# Exploring accident data
Road safety continues to be a major developmental issue killing more than 1.35 mn globally as reported in the Global Status
report on Road Safety 2018, with 11% of casualities occuring only in India.
I come form India and have first hand experience of frequent road accidents and related fatalities. In 2019, there were 151,113 accidents related deaths in 449,000 road accidents in India.

##### That means, every one min aprox(70 sec) one road accident happens in India killing one person every three mins(200 sec).

### This is huge.

As per WHO report, much of this can be prevented by timely treatment of accident victims. In that endeveaour, I am going to analyze the hospital coverage of accidents and see if new hospitals are required and at what locations, and if this can prevent some casualities.
While I could not get geo tagged accidents data for India(perhaps they are not geo tagged yet), so I would be exploring this in for developed countries like United Kingdom, United States of America(later exploration) accidents/hospitals data.

Source: https://morth.nic.in/sites/default/files/RA_Uploading.pdf

 This notebook is inspired from one of the exercises of kaggle geospatial course by Alexis. Good course, worth giving time
 
 https://www.kaggle.com/alexisbcook/proximity-analysis

# Cluster Map visualization technique
 
Cluster Map, helps present dense pockets of data using single point. Each cluster is either relatively sized to or labelled with the number of points that have been grouped together.

Clusters are ideal in interactive maps where the user can drill down to see individual data points contained in a cluster. Cluster maps help reduce clutter when there are many overlapping data points in a small geography.

If we have very few data points to visualize, than ClusterMap should not be used and instead BubbleMap can be used.

I want to visualize all the accidents heppening in UK/city on the country map and as accident counts will be in 10k+, bubble map would not be a feasible technique. As then bubbles would be cluttered too closely and would not provide any insight to the audiance.

Other visualization technique, which can be used to represent this datasets are Choropleth, Hexagonal binning and Dot map.

However, as there are very large no of accident data points, Dot Map would not be appropriate visualization technique.

Choropleth maps need data binned on any geogrophical area, however, we do not have such representive binned data for this dataset.

And so I am going to use cluster map visualization technique for this dataset.

# Considerations for library selections

As mentioned earlier I wish to explore the accident data, and total number of accidents in a year in UK are approx 0.2 mn. While Altair has the default option of 5000 records. Though this can be still acheived by disabling the default option, this further has challenges in an interactive plot.

I didnt wanted to used Matplotlib and Seaborn as I wanted to explore some thing new for this assignment as learning objective.
Another good option can be plotly, and this is really very popular and can be used for diverse use cases as well to cater large dataset.
For now I am going to explore folium and understand its uses and limitations and going to explore plotly later.

# Folium library

Folium is widely used in geospatial data visualisation. It is built on top of Leaflet.js and can cover most of your mapping needs in Python with its great plugins.

Folium builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the Leaflet.js library. Manipulate your data in Python, then visualize it in a Leaflet map via folium.

Getting started with Folium is easy, and you can simply call Folium.Map to visualise base maps immediately. You can then add layers to visualise your data on the interactive base maps available in Folium.

Folium is open source library built on open source

https://python-visualization.github.io/folium/

## Cluster Map
Earlier I explored BubbleMap but found that having large amount of data points to visualize basically clutters the bubble points on the map and no insights can be extracted from such a visualization.

Then on exploring further, I found, Cluster Map is good in way that it groups the closely related points and provides the counts. Also the cluster markers are shaded in a way to differetiate among smaller clusters to bigger clusters.


# Installing Folium

#### To install folium using pip, type the following command:

$ pip install folium=0.0.0


#### To install folium using conda, type the following command:

$ conda install -c conda-forge folium=0.0.0

## Importing libraries

following librries would be used for data cleaning and visualizations

In [None]:
!pip install geopy==1.22.0
!pip install xlrd==1.2.0
!pip install watermark
import numpy as np
import pandas as pd
import geopandas as gpd
import folium
from folium import Marker, Circle, CircleMarker, Choropleth
from folium.plugins import MarkerCluster, HeatMap
from geopandas.tools import geocode
from shapely.geometry import Point
from multiprocessing import Pool
from tqdm.notebook import tqdm, tqdm_notebook
import ipywidgets as widgets
from ipywidgets import interact, fixed, interact_manual, interactive_output
from IPython.display import display
import geopy

tqdm_notebook.pandas()

## Collecting and cleaning the data

### Accidents data
I wanted to analyze the road accidents and hospital coverage of them. While I wanted to do for India, in lack of geo tagged accidents data for India, I decided to go for UK/USA (being developed countries, they have highest level of digitisation and availability of free data).
I could get the accidents data for United Kingdom from belo site:

https://datashare.ed.ac.uk/handle/10283/2509?show=full

This contains the shapefile from year 2005 to 2010. Geospatial data in vector format are often stored in a shapefile format. 

--------------------------------------------------------------------------------------------------------------------------------

#### Shapefiles: One Dataset - Many Files
A shapefile is created by 3 or more files, all of which must retain the same NAME and be stored in the same file directory, in order for you to be able to work with them.

Shapefile Structure
There are 3 key files associated with any and all shapefiles:

- .shp: the file that contains the geometry for all features.
- .shx: the file that indexes the geometry.
- .dbf: the file that stores feature attributes in a tabular format.

### Data loading

Further, as I am interested in hospital coverage of accidents, I would be removing some unwanted columns.

Futher these files are stored year wise, so would be creating a function, which would take some parameters as initial path, fileprefix and year, so that even later year files can be integrated into notebook easily.

**Another technique, I am using for the first time is tqdm library.**

In this notebook we are dealing with around a million records and many a times these processing takes time. In the exploration times, its very difficult to know, whether it is being processed or there is some code issue.

Having the progressbar in these situations helps us in the exploration. It is easy to use,  involves very less overhead and prevents the blind condition in exploration in absence of progressbar for long running processes. 

In [None]:
def load_accident(path, accident_file, years):
    ''' loading the accidents file
    path - initial path of the file if outside of notebook path
    accident_file - filename prefix, i.e. excluding year and extension
    years - for which years data need to be loaded'''
    columns = ['Accident_I', 'Longitude', 'Latitude','Date','geometry']

    accidents_gdf = gpd.GeoDataFrame()
    for year in tqdm(years):
        filename = path + filenameprefix + str(year) + '.shp'
        accident1 = gpd.read_file(filename) 
        accident1 = accident1[columns]
        accident1['year'] = year
        accident1['month'] = pd.to_datetime(accident1['Date'], format='%d/%m/%Y').dt.month_name().str.slice(stop=3)
        geometry = [Point(xy) for xy in zip(accident1['Longitude'], accident1['Latitude'])]
        crs = ('epsg:27700')
        accident1_gdf = gpd.GeoDataFrame(accident1, crs=crs, geometry=geometry) 
        accidents_gdf = pd.concat([accidents_gdf, accident1_gdf], ignore_index=True)  
    return(accidents_gdf)

def read_internediate_data(filename):
    df = pd.read_csv(filename)
    df_gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude))
    df_gdf = df_gdf.set_crs('epsg:27700').drop(columns='Unnamed: 0')
    return(df_gdf)

In [None]:
# I could not get the whole raw data loaded at Kaggle, have made, pre-processed data available 
# and can be accessed and worked here. So either you can work with raw data locally or pre-processed 
# data, at Kaggle.
path = 'input/'
years = [2005, 2006, 2007, 2009, 2010]
filenameprefix = 'GB_Accidents'

# accidents_gdf = data_load.load_accident(path, filenameprefix, years)
# print('Number of records in the file: %d' %accidents_gdf.shape[0])
# print('Number of columns in the file: %d' %accidents_gdf.shape[1])
# Save the intermediate work
# accidents_gdf.to_csv('output/intermediate_work/accidents_gdf')
# accidents_gdf.head(5)

accidents_gdf = read_internediate_data('../input/gb-accidents/accidents_gdf.csv')
accidents_gdf.head(5)

For people, wanting to work directly on the data, without downloading and processing the raw data, I am providing the processed data 

## Hospitals data

I needed geo tagged hospitals data for my analysis purpose. However, had hard time getting the geo tagged hospitals data for UK. And so decided to go-with below dataset which I found to be pretty comprehensive and geocode the data myself using geopandas.

https://data.gov.uk/dataset/f4420d1c-043a-42bc-afbc-4c0f7d3f1620/hospitals

### Functions for loading and Cleaning Hospital data

In [None]:
# This hospital data is in excel sheet and that too in old excel(97-2003) format. Luckily pandas to rescue. 
# Using the 'xlrd' engine same red_excel method can be used to read excel file.
def old_excel_load(path, filename, sheet_name):
    ''' Loads the hospitals data for United Kingdom
    path - initial path of the file if outside of notebook folder
    filename - filename of the hospital data
    sheet_no - sheet no of the hospital data file 
                data file contains multiple sheets and data we are interested is in second sheet'''
    filename = path + filename
    df = pd.read_excel(filename, sheet_name=sheet_name, engine='xlrd')
    return(df)

# Now time to write some function to do cleaning/scrubbing
# Remove unwanted columns, or hospitals which are closed long back and where Town name is not in the record
# As we need Town name to geo tag the all hospitals
def hospital_clean(df):
    ''' cleans and takes subset of the hospital data required for analysis
    df - initial loaded dataframe of hospital data'''
    data = df.copy()
    hosp_cols = ['HOSPITAL', 'Present Name of Hospital','Closure Date', 'Town', 'County: Post 1996']
    data = data[hosp_cols]
    data.columns = ['HOSPITAL', 'Hospital_Name','Closure_Date', 'Town', 'Current_County']
    data = data[(data['Closure_Date'].isna() & (~data['Town'].isna()))].drop(columns='Closure_Date')
    return(data)

### Funcitons for data preperation - geocoding

In [None]:
# Now any city in UK, we want to explore, we need lat,long information for the city
# This function, will do the job
def city_geocoder(city):
    ''' geocodes the city, which we want to explore
    city - name of the city you wish to explore'''
    try:
        result = geocode(city, provider='nominatim').geometry.iloc[0] 
        return(result)
    except:
        return None

# This dataset does not contain geospatial information of the hospitals, however, this can be generated using
# "geocode" library of geopandass    
def my_geocoder(city):
    ''' gecodes a provided town/city name create three columns values Latitude,Longitude and geometry
        city - town/city name of the hospital
        '''
    try:
        result = geocode(city, provider='nominatim')
        point = result.geometry.iloc[0] 
        return(pd.Series({'Latitude': point.y, 'Longitude': point.x, 'geometry': point}))
    except:
        return None
    
# Now that we have a geocode function to generate geocode, lets geo tag all the hospitals in our data    
def hospital_geocode(df):
    ''' geocodes all the town names in the hospital dataframe and and drops any rows for which geocoding was unsuccessfull
    df - clened hospital dataframe containing town name column'''
    df.loc[:, ['Latitude', 'Longitude', 'geometry']] = df.progress_apply(lambda x: my_geocoder(x['Town']), axis=1)
    df = df[~df['geometry'].isna()].reset_index(drop=True)
    crs = ('epsg:27700')
    df = gpd.GeoDataFrame(df, crs=crs, geometry=df.geometry)
    return(df)

### Data loading and cleaning / preprocessing 

In [None]:
# I could not get the whole raw data loaded at Kaggle, have made, pre-processed data available 
# and can be accessed and worked here. So either you can work with raw data locally or pre-processed 
# data, at Kaggle.
path = '../input/gb-hospitals/'
hosp_filename = 'hospital-records.xls'

# hospitals = data_load.old_excel_load(path, hosp_filename, sheet_name=1)
# hospitals_clean = data_clean.hospital_clean(hospitals)
# hospitals_gdf = data_clean.hospital_geocode(hospitals_clean)

# print('Number of records in the file: %d' %hospitals_gdf.shape[0])
# print('Number of columns in the file: %d' %hospitals_gdf.shape[1])
# hospitals_gdf.head(3)
# Saving intermediate work in csv file
# hospitals_gdf.to_csv('hospitals_gdf')

hospitals_gdf = read_internediate_data('../input/gbhospitals/hospitals_gdf.csv')
hospitals_gdf.head(3)

In [None]:
print('Number of records successfully geocoded %d' %hospitals_gdf.shape[0])

 ### Data pre-processing
 
 Measuring distance in three dimentional space is not same as measuring in two dimentional space.
 Also because of sphere shape of the earth, even three dimentional distances doesnt hold good. So there are different 
 coordinate reference systems created to suite different tasks like equal area/distance and also for different geographical continents. 

In [None]:
# As we are working on UK data and found that EPSG:3035 is created for Europian regions
def change_crs_for_region(hospitals, epsg):
    '''changes coordinate reference system of both the input datasets
    hospitals - hospital dataset dataframe
    epsg - coordinate reference system, to which we want to change'''
    df = hospitals.set_crs(epsg=epsg)
    return(df)

hospitals_gdf = change_crs_for_region(hospitals_gdf, 27700)
# Saving  the intermediate work to csv files 
# accidents_gdf.to_csv('output/intermediate_work/accidents_gdf')
# hospitals_gdf.to_csv('output/intermediate_work/hospitals_gdf')
accidents_gdf.head(3)

### Visualize the accident data(MarkerCluster)

In [None]:
# Lets see how BubbleMap fares for this dataset. I will plot only 5000 records, throgh datase
def visualize_data(city, accident):
    '''visulize data through bubble map'''
    point = city_geocoder(city)
    if not point:
        print('Please check the city/county name, Thank You.')
        return

    m_1 = folium.Map(location=[point.y, point.x], tiles='OpenStreetMap', zoom_start=10)
    for idx, row in accident[:5000].iterrows(): 
        Marker([row['Latitude'], row['Longitude']]).add_to(m_1)
    return(m_1)

visualize_data('London', accidents_gdf)

In [None]:
# As can be seen above, BubbleMap is certainly not suited for this dataset. Even with 5000 records, its get clutterd up and  
# doesnt offer any insight about the data
# Now see whether ClusterMap is good for this use case
def visualize_data_mc(city, year, accident):
    '''visualize accidents through cluster map
    city - city name for which, you want to plot the data
    year - year(2005-2010), for which you want to plot accidents
    accident - dataframe name for accidents data'''
    point = city_geocoder(city)
    if not point:
        print('Please check the city/county name, Thank You.')
        return

    accident = accident.copy()
    accident = accident[accident.year == year]
    m_1 = folium.Map(location=[point.y, point.x], tiles='OpenStreetMap', zoom_start=10)

    mc = MarkerCluster()
    for idx, row in accident.iterrows():
        mc.add_child(Marker([row['Latitude'], row['Longitude']]))

    m_1.add_child(mc)
    return(m_1)

visualize_data_mc('London', 2010, accidents_gdf)

This is good, now we can clerly see, there are around 8990 accidents in London. And we can also see the number of accidents in nearby cities. Best part of this is that, this is interactive map. So we can zoom more, if we want and drill down further as required.

Also notice that how color shade of cluster is corresponding to count of accidents at that cluster point.

#### Lets add some widgets

In [None]:
# Now that we see that ClusterMap is good for this use case, let add some widgets and give contorl to users.
# Here I am adding widgets for city name and year for which data can be explored.
text_city = widgets.Text(options='London')

interact_manual(visualize_data_mc, city=text_city, year=(2005,2010), accident=fixed(accidents_gdf));

Notice that I have added, manual control in interactive widgets. This is very much required.
Because dataset is huge, in continuous update mode, it simply keep on trying to plot, while user is trying to adjust inputs.

Adding, manual control, gives control to run in user's hands for better experience.

### Visualize the Hospitals

Default behaviour of the MarkerCluster is to show Green shade for smaller count clusters while to show red shade for bigger number cluster. 
However, this would not make sense in case of hospitals. 

Town/city which is covered by more hospitals is better prepared to handle accidents and so bigger hospital count should be shown in Green shade and lower hospital(worse prepared to handle accidents), should be shown in red shade.

**For this reason, I have customized the code for MarkerCluster**

In [None]:
def visualize_data_mcgroup(city, hospitals):
    '''visualize hospital data in MarkerCluster
    city - name of the city
    hospitals - dataframe for hospital data'''
    point = city_geocoder(city)
    if not point:
        print('Please check the city/county name, Thank You.')
        return
    
    m_1 = folium.Map(location=[point.y, point.x], zoom_start=8)
    mc = MarkerCluster()
    #Create a variable to store your javascript function (written as a string), which adjusts the default css functionality
    #The below are the attributes that I needed for my project, but they can be whatever is needed for you
    icon_create_function = """
        function(cluster) {
        var childCount = cluster.getChildCount(); 
        var c = ' marker-cluster-';

        if (childCount < 50) {
            c += 'large';
        } else if (childCount < 300) {
            c += 'medium';
        } else {
            c += 'small';
        }
        
        return new L.DivIcon({ html: '<div><span>' + childCount + '</span></div>', className: 'marker-cluster' + c, iconSize: new L.Point(40, 40) });
        }
        """
    #Create the marker cluster group, which organizes all the gps points put into it
    mcg = MarkerCluster(name='Cluster Icons', icon_create_function=icon_create_function)
    for idx, row in hospitals.iterrows():
        mcg.add_child(Marker([row['Latitude'], row['Longitude']]))

    m_1.add_child(mcg)
    return(m_1)
# I have taken this MarkerClusyter customization code verbatim from stackoverflow
# Link : https://stackoverflow.com/questions/55657858/is-it-possible-to-change-the-default-colors-used-in-a-folium-marker-cluster-map

#### Lets visalize the hospital data

In [None]:
text_city = widgets.Text(options='London')
interact_manual(visualize_data_mcgroup, city=text_city, hospitals=fixed(hospitals_gdf));

I ran it for Leeds, and we can clearly see, Leeds has 25 hospitals.

### When was the closest hospital more than 10 kms away?

Now say, if 10 kms is a buffer zone, inside which victims can be transported easily and saved.
Then lets find out the number of cases, when closest hospital was more than 10 kms away.

In [None]:
def create_buffer(hospital, km_range):
    '''creates a buffer zone of provided km range
    hospital - dataframe containing hospital data
    km_range - range in kms, of which buffer needs to be created '''
    x_km_buffer = hospital.geometry.buffer(km_range / 100)
    my_union = x_km_buffer.geometry.unary_union
    return(my_union)

def outside_buffer_range(buffer, accidents, year, month):
    '''finds the accidents which occurred outside buffer zone
    buffer - buffer zone dataframe created
    accidents - dataframe of accidents records'''
    accidents = accidents[(accidents['year'] == year) & (accidents['month'] == month)]
    outside_range = accidents[~accidents['geometry'].progress_apply(lambda x: buffer.contains(x))]
    return(outside_range)    

hospital_buffer = create_buffer(hospitals_gdf, 10)

In [None]:
outside_range = outside_buffer_range(hospital_buffer, accidents_gdf, 2010, 'May')
print("Number of accidents outside of 10 km buffer range for year(2010) for month('May'): %d" %outside_range.shape[0])
outside_range.head(3)

In [None]:
total_record_count = len(accidents_gdf[(accidents_gdf['year'] == 2010) 
                                       & (accidents_gdf['month'] == 'May')])

percentage = round(len(outside_range) * 100 / total_record_count, 2)
print('Percentage of accidents, which occurred outside of hospitls coverage is: %.2f' %percentage)

### Make a recommender

When accidents occurs in distant locaitons, it becomes even more vital that injured persons are taken to nearest available hospital.

With this in mind, we create a recommender that:

- takes the location of the accident in one crs system

- finds the closest hospitals(where distance calculations are done in epsg:27700), and

- returns the name of the closest hospital.

In [None]:
# hospitals_gdf = hospitals_gdf.reset_index()
def best_hospital(accident_loc):
    '''returns the best hospital in terms of closest to accident site
    accident_loc - point(latitude and longitude) of the accident location'''
    idxmin = hospitals_gdf.geometry.distance(accident_loc).idxmin()
    hosp_name = hospitals_gdf.iloc[idxmin].Hospital_Name
#     hosp_dist = hospitals_gdf.geometry.distance(accident_loc).min() * 100
    return(hosp_name)

best_hospital(outside_range.geometry.iloc[0])

### Which hospitals are in the highest demand?

Considering hospitals only in outside_range DataFrame, which hospitals is most recommended?

In [None]:
def top_ten_hospitals(df):
    top_10_hospitals = df.geometry.progress_apply(best_hospital).value_counts().sort_values(ascending=False).reset_index().iloc[:10, :]
    top_10_hospitals.columns = ['hospital_name', 'demand_count']
    return(top_10_hospitals)

top_hospitals = top_ten_hospitals(outside_range)
top_hospitals

These hospitals are in high demand attending to accident cases in their buffer zone of 10 kms and attending to cases where they are the nearest hospitals outside range.

Management needs to ensure that these hospitals have the needed infrastructure and people to support the demand to reduce fatalaties.

Also new locations need to be identified which may reduce the burden on these high demand hospitals.

In [None]:
%load_ext watermark
%watermark -v -m -p pandas,numpy,ipywidgets,folium,geopandas,geopy,tqdm,xlrd