UK 2014 All STATS19 data (accident, casualties and vehicle tables) for 2005 to 2014." dataset investigation from [here](https://data.gov.uk/dataset/road-accidents-safety-data).

## Load the library ##

In [1]:
## inspired by https://commercedataservice.github.io/tutorial_biz_dynamics/
from IPython.display import display
import io, requests, zipfile
import pandas as pd
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.plotly as py
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
import numpy as np

from plotly import __version__
## print (__version__) ## requires version >= 1.9.0

## Generating Offline Graphs within Jupyter Notebook
## https://plot.ly/python/offline/
## init_notebook_mode(connected=True)

In [2]:
## http://mcdc.missouri.edu/data/popests/CBSA-EST2014-alldata.csv
## read in data with specified encoding
path = './input/Stats19_Data_2005-2014/'

In [3]:
acc = pd.read_csv(path + "Accidents0514.csv", encoding = "ISO-8859-1")


Columns (31) have mixed types. Specify dtype option on import or set low_memory=False.



In [4]:
cau = pd.read_csv(path + "Casualties0514.csv", encoding = "ISO-8859-1")

In [5]:
veh = pd.read_csv(path + "Vehicles0514.csv", encoding = "ISO-8859-1")

## All the np.nan data is replaced by -1 ##

In [13]:
acc = acc.fillna(-1)

## Set up display environment ##

In [None]:
#pd.set_option('display.max_colwidth', -1)
#pd.options.display.max_rows = 4000
pd.options.display.max_columns = 4000

In [None]:
display(acc.head(3))

In [None]:
display(cau.head(3))

In [None]:
display(veh.head(3))

In [None]:
print(acc.shape)
print(cau.shape)
print(veh.shape)

## Find urban accident occurrence ##

In [None]:
## https://data.gov.uk/dataset/road-accidents-safety-data
print(acc['Urban_or_Rural_Area'].value_counts())
print(len(acc))

# 1 = Urban area
# 2 = Small town
# 3 = Rural

# -1 = Data missing or out of range

**Q:** What fraction of accidents occur in urban areas? Report the answer in decimal form.

In [None]:
acc_tot = len(acc)
acc_urban = acc['Urban_or_Rural_Area'].value_counts()[1]
print(round(acc_urban/acc_tot, 10))

In [None]:
## define the function in the desired date format
dateparse = lambda dates: pd.datetime.strptime(dates, '%d/%m/%Y')
print(dateparse)

acc = pd.read_csv(path + "Accidents0514.csv", parse_dates=['Date'], \
                  index_col=['Date'], date_parser=dateparse, encoding = "ISO-8859-1")


## Check data type ##  
**Work on missing data: ** https://pandas.pydata.org/pandas-docs/stable/missing_data.html

In [None]:
print(type(acc['LSOA_of_Accident_Location']))
print(acc['LSOA_of_Accident_Location'].iloc[0])
print(type(acc['LSOA_of_Accident_Location'].iloc[0]))

#display(acc.iloc[:, 28:33].head(3))

In [None]:
##acc.loc['2005']
##acc_year = acc.groupby(acc.index.year)['Urban_or_Rural_Area'].value_counts()
acc_year = acc.groupby(acc.index.year)['Urban_or_Rural_Area'].count()
acc_year

**Q:** There appears to be a linear trend in the number of accidents that occur each year. What is that trend? Return the slope in units of increased number of accidents per year.

In [None]:
##print(type(acc_year))
y = acc_year.values
##print(y)
x = acc_year.index.values.reshape(-1, 1)
##print(x)
##print(type(x))

In [None]:
# from sklearn import datasets, linear_model
# import matplotlib.pyplot as plt

np.set_printoptions(precision=10)

regr = linear_model.LinearRegression()
regr.fit(x, y)
print(regr.coef_)
print(regr.intercept_)

In [None]:
# Plot outputs
plt.scatter(x, y,  color='black')
#plt.xticks(())
#plt.yticks(())
plt.show()

**Q: ** How many times more likely are you to be in an accident where you skid, jackknife, or overturn (as opposed to an accident where you don't) when it's raining or snowing compared to nice weather with no high winds? Ignore accidents where the weather is unknown or missing.

<center> **Skidding_and_Overturning** </center>  

|code|label|
|:--:|--:|
|0|None|
|1|Skidded|
|2|Skidded and overturned|
|3|Jackknifed|
|4|Jackknifed and overturned|
|5|Overturned|
|-1|Data missing or out of range|  
 
  
<center> **Weather_Conditions** </center>  

|code|label|
|:--:|:--:|
|1|Fine no high winds|
|2|Raining no high winds|
|3|Snowing no high winds|
|4|Fine + high winds|
|5|Raining + high winds|
|6|Snowing + high winds|
|7|Fog or mist|
|8|Other|
|9|Unknown|
|-1|Data missing or out of range|



In [None]:
display(acc.head(3))

In [None]:
#print(acc['Weather_Conditions'].value_counts())
#print(veh['Skidding_and_Overturning'].value_counts())

## Count occurrence ##  
Note: There might be more than one car involved in the same accident, so that the size of vehicle record is larger than accident.

In [None]:
print(acc.shape)
print(veh.shape)

In [None]:
print(len(set(veh['Accident_Index'])))

## Merge the dataset ##  
Here merge style has no effect. Accident_Index is unique in accident dataframe. For vehicle dataframe, the Accident_Index could be the same.

In [None]:
mix_acc_veh = pd.merge(acc[['Accident_Index', 'Weather_Conditions']], 
                            veh[['Accident_Index', 'Skidding_and_Overturning']],
                            how='inner',
                            on='Accident_Index')

In [None]:
print(mix_acc_veh.head(3))
print(mix_acc_veh.shape)

## Condition ##  
- Skidding_and_Overturning = 1, 2, 3, 4, 5 bad action, 0 none  
- Weather_Conditions = 2, 3, 5, 6 bad weather condition, 1 good weather condition  

In [None]:
acc_veh_good_weather = mix_acc_veh[mix_acc_veh['Weather_Conditions'] == 1]
acc_veh_bad_weather = mix_acc_veh[mix_acc_veh['Weather_Conditions'].isin([2,3,5,6])]

bad_action_list = [1,2,3,4,5]

## good weather
acc_veh_good_weather_bad_action = len(acc_veh_good_weather[
          acc_veh_good_weather['Skidding_and_Overturning'].isin(bad_action_list)])

acc_veh_good_weather_good_action = len(acc_veh_good_weather[
          acc_veh_good_weather['Skidding_and_Overturning']== 0])

print(acc_veh_good_weather_bad_action)
print(acc_veh_good_weather_good_action)

print(acc_veh_good_weather_bad_action/(acc_veh_good_weather_bad_action+acc_veh_good_weather_good_action))

## bad weather
acc_veh_bad_weather_bad_action = len(acc_veh_bad_weather[
    acc_veh_bad_weather['Skidding_and_Overturning'].isin(bad_action_list)])

acc_veh_bad_weather_good_action = len(acc_veh_bad_weather[
    acc_veh_bad_weather['Skidding_and_Overturning']==0])

print(acc_veh_bad_weather_bad_action)
print(acc_veh_bad_weather_good_action)
print(acc_veh_bad_weather_bad_action/(acc_veh_bad_weather_bad_action+acc_veh_bad_weather_good_action))

In [None]:
ratio_bad_decision_in_good = acc_veh_good_weather_bad_action/(acc_veh_good_weather_bad_action+acc_veh_good_weather_good_action)
ratio_bad_decision_in_bad = acc_veh_bad_weather_bad_action/(acc_veh_bad_weather_bad_action+acc_veh_bad_weather_good_action)

print("Accident ratio when you made bad decision at good weather: ")
print(ratio_bad_decision_in_good)
print("Accident ratio when you made bad decision at bad weather: ")
print(ratio_bad_decision_in_bad)

print("The accident is {:.10f} times more likely to happen when weather is bad comparing to good weather."
     .format(ratio_bad_decision_in_bad/ratio_bad_decision_in_good))


**Q: ** We can use the accident locations to estimate the areas of the police districts. Represent each as an ellipse with semi-axes given by a single standard deviation of the longitude and latitude. What is the area, in square kilometers, of the largest district measured in this manner?

**Q: ** When is the most dangerous time to drive? Find the hour of the day that has the highest occurance of fatal accidents, normalized by the total number of accidents that occured in that hour. For your answer, submit the corresponding frequency of fatal accidents to all accidents in that hour. Note: round accident times down. For example, if an accident occured at 23:55 it occured in hour 23.

| code | label
| :--: | :--
|1|Fatal
|2|Serious
|3|Slight

In [6]:
##display(acc.head(3))

In [17]:
acc_severity_time = acc[['Accident_Severity', 'Time' ]]
display(acc_severity_time.head(3))
acc_severity_time_subset = acc_severity_time
##display(acc_severity_time_subset)

Unnamed: 0,Accident_Severity,Time
0,2,17:42
1,3,17:36
2,3,00:15


## Check missing data ##  
All the missing data is indicated by -1.

In [21]:
##acc = acc.fillna(-1)
##print(len(acc_severity_time[acc_severity_time['Time'].isnull()]))
print("Before removing, there are {} data entries.".format(len(acc_severity_time)))
acc_severity_time = acc_severity_time[acc_severity_time.Time != -1]
print("After removing, there are {} data entries.".format(len(acc_severity_time)))


Before removing, there are 1640597 data entries.
After removing, there are 1640464 data entries.


In [22]:
## define the function in the desired date format
## time format reference
## http://strftime.org/
## get only time
## https://stackoverflow.com/questions/18039680/django-get-only-date-from-datetime-strptime
dateparse_hour_minutes = lambda dates: pd.datetime.strptime(dates, '%H:%M')
print(dateparse_hour_minutes)

<function <lambda> at 0x1110d1d08>


In [23]:
acc_severity_time.loc[:,'Time']=acc_severity_time.loc[:,'Time'].copy()\
                                                          .apply(dateparse_hour_minutes)\
                                                          .dt.strftime('%H')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



## Pivot table ##

In [24]:
acc_pivot = acc_severity_time.pivot_table(index=['Time'], columns='Accident_Severity', 
                                          aggfunc='size', fill_value=0)

display(acc_pivot.head(5))

## another approach
## acc_severity_time_groupby = acc_severity_time.groupby(['Time'])['Accident_Severity'].value_counts()
## display(acc_severity_time_groupby)

Accident_Severity,1,2,3
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,769,4755,19707
1,613,3600,14295
2,527,3035,11345
3,450,2371,9089
4,360,1794,6963


In [106]:
acc_pivot=acc_pivot.rename(columns = {1:'fatal', 2:'Serious', 3:'Slight' })
display(acc_pivot.head(5))


Accident_Severity,fatal,Serious,Slight,total_count,fatal_ratio
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,769,4755,19707,25231,3.047838
1,613,3600,14295,18508,3.312081
2,527,3035,11345,14907,3.535252
3,450,2371,9089,11910,3.778338
4,360,1794,6963,9117,3.948667


In [107]:
print(type(acc_pivot))
print(acc_pivot.index)
print(acc_pivot.index.name)
print(acc_pivot.columns)
print(acc_pivot.columns.values)

<class 'pandas.core.frame.DataFrame'>
Index(['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11',
       '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23'],
      dtype='object', name='Time')
Time
Index(['fatal', 'Serious', 'Slight', 'total_count', 'fatal_ratio'], dtype='object', name='Accident_Severity')
['fatal' 'Serious' 'Slight' 'total_count' 'fatal_ratio']


In [109]:
acc_pivot['total_count'] = acc_pivot['fatal']+acc_pivot['Serious']+acc_pivot['Slight']
acc_pivot['fatal_ratio'] = acc_pivot['fatal']/acc_pivot['total_count']*100
display(acc_pivot.head(3))

Accident_Severity,fatal,Serious,Slight,total_count,fatal_ratio
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,769,4755,19707,25231,3.047838
1,613,3600,14295,18508,3.312081
2,527,3035,11345,14907,3.535252


In [51]:
import plotly.plotly as py
import plotly.graph_objs as go
import numpy as np
import cufflinks as cf

In [110]:
print(list(acc_pivot['fatal']))
print(list(acc_pivot.index.values))

[769, 613, 527, 450, 360, 425, 582, 819, 788, 769, 954, 960, 983, 1084, 1107, 1243, 1400, 1457, 1215, 1187, 948, 918, 960, 863]
['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23']


In [144]:
## Multiple-axes
## https://plot.ly/python/multiple-axes/

trace0 = go.Bar(x = acc_pivot.index,
                y = acc_pivot['fatal'],
                yaxis = 'y',
                name = 'Fatal'
            )

trace1 = go.Bar(x = acc_pivot.index,
                y = acc_pivot['Serious'],
                yaxis = 'y',
                name = 'Serious'
            )

trace2 = go.Bar(x = acc_pivot.index,
                y = acc_pivot['Slight'],
                yaxis = 'y',                
                name = 'Slight'
            )

trace4 = go.Scatter(x = acc_pivot.index,
                    y = ["{:.2f}".format(i) for i in acc_pivot['fatal_ratio'] ],
                    yaxis = 'y2',
                    mode='lines+text',
                    text = y,
                    textposition = 'auto',
            )

data = [trace2,trace1,trace0, trace4]

layout = go.Layout(
    barmode='stack',
    yaxis=dict(
        title='yaxis title'
    ),
    yaxis2=dict(
        title='yaxis2 title',
        anchor='x',
        overlaying='y',
        side='right'
    ),
)

fig = go.Figure(data=data,layout=layout)

fig1 = py.iplot(fig, filename='basic histogram')


fig1

NameError: name 'y2' is not defined

In [56]:
dir(py.PlotlyDisplay)

AttributeError: module 'plotly.plotly' has no attribute 'PlotlyDisplay'

The number indicates the 

**Q: ** Do accidents in high-speed-limit areas have more casualties? Compute the Pearson correlation coefficient between the speed limit and the ratio of the number of casualties to accidents for each speed limit. Bin the data by speed limit.

**Q: ** How many times more likely are accidents involving male car drivers to be fatal compared to accidents involving female car drivers? The answer should be the ratio of fatality rates of males to females. Ignore all accidents where the driver wasn't driving a car.

**Q: ** How fast do the number of car accidents drop off with age? Only consider car drivers who are legally allowed to drive in the UK (17 years or older). Find the rate at which the number of accidents exponentially decays with age. Age is measured in years. Assume that the number of accidents is exponentially distributed with age for driver's over the age of 17.