In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # seaborn plotting - works better for some plots
import matplotlib as plt # matplotlib - works better for other plots
%matplotlib inline

# load CSV file into a pandas dataframe
df = pd.read_csv("../input/us-accidents/US_Accidents_Dec20_Updated.csv")

Business Understanding (10 points total).
• [10 points] Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). Describe how you would define and measure the outcomes from the dataset. How would you measure the effectiveness of a good prediction algorithm or clustering algorithm?

The accidents data set was collected partially for its large amount of observations and features. The data set has nearly 3 million observations and nearly 50 features. This allows our team of 5 a great deal of flexibility in being able to perform EDA tasks. For example, there is a much lower chance of overlap with this amount of features compared to if we had selected a dataset of 10 or less features for our project. Additionally, the dataset is relevant to all of us as drivers and pedestrians within the city of Philadelphia. We understand the implications of driving and being able to use data to represent the risks of venturing out on the road is critical for many. This includes insurance carriers when determining rates, collision centers when determining where to open a location, and injury attorney offices. 

I would define and measure the outcomes from the dataset by focusing on a specific purpose or goal because there is so much information and attributes. For example, focusing my efforts to a regional area like the City of Philadelphia or on a particular kind of traffic accident, like accidents occuring in or around a roundabout. The data is far too broad to explain everything without a specific goal or question to be answered in mind. If I attempted an overall "summary" of this data, I would likely not derive many meaniningful insights.

I could measure the effectiveness of a good prediction algorithm or clustering algorithm by comparing the test and training sets for differences. Additionally, I could run a prediction/clustering algorithm on the dataset and then compare to a real world sample for accuracy.




# # Begin EDA
What are the dimensions of the dataset?

In [None]:
df.shape

47 Columns (features) and 2,907,610 rows (observations)

Print the first 5 observations.

In [None]:
df.head()

What is the earliest date the data goes back to chronologically?

In [None]:
min(df['Start_Time'])

February 8th, 2016 is the oldest record in the dataset.

What is the most recent accident in the dataset?

In [None]:
max(df['Start_Time'])

December 31st, 2020 is the latest recorded date in the dataset.

# # Accident Severity Statistics

What is the mean severity for all accidents?

In [None]:
np.mean(df['Severity'])

Accidents average to 2.9 in severity.

What percentage of the recorded accidents fall into the 4 (most severe) category?

In [None]:
df[df['Severity'] == 4].shape[0]/df.shape[0]*100 # return count of rows matching condition
# 4.1% of accidents fall into the the "4"/severe category.

#  • [10 points] Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.

The attributes are available below in table form. Note Some information has been supplemented due to the sheer number of attributes in the dataset.


Attribute #	Attribute	Description	Nullable


1	ID	This is a unique identifier of the accident record.	No

2	Severity	Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).	No

3	Start_Time	Shows start time of the accident in local time zone.	No

4	End_Time	Shows end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed.	No

5	Start_Lat	Shows latitude in GPS coordinate of the start point.	No

6	Start_Lng	Shows longitude in GPS coordinate of the start point.	No

7	End_Lat	Shows latitude in GPS coordinate of the end point.	Yes

8	End_Lng	Shows longitude in GPS coordinate of the end point.	Yes

9	Distance(mi)	The length of the road extent affected by the accident.	No

10	Description	Shows natural language description of the accident.	No

11	Number	Shows the street number in address field.	Yes

12	Street	Shows the street name in address field.	Yes

13	Side	Shows the relative side of the street (Right/Left) in address field.	Yes

14	City	Shows the city in address field.	Yes

15	County	Shows the county in address field.	Yes

16	State	Shows the state in address field.	Yes

17	Zipcode	Shows the zipcode in address field.	Yes

18	Country	Shows the country in address field.	Yes

19	Timezone	Shows timezone based on the location of the accident (eastern, central, etc.).	Yes

20	Airport_Code	Denotes an airport-based weather station which is the closest one to location of the accident.	Yes

21	Weather_Timestamp	Shows the time-stamp of weather observation record (in local time).	Yes

22	Temperature(F)	Shows the temperature (in Fahrenheit).	Yes

23	Wind_Chill(F)	Shows the wind chill (in Fahrenheit).	Yes

24	Humidity(%)	Shows the humidity (in percentage).	Yes

25	Pressure(in)	Shows the air pressure (in inches).	Yes

26	Visibility(mi)	Shows visibility (in miles).	Yes

27	Wind_Direction	Shows wind direction.	Yes

28	Wind_Speed(mph)	Shows wind speed (in miles per hour).	Yes

29	Precipitation(in)	Shows precipitation amount in inches, if there is any.	Yes

30	Weather_Condition	Shows the weather condition (rain, snow, thunderstorm, fog, etc.)	Yes

31	Amenity	A POI annotation which indicates presence of amenity in a nearby location.	No

32	Bump	A POI annotation which indicates presence of speed bump or hump in a nearby location.	No

33	Crossing	A POI annotation which indicates presence of crossing in a nearby location.	No

34	Give_Way	A POI annotation which indicates presence of give_way in a nearby location.	No

35	Junction	A POI annotation which indicates presence of junction in a nearby location.	No

36	No_Exit	A POI annotation which indicates presence of no_exit in a nearby location.	No

37	Railway	A POI annotation which indicates presence of railway in a nearby location.	No

38	Roundabout	A POI annotation which indicates presence of roundabout in a nearby location.	No

39	Station	A POI annotation which indicates presence of station in a nearby location.	No

40	Stop	A POI annotation which indicates presence of stop in a nearby location.	No

41	Traffic_Calming	A POI annotation which indicates presence of traffic_calming in a nearby location.	No

42	Traffic_Signal	A POI annotation which indicates presence of traffic_signal in a nearby loction.	No

43	Turning_Loop	A POI annotation which indicates presence of turning_loop in a nearby location.	No

44	Sunrise_Sunset	Shows the period of day (i.e. day or night) based on sunrise/sunset.	Yes

45	Civil_Twilight	Shows the period of day (i.e. day or night) based on civil twilight.	Yes

46	Nautical_Twilight	Shows the period of day (i.e. day or night) based on nautical twilight.	Yes

47	Astronomical_Twilight	Shows the period of day (i.e. day or night) based on astronomical twilight.	Yes



# • [15 points] Verify data quality: Are there missing values? Duplicate data? Outliers? Are those mistakes? How do you deal with these problems?

In [None]:
# Count missing values on ALL columns 

df.isnull().sum()

# Count Duplicate rows 
df.duplicated().sum()

# Count duplicate entries by comparing Start_Time, Street, Zipcode
dup_df = df[df.duplicated(subset=['Start_Time', 'Street', 'Zipcode'], keep=False)]

There are missing values in the dataset. The missing values are not mistakes, but could be the reporting officer did not feel like filling out the entire accident report or saw some details as not necessary. The highest of these is street number missing with 1891672 observations like this. This is understandable as sometimes road lengths may not be in front of any business with a street number - like an airfield for example. There are no duplicate rows in the data with all matching attributes There are no duplicate rows in the data with the same Start_Time and Street. Unfortunately there are few numeric attributes in the dataset that can be meaningful derived to determine outliers using a box plot. For this example, I selected the attribute Distance(mi). This returns the length in distance of an affected traffic incident. In this plot, we see an outlier causing nearly 350mi of road length affected by the accident. We can conclude this was inaccurate or a terrible accident.

We deal with these issues by dropping null/missing values and excluding outliers.

In [None]:
import seaborn as sns
sns.boxplot(x=df['Distance(mi)'])

#  [10 points] Give simple, appropriate statistics (range, mode, mean, median, variance, counts, etc.) for the most important attributes and describe what they mean or if you found something interesting. Note: You can also use data from other sources for comparison.

We can conveniently use the pandas agg function here for accomplishing this task in one fell swoop.



In [None]:
# DISABLE TRUNCATING!
pd.set_option('display.max_colwidth', None)
# DISABLE scientific notation
pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [None]:
df.agg(
     {
         "Severity": ["min", "max", "median", "mean", "count", "var", "std"],
         "Distance(mi)": ["min", "max", "median", "mean", "count", "var", "std"],
         "Temperature(F)": ["min", "max", "median", "mean", "count", "var", "std"],
         "Visibility(mi)": ["min", "max", "median", "mean", "count", "var", "std"],
     }
 )

For this exercise, I selected the attributes Severity, Distance, Temperature, and Visibility. Again, there are not many numeric attributes within the dataset to perform numeric statistics on so I have defaulted to these. We see interestingly that both Severity and Distance(mi) are included in all of the observations while Temperature(F) and Visibility(mi) are not. This is useful because we know that temperature and visibility, while helpful, are not critical from an EDA perspective. Additionally, we see that the Temperature column has a high variance value, which indicates the tempermental climate found across the US. Additionally, one driver can see 140 miles in front of their car from the max Visibility value - that is an interesting outlier. 

In [None]:
import matplotlib as plot

# • [15 points] Visualize the most important attributes appropriately (at least 5 attributes). Important: Provide an interpretation for each chart. Explain for each attribute why you chose the used visualization.


In [None]:
df.plot.bar(x="Distance(mi)", y="Severity", alpha=0.5)

A bar plot was chosen here as we want to compare two continuous observations.

In [None]:
df.plot.scatter(x="Distance(mi)", y="Visibility(mi)", alpha=0.5)

A scatter plot was chosen here to visualize Distance and Visiiblity and easily see outliers.

In [None]:
df.plot.box(x="Temperature", y="Severity", alpha=0.5)

A box plot was chosen here to easily see outliers to see if temperature plays a role in Severity.

In [None]:
df.plot.box(x="Visibility(mi)", y="Severity", alpha=0.5)

Another box plot was chosen here to easily see outliers to see if visibility plays a role in Severity.

In [None]:
df.plot(x="Humidity", y="Severity", alpha=0.5)

A line plot was chosen here to see if Humidity plays a role in accident severity.

SciKit Learn Modeling & Prediction

In [None]:
from matplotlib.colors import ListedColormap
from sklearn import neighbors

X = df.iloc[:, 1:-1]



In [None]:
X.head()

In [None]:
y = df.Severity

In [None]:
plt.figure(figsize = (15, 9))
pd.plotting.scatter_matrix(X, figsize = (15, 9), c = y, marker = 'o')

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors  = 5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
y_pred

tp = sum((y_test == 0) & (y_pred == 0))

fn = sum((y_test == 0) & (y_pred != 0))

fp= sum((y_test  != 0) & (y_pred == 0))

tn = sum((y_test != 0) & (y_pred != 0))

print("recall for class 0 = " + str(tp / (tp + fn)))
print("precision for class 0 = " + str(tp / (tp + fp)))
print("F1 Score for class 0 = " + str(2 * tp /(2 * tp + fp + fn)))

We selected class Severity for the exercise. While the data was too much to properly execute the code, we got an idea for how we will manage our approach for the final project. We will look into other libraries or a platform that can handle the 3 million records better than Kaggle/Colaboratory or our Local Machines.

In [None]:
# Groupwise mean
df.groupby('Severity').mean()

There are several features that could be added to the data or created from existing features to help the data scientists perform his job better. One of these is creating more continuous (numeric) features. Most of the features in the dataset are class-based boolean (True/False). This can be helpful when applying decision tree/Apriori/other prediction algorithms, but not as much with plotting unless we are just aggregating counts of values. We could possibly overcome this by including vehicle speed at time of accident if applicable, posted legal MPH speed limits on roads, etc. As cars become smarter, maybe manufactuers will need to begin reporting more electronic data to NHTSA and this data could be released to the public.



Additionally, perhaps recording more information about people or vehicles involved in the accidents could allow manufactuers to better understand when their vehicles get into accidents or include more safety features on highly crashed models. 

