## US Traffic Accident Analysis

### Team Member:
  Yinghao Wang, Keshuo Liu, Yu Shu, Zhenyang Gai, Simeng Li, Kratik Gupta

### Problem Definition:

Our goal is to find out the factors that influence the occurrence of the number of accidents happening in the United States. We will take factors such as location, weather, and daytime into consideration and use Tableau to provide geographical related visualization. We will make analysis through a large-scale dataset to increase the accuracy of detecting relationships between factors and accident rate. We aim at providing suggestions to DMV and drivers about safe driving. 

### Data Source Link:

1. "US Traffic Accident": A Countrywide Traffic Accident Dataset (2016 - 2020). https://www.kaggle.com/sobhanmoosavi/us-accidents
2. "US Population": United States Census Bureau. https://www.census.gov/

### Data Cleaning Process:

On Kaggle we found a countrywide car accident dataset, which covers 49 states of the USA. The accident data are collected from February 2016 to Dec 2020, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks. Currently, there are about 3 million accident records in this dataset; therefore, datasets need to be cleaned before conducting analysis. With available dataset, we selected records in each year to see how number of accidents changes in 5 years. 
In the us accident table, each year contains a great number of data. There is more than 3 million accident records. Considering efficiency when performing analysis on datasets and limitations in terms of data size and budget on Google Cloud Platform, 10% of the data in each year is randomly sampled and unioned into one table.
Below is an example of random selection process for 2017 datasets:

In [None]:
%%bigquery
SELECT * FROM `ba775-project-team1.dataset_demo.us_traffic_accidents`
WHERE rand() <= 0.1 AND  extract(year from start_time) = 2017
LIMIT 5;

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 689.34query/s]                         
Downloading: 100%|██████████| 5/5 [00:01<00:00,  3.29rows/s]


Unnamed: 0,ID,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance_mi_,Description,...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,A-1829991,4,2017-02-05 12:01:14+00:00,2017-02-05 18:01:14+00:00,32.604144,-112.869644,32.381274,-112.8724,15.4,Closed between Gilbert and Ajo Well Rd - Road ...,...,False,False,False,False,False,False,Day,Day,Day,Day
1,A-24095,3,2017-04-12 08:42:05+00:00,2017-04-12 09:24:52+00:00,27.925198,-82.593422,,,0.01,Right lane blocked due to accident on I-275 No...,...,False,False,False,False,False,False,Day,Day,Day,Day
2,A-1795739,2,2017-08-23 16:12:20+00:00,2017-08-23 16:41:48+00:00,37.885376,-122.516586,37.885376,-122.516586,0.0,#1 lane blocked due to accident on US-101 Nort...,...,False,False,False,False,False,False,Day,Day,Day,Day
3,A-28611,2,2017-05-31 14:51:29+00:00,2017-05-31 20:51:29+00:00,42.212128,-72.613529,42.212128,-72.613529,0.0,At Linden St - Accident.,...,False,False,False,False,False,False,Day,Day,Day,Day
4,A-1497379,2,2017-11-17 13:18:57+00:00,2017-11-17 14:33:44+00:00,42.114193,-72.621414,42.114193,-72.621414,0.0,Slow traffic due to serious accident on US-5 R...,...,False,False,False,False,False,False,Day,Day,Day,Day


After the first step of selecting sample data, some invalid rows were observed. Furthermore, we don't need all 47 columns from the orignal dataset. For example we don't need "country" column since all records happened only in the United States. We also need to exclude some invalid data when the state name is null or weather condition is unknown if we want to process related analysis.
The filtering process is presented below:

In [None]:
%%bigquery
select ID, Severity, State, Start_Time as Time, EXTRACT(month from Start_Time) Month,  EXTRACT(hour FROM Start_Time) Hour, FORMAT_DATE('%A', EXTRACT(date FROM Start_Time)) AS Weekday, Weather_Condition as Weather, Temperature_F_ as Temp, 
Visibility_mi_ as Visibility, Precipitation_in_ as Preciputation, Railway, Station, Traffic_Signal
from `ba775-project-team1.dataset_demo.us_traffic_accidents` 
where Weather_Condition <> 'nan' 
and Visibility_mi_ >= 0 
and Precipitation_in_ >= 0
LIMIT 5;

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 651.59query/s] 
Downloading: 100%|██████████| 5/5 [00:01<00:00,  3.12rows/s]


Unnamed: 0,ID,Severity,State,Time,Month,Hour,Weekday,Weather,Temp,Visibility,Preciputation,Railway,Station,Traffic_Signal
0,A-38249,2,NY,2016-09-01 09:14:56+00:00,9,9,Thursday,Light Rain,72.0,5.0,0.02,False,False,False
1,A-1164713,3,NY,2020-04-18 07:13:47+00:00,4,7,Saturday,Fog,40.0,0.75,0.0,False,False,False
2,A-1251601,2,NY,2017-10-09 15:46:08+00:00,10,15,Monday,Light Rain,72.0,1.2,0.0,False,False,False
3,A-597940,2,NY,2016-07-25 08:53:43+00:00,7,8,Monday,Light Rain,78.1,10.0,0.0,False,False,False
4,A-2587815,2,DC,2020-11-12 15:55:38+00:00,11,15,Thursday,Cloudy,52.0,10.0,0.01,False,False,False


Note that for all cleaning and filtering processes, a total of around 5000 accidents were filtered, which  accounts for 0.17% of our observing dataset. With such small amount of data eliminated, we can still produce reliable output since it does not influence the analysis output significantlly.

### Analysis Topics

To achieve our objective and get into project proposal, we chose aspects in time of a day, weekday, location (in State and some particular places) and weather to observe underlying affects on traffic accidents and corresponding severity. All data processing performed on Google Big Query.

### Location:

**Which state has the highest number of traffic accidents happened?**

We analyzed the data to find which state of the country has the highest accidents happened. We found out that California is the state where most of accidents occured at, and number of accidents happended in CA is doubled than the second highest state Florida.

In [None]:
%%bigquery
SELECT State, count(ID) as num_of_accidents
FROM `ba775-project-team1.dataset_demo.sample_table_wpopulation`
GROUP BY State
ORDER BY num_of_accidents DESC
LIMIT 5;

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 1001.03query/s]
Downloading: 100%|██████████| 5/5 [00:01<00:00,  2.76rows/s]


Unnamed: 0,State,num_of_accidents
0,CA,39209
1,FL,16078
2,TX,9688
3,NY,7161
4,OR,7123


<img src="https://github.com/yinghaow525/BA775-teamproject/blob/charts/Traffic%20Accidents%20Count%20National%20Map.png?raw=True" width="500" align="left" />

Since state's population varies from each other, the number of private motor vehicle ownership would have an impact on number of accidents records. If we want to factor out state population to normalize the count, and generating new calculation field as accidents per million. California is not the state having the highest traffic accident number. We can see California now has a same accidents_per_million value with Minnesota!

In [None]:
%%bigquery
SELECT State, count(ID) as num_of_accidents, population, cast(count(ID)/population*1000000 as INTEGER) accidents_per_million
FROM `ba775-project-team1.dataset_demo.sample_table_wpopulation`
GROUP BY State, population
ORDER BY accidents_per_million DESC
limit 5;

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 764.27query/s] 
Downloading: 100%|██████████| 5/5 [00:01<00:00,  3.25rows/s]


Unnamed: 0,State,num_of_accidents,population,accidents_per_million
0,OR,7123,4217737,1689
1,SC,7121,5148714,1383
2,CA,39209,39512223,992
3,MN,5594,5639632,992
4,UT,2779,3205958,867


<img src="https://github.com/yinghaow525/BA775-teamproject/blob/charts/Accidents%20per%20Million%20National%20Map.png?raw=True" width="500" align ="left" />

**Which state is more prone to have more serious traffic accidents?**

The distribution changes again when we take level of severity into account. We can observe that several states such as Montana, South and North Dakota do not have level-1 severity reocrds. The number of serious accidents happened in Florida, California and  New York is larger than that of other states.

In [None]:
%%bigquery
SELECT State, count(ID) as num_of_accidents
FROM `ba775-project-team1.dataset_demo.sample_table_wpopulation`
WHERE Severity=4
GROUP BY State
ORDER BY num_of_accidents DESC
LIMIT 5;

Query complete after 0.00s: 100%|██████████| 3/3 [00:00<00:00, 1618.80query/s]                        
Downloading: 100%|██████████| 5/5 [00:01<00:00,  3.07rows/s]


Unnamed: 0,State,num_of_accidents
0,FL,457
1,CA,354
2,NY,350
3,PA,342
4,GA,332


<img src='https://github.com/yinghaow525/BA775-teamproject/blob/charts/Sevirity%20Counts.png?raw=True' align='left' width='500' />

**What characteristics were associated with more traffic accidents?**

By comparing the number of accidents that happened around the location factors between 2016 and 2020, it is clear that traffic signals were the most frequent spots. Junctions and Crossings were the next. 

In [None]:
%%bigquery
SELECT  Sum(case when (Amenity=TRUE and Amenity IS NOT NULL) then 1 else 0 end) Amenity,
    Sum(case when (Bump=TRUE and Bump IS NOT NULL) then 1 else 0 end) Bump,
    Sum(case when (Crossing=TRUE and Crossing IS NOT NULL) then 1 else 0 end) Crossing,
    Sum(case when (Give_Way=TRUE and Give_Way IS NOT NULL) then 1 else 0 end) Give_Way,
    Sum(case when (Junction=TRUE and Junction IS NOT NULL) then 1 else 0 end)Junction,
    Sum(case when (Railway=TRUE and Railway IS NOT NULL) then 1 else 0 end) Railway,
    Sum(case when (Roundabout=TRUE and Roundabout IS NOT NULL) then 1 else 0 end) Roundabout,
    Sum(case when (Station=TRUE and Station IS NOT NULL) then 1 else 0 end) Station,
    Sum(case when (Stop=TRUE and Stop IS NOT NULL) then 1 else 0 end) Stop,
    Sum(case when (Traffic_Calming=TRUE and Traffic_Calming IS NOT NULL) then 1 else 0 end) Traffic_Calming,
    Sum(case when (Traffic_Signal=TRUE and Traffic_Signal IS NOT NULL) then 1 else 0 end) Traffic_Signal,
    Sum(case when (Turning_Loop=TRUE and Turning_Loop IS NOT NULL) then 1 else 0 end) Turning_Loop
    from `ba775-project-team1.dataset_demo.sample_table_wpopulation`;

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 1038.32query/s]                        
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.55s/rows]


Unnamed: 0,Amenity,Bump,Crossing,Give_Way,Junction,Railway,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop
0,1694,48,11960,402,13911,1403,11,3244,2436,97,23981,0


<img src="https://github.com/KeshuoLiu/ba775-project-team1/blob/main/location.png?raw=True" align="left" width="500"/>

**Since around traffic signals were the most frequent traffic accident spots as shown above, when did traffic accidents occur more often around traffic signals?**

From the sampled data, traffic accidents occurred more often around traffic signals during the morning peak and noon break periods.

In [None]:
%%bigquery
SELECT Weekday, Hour, Traffic_Signal, ROUND(COUNT(ID)/SUM(COUNT(ID)) OVER(PARTITION BY Weekday, Hour) * 100,2) Percentage
FROM `ba775-project-team1.dataset_demo.sample_table_wpopulation`
GROUP BY Weekday, Traffic_Signal, Hour
ORDER BY Traffic_Signal DESC, Percentage DESC
LIMIT 6

Query complete after 0.00s: 100%|██████████| 6/6 [00:00<00:00, 2810.57query/s]                        
Downloading: 100%|██████████| 6/6 [00:01<00:00,  3.77rows/s]


Unnamed: 0,Weekday,Hour,Traffic_Signal,Percentage
0,Tuesday,11,True,22.58
1,Wednesday,10,True,21.1
2,Tuesday,8,True,20.74
3,Friday,10,True,20.33
4,Friday,11,True,20.2
5,Wednesday,7,True,19.97


<img src="https://github.com/KeshuoLiu/ba775-project-team1/blob/d544ec67edabf2fa033fa6c20cfa443f91d8a44d/temperature.jpg?raw=True" align="left" width="400"/>

### Weather Conditions:

In this section, we will discuss the frequency of traffic accidents occurance on different weather conditions. 

**In which weather condition doest the traffic accident occure more?**

The top 5 weather conditions that most of traffic accidents occured are, fair, cloudy, most cloudy, partly cloudy and light rain. These 5 conditions represent over 80% of traffic accidents. </br>
Nearly half of the accidents happened in fair days, others were in cloudy, rainy or snowy days. Out of which cloudy occupies the highest portion. 

In [None]:
%%bigquery
SELECT Weather_Condition,  COUNT(ID) as number, ROUND( COUNT(ID)/(SELECT COUNT(ID) FROM `ba775-project-team1.dataset_demo.sample_table`  ),6) AS percentage FROM `ba775-project-team1.dataset_demo.sample_table` 
GROUP BY Weather_Condition
ORDER BY number desc
LIMIT 10

Query complete after 0.00s: 100%|██████████| 5/5 [00:00<00:00, 2683.50query/s]                        
Downloading: 100%|██████████| 10/10 [00:01<00:00,  7.76rows/s]


Unnamed: 0,Weather_Condition,number,percentage
0,Fair,65596,0.413578
1,Cloudy,23714,0.149515
2,Mostly Cloudy,20088,0.126653
3,Partly Cloudy,13350,0.084171
4,Light Rain,13241,0.083484
5,Light Snow,3513,0.022149
6,Overcast,3360,0.021185
7,Rain,2968,0.018713
8,Fog,2591,0.016336
9,Haze,1537,0.009691


In [None]:
%%bigquery
select SUM(percentage) FROM 
( SELECT Weather_Condition,  COUNT(ID) as number, ROUND( COUNT(ID)/(SELECT COUNT(ID) FROM `ba775-project-team1.dataset_demo.sample_table`  ),6) AS percentage FROM `ba775-project-team1.dataset_demo.sample_table` 
GROUP BY Weather_Condition
ORDER BY number desc
LIMIT 5 )

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 969.56query/s] 
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.31s/rows]


Unnamed: 0,f0_
0,0.857401


<img src="https://github.com/yinghaow525/BA775-teamproject/blob/charts/weather_condition%20pie%20chart.png?raw=True" align="left" width="500"/>

**How severe the traffic accident is under different weather conditions?**
</br>
Most of the traffic accidents happened in all kinds of weather conditions are in a severity of 2.

In [None]:
%%bigquery
SELECT Severity,Weather_Condition,  COUNT(ID) as number, ROUND( COUNT(ID)/(SELECT COUNT(ID) FROM `ba775-project-team1.dataset_demo.sample_table`  ),6) AS percentage FROM `ba775-project-team1.dataset_demo.sample_table` 
GROUP BY Weather_Condition, Severity 
ORDER BY number DESC
LIMIT 10

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 862.14query/s] 
Downloading: 100%|██████████| 10/10 [00:01<00:00,  7.31rows/s]


Unnamed: 0,Severity,Weather_Condition,number,percentage
0,2,Fair,53899,0.33983
1,2,Cloudy,19035,0.120014
2,2,Mostly Cloudy,15198,0.095822
3,2,Partly Cloudy,10374,0.065407
4,2,Light Rain,9315,0.05873
5,3,Fair,8396,0.052936
6,3,Mostly Cloudy,3730,0.023517
7,3,Cloudy,3513,0.022149
8,3,Light Rain,3285,0.020712
9,2,Light Snow,2414,0.01522


<img src="https://github.com/yinghaow525/BA775-teamproject/blob/charts/Severity%20barchart.png?raw=True" align="left" width="600"/>

We put several weather factors into consideration: visibility, precipitation, temperature and wind speed.

<img src='https://github.com/yinghaow525/BA775-teamproject/blob/charts/Weather%20Conditions.png?raw=True' width='800' align='left' />

**Take visibility distance as an factor first, do more accidents happen when visibility distance is shorter(<=5miles)?**

In [None]:
%%bigquery
SELECT Month, round(avg(Visibility_mi_),2) Visibility, count(*) num_of_accidents
from `ba775-project-team1.dataset_demo.sample_table`
GROUP BY Month
ORDER BY Month

Query complete after 0.00s: 100%|██████████| 3/3 [00:00<00:00, 1171.70query/s]                        
Downloading: 100%|██████████| 12/12 [00:01<00:00,  7.62rows/s]


Unnamed: 0,Month,Visibility,num_of_accidents
0,1,7.72,8796
1,2,7.71,8222
2,3,8.17,8257
3,4,8.92,12394
4,5,9.2,12073
5,6,9.51,12374
6,7,9.46,6195
7,8,9.23,7552
8,9,9.13,13388
9,10,8.98,19422


There is no significant negative relationship between average visibility distance and num of accidents in different months. we can observe that when visibility distance is slightly shorter than 10 miles (very clear view), the number of accidents is higher than that of longer ones. Also, when visibility distance is lower than 8 miles, the portion of severity level 3 and 4 is larger than that of longer visibility distance. To further discover that whether low visibility distance (manually defined as <= 5 miles), we use the following charts:

<img src='https://github.com/yinghaow525/BA775-teamproject/blob/charts/Number%20of%20Accidents%20Happened%20with%20Low%20Visibility%20Distance.png?raw=True' align='left' width='800' />

In this way, we can observe a clear and repeated pattern of accidents happened with low visibility distance. It seems that traffic accidents counts tend to be far more in the first and forth quarter of a year, when visibility distance is less than or equal to 5 miles. 

**Did temperature affect the number of traffic accidents and severity accordingly?**

The number of traffic accidents increased at the end of the year when temperatures were low. Level-2 severity accidents accounted for most of these late-year accidents, and the proportions of other level severity decreased.

In [None]:
%%bigquery
SELECT Month, Temperature_F_ Temperature, Severity, COUNT(ID) Count
FROM `ba775-project-team1.dataset_demo.sample_table_wpopulation`
WHERE is_nan(Temperature_F_) = False
GROUP BY Month, Severity, Temperature_F_
ORDER BY COUNT(ID) DESC, Severity DESC, Month
LIMIT 5

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 956.73query/s] 
Downloading: 100%|██████████| 5/5 [00:01<00:00,  3.25rows/s]


Unnamed: 0,Month,Temperature,Severity,Count
0,12,50.0,2,773
1,12,46.0,2,664
2,12,48.0,2,654
3,12,54.0,2,646
4,12,45.0,2,645


<img src="https://github.com/yinghaow525/BA775-teamproject/blob/charts/Temperature_Severity_Month.png?raw=True" align="left" width="500"/>

### Time:

**Does total number of accidents increase year by year?**

The number of accidents in the U.S. has been increasing every year since 2016, and the number of traffic accidents happened in 2020 is twice than that happened in 2019.

In [None]:
%%bigquery
SELECT Year, COUNT(Year) AS Year_accidents_n
FROM `ba775-project-team1.dataset_demo.sample_table_wpopulation`
GROUP BY Year
ORDER BY Year_accidents_n;

Query complete after 0.00s: 100%|██████████| 3/3 [00:00<00:00, 1593.98query/s]                        
Downloading: 100%|██████████| 5/5 [00:01<00:00,  3.18rows/s]


Unnamed: 0,Year,Year_accidents_n
0,2016,2371
1,2017,5295
2,2018,7143
3,2019,47490
4,2020,96307


**In which months do more traffic accidents happen? Does the pattern vary year from year?**

In 2016, the most accidents occurred in December; 
In 2017, the most accidents in January;
In 2018, the most accidents in November; 
In 2019, the most accidents in October; 
In 2020, the most accidents in December.
This shows that the most accidents occur in the United States in the last quarter of the year.

In [None]:
%%bigquery
SELECT Year, 
   SUM(CASE WHEN Month = 1 THEN 1 ELSE 0 END) AS Jan,
   SUM(CASE WHEN Month = 2 THEN 1 ELSE 0 END) AS Feb,
   SUM(CASE WHEN Month = 3 THEN 1 ELSE 0 END) AS Mar,
   SUM(CASE WHEN Month = 4 THEN 1 ELSE 0 END) AS Apr,
   SUM(CASE WHEN Month = 5 THEN 1 ELSE 0 END) AS May,
   SUM(CASE WHEN Month = 6 THEN 1 ELSE 0 END) AS Jun,
   SUM(CASE WHEN Month = 7 THEN 1 ELSE 0 END) AS Jul,
   SUM(CASE WHEN Month = 8 THEN 1 ELSE 0 END) AS Aug,
   SUM(CASE WHEN Month = 9 THEN 1 ELSE 0 END) AS Sep,
   SUM(CASE WHEN Month = 10 THEN 1 ELSE 0 END) AS Oct,
   SUM(CASE WHEN Month = 11 THEN 1 ELSE 0 END) AS Nov,
   SUM(CASE WHEN Month = 12 THEN 1 ELSE 0 END) AS Dec
FROM `ba775-project-team1.dataset_demo.sample_table_wpopulation`
GROUP BY Year
ORDER BY Year;

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 784.42query/s] 
Downloading: 100%|██████████| 5/5 [00:02<00:00,  1.82rows/s]


Unnamed: 0,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
0,2016,0,23,36,120,65,112,199,302,309,319,376,510
1,2017,702,449,532,383,324,319,216,545,350,468,407,600
2,2018,794,626,736,524,478,379,270,373,590,639,900,834
3,2019,869,1289,549,4337,4223,3748,3833,4481,5737,6921,4851,6652
4,2020,6431,5835,6404,7030,6983,7816,1677,1851,6402,11075,15898,18905


**In which hour of a day do more traffic accidents happen? Is number of accidents related to morning and evening traffic peak?**

The hours of the day from 7:00 a.m. to 8:00 a.m. and 4:00 p.m. to 5:00 p.m. belong to the time of day when there are more accidents. These time periods are with heavy traffic when people usually go to work and school. The government and police can enforce traffic regulation and road transportation in these specific time periods.

In [None]:
%%bigquery
SELECT Hour, Severity, COUNT(ID) AS Hour_accidents_n
FROM `ba775-project-team1.dataset_demo.sample_table_wpopulation`
GROUP BY Hour, Severity
ORDER BY Hour;

<img src="https://github.com/yinghaow525/BA775-teamproject/blob/charts/Hour%20%26%20Counts.png?raw=True" align="left" width="400"/>

<img src="https://github.com/yinghaow525/BA775-teamproject/blob/charts/Accidents%20with%20Level%20of%20Severity%20by%20%20Hour.png?raw=True" align="left" width="500"/>

The severity of the accident was greatest at 5 p.m. The severity of accidents is higher during from 7:00 a.m. to 8:00 a.m. and 4:00 p.m. to 5:00 p.m.

**Does the number of traffic accidents relate to day of week: more accidents happen in weekdays rather than weekends? and how is the severity distribution?**

The number of accidents on weekends is far more less than that of weekdays. And Most of the traffic accidents with severity level 2 or 3 also occurred on weekdays.

In [None]:
%%bigquery
SELECT Weekday, Severity, COUNT(ID) AS Weekday_accidents_n
FROM `ba775-project-team1.dataset_demo.sample_table_wpopulation`
GROUP BY Weekday, Severity
ORDER BY Weekday_accidents_n;

<img src="https://github.com/yinghaow525/BA775-teamproject/blob/charts/Accidents%20with%20Level%20of%20Severity%20by%20Day%20of%20Week.png?raw=True|" align="left" width="500"/>

### Use Bigquery Machine Learning To Predict Level of Severity for Accidents

To help the government and drivers to better utilitize our analysis, we have decided to predict the level of severity of traffic accidents with given weather conditions and time factors.

**What are the factors that influence the level of severity of traffic accidents?**

Now, we are trying to predict the level of severity of traffic accidents given some representative features. We select the sampled dataset with over 1,561 thousand of records as evaluating and training data.

Since we want to predict the level of severity of traffic accidents at a given weather condition, at a certain place, and at a certain time of a year using Big Query machine learning, and the level of severity is a multi-class variable. We decided to use a Logistic Regression model with multi class option. We consider several factors listed below:

Month: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
</br>
Hour: 0-23
</br>
Visibility: 0-10 miles
</br>
Precipitation: 0-24 inches
</br>
Wind Speed: 0-150 mph
</br>
Temperature: 0-174 °F
</br>
Railway: 0(False), 1(True)
</br>
Station: 0(False), 1(True)
</br>
Traffic_Signal: 0(False), 1(True)

We select **level of severity** to be the dependent variable, and other factors to be the independent ones. Excluding ID because ID is unique for each traffic accident record and this unwillingly influences the overall accuracy of prediction model.

In [None]:
%%bigquery
CREATE OR REPLACE MODEL `ba775-project-team1.dataset_demo.model`
OPTIONS(model_type='logistic_reg', labels = ['Severity'])
AS
SELECT * EXCEPT(ID) from `ba775-project-team1.dataset_demo.data_model`

Query complete after 0.00s: 100%|██████████| 3/3 [00:00<00:00, 1613.61query/s]                        


**Evaluation of the logistic regression model**:

In [None]:
%%bigquery
SELECT *
FROM ML.EVALUATE
(
    MODEL dataset_demo.model,
    (SELECT * EXCEPT(ID) FROM `ba775-project-team1.dataset_demo.data_model`)
)

Query complete after 0.00s: 100%|██████████| 9/9 [00:00<00:00, 4928.03query/s]                        
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.45s/rows]


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.279207,0.250272,0.784562,0.220693,2.090181,0.691139


Now, we obtain the bigquery machine model and the evaluation results. The precision is about 0.28 and roc_auc is about 0.69. The higher the roc_auc score is, the better the performance of the model at distinguishing classes.

**Using logistic regression model to make predictions on level of severity** (with existing data):

In [None]:
%%bigquery
CREATE OR REPLACE TABLE dataset_demo.severity_predictions
AS
SELECT 
    predicted_Severity, 
    predicted_Severity_probs[OFFSET(0)].prob,
    Severity,
    Visibility,
    Temperature,
    Precipitation,
    Wind_speed,
    Hour, Month, Railway, Traffic_Signal, Station
FROM ML.PREDICT
(
    MODEL `dataset_demo.model`,
    (SELECT *  FROM `ba775-project-team1.dataset_demo.data_model`)
)
ORDER BY prob DESC

Query complete after 0.00s: 100%|██████████| 9/9 [00:00<00:00, 5126.12query/s]                        


The table created from the above query returns the result for the prediction of severity concluded by the factors we used as parameters defining the logistic regression, and the resulting table includes the probability of predicted severity to be true based on the selected parameters. The results we got as part of logistic regression is saved in severity_predictions.

In [None]:
%%bigquery
SELECT * FROM dataset_demo.severity_predictions
LIMIT 10

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 445.30query/s]                          
Downloading: 100%|██████████| 10/10 [00:01<00:00,  7.17rows/s]


Unnamed: 0,predicted_Severity,prob,Severity,Visibility,Temperature,Precipitation,Wind_speed,Hour,Month,Railway,Traffic_Signal,Station
0,3,1.0,2,7.0,84.0,0.0,984.0,14,3,0,0,0
1,3,0.999071,2,10.0,50.0,24.0,5.0,18,12,0,0,0
2,3,0.999071,2,10.0,50.0,24.0,5.0,18,12,0,0,0
3,3,0.999071,2,10.0,50.0,24.0,5.0,18,12,0,0,0
4,3,0.992937,4,10.0,84.0,0.0,232.0,15,5,0,0,0
5,3,0.984456,2,10.0,75.0,0.0,211.0,15,9,0,0,0
6,3,0.980789,2,10.0,37.0,0.0,230.0,7,12,0,0,0
7,3,0.972629,3,5.0,37.4,10.0,26.5,19,1,0,0,0
8,3,0.969474,3,8.0,39.2,9.99,26.5,17,1,0,0,0
9,3,0.966376,3,7.0,79.0,0.0,157.0,12,4,0,0,0


**Overall accuracy of the predictions on each level of severity**:

We determined the accuracy of our predicted results. The logistic regression model returned us a table with a 78% accurately predicted level of severity. The dataset we used was clustered around severity levels of 2 and 3, providing very few data points within level 1 and level 4 severity, which makes it difficult for the training of the model and maybe the reason our model has predicted 0 entries to lie within level 1 and level 4.

In [None]:
%%bigquery
SELECT 
    COUNT(*) total_accidents,
    COUNTIF(Severity=1) actual_level1, 
    COUNTIF(Severity=2) actual_level2, 
    COUNTIF(Severity=3) actual_level3, 
    COUNTIF(Severity=4) actual_level4, 
    COUNTIF(Severity=1)/COUNT(*)*100 level1_rate_percent,
    COUNTIF(Severity=2)/COUNT(*)*100 level2_rate_percent,
    COUNTIF(Severity=3)/COUNT(*)*100 level3_rate_percent,
    COUNTIF(Severity=4)/COUNT(*)*100 level4_rate_percent,
    COUNTIF(predicted_Severity=1) predicted_level1,
    COUNTIF(predicted_Severity=2) predicted_level2,
    COUNTIF(predicted_Severity=3) predicted_level3,
    COUNTIF(predicted_Severity=4) predicted_level4,
    COUNTIF(Severity=1 AND predicted_Severity=1) true_predicted_level1,
    COUNTIF(Severity=2 AND predicted_Severity=2) true_predicted_level2,
    COUNTIF(Severity=3 AND predicted_Severity=3) true_predicted_level3,
    COUNTIF(Severity=4 AND predicted_Severity=4) true_predicted_level4,
    (COUNTIF(Severity=1 AND predicted_Severity=1) + COUNTIF(Severity=2 AND predicted_Severity=2) + COUNTIF(Severity=3 AND predicted_Severity=3) + COUNTIF(Severity=4 AND predicted_Severity=4))
    /(COUNTIF(predicted_Severity=1)+COUNTIF(predicted_Severity=2)+COUNTIF(predicted_Severity=3)+ COUNTIF(predicted_Severity=4))*100 all_levels_rate_percent_predicted
FROM `ba775-project-team1.dataset_demo.severity_predictions`

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 883.01query/s]                         
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.40s/rows]


Unnamed: 0,total_accidents,actual_level1,actual_level2,actual_level3,actual_level4,level1_rate_percent,level2_rate_percent,level3_rate_percent,level4_rate_percent,predicted_level1,predicted_level2,predicted_level3,predicted_level4,true_predicted_level1,true_predicted_level2,true_predicted_level3,true_predicted_level4,all_levels_rate_percent_predicted
0,1561074,25805,1225135,256827,53307,1.653029,78.480264,16.451943,3.414764,0,1559709,1365,0,0,1224306,453,0,78.456178


**Prediction on level of severity in unknown Future**:

We used our model with some manually defined factors to foresee predicted level of severity. The logistic regression model analyzed the results based on the training data and gave us predicted results. We used following parameters:
1) Month : March <br>
2) Hour : 12 <br>
3) Visibility : 7 <br> 
4) Precipitation : 0 in <br>
5) Wind_speed : 10 mph <br>
6) Temperature 17 C <br>
7) Railway : False <br>
8) Station : False <br>
9) Traffic_Signal : True <br>

In [None]:
%%bigquery
SELECT 
    predicted_Severity, 
    predicted_Severity_probs[OFFSET(0)].prob,
 #   Severity,
    Visibility,
    Temperature,
    Precipitation,
    Wind_speed,
    Hour, Month, Traffic_Signal
FROM ML.PREDICT
(
    MODEL `dataset_demo.model`,
    (SELECT 3 Month, 12 Hour, 7 Visibility,  0 Precipitation, 10 Wind_speed, 17 Temperature, 0 Railway,0 Station, 1 Traffic_Signal)
)

Query complete after 0.00s: 100%|██████████| 7/7 [00:00<00:00, 4117.83query/s]                        
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.39s/rows]


Unnamed: 0,predicted_Severity,prob,Visibility,Temperature,Precipitation,Wind_speed,Hour,Month,Traffic_Signal
0,2,0.835499,7,17,0,10,12,3,1


Based on our sql logistic regression model, we predict that in March at 12pm, the level of severity of a traffic accident happens near a traffic signal with specific weather condition factors is 2, and the probability is 83.55%. 

### Use Python Machine Learning To Predict Level of Severity for Accidents

**CONCEPTS：**

- **Why do we use Python machine learning based on python here?**<br>
  
  Previously using sql logistic regression model to make predictions on the level of severity, but we find out that even if the overall  model accuracy is 80%, there is NO CORRECT prediction on level 1 and 4.

- **By working on machine learning, we can also anwser question below:**<br>
Which feature is more closely related to traffic accident? <br>
How to predict severity from this combination of important features?

- **Why the output model  is important?**<br>
Make sound driving advices to the situation may lead to  serious accident.<br>
Help the Traffic Regulatory Bureau to effectively reduce the traffic accident rate.

STEP 1: **Processing Dataset with SQL**

- **4 factors correspond to traffic accident:** <br>
  Location/Time/Weather/Places<br>
- **12 important features:** <br>
  -- Location: State<br>
  -- Time: Month/Hour/Weekday<br>
  -- Weather: Weather/Visibility/Precipitation/Windspead/Temperature<br>
  -- Places: Crossing/Junction/Traffic_Signal<br>
- **For each severity level choose 10000 rows of data. The overall features data are 40000 rows.**

<img src='https://github.com/yinghaow525/BA775-teamproject/blob/b0f2ec7f731968afd1da35ada7b9a4f1458ba527/Factors&Features.png?raw=Ture' align='left' width='1000' />

In [None]:
%%bigquery
(SELECT ID, Severity, State, extract(month from Start_Time) Month, extract(hour from Start_Time) Hour, FORMAT_DATE('%A', EXTRACT(date FROM Start_Time)) Weekday,
Weather_Condition, Visibility_mi_ as Visibility, Precipitation_in_ as Preciputation, Wind_Speed_mph_ as Windspead, Temperature_F_ as Temperature, 
case when Crossing = True then 1 else 0 end Crossing, 
case when Junction = True then 1 else 0 end Junction, 
case when Traffic_Signal = True then 1 else 0 end Traffic_Signal
FROM `ba775-project-team1.dataset_demo.us_traffic_accidents`
WHERE Visibility_mi_ >=0 and Precipitation_in_>=0 and Temperature_F_>=0 and Wind_Speed_mph_>=0
      and Severity = 1
      limit 10000)
union all
(SELECT ID, Severity, State, extract(month from Start_Time) Month, extract(hour from Start_Time) Hour, FORMAT_DATE('%A', EXTRACT(date FROM Start_Time)) Weekday,
Weather_Condition, Visibility_mi_ as Visibility, Precipitation_in_ as Preciputation, Wind_Speed_mph_ as Windspead, Temperature_F_ as Temperature, 
case when Crossing = True then 1 else 0 end Crossing, 
case when Junction = True then 1 else 0 end Junction, 
case when Traffic_Signal = True then 1 else 0 end Traffic_Signal
FROM `ba775-project-team1.dataset_demo.us_traffic_accidents`
WHERE Visibility_mi_ >=0 and Precipitation_in_>=0 and Temperature_F_>=0 and Wind_Speed_mph_>=0
      and Severity = 2
      limit 10000)
union all
(SELECT ID, Severity, State, extract(month from Start_Time) Month, extract(hour from Start_Time) Hour, FORMAT_DATE('%A', EXTRACT(date FROM Start_Time)) Weekday,
Weather_Condition, Visibility_mi_ as Visibility, Precipitation_in_ as Preciputation, Wind_Speed_mph_ as Windspead, Temperature_F_ as Temperature, 
case when Crossing = True then 1 else 0 end Crossing, 
case when Junction = True then 1 else 0 end Junction, 
case when Traffic_Signal = True then 1 else 0 end Traffic_Signal
FROM `ba775-project-team1.dataset_demo.us_traffic_accidents`
WHERE Visibility_mi_ >=0 and Precipitation_in_>=0 and Temperature_F_>=0 and Wind_Speed_mph_>=0
      and Severity = 3
      limit 10000)
union all
(SELECT ID, Severity, State, extract(month from Start_Time) Month, extract(hour from Start_Time) Hour, FORMAT_DATE('%A', EXTRACT(date FROM Start_Time)) Weekday,
Weather_Condition, Visibility_mi_ as Visibility, Precipitation_in_ as Preciputation, Wind_Speed_mph_ as Windspead, Temperature_F_ as Temperature, 
case when Crossing = True then 1 else 0 end Crossing, 
case when Junction = True then 1 else 0 end Junction, 
case when Traffic_Signal = True then 1 else 0 end Traffic_Signal
FROM `ba775-project-team1.dataset_demo.us_traffic_accidents`
WHERE Visibility_mi_ >=0 and Precipitation_in_>=0 and Temperature_F_>=0 and Wind_Speed_mph_>=0
      and Severity = 4
      limit 10000)

Query complete after 0.00s: 100%|██████████| 6/6 [00:00<00:00, 2708.04query/s]                        
Downloading: 100%|██████████| 40000/40000 [00:01<00:00, 35451.27rows/s]


Unnamed: 0,ID,Severity,State,Month,Hour,Weekday,Weather_Condition,Visibility,Preciputation,Windspead,Temperature,Crossing,Junction,Traffic_Signal
0,A-1251601,2,NY,10,15,Monday,Light Rain,1.2,0.00,10.4,72.0,0,1,0
1,A-597940,2,NY,7,8,Monday,Light Rain,10.0,0.00,3.5,78.1,0,1,0
2,A-2587815,2,DC,11,15,Thursday,Cloudy,10.0,0.01,13.0,52.0,0,0,0
3,A-589139,2,DC,12,13,Tuesday,Mostly Cloudy / Windy,10.0,0.00,24.0,48.0,0,0,0
4,A-31739,2,DC,12,12,Tuesday,Mostly Cloudy,10.0,0.00,18.0,45.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,A-1429346,1,FL,4,16,Tuesday,Fair,10.0,0.00,12.0,82.0,1,0,1
39996,A-397197,1,FL,5,18,Friday,Fair,10.0,0.00,10.0,85.0,1,0,1
39997,A-2444855,1,FL,4,19,Wednesday,Fair,10.0,0.00,15.0,86.0,0,0,0
39998,A-363667,1,FL,3,17,Wednesday,Mostly Cloudy,10.0,0.00,16.0,88.0,1,0,1


STEP 2: **Processing Dataset with Python**

- **Use python process text type features into ID type features.**
- **Use python process ID type features into ONE-HOT encoding features. (Why?)**<br>
  -- The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.<br>
  -- A one-hot encoding can avoid the model to assume a natural ordering between categories which may result in poor performance or unexpected results.<br>
- **Split all data into 80% train dataset and 20% test dataset**

<img src='https://github.com/yinghaow525/BA775-teamproject/blob/b0f2ec7f731968afd1da35ada7b9a4f1458ba527/Process%20Data%20by%20Python.png?raw=Ture' align='left' width='1000' />

In [None]:
import pandas as pd
import random
import numpy as np

# col_types = {}
features = ["State", "Month", "Hour", "Weekday", "Weather", "Visibility", "Preciputation", "Windspead", "Temperature",
            "Junction", "Crossing", "Traffic_Signal"]

def open_file(filename, mode='r'):
    return open(filename, mode, encoding='utf-8', errors='ignore')

def write_file(filename, content):
    open_file(filename, mode="w").write(content)

# split features data into 80% train dataset and 20% test dataset   
def random_sample(filename):
    with open_file(filename) as f_:
        lines = f_.readlines()
    random.shuffle(lines)
    len_test = int(len(lines) * 0.2)
    lines_test = lines[0:len_test]
    lines_train = lines[len_test:]
    train_w = open_file("Untitled Folder/data/ft.train.txt", mode="w")
    test_w = open_file("Untitled Folder/data/ft.test.txt", mode="w")
    for i in lines_train:
        train_w.write(i)
    for j in lines_test:
        test_w.write(j)

# turn the features into id
def feature_to_id(cate_list):
    cates = list(set(cate_list))
    cate_to_id = dict(zip(cates, range(len(cates))))
    return cates, cate_to_id

def process_all_data():
    ft = pd.read_csv('Untitled Folder/data/all_data.txt', sep='\t')
    ft = ft.drop(["ID"], axis=1)

    for col_ in features:
        col_values = ft[col_].values
        col_values = ["NAN" if pd.isnull(c) else c for c in col_values]
        c_, word_to_id = feature_to_id(col_values)

        # a = [word_to_id[cv] for cv in col_values]
        a = to_categorical([word_to_id[cv] for cv in col_values], len(word_to_id))
        ft = pd.concat([ft, pd.DataFrame(a)], axis=1)
        ft = ft.drop([col_], axis=1)

    severity = ft.pop('Severity')
    ft.insert(loc=ft.shape[1], column='severity', value=severity, allow_duplicates=False)

    print(ft.head())
    print(ft.shape[0])
    print(ft.shape[1])
    ft.to_csv('Untitled Folder/data/ft.all.txt', sep='\t', header=False, index=False)

# turn the feature id into one-hot encode
def to_categorical(y, num_classes=None):
    y = np.array(y, dtype='int')
    input_shape = y.shape
    if input_shape and input_shape[-1] == 1 and len(input_shape) > 1:
        input_shape = tuple(input_shape[:-1])
    y = y.ravel()
    if not num_classes:
        num_classes = np.max(y) + 1
    n = y.shape[0]
    categorical = np.zeros((n, num_classes))
    categorical[np.arange(n), y] = 1
    output_shape = input_shape + (num_classes,)
    categorical = np.reshape(categorical, output_shape)
    return categorical.tolist()


process_all_data()
random_sample("Untitled Folder/data/ft.all.txt")



     0    1    2    3    4    5    6    7    8    9  ...  272  273  274    0  \
0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
2  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  1.0   
3  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  1.0   
4  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  1.0   

     1    0    1    0    1  severity  
0  1.0  1.0  0.0  1.0  0.0         2  
1  1.0  1.0  0.0  1.0  0.0         2  
2  0.0  1.0  0.0  1.0  0.0         2  
3  0.0  1.0  0.0  1.0  0.0         2  
4  0.0  1.0  0.0  1.0  0.0         2  

[5 rows x 666 columns]
40000
666


STEP 3&4: **Training&Testing Models**

<img src='https://github.com/yinghaow525/BA775-teamproject/blob/b0f2ec7f731968afd1da35ada7b9a4f1458ba527/Train%20Models.png?raw=Ture' align='left' width='800' />

In [None]:
import pickle
from sklearn import svm, neural_network, linear_model, naive_bayes, neighbors, tree, ensemble, metrics


def process_file(filename):
    conts, labs = [], []
    with open_file(filename) as f_:
        for line in f_:
            cs = line.strip().split("\t")
            conts.append(cs[:-1])
            labs.append(cs[-1])
    print(np.array(conts).shape)
    return np.array(conts).astype("float"), np.array(labs).astype("int").tolist()


def train(train_dir):
    train_feature, train_target = process_file(train_dir)
    print(np.array(train_feature).shape)
    print(np.array(train_target).shape)

    # train
    print("training...")
    model.fit(train_feature, train_target)


def test():
    test_feature, test_target = process_file("Untitled Folder/data/ft.test.txt")
    test_predict = model.predict(test_feature)  # return predict classification


    # accuracy
    true_false = (test_predict == test_target)
    accuracy = np.count_nonzero(true_false) / float(len(test_target))
    print()
    print("accuracy is %f" % accuracy)

    # precision    recall  f1-score
    print()
    print(metrics.classification_report(test_target, test_predict))

    # Confusion Matrix
    print("Confusion Matrix...")
    print(metrics.confusion_matrix(test_target, test_predict))
    


- Model 1: **Random Forest**

In [None]:
# ramdom forest
model = ensemble.RandomForestClassifier()
train("Untitled Folder/data/ft.train.txt")
# print(model.feature_importances_)  # only work for none one-hot and random forest
test()

(32000, 665)
(32000, 665)
(32000,)
training...
(8000, 665)

accuracy is 0.732375

              precision    recall  f1-score   support

           1       0.85      0.88      0.86      1976
           2       0.67      0.79      0.72      2030
           3       0.61      0.61      0.61      1984
           4       0.82      0.66      0.73      2010

    accuracy                           0.73      8000
   macro avg       0.74      0.73      0.73      8000
weighted avg       0.74      0.73      0.73      8000

Confusion Matrix...
[[1729   21   74  152]
 [   5 1595  406   24]
 [  22  649 1201  112]
 [ 279  125  272 1334]]


- Model 2: **Logistic Regression**

In [None]:
# logistic regression
model = linear_model.LogisticRegression(multi_class="multinomial", solver="lbfgs")

train("Untitled Folder/data/ft.train.txt")
test()

(32000, 665)
(32000, 665)
(32000,)
training...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


(8000, 665)

accuracy is 0.695375

              precision    recall  f1-score   support

           1       0.80      0.85      0.82      1976
           2       0.66      0.77      0.71      2030
           3       0.56      0.61      0.58      1984
           4       0.82      0.55      0.66      2010

    accuracy                           0.70      8000
   macro avg       0.71      0.70      0.69      8000
weighted avg       0.71      0.70      0.69      8000

Confusion Matrix...
[[1686   22  115  153]
 [   5 1562  456    7]
 [  47  653 1207   77]
 [ 376  144  382 1108]]


- Model 3: **Support Vector Machine (SVM)**

In [None]:
# SVM
model = svm.LinearSVC()

train("Untitled Folder/data/ft.train.txt")
test()

(32000, 665)
(32000, 665)
(32000,)
training...
(8000, 665)

accuracy is 0.699500

              precision    recall  f1-score   support

           1       0.81      0.85      0.83      1976
           2       0.65      0.78      0.71      2030
           3       0.56      0.60      0.58      1984
           4       0.83      0.57      0.67      2010

    accuracy                           0.70      8000
   macro avg       0.71      0.70      0.70      8000
weighted avg       0.71      0.70      0.70      8000

Confusion Matrix...
[[1675   27  125  149]
 [   3 1586  434    7]
 [  35  681 1197   71]
 [ 353  156  363 1138]]


- Model 4: **Neural Network**

In [None]:
# neural network
model = neural_network.MLPClassifier(hidden_layer_sizes=(2048, 512), verbose=True, early_stopping=True)

train("Untitled Folder/data/ft.train.txt")
test()

(32000, 665)
(32000, 665)
(32000,)
training...
Iteration 1, loss = 0.74865711
Validation score: 0.697187
Iteration 2, loss = 0.60890580
Validation score: 0.698750
Iteration 3, loss = 0.52429944
Validation score: 0.702187
Iteration 4, loss = 0.42935016
Validation score: 0.699063
Iteration 5, loss = 0.32662756
Validation score: 0.694688
Iteration 6, loss = 0.24469872
Validation score: 0.688438
Iteration 7, loss = 0.17598556
Validation score: 0.696250
Iteration 8, loss = 0.13310566
Validation score: 0.670000
Iteration 9, loss = 0.09692953
Validation score: 0.699063
Iteration 10, loss = 0.07682298
Validation score: 0.682187
Iteration 11, loss = 0.06248188
Validation score: 0.688438
Iteration 12, loss = 0.04879386
Validation score: 0.692187
Iteration 13, loss = 0.04217029
Validation score: 0.695000
Iteration 14, loss = 0.03824850
Validation score: 0.686875
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
(8000, 665)

accuracy is 0.704875

        

STEP 5: **Optimize Models(Adjust Parameter)**

- **Adjust Parameter (Neural Network)**<br>
  -- Initialize hidden_layer_sizes=(4096, 1024)<br>
  -- Adjust to hidden_layer_sizes=(2048, 512)<br>
  -- Why adjust hidden_layer_sizes into smaller one?<br>
      The training accuracy is better than the testing, so the output model might be overfitting.
      So it is necessary to choose a smaller hidden_layer_sizes.
      The outcome is better (you can see the chart below).

<img src='https://github.com/yinghaow525/BA775-teamproject/blob/b0f2ec7f731968afd1da35ada7b9a4f1458ba527/Adjust%20Parameter(Neural%20Network).png?raw=Ture' align='left' width='800' />

STEP 5: **Optimize Models(ONE-HOT encoding)**

- **Turning ID features into ONE-HOT features**<br>
  -- The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.<br>
  -- A one-hot encoding can avoid the model to assume a natural ordering between categories which may result in poor performance or unexpected results.<br>

STEP 6: **Evaluating Model**

**Precision**

<img src='https://github.com/yinghaow525/BA775-teamproject/blob/b0f2ec7f731968afd1da35ada7b9a4f1458ba527/Precision.png?raw=Ture' align='left' width='800' />

**Recall**

<img src='https://github.com/yinghaow525/BA775-teamproject/blob/b0f2ec7f731968afd1da35ada7b9a4f1458ba527/Recall.png?raw=Ture' align='left' width='800' />

**F1-score**

<img src='https://github.com/yinghaow525/BA775-teamproject/blob/b0f2ec7f731968afd1da35ada7b9a4f1458ba527/F1-score.png?raw=Ture' align='left' width='800' />

**Accuracy**

<img src='https://github.com/yinghaow525/BA775-teamproject/blob/b0f2ec7f731968afd1da35ada7b9a4f1458ba527/Accuracy.png?raw=Ture' align='left' width='600' />

**Accident Classification Model Conclusion:**

- Random Forest has a higher accuracy to predict the level of severity. We have overall 73% confidence to say in which situation will lead to which leavel of severity.
- The top three features corresponding to US traffic accident are State, Temperature and Month.(as below)

<img src='https://github.com/SimengLi1998/BA775-teamproject/blob/fb3f8375b568dce09d0871572ef115d197c8cd44/%E6%88%AA%E5%B1%8F2021-08-30%20%E4%B8%8B%E5%8D%8810.52.30.png?raw=Ture' align='left' width='1000' />

### Conclusion:

After observing analysis on aspects in location, time and weather conditions, we report US traffic accidents the following:
</br>
Most traffic accidents occur in coastal areas, such as CA, OR, FL. However, the frequency of serious accidents in FL and NY is higher than that of other states. This might be caused by the excessive traffic and narrow roads in NY and non-pedestrian and bike-friendly roads in FL. Because traffic signal, crossing, and junction were the top 3 frequent locations for traffic accidents, we suggest that police strengthen traffic control at these 3 types of locations with heavy traffic by installing more monitors or warning signs and arranging more patrol officers. Moreover, we discovered that accidents occurred more often around traffic signals during morning peaks and noon breaks. Increasing the number of police to direct traffic flow around traffic signals may lower the occurrence of traffic accidents.
</br></br>
Over 80% of traffic accidents happened under the weather condition of fair, cloudy, mostly cloudy, partly cloudy and light rain. Most of these accidents have a severity of two, and we don’t see significant portion of accidents having severity of over two happened under other weather conditions. The number of traffic accidents increased as visibility distance went shorter, temperature and precipitation went lower. Thus, more accidents happened in winter. Therefore, DMV can make suggestions in the brochures to call on drive-safe attention specificly in winter.
</br></br>
In terms of time, we observed that the highest number of accidents occurred between 7:00a.m. and 8:00a.m. and 4:00p.m. and 5:00p.m. daily. The 4 - 5 pm is the peak time for leaving work. However, there were also a high number of accidents between 7am and 8am, at the peak time for work. The number of accidents on weekdays is higher than weekends, and the severity is correspondingly higher. Therefore, the DMV can alert people to be careful when they are out during these hours and drivers need to maintain a cautious attitude. Government can put more police during these hours to control the occurrence of accidents.

### Tableau Dashboard:

<img src='https://github.com/yinghaow525/BA775-teamproject/blob/charts/Team1_Dashboard.png?raw=True' align='left' />

#### Tableau Public Dashboard Link:

https://public.tableau.com/app/profile/yinghao.wang3127/viz/BA775Team1ProjectDashboard/Team1_Dashboard?publish=yes

#### Github Link:
https://github.com/yinghaow525/BA775-teamproject/blob/main/Team%20Project.ipynb