## Accidend dataset

This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by various entities, including the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. The dataset currently contains approximately 7.7 million accident records. For more information about this dataset, please visit here.

This dataset was collected in real-time using multiple Traffic APIs. It contains accident data collected from February 2016 to March 2023 for the Contiguous United States.

### Description for Each and Every Columns: This Data comprises of 46 columns and 7,728,394 rows

#### 0. ID This is a unique identifier of the accident record.

#### 1. Source: Source of raw accident data

#### 2. Severity: Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).

#### 3. Start_Time: Shows start time of the accident in local time zone.

#### 4. End_Time: Shows end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed.

#### 5. Start_Lat: Shows latitude in GPS coordinate of the start point.

#### 6. Start_Lng: Shows longitude in GPS coordinate of the start point.

#### 7. End_Lat: Shows latitude in GPS coordinate of the end point.

#### 8. End_Lng: Shows longitude in GPS coordinate of the end point.

#### 9. Distance(mi): The length of the road extent affected by the accident in miles.

#### 10. Description: Shows a human provided description of the accident.

#### 11. Street: Shows the street name in address field.

#### 12. City: Shows the city in address field.

#### 13. County: Shows the county in address field.

#### 14. State: Shows the state in address field.

#### 15. Zipcode: Shows the zipcode in address field.

#### 16. Country: Shows the country in address field.

#### 17. Timezone: Shows timezone based on the location of the accident (eastern, central, etc.).

#### 18. Airport_Code: Denotes an airport-based weather station which is the closest one to location of the accident.

#### 19. Weather_Timestamp: Shows the time-stamp of weather observation record (in local time).

#### 20. Temperature(F): Shows the temperature (in Fahrenheit).

#### 21. Wind_Chill(F): Shows the wind chill (in Fahrenheit).

#### 22. Humidity(%): Shows the humidity (in percentage).

#### 23. Pressure(in): Shows the air pressure (in inches).

#### 24. Visibility(mi): Shows visibility (in miles).

#### 25. Wind_Direction: Shows wind direction.

#### 26. Wind_Speed(mph): Shows wind speed (in miles per hour).

#### 27. Precipitation(in): Shows precipitation amount in inches, if there is any.

#### 28. Weather_Condition: Shows the weather condition (rain, snow, thunderstorm, fog, etc.)

#### 29. Amenity: A POI annotation which indicates presence of amenity in a nearby location.

#### 30. Bump: A POI annotation which indicates presence of speed bump or hump in a nearby location.

#### 31. Crossing: A POI annotation which indicates presence of crossing in a nearby location.

#### 32. Give_Way: A POI annotation which indicates presence of give_way in a nearby location.

#### 33. Junction: A POI annotation which indicates presence of junction in a nearby location.

#### 34. No_Exit: A POI annotation which indicates presence of no_exit in a nearby location.

#### 35. Railway: A POI annotation which indicates presence of railway in a nearby location.

#### 36. Roundabout: A POI annotation which indicates presence of roundabout in a nearby location.

#### 37. Station: A POI annotation which indicates presence of station in a nearby location.

#### 38. Stop: A POI annotation which indicates presence of stop in a nearby location.

#### 39. Traffic_Calming: A POI annotation which indicates presence of traffic_calming in a nearby location.

#### 40. Traffic_Signal: A POI annotation which indicates presence of traffic_signal in a nearby location.

#### 41. Turning_Loop: A POI annotation which indicates presence of turning_loop in a nearby location.

#### 42. Sunrise_Sunset: Shows the period of day (i.e. day or night) based on sunrise/sunset.

#### 43. Civil_Twilight: Shows the period of day (i.e. day or night) based on civil twilight.

#### 44. Nautical_Twilight: Shows the period of day (i.e. day or night) based on nautical twilight.

#### 45. Astronomical_Twilight: Shows the period of day (i.e. day or night) based on astronomical twilight.

Three Types of Twilight
 

In its most general sense, twilight is the period of time before sunrise and after sunset, in which the atmosphere is partially illuminated by the sun, being neither totally dark or completely lit. 

However there are three categories of twilight that are defined by how far the sun is below the horizon.

Civil Twilight:  
Begins in the morning, or ends in the evening, when the geometric center of the sun is 6 degrees below the horizon.  Therefore morning civil twilight begins when the geometric center of the sun is 
6 degrees below the horizon, and ends at sunrise.  Evening civil twilight begins at sunset, and ends when the geometric center of the sun is 6 degrees below the horizon.  Under these conditions 
absent fog or other restrictions, the brightest stars and planets can be seen, the horizon and terrestrial objects can be discerned, and in many cases, artificial lighting is not needed.

Nautical Twilight:
Begins in the morning, or ends in the evening, when the geometric center of the sun is 12 degrees below the horizon.  In general, the term nautical twilight refers to sailors being able to take reliable
 readings via well known stars because the horizon is still visible, even under moonless conditions.  Absent fog or other restrictions, outlines of terrestrial objects may still be discernible, but detailed 
 outdoor activities are likely curtailed without artificial illumination.

Astronomical Twilight:
Begins in the morning, or ends in the evening, when the geometric center of the sun is 18 degrees below the horizon.  In astronomical twilight, sky illumination is so faint that most casual observers would 
regard the sky as fully dark, especially under urban or suburban light pollution.  Under astronomical twilight, the horizon is not discernible and moderately faint stars or planets can be observed with the 
naked eye under a non light polluted sky.  But to test the limits of naked eye observations, the sun needs to be more than 18 degrees below the horizon.  Point light sources such as stars and planets can be
readily studied by astronomers under astronomical twilight.  But diffuse light sources such as galaxies, nebula, and globular clusters need to be observed under a totally dark sky, again when the sun is more 
than 18 degrees below the horizon.

The figure below shows civil, nautical and astronomical twilight.  Note that the angles are not to scale so as to show the three twilight categories with more clarity.

![image.png](attachment:image.png)

In [3]:
# import some libraries (I maybe wouldn't use all of them!)

import datetime as dt
import pandas as pd
import numpy as np
import requests
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import h2o
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from imblearn.under_sampling import RandomUnderSampler
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from math import radians, sin, cos, sqrt, atan2
from shapely.geometry import Polygon
from sklearn.preprocessing import LabelEncoder
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from statsmodels.tools.tools import add_constant

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LinearRegression
from h2o.automl import H2OAutoML
from tpot import TPOTClassifier

In [4]:
accidents = pd.read_csv("C:\\Users\\reza3\\OneDrive\\Desktop\\AIT\\Machine learning\\group project\\data\\US_Accidents_March23.csv")


### creating time components:

In [5]:
#some time valuse have inconsistent data format so I cleaned the values:
accidents.loc[:, "Start_Time"] = accidents["Start_Time"].str[:19]
accidents.loc[:, "End_Time"] = accidents["End_Time"].str[:19]

#changing data types to datetime format:
accidents["Start_Time"] = pd.to_datetime(accidents["Start_Time"])
accidents["End_Time"] = pd.to_datetime(accidents["End_Time"])

# Get year from datetime pandas series
accidents["Start_year"]= accidents["Start_Time"].dt.year
accidents["End_year"]= accidents["End_Time"].dt.year

# Get month name from datetime pandas series
accidents["Start_month"] = accidents["Start_Time"].dt.month_name()
accidents["End_month"] = accidents["End_Time"].dt.month_name()

# Get day name from datetime pandas series
accidents["Start_day"] = accidents["Start_Time"].dt.day_name()
accidents["End_day"] = accidents["End_Time"].dt.day_name()

# Get hour from datetime pandas series
accidents["Start_hour"] = accidents["Start_Time"].dt.hour
accidents["End_hour"] = accidents["End_Time"].dt.hour

# Get time from datetime pandas series
accidents["Start_time"] = accidents["Start_Time"].dt.time
accidents["End_time"] = accidents["End_Time"].dt.time

# Get day of the week number from datetime pandas series
accidents["Start_weekday"] = accidents["Start_Time"].dt.day_of_week
accidents["End_weekday"] = accidents["End_Time"].dt.day_of_week

# create a new column for weekends:
accidents['IsWeekend'] = accidents['Start_weekday'].apply(lambda x: 1 if x >= 5 else 0)


some of the records are inccidents and have large values of delay. we should be careful when we want to learn a model including "Delay" 

In [6]:
# create the time between start and end of an accident (minute)
accidents.insert(loc=3, column='Delay(min)', value=((accidents["End_Time"] - accidents["Start_Time"]).dt.total_seconds())/60)

In [None]:
pd.set_option('display.float_format', '{:.2f}'.format)
accidents.describe(include = 'all').to_csv("C:\\Users\\reza3\\OneDrive\\Desktop\\AIT\\Machine learning\\group project\\data\\describe.csv")

In [7]:
# convert some measurement scales to metric:
def fahrenheit_to_celsius(fahrenheit):
    celsius = (fahrenheit - 32) * 5 / 9
    return celsius
accidents["Temperature(C)"] = accidents["Temperature(F)"].apply(fahrenheit_to_celsius)
accidents["Wind_Chill(C)"] = accidents["Wind_Chill(F)"].apply(fahrenheit_to_celsius)

def inches_to_cm(inches):
    cm = inches * 2.54
    return cm

accidents["Pressure(cm)"] = accidents["Pressure(in)"].apply(inches_to_cm)
accidents["Precipitation(cm)"] = (accidents["Precipitation(in)"].apply(inches_to_cm)) * 10

def miles_to_km(distance_miles):
    distance_km = distance_miles * 1.60934
    return distance_km

accidents["Distance(km)"] = accidents["Distance(mi)"].apply(miles_to_km)
accidents["Visibility(km)"] = accidents["Visibility(mi)"].apply(miles_to_km)
accidents["Wind_Speed(kmph)"] = accidents["Wind_Speed(mph)"].apply(miles_to_km)

accidents = accidents.drop(columns=["Temperature(F)","Wind_Chill(F)", "Pressure(in)", "Precipitation(in)", "Visibility(mi)", "Wind_Speed(mph)", "Distance(mi)" ])

In [8]:
# Create a function to assign grid square IDs
def assign_grid_square(gridsize, lat, lng):
    lat_interval = np.floor(lat / gridsize) * gridsize
    lon_interval = np.floor(lng / gridsize) * gridsize
    return f'{lat_interval}-{lat_interval + gridsize}_{lon_interval}-{lon_interval + gridsize}'

# Apply the function to create a new column 'GridSquare'
accidents['GridSquare'] = accidents.apply(lambda row: assign_grid_square(0.001, row['Start_Lat'], row['Start_Lng']), axis=1)


In [42]:
import geopy
import geopy.distance as distance
import plotly.graph_objects as go
from haversine import haversine, Unit
from shapely.geometry import Polygon

# Each side is 25KM appart
d = distance.distance(kilometers=25) # 25 KM
print(d)
# Going clockwise, from lower-left to upper-left, upper-right...

my_lat = 38.696211
my_lon = -120.5

center_point = geopy.Point((my_lat,my_lon))
p2 = d.destination(point=center_point, bearing=45)
p3 = d.destination(point=center_point, bearing=135)
p4 = d.destination(point=center_point, bearing=-135)
p5 = d.destination(point=center_point, bearing=-45)

# print('p2','-->',p2)
# print('p3','-->',p3)
# print('p4','-->',p4)
# print('p5','-->',p5)

points = [(p.latitude, p.longitude) for p in [p2,p3,p4,p5]]
polygon = Polygon(points)

25.0 km


In [9]:
# recoding weather condition to fewer categories:
weather_bins = {
    'Clear': ['Clear', 'Fair'],
    'Cloudy': ['Cloudy', 'Mostly Cloudy', 'Partly Cloudy', 'Scattered Clouds'],
    'Rainy': ['Light Rain', 'Rain', 'Light Freezing Drizzle', 'Light Drizzle', 'Heavy Rain', 'Light Freezing Rain', 'Drizzle', 'Light Freezing Fog', 'Light Rain Showers', 'Showers in the Vicinity', 'T-Storm', 'Thunder', 'Patches of Fog', 'Heavy T-Storm', 'Heavy Thunderstorms and Rain', 'Funnel Cloud', 'Heavy T-Storm / Windy', 'Heavy Thunderstorms and Snow', 'Rain / Windy', 'Heavy Rain / Windy', 'Squalls', 'Heavy Ice Pellets', 'Thunder / Windy', 'Drizzle and Fog', 'T-Storm / Windy', 'Smoke / Windy', 'Haze / Windy', 'Light Drizzle / Windy', 'Widespread Dust / Windy', 'Wintry Mix', 'Wintry Mix / Windy', 'Light Snow with Thunder', 'Fog / Windy', 'Snow and Thunder', 'Sleet / Windy', 'Heavy Freezing Rain / Windy', 'Squalls / Windy', 'Light Rain Shower / Windy', 'Snow and Thunder / Windy', 'Light Sleet / Windy', 'Sand / Dust Whirlwinds', 'Mist / Windy', 'Drizzle / Windy', 'Duststorm', 'Sand / Dust Whirls Nearby', 'Thunder and Hail', 'Freezing Rain / Windy', 'Light Snow Shower / Windy', 'Partial Fog', 'Thunder / Wintry Mix / Windy', 'Patches of Fog / Windy', 'Rain and Sleet', 'Light Snow Grains', 'Partial Fog / Windy', 'Sand / Dust Whirlwinds / Windy', 'Heavy Snow with Thunder', 'Heavy Blowing Snow', 'Low Drifting Snow', 'Light Hail', 'Light Thunderstorm', 'Heavy Freezing Drizzle', 'Light Blowing Snow', 'Thunderstorms and Snow', 'Heavy Rain Showers', 'Rain Shower / Windy', 'Sleet and Thunder', 'Heavy Sleet and Thunder', 'Drifting Snow / Windy', 'Shallow Fog / Windy', 'Thunder and Hail / Windy', 'Heavy Sleet / Windy', 'Sand / Windy', 'Heavy Rain Shower / Windy', 'Blowing Snow Nearby', 'Blowing Sand', 'Heavy Rain Shower', 'Drifting Snow', 'Heavy Thunderstorms with Small Hail'],
    'Snowy': ['Light Snow', 'Snow', 'Light Snow / Windy', 'Snow Grains', 'Snow Showers', 'Snow / Windy', 'Light Snow and Sleet', 'Snow and Sleet', 'Light Snow and Sleet / Windy', 'Snow and Sleet / Windy'],
    'Windy': ['Blowing Dust / Windy', 'Fair / Windy', 'Mostly Cloudy / Windy', 'Light Rain / Windy', 'T-Storm / Windy', 'Blowing Snow / Windy', 'Freezing Rain / Windy', 'Light Snow and Sleet / Windy', 'Sleet and Thunder / Windy', 'Blowing Snow Nearby', 'Heavy Rain Shower / Windy'],
    'Hail': ['Hail'],
    'Volcanic Ash': ['Volcanic Ash'],
    'Tornado': ['Tornado']
}

def map_weather_to_bins(weather):
    for bin_name, bin_values in weather_bins.items():
        if weather in bin_values:
            return bin_name
    return 'Other' 

accidents['Weather_Bin'] = accidents['Weather_Condition'].apply(map_weather_to_bins)

In [10]:
# Dropping Unncessary Columns(some of them has two level which one level has about 100% share)
accidents.drop(columns=["Weather_Condition","Source", "End_Lat", "End_Lng", "Country", "Airport_Code", "Weather_Timestamp", "Timezone", "Turning_Loop"], inplace=True)

In [11]:
# Rename some column names to follow the general dataset naming rule (for example'Start_year' to 'Start_Year')
accidents = accidents.rename(columns={'End_day': 'End_Day'})
accidents = accidents.rename(columns={'End_hour': 'End_Hour'})
accidents = accidents.rename(columns={'Start_hour': 'Start_Hour'})
accidents = accidents.rename(columns={'End_weekday': 'End_Weekday'})
accidents = accidents.rename(columns={'Start_weekday': 'Start_Weekday'})
accidents = accidents.rename(columns={'Start_day': 'Start_Day'})
accidents = accidents.rename(columns={'End_month': 'End_Month'})
accidents = accidents.rename(columns={'Start_month': 'Start_Month'})
accidents = accidents.rename(columns={'End_year': 'End_Year'})
accidents = accidents.rename(columns={'Start_year': 'Start_Year'})

In [34]:
accidents["Wind_Direction"] = accidents["Wind_Direction"].str[:1]

In [35]:
def clean_data(accidents):
    # One-hot encode columns: 'Wind_Direction', 'Sunrise_Sunset' and 6 other columns
    accidents = pd.get_dummies(accidents, columns=['Wind_Direction', 'Sunrise_Sunset', 'Civil_Twilight', 'Astronomical_Twilight', 'Nautical_Twilight', 'Weather_Bin', 'Start_Day', 'Start_Month'])
    return accidents

accidents_copy = clean_data(accidents.copy())

In [36]:
accidents_copy.head()

Unnamed: 0,ID,Severity,Delay(min),Start_Time,End_Time,Start_Lat,Start_Lng,Description,Street,City,...,Start_Month_December,Start_Month_February,Start_Month_January,Start_Month_July,Start_Month_June,Start_Month_March,Start_Month_May,Start_Month_November,Start_Month_October,Start_Month_September
0,A-1,3,314.0,2016-02-08 05:46:00,2016-02-08 11:00:00,39.865147,-84.058723,Right lane blocked due to accident on I-70 Eas...,I-70 E,Dayton,...,False,True,False,False,False,False,False,False,False,False
1,A-2,2,30.0,2016-02-08 06:07:59,2016-02-08 06:37:59,39.928059,-82.831184,Accident on Brice Rd at Tussing Rd. Expect del...,Brice Rd,Reynoldsburg,...,False,True,False,False,False,False,False,False,False,False
2,A-3,2,30.0,2016-02-08 06:49:27,2016-02-08 07:19:27,39.063148,-84.032608,Accident on OH-32 State Route 32 Westbound at ...,State Route 32,Williamsburg,...,False,True,False,False,False,False,False,False,False,False
3,A-4,3,30.0,2016-02-08 07:23:34,2016-02-08 07:53:34,39.747753,-84.205582,Accident on I-75 Southbound at Exits 52 52B US...,I-75 S,Dayton,...,False,True,False,False,False,False,False,False,False,False
4,A-5,2,30.0,2016-02-08 07:39:07,2016-02-08 08:09:07,39.627781,-84.188354,Accident on McEwen Rd at OH-725 Miamisburg Cen...,Miamisburg Centerville Rd,Dayton,...,False,True,False,False,False,False,False,False,False,False


In [39]:
boolean_columns = ["Amenity", "Bump", "Crossing", "Give_Way", "Roundabout", "Junction", "No_Exit", "Railway", "Station", "Stop", "Traffic_Calming", "Traffic_Signal","Wind_Direction_C", "Wind_Direction_E", 
                   "Wind_Direction_N", "Wind_Direction_S","Wind_Direction_V", "Wind_Direction_W", "Sunrise_Sunset_Day", "Sunrise_Sunset_Night", "Civil_Twilight_Day", "Civil_Twilight_Night", 
                   "Astronomical_Twilight_Day", "Astronomical_Twilight_Night", "Nautical_Twilight_Day", "Nautical_Twilight_Night", "Weather_Bin_Clear", "Weather_Bin_Cloudy", "Weather_Bin_Hail", 
                   "Weather_Bin_Other", "Weather_Bin_Rainy", "Weather_Bin_Snowy", "Weather_Bin_Tornado", "Weather_Bin_Volcanic Ash", "Weather_Bin_Windy", "Start_Day_Friday", "Start_Day_Monday", 
                   "Start_Day_Saturday", "Start_Day_Sunday", "Start_Day_Thursday", "Start_Day_Tuesday", "Start_Day_Wednesday", "Start_Month_April", "Start_Month_August", "Start_Month_December",
                   "Start_Month_February", "Start_Month_January", "Start_Month_July", "Start_Month_June", "Start_Month_March", "Start_Month_May", "Start_Month_November", "Start_Month_October",
                   "Start_Month_September"]
accidents_copy[boolean_columns] = accidents_copy[boolean_columns].apply(lambda x: x.astype(int))


In [40]:
accidents_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7728394 entries, 0 to 7728393
Data columns (total 87 columns):
 #   Column                       Dtype         
---  ------                       -----         
 0   ID                           object        
 1   Severity                     int64         
 2   Delay(min)                   float64       
 3   Start_Time                   datetime64[ns]
 4   End_Time                     datetime64[ns]
 5   Start_Lat                    float64       
 6   Start_Lng                    float64       
 7   Description                  object        
 8   Street                       object        
 9   City                         object        
 10  County                       object        
 11  State                        object        
 12  Zipcode                      object        
 13  Humidity(%)                  float64       
 14  Amenity                      int32         
 15  Bump                         int32         
 16  

changing the attributes data type

In [None]:

accidents.Severity.plot(kind='pie', autopct='%.1f%%')


In [None]:
sns.barplot(s)

In [None]:
accidents.columns