### This  Notebook is for cleansing the dataset and building and evaluating the model

Importing the required modules

In [2]:
#Basic Import
import pandas as pd
import numpy as np
import dask.dataframe as dd

from dask_ml.model_selection import train_test_split

#Importing Train Test Split and GridSearchCV for hyperparamater tuning
from sklearn.model_selection import GridSearchCV

#Importing the CatBoostClassifier
from catboost import CatBoostClassifier, Pool

#Importing model evaluation metrics
from sklearn.metrics import accuracy_score, classification_report

from sklearn.preprocessing import LabelEncoder



Importing the Dataset from the local CSV file

In [3]:
df1=dd.read_csv('US_Accidents_March23.csv')

In [4]:
df1.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 46 entries, ID to Astronomical_Twilight
dtypes: object(20), bool(13), float64(12), int64(1)

We are going to use the CatBoostClassifier to train the model.

It does not require the numerical variables of the dataset to be scaled

The following columns can be dropped from the Dataset:

- ID: Does not serve any purpose other than identification
- Source: The Source from which the data is sourced. It does not contribute to the Severity
- End_Lat: Significant number of values are NaN. Also has 100% correlation to Start_Lat
- End_Lng: Significant number of values are NaN. Also has 100% correlation to Start_Lng
- Description: It is a text field that describes the accident. Description cannot be used to predict the Severity
- Street: This information will already be captured in the Latitude and Longitude
- City: This information will already be captured in the Latitude and Longitude
- County: This information will already be captured in the Latitude and Longitude
- State: This information will already be captured in the Latitude and Longitude
- Zipcode: This information will already be captured in the Latitude and Longitude
- Country: Has no Variance at all. All the values are 'US' since the dataset pertains to the US
- Timezone: This information will already be captured in the Latitude and Longitude
- Amenity: Has extremely low Variance
- Bump: Has extremely low Variance
- Crossing: Has extremely low Variance
- Give_Way: Has extremely low Variance
- Junction: Has extremely low Variance
- No_Exit: Has extremely low Variance
- Railway: Has extremely low Variance
- Roundabout: Has extremely low Variance
- Station: Has extremely low Variance
- Stop: Has extremely low Variance
- Traffic_Calming: Has extremely low Variance
- Traffic_Signal: Has extremely low Variance
- Turning_Loop: Has no Variance at all. All the values are False
- Sunrise_Sunset: This information is already captured in the Start_Time and End_Time features
- Civil_Twilight: This information is already captured in the Start_Time and End_Time features
- Nautical_Twilight: This information is already captured in the Start_Time and End_Time features
- Astronomical_Twilight: This information is already captured in the Start_Time and End_Time features
- Wind_Chill(F): Has nearly 2 Million NaN records and a 99% positive correlation with Temperature

In [5]:
columns_to_drop = ['ID',
'Source',
'End_Lat',
'End_Lng',
'Description',
'Street',
'City',
'County',
'State',
'Zipcode',
'Country',
'Timezone',
'Wind_Chill(F)',
'Amenity',
'Bump',
'Crossing',
'Give_Way',
'Junction',
'No_Exit',
'Railway',
'Roundabout',
'Station',
'Stop',
'Traffic_Calming',
'Traffic_Signal',
'Turning_Loop',
'Sunrise_Sunset',
'Civil_Twilight',
'Nautical_Twilight',
'Astronomical_Twilight'
]

In [6]:
#Dropping the features that are not required from the Dataset
df1=df1.drop(columns_to_drop,axis=1)

In [7]:
#Checking the shape of the Dataset after dropping the unneccessary columns
shape=df1.shape
print("Number of rows in the dataset:",shape[0].compute())
print("Number of columns in the dataset:",shape[1])

Number of rows in the dataset: 7728394
Number of columns in the dataset: 16


Dropping the records which have NaN values

In [8]:
df1=df1.dropna()

In [9]:
#Checking the shape of the Dataset after dropping the Nan values
shape=df1.shape
print("Number of rows in the dataset:",shape[0].compute())
print("Number of columns in the dataset:",shape[1])

Number of rows in the dataset: 5391216
Number of columns in the dataset: 16


We still have a significant number of records even after dropping the NaN values

In [10]:
#Defining a list of columns where the date format must be in datetime64[ns]
date_columns = ['Start_Time', 'End_Time', 'Weather_Timestamp']

Dask has read the Start_Time, End_Time and Weather_Timestamp as object datatype. Let us convert them into Date format.

In [11]:
#Converting the date columns from object datatype to datetime64[ns] datatype
for column in date_columns:
    #df1[column] = dd.to_datetime(df1[column], unit='s')
    #df1[column] = df1[column].astype('datetime64[ns]').astype('int64') // 10**9
    df1[column] = dd.to_datetime(df1[column], format='%Y-%m-%d %H:%M:%S', errors='coerce')

In [12]:
#Checking the datatype of the columns
print(df1.dtypes)

Severity                      int64
Start_Time           datetime64[ns]
End_Time             datetime64[ns]
Start_Lat                   float64
Start_Lng                   float64
Distance(mi)                float64
Airport_Code                 object
Weather_Timestamp    datetime64[ns]
Temperature(F)              float64
Humidity(%)                 float64
Pressure(in)                float64
Visibility(mi)              float64
Wind_Direction               object
Wind_Speed(mph)             float64
Precipitation(in)           float64
Weather_Condition            object
dtype: object


Separating the Independent variables and the Dependent varibale into separate Dataframe

In [13]:
#Separating features and target
X = df1.drop('Severity', axis=1)
y = df1['Severity']

Encoding the Categorical variables. Since the number of distinct values in the catgeorical variables are high we have chosen Label encoder inst

In [14]:
#Creating a list of categorical variables for encoding
cat_cols = ['Airport_Code','Wind_Direction','Weather_Condition']

In [15]:
#Printing the list of categorical columns
print(cat_cols)

['Airport_Code', 'Wind_Direction', 'Weather_Condition']


In [16]:
#Creating a NumPy copy of the dataframe since the CatBoost Pool requires the dataframe to be of NumPy array
X_np = X.compute()

In [17]:
#Creating a NumPy copy of the dataframe since the CatBoost Pool requires the dataframe to be of NumPy array
Y_np = y.compute()

In [18]:
#Initialize the Label encoder
label_encoder = LabelEncoder()

In [19]:
#Encoding the categorical variables
for cat in cat_cols:
    X_np[cat]=label_encoder.fit_transform(X_np[cat])

Creating the Test data, Validation data and test data

In [20]:
# Split into train, test, and validation sets with desired proportions
X_train, X_temp, y_train, y_temp = train_test_split(X_np, Y_np, test_size=0.25)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.6)  # 60% of 25% = 15%

Creating a CatBoost Pool for efficient handling of large datasets

In [21]:
# Create a CatBoost Pool for efficient handling of large datasets
train_pool = Pool(X_train, label=y_train)
val_pool = Pool(X_val, label=y_val)

Defining a hyperparameter grid for tuning the model

In [29]:
# Definiiing hyperparameter grid for tuning
param_grid = {
    'iterations': [10], #, 200, 300],
    'learning_rate': [0.05],# 0.1, 0.15],
    'depth': [4]#, 6, 8],   
}

Creating a CatBoostClassifier model with early stopping enabled

In [30]:
# Create a CatBoostClassifier model with early stopping enabled
model = CatBoostClassifier(
    early_stopping_rounds=10,  # Monitor validation performance
    eval_metric='Accuracy',  # Replace with your preferred metric
    # Other model parameters as needed
)

Performing grid search with Dask-compatible GridSearchCV

In [31]:
# Perform grid search with Dask-compatible GridSearchCV
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

0:	learn: 0.8468029	total: 1.26s	remaining: 11.4s
1:	learn: 0.8467909	total: 2.25s	remaining: 8.98s
2:	learn: 0.8457670	total: 3.55s	remaining: 8.28s
3:	learn: 0.8463862	total: 4.78s	remaining: 7.17s
4:	learn: 0.8474153	total: 6.15s	remaining: 6.15s
5:	learn: 0.8474153	total: 7.29s	remaining: 4.86s
6:	learn: 0.8474444	total: 8.39s	remaining: 3.6s
7:	learn: 0.8472348	total: 9.35s	remaining: 2.34s
8:	learn: 0.8457670	total: 10.3s	remaining: 1.14s
9:	learn: 0.8457670	total: 11.4s	remaining: 0us
0:	learn: 0.8467702	total: 803ms	remaining: 7.23s
1:	learn: 0.8467597	total: 1.6s	remaining: 6.42s
2:	learn: 0.8457670	total: 2.55s	remaining: 5.94s
3:	learn: 0.8463612	total: 3.64s	remaining: 5.46s
4:	learn: 0.8473714	total: 4.61s	remaining: 4.61s
5:	learn: 0.8473714	total: 5.49s	remaining: 3.66s
6:	learn: 0.8473714	total: 6.39s	remaining: 2.74s
7:	learn: 0.8472589	total: 7.31s	remaining: 1.83s
8:	learn: 0.8457670	total: 8.18s	remaining: 909ms
9:	learn: 0.8463268	total: 9.1s	remaining: 0us
0:	lear

Printing the best model and its parameters

In [33]:
# Get the best model and its parameters
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_

Evaluating the performance of the model on the test data

In [34]:
# Evaluate performance on the test set
test_pool = Pool(X_test, label=y_test)
test_predictions = best_model.predict(test_pool)
test_accuracy = best_model.score(test_pool)

print("Test accuracy:", test_accuracy)

Test accuracy: 0.8458394698540713


We have got a test accuracy of almost 85% which is pretty neat