## Machine Learning Project

**o What are the most important features for predicting X as a target variable?**

**o Which classification approach do you prefer for the prediction of X as a target variable, and why?**

**o How to classify the loyal and churn customers using Support Vector Machines?**

**o Why is dimensionality reduction important in machine learning?**




**a) Logical justification based on the reasoning for the specific choice of machine learning approaches.**

**b) Multiple machine learning approaches (at least two) using hyperparameters and a comparison between the chosen modelling approaches.**

**c) Visualise your comparison of ML modelling outcomes. You may use a statistical approach to argue that one feature is more important than other features (for example, using PCA).**

**d) Cross-validation methods should be used to justify the authenticity of your ML results.**




**1. Motivation, a description of the problem domain, and an explanation of how the project's goals are justified using Prediction /Classification / Clustering Rules / Dimensionality Reduction etc.. (10 marks)**

**2. Characterization of data, explanation and description of techniques used for the variation in the accuracy across three training splits (10% / 20%/ 30%) using cross validation techniques. (30 marks)**

**3. Interpret and explain the results obtained, discuss overfitting / underfitting / generalisation, provide a rationale for the chosen model and use visualisations to support your findings. Comments in Python code, conclusions of the project should be specified at the end of the report. Harvard Style must be used for citations and references. (20 marks)**

**4. Each team member presents a PowerPoint presentation of their work (maximum 5 slides) to emphasize their distinctive contributions based on their involvement in the project's conceptual understanding, code development, and deployment. (20 marks individual)**

**5. Each team member fully described their individual contributions to the project in a reflective journal, using at least 500 to 700 words as well as images, diagrams, figures, and visualizations to elaborate his/ her work. (20 marks individual)**


#### Introduction

#### Data Collection

In [22]:
import pandas as pd
import missingno as msno
import matplotlib as plt
import numpy as np
import seaborn as sns
df= pd.read_csv("Traffic_Crashes_-_Crashes.csv")

In [23]:
df.shape

(746498, 49)

In [24]:
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,CRASH_RECORD_ID,RD_NO,CRASH_DATE_EST_I,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,LANE_CNT,ALIGNMENT,ROADWAY_SURFACE_COND,ROAD_DEFECT,REPORT_TYPE,CRASH_TYPE,INTERSECTION_RELATED_I,NOT_RIGHT_OF_WAY_I,HIT_AND_RUN_I,DAMAGE,DATE_POLICE_NOTIFIED,PRIM_CONTRIBUTORY_CAUSE,SEC_CONTRIBUTORY_CAUSE,STREET_NO,STREET_DIRECTION,STREET_NAME,BEAT_OF_OCCURRENCE,PHOTOS_TAKEN_I,STATEMENTS_TAKEN_I,DOORING_I,WORK_ZONE_I,WORK_ZONE_TYPE,WORKERS_PRESENT_I,NUM_UNITS,MOST_SEVERE_INJURY,INJURIES_TOTAL,INJURIES_FATAL,INJURIES_INCAPACITATING,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION
0,012c5bfce715efb2f2b387d6dd86f9c13e9dc1809fb52a...,JG341943,,07/12/2023 03:05:00 PM,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,TURNING,NOT DIVIDED,,STRAIGHT AND LEVEL,UNKNOWN,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,,,,"OVER $1,500",07/15/2023 11:30:00 AM,IMPROPER TURNING/NO SIGNAL,UNABLE TO DETERMINE,4754,W,63RD ST,813.0,,,,,,,2,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,2.0,0.0,15,4,7,41.778542,-87.742065,POINT (-87.742064741348 41.778541938106)
1,01d457f032e23d935a0b8f6b4c88221375180ffd4cd959...,JG338388,,07/12/2023 05:50:00 PM,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,,STRAIGHT AND LEVEL,DRY,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,,,,"$501 - $1,500",07/12/2023 06:41:00 PM,FOLLOWING TOO CLOSELY,NOT APPLICABLE,8300,S,PULASKI RD,834.0,,,,,,,2,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,2.0,0.0,17,4,7,41.742131,-87.721824,POINT (-87.72182410033 41.742130554062)
2,02249b4747a4bf40b88a8357304a98dfeaef9c38eebbf0...,JG350008,,07/12/2023 02:00:00 PM,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,OTHER OBJECT,NOT DIVIDED,,STRAIGHT AND LEVEL,DRY,"RUT, HOLES",NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,,Y,,"$501 - $1,500",07/21/2023 10:10:00 AM,NOT APPLICABLE,NOT APPLICABLE,9615,S,STONY ISLAND AVE,431.0,,,,,,,1,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,1.0,0.0,14,4,7,41.719844,-87.58479,POINT (-87.584789974824 41.719844228292)
3,03e3b6caad71b78ed9ae325648effa9512bfb2517aed30...,JG338049,,07/12/2023 07:05:00 AM,30,TRAFFIC SIGNAL,UNKNOWN,FREEZING RAIN/DRIZZLE,DAYLIGHT,REAR END,NOT DIVIDED,,STRAIGHT AND LEVEL,UNKNOWN,UNKNOWN,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,,,,"$501 - $1,500",07/12/2023 02:18:00 PM,FOLLOWING TOO CLOSELY,UNABLE TO DETERMINE,2370,N,ASHLAND AVE,1811.0,,,,,,,2,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,2.0,0.0,7,4,7,41.925105,-87.668291,POINT (-87.668291181568 41.925104953308)
4,0481fc919b38f1572d4ba04b069766102d904a662ff096...,JG338431,,07/12/2023 06:30:00 PM,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,NOT DIVIDED,,STRAIGHT AND LEVEL,WET,NO DEFECTS,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,,,,"OVER $1,500",07/12/2023 07:15:00 PM,FOLLOWING TOO CLOSELY,FOLLOWING TOO CLOSELY,5200,N,ELSTON AVE,1623.0,,,,,,,2,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,2.0,0.0,18,4,7,41.975258,-87.751991,POINT (-87.751990557158 41.97525809527)


In [25]:
#nan_graph_1=msno.matrix(df.sample(axis=1, n=49))

In [26]:
df.isnull().sum()

CRASH_RECORD_ID                       0
RD_NO                              4307
CRASH_DATE_EST_I                 690109
CRASH_DATE                            0
POSTED_SPEED_LIMIT                    0
TRAFFIC_CONTROL_DEVICE                0
DEVICE_CONDITION                      0
WEATHER_CONDITION                     0
LIGHTING_CONDITION                    0
FIRST_CRASH_TYPE                      0
TRAFFICWAY_TYPE                       0
LANE_CNT                         547494
ALIGNMENT                             0
ROADWAY_SURFACE_COND                  0
ROAD_DEFECT                           0
REPORT_TYPE                       21222
CRASH_TYPE                            0
INTERSECTION_RELATED_I           575368
NOT_RIGHT_OF_WAY_I               711724
HIT_AND_RUN_I                    513706
DAMAGE                                0
DATE_POLICE_NOTIFIED                  0
PRIM_CONTRIBUTORY_CAUSE               0
SEC_CONTRIBUTORY_CAUSE                0
STREET_NO                             0


"MOST_SEVERE_INJURY" will be our Target varible. As you can see there are 5 possible outcomes and 1630 Null values. We will need to address the fact that "NO INDICATION OF INJURY" occurs 643k times but "FATAL" only occurs 812.

In [27]:
df["MOST_SEVERE_INJURY"].unique()

array(['NO INDICATION OF INJURY', 'REPORTED, NOT EVIDENT',
       'NONINCAPACITATING INJURY', 'INCAPACITATING INJURY', nan, 'FATAL'],
      dtype=object)

In [28]:
df["MOST_SEVERE_INJURY"].value_counts()

NO INDICATION OF INJURY     643460
NONINCAPACITATING INJURY     57308
REPORTED, NOT EVIDENT        30560
INCAPACITATING INJURY        12728
FATAL                          812
Name: MOST_SEVERE_INJURY, dtype: int64

In [29]:
df["INJURIES_UNKNOWN"].unique()

array([ 0., nan])

In [30]:
# from ydata_profiling import ProfileReport
# slice_df = df.iloc[:, :10]
# report = ProfileReport(df, title='My Data', minimal=True)
# report.to_file("Crushes_in_Chicago.html")

We proceed to cut off the constant variable "INJURIES_UNKNOWN" with number of Zeros (%)	99.8%

In [31]:
df = df.drop(columns=['INJURIES_UNKNOWN'])

We drop the features: "INJURIES_TOTAL", "INJURIES_FATAL", "INJURIES_INCAPACITATING", "INJURIES_NON_INCAPACITATING", "INJURIES_REPORTED_NOT_EVIDENT", "INJURIES_NO_INDICATION" as they count the outcome of our target varible. Using these in the model will create a model that will reliant on them of and won't be appliable for predicting new data. 

In [32]:
df = df.drop(columns=["INJURIES_TOTAL","INJURIES_FATAL","INJURIES_INCAPACITATING","INJURIES_NON_INCAPACITATING","INJURIES_REPORTED_NOT_EVIDENT","INJURIES_NO_INDICATION"])

We drop "CRASH_RECORD_ID" as it only has unique values

In [33]:
df =df.drop(columns='CRASH_RECORD_ID')

We proceed to cut off all the variables with 65% or more of missing values.

In [34]:
df = df.drop(columns=['CRASH_DATE_EST_I','LANE_CNT', 'INTERSECTION_RELATED_I', 'NOT_RIGHT_OF_WAY_I', 'HIT_AND_RUN_I', 'PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I', 'DOORING_I', 'WORK_ZONE_I','WORK_ZONE_TYPE', 'WORKERS_PRESENT_I'])

In [35]:
df.shape

(746498, 30)

Dropping all rows where our target varible is NAN.

In [36]:
df.dropna(subset=["MOST_SEVERE_INJURY"],inplace=True)

In [37]:
df.isnull().sum()

RD_NO                       4294
CRASH_DATE                     0
POSTED_SPEED_LIMIT             0
TRAFFIC_CONTROL_DEVICE         0
DEVICE_CONDITION               0
WEATHER_CONDITION              0
LIGHTING_CONDITION             0
FIRST_CRASH_TYPE               0
TRAFFICWAY_TYPE                0
ALIGNMENT                      0
ROADWAY_SURFACE_COND           0
ROAD_DEFECT                    0
REPORT_TYPE                21180
CRASH_TYPE                     0
DAMAGE                         0
DATE_POLICE_NOTIFIED           0
PRIM_CONTRIBUTORY_CAUSE        0
SEC_CONTRIBUTORY_CAUSE         0
STREET_NO                      0
STREET_DIRECTION               4
STREET_NAME                    1
BEAT_OF_OCCURRENCE             5
NUM_UNITS                      0
MOST_SEVERE_INJURY             0
CRASH_HOUR                     0
CRASH_DAY_OF_WEEK              0
CRASH_MONTH                    0
LATITUDE                    4897
LONGITUDE                   4897
LOCATION                    4897
dtype: int

Null values now are REPORT_TYPE which seems to be how police report the crash after the crash has happened, LATITUDE
LONGITUDE and LOCATION which report same info as STREET NO AND STREET NAME, recommand we just drop these three. RD_NO. adn few in STREET_DIRECTION, SREET_NAME and BEAT_OF_OCCURRENCE. Can drop these rows,10 rows isn't significate. 

In [38]:
df =df.drop(columns=['LATITUDE', 'LONGITUDE', 'LOCATION','REPORT_TYPE'])

In [39]:
df.dropna(subset=["STREET_DIRECTION", "STREET_NAME", "BEAT_OF_OCCURRENCE"],inplace=True)

After investigating the "RD_NO" feature a bit further, we discovered that it also only has Unique Values, so simularily to "CRASH_RECORD_ID" we are going to drop it. The dataset is now full cleaned. 

In [40]:
df['RD_NO'].describe()

count       740565
unique      740565
top       JG341943
freq             1
Name: RD_NO, dtype: object

In [41]:
df =df.drop(columns='RD_NO')

In [50]:
df =df.drop(columns='CRASH_DATE')

In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 744859 entries, 0 to 746497
Data columns (total 24 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   POSTED_SPEED_LIMIT       744859 non-null  int64  
 1   TRAFFIC_CONTROL_DEVICE   744859 non-null  object 
 2   DEVICE_CONDITION         744859 non-null  object 
 3   WEATHER_CONDITION        744859 non-null  object 
 4   LIGHTING_CONDITION       744859 non-null  object 
 5   FIRST_CRASH_TYPE         744859 non-null  object 
 6   TRAFFICWAY_TYPE          744859 non-null  object 
 7   ALIGNMENT                744859 non-null  object 
 8   ROADWAY_SURFACE_COND     744859 non-null  object 
 9   ROAD_DEFECT              744859 non-null  object 
 10  CRASH_TYPE               744859 non-null  object 
 11  DAMAGE                   744859 non-null  object 
 12  DATE_POLICE_NOTIFIED     744859 non-null  object 
 13  PRIM_CONTRIBUTORY_CAUSE  744859 non-null  object 
 14  SEC_

In [67]:
from sklearn.tree import DecisionTreeClassifier       
from sklearn.model_selection import train_test_split  
from sklearn import metrics                           
from sklearn import tree
from sklearn.metrics import accuracy_score, confusion_matrix, recall_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
import warnings
warnings.filterwarnings('ignore')

In [48]:
X=df.drop(columns="MOST_SEVERE_INJURY", inplace=False)
y=df['MOST_SEVERE_INJURY']

In [68]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 1, stratify=y)
smte = SMOTE()
X_train, y_train = smte.fit_resample(X_train, y_train)

ValueError: could not convert string to float: 'TRAFFIC SIGNAL'

In [None]:
clf = DecisionTreeClassifier(max_depth = None, random_state = 1)
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

#### Data Pre-processing

#### Model Selection

#### Data Splitting

#### Training and Test

#### Interpretation