# Toronto Traffic Collisions EDA

In this notebook, we explore and analyze Toronto Traffic Collision Data from 2014 to 2024 to identify key trends, patterns, and insights related to traffic incidents in the city. The dataset provides detailed information about collision dates, times, locations, vehicle types involved, and the severity of outcomes such as injuries and fatalities.

**Objectives:**
- Understand Collision Trends: Analyze how collision rates change over time (by year, month, and time of day).
- Location-Based Insights: Identify neighborhoods or intersections with higher collision frequencies.
- Collision Types: Investigate the involvement of different modes of transportation (e.g., automobiles, bicycles, pedestrians).
- Severity Analysis: Examine the frequency and patterns of fatal and non-fatal collisions.
- Data Cleaning: Handle missing data and outliers to ensure the analysis is robust.
  
**Why This Analysis Matters:**  
Traffic collisions can have a significant impact on public safety and urban planning. By analyzing historical data, we can gain insights to help inform policies, improve traffic management, and potentially reduce future incidents.

<hr>

## Data Import and Overview

### Import necessary libraries

In [6]:
# import libraries 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Load dataset and display first 5 rows

In [2]:
df = pd.read_csv("/Users/raghadibrahim/Desktop/toronto-collisions-eda/data/Traffic_Collisions.csv")
df.head() # show first five rows

Unnamed: 0,_id,OCC_DATE,OCC_MONTH,OCC_DOW,OCC_YEAR,OCC_HOUR,DIVISION,FATALITIES,INJURY_COLLISIONS,FTR_COLLISIONS,...,HOOD_158,NEIGHBOURHOOD_158,LONG_WGS84,LAT_WGS84,AUTOMOBILE,MOTORCYCLE,PASSENGER,BICYCLE,PEDESTRIAN,geometry
0,1,1388552400000,January,Wednesday,2014,13,D23,,NO,NO,...,006,Kingsview Village-The Westway (6),-79.558639,43.694246,YES,NO,NO,NO,NO,"{""type"": ""MultiPoint"", ""coordinates"": [[-79.55..."
1,2,1388552400000,January,Wednesday,2014,19,D42,,NO,YES,...,128,Agincourt South-Malvern West (128),-79.281506,43.784746,YES,NO,NO,NO,NO,"{""type"": ""MultiPoint"", ""coordinates"": [[-79.28..."
2,3,1388552400000,January,Wednesday,2014,2,NSA,,YES,NO,...,NSA,NSA,0.0,0.0,YES,NO,NO,NO,NO,"{""type"": ""MultiPoint"", ""coordinates"": [[5.6843..."
3,4,1388552400000,January,Wednesday,2014,3,NSA,,NO,NO,...,NSA,NSA,0.0,0.0,YES,NO,NO,NO,NO,"{""type"": ""MultiPoint"", ""coordinates"": [[5.6843..."
4,5,1388552400000,January,Wednesday,2014,5,NSA,,YES,NO,...,NSA,NSA,0.0,0.0,YES,NO,NO,NO,NO,"{""type"": ""MultiPoint"", ""coordinates"": [[5.6843..."


### Display important info

In [4]:
# show dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 704704 entries, 0 to 704703
Data columns (total 21 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   _id                704704 non-null  int64  
 1   OCC_DATE           704704 non-null  int64  
 2   OCC_MONTH          704704 non-null  object 
 3   OCC_DOW            704704 non-null  object 
 4   OCC_YEAR           704704 non-null  int64  
 5   OCC_HOUR           704704 non-null  int64  
 6   DIVISION           704704 non-null  object 
 7   FATALITIES         606 non-null     float64
 8   INJURY_COLLISIONS  704700 non-null  object 
 9   FTR_COLLISIONS     704700 non-null  object 
 10  PD_COLLISIONS      704700 non-null  object 
 11  HOOD_158           704704 non-null  object 
 12  NEIGHBOURHOOD_158  704704 non-null  object 
 13  LONG_WGS84         704704 non-null  float64
 14  LAT_WGS84          704704 non-null  float64
 15  AUTOMOBILE         704700 non-null  object 
 16  MO

<br>

We can see above that the dataset consists of $704,704$ rows and $21$ columns. Each row in the dataset represents a single reported traffic collision incident in Toronto. Specifically, one row corresponds to one collision event and includes information about when, where, and how the collision happened, as well as details on the severity and the types of vehicles and people involved.

### Key Statistics

In [3]:
# dataframe key statistics
df.describe()

Unnamed: 0,_id,OCC_DATE,OCC_YEAR,OCC_HOUR,FATALITIES,LONG_WGS84,LAT_WGS84
count,704704.0,704704.0,704704.0,704704.0,606.0,704704.0,704704.0
mean,352352.5,1553559000000.0,2018.729875,13.468612,1.014851,-66.342691,36.528215
std,203430.666387,97843820000.0,3.09903,4.976665,0.145831,29.423231,16.200397
min,1.0,1388552000000.0,2014.0,0.0,1.0,-79.639247,0.0
25%,176176.75,1472706000000.0,2016.0,10.0,1.0,-79.444829,43.644346
50%,352352.5,1545109000000.0,2018.0,14.0,1.0,-79.370469,43.6925
75%,528528.25,1642568000000.0,2022.0,17.0,1.0,-79.258521,43.75148
max,704704.0,1727672000000.0,2024.0,23.0,3.0,0.0,43.853164


##  Data Cleaning and Preprocessing

### Missing Values

In [5]:
df.isnull().sum()

_id                       0
OCC_DATE                  0
OCC_MONTH                 0
OCC_DOW                   0
OCC_YEAR                  0
OCC_HOUR                  0
DIVISION                  0
FATALITIES           704098
INJURY_COLLISIONS         4
FTR_COLLISIONS            4
PD_COLLISIONS             4
HOOD_158                  0
NEIGHBOURHOOD_158         0
LONG_WGS84                0
LAT_WGS84                 0
AUTOMOBILE                4
MOTORCYCLE                4
PASSENGER                 4
BICYCLE                   4
PEDESTRIAN                4
geometry                  0
dtype: int64

### Data Types

### Check for Duplicates

## Exploratory Data Analysis (EDA)

In [10]:
df['MOTORCYCLE'].value_counts()

MOTORCYCLE
NO     694183
N/R      6246
YES      4271
Name: count, dtype: int64

In [11]:
df['AUTOMOBILE'].value_counts()

AUTOMOBILE
YES    695103
N/R      6246
NO       3351
Name: count, dtype: int64

In [12]:
df['BICYCLE'].value_counts()

BICYCLE
NO     687298
YES     11156
N/R      6246
Name: count, dtype: int64

In [13]:
df['PEDESTRIAN'].value_counts()

PEDESTRIAN
NO     680900
YES     17554
N/R      6246
Name: count, dtype: int64

In [14]:
df['PASSENGER'].value_counts()

PASSENGER
NO     644971
YES     53483
N/R      6246
Name: count, dtype: int64

In [15]:
df['INJURY_COLLISIONS'].value_counts()

INJURY_COLLISIONS
NO     610082
YES     94618
Name: count, dtype: int64

In [16]:
df['FATALITIES'].value_counts()

FATALITIES
1.0    599
2.0      5
3.0      2
Name: count, dtype: int64

In [17]:
df['PD_COLLISIONS'].value_counts()

PD_COLLISIONS
YES    508411
NO     196289
Name: count, dtype: int64

## Insights and Findings

## Conclusion

## Next Steps