## City of Toronto Collisions Data


The Total Collisions Dataset is a CSV file containing detailed records of motor vehicle collisions within the City of Toronto. The dataset uses the WGS84 Coordinate Reference System, ensuring consistent geographic representation of collision locations. Key attributes include the geographic location of each collision, whether it resulted in a fatality or injury, and the timestamp of the event. For our analysis, we will focus on data from 2021 to 2024 to align with recent census data, providing insights into contemporary trends and patterns in collisions. Additionally, the dataset may include supplementary fields such as road conditions, weather visibility, and types of vehicles involved, offering a comprehensive view of the contributing factors to these incidents. By analyzing this dataset, we aim to identify high-risk areas and underlying causes of collisions to inform preventative strategies and improve road safety.

## Setup Notebook

In [1]:
# Import 3rd party libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt
import geopandas as gpd
import matplotlib.pyplot as plt
import folium
from IPython.display import display
import geopandas as gpd
from shapely.geometry import Point

# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
import warnings
warnings.filterwarnings('ignore')

## Import GeoJson Data

In [2]:
# Create a base map
map_2 = folium.Map(location=[43.6426, -79.3871], 
                   tiles='OpenStreetMap', 
                   zoom_start=10)

# Correct the GeoJSON file path
geojson_file_path = "FATALS_KSI_4359710384762535516.geojson"

# Ensure the file exists at the specified path
try:
    folium.GeoJson(geojson_file_path, name="Collision Data").add_to(map_2)
    # Display the map
    map_2
except FileNotFoundError:
    print(f"Error: The file '{geojson_file_path}' was not found. Please verify the path.")

# Display the map directly in the notebook
display(map_2)

In [4]:
# Import dataset as a DataFrane
collision = gpd.read_file('FATALS_KSI_4359710384762535516.geojson')

# View DataFrame
collision.head()

Unnamed: 0,OBJECTID,INDEX_,ACCNUM,DATE,TIME,STREET1,STREET2,OFFSET,ROAD_CLASS,DISTRICT,...,AG_DRIV,REDLIGHT,ALCOHOL,DISABILITY,HOOD_158,NEIGHBOURHOOD_158,HOOD_140,NEIGHBOURHOOD_140,DIVISION,geometry
0,1,3363207,882024,"Sat, 07 Jan 2006 05:00:00 GMT",2325,STEELES AVE E,NINTH LINE ST,,Minor Arterial,Scarborough,...,,,,,144,Morningside Heights,131,Rouge (131),D42,POINT (-79.22479 43.84275)
1,2,3363869,882497,"Sun, 08 Jan 2006 05:00:00 GMT",1828,ISLINGTON AVE,GOLFDOWN DR,,Major Arterial,Etobicoke York,...,Yes,,,,5,Elms-Old Rexdale,5,Elms-Old Rexdale (5),D23,POINT (-79.55809 43.72145)
2,3,3363416,882174,"Mon, 09 Jan 2006 05:00:00 GMT",1435,KENNEDY RD,GLAMORGAN AVE,,Major Arterial,Scarborough,...,,,,,126,Dorset Park,126,Dorset Park (126),D41,POINT (-79.28229 43.76945)
3,4,3363879,882501,"Wed, 11 Jan 2006 05:00:00 GMT",1120,BARTLEY DR,JINNAH CRT,,Collector,North York,...,Yes,,,,43,Victoria Village,43,Victoria Village (43),D55,POINT (-79.30799 43.72205)
4,5,3371161,886230,"Sat, 21 Jan 2006 05:00:00 GMT",1829,MIDLAND AVE,GOODLAND GT,,Major Arterial,Scarborough,...,Yes,,,,128,Agincourt South-Malvern West,128,Agincourt South-Malvern West (128),D42,POINT (-79.27559 43.77935)


## Data Analysis

In [5]:
# Check the number of columns and rows
collision.shape

(976, 53)

In [6]:
# Check the columns in DataFrame
collision.columns

Index(['OBJECTID', 'INDEX_', 'ACCNUM', 'DATE', 'TIME', 'STREET1', 'STREET2',
       'OFFSET', 'ROAD_CLASS', 'DISTRICT', 'LATITUDE', 'LONGITUDE', 'ACCLOC',
       'TRAFFCTL', 'VISIBILITY', 'LIGHT', 'RDSFCOND', 'ACCLASS', 'IMPACTYPE',
       'INVTYPE', 'INVAGE', 'INJURY', 'FATAL_NO', 'INITDIR', 'VEHTYPE',
       'MANOEUVER', 'DRIVACT', 'DRIVCOND', 'PEDTYPE', 'PEDACT', 'PEDCOND',
       'CYCLISTYPE', 'CYCACT', 'CYCCOND', 'PEDESTRIAN', 'CYCLIST',
       'AUTOMOBILE', 'MOTORCYCLE', 'TRUCK', 'TRSN_CITY_VEH', 'EMERG_VEH',
       'PASSENGER', 'SPEEDING', 'AG_DRIV', 'REDLIGHT', 'ALCOHOL', 'DISABILITY',
       'HOOD_158', 'NEIGHBOURHOOD_158', 'HOOD_140', 'NEIGHBOURHOOD_140',
       'DIVISION', 'geometry'],
      dtype='object')

Based on the Toronto Police Service's Traffic Collisions Open Data (ASR-T-TBL-001), here is a description of each column in the dataset:

- OBJECTID: Unique identifier for each record in the dataset.
- INDEX: Sequential number assigned to each collision event.
- ACCNUM: Unique accident number assigned by the police.
- DATE: Date when the collision occurred.
- TIME: Time of day when the collision occurred.
- STREET1: Primary street where the collision took place.
- STREET2: Secondary street involved in the collision (if applicable).
- OFFSET: Distance from the intersection or reference point.
- ROAD_CLASS: Classification of the road (e.g., arterial, collector).
- DISTRICT: Police district where the collision occurred.
- LATITUDE: Geographic latitude coordinate of the collision location.
- LONGITUDE: Geographic longitude coordinate of the collision location.
- ACCLOC: Specific location details of the accident.
- TRAFFCTL: Type of traffic control present at the collision site.
- VISIBILITY: Visibility conditions at the time of the collision.
- LIGHT: Lighting conditions during the collision (e.g., daylight, dark).
- RDSFCOND: Road surface conditions at the time of the collision.
- ACCLASS: Classification of the accident (e.g., fatal, non-fatal injury).
- IMPACTYPE: Type of impact during the collision (e.g., rear-end, side).
- INVTYPE: Type of individuals involved (e.g., driver, pedestrian).
- INVAGE: Age of the individuals involved in the collision.
- INJURY: Severity of injuries sustained (e.g., none, minor, fatal).
- FATAL_NO: Number of fatalities resulting from the collision.
- INITDIR: Initial direction of travel of the vehicles involved.
- VEHTYPE: Type of vehicles involved in the collision.
- MANOEUVER: Maneuver being performed by the vehicle at the time of collision.
- DRIVACT: Driver's action leading up to the collision.
- DRIVCOND: Driver's condition at the time of the collision (e.g., normal, impaired).
- PEDTYPE: Type of pedestrian involved (if applicable).
- PEDACT: Pedestrian's action leading up to the collision.
- PEDCOND: Pedestrian's condition at the time of the collision.
- CYCLISTYPE: Type of cyclist involved (if applicable).
- CYCACT: Cyclist's action leading up to the collision.
- CYCCOND: Cyclist's condition at the time of the collision.
- PEDESTRIAN: Indicator if a pedestrian was involved.
- CYCLIST: Indicator if a cyclist was involved.
- AUTOMOBILE: Indicator if an automobile was involved.
- MOTORCYCLE: Indicator if a motorcycle was involved.
- TRUCK: Indicator if a truck was involved.
- TRSN_CITY_VEH: Indicator if a city transit vehicle was involved.
- EMERG_VEH: Indicator if an emergency vehicle was involved.
- PASSENGER: Indicator if passengers were involved.
- SPEEDING: Indicator if speeding was a factor.
- AG_DRIV: Indicator if aggressive driving was a factor.
- REDLIGHT: Indicator if running a red light was a factor.
- ALCOHOL: Indicator if alcohol was a factor.
- DISABILITY: Indicator if a disability was a factor.
- HOOD_158: Neighborhood identifier based on 158 neighborhood divisions.
- NEIGHBOURHOOD_158: Name of the neighborhood (158 divisions).
- HOOD_140: Neighborhood identifier based on 140 neighborhood divisions.
- NEIGHBOURHOOD_140: Name of the neighborhood (140 divisions).
- DIVISION: Police division responsible for the area.
- geometry: X and Y coordinate in a projected coordinate system.


To identify collision patterns within an area, not all columns in the dataset are essential. Below is an analysis of columns that can be dropped because they do not directly contribute to the understanding of collision patterns:
- OBJECTID: Unique record identifier; not useful for analysis.
- INDEX: Sequential numbering of collisions; redundant information.
- ACCNUM: Police-assigned accident number; not relevant for pattern analysis.
- OFFSET: Distance from a reference point; not essential for broad collision patterns.
- HOOD_158: Numeric neighborhood identifier (already represented by NEIGHBOURHOOD_158).
- NEIGHBOURHOOD_158: Neighborhood name with 158 divisions; redundant if you use the 140-division version.
- HOOD_140: Numeric neighborhood identifier (already represented by NEIGHBOURHOOD_140).
- GEOMETRY: Projected coordinate system values; not needed if LATITUDE and LONGITUDE are available.
- INITDIR: Initial direction of travel; unlikely to impact spatial collision patterns.
- TRAFFCTL: Type of traffic control; may not directly impact patterns in broader spatial or temporal analysis.
- INVTYPE: Type of individuals involved; focuses on the individuals rather than collision patterns.
- PEDTYPE, PEDACT, PEDCOND: Focus on pedestrian-specific details; drop if your analysis isn’t focused on pedestrian collisions.
- CYCLISTYPE, CYCACT, CYCCOND: Focus on cyclist-specific details; drop if your analysis isn’t focused on cyclist collisions.
- PASSENGER: Indicator for passengers involved; does not directly contribute to spatial or temporal patterns.
- TRSN_CITY_VEH: Indicator for transit vehicles; drop unless public transit-related collisions are of interest.
- DIVISION: Police division; not directly linked to spatial collision patterns if geographic data is already included.
- STREET1, STREET2, DISTRICT: Don't need as the coordinates are being used and can be connected to the Wards dataset.


Columns to Retain for Identifying Collision Patterns:
- DATE and TIME: Critical for understanding temporal collision patterns.
- LATITUDE and LONGITUDE: Key for mapping spatial collision patterns.
- ROAD_CLASS: Helps identify patterns based on road types.
- VISIBILITY and LIGHT: Key environmental factors affecting collisions.
- RDSFCOND: Road surface conditions, which can reveal environmental hazards.
- ACCLASS: Classification of the accident (fatal, non-fatal injury) for severity analysis.
- IMPACTYPE: Helps identify common collision types within an area.
- FATAL_NO: Indicates fatalities, important for severity analysis.
- SPEEDING, AG_DRIV, REDLIGHT, ALCOHOL: Critical behavioral factors contributing to collision patterns.
- NEIGHBOURHOOD_140: Simplified neighborhood representation for spatial analysis.
- AUTOMOBILE, MOTORCYCLE, TRUCK: Helps identify vehicle types commonly involved in collisions.

In [7]:
# Drop columns 
collision = collision.drop(columns = ['OBJECTID', 'INDEX_', 'ACCNUM', 'OFFSET', 'HOOD_158', 'NEIGHBOURHOOD_158','HOOD_140', 'geometry', 'INITDIR', 
                   'TRAFFCTL', 'INVTYPE', 'PEDTYPE', 'PEDACT', 'PEDCOND', 'CYCLISTYPE', 'CYCACT', 'CYCCOND', 'PASSENGER', 'TRSN_CITY_VEH', 
                   'DIVISION', 'STREET1', 'STREET2', 'DISTRICT', ], errors='ignore')

# Check if columns are removed
collision.columns

Index(['DATE', 'TIME', 'ROAD_CLASS', 'LATITUDE', 'LONGITUDE', 'ACCLOC',
       'VISIBILITY', 'LIGHT', 'RDSFCOND', 'ACCLASS', 'IMPACTYPE', 'INVAGE',
       'INJURY', 'FATAL_NO', 'VEHTYPE', 'MANOEUVER', 'DRIVACT', 'DRIVCOND',
       'PEDESTRIAN', 'CYCLIST', 'AUTOMOBILE', 'MOTORCYCLE', 'TRUCK',
       'EMERG_VEH', 'SPEEDING', 'AG_DRIV', 'REDLIGHT', 'ALCOHOL', 'DISABILITY',
       'NEIGHBOURHOOD_140'],
      dtype='object')

In [8]:
# Check data types per column
print(collision.dtypes)

DATE                  object
TIME                  object
ROAD_CLASS            object
LATITUDE             float64
LONGITUDE            float64
ACCLOC                object
VISIBILITY            object
LIGHT                 object
RDSFCOND              object
ACCLASS               object
IMPACTYPE             object
INVAGE                object
INJURY                object
FATAL_NO             float64
VEHTYPE               object
MANOEUVER             object
DRIVACT               object
DRIVCOND              object
PEDESTRIAN            object
CYCLIST               object
AUTOMOBILE            object
MOTORCYCLE            object
TRUCK                 object
EMERG_VEH             object
SPEEDING              object
AG_DRIV               object
REDLIGHT              object
ALCOHOL               object
DISABILITY            object
NEIGHBOURHOOD_140     object
dtype: object


## Data Cleaning

Data cleaning for the collision dataset ensures:

- Accuracy: Removes errors and inconsistencies.
- Efficiency: Streamlines the dataset for quicker and easier analysis.
- Reliability: Produces trustworthy insights and recommendations.
- Focus: Tailors the data for the specific analysis of collision patterns and trends.

In [9]:
# Check for missing values
print(collision.isnull().sum())

DATE                   0
TIME                   0
ROAD_CLASS            25
LATITUDE               0
LONGITUDE              0
ACCLOC               262
VISIBILITY            12
LIGHT                  2
RDSFCOND              13
ACCLASS                1
IMPACTYPE              1
INVAGE                 0
INJURY                 0
FATAL_NO             111
VEHTYPE              417
MANOEUVER            642
DRIVACT              697
DRIVCOND             699
PEDESTRIAN           430
CYCLIST              930
AUTOMOBILE           162
MOTORCYCLE           874
TRUCK                868
EMERG_VEH            975
SPEEDING             782
AG_DRIV              534
REDLIGHT             915
ALCOHOL              930
DISABILITY           956
NEIGHBOURHOOD_140      0
dtype: int64


In [10]:
# Function to handle missing data for both numerical and categorical columns
def handle_missing_data(collision):
    for column in collision.columns:
        if collision[column].isnull().sum() > 0:  # Check for missing values
            if pd.api.types.is_numeric_dtype(collision[column]):
                # Numerical data: Use mean or median based on skewness
                skewness = collision[column].skew()
                if abs(skewness) < 0.5:  # Normally distributed
                    impute_value = collision[column].mean()
                    print(f"Imputing missing values in numerical column '{column}' with mean: {impute_value:.2f}")
                else:  # Skewed distribution
                    impute_value = collision[column].median()
                    print(f"Imputing missing values in numerical column '{column}' with median: {impute_value:.2f}")
                collision[column].fillna(impute_value, inplace=True)
            else:
                # Categorical data: Use mode or assign "Missing"
                mode_value = collision[column].mode()[0]  # Get the most frequent value
                print(f"Imputing missing values in categorical column '{column}' with mode: '{mode_value}'")
                collision[column].fillna(mode_value, inplace=True)
    return collision

# Handle missing data
collision_handled = handle_missing_data(collision)


Imputing missing values in categorical column 'ROAD_CLASS' with mode: 'Major Arterial'
Imputing missing values in categorical column 'ACCLOC' with mode: 'At Intersection'
Imputing missing values in categorical column 'VISIBILITY' with mode: 'Clear'
Imputing missing values in categorical column 'LIGHT' with mode: 'Daylight'
Imputing missing values in categorical column 'RDSFCOND' with mode: 'Dry'
Imputing missing values in categorical column 'ACCLASS' with mode: 'Fatal'
Imputing missing values in categorical column 'IMPACTYPE' with mode: 'Pedestrian Collisions'
Imputing missing values in numerical column 'FATAL_NO' with mean: 28.87
Imputing missing values in categorical column 'VEHTYPE' with mode: 'Other'
Imputing missing values in categorical column 'MANOEUVER' with mode: 'Going Ahead'
Imputing missing values in categorical column 'DRIVACT' with mode: 'Lost control'
Imputing missing values in categorical column 'DRIVCOND' with mode: 'Unknown'
Imputing missing values in categorical colu

In [11]:
# Verify that missing values have been handled
print(collision_handled.isnull().sum())

DATE                 0
TIME                 0
ROAD_CLASS           0
LATITUDE             0
LONGITUDE            0
ACCLOC               0
VISIBILITY           0
LIGHT                0
RDSFCOND             0
ACCLASS              0
IMPACTYPE            0
INVAGE               0
INJURY               0
FATAL_NO             0
VEHTYPE              0
MANOEUVER            0
DRIVACT              0
DRIVCOND             0
PEDESTRIAN           0
CYCLIST              0
AUTOMOBILE           0
MOTORCYCLE           0
TRUCK                0
EMERG_VEH            0
SPEEDING             0
AG_DRIV              0
REDLIGHT             0
ALCOHOL              0
DISABILITY           0
NEIGHBOURHOOD_140    0
dtype: int64


All missing (null) values have been dealt with (removed or imputed). Let's also remove all duplicates. 

In [12]:
# Remove Duplicates
collision.drop_duplicates(inplace=True)

Now let's remove outliers with the Interquartile Range method. 

In [13]:
# Function to calculate IQR and remove outliers
def remove_outliers(collision,column):
    Q1 = collision[column].quantile(0.25)  # 25th percentile
    Q3 = collision[column].quantile(0.75)  # 75th percentile
    IQR = Q3 - Q1                   # Inter-Quartile Range
    
    # Define lower and upper bounds
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Filter out rows with outliers
    return collision[(collision[column] >= lower_bound) & (collision[column] <= upper_bound)]

# List of numeric columns
numeric_columns = collision.select_dtypes(include=['float64', 'int64']).columns

# Remove outliers using the IQR method
for column in numeric_columns:
    collision = remove_outliers(collision, column)

That is all for data cleaning, let's see the new size of the dataset.

In [14]:
collision.shape

(972, 30)

## Import Ward Data

To predict the number of collisions in Toronto, the collision data must be spatially joined with the ward boundaries to determine which ward each collision occurred in. This process involves using the geographic coordinates (latitude and longitude) from the collision data and mapping them to the corresponding ward polygons in the ward dataset. After assigning collisions to their respective wards, we can aggregate the number of collisions per ward to identify trends and high-risk areas. This data can then be enriched with additional ward-specific features, such as population, road density, and traffic volume, to build a predictive model. By training a machine learning model with these features, we can forecast collision counts and provide actionable insights for city planning and traffic safety improvements.

In [3]:
# Load the shapefile
ward_shapefile_path = "WARD_WGS84.shx" 
ward_data = gpd.read_file(ward_shapefile_path.replace(".shx", ".shp"))  # Replace with .shp extension

# Create a base map
map_wards = folium.Map(location=[43.7, -79.4], zoom_start=11)  # Adjust to center on Toronto

# Add the ward shapefile to the map
ward_data_json = ward_data.to_json()  # Convert GeoDataFrame to GeoJSON
folium.GeoJson(ward_data_json, name="Wards").add_to(map_wards)

# Display the map
map_wards

# View geoDataFrame
ward_data.head()

Unnamed: 0,AREA_ID,AREA_TYPE,AREA_S_CD,AREA_L_CD,AREA_NAME,X,Y,LONGITUDE,LATITUDE,geometry
0,2551040,WD18,16,16,Don Valley East,318237.29,4844000.0,-79.33298,43.739716,"POLYGON ((-79.31335 43.71699, -79.31950 43.715..."
1,2551044,WD18,3,3,Etobicoke-Lakeshore,303099.474,4831000.0,-79.52087,43.621646,"POLYGON ((-79.49777 43.65198, -79.49725 43.651..."
2,2551048,WD18,15,15,Don Valley West,314825.876,4843000.0,-79.37536,43.728396,"POLYGON ((-79.35232 43.71573, -79.35209 43.715..."
3,2551052,WD18,23,23,Scarborough North,324522.149,4852000.0,-79.25467,43.809672,"POLYGON ((-79.22591 43.83960, -79.22556 43.839..."
4,2551056,WD18,11,11,University-Rosedale,313306.543,4837000.0,-79.39432,43.671139,"POLYGON ((-79.39004 43.69050, -79.39004 43.690..."


## Overlay Collision Data onto Ward Data 


Overlaying the collision data onto the ward data is essential for spatial analysis and understanding where collisions are occurring within the city. By mapping each collision to a specific ward, we can identify patterns and trends in collision occurrences relative to geographic boundaries. This allows for aggregating the number of collisions per ward, which is critical for targeted analysis, policy-making, and resource allocation. For example, high-collision wards can be prioritized for road safety improvements or public awareness campaigns. Additionally, integrating collision data with ward-specific attributes such as population, traffic volume, or road density enables more accurate predictive modeling and helps address traffic safety issues more effectively.


Converting collision data into a GeoDataFrame is essential for spatial analysis, as it allows for operations like spatial joins, overlays, and mapping. By creating a geometry column from LATITUDE and LONGITUDE, each collision is represented as a precise point in space. Assigning a Coordinate Reference System (CRS), such as EPSG:4326 (WGS84), ensures the data aligns accurately with other spatial datasets, like ward boundaries. This conversion enables mapping collisions to specific wards, visualizing spatial patterns, and ensuring compatibility with geospatial tools, making it a critical step for reliable and accurate analysis.

In [24]:
# Convert collision data to GeoDataFrame
collision["geometry"] = collision.apply(lambda row: Point(row["LONGITUDE"], row["LATITUDE"]), axis=1)
collision = gpd.GeoDataFrame(collision, geometry="geometry", crs="EPSG:4326")
