# California Traffic Collision Data from SWITRS

This data comes from the California Highway Patrol and covers collisions from January 1st, 2001 until mid-December, 2020. The full database dumps have been requested from the CHP four times, once in 2016, 2017, 2018, 2020, 2021.

There are three main tables:

`collisions`: Contains information about the collision, where it happened, what vehicles were involved.  
`parties`: Contains information about the groups people involved in the collision including age, sex, and sobriety.  
`victims`: Contains information about the injuries of specific people involved in the collision.  
There is also a table called `case_ids` which is used to build the other tables. It tells us which of the five original datasets each row came from.

In [1]:
import pandas as pd
import numpy as np
import sqlite3
import plotly.express as px
import folium
from folium import plugins
from folium.plugins import HeatMap
import pygal 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.graph_objects as go
from pygal.style import Style
from IPython.display import display, HTML
plotly.offline.init_notebook_mode (connected = True)

In [2]:
# Create a SQL connection to SQLite database
con = sqlite3.connect('/Users/sunhe/Desktop/NUS_Semester1/DSA5104/project/data/switrs.sqlite')

In [3]:
# check tables in the database
table = pd.read_sql_query(
    """
    SELECT name
    FROM sqlite_schema
    WHERE type ='table'
    """, con)
table

Unnamed: 0,name
0,case_ids
1,collisions
2,victims
3,parties


In [4]:
# check attributes in the tables
case_ids = pd.read_sql_query(
    """
    SELECT *  
    FROM case_ids;
    """, con)

In [5]:
case_ids.shape

(9424334, 2)

In [6]:
case_ids.head()

Unnamed: 0,case_id,db_year
0,81715,2021
1,726202,2021
2,3858022,2021
3,3899441,2021
4,3899442,2021


In [7]:
# check missing values
case_ids.isnull().sum()

case_id    0
db_year    0
dtype: int64

We do data cleaning and visualization of these four tables respectively, whose processes are saved in `Data processing of Collisions.ipynb`, `Data processing of Victims.ipynb` and `Data processing of Parties.ipynb`.

In [8]:
# Save cleaned data
with open('/Users/sunhe/Desktop/NUS_Semester1/DSA5104/project/data/clean_data/clean_case_ids.csv',
          'a', encoding='utf8', newline="") as f:
    case_ids.to_csv(f, header=True, index=0)