# Crashlytics

In [21]:
try:
    import pyspark
except ModuleNotFoundError:
    !pip3 install pyspark
    import pyspark
try:
    import pandas as pd
except ModuleNotFoundError:
    !pip3 install pandas
    import pandas as pd
    import csv
try:
    import matplotlib.pyplot as plt
except ModuleNotFoundError:
    !pip3 install matplotlib
    import matplotlib.pyplot as plt

In [2]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, LongType, IntegerType, FloatType
import pyspark.sql.functions as F
from pyspark.sql.types import *
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, IndexToString, PCA
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator

In [3]:
#May need to install java for this to work
ss=SparkSession.builder.master("local").appName("crashlytics").getOrCreate()

23/11/09 13:36:06 WARN Utils: Your hostname, elliottmac.local resolves to a loopback address: 127.0.0.1; using 104.39.47.166 instead (on interface en0)
23/11/09 13:36:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/09 13:36:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
raw_df = ss.read.csv("smallest_crash_data.csv", header=True, inferSchema=True)

                                                                                

In [5]:
raw_df.printSchema()
raw_df.show(5)

root
 |-- ID: string (nullable = true)
 |-- Source: string (nullable = true)
 |-- Severity: integer (nullable = true)
 |-- Start_Time: timestamp (nullable = true)
 |-- End_Time: timestamp (nullable = true)
 |-- Start_Lat: double (nullable = true)
 |-- Start_Lng: double (nullable = true)
 |-- End_Lat: string (nullable = true)
 |-- End_Lng: string (nullable = true)
 |-- Distance(mi): double (nullable = true)
 |-- Description: string (nullable = true)
 |-- Street: string (nullable = true)
 |-- City: string (nullable = true)
 |-- County: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Zipcode: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Timezone: string (nullable = true)
 |-- Airport_Code: string (nullable = true)
 |-- Weather_Timestamp: timestamp (nullable = true)
 |-- Temperature(F): double (nullable = true)
 |-- Wind_Chill(F): double (nullable = true)
 |-- Humidity(%): double (nullable = true)
 |-- Pressure(in): double (nullable = true)
 |-- V

23/11/09 13:36:21 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+---+-------+--------+-------------------+-------------------+-----------------+------------------+-------+-------+------------+--------------------+--------------------+------------+----------+-----+----------+-------+----------+------------+-------------------+--------------+-------------+-----------+------------+--------------+--------------+---------------+-----------------+-----------------+-------+-----+--------+--------+--------+-------+-------+----------+-------+-----+---------------+--------------+------------+--------------+--------------+-----------------+---------------------+
| ID| Source|Severity|         Start_Time|           End_Time|        Start_Lat|         Start_Lng|End_Lat|End_Lng|Distance(mi)|         Description|              Street|        City|    County|State|   Zipcode|Country|  Timezone|Airport_Code|  Weather_Timestamp|Temperature(F)|Wind_Chill(F)|Humidity(%)|Pressure(in)|Visibility(mi)|Wind_Direction|Wind_Speed(mph)|Precipitation(in)|Weather_Condition|Ameni

## Imbalance in Data

Crash severity of 2 or 3 is much more common than any other severity ranking, and crash durations (duration of impact on traffic) are recorded as 30 minutes the majority of the time. This shows that the data is imbalanced, which we will have to consider when building our classifier.

In [6]:
severity_count = raw_df.groupBy(F.col("Severity")).count()
severity_count.show()

time_count = raw_df.withColumn("time_diff", -(F.col("Start_Time").cast("long") - F.col("End_Time").cast("long"))/60).groupBy(F.col("time_diff")).count()
time_count.orderBy(F.col("count"), ascending=False).show()

+--------+-----+
|Severity|count|
+--------+-----+
|       1|    1|
|       3|   81|
|       2|  118|
+--------+-----+

+------------------+-----+
|         time_diff|count|
+------------------+-----+
|              30.0|  151|
|              45.0|   21|
|              60.0|    2|
|             170.0|    1|
|             235.0|    1|
|             180.0|    1|
|             350.0|    1|
|              98.0|    1|
|             144.0|    1|
|             177.0|    1|
|             409.0|    1|
|             871.0|    1|
|             314.0|    1|
|             137.0|    1|
| 72.98333333333333|    1|
|             186.0|    1|
|             156.0|    1|
|             173.0|    1|
|              72.0|    1|
|48.983333333333334|    1|
+------------------+-----+
only showing top 20 rows



## TODO: clean data

Need to get rid of strings in data, by turning them into integer categories, removing them, or doing something clever with some other model that understands text

In [18]:
labelIndexer = StringIndexer(inputCols = ["State", "Street", "Weather_Condition", "Sunrise_Sunset"], outputCols = ["StateId", "StreetId", "Weather_Id", "DaytimeId"]).fit(raw_df)
transformed_data = labelIndexer.transform(raw_df)
cleaned_df = transformed_data.drop("ID", "Airport_Code", "Zipcode", "Source", "Start_Time", "End_Time", "End_Lat", "End_Lng", "Description", "City", "County", "Zipcode", "Country", "Timezone", \
                         "Weather_Timestamp", "Wind_Direction", "Civil_Twilight", "Nautical_Twilight", "Astronomical_Twilight", "Turning_Loop", "State", "Weather_Condition", "Street", "Sunrise_Sunset")

cleaned_df.show()

+--------+-----------------+------------------+------------+--------------+-------------+-----------+------------+--------------+---------------+-----------------+-------+-----+--------+--------+--------+-------+-------+----------+-------+-----+---------------+--------------+-------+--------+----------+---------+
|Severity|        Start_Lat|         Start_Lng|Distance(mi)|Temperature(F)|Wind_Chill(F)|Humidity(%)|Pressure(in)|Visibility(mi)|Wind_Speed(mph)|Precipitation(in)|Amenity| Bump|Crossing|Give_Way|Junction|No_Exit|Railway|Roundabout|Station| Stop|Traffic_Calming|Traffic_Signal|StateId|StreetId|Weather_Id|DaytimeId|
+--------+-----------------+------------------+------------+--------------+-------------+-----------+------------+--------------+---------------+-----------------+-------+-----+--------+--------+--------+-------+-------+----------+-------+-----+---------------+--------------+-------+--------+----------+---------+
|       3|        39.865147|        -84.058723|        

## PCA and K-Means Clustering

This data has a large number of features, so we will perform PCA to better understand which features of our dataset affect crash severity.

In [20]:
column_names = cleaned_df.columns
assembler = VectorAssembler(inputCols=column_names, outputCol="features", handleInvalid="keep")
assembled_data = assembler.transform(cleaned_df)
assembled_data.show(5)

+--------+-----------------+------------------+------------+--------------+-------------+-----------+------------+--------------+---------------+-----------------+-------+-----+--------+--------+--------+-------+-------+----------+-------+-----+---------------+--------------+-------+--------+----------+---------+--------------------+
|Severity|        Start_Lat|         Start_Lng|Distance(mi)|Temperature(F)|Wind_Chill(F)|Humidity(%)|Pressure(in)|Visibility(mi)|Wind_Speed(mph)|Precipitation(in)|Amenity| Bump|Crossing|Give_Way|Junction|No_Exit|Railway|Roundabout|Station| Stop|Traffic_Calming|Traffic_Signal|StateId|StreetId|Weather_Id|DaytimeId|            features|
+--------+-----------------+------------------+------------+--------------+-------------+-----------+------------+--------------+---------------+-----------------+-------+-----+--------+--------+--------+-------+-------+----------+-------+-----+---------------+--------------+-------+--------+----------+---------+--------------

## Decision Tree Classifier

## Classification with XGBoost