# Spark Random Forest Implementation

## Chicago Crime Use Case

You are provided with the dataset that contains the crime records from Chicago. The dataset belongs to Chicago Police Department. This dataset reflects reported incidents of crime that occurred in the City of Chicago from 2012 to 2017. The data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system.

## Dataset Understanding

## Objective: 

Our objective is to use the information and try to come up with a system that classifies the **FBI Code** for each crime absed on the given information.

Columns in the Dataset:

**ID** - Unique identifier for the record.

**Case Number** - The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.

**Date** - Date when the incident occurred. this is sometimes a best estimate.

**Block** - The partially redacted address where the incident occurred, placing it on the same block as the actual address.

**IUCR** - The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description.

**Primary Type** - The primary description of the IUCR code.

**Description** - The secondary description of the IUCR code, a subcategory of the primary description.

**Location Description** - Description of the location where the incident occurred.

**Arrest** - Indicates whether an arrest was made.

**Domestic** - Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.

**Beat** - Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts.

**District** - Indicates the police district where the incident occurred.

**Ward** - The ward (City Council district) where the incident occurred.

**Community Area** - Indicates the community area where the incident occurred. Chicago has 77 community areas.

**FBI Code** - Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS).

**X Coordinate** - The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.

**Y Coordinate** - The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.

**Year** - Year the incident occurred.

**Updated On** - Date and time the record was last updated.

**Latitude** - The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.

**Longitude** - The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.

**Location** - The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.

### Initialising the Spark session

In [1]:
%%configure -f
{ "conf":{
          "spark.pyspark.python": "python3",
          "spark.pyspark.virtualenv.enabled": "true",
          "spark.pyspark.virtualenv.type":"native",
          "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
         }
}

In [2]:
from pyspark import SparkContext, SparkConf

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1,application_1609123214595_0002,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [3]:
sc = SparkContext.getOrCreate();

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
sc

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<SparkContext master=yarn appName=livy-session-1>

In [5]:
sc.list_packages()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Package                    Version
-------------------------- -------------------
beautifulsoup4             4.8.1
boto                       2.49.0
jmespath                   0.9.4
lxml                       4.4.2
mysqlclient                1.4.6
nltk                       3.4.5
nose                       1.3.4
numpy                      1.14.5
pip                        20.3.3
py-dateutil                2.2
python36-sagemaker-pyspark 1.2.6
pytz                       2019.3
PyYAML                     3.11
setuptools                 51.1.0.post20201221
six                        1.13.0
soupsieve                  1.9.5
wheel                      0.36.2
windmill                   1.6

In [6]:
sc.install_pypi_package("pandas==0.25.1")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Collecting pandas==0.25.1
  Downloading pandas-0.25.1-cp36-cp36m-manylinux1_x86_64.whl (10.5 MB)
Collecting python-dateutil>=2.6.1
  Downloading python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Installing collected packages: python-dateutil, pandas
Successfully installed pandas-0.25.1 python-dateutil-2.8.1

In [7]:
sc.install_pypi_package("matplotlib==3.1.1", "https://pypi.org/simple")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Collecting matplotlib==3.1.1
  Downloading matplotlib-3.1.1-cp36-cp36m-manylinux1_x86_64.whl (13.1 MB)
Collecting cycler>=0.10
  Downloading cycler-0.10.0-py2.py3-none-any.whl (6.5 kB)
Collecting kiwisolver>=1.0.1
  Downloading kiwisolver-1.3.1-cp36-cp36m-manylinux1_x86_64.whl (1.1 MB)
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1
  Downloading pyparsing-2.4.7-py2.py3-none-any.whl (67 kB)
Installing collected packages: pyparsing, kiwisolver, cycler, matplotlib
Successfully installed cycler-0.10.0 kiwisolver-1.3.1 matplotlib-3.1.1 pyparsing-2.4.7

In [8]:
sc.list_packages()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Package                    Version
-------------------------- -------------------
beautifulsoup4             4.8.1
boto                       2.49.0
cycler                     0.10.0
jmespath                   0.9.4
kiwisolver                 1.3.1
lxml                       4.4.2
matplotlib                 3.1.1
mysqlclient                1.4.6
nltk                       3.4.5
nose                       1.3.4
numpy                      1.14.5
pandas                     0.25.1
pip                        20.3.3
py-dateutil                2.2
pyparsing                  2.4.7
python-dateutil            2.8.1
python36-sagemaker-pyspark 1.2.6
pytz                       2019.3
PyYAML                     3.11
setuptools                 51.1.0.post20201221
six                        1.13.0
soupsieve                  1.9.5
wheel                      0.36.2
windmill                   1.6

### Loading the dataset

In [9]:
df = spark.read.csv('s3a://rf-dataset/Chicago_Crimes_2012_to_2017.csv', header = True, inferSchema = False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
# Printing the first row
df.head(1)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(_c0='3', ID='10508693', Case Number='HZ250496', Date='05/03/2016 11:40:00 PM', Block='013XX S SAWYER AVE', IUCR='0486', Primary Type='BATTERY', Description='DOMESTIC BATTERY SIMPLE', Location Description='APARTMENT', Arrest='True', Domestic='True', Beat='1022', District='10.0', Ward='24.0', Community Area='29.0', FBI Code='08B', X Coordinate='1154907.0', Y Coordinate='1893681.0', Year='2016', Updated On='05/10/2016 03:56:50 PM', Latitude='41.864073157', Longitude='-87.706818608', Location='(41.864073157, -87.706818608)')]

In [11]:
# Schema of the dataset
df.printSchema()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- _c0: string (nullable = true)
 |-- ID: string (nullable = true)
 |-- Case Number: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Block: string (nullable = true)
 |-- IUCR: string (nullable = true)
 |-- Primary Type: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Location Description: string (nullable = true)
 |-- Arrest: string (nullable = true)
 |-- Domestic: string (nullable = true)
 |-- Beat: string (nullable = true)
 |-- District: string (nullable = true)
 |-- Ward: string (nullable = true)
 |-- Community Area: string (nullable = true)
 |-- FBI Code: string (nullable = true)
 |-- X Coordinate: string (nullable = true)
 |-- Y Coordinate: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Updated On: string (nullable = true)
 |-- Latitude: string (nullable = true)
 |-- Longitude: string (nullable = true)
 |-- Location: string (nullable = true)

In [12]:
# Count total no of rows
df.count()


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1456714

In [13]:
# print 5 rows

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [14]:
df.head(5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(_c0='3', ID='10508693', Case Number='HZ250496', Date='05/03/2016 11:40:00 PM', Block='013XX S SAWYER AVE', IUCR='0486', Primary Type='BATTERY', Description='DOMESTIC BATTERY SIMPLE', Location Description='APARTMENT', Arrest='True', Domestic='True', Beat='1022', District='10.0', Ward='24.0', Community Area='29.0', FBI Code='08B', X Coordinate='1154907.0', Y Coordinate='1893681.0', Year='2016', Updated On='05/10/2016 03:56:50 PM', Latitude='41.864073157', Longitude='-87.706818608', Location='(41.864073157, -87.706818608)'), Row(_c0='89', ID='10508695', Case Number='HZ250409', Date='05/03/2016 09:40:00 PM', Block='061XX S DREXEL AVE', IUCR='0486', Primary Type='BATTERY', Description='DOMESTIC BATTERY SIMPLE', Location Description='RESIDENCE', Arrest='False', Domestic='True', Beat='313', District='3.0', Ward='20.0', Community Area='42.0', FBI Code='08B', X Coordinate='1183066.0', Y Coordinate='1864330.0', Year='2016', Updated On='05/10/2016 03:56:50 PM', Latitude='41.782921527', Longi

In [15]:
# Show 5 rows in tabular format
df.show(5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+--------+-----------+--------------------+-------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+
|_c0|      ID|Case Number|                Date|              Block|IUCR|        Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On|    Latitude|    Longitude|            Location|
+---+--------+-----------+--------------------+-------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+
|  3|10508693|   HZ250496|05/03/2016 11:40:...| 013XX S SAWYER AVE|0486|             BATTERY|DOMESTIC BATTERY ...| 

In [16]:
# get all columns
df.columns

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['_c0', 'ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type', 'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat', 'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate', 'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude', 'Location']

### Data Exploration and Cleaning

In [17]:
df.select("Date").show(10, truncate = False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------------+
|Date                  |
+----------------------+
|05/03/2016 11:40:00 PM|
|05/03/2016 09:40:00 PM|
|05/03/2016 11:31:00 PM|
|05/03/2016 10:10:00 PM|
|05/03/2016 10:00:00 PM|
|05/03/2016 10:35:00 PM|
|05/03/2016 10:30:00 PM|
|05/03/2016 09:30:00 PM|
|05/03/2016 04:00:00 PM|
|05/03/2016 10:30:00 PM|
+----------------------+
only showing top 10 rows

In [18]:
# Column type
df.select("Date").dtypes

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[('Date', 'string')]

Date is string data type, convert to timestamp format

In [19]:
# Changing the type of column Date to timestamp
from pyspark.sql.functions import to_timestamp


df = df.withColumn("Date_Time",to_timestamp('Date',"MM/dd/yyyy hh:mm:ss a"))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [20]:
df.select("Date_Time").show(10, truncate = False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------+
|Date_Time          |
+-------------------+
|2016-05-03 23:40:00|
|2016-05-03 21:40:00|
|2016-05-03 23:31:00|
|2016-05-03 22:10:00|
|2016-05-03 22:00:00|
|2016-05-03 22:35:00|
|2016-05-03 22:30:00|
|2016-05-03 21:30:00|
|2016-05-03 16:00:00|
|2016-05-03 22:30:00|
+-------------------+
only showing top 10 rows

In [21]:
df.select("Date_Time").dtypes

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[('Date_Time', 'timestamp')]

In [22]:
# Extracting 'hour' from the dataset
from pyspark.sql.functions import hour

df = df.withColumn('hour', hour(df["Date_Time"]))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [23]:
df.select('hour').show(5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+
|hour|
+----+
|  23|
|  21|
|  23|
|  22|
|  22|
+----+
only showing top 5 rows

In [24]:
### Extract day of week from date in pyspark

from pyspark.sql.functions import dayofweek

# create a new column for dayofweek from Date_Time

df = df.withColumn('day_of_week', dayofweek(df["Date_Time"]))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [25]:
df.select('day_of_week').show(5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+
|day_of_week|
+-----------+
|          3|
|          3|
|          3|
|          3|
|          3|
+-----------+
only showing top 5 rows

In [26]:
# Show 'hour' & 'day_of_week'

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [27]:
df.select('hour','day_of_week').show(5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+-----------+
|hour|day_of_week|
+----+-----------+
|  23|          3|
|  21|          3|
|  23|          3|
|  22|          3|
|  22|          3|
+----+-----------+
only showing top 5 rows

In [28]:
df.show(5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+--------+-----------+--------------------+-------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+-------------------+----+-----------+
|_c0|      ID|Case Number|                Date|              Block|IUCR|        Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On|    Latitude|    Longitude|            Location|          Date_Time|hour|day_of_week|
+---+--------+-----------+--------------------+-------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+-------------------+----+-----------+
|  3

In [29]:
# Dropping the columns: Date & Date_Time
df = df.drop('Date', 'Date_Time')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [30]:
df.show(5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+--------+-----------+-------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+----+-----------+
|_c0|      ID|Case Number|              Block|IUCR|        Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On|    Latitude|    Longitude|            Location|hour|day_of_week|
+---+--------+-----------+-------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+----+-----------+
|  3|10508693|   HZ250496| 013XX S SAWYER AVE|0486|             BATTERY|DOMESTIC BATTERY ...|           APARTMENT|  True|    Tr

### Hours- statistical analysis

In [31]:
# in each hour, how many crimes happened

df.groupBy('hour').count().show(10)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+-----+
|hour|count|
+----+-----+
|  12|83930|
|  22|75824|
|   1|43771|
|  13|69666|
|  16|76065|
|   6|24609|
|   3|31048|
|  20|80826|
|   5|20233|
|  19|84193|
+----+-----+
only showing top 10 rows

In [32]:
# Storing in a pandas dataframe for visualisation 
# store in descending order
hour_df = df.groupBy('hour').count().orderBy('count',ascending=False).toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [33]:
# print 10 rows
hour_df.head(10)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

   hour  count
0    19  84193
1    12  83930
2    18  82414
3    20  80826
4    15  79930
5    21  76543
6    16  76065
7    22  75824
8    17  75556
9    14  73698

In [34]:
# import matplotlib 
import matplotlib.pyplot as plt

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### What time of the day are criminals the busiest?

In [None]:

# create the plot
plt.figure(figsize=(14,10))

# Plot Crime data for hour
hour_df.plot(x='hour', y='count', kind='bar', color='blue')

plt.title('Amount of Crimes by Hour')
plt.ylabel('Amount of Crimes')
plt.xlabel('Hour')


# display the plot
%matplot plt


### Day of week statistical analysis

In [None]:
df.groupBy("day_of_week").count().show()


In [None]:
dayofweek_df = df.groupBy("day_of_week").count().orderBy("count", ascending = False).toPandas()

In [None]:
dayofweek_df.head(7)

### which day of the week ciminals are busiest?

(1- Sunday , 2- Monday …… 7- Saturday)

In [None]:
# create the plot
plt.figure(figsize=(14,10))


dayofweek_df.plot(x = 'day_of_week', y = 'count', kind='bar', color = "pink")


plt.title('Amount of Crimes by day_of_week')
plt.ylabel('Amount of Crimes')
plt.xlabel('day_of_week')


# display the plot
%matplot plt


### year statistical analysis

In [None]:
df.groupBy("Year").count().show()


In [None]:
year_df = df.groupBy("year").count().orderBy("count", ascending = False).toPandas()

In [None]:
year_df.head(7)

### how no of crimes are changing over the years

In [None]:

# create the plot
plt.figure(figsize=(14,10))


year_df.plot(x = 'year', y = 'count', kind='bar', color = "red")


plt.title('Amount of Crimes by year')
plt.ylabel('Amount of Crimes')
plt.xlabel('year')


# display the plot
%matplot plt


### Primary Type statistical analysis

In [None]:
df.groupBy("Primary Type").count().show()

In [None]:
primarytype_df = df.groupBy("Primary Type").count().orderBy("count", ascending = False).toPandas()

In [None]:
primarytype_df.head()

### Primary Types of crime which is mostly reported 

In [None]:
# create the plot


primarytype_df.head(14).plot(x = 'Primary Type', y = 'count', kind='barh',figsize=(20,20), color = "#b35900")



plt.title('Amount of Crimes by Primary Type')
plt.ylabel('Amount of Crimes')
plt.xlabel('Primary Type')

# display the plot
%matplot plt



### Location Description statistical analysis

In [None]:
df.groupBy("Location Description").count().show()


In [None]:
location_df = df.groupBy("Location Description").count().orderBy("count", ascending = False).toPandas()

In [None]:
location_df.head()

### Top locations for most number of crime

In [None]:
%matplotlib inline

In [None]:
# create the plot


location_df.head(20).plot(x = 'Location Description', y = 'count', kind='barh',figsize=(20,20), color = "green")


plt.title('Amount of Crimes by Location Description')
plt.ylabel('Amount of Crimes')
plt.xlabel('Location Description')


# display the plot
%matplot plt


### How many arrests happened

In [None]:
df.groupBy('Arrest').count().show()

### In what percentage of crime arrests happened?

In [None]:
df.filter(df["Arrest"]==True).count()/df.count() * 100

### How many crimes are domestic

In [None]:
df.groupBy("Domestic").count().show()

### Calculating percentage of domestic crime

In [None]:
df.filter(df["Domestic"]==True).count()/df.count() * 100

### How many narcotics cases are there in the dataset?

In [None]:
df.where(df["Primary Type"]=="NARCOTICS").count()

### Calculating percentage of narcotics cases in the dataset?

In [None]:
df.where(df["Primary Type"] == "NARCOTICS").count()/df.count() * 100

### How many domestic assualts there are?

In [None]:
df.filter((df["Primary Type"] == "ASSAULT") & (df["Domestic"] == "True")).count()


### Calculating percentage of domestic assault cases in the dataset

In [None]:
df.filter((df["Primary Type"] == "ASSAULT") & (df["Domestic"] == "True")).count()/df.count() * 100

## Drop columns which are not required for model building

In [None]:
# show 5 rows
df.show(5)

In [None]:
# get columns
df.columns

**Dropping columns which are ID or numbers which won't help in model learning:**


'_c0', 'ID', 'Case Number': are IDs

'Block', 'Description' : Lots of text like address 

'Updated On' : no need 

'Location': combination of lat, long so no need

In [35]:
df = df.drop("_c0", "ID", "Case Number",'Block', 'Description', "Updated On", 'Location')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [36]:
df.show(5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+------------+-------------+----+-----------+
|IUCR|        Primary Type|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|    Latitude|    Longitude|hour|day_of_week|
+----+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+------------+-------------+----+-----------+
|0486|             BATTERY|           APARTMENT|  True|    True|1022|    10.0|24.0|          29.0|     08B|   1154907.0|   1893681.0|2016|41.864073157|-87.706818608|  23|          3|
|0486|             BATTERY|           RESIDENCE| False|    True| 313|     3.0|20.0|          42.0|     08B|   1183066.0|   1864330.0|2016|41.782921527| -87.60436317|  21|          3|
|0470|PUBLIC PEACE VIOL...|              STREET| False|   False|1524|    15.0|37.0|  

In [37]:
df.columns

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['IUCR', 'Primary Type', 'Location Description', 'Arrest', 'Domestic', 'Beat', 'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate', 'Y Coordinate', 'Year', 'Latitude', 'Longitude', 'hour', 'day_of_week']

**Now we're left with lots of categorical columns, need to see how many distinct labels are there in each column, if the number of distinct labels are huge in a column, so during One-Hot_encoding need to create lots of new column.**

### Unique Values

In [38]:
for c in df.columns:
    print (c)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

IUCR
Primary Type
Location Description
Arrest
Domestic
Beat
District
Ward
Community Area
FBI Code
X Coordinate
Y Coordinate
Year
Latitude
Longitude
hour
day_of_week

In [39]:
# Checking the number distinct values in each attribute
from pyspark.sql.functions import col, countDistinct


df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------+---------+----+-----------+
|IUCR|Primary Type|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|Latitude|Longitude|hour|day_of_week|
+----+------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------+---------+----+-----------+
| 365|          33|                 142|     2|       2| 302|      24|  50|            78|      26|       67714|      111555|   6|  368076|   367942|  24|          7|
+----+------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------+---------+----+-----------+

In [40]:
# get columns
df.columns

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['IUCR', 'Primary Type', 'Location Description', 'Arrest', 'Domestic', 'Beat', 'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate', 'Y Coordinate', 'Year', 'Latitude', 'Longitude', 'hour', 'day_of_week']

**Based on distinct count analysis, we can clearly decide on dropping few more columns, which is having huge distinct count, that many new columns needs to be cerated if we're considering that.**

***'IUCR', 'Beat','Ward','Community Area' : these columns can be dropped, this info can be inferred from the coordinates & lat, long columns, since they are more granular.***

In [41]:
df = df.drop('IUCR', 'Beat','Ward','Community Area')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [None]:
df.show(5)

In [None]:
df.columns

#### Handling null values

In [None]:
# Counting the number of null values in each column
from pyspark.sql.functions import when, count, col, isnull


df.select([count(when(isnull(c), c)).alias(c) for c in df.columns]).show()



**As we can see many row is not having the coordinates & lat, long details, without this info, it'll be diffcult to predict the FBI Code. So we'll drop these rows.**

In [42]:
# Dropping the rows with null values
df = df.na.drop()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [None]:
# Check if the null values are dropped

df.select([count(when(isnull(c), c)).alias(c) for c in df.columns]).show()

In [None]:

print((df.count(), len(df.columns)))

#### Correction in column type

In [None]:
# Column type
df.dtypes

In [None]:
df.printSchema()

In [None]:
df.show(3)

**Need to change the data type of all lat, long, coordinates, district, year from String to Float/Integer**

In [43]:
# Changing the required columns from string type to numerical 
from pyspark.sql.types import FloatType, IntegerType


df = df.withColumn('District', df['District'].cast(IntegerType()))



df = df.withColumn('X Coordinate', df['X Coordinate'].cast(FloatType()))
df = df.withColumn('Y Coordinate', df['Y Coordinate'].cast(FloatType()))
df = df.withColumn('Longitude', df['Longitude'].cast(FloatType()))
df = df.withColumn('Latitude', df['Latitude'].cast(FloatType()))
df = df.withColumn('Year', df['Year'].cast(IntegerType()))


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [None]:
df.dtypes

In [None]:
df.show(3)

## Exploring the target variable: FBI Code

In [None]:
df.groupBy("FBI Code").count().show()

In [None]:
# Storing in a pandas dataframe for visualisation
fbi_df = df.groupBy("FBI Code").count().orderBy("count", ascending = False).toPandas()

In [None]:
fbi_df.head()

In [None]:
# create the plot
plt.figure(figsize=(14,10))

fbi_df.head(15).plot(x = 'FBI Code', y = 'count', kind='bar', color = "violet")

plt.title('Amount of Crimes by FBI Code')
plt.ylabel('Amount of Crimes')
plt.xlabel('FBI Code')


# display the plot
%matplot plt

## Feature Generation & Vector Creation

In [None]:
# Identifying the catrgorical columns for indexing
df.columns

In [None]:
len(df.columns)

In [None]:
df.show(3)

In [44]:
# Storing the categorical and continuous columns in different lists


categorical_features = ['Primary Type', 'Location Description', 'Arrest', 'Domestic', 'District','Year','hour','day_of_week' ]


continuous_features = ['X Coordinate', 'Y Coordinate', 'Latitude', 'Longitude']



VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Spark Pipeline concept will be used here

In [45]:
# Importing the libraries for data transormation
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [46]:
# Initialising the variable 'stages' to store every step for building a pipeline
stages = []

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### StringIndexer: Features which are in string are converted to numerical values

### OneHotEncoderEstimator: Converts categorical variable into new columns

In [47]:
# Building a function for encoding all the categorical variables


for categoricalCol in categorical_features:
    print(categoricalCol)
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + '_Index')
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "_encoded"])    
    stages += [stringIndexer, encoder]

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Primary Type
Location Description
Arrest
Domestic
District
Year
hour
day_of_week

In [48]:
# Encoding the target variable as label

label_stringIdx = StringIndexer(inputCol = 'FBI Code', outputCol = 'label')

stages += [label_stringIdx]

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### VectorAssembler: Generated vectors for all the features

In [49]:
# Building a function for generating a vector of all features

assemblerInputs = [c + "_encoded" for c in categorical_features] + continuous_features


assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")


stages += [assembler]

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Pipeline stages are used to run all Steps/stages

**Stages is a list of functions which is used as an input to the pipeline**

In [50]:
# Loading all the steps in a pipeline
from pyspark.ml import Pipeline


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [51]:
pipeline = Pipeline(stages = stages)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [52]:
pipeline

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Pipeline_9e212bafbbe1

### Fit & Transform DF

In [53]:
# Fitting the steps on the dataFrame
pipelineModel = pipeline.fit(df)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [54]:
# Transforming the dataframe
df = pipelineModel.transform(df)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [55]:
# show rows
df.show(5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------------------+------+--------+--------+--------+------------+------------+----+---------+---------+----+-----------+------------------+--------------------+--------------------------+----------------------------+------------+--------------+--------------+----------------+--------------+----------------+----------+-------------+----------+---------------+-----------------+-------------------+-----+--------------------+
|        Primary Type|Location Description|Arrest|Domestic|District|FBI Code|X Coordinate|Y Coordinate|Year| Latitude|Longitude|hour|day_of_week|Primary Type_Index|Primary Type_encoded|Location Description_Index|Location Description_encoded|Arrest_Index|Arrest_encoded|Domestic_Index|Domestic_encoded|District_Index|District_encoded|Year_Index| Year_encoded|hour_Index|   hour_encoded|day_of_week_Index|day_of_week_encoded|label|            features|
+--------------------+--------------------+------+--------+--------+--------+------------+--------

In [56]:
# Checking the schema of transformed dataFrame
df.printSchema()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- Primary Type: string (nullable = true)
 |-- Location Description: string (nullable = true)
 |-- Arrest: string (nullable = true)
 |-- Domestic: string (nullable = true)
 |-- District: integer (nullable = true)
 |-- FBI Code: string (nullable = true)
 |-- X Coordinate: float (nullable = true)
 |-- Y Coordinate: float (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Latitude: float (nullable = true)
 |-- Longitude: float (nullable = true)
 |-- hour: integer (nullable = true)
 |-- day_of_week: integer (nullable = true)
 |-- Primary Type_Index: double (nullable = false)
 |-- Primary Type_encoded: vector (nullable = true)
 |-- Location Description_Index: double (nullable = false)
 |-- Location Description_encoded: vector (nullable = true)
 |-- Arrest_Index: double (nullable = false)
 |-- Arrest_encoded: vector (nullable = true)
 |-- Domestic_Index: double (nullable = false)
 |-- Domestic_encoded: vector (nullable = true)
 |-- District_Index: double (nullable = false)
 |

In [57]:
df.groupBy("label").count().orderBy("count", ascending = False).show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+------+
|label| count|
+-----+------+
|  0.0|321960|
|  1.0|222988|
|  2.0|152816|
|  3.0|134437|
|  4.0|125762|
|  5.0| 81671|
|  6.0| 66801|
|  7.0| 59858|
|  8.0| 59250|
|  9.0| 56093|
| 10.0| 35956|
| 11.0| 23380|
| 12.0| 17080|
| 13.0| 17070|
| 14.0|  7783|
| 15.0|  7585|
| 16.0|  6756|
| 17.0|  6303|
| 18.0|  5387|
| 19.0|  2578|
+-----+------+
only showing top 20 rows

### Split data into train & test

In [63]:
# Splitting the dataFrame into training and testing set

train, test = df.randomSplit([0.7, 0.3], seed = 100)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [64]:
print("Training Dataset Count: " + str(train.count()))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training Dataset Count: 992198

In [65]:
print("Test Dataset Count: " + str(test.count()))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Test Dataset Count: 426206

## Spark Random Forest

In [66]:
from pyspark.ml.classification import RandomForestClassifier

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [67]:
# Building the RF model

rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'label', \
                            maxDepth=5, impurity='gini', numTrees=25, seed=100)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [68]:
# Fitting the model over the training set
rfmodel = rf.fit(train)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [69]:
# Printing the forest obtained from the model
print(rfmodel.toDebugString)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

RandomForestClassificationModel (uid=RandomForestClassifier_5b00db469a95) with 25 trees
  Tree 0 (weight 1.0):
    If (feature 3 in {0.0})
     If (feature 46 in {0.0})
      If (feature 38 in {0.0})
       If (feature 7 in {0.0})
        Predict: 0.0
       Else (feature 7 not in {0.0})
        Predict: 8.0
      Else (feature 38 not in {0.0})
       If (feature 12 in {0.0})
        If (feature 15 in {0.0})
         Predict: 9.0
        Else (feature 15 not in {0.0})
         Predict: 16.0
       Else (feature 12 not in {0.0})
        Predict: 12.0
     Else (feature 46 not in {0.0})
      If (feature 179 in {0.0})
       If (feature 219 in {0.0})
        If (feature 7 in {0.0})
         Predict: 0.0
        Else (feature 7 not in {0.0})
         Predict: 8.0
       Else (feature 219 not in {0.0})
        If (feature 180 in {0.0})
         Predict: 0.0
        Else (feature 180 not in {0.0})
         Predict: 5.0
      Else (feature 179 not in {0.0})
       If (feature 197 in {0.0})
 

### Model Prediction

In [70]:
# Applying the model on test set
predictions = rfmodel.transform(test)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [71]:
predictions

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

DataFrame[Primary Type: string, Location Description: string, Arrest: string, Domestic: string, District: int, FBI Code: string, X Coordinate: float, Y Coordinate: float, Year: int, Latitude: float, Longitude: float, hour: int, day_of_week: int, Primary Type_Index: double, Primary Type_encoded: vector, Location Description_Index: double, Location Description_encoded: vector, Arrest_Index: double, Arrest_encoded: vector, Domestic_Index: double, Domestic_encoded: vector, District_Index: double, District_encoded: vector, Year_Index: double, Year_encoded: vector, hour_Index: double, hour_encoded: vector, day_of_week_Index: double, day_of_week_encoded: vector, label: double, features: vector, rawPrediction: vector, probability: vector, prediction: double]

In [72]:
predictions.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------+--------------------+------+--------+--------+--------+------------+------------+----+---------+----------+----+-----------+------------------+--------------------+--------------------------+----------------------------+------------+--------------+--------------+----------------+--------------+----------------+----------+-------------+----------+---------------+-----------------+-------------------+-----+--------------------+--------------------+--------------------+----------+
|Primary Type|Location Description|Arrest|Domestic|District|FBI Code|X Coordinate|Y Coordinate|Year| Latitude| Longitude|hour|day_of_week|Primary Type_Index|Primary Type_encoded|Location Description_Index|Location Description_encoded|Arrest_Index|Arrest_encoded|Domestic_Index|Domestic_encoded|District_Index|District_encoded|Year_Index| Year_encoded|hour_Index|   hour_encoded|day_of_week_Index|day_of_week_encoded|label|            features|       rawPrediction|         probability|prediction|
+-----

In [73]:
# Printing the required columns
predictions.select('label', 'rawPrediction', 'prediction', 'probability').show(10)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+--------------------+----------+--------------------+
|label|       rawPrediction|prediction|         probability|
+-----+--------------------+----------+--------------------+
| 21.0|[5.90314750459841...|       0.0|[0.23612590018393...|
| 21.0|[5.90314750459841...|       0.0|[0.23612590018393...|
| 21.0|[5.90314750459841...|       0.0|[0.23612590018393...|
| 21.0|[5.90314750459841...|       0.0|[0.23612590018393...|
| 21.0|[5.90314750459841...|       0.0|[0.23612590018393...|
| 21.0|[4.54629889653536...|       0.0|[0.18185195586141...|
| 21.0|[4.54629889653536...|       0.0|[0.18185195586141...|
| 21.0|[5.77726835296736...|       0.0|[0.23109073411869...|
| 21.0|[5.66200951273795...|       0.0|[0.22648038050951...|
| 21.0|[5.66200951273795...|       0.0|[0.22648038050951...|
+-----+--------------------+----------+--------------------+
only showing top 10 rows

### Model Evaluation

In [74]:
# Model evaluation
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")



VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [75]:

accuracy = evaluator.evaluate(predictions)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [78]:
# Model Accuracy
print(accuracy)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

0.8108726195474472

In [79]:
# Test Error
print("Test Error = %g" % (1.0 - accuracy))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Test Error = 0.189127

### Feature Importance

In [80]:
# Feature Importance
rfmodel.featureImportances

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparseVector(234, {0: 0.1229, 1: 0.1334, 2: 0.0996, 3: 0.0897, 4: 0.0622, 5: 0.0228, 6: 0.1022, 7: 0.0398, 8: 0.0733, 9: 0.0412, 10: 0.0101, 11: 0.0269, 12: 0.0026, 13: 0.001, 14: 0.0091, 15: 0.0015, 16: 0.0029, 17: 0.0025, 18: 0.0016, 19: 0.0005, 21: 0.0002, 23: 0.0, 24: 0.0, 32: 0.0129, 33: 0.0051, 34: 0.0086, 35: 0.0137, 38: 0.0002, 40: 0.001, 42: 0.0054, 43: 0.0008, 46: 0.0043, 47: 0.0025, 59: 0.0004, 61: 0.0, 64: 0.0, 71: 0.0, 75: 0.0004, 76: 0.0, 171: 0.0475, 172: 0.0405, 173: 0.0009, 176: 0.0, 177: 0.0, 178: 0.0, 179: 0.0, 180: 0.0, 182: 0.0001, 183: 0.0001, 185: 0.0002, 186: 0.0, 187: 0.0, 190: 0.0, 194: 0.0, 197: 0.0, 198: 0.0, 200: 0.0001, 202: 0.0, 207: 0.0, 208: 0.0, 211: 0.0, 214: 0.0001, 216: 0.0, 217: 0.0, 218: 0.0, 219: 0.0, 220: 0.0001, 221: 0.0, 222: 0.0, 223: 0.0001, 224: 0.0, 228: 0.0, 229: 0.0, 230: 0.0014, 231: 0.0037, 232: 0.001, 233: 0.0031})

In [81]:
# Defining a function to extract features along with the feature importance score
import pandas as pd
def ExtractFeatureImp(featureImp, dataset, featuresCol):
    list_extract = []
    for i in dataset.schema[featuresCol].metadata["ml_attr"]["attrs"]:
        list_extract = list_extract + dataset.schema[featuresCol].metadata["ml_attr"]["attrs"][i]
    varlist = pd.DataFrame(list_extract)
    varlist['score'] = varlist['idx'].apply(lambda x: featureImp[x])
    return(varlist.sort_values('score', ascending = False))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [82]:
# Printing the feature importance scores
ExtractFeatureImp(rfmodel.featureImportances, predictions, "features").head(10)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

     idx                                      name     score
5      1              Primary Type_encoded_BATTERY  0.133352
4      0                Primary Type_encoded_THEFT  0.122880
10     6             Primary Type_encoded_BURGLARY  0.102159
6      2      Primary Type_encoded_CRIMINAL DAMAGE  0.099569
7      3            Primary Type_encoded_NARCOTICS  0.089723
12     8  Primary Type_encoded_MOTOR VEHICLE THEFT  0.073267
8      4              Primary Type_encoded_ASSAULT  0.062191
175  171                      Arrest_encoded_False  0.047516
13     9              Primary Type_encoded_ROBBERY  0.041180
176  172                    Domestic_encoded_False  0.040514