<a href="https://colab.research.google.com/github/shubhamgundawarNYU/Big-Data-Project-Group-16/blob/main/misc-datasets-notebooks/NYPD_Shooting_Incident_Data_(Historic).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**BIG DATA PROJECT**

### NYPD Shooting Incident Data (Historic)
Link to Dataset (https://data.cityofnewyork.us/Public-Safety/NYPD-Shooting-Incident-Data-Historic-/833y-fsy8)

### DATA CLEANING AT SCALE

#### Mounting Google Drive to Google Collab Notebook to Load the Data Set

Make sure you have the dataset in your Google Drive and you mount your drive to the Colab.

The file should be at the following path: `gdrive/My Drive/NYPD_Arrest_Data_Year_to_Date.csv`


In [None]:
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


#### Importing required and Necessary Libraries for cleaning the data present in the data set

In [None]:
import numpy as np
import pandas as pd
import io

In [None]:
!pip install pyspark




# **Running Pyspark in Colab**

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab. One important note is that if you are new in Spark, it is better to avoid Spark 2.4.0 version since some people have already complained about its compatibility issue with python. 
Follow the steps to install the dependencies:

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
!tar xf spark-3.2.0-bin-hadoop3.2.tgz
!pip install -q findspark

Now that you installed Spark and Java in Colab, it is time to set the environment path which enables you to run Pyspark in your Colab environment. Set the location of Java and Spark by running the following code:

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.0-bin-hadoop3.2"

Run a local spark session to test your installation:

In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder.getOrCreate()

#### Reading the Data Set CSV File using `spark.read.csv()` Function

In [None]:
df = spark.read.csv("/content/gdrive/MyDrive/Big Data/NYPD_Shooting_Incident_Data__Historic_.csv", inferSchema=True, header =True)

In [None]:
df.count()

23585

#### Get Data Type for each column present in the Data Set




In [None]:
df.printSchema()

root
 |-- INCIDENT_KEY: integer (nullable = true)
 |-- OCCUR_DATE: string (nullable = true)
 |-- OCCUR_TIME: string (nullable = true)
 |-- BORO: string (nullable = true)
 |-- PRECINCT: integer (nullable = true)
 |-- JURISDICTION_CODE: integer (nullable = true)
 |-- LOCATION_DESC: string (nullable = true)
 |-- STATISTICAL_MURDER_FLAG: boolean (nullable = true)
 |-- PERP_AGE_GROUP: string (nullable = true)
 |-- PERP_SEX: string (nullable = true)
 |-- PERP_RACE: string (nullable = true)
 |-- VIC_AGE_GROUP: string (nullable = true)
 |-- VIC_SEX: string (nullable = true)
 |-- VIC_RACE: string (nullable = true)
 |-- X_COORD_CD: double (nullable = true)
 |-- Y_COORD_CD: double (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)
 |-- Lon_Lat: string (nullable = true)



#### Outputing the List of Columns in the Data Set

In [None]:
df.columns

['INCIDENT_KEY',
 'OCCUR_DATE',
 'OCCUR_TIME',
 'BORO',
 'PRECINCT',
 'JURISDICTION_CODE',
 'LOCATION_DESC',
 'STATISTICAL_MURDER_FLAG',
 'PERP_AGE_GROUP',
 'PERP_SEX',
 'PERP_RACE',
 'VIC_AGE_GROUP',
 'VIC_SEX',
 'VIC_RACE',
 'X_COORD_CD',
 'Y_COORD_CD',
 'Latitude',
 'Longitude',
 'Lon_Lat']

#### Get top 10 rows of the arrests dataframe

In [None]:
df.show(n=10)

+------------+----------+----------+--------+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------------+------------+-------------+------------------+------------------+--------------------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|    BORO|PRECINCT|JURISDICTION_CODE|LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|PERP_RACE|VIC_AGE_GROUP|VIC_SEX|      VIC_RACE|  X_COORD_CD|   Y_COORD_CD|          Latitude|         Longitude|             Lon_Lat|
+------------+----------+----------+--------+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------------+------------+-------------+------------------+------------------+--------------------+
|    24050482|08/27/2006|  05:35:00|   BRONX|      52|                0|         null|                   true|          null|    null|     null|        25-44|      F|BLACK HISPANIC|1017541.5625|  

## We see that the columns `X_COORD_CD`,`Y_COORD_CD`,`Latitude` and `Longitude` conveys the same data as `New Georeferenced Column`.

#### Hence, we drop those columns and keep only `New Georeferenced Column` column in our cleaned dataset.

In [None]:
df = df.drop('X_COORD_CD','Y_COORD_CD', 'Latitude', 'Longitude')

In [None]:
df.columns

['INCIDENT_KEY',
 'OCCUR_DATE',
 'OCCUR_TIME',
 'BORO',
 'PRECINCT',
 'JURISDICTION_CODE',
 'LOCATION_DESC',
 'STATISTICAL_MURDER_FLAG',
 'PERP_AGE_GROUP',
 'PERP_SEX',
 'PERP_RACE',
 'VIC_AGE_GROUP',
 'VIC_SEX',
 'VIC_RACE',
 'Lon_Lat']

#### Removing all the **duplicate** entries

In [None]:
df = df.drop_duplicates()

In [None]:
df.count()

23585

In [None]:
df.distinct().count()

23585

## Find Count of Null, None, NaN of All DataFrame Columns

In [None]:
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]
   ).show()

#### Get top 5 rows where Arrest Date is NaN

In [None]:
df.where(col('OCCUR_DATE').isNull()).show(n=5)

NameError: ignored

There are no rows where Arrest Date is not present which is the expected scenario since the arrest would have happened on a particular day.

In [None]:
df = df.filter(df.OCCUR_DATE.isNotNull())

In [None]:
df.show(100)

+------------+----------+----------+-------------+--------+-----------------+--------------------+-----------------------+--------------+--------+--------------+-------------+-------+--------------+--------------------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|         BORO|PRECINCT|JURISDICTION_CODE|       LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|     PERP_RACE|VIC_AGE_GROUP|VIC_SEX|      VIC_RACE|             Lon_Lat|
+------------+----------+----------+-------------+--------+-----------------+--------------------+-----------------------+--------------+--------+--------------+-------------+-------+--------------+--------------------+
|    74925351|10/05/2010|  16:04:00|     BROOKLYN|      67|                0|     COMMERCIAL BLDG|                  false|         25-44|       M|         BLACK|        18-24|      M|         BLACK|POINT (-73.943490...|
|   198404597|06/12/2019|  19:37:00|     BROOKLYN|      77|                0|      GROCERY/BODEGA|                   tru

Check if minimum and maximum values of date and time are valid or not.

1. The minimum value should not be less than January 1, 2021.

2. And the maximum value should not be greater than the current date.

In [None]:
## Minimum value of the column in pyspark
df.agg({'ARREST_DATE': 'min'}).show()

+----------------+
|min(ARREST_DATE)|
+----------------+
|      01/01/2021|
+----------------+



In [None]:
## Maximum value of the column in pyspark
df.agg({'ARREST_DATE': 'max'}).show()

+----------------+
|max(ARREST_DATE)|
+----------------+
|      09/30/2021|
+----------------+



## TODO: Revisit
**Some basic data quality checks are as below:**
1. Check if there are no garbage values in law category column. The valid values are: 'FELONY', 'VIOLATION', 'MISDEMEANOR'.
Check for the value `I`.
2. Check if there are no misspellings in Borough Name. There should be 5 distinct boroughs: Manhattan, Bronx, Queens, Brooklyn, Staten Island. We implement unique method, in case of misspellings multiple values of the same borough would be returned.
3. Ideally, key code should contain only 3 digits. Implementing a check below to see if there are any invalid values for the key code. 

In [None]:
df.select('PERP_SEX').distinct().show()

+--------+
|PERP_SEX|
+--------+
|       F|
|    null|
|       M|
|       U|
+--------+



In [None]:
df.groupBy('PERP_SEX').count().show()

+--------+-----+
|PERP_SEX|count|
+--------+-----+
|       F|  335|
|       M|13490|
|       U| 1499|
|    null| 8261|
+--------+-----+



### Checks for Borough Name

In [None]:
df.select('BORO').distinct().show()

+-------------+
|         BORO|
+-------------+
|       QUEENS|
|     BROOKLYN|
|        BRONX|
|    MANHATTAN|
|STATEN ISLAND|
+-------------+



We can see there are no invalid values for the Borough names and thus no need for additional data correction for the same.

In [None]:
df.where(col('BORO').isNull()).show()

+------------+----------+----------+----+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------+-------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|BORO|PRECINCT|JURISDICTION_CODE|LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|PERP_RACE|VIC_AGE_GROUP|VIC_SEX|VIC_RACE|Lon_Lat|
+------------+----------+----------+----+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------+-------+
+------------+----------+----------+----+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------+-------+



#### Dropping Rows where Borough Name is NULL

In [None]:
df = df.filter(df.BORO.isNotNull())

In [None]:
df.count()

23585

In [None]:
df.filter(df.BORO.isNull()).show()

+------------+----------+----------+----+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------+-------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|BORO|PRECINCT|JURISDICTION_CODE|LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|PERP_RACE|VIC_AGE_GROUP|VIC_SEX|VIC_RACE|Lon_Lat|
+------------+----------+----------+----+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------+-------+
+------------+----------+----------+----+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------+-------+



In [None]:
pip install openclean humanfriendly

Collecting openclean
  Downloading openclean-0.2.1-py3-none-any.whl (5.2 kB)
Collecting humanfriendly
  Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 3.0 MB/s 
[?25hCollecting openclean-core==0.4.1
  Downloading openclean_core-0.4.1-py3-none-any.whl (267 kB)
[K     |████████████████████████████████| 267 kB 28.6 MB/s 
Collecting refdata>=0.2.0
  Downloading refdata-0.2.0-py3-none-any.whl (37 kB)
Collecting histore>=0.4.0
  Downloading histore-0.4.1-py3-none-any.whl (109 kB)
[K     |████████████████████████████████| 109 kB 55.6 MB/s 
[?25hCollecting jsonschema>=3.2.0
  Downloading jsonschema-4.2.1-py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 5.4 MB/s 
Collecting flowserv-core>=0.8.0
  Downloading flowserv_core-0.9.2-py3-none-any.whl (260 kB)
[K     |████████████████████████████████| 260 kB 45.5 MB/s 
[?25hCollecting jellyfish
  Downloading jellyfish-0.8.9.tar.gz (137 kB)
[K     |███████████

In [None]:
import openclean
from openclean.data.source.socrata import Socrata

# Download the full 'NYPD Complaint Data Historic' dataset.
# Note that the downloaded full dataset file is about 380 MB in size! Use the
# alternative data file with 10,000 rows that is included in the repository if
# you do not want to download the full data file.

import gzip
import humanfriendly
import os

dataset = Socrata().dataset('833y-fsy8')

# By default, this example uses a small sample of the full dataset that
# is included in the 'data' subfolder within this repository.
#datafile = './data/qgea-i56i.tsv.gz'

# Remove the comment for this line if you want to use the full dataset.
datafile = './833y-fsy8.tsv.gz'


# Download file only if it does not exist already.
if not os.path.isfile(datafile):
    with gzip.open(datafile, 'wb') as f:
        print('Downloading ...\n')
        dataset.write(f)


fsize = humanfriendly.format_size(os.stat(datafile).st_size)
print("Using '{}' in file {} of size {}".format(dataset.name, datafile, fsize))

Downloading ...

Using 'NYPD Shooting Incident Data (Historic)' in file ./833y-fsy8.tsv.gz of size 1.22 MB


In [None]:
# Due to the size of the full dataset file, we make use of openclean's
# stream operator to avoid having to load the dataset into main-memory.

from openclean.pipeline import stream

ds = stream(datafile)

In [None]:
bor = ds.select('BORO')

bor1 = bor.to_df()

In [None]:
bor1.dropna(inplace=True)

In [None]:
bor1.isnull().values.any()

False

In [None]:
from openclean.function.matching.base import DefaultStringMatcher
from openclean.function.matching.fuzzy import FuzzySimilarity
from openclean.data.mapping import Mapping

VOCABULARY = ['BROOKLYN','MANHATTAN','STATEN ISLAND','BRONX','QUEENS']

matcher = DefaultStringMatcher(
    vocabulary=VOCABULARY,
    similarity=FuzzySimilarity()
)

map = Mapping()
for query in bor1['BORO']:
    map.add(query, matcher.find_matches(query))

print(map)

Mapping(<class 'list'>, {'BRONX': [StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', sco

In [None]:
from openclean.function.eval.domain import Lookup
from openclean.operator.transform.update import update
from openclean.function.eval.base import Col


fixed = update(bor1, 'BORO', Lookup(columns=['BORO'], mapping=map.to_lookup(), default=Col('BORO')))

print(fixed['BORO'].unique())

  'update the map to contain only 1 match per key'.format(k, len(v)))
  'update the map to contain only 1 match per key'.format(k, len(v)))
  'update the map to contain only 1 match per key'.format(k, len(v)))
  'update the map to contain only 1 match per key'.format(k, len(v)))
  'update the map to contain only 1 match per key'.format(k, len(v)))


['BRONX' 'QUEENS' 'BROOKLYN' 'MANHATTAN' 'STATEN ISLAND']


#### We can see that all the key codes are valid 3-digit numbers

In [None]:
df.filter((df.KY_CD < 100) | (df.KY_CD > 999)).count()

0

## Defining checks for outliers in age group

In [None]:
df.select('PERP_AGE_GROUP').distinct().show()

+--------------+
|PERP_AGE_GROUP|
+--------------+
|           940|
|           <18|
|         25-44|
|          null|
|           224|
|       UNKNOWN|
|           65+|
|         18-24|
|          1020|
|         45-64|
+--------------+



#### There are no invalid age groups like negative values, unrealistically high age groups, etc.

#### Lets find all the invalid age groups and replace them with `NaN`

In [None]:
valid_age_groups = ['<18','18-24','25-44','45-64','65+',np.NaN]
df = df.withColumn('PERP_AGE_GROUP', when(df.PERP_AGE_GROUP.isin(valid_age_groups), df.PERP_AGE_GROUP).otherwise(np.NaN))
df.show()

+------------+----------+----------+-------------+--------+-----------------+--------------------+-----------------------+--------------+--------+--------------+-------------+-------+--------------+--------------------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|         BORO|PRECINCT|JURISDICTION_CODE|       LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|     PERP_RACE|VIC_AGE_GROUP|VIC_SEX|      VIC_RACE|             Lon_Lat|
+------------+----------+----------+-------------+--------+-----------------+--------------------+-----------------------+--------------+--------+--------------+-------------+-------+--------------+--------------------+
|    74925351|10/05/2010|  16:04:00|     BROOKLYN|      67|                0|     COMMERCIAL BLDG|                  false|         25-44|       M|         BLACK|        18-24|      M|         BLACK|POINT (-73.943490...|
|   198404597|06/12/2019|  19:37:00|     BROOKLYN|      77|                0|      GROCERY/BODEGA|                   tru

In [None]:
df.select('PERP_AGE_GROUP').distinct().show()

+--------------+
|PERP_AGE_GROUP|
+--------------+
|           <18|
|         25-44|
|           65+|
|           NaN|
|         18-24|
|         45-64|
+--------------+



### Check for Race Values 

In [None]:
df.select('PERP_RACE').distinct().show()

+--------------------+
|           PERP_RACE|
+--------------------+
|               WHITE|
|               BLACK|
|AMERICAN INDIAN/A...|
|                null|
|      BLACK HISPANIC|
|      WHITE HISPANIC|
|             UNKNOWN|
|ASIAN / PACIFIC I...|
+--------------------+



#### Replace all `UNKNOWN` values with `NaN`

In [None]:
from pyspark.sql.functions import regexp_replace

df = df.withColumn("PERP_RACE",
  regexp_replace("PERP_RACE", "UNKNOWN", "NaN"))

In [None]:
df.show(100)

+------------+----------+----------+-------------+--------+-----------------+--------------------+-----------------------+--------------+--------+--------------+-------------+-------+--------------+--------------------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|         BORO|PRECINCT|JURISDICTION_CODE|       LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|     PERP_RACE|VIC_AGE_GROUP|VIC_SEX|      VIC_RACE|             Lon_Lat|
+------------+----------+----------+-------------+--------+-----------------+--------------------+-----------------------+--------------+--------+--------------+-------------+-------+--------------+--------------------+
|    74925351|10/05/2010|  16:04:00|     BROOKLYN|      67|                0|     COMMERCIAL BLDG|                  false|         25-44|       M|         BLACK|        18-24|      M|         BLACK|POINT (-73.943490...|
|   198404597|06/12/2019|  19:37:00|     BROOKLYN|      77|                0|      GROCERY/BODEGA|                   tru

In [None]:
df.select('PERP_RACE').distinct().show()

+--------------------+
|           PERP_RACE|
+--------------------+
|               WHITE|
|               BLACK|
|AMERICAN INDIAN/A...|
|                null|
|      BLACK HISPANIC|
|      WHITE HISPANIC|
|                 NaN|
|ASIAN / PACIFIC I...|
+--------------------+



### Checks for Perpretrator Sex

In [None]:
df.select('PERP_SEX').distinct().show()

+--------+
|PERP_SEX|
+--------+
|       F|
|    null|
|       M|
|       U|
+--------+



#### Checking values in suspect sex

In [None]:
df.groupBy('PERP_SEX').count().orderBy('count', ascending=False).show()

+--------+-----+
|PERP_SEX|count|
+--------+-----+
|       M|13490|
|    null| 8261|
|       U| 1499|
|       F|  335|
+--------+-----+



#### Get unique values of offense description in sorted order

In [None]:
df.select('OFNS_DESC').distinct().orderBy('OFNS_DESC', ascending=True).show()

+--------------------+
|           OFNS_DESC|
+--------------------+
|                null|
| ADMINISTRATIVE CODE|
|ADMINISTRATIVE CODES|
|AGRICULTURE & MRK...|
|ALCOHOLIC BEVERAG...|
|ANTICIPATORY OFFE...|
|               ARSON|
|ASSAULT 3 & RELAT...|
|     BURGLAR'S TOOLS|
|            BURGLARY|
|CHILD ABANDONMENT...|
|CRIMINAL MISCHIEF...|
|   CRIMINAL TRESPASS|
|     DANGEROUS DRUGS|
|   DANGEROUS WEAPONS|
|  DISORDERLY CONDUCT|
|ENDAN WELFARE INCOMP|
|            ESCAPE 3|
|      FELONY ASSAULT|
|   FELONY SEX CRIMES|
+--------------------+
only showing top 20 rows



#### Getting Total Count of Offense Description

In [None]:
df.select('OFNS_DESC').distinct().count()

64

In [None]:
df.groupBy('OFNS_DESC').count().show()

+--------------------+-----+
|           OFNS_DESC|count|
+--------------------+-----+
|OTHER TRAFFIC INF...| 1209|
|ANTICIPATORY OFFE...|   56|
|   FELONY SEX CRIMES|    2|
|OTHER OFFENSES RE...|  971|
|VEHICLE AND TRAFF...| 3995|
|KIDNAPPING & RELA...|   62|
|HOMICIDE-NEGLIGEN...|    6|
|OFF. AGNST PUB OR...| 2548|
|      FELONY ASSAULT|11472|
|ALCOHOLIC BEVERAG...|  279|
|OFFENSES RELATED ...|    7|
|CRIMINAL MISCHIEF...| 7007|
|         THEFT-FRAUD|  149|
|   THEFT OF SERVICES|  137|
|MURDER & NON-NEGL...| 1157|
|            JOSTLING|    4|
|MISCELLANEOUS PEN...| 7629|
|LOITERING/GAMBLIN...|    4|
|               ARSON|   86|
|OFFENSES AGAINST ...|  737|
+--------------------+-----+
only showing top 20 rows



**Map Key Codes with Offense Description**

In [None]:
key_off_mapping = df.groupBy('KY_CD').agg(collect_set('OFNS_DESC').alias('OFNS_DESCS')).orderBy('KY_CD')
key_off_mapping.show()

+-----+--------------------+
|KY_CD|          OFNS_DESCS|
+-----+--------------------+
| null|                  []|
|  101|[MURDER & NON-NEG...|
|  102|[HOMICIDE-NEGLIGE...|
|  103|[HOMICIDE-NEGLIGE...|
|  104|              [RAPE]|
|  105|           [ROBBERY]|
|  106|    [FELONY ASSAULT]|
|  107|          [BURGLARY]|
|  109|     [GRAND LARCENY]|
|  110|[GRAND LARCENY OF...|
|  111|[POSSESSION OF ST...|
|  112|       [THEFT-FRAUD]|
|  113|           [FORGERY]|
|  114|             [ARSON]|
|  115|[PROSTITUTION & R...|
|  116|[SEX CRIMES, FELO...|
|  117|   [DANGEROUS DRUGS]|
|  118| [DANGEROUS WEAPONS]|
|  119|[INTOXICATED/IMPA...|
|  120|[ENDAN WELFARE IN...|
+-----+--------------------+
only showing top 20 rows



In [None]:
key_off_mapping.count()

67

In [None]:
df.select('KY_CD').distinct().count()

67

#### Each key code represents a particular offense description. There is a one to one mapping. So we would use key code for future analysis instead of offense description.

#### Calculating the null values present in the data columnwise (with respect to the features)

In [None]:
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

In [None]:
amount_missing_df = df.select([(count(when(isnan(c) | col(c).isNull(), c))/count(lit(1))).alias(c) for c in df.columns])
amount_missing_df.show()

#### Thus, we can see that the percentage of null values per variable has gone considerably down after cleaning. Some variables like 'PARKS_NM', 'HADEVELOPT' and such can have null values as established above. 

### Number of columns in Clean Data

In [None]:
len(df.columns)

15

### Number of rows in Clean Data

In [None]:
df.count()

23585

In [None]:
df.printSchema()

root
 |-- INCIDENT_KEY: integer (nullable = true)
 |-- OCCUR_DATE: string (nullable = true)
 |-- OCCUR_TIME: string (nullable = true)
 |-- BORO: string (nullable = true)
 |-- PRECINCT: integer (nullable = true)
 |-- JURISDICTION_CODE: integer (nullable = true)
 |-- LOCATION_DESC: string (nullable = true)
 |-- STATISTICAL_MURDER_FLAG: boolean (nullable = true)
 |-- PERP_AGE_GROUP: string (nullable = true)
 |-- PERP_SEX: string (nullable = true)
 |-- PERP_RACE: string (nullable = true)
 |-- VIC_AGE_GROUP: string (nullable = true)
 |-- VIC_SEX: string (nullable = true)
 |-- VIC_RACE: string (nullable = true)
 |-- Lon_Lat: string (nullable = true)



### **Exporting Clean Data in CSV**

The Cleaned Data Set will be saved as `NYPD_Complaint_Data_Historic_Cleaned.csv`

In [None]:
pd_df = df.toPandas()
pd_df.to_csv("NYPD_Arrest_Data_Year_to_Date_Cleaned_Spark.csv")