<a href="https://colab.research.google.com/github/shubhamgundawarNYU/Big-Data-Project-Group-16/blob/main/misc-datasets-notebooks/NYPD_Shooting_Incident_Data_(Year_To_Date).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**BIG DATA PROJECT**

### NYPD Shooting Incident Data (Year To Date)
Link to Dataset (https://data.cityofnewyork.us/Public-Safety/NYPD-Shooting-Incident-Data-Year-To-Date-/5ucz-vwe8)

### DATA CLEANING AT SCALE

#### Mounting Google Drive to Google Collab Notebook to Load the Data Set

Make sure you have the dataset in your Google Drive and you mount your drive to the Colab.

The file should be at the following path: `gdrive/My Drive/NYPD_Complaint_Data_Historic.csv`


In [None]:
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


#### Importing required and Necessary Libraries for cleaning the data present in the data set

In [None]:
import numpy as np
import pandas as pd
import io

In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 36 kB/s 
[?25hCollecting py4j==0.10.9.2
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 47.3 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.0-py2.py3-none-any.whl size=281805912 sha256=19f12757fc5024a75592f5909a085d474dc245fb4a302dd5467292b7b82d3725
  Stored in directory: /root/.cache/pip/wheels/0b/de/d2/9be5d59d7331c6c2a7c1b6d1a4f463ce107332b1ecd4e80718
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.2 pyspark-3.2.0



# **Running Pyspark in Colab**

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab. One important note is that if you are new in Spark, it is better to avoid Spark 2.4.0 version since some people have already complained about its compatibility issue with python. 
Follow the steps to install the dependencies:

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
!tar xf spark-3.2.0-bin-hadoop3.2.tgz
!pip install -q findspark

Now that you installed Spark and Java in Colab, it is time to set the environment path which enables you to run Pyspark in your Colab environment. Set the location of Java and Spark by running the following code:

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.0-bin-hadoop3.2"

Run a local spark session to test your installation:

In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder.getOrCreate()

#### Reading the Data Set CSV File using `spark.read.csv()` Function

In [None]:
df = spark.read.csv("/content/gdrive/MyDrive/Big Data/NYPD_Shooting_Incident_Data__Year_To_Date_.csv", inferSchema=True, header =True)

In [None]:
df.count()

1531

#### Get Data Type for each column present in the Data Set




In [None]:
df.printSchema()

root
 |-- INCIDENT_KEY: integer (nullable = true)
 |-- OCCUR_DATE: string (nullable = true)
 |-- OCCUR_TIME: string (nullable = true)
 |-- BORO: string (nullable = true)
 |-- PRECINCT: integer (nullable = true)
 |-- JURISDICTION_CODE: integer (nullable = true)
 |-- LOCATION_DESC: string (nullable = true)
 |-- STATISTICAL_MURDER_FLAG: boolean (nullable = true)
 |-- PERP_AGE_GROUP: string (nullable = true)
 |-- PERP_SEX: string (nullable = true)
 |-- PERP_RACE: string (nullable = true)
 |-- VIC_AGE_GROUP: string (nullable = true)
 |-- VIC_SEX: string (nullable = true)
 |-- VIC_RACE: string (nullable = true)
 |-- X_COORD_CD: integer (nullable = true)
 |-- Y_COORD_CD: integer (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)
 |-- New Georeferenced Column: string (nullable = true)



#### Outputing the List of Columns in the Data Set

In [None]:
df.columns

['INCIDENT_KEY',
 'OCCUR_DATE',
 'OCCUR_TIME',
 'BORO',
 'PRECINCT',
 'JURISDICTION_CODE',
 'LOCATION_DESC',
 'STATISTICAL_MURDER_FLAG',
 'PERP_AGE_GROUP',
 'PERP_SEX',
 'PERP_RACE',
 'VIC_AGE_GROUP',
 'VIC_SEX',
 'VIC_RACE',
 'X_COORD_CD',
 'Y_COORD_CD',
 'Latitude',
 'Longitude',
 'New Georeferenced Column']

#### Get top 10 rows of the complaints dataframe

In [None]:
df.show(n=10)

+------------+----------+----------+---------+--------+-----------------+--------------------+-----------------------+--------------+--------+--------------+-------------+-------+--------------+----------+----------+------------------+------------------+------------------------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|     BORO|PRECINCT|JURISDICTION_CODE|       LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|     PERP_RACE|VIC_AGE_GROUP|VIC_SEX|      VIC_RACE|X_COORD_CD|Y_COORD_CD|          Latitude|         Longitude|New Georeferenced Column|
+------------+----------+----------+---------+--------+-----------------+--------------------+-----------------------+--------------+--------+--------------+-------------+-------+--------------+----------+----------+------------------+------------------+------------------------+
|   230162224|06/27/2021|  04:34:00| BROOKLYN|      75|                0|                null|                  false|          null|    null|          null|   

## We see that the columns `X_COORD_CD`,`Y_COORD_CD`,`Latitude` and `Longitude` conveys the same data as `LatLon`.

#### Hence, we drop those columns and keep only `LatLon` column in our cleaned dataset.

In [None]:
df = df.drop('X_COORD_CD','Y_COORD_CD','Latitude','Longitude')

In [None]:
df.columns

['INCIDENT_KEY',
 'OCCUR_DATE',
 'OCCUR_TIME',
 'BORO',
 'PRECINCT',
 'JURISDICTION_CODE',
 'LOCATION_DESC',
 'STATISTICAL_MURDER_FLAG',
 'PERP_AGE_GROUP',
 'PERP_SEX',
 'PERP_RACE',
 'VIC_AGE_GROUP',
 'VIC_SEX',
 'VIC_RACE',
 'New Georeferenced Column']

#### Removing all the **duplicate** entries

In [None]:
df = df.drop_duplicates()

In [None]:
df.count()

1531

In [None]:
df.distinct().count()

1531

#### **Checking** if the complaint number is unique or not

In [None]:
df.select('INCIDENT_KEY').distinct().count()

1195

#### As we can see `CMPLNT_NUM` should have been unique, but it is not.
#### Let's see what are the duplicate values.

In [None]:
df1 = df.groupBy('INCIDENT_KEY').count().filter("count > 1")
df1.drop('count').count()

180

In [None]:
df1.sort('INCIDENT_KEY').show(n = 10)

+------------+-----+
|INCIDENT_KEY|count|
+------------+-----+
|   222524733|    2|
|   222539215|    2|
|   222560300|    2|
|   222563766|    4|
|   222932692|    8|
|   223198985|    4|
|   223201958|    2|
|   223505522|    2|
|   223786434|    2|
|   223827862|    2|
+------------+-----+
only showing top 10 rows



#### Check for complaint number `100509703`

In [None]:
df.filter('INCIDENT_KEY = 223786434').show()

+------------+----------+----------+---------+--------+-----------------+-------------+-----------------------+--------------+--------+--------------+-------------+-------+--------------+------------------------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|     BORO|PRECINCT|JURISDICTION_CODE|LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|     PERP_RACE|VIC_AGE_GROUP|VIC_SEX|      VIC_RACE|New Georeferenced Column|
+------------+----------+----------+---------+--------+-----------------+-------------+-----------------------+--------------+--------+--------------+-------------+-------+--------------+------------------------+
|   223786434|01/31/2021|  07:30:00|MANHATTAN|      34|                0|         null|                  false|           <18|       M|BLACK HISPANIC|        25-44|      M|WHITE HISPANIC|    POINT (-73.921487...|
|   223786434|01/31/2021|  07:30:00|MANHATTAN|      34|                0|         null|                  false|         18-24|       M|WHITE HISPANI

#### We understand, that complaint number is not specifically unique. The dataset has rows with duplicated complaint numbers having unique information for all other columns. Thus, we cannot drop the entries with duplicated complaint numbers.

## Find Count of Null, None, NaN of All DataFrame Columns

In [None]:
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]
   ).show()

#### Get top 5 rows where OCCUR_DATE is NaN

In [None]:
df.where(col('OCCUR_DATE').isNull()).show(n=5)

+------------+----------+----------+----+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------+------------------------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|BORO|PRECINCT|JURISDICTION_CODE|LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|PERP_RACE|VIC_AGE_GROUP|VIC_SEX|VIC_RACE|New Georeferenced Column|
+------------+----------+----------+----+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------+------------------------+
+------------+----------+----------+----+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------+------------------------+



### Get rows where either complaint from date or complaint from time is null

1. There maybe null values in the CMPLNT_TO_DT, CMPLNT_TO_TM, LOC_OF_OCCUR_DESC, PREM_TYP_DESC, PARKS_NM and HADEVELOPT columns. 

2. LOC_OF_OCCUR_DESC, PREM_TYP_DESC, PARKS_NM and HADEVELOPT need not be present for all the fields so cannot be used to eliminate records.

3. However, complaint from date and complaint from time columns cannot have null values. We drop the rows where either complaint from date or complaint from time is null.

In [None]:
df.drop(df.OCCUR_DATE.isNull() | df.OCCUR_TIME.isNull())

DataFrame[INCIDENT_KEY: int, OCCUR_DATE: string, OCCUR_TIME: string, BORO: string, PRECINCT: int, JURISDICTION_CODE: int, LOCATION_DESC: string, STATISTICAL_MURDER_FLAG: boolean, PERP_AGE_GROUP: string, PERP_SEX: string, PERP_RACE: string, VIC_AGE_GROUP: string, VIC_SEX: string, VIC_RACE: string, New Georeferenced Column: string]

In [None]:
df = df.filter(df.OCCUR_DATE.isNotNull() | df.OCCUR_TIME.isNotNull())

In [None]:
df.show(100)

+------------+----------+----------+-------------+--------+-----------------+--------------------+-----------------------+--------------+--------+--------------------+-------------+-------+--------------------+------------------------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|         BORO|PRECINCT|JURISDICTION_CODE|       LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|           PERP_RACE|VIC_AGE_GROUP|VIC_SEX|            VIC_RACE|New Georeferenced Column|
+------------+----------+----------+-------------+--------+-----------------+--------------------+-----------------------+--------------+--------+--------------------+-------------+-------+--------------------+------------------------+
|   227221494|04/21/2021|  17:14:00|       QUEENS|     105|                0|                null|                  false|          null|    null|                null|        18-24|      M|               BLACK|    POINT (-73.748024...|
|   225285872|03/07/2021|  06:05:00|     BROOKLYN|      

Check if minimum and maximum values of date and time are valid or not. The value of time cannot be 24:00:00. 

In [None]:
## Minimum value of the column in pyspark
df.agg({'OCCUR_TIME': 'min'}).show()

+---------------+
|min(OCCUR_TIME)|
+---------------+
|       00:00:00|
+---------------+



In [None]:
## Maximum value of the column in pyspark
df.agg({'OCCUR_TIME': 'max'}).show()

+---------------+
|max(OCCUR_TIME)|
+---------------+
|       23:57:00|
+---------------+



In [None]:
df.agg({'OCCUR_DATE': 'min'}).show()

+---------------+
|min(OCCUR_DATE)|
+---------------+
|     01/01/2021|
+---------------+



In [None]:
df.agg({'OCCUR_DATE': 'max'}).show()

+---------------+
|max(OCCUR_DATE)|
+---------------+
|     09/30/2021|
+---------------+



**Some basic data quality checks are as below:**
1. Check if there are no garbage values in the location of occurence description column. The valid values that this column should ideally contain are: 'FRONT OF', 'REAR OF', 'OUTSIDE', 'INSIDE', 'OPPOSITE OF'
2. Check if there are no garbage values in law category column. The valid values are: 'FELONY', 'VIOLATION', 'MISDEMEANOR'
3. Check if there are no misspellings in Borough Name. There should be 5 distinct boroughs: Manhattan, Bronx, Queens, Brooklyn, Staten Island. We implement unique method, in case of misspellings multiple values of the same borough would be returned.
4. Check if 'CRM_ATPT_CPTD_CD' column has no garbage value. The only acceptable values are Completed or Attempted.
5. Ideally, key code should contain only 3 digits. Implementing a check below to see if there are any invalid values for the key code. 

In [None]:
df.select('LOCATION_DESC').distinct().show()

+--------------------+
|       LOCATION_DESC|
+--------------------+
|     COMMERCIAL BLDG|
|            HOSPITAL|
|           FAST FOOD|
|                null|
|                BANK|
|      GROCERY/BODEGA|
|         GAS STATION|
|MULTI DWELL - PUB...|
|          DEPT STORE|
|         HOTEL/MOTEL|
|    RESTAURANT/DINER|
|           PVT HOUSE|
|      BAR/NIGHT CLUB|
|MULTI DWELL - APT...|
|   BEAUTY/NAIL SALON|
+--------------------+



### Checks for Borough Name

In [None]:
df.select('BORO').distinct().show()

+-------------+
|         BORO|
+-------------+
|       QUEENS|
|     BROOKLYN|
|        BRONX|
|    MANHATTAN|
|STATEN ISLAND|
+-------------+



We can see there are no misspellings for the Borough names and thus no need for additional data correction for the same.

In [None]:
df.where(col('BORO').isNull()).show()

+------------+----------+----------+----+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------+------------------------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|BORO|PRECINCT|JURISDICTION_CODE|LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|PERP_RACE|VIC_AGE_GROUP|VIC_SEX|VIC_RACE|New Georeferenced Column|
+------------+----------+----------+----+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------+------------------------+
+------------+----------+----------+----+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------+------------------------+



#### Dropping Rows where Borough Name is NULL

In [None]:
df = df.filter(df.BORO.isNotNull())

In [None]:
df.count()

1531

In [None]:
df.filter(df.BORO.isNull()).show()

+------------+----------+----------+----+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------+------------------------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|BORO|PRECINCT|JURISDICTION_CODE|LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|PERP_RACE|VIC_AGE_GROUP|VIC_SEX|VIC_RACE|New Georeferenced Column|
+------------+----------+----------+----+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------+------------------------+
+------------+----------+----------+----+--------+-----------------+-------------+-----------------------+--------------+--------+---------+-------------+-------+--------+------------------------+



## Defining checks for outliers in age group

In [None]:
df.select('PERP_AGE_GROUP').distinct().show()

+--------------+
|PERP_AGE_GROUP|
+--------------+
|           <18|
|         25-44|
|          null|
|           65+|
|         18-24|
|         45-64|
+--------------+



#### There are many invalid age groups like negative values, unrealistically high age groups, etc.

#### Lets find all the invalid age groups and replace them with `NaN`

In [None]:
valid_age_groups = ['<18','18-24','25-44','45-64','65+',np.NaN]
df = df.withColumn('PERP_AGE_GROUP', when(df.PERP_AGE_GROUP.isin(valid_age_groups), df.PERP_AGE_GROUP).otherwise(np.NaN))
df.show()

+------------+----------+----------+---------+--------+-----------------+--------------------+-----------------------+--------------+--------+--------------------+-------------+-------+--------------------+------------------------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|     BORO|PRECINCT|JURISDICTION_CODE|       LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|           PERP_RACE|VIC_AGE_GROUP|VIC_SEX|            VIC_RACE|New Georeferenced Column|
+------------+----------+----------+---------+--------+-----------------+--------------------+-----------------------+--------------+--------+--------------------+-------------+-------+--------------------+------------------------+
|   227221494|04/21/2021|  17:14:00|   QUEENS|     105|                0|                null|                  false|           NaN|    null|                null|        18-24|      M|               BLACK|    POINT (-73.748024...|
|   225285872|03/07/2021|  06:05:00| BROOKLYN|      73|                0

In [None]:
df.select('PERP_AGE_GROUP').distinct().show()

+--------------+
|PERP_AGE_GROUP|
+--------------+
|           <18|
|         25-44|
|           65+|
|           NaN|
|         18-24|
|         45-64|
+--------------+



In [None]:
df.show(n=5)

+------------+----------+----------+---------+--------+-----------------+--------------------+-----------------------+--------------+--------+---------+-------------+-------+--------------+------------------------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|     BORO|PRECINCT|JURISDICTION_CODE|       LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|PERP_RACE|VIC_AGE_GROUP|VIC_SEX|      VIC_RACE|New Georeferenced Column|
+------------+----------+----------+---------+--------+-----------------+--------------------+-----------------------+--------------+--------+---------+-------------+-------+--------------+------------------------+
|   227221494|04/21/2021|  17:14:00|   QUEENS|     105|                0|                null|                  false|           NaN|    null|     null|        18-24|      M|         BLACK|    POINT (-73.748024...|
|   225285872|03/07/2021|  06:05:00| BROOKLYN|      73|                0|MULTI DWELL - APT...|                  false|           NaN|    nul

In [None]:
df.select('VIC_AGE_GROUP').distinct().show()

+-------------+
|VIC_AGE_GROUP|
+-------------+
|          <18|
|        25-44|
|      UNKNOWN|
|          65+|
|        18-24|
|        45-64|
+-------------+



#### There are many invalid age groups like negative values, unrealistically high age groups, etc.

#### Lets find all the invalid age groups and replace them with `NaN`

In [None]:
valid_age_groups = ['<18','18-24','25-44','45-64','65+',np.NaN]
df = df.withColumn('VIC_AGE_GROUP', when(df.VIC_AGE_GROUP.isin(valid_age_groups), df.VIC_AGE_GROUP).otherwise(np.NaN))
df.show()

+------------+----------+----------+---------+--------+-----------------+--------------------+-----------------------+--------------+--------+--------------------+-------------+-------+--------------------+------------------------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|     BORO|PRECINCT|JURISDICTION_CODE|       LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|           PERP_RACE|VIC_AGE_GROUP|VIC_SEX|            VIC_RACE|New Georeferenced Column|
+------------+----------+----------+---------+--------+-----------------+--------------------+-----------------------+--------------+--------+--------------------+-------------+-------+--------------------+------------------------+
|   227221494|04/21/2021|  17:14:00|   QUEENS|     105|                0|                null|                  false|           NaN|    null|                null|        18-24|      M|               BLACK|    POINT (-73.748024...|
|   225285872|03/07/2021|  06:05:00| BROOKLYN|      73|                0

In [None]:
df.select('VIC_AGE_GROUP').distinct().show()

+-------------+
|VIC_AGE_GROUP|
+-------------+
|          <18|
|        25-44|
|          65+|
|          NaN|
|        18-24|
|        45-64|
+-------------+



In [None]:
df.show(n=5)

+------------+----------+----------+---------+--------+-----------------+--------------------+-----------------------+--------------+--------+---------+-------------+-------+--------------+------------------------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|     BORO|PRECINCT|JURISDICTION_CODE|       LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|PERP_RACE|VIC_AGE_GROUP|VIC_SEX|      VIC_RACE|New Georeferenced Column|
+------------+----------+----------+---------+--------+-----------------+--------------------+-----------------------+--------------+--------+---------+-------------+-------+--------------+------------------------+
|   227221494|04/21/2021|  17:14:00|   QUEENS|     105|                0|                null|                  false|           NaN|    null|     null|        18-24|      M|         BLACK|    POINT (-73.748024...|
|   225285872|03/07/2021|  06:05:00| BROOKLYN|      73|                0|MULTI DWELL - APT...|                  false|           NaN|    nul

### Check for Race Values 

In [None]:
df.select('PERP_RACE').distinct().show()

+--------------------+
|           PERP_RACE|
+--------------------+
|               WHITE|
|               BLACK|
|                null|
|      BLACK HISPANIC|
|      WHITE HISPANIC|
|ASIAN / PACIFIC I...|
+--------------------+



#### Replace all `UNKNOWN` values with `NaN`

In [None]:
from pyspark.sql.functions import regexp_replace

df = df.withColumn("PERP_RACE",
  regexp_replace("PERP_RACE", "UNKNOWN", "NaN"))

In [None]:
df.show(100)

+------------+----------+----------+-------------+--------+-----------------+--------------------+-----------------------+--------------+--------+--------------------+-------------+-------+--------------------+------------------------+
|INCIDENT_KEY|OCCUR_DATE|OCCUR_TIME|         BORO|PRECINCT|JURISDICTION_CODE|       LOCATION_DESC|STATISTICAL_MURDER_FLAG|PERP_AGE_GROUP|PERP_SEX|           PERP_RACE|VIC_AGE_GROUP|VIC_SEX|            VIC_RACE|New Georeferenced Column|
+------------+----------+----------+-------------+--------+-----------------+--------------------+-----------------------+--------------+--------+--------------------+-------------+-------+--------------------+------------------------+
|   227221494|04/21/2021|  17:14:00|       QUEENS|     105|                0|                null|                  false|           NaN|    null|                null|        18-24|      M|               BLACK|    POINT (-73.748024...|
|   225285872|03/07/2021|  06:05:00|     BROOKLYN|      

In [None]:
df.select('PERP_RACE').distinct().show()

+--------------------+
|           PERP_RACE|
+--------------------+
|               WHITE|
|               BLACK|
|                null|
|      BLACK HISPANIC|
|      WHITE HISPANIC|
|ASIAN / PACIFIC I...|
+--------------------+



In [None]:
df.select('VIC_RACE').distinct().show()

+--------------------+
|            VIC_RACE|
+--------------------+
|               WHITE|
|               BLACK|
|      BLACK HISPANIC|
|      WHITE HISPANIC|
|             UNKNOWN|
|ASIAN / PACIFIC I...|
+--------------------+



In [None]:
from pyspark.sql.functions import regexp_replace

df = df.withColumn("VIC_RACE",
  regexp_replace("VIC_RACE", "UNKNOWN", "NaN"))

In [None]:
df.select('VIC_RACE').distinct().show()

+--------------------+
|            VIC_RACE|
+--------------------+
|               WHITE|
|               BLACK|
|      BLACK HISPANIC|
|      WHITE HISPANIC|
|                 NaN|
|ASIAN / PACIFIC I...|
+--------------------+



### Checks for Suspect & Victim Sex

In [None]:
df.select('PERP_SEX').distinct().show()

+--------+
|PERP_SEX|
+--------+
|       F|
|    null|
|       M|
+--------+



#### Checking values in suspect sex

In [None]:
df.groupBy('PERP_SEX').count().orderBy('count', ascending=False).show()

+--------+-----+
|PERP_SEX|count|
+--------+-----+
|    null|  811|
|       M|  696|
|       F|   24|
+--------+-----+



In [None]:
df.select('VIC_SEX').distinct().show()

+-------+
|VIC_SEX|
+-------+
|      F|
|      M|
|      U|
+-------+



In [None]:
df.groupBy('VIC_SEX').count().orderBy('count', ascending=False).show()

+-------+-----+
|VIC_SEX|count|
+-------+-----+
|      M| 1369|
|      F|  161|
|      U|    1|
+-------+-----+



In [None]:
amount_missing_df = df.select([(count(when(isnan(c) | col(c).isNull(), c))/count(lit(1))).alias(c) for c in df.columns])
amount_missing_df.show()

#### Thus, we can see that the percentage of null values per variable has gone considerably down after cleaning. Some variables like 'PARKS_NM', 'HADEVELOPT' and such can have null values as established above. 

JURISDICTION wise count

In [None]:
df.groupBy('LOCATION_DESC').count().show()

+--------------------+-----+
|       LOCATION_DESC|count|
+--------------------+-----+
|     COMMERCIAL BLDG|   28|
|            HOSPITAL|    7|
|           FAST FOOD|    1|
|                BANK|    2|
|      GROCERY/BODEGA|   38|
|         GAS STATION|    8|
|MULTI DWELL - PUB...|  250|
|          DEPT STORE|    4|
|    RESTAURANT/DINER|    6|
|         HOTEL/MOTEL|    7|
|           PVT HOUSE|   16|
|      BAR/NIGHT CLUB|   15|
|MULTI DWELL - APT...|   99|
|   BEAUTY/NAIL SALON|    5|
|                null| 1045|
+--------------------+-----+



### Number of columns in Clean Data

In [None]:
len(df.columns)

15

### Number of rows in Clean Data

In [None]:
df.count()

1531

In [None]:
df.printSchema()

root
 |-- INCIDENT_KEY: integer (nullable = true)
 |-- OCCUR_DATE: string (nullable = true)
 |-- OCCUR_TIME: string (nullable = true)
 |-- BORO: string (nullable = true)
 |-- PRECINCT: integer (nullable = true)
 |-- JURISDICTION_CODE: integer (nullable = true)
 |-- LOCATION_DESC: string (nullable = true)
 |-- STATISTICAL_MURDER_FLAG: boolean (nullable = true)
 |-- PERP_AGE_GROUP: string (nullable = true)
 |-- PERP_SEX: string (nullable = true)
 |-- PERP_RACE: string (nullable = true)
 |-- VIC_AGE_GROUP: string (nullable = true)
 |-- VIC_SEX: string (nullable = true)
 |-- VIC_RACE: string (nullable = true)
 |-- New Georeferenced Column: string (nullable = true)



### **Exporting Clean Data in CSV**

The Cleaned Data Set will be saved as `NYPD_Complaint_Data_Historic_Cleaned.csv`

In [None]:
pd_df = df.toPandas()
pd_df.to_csv("NYPD_Complaint_Data_Historic_Cleaned_Spark.csv")