<a href="https://colab.research.google.com/github/shubhamgundawarNYU/Big-Data-Project-Group-16/blob/main/misc-datasets-notebooks/NYPD_Criminal_Court_Summons_Incident_Level_Data_(Year_To_Date).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**BIG DATA PROJECT**

### NYPD Criminal Court Summons Incident Level Data (Year To Date) 
Link to Dataset (https://data.cityofnewyork.us/Public-Safety/NYPD-Criminal-Court-Summons-Incident-Level-Data-Ye/mv4k-y93f)

### DATA CLEANING AT SCALE

#### Mounting Google Drive to Google Collab Notebook to Load the Data Set

Make sure you have the dataset in your Google Drive and you mount your drive to the Colab.

The file should be at the following path: `gdrive/My Drive/NYPD_Criminal_Court_Summons_Incident_Level_Data_Year_To_Date.csv`


In [None]:
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


#### Importing required and Necessary Libraries for cleaning the data present in the data set

In [None]:
import numpy as np
import pandas as pd
import io

In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 38 kB/s 
[?25hCollecting py4j==0.10.9.2
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 65.1 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.0-py2.py3-none-any.whl size=281805912 sha256=cae4bced050650d1c8ece9c93e75291d98d6d262d9edecf59bf354b88c90a4a4
  Stored in directory: /root/.cache/pip/wheels/0b/de/d2/9be5d59d7331c6c2a7c1b6d1a4f463ce107332b1ecd4e80718
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.2 pyspark-3.2.0



# **Running Pyspark in Colab**

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab. One important note is that if you are new in Spark, it is better to avoid Spark 2.4.0 version since some people have already complained about its compatibility issue with python. 
Follow the steps to install the dependencies:

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
!tar xf spark-3.2.0-bin-hadoop3.2.tgz
!pip install -q findspark

Now that you installed Spark and Java in Colab, it is time to set the environment path which enables you to run Pyspark in your Colab environment. Set the location of Java and Spark by running the following code:

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.0-bin-hadoop3.2"

Run a local spark session to test your installation:

In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder.getOrCreate()

#### Reading the Data Set CSV File using `spark.read.csv()` Function

In [None]:
df = spark.read.csv("/content/gdrive/MyDrive/NYPD_Criminal_Court_Summons_Incident_Level_Data_Year_To_Date.csv", inferSchema=True, header =True)

In [None]:
df.count()

35297

#### Get Data Type for each column present in the Data Set




In [None]:
df.printSchema()

root
 |-- SUMMONS_KEY: integer (nullable = true)
 |-- SUMMONS_DATE: string (nullable = true)
 |-- OFFENSE_DESCRIPTION: string (nullable = true)
 |-- LAW_SECTION_NUMBER: string (nullable = true)
 |-- LAW_DESCRIPTION: string (nullable = true)
 |-- SUMMONS_CATEGORY_TYPE: string (nullable = true)
 |-- AGE_GROUP: string (nullable = true)
 |-- SEX: string (nullable = true)
 |-- RACE: string (nullable = true)
 |-- JURISDICTION_CODE: integer (nullable = true)
 |-- BORO: string (nullable = true)
 |-- PRECINCT_OF_OCCUR: integer (nullable = true)
 |-- X_COORDINATE_CD: integer (nullable = true)
 |-- Y_COORDINATE_CD: integer (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)
 |-- New Georeferenced Column: string (nullable = true)



#### Outputing the List of Columns in the Data Set

In [None]:
df.columns

['SUMMONS_KEY',
 'SUMMONS_DATE',
 'OFFENSE_DESCRIPTION',
 'LAW_SECTION_NUMBER',
 'LAW_DESCRIPTION',
 'SUMMONS_CATEGORY_TYPE',
 'AGE_GROUP',
 'SEX',
 'RACE',
 'JURISDICTION_CODE',
 'BORO',
 'PRECINCT_OF_OCCUR',
 'X_COORDINATE_CD',
 'Y_COORDINATE_CD',
 'Latitude',
 'Longitude',
 'New Georeferenced Column']

#### Get top 10 rows of the complaints dataframe

In [None]:
df.show(n=10)

+-----------+------------+--------------------+------------------+-------------------+---------------------+---------+----+--------------+-----------------+--------+-----------------+---------------+---------------+-----------------+------------------+------------------------+
|SUMMONS_KEY|SUMMONS_DATE| OFFENSE_DESCRIPTION|LAW_SECTION_NUMBER|    LAW_DESCRIPTION|SUMMONS_CATEGORY_TYPE|AGE_GROUP| SEX|          RACE|JURISDICTION_CODE|    BORO|PRECINCT_OF_OCCUR|X_COORDINATE_CD|Y_COORDINATE_CD|         Latitude|         Longitude|New Georeferenced Column|
+-----------+------------+--------------------+------------------+-------------------+---------------------+---------+----+--------------+-----------------+--------+-----------------+---------------+---------------+-----------------+------------------+------------------------+
|  234450913|  09/30/2021|KNIVES; PUBLIC PO...|         10-133(B)|               null|                 null|    25-44|   M|         BLACK|                2|   BRONX| 

## We see that the columns `X_COORD_CD`,`Y_COORD_CD`,`Latitude`, `Longitude` convey the same data as `New Georeferenced Column`.

#### Hence, we drop those columns and keep only `New Georeferenced Column` column in our cleaned dataset.

In [None]:
df = df.drop('X_COORD_CD','Y_COORD_CD','Latitude','Longitude')

In [None]:
df.columns

['SUMMONS_KEY',
 'SUMMONS_DATE',
 'OFFENSE_DESCRIPTION',
 'LAW_SECTION_NUMBER',
 'LAW_DESCRIPTION',
 'SUMMONS_CATEGORY_TYPE',
 'AGE_GROUP',
 'SEX',
 'RACE',
 'JURISDICTION_CODE',
 'BORO',
 'PRECINCT_OF_OCCUR',
 'X_COORDINATE_CD',
 'Y_COORDINATE_CD',
 'New Georeferenced Column']

#### Removing all the **duplicate** entries

In [None]:
df = df.drop_duplicates()

In [None]:
df.count()

35297

In [None]:
df.distinct().count()

35297

#### **Checking** if the Summons Key is unique or not

In [None]:
df.select('SUMMONS_KEY').distinct().count()

35294

#### As we can see `SUMMONS_KEY` should have been unique, but it is not.
#### Let's see what are the duplicate values.

In [None]:
df1 = df.groupBy('SUMMONS_KEY').count().filter("count > 1")
df1.drop('count').count()

3

In [None]:
df1.sort('SUMMONS_KEY').show(n = 10)

+-----------+-----+
|SUMMONS_KEY|count|
+-----------+-----+
|    6754670|    2|
|  150769181|    2|
|  169683528|    2|
+-----------+-----+



#### Check for complaint number `6754670`

In [None]:
df.filter('SUMMONS_KEY = 6754670').show()

+-----------+------------+--------------------+------------------+------------------+---------------------+---------+----+-------+-----------------+--------+-----------------+---------------+---------------+------------------------+
|SUMMONS_KEY|SUMMONS_DATE| OFFENSE_DESCRIPTION|LAW_SECTION_NUMBER|   LAW_DESCRIPTION|SUMMONS_CATEGORY_TYPE|AGE_GROUP| SEX|   RACE|JURISDICTION_CODE|    BORO|PRECINCT_OF_OCCUR|X_COORDINATE_CD|Y_COORDINATE_CD|New Georeferenced Column|
+-----------+------------+--------------------+------------------+------------------+---------------------+---------+----+-------+-----------------+--------+-----------------+---------------+---------------+------------------------+
|    6754670|  09/15/2021|FEDERAL MOTOR VEH...|            CFR 49|NYS Transportation|            NYS TRANS|  UNKNOWN|null|   null|                0|BROOKLYN|               76|         999701|         195491|    POINT (-73.944275...|
|    6754670|  09/15/2021|FEDERAL MOTOR VEH...|            CFR 49|NY

#### TODO: Revisit

#### We understand, that complaint number is not specifically unique. The dataset has rows with duplicated complaint numbers having unique information for all other columns. Thus, we cannot drop the entries with duplicated complaint numbers.

## Find Count of Null, None, NaN of All DataFrame Columns

In [None]:
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]
   ).show()

+-----------+------------+-------------------+------------------+---------------+---------------------+---------+-----+-----+-----------------+----+-----------------+---------------+---------------+------------------------+
|SUMMONS_KEY|SUMMONS_DATE|OFFENSE_DESCRIPTION|LAW_SECTION_NUMBER|LAW_DESCRIPTION|SUMMONS_CATEGORY_TYPE|AGE_GROUP|  SEX| RACE|JURISDICTION_CODE|BORO|PRECINCT_OF_OCCUR|X_COORDINATE_CD|Y_COORDINATE_CD|New Georeferenced Column|
+-----------+------------+-------------------+------------------+---------------+---------------------+---------+-----+-----+-----------------+----+-----------------+---------------+---------------+------------------------+
|          0|           0|                  0|                 0|           8936|                 8936|        0|17436|17437|                0|   0|                0|            181|            181|                     181|
+-----------+------------+-------------------+------------------+---------------+---------------------+-

#### Get top 5 rows where complaint from Date is NaN

In [None]:
df.where(col('SUMMONS_DATE').isNull()).show(n=5)

+-----------+------------+-------------------+------------------+---------------+---------------------+---------+---+----+-----------------+----+-----------------+---------------+---------------+------------------------+
|SUMMONS_KEY|SUMMONS_DATE|OFFENSE_DESCRIPTION|LAW_SECTION_NUMBER|LAW_DESCRIPTION|SUMMONS_CATEGORY_TYPE|AGE_GROUP|SEX|RACE|JURISDICTION_CODE|BORO|PRECINCT_OF_OCCUR|X_COORDINATE_CD|Y_COORDINATE_CD|New Georeferenced Column|
+-----------+------------+-------------------+------------------+---------------+---------------------+---------+---+----+-----------------+----+-----------------+---------------+---------------+------------------------+
+-----------+------------+-------------------+------------------+---------------+---------------------+---------+---+----+-----------------+----+-----------------+---------------+---------------+------------------------+



In [None]:
df = df.filter(df.SUMMONS_DATE.isNotNull())

In [None]:
df.show(100)

+-----------+------------+--------------------+------------------+-------------------+---------------------+---------+----+--------------------+-----------------+-------------+-----------------+---------------+---------------+------------------------+
|SUMMONS_KEY|SUMMONS_DATE| OFFENSE_DESCRIPTION|LAW_SECTION_NUMBER|    LAW_DESCRIPTION|SUMMONS_CATEGORY_TYPE|AGE_GROUP| SEX|                RACE|JURISDICTION_CODE|         BORO|PRECINCT_OF_OCCUR|X_COORDINATE_CD|Y_COORDINATE_CD|New Georeferenced Column|
+-----------+------------+--------------------+------------------+-------------------+---------------------+---------+----+--------------------+-----------------+-------------+-----------------+---------------+---------------+------------------------+
|  228901095|  05/31/2021|        NO TAX STAMP|            11-809|Administrative Code|                  TLC|  UNKNOWN|null|                null|                0|     BROOKLYN|               76|        1020325|         240688|    POINT (-73.869

Check if minimum and maximum values of date are valid or not.

1. The minimum value cannot be lower than Jan 1, 2021.

2. The maximum value cannot be greater than the current date.

In [None]:
df.agg({'SUMMONS_DATE': 'min'}).show()

+-----------------+
|min(SUMMONS_DATE)|
+-----------------+
|       01/01/2021|
+-----------------+



In [None]:
df.agg({'SUMMONS_DATE': 'max'}).show()

+-----------------+
|max(SUMMONS_DATE)|
+-----------------+
|       09/30/2021|
+-----------------+



**Some basic data quality checks are as below:**
1. Check if there are no misspellings in Borough Name. There should be 5 distinct boroughs: Manhattan, Bronx, Queens, Brooklyn, Staten Island. We implement unique method, in case of misspellings multiple values of the same borough would be returned.

### Checks for Borough Name

In [None]:
df.select('BORO').distinct().show()

+-------------+
|         BORO|
+-------------+
|       QUEENS|
|     BROOKLYN|
|        BRONX|
|     NEW YORK|
|    MANHATTAN|
|STATEN ISLAND|
+-------------+



Creating straming data using openclean library

In [None]:
pip install openclean humanfriendly

Collecting openclean
  Downloading openclean-0.2.1-py3-none-any.whl (5.2 kB)
Collecting humanfriendly
  Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 3.9 MB/s 
[?25hCollecting openclean-core==0.4.1
  Downloading openclean_core-0.4.1-py3-none-any.whl (267 kB)
[K     |████████████████████████████████| 267 kB 48.0 MB/s 
[?25hCollecting flowserv-core>=0.8.0
  Downloading flowserv_core-0.9.2-py3-none-any.whl (260 kB)
[K     |████████████████████████████████| 260 kB 42.1 MB/s 
Collecting refdata>=0.2.0
  Downloading refdata-0.2.0-py3-none-any.whl (37 kB)
Collecting histore>=0.4.0
  Downloading histore-0.4.1-py3-none-any.whl (109 kB)
[K     |████████████████████████████████| 109 kB 55.0 MB/s 
Collecting jsonschema>=3.2.0
  Downloading jsonschema-4.2.1-py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 7.6 MB/s 
[?25hCollecting jellyfish
  Downloading jellyfish-0.8.9.tar.gz (137 kB)
[K     |███████████

In [None]:
import openclean
from openclean.data.source.socrata import Socrata

for dataset in Socrata().catalog(domain='data.cityofnewyork.us'):
    if 'complaint' in dataset.name.lower() or 'NYPD' in dataset.name or 'Crime' in dataset.name.lower():
        print(f'{dataset.identifier}\t{dataset.domain}\t{dataset.name}')

qgea-i56i	data.cityofnewyork.us	NYPD Complaint Data Historic
uip8-fykc	data.cityofnewyork.us	NYPD Arrest Data (Year to Date)
eabe-havv	data.cityofnewyork.us	DOB Complaints Received
5uac-w243	data.cityofnewyork.us	NYPD Complaint Data Current (Year To Date)
8h9b-rp9u	data.cityofnewyork.us	NYPD Arrests Data (Historic)
833y-fsy8	data.cityofnewyork.us	NYPD Shooting Incident Data (Historic)
5ucz-vwe8	data.cityofnewyork.us	NYPD Shooting Incident Data (Year To Date)
uwyv-629c	data.cityofnewyork.us	Housing Maintenance Code Complaints
a2nx-4u46	data.cityofnewyork.us	Complaint Problems
bqiq-cu78	data.cityofnewyork.us	NYPD Hate Crimes
sv2w-rv3k	data.cityofnewyork.us	NYPD Criminal Court Summons (Historic)
nre2-6m2s	data.cityofnewyork.us	Consumer Services Mediated Complaints
mv4k-y93f	data.cityofnewyork.us	NYPD Criminal Court Summons Incident Level Data (Year To Date)
6v9u-ndjg	data.cityofnewyork.us	Building Complaint Disposition Codes
9jgj-bmct	data.cityofnewyork.us	DOHMH Indoor Environmental Compl

In [None]:
# Download the full 'NYPD Complaint Data Historic' dataset.
# Note that the downloaded full dataset file is about 380 MB in size! Use the
# alternative data file with 10,000 rows that is included in the repository if
# you do not want to download the full data file.

import gzip
import humanfriendly
import os

dataset = Socrata().dataset('mv4k-y93f')

# By default, this example uses a small sample of the full dataset that
# is included in the 'data' subfolder within this repository.
#datafile = './data/qgea-i56i.tsv.gz'

# Remove the comment for this line if you want to use the full dataset.
datafile = './qgea-i56i.tsv.gz'


# Download file only if it does not exist already.
if not os.path.isfile(datafile):
    with gzip.open(datafile, 'wb') as f:
        print('Downloading ...\n')
        dataset.write(f)


fsize = humanfriendly.format_size(os.stat(datafile).st_size)
print("Using '{}' in file {} of size {}".format(dataset.name, datafile, fsize))

Downloading ...

Using 'NYPD Criminal Court Summons Incident Level Data (Year To Date)' in file ./qgea-i56i.tsv.gz of size 1.3 MB


In [None]:
# Due to the size of the full dataset file, we make use of openclean's
# stream operator to avoid having to load the dataset into main-memory.

from openclean.pipeline import stream

ds = stream(datafile)

In [None]:
bor = ds.select('BORO')

In [None]:
bor1 = bor.to_df()

In [None]:
type(bor1)

pandas.core.frame.DataFrame

In [None]:
bor1.dropna(inplace=True)

In [None]:
bor1['BORO'].unique()

array(['BRONX', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND', 'MANHATTAN',
       'NEW YORK'], dtype=object)

In [None]:
bor1.isnull().values.any()

False

In [None]:
from openclean.function.matching.base import DefaultStringMatcher
from openclean.function.matching.fuzzy import FuzzySimilarity
from openclean.data.mapping import Mapping

VOCABULARY = ['BROOKLYN','MANHATTAN','STATEN ISLAND','BRONX','QUEENS','NEW YORK']

matcher = DefaultStringMatcher(
    vocabulary=VOCABULARY,
    similarity=FuzzySimilarity()
)

map = Mapping()
for query in bor1['BORO']:
    map.add(query, matcher.find_matches(query))

print(map)

Mapping(<class 'list'>, {'BRONX': [StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', score=1), StringMatch(term='BRONX', sco

In [None]:
from openclean.function.eval.domain import Lookup
from openclean.operator.transform.update import update
from openclean.function.eval.base import Col


fixed = update(bor1, 'BORO', Lookup(columns=['BORO'], mapping=map.to_lookup(), default=Col('BORO')))

print(fixed['BORO'].unique())

['BRONX' 'BROOKLYN' 'QUEENS' 'STATEN ISLAND' 'MANHATTAN' 'NEW YORK']


  'update the map to contain only 1 match per key'.format(k, len(v)))
  'update the map to contain only 1 match per key'.format(k, len(v)))
  'update the map to contain only 1 match per key'.format(k, len(v)))
  'update the map to contain only 1 match per key'.format(k, len(v)))
  'update the map to contain only 1 match per key'.format(k, len(v)))
  'update the map to contain only 1 match per key'.format(k, len(v)))


We can see there are no misspellings for the Borough names and thus no need for additional data correction for the same.

In [None]:
df.where(col('BORO').isNull()).show()

+-----------+------------+-------------------+------------------+---------------+---------------------+---------+---+----+-----------------+----+-----------------+---------------+---------------+------------------------+
|SUMMONS_KEY|SUMMONS_DATE|OFFENSE_DESCRIPTION|LAW_SECTION_NUMBER|LAW_DESCRIPTION|SUMMONS_CATEGORY_TYPE|AGE_GROUP|SEX|RACE|JURISDICTION_CODE|BORO|PRECINCT_OF_OCCUR|X_COORDINATE_CD|Y_COORDINATE_CD|New Georeferenced Column|
+-----------+------------+-------------------+------------------+---------------+---------------------+---------+---+----+-----------------+----+-----------------+---------------+---------------+------------------------+
+-----------+------------+-------------------+------------------+---------------+---------------------+---------+---+----+-----------------+----+-----------------+---------------+---------------+------------------------+



#### Dropping Rows where Borough Name is NULL

In [None]:
df = df.filter(df.BORO.isNotNull())

In [None]:
df.count()

35297

In [None]:
df.filter(df.BORO.isNull()).show()

+-----------+------------+-------------------+------------------+---------------+---------------------+---------+---+----+-----------------+----+-----------------+---------------+---------------+------------------------+
|SUMMONS_KEY|SUMMONS_DATE|OFFENSE_DESCRIPTION|LAW_SECTION_NUMBER|LAW_DESCRIPTION|SUMMONS_CATEGORY_TYPE|AGE_GROUP|SEX|RACE|JURISDICTION_CODE|BORO|PRECINCT_OF_OCCUR|X_COORDINATE_CD|Y_COORDINATE_CD|New Georeferenced Column|
+-----------+------------+-------------------+------------------+---------------+---------------------+---------+---+----+-----------------+----+-----------------+---------------+---------------+------------------------+
+-----------+------------+-------------------+------------------+---------------+---------------------+---------+---+----+-----------------+----+-----------------+---------------+---------------+------------------------+



## Defining checks for outliers in age group

In [None]:
df.select('AGE_GROUP').distinct().show()

+---------+
|AGE_GROUP|
+---------+
|      <18|
|    25-44|
|  UNKNOWN|
|      65+|
|    18-24|
|    45-64|
+---------+



#### There are many invalid age groups like negative values, unrealistically high age groups, etc.

#### Lets find all the invalid age groups and replace them with `NaN`

In [None]:
valid_age_groups = ['<18','18-24','25-44','45-64','65+',np.NaN]
df = df.withColumn('AGE_GROUP', when(df.AGE_GROUP.isin(valid_age_groups), df.AGE_GROUP).otherwise(np.NaN))
df.show()

+-----------+------------+--------------------+------------------+-------------------+---------------------+---------+----+--------------------+-----------------+-------------+-----------------+---------------+---------------+------------------------+
|SUMMONS_KEY|SUMMONS_DATE| OFFENSE_DESCRIPTION|LAW_SECTION_NUMBER|    LAW_DESCRIPTION|SUMMONS_CATEGORY_TYPE|AGE_GROUP| SEX|                RACE|JURISDICTION_CODE|         BORO|PRECINCT_OF_OCCUR|X_COORDINATE_CD|Y_COORDINATE_CD|New Georeferenced Column|
+-----------+------------+--------------------+------------------+-------------------+---------------------+---------+----+--------------------+-----------------+-------------+-----------------+---------------+---------------+------------------------+
|  228901095|  05/31/2021|        NO TAX STAMP|            11-809|Administrative Code|                  TLC|      NaN|null|                null|                0|     BROOKLYN|               76|        1020325|         240688|    POINT (-73.869

In [None]:
df.select('AGE_GROUP').distinct().show()

+---------+
|AGE_GROUP|
+---------+
|      <18|
|    25-44|
|      65+|
|      NaN|
|    18-24|
|    45-64|
+---------+



In [None]:
df.show(n=5)

+-----------+------------+--------------------+------------------+-------------------+---------------------+---------+----+--------------------+-----------------+---------+-----------------+---------------+---------------+------------------------+
|SUMMONS_KEY|SUMMONS_DATE| OFFENSE_DESCRIPTION|LAW_SECTION_NUMBER|    LAW_DESCRIPTION|SUMMONS_CATEGORY_TYPE|AGE_GROUP| SEX|                RACE|JURISDICTION_CODE|     BORO|PRECINCT_OF_OCCUR|X_COORDINATE_CD|Y_COORDINATE_CD|New Georeferenced Column|
+-----------+------------+--------------------+------------------+-------------------+---------------------+---------+----+--------------------+-----------------+---------+-----------------+---------------+---------------+------------------------+
|  228901095|  05/31/2021|        NO TAX STAMP|            11-809|Administrative Code|                  TLC|      NaN|null|                null|                0| BROOKLYN|               76|        1020325|         240688|    POINT (-73.869649...|
|  22988

### Check for Race Values 

In [None]:
df.select('RACE').distinct().show()

+--------------------+
|                RACE|
+--------------------+
|               WHITE|
|               BLACK|
|AMERICAN INDIAN/A...|
|                null|
|      BLACK HISPANIC|
|      WHITE HISPANIC|
|             UNKNOWN|
|               OTHER|
|ASIAN / PACIFIC I...|
+--------------------+



#### Replace all `UNKNOWN` values with `NaN`

In [None]:
from pyspark.sql.functions import regexp_replace

df = df.withColumn("RACE",
  regexp_replace("RACE", "UNKNOWN", "NaN"))

In [None]:
df.show(100)

+-----------+------------+--------------------+------------------+-------------------+---------------------+---------+----+--------------------+-----------------+-------------+-----------------+---------------+---------------+------------------------+
|SUMMONS_KEY|SUMMONS_DATE| OFFENSE_DESCRIPTION|LAW_SECTION_NUMBER|    LAW_DESCRIPTION|SUMMONS_CATEGORY_TYPE|AGE_GROUP| SEX|                RACE|JURISDICTION_CODE|         BORO|PRECINCT_OF_OCCUR|X_COORDINATE_CD|Y_COORDINATE_CD|New Georeferenced Column|
+-----------+------------+--------------------+------------------+-------------------+---------------------+---------+----+--------------------+-----------------+-------------+-----------------+---------------+---------------+------------------------+
|  228901095|  05/31/2021|        NO TAX STAMP|            11-809|Administrative Code|                  TLC|      NaN|null|                null|                0|     BROOKLYN|               76|        1020325|         240688|    POINT (-73.869

In [None]:
df.select('RACE').distinct().show()

+--------------------+
|                RACE|
+--------------------+
|               WHITE|
|               BLACK|
|AMERICAN INDIAN/A...|
|                null|
|      BLACK HISPANIC|
|      WHITE HISPANIC|
|                 NaN|
|               OTHER|
|ASIAN / PACIFIC I...|
+--------------------+



### Checks for Suspect & Victim Sex

In [None]:
df.select('SEX').distinct().show()

+----+
| SEX|
+----+
|   F|
|null|
|   M|
|   U|
+----+



#### Checking values in suspect sex

In [None]:
df.groupBy('SEX').count().orderBy('count', ascending=False).show()

+----+-----+
| SEX|count|
+----+-----+
|null|17436|
|   M|15830|
|   F| 1978|
|   U|   53|
+----+-----+



#### Get unique values of offense description in sorted order

In [None]:
df.select('OFFENSE_DESCRIPTION').distinct().orderBy('OFFENSE_DESCRIPTION', ascending=True).show()

+--------------------+
| OFFENSE_DESCRIPTION|
+--------------------+
|"DRIVER DOES NOT ...|
|"FAIL TO DISPLAY ...|
|"FAIL TO POST ""N...|
|"FAIL TO POST SIG...|
|"FAILURE TO CONSP...|
|"FAILURE TO DISPL...|
|"FAILURE TO DISPL...|
|"FAILURE TO POST ...|
|"NO ""BOOK OF DAI...|
|"NO ""CERTIFICATE...|
|"NO ""NO CHOKING"...|
|"NO SIGN STATING ...|
|           20-217(I)|
|      ACCEPT ON HAIL|
|ADVERTISE AS HACK...|
|AFTER HOURS CONSU...|
|AGG. UNLICENSED O...|
|AGGRESSIVE PANHAN...|
|AIR PISTOL/RIFLE;...|
| ALCOHOL IN THE PARK|
+--------------------+
only showing top 20 rows



#### Getting Total Count of Offense Description

In [None]:
df.select('OFFENSE_DESCRIPTION').distinct().count()

388

In [None]:
df.groupBy('OFFENSE_DESCRIPTION').count().show()

+--------------------+-----+
| OFFENSE_DESCRIPTION|count|
+--------------------+-----+
|IN A PUBLIC PLACE...|   14|
|REFUSE INSPECTION...|    6|
|      ACCEPT ON HAIL|   30|
|TRIP SHEET; WITH ...|    1|
|AIR PISTOL/RIFLE;...|   26|
|    RECKLESS DRIVING|  912|
|UNREASONABLE NOIS...|    2|
|  OTHER BUSINESS LAW|    7|
|"NO ""BOOK OF DAI...|   25|
|DISCON: REFUSE LA...|   28|
|UNREASONABLE NOIS...|    1|
|RECEIVING MONEY O...|   11|
|MARIJUANA, UNLAWF...|   44|
|HOURS; PERMITTED ...|   49|
|LITTERING PROHIBITED|   13|
|BOOK TO BE OPEN T...|    2|
|  NOISE (HORN/ALARM)|    6|
|TOW TRUCK; RATE/I...|    2|
|    PUBLIC URINATION|  174|
|TRIP SHEET; WITH ...|    5|
+--------------------+-----+
only showing top 20 rows



#### Calculating the null values present in the data columnwise (with respect to the features)

In [None]:
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+-----------+------------+-------------------+------------------+---------------+---------------------+---------+-----+-----+-----------------+----+-----------------+---------------+---------------+------------------------+
|SUMMONS_KEY|SUMMONS_DATE|OFFENSE_DESCRIPTION|LAW_SECTION_NUMBER|LAW_DESCRIPTION|SUMMONS_CATEGORY_TYPE|AGE_GROUP|  SEX| RACE|JURISDICTION_CODE|BORO|PRECINCT_OF_OCCUR|X_COORDINATE_CD|Y_COORDINATE_CD|New Georeferenced Column|
+-----------+------------+-------------------+------------------+---------------+---------------------+---------+-----+-----+-----------------+----+-----------------+---------------+---------------+------------------------+
|          0|           0|                  0|                 0|           8936|                 8936|    17500|17436|17713|                0|   0|                0|            181|            181|                     181|
+-----------+------------+-------------------+------------------+---------------+---------------------+-

In [None]:
amount_missing_df = df.select([(count(when(isnan(c) | col(c).isNull(), c))/count(lit(1))).alias(c) for c in df.columns])
amount_missing_df.show()

+-----------+------------+-------------------+------------------+-------------------+---------------------+------------------+-------------------+------------------+-----------------+----+-----------------+--------------------+--------------------+------------------------+
|SUMMONS_KEY|SUMMONS_DATE|OFFENSE_DESCRIPTION|LAW_SECTION_NUMBER|    LAW_DESCRIPTION|SUMMONS_CATEGORY_TYPE|         AGE_GROUP|                SEX|              RACE|JURISDICTION_CODE|BORO|PRECINCT_OF_OCCUR|     X_COORDINATE_CD|     Y_COORDINATE_CD|New Georeferenced Column|
+-----------+------------+-------------------+------------------+-------------------+---------------------+------------------+-------------------+------------------+-----------------+----+-----------------+--------------------+--------------------+------------------------+
|        0.0|         0.0|                0.0|               0.0|0.25316599144403207|  0.25316599144403207|0.4957928435844406|0.49397965832790325|0.5018273507663541|             

#### Thus, we can see that the percentage of null values per variable has gone considerably down after cleaning. Some variables like 'PARKS_NM', 'HADEVELOPT' and such can have null values as established above. 

JURISDICTION wise count

In [None]:
df.groupBy('JURISDICTION_CODE').count().show()

+-----------------+-----+
|JURISDICTION_CODE|count|
+-----------------+-----+
|                1| 1225|
|                2| 1562|
|                0|32510|
+-----------------+-----+



### Number of columns in Clean Data

In [None]:
len(df.columns)

15

### Number of rows in Clean Data

In [None]:
df.count()

35297

In [None]:
df.printSchema()

root
 |-- SUMMONS_KEY: integer (nullable = true)
 |-- SUMMONS_DATE: string (nullable = true)
 |-- OFFENSE_DESCRIPTION: string (nullable = true)
 |-- LAW_SECTION_NUMBER: string (nullable = true)
 |-- LAW_DESCRIPTION: string (nullable = true)
 |-- SUMMONS_CATEGORY_TYPE: string (nullable = true)
 |-- AGE_GROUP: string (nullable = true)
 |-- SEX: string (nullable = true)
 |-- RACE: string (nullable = true)
 |-- JURISDICTION_CODE: integer (nullable = true)
 |-- BORO: string (nullable = true)
 |-- PRECINCT_OF_OCCUR: integer (nullable = true)
 |-- X_COORDINATE_CD: integer (nullable = true)
 |-- Y_COORDINATE_CD: integer (nullable = true)
 |-- New Georeferenced Column: string (nullable = true)



### **Exporting Clean Data in CSV**

The Cleaned Data Set will be saved as `NYPD_Complaint_Data_Historic_Cleaned.csv`

In [None]:
pd_df = df.toPandas()
pd_df.to_csv("NYPD_Criminal_Court_Summons_Incident_Level_Data_Year_To_Date_Cleaned_Spark.csv")