# NYC 311 Service Requests from 2010 to present - Data Profiling and Data Cleaning


The dataset that is used for all the examples is the [311 Service Requests from 2010 to Present](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9), which contains 311 calls or service requests made from 2010 to present.

The dataset consists of over 27 million rows with information about the time, complaint type, location and status of the 311 calls. The d
The dataset was accessed with the [Socrata Open Data API (SODA)](https://dev.socrata.com/).

# **NULL Value Handling**

In [None]:
#from google.colab import drive
#drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


Installing required libraries

In [None]:
!pip install pyspark
!pip install openclean
!pip install uszipcode
!pip install humanfriendly



In [None]:
import pandas as pd
import uszipcode
import gzip
import humanfriendly
import os

In [None]:
from openclean.data.source.socrata import Socrata
from openclean.profiling.column import DefaultColumnProfiler
from openclean.profiling.anomalies.sklearn import DBSCANOutliers
from openclean.cluster.key import KeyCollision
from openclean.function.value.key.fingerprint import Fingerprint

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark import SparkContext
from pyspark.sql.types import *
from pyspark.ml.feature import *

spark = SparkSession.builder.appName("BDProject").getOrCreate()
sc = spark.sparkContext

In [None]:
df_full_data = spark.read.format("csv") \
.option("header", "true") \
.option("infer_schema","true")\
.option("first_row_is_header","true")\
.load(r"../data/311_Service_Requests_from_2010_to_Present.csv")
df_full_data.createOrReplaceTempView("df_full_data")
df_full_data.show()

+----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+--------------------+------------+--------------------+-----------------+---------------+--------------+---------------------+---------------------+------------+----------------+--------+-------------+------+--------------------+----------------------+------------------------------+------------------+----------+---------+--------------------------+--------------------------+----------------------+------------------+------------+------------+--------------------+---------------------+-------------------+------------------------+---------+----------------------+------------------+------------------+--------------------+
|Unique Key|        Created Date|         Closed Date|Agency|         Agency Name|      Complaint Type|          Descriptor|       Location Type|Incident Zip|    Incident Address|      Street Name| Cross Street 1|Cross Street 2|Intersection Street 1|

In [None]:
df_full_data.count()

27130961

In [None]:
df_full_data.printSchema()

root
 |-- Unique Key: string (nullable = true)
 |-- Created Date: string (nullable = true)
 |-- Closed Date: string (nullable = true)
 |-- Agency: string (nullable = true)
 |-- Agency Name: string (nullable = true)
 |-- Complaint Type: string (nullable = true)
 |-- Descriptor: string (nullable = true)
 |-- Location Type: string (nullable = true)
 |-- Incident Zip: string (nullable = true)
 |-- Incident Address: string (nullable = true)
 |-- Street Name: string (nullable = true)
 |-- Cross Street 1: string (nullable = true)
 |-- Cross Street 2: string (nullable = true)
 |-- Intersection Street 1: string (nullable = true)
 |-- Intersection Street 2: string (nullable = true)
 |-- Address Type: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Landmark: string (nullable = true)
 |-- Facility Type: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Due Date: string (nullable = true)
 |-- Resolution Description: string (nullable = true)
 |-- Resolution Action

In [None]:
df_full_data = df_full_data.withColumnRenamed("Unique Key","Unique_Key")

In [None]:
df_full_data = df_full_data.withColumnRenamed("Created Date","Created_Date")
df_full_data = df_full_data.withColumnRenamed("Closed Date","Closed_Date")
df_full_data = df_full_data.withColumnRenamed("Agency Name","Agency_Name")
df_full_data = df_full_data.withColumnRenamed("Complaint Type","Complaint_Type")
df_full_data = df_full_data.withColumnRenamed("Location Type","Location_Type")
df_full_data = df_full_data.withColumnRenamed("Incident Zip","Incident_Zip")
df_full_data = df_full_data.withColumnRenamed("Incident Address","Incident_Address")
df_full_data = df_full_data.withColumnRenamed("Street Name","Street_Name")
df_full_data = df_full_data.withColumnRenamed("Cross Street 1","Cross_Street_1")
df_full_data = df_full_data.withColumnRenamed("Cross Street 2","Cross_Street_2")
df_full_data = df_full_data.withColumnRenamed("Intersection Street 1","Intersection_Street_1")
df_full_data = df_full_data.withColumnRenamed("Intersection Street 2","Intersection_Street_2")
df_full_data = df_full_data.withColumnRenamed("Address Type","Address_Type")
df_full_data = df_full_data.withColumnRenamed("Facility Type","Facility_Type")
df_full_data = df_full_data.withColumnRenamed("Due Date","Due_Date")
df_full_data = df_full_data.withColumnRenamed("Resolution Description","Resolution_Description")
df_full_data = df_full_data.withColumnRenamed("Resolution Action Updated Date","Resolution_Action_Updated_Date")
df_full_data = df_full_data.withColumnRenamed("Community Board","Community_Board")
df_full_data = df_full_data.withColumnRenamed("Park Borough","Park_Borough")

In [None]:
df_full_data.printSchema()

root
 |-- Unique_Key: string (nullable = true)
 |-- Created_Date: string (nullable = true)
 |-- Closed_Date: string (nullable = true)
 |-- Agency: string (nullable = true)
 |-- Agency_Name: string (nullable = true)
 |-- Complaint_Type: string (nullable = true)
 |-- Descriptor: string (nullable = true)
 |-- Location_Type: string (nullable = true)
 |-- Incident_Zip: string (nullable = true)
 |-- Incident_Address: string (nullable = true)
 |-- Street_Name: string (nullable = true)
 |-- Cross_Street_1: string (nullable = true)
 |-- Cross_Street_2: string (nullable = true)
 |-- Intersection_Street_1: string (nullable = true)
 |-- Intersection_Street_2: string (nullable = true)
 |-- Address_Type: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Landmark: string (nullable = true)
 |-- Facility_Type: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Due_Date: string (nullable = true)
 |-- Resolution_Description: string (nullable = true)
 |-- Resolution_Action

## **Working with NULL Values**

Drop Rows if all column values in a row are null

In [None]:
df_full_data = df_full_data.na.drop("all")

As seen below, our data did not contain any rows with rows containing all null values

In [None]:
df_full_data.count()

27130961

Drop Columns containing more than 75% null values.

In [None]:
null_col_data = df_full_data.select([(count(when(col(c).isNull(), c))/df_full_data.count()).alias(c) for c in df_full_data.columns]).collect()
#null_col_data

The above data shows the percentage of Null values present in each column of our dataset. Percentage is displayed within range 0-1.

We will remove columns containing more than 75% Null values.

In [None]:
#Current number of columns in the dataset including the index column.
len(df_full_data.columns)

41

In [None]:
#Converting Null value data to a dictionary
null_dict_list = [row.asDict() for row in null_col_data]
null_dict = null_dict_list[0] 

In [None]:
null_dict

{'Address_Type': 0.52,
 'Agency': 0.0,
 'Agency_Name': 0.0,
 'BBL': 0.5459406666666666,
 'Borough': 0.012646,
 'Bridge Highway Direction': 0.9948233333333333,
 'Bridge Highway Name': 0.9947166666666667,
 'Bridge Highway Segment': 0.9946873333333334,
 'City': 0.3392153333333333,
 'Closed_Date': 0.04949,
 'Community_Board': 0.012646,
 'Complaint_Type': 0.0,
 'Created_Date': 0.0,
 'Cross_Street_1': 0.5335506666666666,
 'Cross_Street_2': 0.54303,
 'Descriptor': 0.0032593333333333333,
 'Due_Date': 0.682708,
 'Facility_Type': 0.44323,
 'Incident_Address': 0.34485266666666664,
 'Incident_Zip': 0.3192426666666667,
 'Intersection_Street_1': 0.6206986666666666,
 'Intersection_Street_2': 0.6230033333333334,
 'Landmark': 0.762316,
 'Latitude': 0.47852466666666665,
 'Location': 0.47851266666666664,
 'Location_Type': 0.305324,
 'Longitude': 0.47852466666666665,
 'Open Data Channel Type': 1.2666666666666667e-05,
 'Park Facility Name': 1.2666666666666667e-05,
 'Park_Borough': 0.012646,
 'Resolution_Ac

Finding Columns containing more than 75% null values.

In [None]:
col_null_75p=list({i for i in null_dict if null_dict[i] > 0.75})
print(col_null_75p)

['Bridge Highway Segment', 'Road Ramp', 'Landmark', 'Vehicle Type', 'Bridge Highway Name', 'Taxi Pick Up Location', 'Taxi Company Borough', 'Bridge Highway Direction']


In [None]:
df_data = df_data.drop(*col_null_75p)

Specified columns droppped.

In [None]:
df_data.show()

+---+----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+-------------------+------------+--------------------+----------------+----------------+-------------------+---------------------+---------------------+------------+----------------+-------------+-----------+--------+----------------------+------------------------------+------------------+----------+-----------+--------------------------+--------------------------+----------------------+------------------+------------+------------------+------------------+--------------------+
|_c0|Unique_Key|        Created_Date|         Closed_Date|Agency|         Agency_Name|      Complaint_Type|          Descriptor|      Location_Type|Incident_Zip|    Incident_Address|     Street_Name|  Cross_Street_1|     Cross_Street_2|Intersection_Street_1|Intersection_Street_2|Address_Type|            City|Facility_Type|     Status|Due_Date|Resolution_Description|Resolution_Action_Upda

Dropping rows for selected null value columns.

Certain Column values are important for the data and if these values are missing, we lose important information.

In [None]:
df_full_data.where(df_full_data.Unique_Key.isNotNull())
df_full_data.where(df_full_data.Status.isNotNull())
df_full_data.where(df_full_data.Created_Date.isNotNull())
df_full_data.where(df_full_data.Latitude.isNotNull())
df_full_data.where(df_full_data.Longitude.isNotNull())
df_full_data.where(df_full_data.Location.isNotNull())

Dropping Park Borough column since it has the same data as Borough Column

In [None]:
df_data = df_data.drop("Park_Borough")

In [None]:
df_data.show()

+---+----------+--------------------+--------------------+------+--------------------+--------------------+--------------------+-------------------+------------+--------------------+----------------+----------------+-------------------+---------------------+---------------------+------------+----------------+-------------+-----------+--------+----------------------+------------------------------+------------------+----------+-----------+--------------------------+--------------------------+----------------------+------------------+------------------+------------------+--------------------+
|_c0|Unique_Key|        Created_Date|         Closed_Date|Agency|         Agency_Name|      Complaint_Type|          Descriptor|      Location_Type|Incident_Zip|    Incident_Address|     Street_Name|  Cross_Street_1|     Cross_Street_2|Intersection_Street_1|Intersection_Street_2|Address_Type|            City|Facility_Type|     Status|Due_Date|Resolution_Description|Resolution_Action_Updated_Date|   C