# Profiling 

This notebook goes over the problems that we found while exploring the sample dataset (i.e. the sample dataset with ~5 million rows). We broke down the problems that we found into the following sections: **Uniformity, Accuracy, Inconsistency, Completeness, and Outlier**. Each section will contain cells that can be run to explore the dataset and observe the problems that we found.

## Mount Google Drive & Create Dataset Folder

****NOTE: THIS IS FOR GOOGLE COLAB!**
(If you are not running on colab, you should manually create the directories, upload the dataset, and set the **WORKING_DIRECTORY** and **datafile** paths) 

Before we start exploring the dataset, we need to mount our Google Drive to this notebook. Run the following cells to mount Google Drive, create directories leading to the dataset, and define the path to dataset.

In [None]:
# Mount Google Drive
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Create evq_311_service_proj and evq_311_service_proj/dataset folders
import os

WORKING_DIRECTORY = "/content/drive/MyDrive/evq_311_service_proj"

# Create path if the path does not already exist
if not os.path.exists(WORKING_DIRECTORY):
  os.chdir('/content/drive/MyDrive/')
  os.mkdir('evq_311_service_proj')
  os.chdir('evq_311_service_proj')
  os.mkdir('dataset')
  

os.chdir(WORKING_DIRECTORY)  # Change directory to WORKING_DIRECTORY
os.getcwd()     # Display current working directory
                # (Note: this should display WORKING_DIRECTORY)

'/content/drive/MyDrive/evq_311_service_proj'

In [None]:
# Define dataset path (NOTE: replace "erm2-nwe9_5M.csv.gz" with your dataset name)
datafile = "./dataset/erm2-nwe9_5M.csv.gz"

By the end of this section, the current directory should be the **WORKING_DIRECTORY** path and the **datafile** path should be set. The final thing left to do is to **upload the dataset** into the folder that is expected by the **datafile** path, if this has not already been done so.

## Set Up pyspark

Now that the dataset path is set up and the dataset is uploaded, we need to set up pyspark to begin profiling and exploring our dataset. 

Run through the following cells to set up pyspark.

In [None]:
# Run this cell if pyspark is not already installed
%pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 37 kB/s 
[?25hCollecting py4j==0.10.9.2
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 11.6 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.0-py2.py3-none-any.whl size=281805912 sha256=35981a8bbe6861ea30c0658186038a76e78090b2261799c3f8e622ee3ee9b16b
  Stored in directory: /root/.cache/pip/wheels/0b/de/d2/9be5d59d7331c6c2a7c1b6d1a4f463ce107332b1ecd4e80718
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.2 pyspark-3.2.0


In [None]:
# Set up pyspark session
from pyspark.sql import SparkSession

spark = SparkSession \
            .builder \
            .appName("Python Spark SQL basic example") \
            .config("spark.some.config.option", "some-value") \
            .config("spark.executor.memory", "20g") \
            .config("spark.driver.memory", "20g") \
            .getOrCreate()

df = spark.read.csv(datafile, header=True)

****WARNING:** If you run into a java heap memory, configure the following lines in the cell above: 


*  .config("spark.executor.memory", "30g") \
*  .config("spark.driver.memory", "30g") \

Change the number infront of the g (ex:20g)

Changing this number could also change the amount of RAM needed to download the final file

Run the following cell to check that pyspark is set up properly (you should be able to view the schema of the dataset that you loaded).

In [None]:
# Run this code to see the schema of the loaded dataset 
df.printSchema()

root
 |-- Unique Key: string (nullable = true)
 |-- Created Date: string (nullable = true)
 |-- Closed Date: string (nullable = true)
 |-- Agency: string (nullable = true)
 |-- Agency Name: string (nullable = true)
 |-- Complaint Type: string (nullable = true)
 |-- Descriptor: string (nullable = true)
 |-- Location Type: string (nullable = true)
 |-- Incident Zip: string (nullable = true)
 |-- Incident Address: string (nullable = true)
 |-- Street Name: string (nullable = true)
 |-- Cross Street 1: string (nullable = true)
 |-- Cross Street 2: string (nullable = true)
 |-- Intersection Street 1: string (nullable = true)
 |-- Intersection Street 2: string (nullable = true)
 |-- Address Type: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Landmark: string (nullable = true)
 |-- Facility Type: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Due Date: string (nullable = true)
 |-- Resolution Description: string (nullable = true)
 |-- Resolution Action

Notice in the result of running the above cell that every item in the schema is of type string, even if it is not the expected type. To modify the dataset such that the types of each column is what we would expect, we perform type casting for each column that should not be type string. 

Run the following cell to complete the type casting and view the updated schema.

In [None]:
from pyspark.sql.types import IntegerType, DoubleType
from pyspark.sql.functions import to_timestamp

# Type casting to expected types
df = df.withColumn("Unique Key",df["Unique Key"].cast(IntegerType()))
df = df.withColumn("Due Date",to_timestamp(df["Due Date"],"MM/dd/yyyy hh:mm:ss a"))
df = df.withColumn("Created Date", to_timestamp(df["Created Date"],"MM/dd/yyyy hh:mm:ss a"))
df = df.withColumn("Closed Date",to_timestamp(df["Closed Date"],"MM/dd/yyyy hh:mm:ss a"))
df = df.withColumn("Incident Zip",df["Incident Zip"].cast(IntegerType()))
df = df.withColumn("BBL",df["BBL"].cast(IntegerType()))
df = df.withColumn("X Coordinate (State Plane)",df["X Coordinate (State Plane)"].cast(IntegerType()))
df = df.withColumn("Y Coordinate (State Plane)",df["Y Coordinate (State Plane)"].cast(IntegerType()))
df = df.withColumn("Latitude",df["Latitude"].cast(DoubleType()))
df = df.withColumn("Longitude",df["Longitude"].cast(DoubleType()))
df = df.withColumn("Resolution Action Updated Date",to_timestamp(df["Resolution Action Updated Date"],"MM/dd/yyyy hh:mm:ss a"))

df.printSchema()

root
 |-- Unique Key: integer (nullable = true)
 |-- Created Date: timestamp (nullable = true)
 |-- Closed Date: timestamp (nullable = true)
 |-- Agency: string (nullable = true)
 |-- Agency Name: string (nullable = true)
 |-- Complaint Type: string (nullable = true)
 |-- Descriptor: string (nullable = true)
 |-- Location Type: string (nullable = true)
 |-- Incident Zip: integer (nullable = true)
 |-- Incident Address: string (nullable = true)
 |-- Street Name: string (nullable = true)
 |-- Cross Street 1: string (nullable = true)
 |-- Cross Street 2: string (nullable = true)
 |-- Intersection Street 1: string (nullable = true)
 |-- Intersection Street 2: string (nullable = true)
 |-- Address Type: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Landmark: string (nullable = true)
 |-- Facility Type: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Due Date: timestamp (nullable = true)
 |-- Resolution Description: string (nullable = true)
 |-- Resolu

Now that pyspark is set up and the columns of the dataset are updated to types that we expect, we can start using pyspark to explore the dataset!

## I. Uniformity

We describe uniform data as data that use the same format and unit of measure. As we explored our dataset, we found five columns with values that have non-uniform casing (i.e. some values are fully uppercased, some are fully lowercased, and some only have the first letter of each word uppercased). 

Run through the following cells to observe the problem in the five columns. 

"Complaint Type" column has values that have nonuniform casing (ex: one row is "Traffic Signal Condition" while another row is "SAFETY"). 

In [None]:
# View column "Complaint Type" 
df.select("Complaint Type").distinct().collect()

[Row(Complaint Type='Traffic Signal Condition'),
 Row(Complaint Type='SAFETY'),
 Row(Complaint Type='Cranes and Derricks'),
 Row(Complaint Type='ELECTRIC'),
 Row(Complaint Type='Animal-Abuse'),
 Row(Complaint Type='Street Sweeping Complaint'),
 Row(Complaint Type='Tanning'),
 Row(Complaint Type='DOOR/WINDOW'),
 Row(Complaint Type='Comments'),
 Row(Complaint Type='Noise - Helicopter'),
 Row(Complaint Type='STRUCTURAL'),
 Row(Complaint Type='Broken Parking Meter'),
 Row(Complaint Type='Fire Alarm - New System'),
 Row(Complaint Type='Window Guard'),
 Row(Complaint Type='Broken Muni Meter'),
 Row(Complaint Type='Highway Condition'),
 Row(Complaint Type='FLOORING/STAIRS'),
 Row(Complaint Type='Street Condition'),
 Row(Complaint Type='Hazardous Materials'),
 Row(Complaint Type='Illegal Dumping'),
 Row(Complaint Type='Vending'),
 Row(Complaint Type='Ferry Permit'),
 Row(Complaint Type='PAINT - PLASTER'),
 Row(Complaint Type='OUTSIDE BUILDING'),
 Row(Complaint Type='Taxi Report'),
 Row(Complai

"Descriptor" column has the same problem of nonuniform casing, as previously observed in "Complaint Type" column.

In [None]:
# View column "Descriptor" 
df.select("Descriptor").distinct().collect()

[Row(Descriptor='Lamppost Damaged'),
 Row(Descriptor='Plate Condition - Shifted'),
 Row(Descriptor='BBQ Outside Authorized Area'),
 Row(Descriptor='Ready NY - Seniors and Disabled - Spanish'),
 Row(Descriptor='4 DSNY Spillage'),
 Row(Descriptor='Containers Set Out Too Early'),
 Row(Descriptor='Hydrant Running Full (WA4)'),
 Row(Descriptor='Miscellaneous'),
 Row(Descriptor='E10 Street Obstruction'),
 Row(Descriptor='Glassware Hanging'),
 Row(Descriptor='Personal Veteran Exemption'),
 Row(Descriptor='Inside Apartment'),
 Row(Descriptor='High Grass'),
 Row(Descriptor='SCRIE Assistance'),
 Row(Descriptor='ER9 FOAM VIOL. RETAILER'),
 Row(Descriptor='Apartment'),
 Row(Descriptor='SHOWER-STALL'),
 Row(Descriptor='Broken Curb'),
 Row(Descriptor='Tortured'),
 Row(Descriptor='Tax Commission Rules'),
 Row(Descriptor='Garbage'),
 Row(Descriptor='Fire Alarm Lamp Out'),
 Row(Descriptor='List of Outstanding Tickets'),
 Row(Descriptor='Illegal Activity by Phone'),
 Row(Descriptor='Home Instruction'),


"Location Type" column is generally uniform except for a single row with the value "RESIDENTIAL BUILDING".

In [None]:
# View column "Location Type"
df.select("Location Type").distinct().collect()

[Row(Location Type='Apartment'),
 Row(Location Type='House and Store'),
 Row(Location Type='Ferry'),
 Row(Location Type='Private School'),
 Row(Location Type='Other (Explain Below)'),
 Row(Location Type='Public Park/Garden'),
 Row(Location Type='Roadway'),
 Row(Location Type='Loft Residence'),
 Row(Location Type='Condo Unit'),
 Row(Location Type='Cafeteria'),
 Row(Location Type='House of Worship'),
 Row(Location Type='Store/Commercial'),
 Row(Location Type='1-3 Family Mixed Use Building'),
 Row(Location Type='Building (Non-Residential)'),
 Row(Location Type='Public Area'),
 Row(Location Type='Golf'),
 Row(Location Type='Home'),
 Row(Location Type='Grocery Store'),
 Row(Location Type='Street Vendor'),
 Row(Location Type='Cafeteria - Private School'),
 Row(Location Type='Tanning Salon'),
 Row(Location Type='Abandoned Building'),
 Row(Location Type='Cemetery'),
 Row(Location Type='Subway'),
 Row(Location Type='Street Fair Vendor'),
 Row(Location Type='Co-Op Unit'),
 Row(Location Type='Par

"Street Name" column has nonuniform casing as well (ex: one row has "east gun hill road" while majority of the rows are uppercased like "ALBERTA AVE").

In [None]:
# View column "Street Name" 
df.select("Street Name").distinct().collect()

[Row(Street Name='47 AVENUE'),
 Row(Street Name='WEST  148 STREET'),
 Row(Street Name='148 AVENUE'),
 Row(Street Name='KING STREET'),
 Row(Street Name='THORNHILL AVENUE'),
 Row(Street Name='SHARON AVENUE'),
 Row(Street Name='150 PLACE'),
 Row(Street Name='BARCLAY AVENUE'),
 Row(Street Name='WELDON STREET'),
 Row(Street Name='STARR STREET'),
 Row(Street Name='RUPERT AVENUE'),
 Row(Street Name='WEST   46 STREET'),
 Row(Street Name='132 AVENUE'),
 Row(Street Name='EAST 232 STREET'),
 Row(Street Name='205 PLACE'),
 Row(Street Name='HONEYWELL STREET'),
 Row(Street Name='EAST 145 STREET'),
 Row(Street Name='MUNDY LANE'),
 Row(Street Name='COURT SQUARE'),
 Row(Street Name='MACORMAC PLACE'),
 Row(Street Name='EAST  121 STREET'),
 Row(Street Name='WEST 12 STREET'),
 Row(Street Name='ANNA DALE ROAD'),
 Row(Street Name='BEACH  105 STREET'),
 Row(Street Name='ABINGDON ROAD'),
 Row(Street Name='BETHUNE STREET'),
 Row(Street Name='WILLETT AVENUE'),
 Row(Street Name="ST ANTHONY'S PLACE"),
 Row(Street

"City" column has nonuniform casing (ex: one row has "New York" while majority of the rows are all uppercased like "ROCKVILLE CENTER").

In [None]:
# View column "City"
df.select("City").distinct().collect()

[Row(City='PRINCETON'),
 Row(City='HUDSON COUNTY'),
 Row(City='HUSTON'),
 Row(City='NESCONSET'),
 Row(City='WATERTOWN'),
 Row(City='RARITAN'),
 Row(City='SOUTH FLORAL PARK'),
 Row(City='Corona'),
 Row(City='VALLEY STREAM'),
 Row(City='STAMFORD'),
 Row(City='ISLAND'),
 Row(City='SELINA'),
 Row(City='CHINO'),
 Row(City='RIDGEFIELD PARK'),
 Row(City='ELMHERST'),
 Row(City='BOLTIMORE'),
 Row(City='ALPHARATTA'),
 Row(City='DEAR PARK'),
 Row(City='LIDO BEACH'),
 Row(City='SOMERVILLE'),
 Row(City='NEW JERSEY'),
 Row(City='WANTGAH'),
 Row(City='VALLY STREAM'),
 Row(City='BROOKYLN'),
 Row(City='WESTBURY'),
 Row(City='ORLANDO'),
 Row(City='SAVANNAH'),
 Row(City='SOUTH GATE'),
 Row(City='DEER PARK'),
 Row(City='Queens'),
 Row(City='NS'),
 Row(City='STE'),
 Row(City='Rego Park'),
 Row(City='FAIR LAWN'),
 Row(City='NEWARK AIRPORT'),
 Row(City='NEWHYDE PARK'),
 Row(City='HEWLETT TOWN'),
 Row(City='RIDGEFIELD'),
 Row(City='TUXEDO PARK'),
 Row(City='HUNGTINGTON'),
 Row(City='SPRING VALLEY'),
 Row(City

## II. Accuracy

We describe accurate data as data that is close to the true, expected values. As we explored our dataset, we found that there are several inaccurate cities in the city column. To explore the accuracy in the city column, we found and downloadeded a dataset ([uszips.csv](https://drive.google.com/file/d/1qd2cXgTx-h-hRd0C7z2s_U4O8VYLAXA7/view?usp=sharing)) that contains information such as US zipcode, state name, city, etc. We use this dataset as baseline to compare to our dataset to find any inaccuracies, such as misspellings. 

Before we start, make sure to download and upload the [uszips.csv](https://drive.google.com/file/d/1qd2cXgTx-h-hRd0C7z2s_U4O8VYLAXA7/view?usp=sharing) dataset into the same folder/directory as where our dataset resides (i.e. for this colab notebook environment, the current working directory path should be /content/drive/MyDrive/evq_311_service_proj and the datasets should all be placed in the dataset/ folder inside the current working directory).

In [None]:
# Define path for US zip dataset
uszip_path = "./dataset/uszips.csv"

In [None]:
# Read the US zip dataset
us = spark.read.csv(uszip_path, header=True)
us = us.withColumn("zip",us["zip"].cast(IntegerType()))
us = us.withColumn("lat",us["lat"].cast(DoubleType()))
us = us.withColumn("lng",us["lng"].cast(DoubleType()))
us.show()

+---+--------+---------+-------------+--------+-----------+----+-----------+----------+-------+-----------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|zip|     lat|      lng|         city|state_id| state_name|zcta|parent_zcta|population|density|county_fips|  county_name|      county_weights|    county_names_all|     county_fips_all|           imprecise|            military|            timezone|
+---+--------+---------+-------------+--------+-----------+----+-----------+----------+-------+-----------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|601|18.18005|-66.75218|     Adjuntas|      PR|Puerto Rico|TRUE|       null|     17113|  102.7|      72001|     Adjuntas|"{""72001"": ""99...| ""72141"": ""0.5...|     Adjuntas|Utuado|         72001|72141|               FALSE|               FALSE|
|602|18.

Once the dataset is uploaded, run the following cells to observe the problem.

In [None]:
# All distinct cities present in the dataset
temp_df_1 = df.select("City").distinct() 
list_of_cities = list(temp_df_1.toPandas()['City']) 

# All cities present in the US zip dataset
list_of_correct_cities =  list(us.select("city").distinct().toPandas()['city']) 

In [None]:
list_of_wrong_cities = []

for i in list_of_cities:
  if i not in list_of_correct_cities:
    list_of_wrong_cities.append(i)

# First 20 mispelled words
list_of_wrong_cities[:20] 

['PRINCETON',
 'HUDSON COUNTY',
 'HUSTON',
 'NESCONSET',
 'WATERTOWN',
 'RARITAN',
 'SOUTH FLORAL PARK',
 'VALLEY STREAM',
 'STAMFORD',
 'ISLAND',
 'SELINA',
 'CHINO',
 'RIDGEFIELD PARK',
 'ELMHERST',
 'BOLTIMORE',
 'ALPHARATTA',
 'DEAR PARK',
 'LIDO BEACH',
 'SOMERVILLE',
 'NEW JERSEY']

As we can observe from the results of running the cells, there are inaccuracies in the cities that are listed in the city column of our 311 service dataset. For instance, "NEW JERSEY" is a state, not a city and "HUSTON" is a misspelling for Houston, a city in Texas. 

## III. Inconsistency

We describe inconsistent data as data that contains values that contradict each other or contains value that is not what we would expect based on the column that it is in. As we explored our dataset, we observed inconsistency problems in both the Agency and Agency Name columns in the 311 service dataset. We attempted to find a dataset that might contain all agency names and information, but we could not find a complete dataset. Instead, we decided to create our own custom dataset using the data from https://www1.nyc.gov/nyc-resources/agencies.page along with the addition of other agencies that showed up on the NYC open data website. Our dataset can be downloaded/accessed via this link: [NYC-Agency-Names.csv](https://drive.google.com/file/d/1EHpyXNOwCpv-NM0OTTqFBHeKikKpBtd1/view?usp=sharing). We use this dataset as baseline to compare to the corresponding columns in our 311 service dataset. 

Just as before, make sure to download and upload the [NYC-Agency-Names.csv](https://drive.google.com/file/d/1EHpyXNOwCpv-NM0OTTqFBHeKikKpBtd1/view?usp=sharing) dataset into the same folder/directory as where our dataset resides (i.e. for this colab notebook environment, the current working directory path should be /content/drive/MyDrive/evq_311_service_proj and the datasets should all be placed in the dataset/ folder inside the current working directory).

### Explore Agency column

In [None]:
# Define path for NYC Agency Names dataset
nycAgencyNames_path = "./dataset/NYC-Agency-Names.csv"

In [None]:
# Read NYC Agency Names dataset 
agency_df = spark.read.csv(nycAgencyNames_path, header=True)

In [None]:
# Show Schema for NYC Agency Names dataset
agency_df.printSchema()

root
 |-- Agency Acronym: string (nullable = true)
 |-- Agency: string (nullable = true)



Once the dataset is uploaded, run the following cells to observe the problem.

In [None]:
# View data from Agency column
from pyspark.sql.functions import asc

df.select("Agency").distinct().orderBy(asc("Agency")).show(df.count(), False)

+---------------------------------------+
|Agency                                 |
+---------------------------------------+
|3-1-1                                  |
|ACS                                    |
|CEO                                    |
|COIB                                   |
|DCA                                    |
|DCAS                                   |
|DCP                                    |
|DEP                                    |
|DFTA                                   |
|DHS                                    |
|DOB                                    |
|DOE                                    |
|DOF                                    |
|DOHMH                                  |
|DOITT                                  |
|DORIS                                  |
|DOT                                    |
|DPR                                    |
|DSNY                                   |
|DVS                                    |
|EDC                              

In [None]:
# All distinct Agency present in the dataset
temp_df2 = df.select("Agency").distinct() 
list_of_agencies = list(temp_df2.toPandas()['Agency']) 

# All Agency present in the NYC Agency Names dataset
list_of_correct_agencies = list(agency_df.select("Agency Acronym").distinct().toPandas()['Agency Acronym']) 

In [None]:
# Show list of Agency that were not in the dataset with uniform names
list_of_wrong_agencies = []

for i in list_of_agencies:
  if i not in list_of_correct_agencies:
    list_of_wrong_agencies.append(i)

# Show the first 20 in the list
list_of_wrong_agencies[:20] 

['MAYORâ\x80\x99S OFFICE OF SPECIAL ENFORCEMENT', '3-1-1', 'TAX', 'MOC', 'CEO']

In [None]:
# Count how many rows have MAYORâ in the Agency column
df.filter(df["Agency"][0:6] == "MAYORâ").count()

14237

As we can observe from the results of running the previous cells, there is inconsistent data where every other value in the column in an acronym, but we also have "MAYORâ\x80\x99S OFFICE OF SPECIAL ENFORCEMENT", which is not an acronym. By counting, we find that the number of rows in Agency column that has "MAYORâ" as part of the value is 14,237. Theoretically, we would solve the issue by finding out if the corresponding "Agency Name" is in our database, but for this smaller dataset and for simpliciy sake, we plan to just filter these values out.

Additionally, there is one minor problem where the apostrophe in "Mayor's" did not store correctly and turned into gibberish. Since this only happens in one specific case of "MAYORâS OFFICE OF SPECIAL ENFORCEMENT", the solution will be very specific to fix this problem.

There also might be some inaccuracies which resulted in the other values like "TAX" and "CEO". Note that even though "3-1-1" showed up in the list above, it is actually correct; this is because it is listed in the NYC Agency Names dataset as "311" without the dashes. 

### Explore Agency Name column

In [None]:
# View data from Agency Name column
df.select("Agency Name").distinct().orderBy(asc("Agency Name")).show(df.count(), False)

+--------------------------------------------------------------------------------------------------+
|Agency Name                                                                                       |
+--------------------------------------------------------------------------------------------------+
|3-1-1                                                                                             |
|3-1-1 Call Center                                                                                 |
|311 Administrative Supervisor                                                                     |
|A - Bronx                                                                                         |
|A - Brooklyn                                                                                      |
|A - Canine Task Force Citywide                                                                    |
|A - Illegal Posting Manhattan and Bronx                                                   

In [None]:
# All distinct Agency Name present in the dataset
temp_df3 = df.select("Agency Name").distinct() 
list_of_agency_names = list(temp_df3.toPandas()['Agency Name']) 

# All Agency present in the NYC Agency Names dataset
list_of_correct_agency_names=  list(agency_df.select("Agency").distinct().toPandas()['Agency']) 

In [None]:
# Show list of non uniform names in the Agency Name Column
list_of_wrong_agency_names = []

for i in list_of_agency_names:
  if (i[0:6] != "School") and i not in list_of_correct_agency_names:
    list_of_wrong_agency_names.append(i)

# Show entire list
for j in list_of_wrong_agency_names:
  print(j)

Exemption Unit
Manhattan 02
Early Care and Education Information Unit
NYCEM Property Damage
Brooklyn South 10
BCC - Brooklyn North
Alternative Superintendency
A - Manhattan
ACS
DOT
Taxi and Limousine Commission
DCA
HPD
DPR
Property Exec Office
Division of Alternative Management
Child Care and Camps Complaint Unit
TLC
NYC Emergency Management
External Affairs Unit
BCC - Queens West
Valuation Policy
Correspondence - Taxi and Limousine Commission
Department of Housing Preservation and Development
Queens East 10
Commercial Exemption Unit
Queens West 03
DCP
Discrepancy and Billing
DOF
Queens West 06
LinkNYC Unit
Department of Homeless Services
3-1-1
DHS Advantage Programs
Operations Unit - Department of Homeless Services
NYPD
TAX
DOB Inspections - Queens
DOF Legal Affairs - Taxpayer Services Unit
MOC
DVS
Fire Department of New York
3-1-1 Call Center
Investigation Review Section
DOB
BCC - Staten Island
Queens East 12
Office of the Sheriff
Tax Lien Unit
COIB
Manhattan 07
CFC - Bronx
P - Queen

In [None]:
# Count how many rows have "School" as part of the value in the Agency Name column
df.filter((df["Agency Name"][0:6] == "School")).count()

3144

In [None]:
# Count how many rows have "School" as part of the value in Agency Name and "DOE" in Agency column
df.filter((df["Agency Name"][0:6] == "School") & (df["Agency"] == "DOE")).count()

3144

As we can observe from the results above, there are several inconsistencies and other problems, including misspellings, uneven uppercase/lowercase usage, different titles (ex: "The Department of..." vs "Department of..."), and acronyms are being used despite there being an actual name to the department (ex: NYPD vs using New York Police Department). There are also many schools that are listed that are not technically agencies due to the DOE. However, for this specific case, if a school is listed with DOE for the Agency in the same row, we will leave this alone as we know this is most likely correct because we can observe that the number of agency names that are schools is equivalent to the number of values that have agency name that is a school and agency that is DOE. This school data is too detailed and can potentially be used for other purposes, so we decided that it should not be taken out of the dataset.

## IV. Completeness

We describe a complete dataset as a dataset that contains all required data. While exploring our dataset, we found that several values in the city column are null (i.e. missing).

Run the following cells to view the observation.

### Explore City column

In [None]:
from pyspark.sql.functions import desc

# Display number of values in city column
df.groupBy('City').count().orderBy(desc("count")).show(df.count(), False)

+-------------------------+-------+
|City                     |count  |
+-------------------------+-------+
|BROOKLYN                 |1592913|
|NEW YORK                 |1024723|
|BRONX                    |990824 |
|null                     |337731 |
|STATEN ISLAND            |262032 |
|JAMAICA                  |75304  |
|FLUSHING                 |56319  |
|Jamaica                  |54074  |
|ASTORIA                  |52854  |
|RIDGEWOOD                |41676  |
|Flushing                 |37942  |
|Astoria                  |32108  |
|CORONA                   |28253  |
|Ridgewood                |25369  |
|WOODSIDE                 |25196  |
|FRESH MEADOWS            |25104  |
|OZONE PARK               |23299  |
|ELMHURST                 |22163  |
|EAST ELMHURST            |21381  |
|LONG ISLAND CITY         |20901  |
|SOUTH RICHMOND HILL      |20650  |
|FAR ROCKAWAY             |20220  |
|FOREST HILLS             |19267  |
|QUEENS VILLAGE           |19098  |
|SOUTH OZONE PARK         |1

From the results above, we can see that there are 337,731 rows in the City column that contain null values. 

### Explore Address Type column

Another factor of Completeness that we should keep a look out for involves the Address Type Column. This Column is integral for figuring out what columns holds the location information we are looking for. For example, the address type "ADDRESS" Most likely corresponds with the Incident Address field.

To figure out what Address Type corresponds to what location values, we would have to map out the amount of times non null values appeared in each column per different type. If there is a significant amount of data that appears when calling a certain address type, then that column most likely is associated with that type. 

Ideally, this Address Type Column should not be null, and if it is, then the columns that could hold information about the Address should also be null. 

If the Address Type Column is not NULL, then we should also be checking if the correct Address Columns are filled while the rest of the address columns are null.

In [None]:
# Show the distinct address types
df.select("Address Type").distinct().show(df.count(), False)

+------------+
|Address Type|
+------------+
|INTERSECTION|
|UNRECOGNIZED|
|null        |
|BLOCKFACE   |
|LATLONG     |
|ADDRESS     |
|PLACENAME   |
+------------+



In [None]:
from pyspark.sql.functions import desc

# Count how many rows of each address types there are
df.groupBy('Address Type').count().orderBy(desc("count")).show(df.count(), False)

+------------+-------+
|Address Type|count  |
+------------+-------+
|ADDRESS     |3724266|
|null        |853003 |
|INTERSECTION|682565 |
|BLOCKFACE   |115632 |
|LATLONG     |36369  |
|PLACENAME   |2098   |
|UNRECOGNIZED|758    |
+------------+-------+



In [None]:
# Check to see if Incident Addressing is associated with the ADDRESS type
df.filter((df["Address Type"] == "ADDRESS") & df["Incident Address"].isNotNull()).count()

3610393

## V. Outlier

We describe an outlier as data that is significantly different from all other data in the dataset. To find potential outliers, we looked at the minimum and maximum dates in columns with timestamp types (i.e. "Created Date", "Closed Date", "Due Date", and "Resolution Action Updated Date"). We found outliers in three of the four columns with timestamp types. 

Run the following cells to explore the columns with timestamp types.

In [None]:
from pyspark.sql import Row
from pyspark.sql.functions import min, max

"Created Date" column seems to have the expected minimum and maximum values for dates in our 311 service dataset.

In [None]:
# View min and max dates for Created Date
df.select(min("Created Date"),max("Created Date")).show(df.count(), False)

+-------------------+-------------------+
|min(Created Date)  |max(Created Date)  |
+-------------------+-------------------+
|2010-01-01 00:00:00|2021-11-13 02:10:18|
+-------------------+-------------------+



"Closed Date", "Due Date", and "Resolution Action Updated Date" columns can be observed to have outlier dates that clearly should not exist in our dataset. 

In [None]:
# View min and max dates for Closed Date
df.select(min("Closed Date"),max("Closed Date")).show(df.count(), False)

+-------------------+-------------------+
|min(Closed Date)   |max(Closed Date)   |
+-------------------+-------------------+
|1899-12-31 19:00:00|2201-03-25 00:00:00|
+-------------------+-------------------+



In [None]:
# View min and max dates for Due Date
df.select(min("Due Date"),max("Due Date")).show(df.count(), False)

+-------------------+-------------------+
|min(Due Date)      |max(Due Date)      |
+-------------------+-------------------+
|1900-01-02 00:00:00|2021-12-12 13:43:56|
+-------------------+-------------------+



In [None]:
# View min and max dates for Resolution Action Updated Date
df.select(min("Resolution Action Updated Date"),max("Resolution Action Updated Date")).show(df.count(), False)

+-----------------------------------+-----------------------------------+
|min(Resolution Action Updated Date)|max(Resolution Action Updated Date)|
+-----------------------------------+-----------------------------------+
|2000-12-31 23:21:26                |2926-10-30 11:51:00                |
+-----------------------------------+-----------------------------------+



## Conclusion

In this notebook, we explored different aspects of our dataset and uncovered problems with the dataset that needs to be cleaned and improved. The next step is actually trying to clean and create a new version of the dataset. The cleaning of the dataset is shown in the [Cleaning_The_Dataset.ipynb](https://colab.research.google.com/drive/1_EYqXb2oN889RPqRc8jwQaWygmpzKBiF?usp=sharing) notebook. 