# ****IMPORTANT NOTE:**

This notebook is for VIEW ONLY. To test and run the notebook, please download, upload and run the zeppelin notebook on Peel: [Cleaning_Dataset_peel_Inconsistency.zpln](https://github.com/qyc206/evq_big_data_project/blob/main/notebooks/part2/Cleaning_Dataset_peel_Inconsistency.zpln).

## Upload the dataset to Peel cluster & Define dataset path

Before continuing, make sure your dataset is available on Peel HDFS. If your dataset is on your local machine, you can copy them to the login node of the cluster and move them to your user directory in the HDFS using the following commands:

```
# Copy file from local machine to login node of the cluster
mylaptop$ scp -r [FILENAME] <net_id>@peel.hpc.nyu.edu:~/

# Move file from cluster login node to your user directory in HDFS 
# (your file will be in the path "/user/[netid]/[FILENAME]")
hfs -put [FILENAME]
```

Make sure you can locate your dataset before continuing onwards.

In [None]:
%pyspark
# Define path to dataset on Peel HDFS (NOTE: replace file name with your own if different)
dataset_path = "/user/CS-GY-6513/project_data/data-cityofnewyork-us.erm2-nwe9.csv"

## Set up Spark Session

Now that the dataset is uploaded and the path is defined, we need to set up pyspark to begin profiling and exploring our dataset. 

If this notebook is run in an environment where pyspark is not yet installed, please add a new cell BEFORE the next cell and run the following command:

```
# Run this command if pyspark is not already installed
%pip install pyspark
```

In [None]:
%pyspark

# Set up pyspark session
from pyspark.sql import SparkSession

spark = SparkSession \
            .builder \
            .appName("Python Spark SQL basic example") \
            .config("spark.some.config.option", "some-value") \
            .config("spark.executor.memory", "50g") \
            .config("spark.driver.memory", "50g") \
            .getOrCreate()

## Load dataset using spark

Run the following lines to load the dataset using spark and test to make sure that dataset is properly loaded.

In [None]:
%pyspark

# Load dataset
df = spark.read.format('csv').options(header='true',inferschema='true').load(dataset_path)
# (Note: change "311_service_report" to a name that better suits your dataset, if different)
df.createOrReplaceTempView("311_service_report") 

In [None]:
%pyspark

from pyspark.sql.types import IntegerType, DoubleType
from pyspark.sql.functions import to_timestamp

# Type casting to expected types
df = df.withColumn("Unique Key",df["Unique Key"].cast(IntegerType()))
df = df.withColumn("Due Date",to_timestamp(df["Due Date"],"MM/dd/yyyy hh:mm:ss a"))
df = df.withColumn("Created Date", to_timestamp(df["Created Date"],"MM/dd/yyyy hh:mm:ss a"))
df = df.withColumn("Closed Date",to_timestamp(df["Closed Date"],"MM/dd/yyyy hh:mm:ss a"))
df = df.withColumn("Incident Zip",df["Incident Zip"].cast(IntegerType()))
df = df.withColumn("BBL",df["BBL"].cast(IntegerType()))
df = df.withColumn("X Coordinate (State Plane)",df["X Coordinate (State Plane)"].cast(IntegerType()))
df = df.withColumn("Y Coordinate (State Plane)",df["Y Coordinate (State Plane)"].cast(IntegerType()))
df = df.withColumn("Resolution Action Updated Date",to_timestamp(df["Resolution Action Updated Date"],"MM/dd/yyyy hh:mm:ss a"))

## III. Inconsistency

As seen during profiling, there is incorrect data mostly in the "Agency Name" Column, but we will need to clean both "Agency Name" and "Agency" Columns.

To go over changes in "Agency", we need to make specific changes. For example, there were formatting errors with "Mayorâ", and because both are specific cases, we will fix those manually. For the other errors that were shown in Agency, we can either filter out these errors, or we can check the "correct" dataset to see if the Agency Name has a match in the good dataset and change the Agency accordingly. However, our main focus will be on Agency Names because Agency Names has significantly more errors than Agency, and therefore, we will not be implementing these changes in this document at the moment.

"Agency Name" will be fixed by going through if there is a corresponding "Agency" in the same row that is also in the "correct" dataset. If there is a match, then we will change the value of the agency name to what it is supposed to be. There will be one type of Agency I will not be changing the name for: the DOE. This is because of the aformentioned "School" agency name problem where the name of the school that the DOE could be important. The 3-1-1 names will also go untouched for the same reason.

Run the following cells to fix “Mayorâ” problem.

In [None]:
%pyspark
# Fix the Mayora Problem by finding "MAYORâ" and replacing it in Agency
from pyspark.sql.functions import when

df = df.withColumn("Agency", \
              when(df["Agency"][0:6] == "MAYORâ", "OSE").otherwise(df["Agency"]))

In [None]:
%pyspark

# Fix the Mayora Problem by finding "Mayorâ" and replacing it in Agency Name

df = df.withColumn("Agency Name", \
              when(df["Agency Name"][0:6] == "Mayorâ", "Mayor's Special Office of Enforcement").otherwise(df["Agency Name"]))

Before moving on, make sure to download and upload the [NYC-Agency-Names.csv](https://drive.google.com/file/d/1EHpyXNOwCpv-NM0OTTqFBHeKikKpBtd1/view?usp=sharing) dataset into the same folder/directory as where our dataset resides (i.e. for this colab notebook environment, the current working directory path should be /content/drive/MyDrive/evq_311_service_proj and the datasets should all be placed in the dataset/ folder inside the current working directory).

This dataset is needed for fixing inconsistencies in "Agency Name" and "Agency".

In [None]:
%pyspark

# Define path for NYC Agency Names dataset
# (Note: make sure to update to your netid and dataset name)
nycAgencyNames_path = "/user/emc689/NYC-Agency-Names.csv"

In [None]:
%pyspark

# Read NYC Agency Names dataset 
agency_df = spark.read.csv(nycAgencyNames_path, header=True)

# Show Schema for NYC Agency Names dataset
agency_df.printSchema()

Run the following cells to fix inconsistencies in “Agency Name” and “Agency”.

In [None]:
%pyspark
from pyspark.sql.functions import coalesce

# Checked the Agency Acronym and found the Agency Names for all rows this way any mistakes on spellings, uniformity can be removed at once.
result = df.join(agency_df,df["Agency"] ==  agency_df["Agency Acronym"],"left").select(df["Unique Key"],df["Agency"],agency_df["Agency"].alias("agency_name_df"),df["Agency Name"].alias("df_Agency"))

# We want to now turn the value of agency_name_df to null when df_Agency contains school
result = result.withColumn("agency_name_df", when( (result["Agency"]== "DOE")& (result["df_Agency"][0:6] == "School"), None).otherwise(result["agency_name_df"]) )
temp = result.select(result["Unique Key"].alias("Temp Key"),coalesce(result["agency_name_df"], result["df_Agency"]).alias("Final Agency"))

In [None]:
%pyspark
# Update our Agency Name Column with the new Data
df = df.join(temp,df["Unique Key"] ==  temp["Temp Key"],how="left")
df = df.drop("Temp Key")
df = df.withColumn("Agency Name", df["Final Agency"])
df = df.drop("Final Agency")

### Run the following cells to see the improvements!


In [None]:
%pyspark
# Show that the "MAYORâ" problem is fixed
df.select("Agency", "Agency Name").filter(df["Agency"] == "OSE").show(20,False)

We can observe in the following result that the Agency Name looks more consistent and more accurate.

Run the following cell to further show the data set has improved by looking at incorrect data. We can observe that there is much less inconsistencies compared to what we saw during profiling (Note: we are not counting 3-1-1 and the variations of that as inconsistent).

In [None]:
%pyspark

# All distinct agency name present in the dataset
df5 = df.select("Agency Name").distinct() 
list_of_agency_names = list(df5.toPandas()['Agency Name']) 

# All agency present in the dataset
list_of_correct_agency_names = list(agency_df.select("Agency").distinct().toPandas()['Agency']) 

In [None]:
%pyspark
from pyspark.sql.functions import asc

# Show new values for Agency Name
df.select("Agency Name").distinct().orderBy(asc("Agency Name")).show(df.count(), False)

In [None]:
%pyspark

list_of_wrong_agency_names_final = []
for i in list_of_agency_names:
  if (i[0:6] != "School") and i not in list_of_correct_agency_names:
    list_of_wrong_agency_names_final.append(i)
for j in list_of_wrong_agency_names_final:
  print(j)