# ****IMPORTANT NOTE:**

This notebook is for VIEW ONLY. To test and run the notebook, please download, upload and run the zeppelin notebook on Peel: [Cleaning_Dataset_peel_Uniformity.zpln](https://github.com/qyc206/evq_big_data_project/blob/main/notebooks/part2/Cleaning_Dataset_peel_Uniformity.zpln).

## Upload the dataset to Peel cluster & Define dataset path

Before continuing, make sure your dataset is available on Peel HDFS. If your dataset is on your local machine, you can copy them to the login node of the cluster and move them to your user directory in the HDFS using the following commands:

```
# Copy file from local machine to login node of the cluster
mylaptop$ scp -r [FILENAME] <net_id>@peel.hpc.nyu.edu:~/

# Move file from cluster login node to your user directory in HDFS 
# (your file will be in the path "/user/[netid]/[FILENAME]")
hfs -put [FILENAME]
```

Make sure you can locate your dataset before continuing onwards.

In [None]:
%pyspark
# Define path to dataset on Peel HDFS (NOTE: replace file name with your own if different)
dataset_path = "/user/CS-GY-6513/project_data/data-cityofnewyork-us.erm2-nwe9.csv"

In [None]:
%pyspark

# Set up pyspark session
from pyspark.sql import SparkSession

spark = SparkSession \
            .builder \
            .appName("Python Spark SQL basic example") \
            .config("spark.some.config.option", "some-value") \
            .config("spark.executor.memory", "50g") \
            .config("spark.driver.memory", "50g") \
            .getOrCreate()

## Load dataset using spark

Run the following lines to load the dataset using spark and test to make sure that dataset is properly loaded.

In [None]:
%pyspark

# Load dataset
df = spark.read.format('csv').options(header='true',inferschema='true').load(dataset_path)
# (Note: change "311_service_report" to a name that better suits your dataset, if different)
df.createOrReplaceTempView("311_service_report") 

Notice in the result of running the above cell that most items in the schema is of type string, even if it is not the expected type. To modify the dataset such that the types of each column is what we would expect, we perform type casting for each column that should not be type string.

**NOTE: the following cell is specific for the 311 service report dataset; make sure to modify the following cell to include type casting that is necessary to your dataset, if different

In [None]:
%pyspark

from pyspark.sql.types import IntegerType, DoubleType
from pyspark.sql.functions import to_timestamp

# Type casting to expected types
df = df.withColumn("Unique Key",df["Unique Key"].cast(IntegerType()))
df = df.withColumn("Due Date",to_timestamp(df["Due Date"],"MM/dd/yyyy hh:mm:ss a"))
df = df.withColumn("Created Date", to_timestamp(df["Created Date"],"MM/dd/yyyy hh:mm:ss a"))
df = df.withColumn("Closed Date",to_timestamp(df["Closed Date"],"MM/dd/yyyy hh:mm:ss a"))
df = df.withColumn("Incident Zip",df["Incident Zip"].cast(IntegerType()))
df = df.withColumn("BBL",df["BBL"].cast(IntegerType()))
df = df.withColumn("X Coordinate (State Plane)",df["X Coordinate (State Plane)"].cast(IntegerType()))
df = df.withColumn("Y Coordinate (State Plane)",df["Y Coordinate (State Plane)"].cast(IntegerType()))
df = df.withColumn("Resolution Action Updated Date",to_timestamp(df["Resolution Action Updated Date"],"MM/dd/yyyy hh:mm:ss a"))


# (Note: change "311_service_report" to a name that better suits your dataset, if different)
df.createOrReplaceTempView("311_service_report")

## I. Uniformity

As observed during profiling, there are several non-uniform casing problems in the values of columns of type string. The problem is narrowed to the following five columns: "Complaint Type", "Descriptor", "Location Type", "Street Name", and "City". To solve this problem, we write a function called **oneColUniformCasing** that takes in a column name (type string) and updates the values of the items in the column to the format where the first character in every word is uppercased. 

Run the following cell with the function definition.

In [None]:
%pyspark
def calculate_distinct(col,dataoriginal,get_option="count"):
    distinct_vals = dataoriginal.select(col).distinct()
    if get_option=="count":
        return distinct_vals.count()
    elif get_option == "distinct":
        return distinct_vals

In [None]:
%pyspark 
from pyspark.sql.functions import initcap, col, trim
def oneColUniformCasing(col_name,dataoriginal):
    dataoriginal = dataoriginal.select("*", trim(initcap(col(col_name))).alias('Temp name'))
    dataoriginal  = dataoriginal.drop(col_name)
    newdata  = dataoriginal.withColumnRenamed("Temp name",col_name)
    return newdata

The function is applied to the columns that are found to be non-uniform.

In [None]:
%pyspark
# Apply oneColUniformCasing to "Complaint Type"
df = oneColUniformCasing("Complaint Type", df)

# Apply oneColUniformCasing to "Descriptor"
df = oneColUniformCasing("Descriptor", df)

# Apply oneColUniformCasing to "Street Name"
df = oneColUniformCasing("Street Name", df)

# Apply oneColUniformCasing to "City"
df = oneColUniformCasing("City", df)

Since “Location Type” column only has one nonuniform value found, this value is directly corrected.

In [None]:
%pyspark
from pyspark.sql.functions import regexp_replace

# Only fix row with "RESIDENTIAL BUILDING" in "Location Type" column
df = df.withColumn('Location Type', regexp_replace('Location Type', 
                                              'RESIDENTIAL BUILDING', 
                                              'Residential Building'))

### Try some columns to see the improvement!

Below are a few cells that show some of the columns that had uniformity problems. We can observe by running these cells that the items of each column now contain uniform casing (i.e. every value is in the format where the first letter of each word is uppercased).

(The output is currently hidden, press the show output button if you want to see our results)

In [None]:
%pyspark
# View column "Complaint Type" 
df.select("Complaint Type").distinct().collect()

In [None]:
%pyspark
# View column "Location Type"
df.select("Location Type").distinct().collect()

In [None]:
%pyspark
# View column "City"
df.select("City").distinct().collect()