# ****IMPORTANT NOTE:**

This notebook is for VIEW ONLY. To test and run the notebook, please download, upload and run the zeppelin notebook on Peel: [Cleaning_Dataset_peel_Outlier.zpln](https://github.com/qyc206/evq_big_data_project/blob/main/notebooks/part2/Cleaning_Dataset_peel_Outlier.zpln).

## V. Outlier

As shown in our profiling, there are several outliers in three columns of timestamp type: "Closed Date", "Due Date", and "Resolution Action Updated Date". We want to filter out rows with dates that should not belong in the dataset (aka any date from the year before 2010 and any date after 2021). We want to filter the data out rather than try to fix it because there would be no way to find out the correct dates for these outliers. 

**NOTE:** There is a count between each filtering to show the amount the filter affects the dataset.

The results are also shown after each filtering. It can be observed that the date ranges are now reasonable (i.e. between 2010 and present/2021). 


## Upload the dataset to Peel cluster & Define dataset path

Before continuing, make sure your dataset is available on Peel HDFS. If your dataset is on your local machine, you can copy them to the login node of the cluster and move them to your user directory in the HDFS using the following commands:

```
# Copy file from local machine to login node of the cluster
mylaptop$ scp -r [FILENAME] <net_id>@peel.hpc.nyu.edu:~/

# Move file from cluster login node to your user directory in HDFS 
# (your file will be in the path "/user/[netid]/[FILENAME]")
hfs -put [FILENAME]
```

Make sure you can locate your dataset before continuing onwards.

In [None]:
%pyspark
# Define path to dataset on Peel HDFS (NOTE: replace file name with your own if different)
dataset_path = "/user/CS-GY-6513/project_data/data-cityofnewyork-us.erm2-nwe9.csv"

## Set up Spark Session

Now that the dataset is uploaded and the path is defined, we need to set up pyspark to begin profiling and exploring our dataset. 

If this notebook is run in an environment where pyspark is not yet installed, please add a new cell BEFORE the next cell and run the following command:

```
# Run this command if pyspark is not already installed
%pip install pyspark
```

In [None]:
%pyspark

# Set up pyspark session
from pyspark.sql import SparkSession

spark = SparkSession \
            .builder \
            .appName("Python Spark SQL basic example") \
            .config("spark.some.config.option", "some-value") \
            .config("spark.executor.memory", "35g") \
            .config("spark.driver.memory", "35g") \
            .getOrCreate()

## Load dataset using spark

Run the following lines to load the dataset using spark and test to make sure that dataset is properly loaded.

In [None]:
%pyspark

# Load dataset
df = spark.read.format('csv').options(header='true',inferschema='true').load(dataset_path)
# (Note: change "311_service_report" to a name that better suits your dataset, if different)
df.createOrReplaceTempView("311_service_report") 

### Generalizing Formatting

For many datasets, to optimally find information about any column that involves time, the column type must be turned into a timestamp type. However, to turn a column type into a timestamp, the data within the column must match the format that is specified when calling the to_timestamp() function ( to_timestamp(dataset[column], format) ). Therefore, it is best to be able to generalize this part of formating to make sure all our date columns are uniforom. This is even more essential since some of our solutions involve dates.

In [None]:
%pyspark

def formatDate(dataset, col, DateForm):
    formatedData = dataset.withColumn(col,to_timestamp(dataset[col],DateForm))
    return formatedData

In [None]:
%pyspark

from pyspark.sql.types import IntegerType, DoubleType
from pyspark.sql.functions import to_timestamp

# Type casting to expected types
df = df.withColumn("Unique Key",df["Unique Key"].cast(IntegerType()))
df = formatDate(df,"Due Date","MM/dd/yyyy hh:mm:ss a")
df = formatDate(df,"Created Date","MM/dd/yyyy hh:mm:ss a")
df = formatDate(df,"Closed Date","MM/dd/yyyy hh:mm:ss a")
df = df.withColumn("Incident Zip",df["Incident Zip"].cast(IntegerType()))
df = df.withColumn("BBL",df["BBL"].cast(IntegerType()))
df = df.withColumn("X Coordinate (State Plane)",df["X Coordinate (State Plane)"].cast(IntegerType()))
df = df.withColumn("Y Coordinate (State Plane)",df["Y Coordinate (State Plane)"].cast(IntegerType()))
df = formatDate(df,"Resolution Action Updated Date","MM/dd/yyyy hh:mm:ss a")


# (Note: change "311_service_report" to a name that better suits your dataset, if different)
df.createOrReplaceTempView("311_service_report")

df.printSchema()

In [None]:
%pyspark

# Run to remove cache
df.unpersist()

## Cleaning

Now that pyspark is set up and the columns of the dataset are updated to types that we expect, we can start using pyspark to explore and clean the dataset!

In [None]:
%pyspark

from pyspark.sql import Row
from pyspark.sql.functions import min, max

This specific error is one of the more prevalent errors found amongst the datasets. Therefore, we provide here a generalized version of the code that allows this simple error fix to be run on other datasets given the dataframe, min and max dates, and column name.

To also work for specific cases where we only need to filter min OR max dates and not both, specific functions were made to deal with this

### Generalized functions

In [None]:
%pyspark

def removeOutlierDates(df,minDate,maxDate, col):
    df = df.filter(df[col].isNull() | (year(col) >= minDate) & (year(col) <= maxDate))
    return df

def filterMinOnlyDates(df,minDate,maxDate, col):
    df = df.filter(df[col].isNull() | (year(col) >= minDate))
    return df

def filterMaxOnlyDates(df, maxDate, col):
    df = df.filter(df[col].isNull() |  (year(col) <= maxDate))
    return df

### Closed Date column

In [None]:
%pyspark

# Fixing dates from Closed Date
from pyspark.sql.functions import year, desc

df = removeOutlierDates(df, 2018, 2021, "Closed Date")

In [None]:
%pyspark

# Display results
df.select(min("Closed Date"),max("Closed Date")).show(df.count(), False)

In [None]:
%pyspark

# Count of the number of overall rows currently in the data
df.count()

### Due Date column

In [None]:
%pyspark

# Fixing dates from Due Dates
from pyspark.sql.functions import year, desc

df = removeOutlierDates(df, 2018, 2021, "Due Date")

In [None]:
%pyspark

# Display results
df.select(min("Due Date"),max("Due Date")).show(df.count(), False)

In [None]:
%pyspark

# Count of the number of overall rows currently in the data to check we didn't get rid of too many rows
df.count()

### Resolution Action Updated Date column

In [None]:
%pyspark

# Fixing dates from Resolution Action Updated Date
from pyspark.sql.functions import year, desc

df = removeOutlierDates(df, 2018, 2021, "Resolution Action Updated Date")

In [None]:
%pyspark

# Display results
df.select(min("Resolution Action Updated Date"),max("Resolution Action Updated Date")).show(df.count(), False)

In [None]:
%pyspark

# Count of the number of overall rows currently in the data
df.count()