# ****IMPORTANT NOTE:**

This notebook is for VIEW ONLY. To test and run the notebook, please download, upload and run the zeppelin notebook on Peel: [Profiling_Dataset_peel.zpln](https://github.com/qyc206/evq_big_data_project/blob/main/notebooks/part2/Profiling_Dataset_peel.zpln).

# Profiling 

This notebook goes over the problems that we found while exploring the sample dataset (i.e. the sample dataset with ~5 million rows). We broke down the problems that we found into the following sections: Uniformity, Accuracy, Inconsistency, Completeness, and Outlier. Each section will contain cells that can be run to explore the dataset and observe the problems that we found.

This notebook is also used to profile the datasets that are found to be similar to the initial 311 service report dataset profiled and cleaned in part one of the project. 

## Upload the dataset to Peel cluster & Define dataset path

Before continuing, make sure your dataset is available on Peel HDFS. If your dataset is on your local machine, you can copy them to the login node of the cluster and move them to your user directory in the HDFS using the following commands:

```
# Copy file from local machine to login node of the cluster
mylaptop$ scp -r [FILENAME] <net_id>@peel.hpc.nyu.edu:~/

# Move file from cluster login node to your user directory in HDFS 
# (your file will be in the path "/user/[netid]/[FILENAME]")
hfs -put [FILENAME]
```

Make sure you can locate your dataset before continuing onwards.

In [None]:
%pyspark

# Define path to dataset on Peel HDFS (NOTE: replace file name with your own if different)
dataset_path = "/user/CS-GY-6513/project_data/data-cityofnewyork-us.erm2-nwe9.csv"

## Set up Spark Session

Now that the dataset is uploaded and the path is defined, we need to set up pyspark to begin profiling and exploring our dataset. 

If this notebook is run in an environment where pyspark is not yet installed, please add a new cell BEFORE the next cell and run the following command:

```
# Run this command if pyspark is not already installed
%pip install pyspark
```


In [None]:
%pyspark 

# Set up pyspark session
from pyspark.sql import SparkSession

spark = SparkSession \
            .builder \
            .appName("Python Spark SQL basic example") \
            .config("spark.some.config.option", "some-value") \
            .config("spark.executor.memory", "20g") \
            .config("spark.driver.memory", "20g") \
            .getOrCreate()

**WARNING: If you run into a java heap memory, configure the following lines in the cell above:

.config("spark.executor.memory", "30g")
.config("spark.driver.memory", "30g")
Change the number infront of the g (ex:20g)

Changing this number could also change the amount of RAM needed to download the final file


## Load dataset using spark

Run the following lines to load the dataset using spark and test to make sure that dataset is properly loaded.

In [None]:
%pyspark

# Load dataset
df = spark.read.format('csv').options(header='true',inferschema='true').load(dataset_path)
# (Note: change "311_service_report" to a name that better suits your dataset, if different)
df.createOrReplaceTempView("311_service_report") 

In [None]:
%pyspark

# Check if the dataset is properly loaded by printing its schema
df.printSchema()

Notice in the result of running the above cell that most items in the schema is of type string, even if it is not the expected type. To modify the dataset such that the types of each column is what we would expect, we perform type casting for each column that should not be type string.

**NOTE: the following cell is specific for the 311 service report dataset; make sure to modify the following cell to include type casting that is necessary to your dataset, if different

In [None]:
%pyspark

from pyspark.sql.types import IntegerType, DoubleType
from pyspark.sql.functions import to_timestamp

# Type casting to expected types
df = df.withColumn("Unique Key",df["Unique Key"].cast(IntegerType()))
df = df.withColumn("Due Date",to_timestamp(df["Due Date"],"MM/dd/yyyy hh:mm:ss a"))
df = df.withColumn("Created Date", to_timestamp(df["Created Date"],"MM/dd/yyyy hh:mm:ss a"))
df = df.withColumn("Closed Date",to_timestamp(df["Closed Date"],"MM/dd/yyyy hh:mm:ss a"))
df = df.withColumn("Incident Zip",df["Incident Zip"].cast(IntegerType()))
df = df.withColumn("BBL",df["BBL"].cast(IntegerType()))
df = df.withColumn("X Coordinate (State Plane)",df["X Coordinate (State Plane)"].cast(IntegerType()))
df = df.withColumn("Y Coordinate (State Plane)",df["Y Coordinate (State Plane)"].cast(IntegerType()))
df = df.withColumn("Resolution Action Updated Date",to_timestamp(df["Resolution Action Updated Date"],"MM/dd/yyyy hh:mm:ss a"))

# (Note: change "311_service_report" to a name that better suits your dataset, if different)
df.createOrReplaceTempView("311_service_report")

df.printSchema()

Now that pyspark is set up and the columns of the dataset are updated to types that we expect, we can start using pyspark to explore the dataset!

## I. Uniformity

We describe uniform data as data that use the same format and unit of measure. As we explored the 311 service report dataset, we found five columns with values that have non-uniform casing (i.e. some values are fully uppercased, some are fully lowercased, and some only have the first letter of each word uppercased).

Run through the following cells to observe the problem in the five columns.

**NOTE: to profile for uniformity in columns specific for your dataset, run the following command: 
```
# Replace the [COLUMN_NAME] with the column that you want to profile
df.select([COLUMN_NAME]).distinct().collect()

# Use .head([NUMBER_OF_ROWS_TO_DISPLAY]) instead if the result is too large for display
df.select([COLUMN_NAME]).distinct().head(20)
```

Run the following cell and notice that the "Descriptor" column has the same problem of non-uniform casing, as previously observed in "Complaint Type" column. If observed carefully, a row with "Sidewalk CafÃ©" can be found. This problem will brought up again in the Accuracy section.

In [None]:
%pyspark

# View column "Descriptor" 
df.select("Descriptor").distinct().collect()

Run the following cell and notice that the "Location Type" column is generally uniform except for the rows with the value "RESIDENTIAL BUILDING", "3+ Family ApT", and "3+ Family Apt.".

In [None]:
%pyspark

# View column "Location Type"
df.select("Location Type").distinct().collect()

Run the following cell and notice that the "Street Name" column has non-uniform casing as well (ex: one row has "Lenox Road" while majority of the rows are uppercased like "ALBERTA AVE").

In [None]:
%pyspark

# View column "Street Name" 
df.select("Street Name").distinct().head(200)

Run the following cell and notice that the "City" column has nonuniform casing (ex: one row has "New York" while majority of the rows are all uppercased like "ROCKVILLE CENTER").

In [None]:
%pyspark

# View column "City"
df.select("City").distinct().collect()

## II. Accuracy

We describe accurate data as data that is close to the true, expected values. As we explored our dataset, we found that there are several inaccurate cities in the city column. To explore the accuracy in the city column, we found and downloadeded a dataset ([uszips.csv](https://drive.google.com/file/d/1qd2cXgTx-h-hRd0C7z2s_U4O8VYLAXA7/view?usp=sharing)) that contains information such as US zipcode, state name, city, etc. We use this dataset as baseline to compare to our dataset to find any inaccuracies, such as misspellings. 

Before we start, make sure to download and upload the [uszips.csv](https://drive.google.com/file/d/1qd2cXgTx-h-hRd0C7z2s_U4O8VYLAXA7/view?usp=sharing) dataset into HDFS just as described in the previous "Upload the dataset to Peel cluster & Define dataset path" section. After the reference dataset is downloaded and uploaded into HDFS, run the cells to define the dataset path and make sure the dataset can be read.

In [None]:
%pyspark

# Define path for US zip dataset
# (Note: make sure to update to your netid and dataset name)
uszip_path = "/user/qyc206/uszips.csv"

In [None]:
%pyspark

# Read the US zip dataset
us = spark.read.csv(uszip_path, header=True)
us = us.withColumn("zip",us["zip"].cast(IntegerType()))
us = us.withColumn("lat",us["lat"].cast(DoubleType()))
us = us.withColumn("lng",us["lng"].cast(DoubleType()))
us.show()

Once the dataset is uploaded, run the following cells to observe the problem.

**NOTE: if your city column has a different column name, update "City" to your column name

### Exploring City column

In [None]:
%pyspark

# All distinct cities present in the dataset
temp_df_1 = df.select("City").distinct() 
list_of_cities = list(temp_df_1.toPandas()['City']) 

# All cities present in the US zip dataset
list_of_correct_cities =  list(us.select("city").distinct().toPandas()['city']) 

In [None]:
%pyspark

list_of_wrong_cities = []

for i in list_of_cities:
  if i not in list_of_correct_cities:
    list_of_wrong_cities.append(i)

# First 20 mispelled words
list_of_wrong_cities[:20] 

As we can observe from the results of running the cells, there are inaccuracies in the cities that are listed in the city column of our 311 service dataset. For instance, "NEW JERSEY" is a state, not a city and "HUSTON" is a misspelling for Houston, a city in Texas.

### Exploring Complaint Type column

Another problem with accuracy that we find is in the "Complaint Type" column, as observed also during the profiling for uniformity problems. Run the following cells to further explore this problem.

**NOTE: you can replace "Complaint Type" in the following cell with your specific column name to search for accuracy problems in your dataset, if different & if you are using your a different column for your own dataset, make sure to modify the pattern (of valid symbols) based on your expected contents.

In [None]:
%pyspark
from pyspark.sql.functions import col

In [None]:
%pyspark

# Define regular expression pattern
pattern = "^[-a-zA-Z0-9\s\(\)\.\/]*$"

In [None]:
%pyspark

# Get distinct values of column "Complaint Type" 
df_distinct_complaints = df.select("Complaint Type").distinct()

In [None]:
%pyspark

# View rows with special characters
df_distinct_complaints.filter(~col("Complaint Type").rlike(pattern)).collect()

### Exploring Descriptor column

Run the following cell to observe similar problems as shown for the Complaint Type column. 

NOTE: make sure to run the first cell under the previous "Exploring Complaint Type column" section BEFORE running the following cells & if you are using your a different column for your own dataset, make sure to modify the pattern (of valid symbols) based on your expected contents.

In [None]:
%pyspark

# Define regular expression pattern
pattern2 = "^[-a-zA-Z0-9\s\(\)\.\/\,\:\*\'\&\"]*$"

In [None]:
%pyspark

# Get distinct values of column "Descriptor" 
df_distinct_descriptors = df.select("Descriptor").distinct()

In [None]:
%pyspark

# View rows with special characters
df_distinct_descriptors.filter(~col("Descriptor").rlike(pattern2)).collect()

## III. Inconsistency

We describe inconsistent data as data that contains values that contradict each other or contains value that is not what we would expect based on the column that it is in. As we explored our dataset, we observed inconsistency problems in both the Agency and Agency Name columns in the 311 service dataset. We attempted to find a dataset that might contain all agency names and information, but we could not find a complete dataset. Instead, we decided to create our own custom dataset using the data from https://www1.nyc.gov/nyc-resources/agencies.page along with the addition of other agencies that showed up on the NYC open data website. Our dataset can be downloaded/accessed via this link: [NYC-Agency-Names.csv](https://drive.google.com/file/d/1EHpyXNOwCpv-NM0OTTqFBHeKikKpBtd1/view?usp=sharing). We use this dataset as baseline to compare to the corresponding columns in our 311 service dataset. 

Just as before, make sure to download and upload the [NYC-Agency-Names.csv](https://drive.google.com/file/d/1EHpyXNOwCpv-NM0OTTqFBHeKikKpBtd1/view?usp=sharing) dataset into HDFS just as described in the previous “Upload the dataset to Peel cluster & Define dataset path” section. After the reference dataset is downloaded and uploaded into HDFS, run the cells to define the dataset path and make sure the dataset can be read.

In [None]:
%pyspark

# Define path for NYC Agency Names dataset
# (Note: make sure to update to your netid and dataset name)
nycAgencyNames_path = "/user/qyc206/NYC-Agency-Names.csv"

In [None]:
%pyspark

# Read NYC Agency Names dataset 
agency_df = spark.read.csv(nycAgencyNames_path, header=True)

# Show Schema for NYC Agency Names dataset
agency_df.printSchema()

Once the dataset is uploaded, run the following cells to profile for problems.

**NOTE: the following approaches can be used on corresponding columns from similar datasets; but make sure to change the column name accordingly in the cells

### Exploring Agency column

In [None]:
%pyspark

# View data from Agency column
from pyspark.sql.functions import asc

df.select("Agency").distinct().orderBy(asc("Agency")).show(df.count(), False)

In [None]:
%pyspark

# All distinct Agency present in the dataset
temp_df2 = df.select("Agency").distinct() 
list_of_agencies = list(temp_df2.toPandas()['Agency']) 

# All Agency present in the NYC Agency Names dataset
list_of_correct_agencies = list(agency_df.select("Agency Acronym").distinct().toPandas()['Agency Acronym']) 

In [None]:
%pyspark

# Show list of Agency that were not in the dataset with uniform names
list_of_wrong_agencies = []

for i in list_of_agencies:
  if i not in list_of_correct_agencies:
    list_of_wrong_agencies.append(i)

# Show the first 20 in the list
list_of_wrong_agencies[:20] 

In [None]:
%pyspark

# Count how many rows have MAYORâ in the Agency column
df.filter(df["Agency"][0:6] == "MAYORâ").count()

As we can observe from the results of running the previous cells, there is inconsistent data where every other value in the column in an acronym, but we also have "MAYORâ\x80\x99S OFFICE OF SPECIAL ENFORCEMENT", which is not an acronym. By counting, we find that the number of rows in Agency column that has "MAYORâ" as part of the value is 70,685. Theoretically, we would solve the issue by finding out if the corresponding "Agency Name" is in our database, but for this smaller dataset and for simpliciy sake, we plan to just filter these values out.

Additionally, there is one minor problem where the apostrophe in "Mayor's" did not store correctly and turned into gibberish. Since this only happens in one specific case of "MAYORâS OFFICE OF SPECIAL ENFORCEMENT", the solution will be very specific to fix this problem.

There also might be some inaccuracies which resulted in the other values like "TAX" and "CEO". Note that even though "3-1-1" showed up in the list above, it is actually correct; this is because it is listed in the NYC Agency Names dataset as "311" without the dashes.

### Exploring Agency Name column

In [None]:
%pyspark

# View data from Agency Name column
df.select("Agency Name").distinct().orderBy(asc("Agency Name")).show(df.count(), False)

In [None]:
%pyspark

# All distinct Agency Name present in the dataset
temp_df3 = df.select("Agency Name").distinct() 
list_of_agency_names = list(temp_df3.toPandas()['Agency Name']) 

# All Agency present in the NYC Agency Names dataset
list_of_correct_agency_names=  list(agency_df.select("Agency").distinct().toPandas()['Agency']) 

In [None]:
%pyspark

# Show list of non uniform names in the Agency Name Column
list_of_wrong_agency_names = []

for i in list_of_agency_names:
  if (i[0:6] != "School") and i not in list_of_correct_agency_names:
    list_of_wrong_agency_names.append(i)

# Show entire list
for j in list_of_wrong_agency_names:
  print(j)

In [None]:
%pyspark

# Count how many rows have "School" as part of the value in the Agency Name column
df.filter((df["Agency Name"][0:6] == "School")).count()

In [None]:
%pyspark

# Count how many rows have "School" as part of the value in Agency Name and "DOE" in Agency column
df.filter((df["Agency Name"][0:6] == "School") & (df["Agency"] == "DOE")).count()

As we can observe from the results above, there are several inconsistencies and other problems, including misspellings, uneven uppercase/lowercase usage, different titles (ex: "The Department of..." vs "Department of..."), and acronyms are being used despite there being an actual name to the department (ex: NYPD vs using New York Police Department). There are also many schools that are listed that are not technically agencies due to the DOE. However, for this specific case, if a school is listed with DOE for the Agency in the same row, we will leave this alone as we know this is most likely correct because we can observe that the number of agency names that are schools is equivalent to the number of values that have agency name that is a school and agency that is DOE. This school data is too detailed and can potentially be used for other purposes, so we decided that it should not be taken out of the dataset.

## IV. Completeness

We describe a complete dataset as a dataset that contains all required data. While exploring our dataset, we found that several values in the city column are null (i.e. missing).

Run the following cells to view the observation.

**NOTE: the following approaches can be used on corresponding columns from similar datasets; but make sure to change the column name accordingly in the cells

### Exploring City column

In [None]:
%pyspark

from pyspark.sql.functions import desc

# Display number of values in city column
df.groupBy('City').count().orderBy(desc("count")).show(df.count(), False)

From the results above, we can see that there are 1,690,360 rows in the City column that contain null values.

### Exploring Address Type column

Another factor of Completeness that we should keep a look out for involves the Address Type Column. This Column is integral for figuring out what columns holds the location information we are looking for. For example, the address type "ADDRESS" Most likely corresponds with the Incident Address field.

To figure out what Address Type corresponds to what location values, we would have to map out the amount of times non null values appeared in each column per different type. If there is a significant amount of data that appears when calling a certain address type, then that column most likely is associated with that type.

Ideally, this Address Type Column should not be null, and if it is, then the columns that could hold information about the Address should also be null.

If the Address Type Column is not NULL, then we should also be checking if the correct Address Columns are filled while the rest of the address columns are null.

In [None]:
%pyspark

# Show the distinct address types
df.select("Address Type").distinct().show(df.count(), False)

In [None]:
%pyspark

from pyspark.sql.functions import desc

# Count how many rows of each address types there are
df.groupBy('Address Type').count().orderBy(desc("count")).show(df.count(), False)

In [None]:
%pyspark

# Check to see if Incident Addressing is associated with the ADDRESS type
df.filter((df["Address Type"] == "ADDRESS") & df["Incident Address"].isNotNull()).count()

## V. Outlier

We describe an outlier as data that is significantly different from all other data in the dataset. To find potential outliers, we looked at the minimum and maximum dates in columns with timestamp types (i.e. "Created Date", "Closed Date", "Due Date", and "Resolution Action Updated Date"). We found outliers in three of the four columns with timestamp types.

Run the following cells to explore the columns with timestamp types.

**NOTE: the following approaches can be used on corresponding columns from similar datasets; but make sure to change the column name accordingly in the cells

In [None]:
%pyspark

# Import libraries
from pyspark.sql import Row
from pyspark.sql.functions import min, max

"Created Date" column seems to have the expected minimum and maximum values for dates in our 311 service dataset. Run the following cell to view this observation.

In [None]:
%pyspark

# View min and max dates for Created Date
df.select(min("Created Date"),max("Created Date")).show(df.count(), False)

"Closed Date", "Due Date", and "Resolution Action Updated Date" columns can be observed to have outlier dates that clearly should not exist in our dataset. Run the following cells to view this observation.

In [None]:
%pyspark

# View min and max dates for Closed Date
df.select(min("Closed Date"),max("Closed Date")).show(df.count(), False)

In [None]:
%pyspark

# View min and max dates for Due Date
df.select(min("Due Date"),max("Due Date")).show(df.count(), False)

In [None]:
%pyspark

# View min and max dates for Resolution Action Updated Date
df.select(min("Resolution Action Updated Date"),max("Resolution Action Updated Date")).show(df.count(), False)

## Conclusion

In this notebook, we explored different aspects of our dataset and uncovered problems with the dataset that needs to be cleaned and improved. The next step is actually trying to clean and create a new version of the dataset.