# PySpark - Preparing the Data For Modeling

## Introduction

This project aims to explore the methods for preparing the dataset for modeling purposes. It is important to note that any dataset is dirty until proven otherwise and that it should be proven to be sufficiently clean before using it. However, no dataset can be entirely clean. Below will list some of the problems that can occur in a dataset. Majority of the time, 80% of the work is getting familiar and cleaning up the dataset. The remaining 20% would be building the model.

For this project, the dataset used will only consist of 22 records, as this is to get a feel for data cleaning with PySpark and should be transferable to other datasets.

## Problems that a Dataset can have:
- __Duplicated Observations__: These types of duplication comes from systemic and operator's faults.
- __Missing Observations__: These types of errors can come about due to sensor problems, data corruption or unwilling participant that would not provide answers.
- __Anomalous Observations__: Observations that stands out when compared to the rest of the dataset. Like Outliers.
- __Encoding__: This is when text fields are not normalised, in different languages, gibberish text inputs, or when date and date time fields were not encoded similarly.
- __Untrustworthy answers__: These are true when it comes to surveys. When the response is a lie for any number of reasons. This type is much harder to work with and clean up.


## Breakdown of this Notebook

- Handling Duplicates in data records
- Handling missing observations in dataset
- Handling outliers
- Exploring the descriptive statistics
- Computing Correlations
- Drawing Histograms to describe the data
- Visualising the interactions between features


## 1 PySpark Machine Configuration:

Here it only uses two processing cores from the CPU, and it set up by the following code.

In [1]:
%%configure
{
    "executorCores" : 4
}

In [2]:
from pyspark.sql.types import *

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
3,,pyspark,idle,,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 2 Setup the Correct Directory:

In [3]:
import os

# Change the Path:
path = '++++your working directory here++++/Datasets/'
os.chdir(path)
folder_pathway = os.getcwd()

# print(folder_pathway)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 3 Create the Dataset: 22 samples

In [5]:
# Define the Dirty Dataset:
dirty_data = spark.createDataFrame(
    [(1,'Porsche','Boxster S','Turbo',2.5,4,22,None),
     (2,'Aston Martin','Vanquish','Aspirated',6.0,12,16,None),
     (3,'Porsche','911 Carrera 4S Cabriolet','Turbo',3.0,6,24,None),
     (3,'General Motors','SPARK ACTIV','Aspirated',1.4,None,32,None),
     (5,'BMW','COOPER S HARDTOP 2 DOOR','Turbo',2.0,4,26,None),
     (6,'BMW','330i','Turbo',2.0,None,27,None),
     (7,'BMW','440i Coupe','Turbo',3.0,6,23,None),
     (8,'BMW','440i Coupe','Turbo',3.0,6,23,None),
     (9,'Mercedes-Benz',None,None,None,None,27,None),
     (10,'Mercedes-Benz','CLS 550','Turbo',4.7,8,21,79231),
     (11,'Volkswagen','GTI','Turbo',2.0,4,None,None),
     (12,'Ford Motor Company','FUSION AWD','Turbo',2.7,6,20,None),
     (13,'Nissan','Q50 AWD RED SPORT','Turbo',3.0,6,22,None),
     (14,'Nissan','Q70 AWD','Aspirated',5.6,8,18,None),
     (15,'Kia','Stinger RWD','Turbo',2.0,4,25,None),
     (16,'Toyota','CAMRY HYBRID LE','Aspirated',2.5,4,46,None),
     (16,'Toyota','CAMRY HYBRID LE','Aspirated',2.5,4,46,None),
     (18,'FCA US LLC','300','Aspirated',3.6,6,23,None),
     (19,'Hyundai','G80 AWD','Turbo',3.3,6,20,None),
     (20,'Hyundai','G80 AWD','Turbo',3.3,6,20,None),
     (21,'BMW','X5 M','Turbo',4.4,8,18,121231),
     (22,'GE','K1500 SUBURBAN 4WD','Aspirated',5.3,8,18,None) ],
    schema = ['Id','Manufacturer','Model','EngineType','Displacement',
     'Cylinders','FuelEconomy','MSRP'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
# Inspect:
dirty_data.take(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(Id=1, Manufacturer='Porsche', Model='Boxster S', EngineType='Turbo', Displacement=2.5, Cylinders=4, FuelEconomy=22, MSRP=None)]

## 4 Handling duplicates of data records:

It can be very hard to spot duplicates of data and these happen all the time. PySpark DataFrame have a method to help remove these duplicates called .dropDuplicates() transformation function.

In [8]:
# First is to check for duplicated rows:
dirty_data.count(), dirty_data.distinct().count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

(22, 21)

#### From this "(22, 21)" output, it can be determined that there is one record of data that has a duplicate.

#### To check which record it is:
- First, use the .groupBy() function to define which of the dataset columns to aggregate. Here all the columns were chosen.
- Next, count the number of times these records occur with the .count() function.
- Next, use the .filter() method to select all of the rows in the dataset that occurs more than once.
- Lastly, print these records out with the .show() function.

In [10]:
# Inspect the dataset for duplicates:
(
    dirty_data
    .groupBy(dirty_data.columns)
    .count()
    .filter('count > 1')
    .show()
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+------------+---------------+----------+------------+---------+-----------+----+-----+
| Id|Manufacturer|          Model|EngineType|Displacement|Cylinders|FuelEconomy|MSRP|count|
+---+------------+---------------+----------+------------+---------+-----------+----+-----+
| 16|      Toyota|CAMRY HYBRID LE| Aspirated|         2.5|        4|         46|null|    2|
+---+------------+---------------+----------+------------+---------+-----------+----+-----+

It can be seen that __"Id 16"__ is the duplicate record.

#### Next is to proceed in removing the duplicate row:

In [11]:
# Remove the duplicates:
fully_removed_dat = dirty_data.dropDuplicates()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 4.1 Duplicates: data IDs

If the data is collected over time, it can be possible that the same data would be recorded with diffrent IDs. 

#### To check for Duplicate IDs:
- First groupBy all the columns except for the "Id" column.
- Next, is to count the number of records.
- Next, is to extract the records that has a duplicate count. ("count > 1")
- Finally, is to print out the data.

In [13]:
# Inspect if the Dataset has duplicate IDs:
(
    fully_removed_dat
    .groupBy( [col for col in fully_removed_dat.columns if col != 'Id'] )
    .count()
    .filter('count > 1')
    .show()
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------+----------+----------+------------+---------+-----------+----+-----+
|Manufacturer|     Model|EngineType|Displacement|Cylinders|FuelEconomy|MSRP|count|
+------------+----------+----------+------------+---------+-----------+----+-----+
|         BMW|440i Coupe|     Turbo|         3.0|        6|         23|null|    2|
|     Hyundai|   G80 AWD|     Turbo|         3.3|        6|         20|null|    2|
+------------+----------+----------+------------+---------+-----------+----+-----+

#### Check the count similar to the previous section:

In [14]:
# Save the data as a separate copy:
no_ids_dat = (
    fully_removed_dat
    .select( [col for col in fully_removed_dat.columns if col != "Id"] )
)

# Compare the count:
no_ids_dat.count(), no_ids_dat.distinct().count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

(21, 19)

#### From the output "(21, 19)", it shows that there are 4 duplicate records (or 2 duplicate IDs) in the dataset.

In [16]:
# Remove these duplicates:
id_removed_dat = fully_removed_dat.dropDuplicates(
    subset = [col for col in fully_removed_dat.columns if col != "Id"]
)

# Count the number of rows:
id_removed_dat.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

19

## 4.2 Duplicates: ID Collisions.

