# PySpark - Preparing the Data For Modeling

## Introduction

This project will venture into building machine learning models by using the PySpark's MLlib module. It should be noted that it is now being deprecated as it is being moved to the ML module. However, if the datasets used are to be stored on RDDs, it is still possible to utilise the MLlib for machine learning work. 

## Breakdown of this Notebook

- Loading the Dataset
- Exploring the Dataset
- Testing the Dataset
- Transforming the Dataset
- Standardising the Dataset
- Creating RDDs for Training
- Predicting hours of work for census respondents
- Forecasting the income level of census respondents
- Building a clustering model
- Computing the performance statistics


## Dataset:

For this project, the dataset can be found in the "Datasets" folder, where it is sourced from http://archive.ics.uci.edu/ml/datasets/Census+Income. The CSV file that is used is called "". \
Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Adult" dataset. 

Alternative source: https://www.kaggle.com/uciml/adult-census-income/data

Included in the "Datasets" folder is two files:
- adults.csv
- census_income_dataset.csv

## 1 PySpark Machine Configuration:

Here it only uses two processing cores from the CPU, and it set up by the following code.

In [1]:
%%configure
{
    "executorCores" : 4
}

In [2]:
from pyspark.sql.types import *

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
0,,pyspark,idle,,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 2 Setup the Correct Directory:

In [3]:
import os

# Change the Path:
path = '++++your working directory here++++/Datasets/'
os.chdir(path)
folder_pathway = os.getcwd()

# print(folder_pathway)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 3 Loading in the Dataset:

The dataset will be the 1994 census income data.

In [4]:
# Set the path to the dataset:
dataset_path = path + 'Datasets/census_income_dataset.csv'

# Load in the data:
census_income_dat = spark.read.csv(dataset_path, header = True, inferSchema = True)


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [5]:
# Inspect: 
census_income_dat.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: integer (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: integer (nullable = true)
 |-- capital-loss: integer (nullable = true)
 |-- hours-per-week: integer (nullable = true)
 |-- native-country: string (nullable = true)
 |-- label: string (nullable = true)

By checking directly with the CSV file on Excel, it can be seen that the datatype for each of the columns were detected properly with the "inferSchema" parameter set to "True".

In [7]:
# Inspect the row data:
census_income_dat.take(2)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(age=39, workclass=' State-gov', fnlwgt=77516, education=' Bachelors', education-num=13, marital-status=' Never-married', occupation=' Adm-clerical', relationship=' Not-in-family', race=' White', sex=' Male', capital-gain=2174, capital-loss=0, hours-per-week=40, native-country=' United-States', label=' <=50K'), Row(age=50, workclass=' Self-emp-not-inc', fnlwgt=83311, education=' Bachelors', education-num=13, marital-status=' Married-civ-spouse', occupation=' Exec-managerial', relationship=' Husband', race=' White', sex=' Male', capital-gain=0, capital-loss=0, hours-per-week=13, native-country=' United-States', label=' <=50K')]

#### Problem:

From the above print out of the Row() objects, it can be seen that there are trailing white spaces like: workclass=' State-gov'. __This needs to be corrected__.

## 3.1 Fix the Trailing White Spaces:

### Import the required libraries:

In [8]:
import pyspark.sql.functions as f

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
# Loop through the columns: remove the tailing white spaces on the left and right.
for col, typ in census_income_dat.dtypes:
    if ( typ == 'string' ):
        census_income_dat = census_income_dat.withColumn(col, f.ltrim(f.rtrim(census_income_dat[col])))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [11]:
# Inspect the row data:
census_income_dat.take(2)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(age=39, workclass='State-gov', fnlwgt=77516, education='Bachelors', education-num=13, marital-status='Never-married', occupation='Adm-clerical', relationship='Not-in-family', race='White', sex='Male', capital-gain=2174, capital-loss=0, hours-per-week=40, native-country='United-States', label='<=50K'), Row(age=50, workclass='Self-emp-not-inc', fnlwgt=83311, education='Bachelors', education-num=13, marital-status='Married-civ-spouse', occupation='Exec-managerial', relationship='Husband', race='White', sex='Male', capital-gain=0, capital-loss=0, hours-per-week=13, native-country='United-States', label='<=50K')]

#### Observation:

It can be seen that the problem has now been fixed. For example, workclass='State-gov'.

## 4 Exploring the Dataset:

This section will explore the different aspects of the dataset.

## 4.1 Select the Numerical and Categorical features:

To do this:
- First, check out the columns' dtypes.
- Create a list of columns that are to be kept.
- Here, the 'label' column is important as it keeps track of the person regardless of income.
- Create a DataFrame with the selected columns and extract the numerical and categorical columns.

In [13]:
# Inspect the dtypes of each column:
census_income_dat.dtypes

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[('age', 'int'), ('workclass', 'string'), ('fnlwgt', 'int'), ('education', 'string'), ('education-num', 'int'), ('marital-status', 'string'), ('occupation', 'string'), ('relationship', 'string'), ('race', 'string'), ('sex', 'string'), ('capital-gain', 'int'), ('capital-loss', 'int'), ('hours-per-week', 'int'), ('native-country', 'string'), ('label', 'string')]

In [15]:
# List all the columns of interests:
cols_keep = census_income_dat.dtypes

cols_keep = (
    ['label', 'age', 'capital-gain', 'capital-loss', 'hours-per-week'] + 
    [element[0] for element in cols_keep[:-1] if element[1] == 'string']
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [16]:
# Separate the numerical and categorical features:
census_subset = census_income_dat.select(cols_keep)

# Numerical:
cols_num = [element[0] for element in census_subset.dtypes if element[1] == 'int']

# Categorical:
cols_cat = [element[0] for element in census_subset.dtypes if element[1] == 'string']

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 4.2 Numerical Features:

For numerical features, some basic descriptive statistics can be calculated.

### Import the required Libraries:

In [17]:
import pyspark.mllib.stat as st
import numpy as np

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Descriptive Statistics:

To do this:

- First, subset the data for the inclusion of only numerical columns and extract the RDD. A list is created with the .map() function as every element of the RDD is a Row() object.
- Use the statistic package from MLlib to call the .colStats() function to input the RDD numerical values.
- Print out the descriptive statistics.

In [18]:
# Create the RDD:
rdd_num = (
    census_subset
    .select(cols_num)
    .rdd
    .map(lambda row: [element for element in row])
)

# Run the MLlib statistics package:
stats_num = st.Statistics.colStats(rdd_num)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [20]:
# Print the descriptive statistics:
for col, min_, mean_, max_, var_ in zip(cols_num, stats_num.min(), stats_num.mean(), stats_num.max(), stats_num.variance()):
    print('{0}: min->{1: .1f}, mean->{2: .1f}, max->{3: .1f}, stdev->{4: .1f}'.format(col, min_, mean_, max_, np.sqrt(var_)))
    

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

age: min-> 17.0, mean-> 38.6, max-> 90.0, stdev-> 13.6
capital-gain: min-> 0.0, mean-> 1077.6, max-> 99999.0, stdev-> 7385.3
capital-loss: min-> 0.0, mean-> 87.3, max-> 4356.0, stdev-> 403.0
hours-per-week: min-> 1.0, mean-> 40.4, max-> 99.0, stdev-> 12.3

#### Observation:

The average age can be seen to be 39 years old. But there seem to be an outlier here that is 90 years old. 
Comparing the capital gain and loss, it would seem like poeple are making more money than losing it. 
It can also be seen that the average working hours are 40 hours.

## 4.3 Categorical Features:

For categorical features, it should be noted that __not possible__ to calculate simple descriptive statistics. Therefore, the frequencies for each of teh distinct values in the categorical columns must be calculated instead.

To do this:
- First, subset the data for only the categorical columns and its labels (person identifier). Extract the underlying RDD and transform each of the Row() objects into a list. 
- Next, store these results as a dictionary.
- Loop through all the columns and then aggregate the data by using the .groupBy() function. 
- Create a list of tuples where the first element is value "element1[0]" and the second element is the frequenqy (lenght) of "element1[1]".

In [21]:
# Create the RDD:
rdd_cat = (
    census_subset
    .select(cols_cat + ['label'])
    .rdd
    .map(lambda row: [element for element in row])
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [25]:
# Inspect:
rdd_cat.take(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[['<=50K', 'State-gov', 'Bachelors', 'Never-married', 'Adm-clerical', 'Not-in-family', 'White', 'Male', 'United-States', '<=50K']]

In [22]:
# Save to dictionary:
results_cat = {}

for i, col in enumerate(cols_cat + ['label']):
    results_cat[col] = (
        rdd_cat
        .groupBy(lambda row: row[i])
        .map(lambda element: [element[0], len(element[1])])
        .collect()
    )

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [34]:
# Inspect:
for i in results_cat:
    print(i, results_cat[i], "\n")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

label [['<=50K', 24720], ['>50K', 7841]] 

workclass [['State-gov', 1298], ['Self-emp-not-inc', 2541], ['Private', 22696], ['Federal-gov', 960], ['Local-gov', 2093], ['?', 1836], ['Self-emp-inc', 1116], ['Without-pay', 14], ['Never-worked', 7]] 

education [['Bachelors', 5355], ['HS-grad', 10501], ['11th', 1175], ['Masters', 1723], ['9th', 514], ['Some-college', 7291], ['Assoc-acdm', 1067], ['Assoc-voc', 1382], ['7th-8th', 646], ['Doctorate', 413], ['Prof-school', 576], ['5th-6th', 333], ['10th', 933], ['1st-4th', 168], ['Preschool', 51], ['12th', 433]] 

marital-status [['Never-married', 10683], ['Married-civ-spouse', 14976], ['Divorced', 4443], ['Married-spouse-absent', 418], ['Separated', 1025], ['Married-AF-spouse', 23], ['Widowed', 993]] 

occupation [['Adm-clerical', 3770], ['Exec-managerial', 4066], ['Handlers-cleaners', 1370], ['Prof-specialty', 4140], ['Other-service', 3295], ['Sales', 3650], ['Craft-repair', 4099], ['Transport-moving', 1597], ['Farming-fishing', 994], ['Machi

#### Observation:

The values for each key in each of the columns are not sorted, let's fix this by sorting for the values and not the keys.

In [42]:
for i in results_cat:
    print(i, sorted(results_cat[i], key = lambda element: element[1], reverse = True), "\n")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

label [['<=50K', 24720], ['>50K', 7841]] 

workclass [['Private', 22696], ['Self-emp-not-inc', 2541], ['Local-gov', 2093], ['?', 1836], ['State-gov', 1298], ['Self-emp-inc', 1116], ['Federal-gov', 960], ['Without-pay', 14], ['Never-worked', 7]] 

education [['HS-grad', 10501], ['Some-college', 7291], ['Bachelors', 5355], ['Masters', 1723], ['Assoc-voc', 1382], ['11th', 1175], ['Assoc-acdm', 1067], ['10th', 933], ['7th-8th', 646], ['Prof-school', 576], ['9th', 514], ['12th', 433], ['Doctorate', 413], ['5th-6th', 333], ['1st-4th', 168], ['Preschool', 51]] 

marital-status [['Married-civ-spouse', 14976], ['Never-married', 10683], ['Divorced', 4443], ['Separated', 1025], ['Widowed', 993], ['Married-spouse-absent', 418], ['Married-AF-spouse', 23]] 

occupation [['Prof-specialty', 4140], ['Craft-repair', 4099], ['Exec-managerial', 4066], ['Adm-clerical', 3770], ['Sales', 3650], ['Other-service', 3295], ['Machine-op-inspct', 2002], ['?', 1843], ['Transport-moving', 1597], ['Handlers-cleaners'

#### Observation:

Taking a look at the "sex", there seem to be an imbalance in genders where the male is almost twice as much in count when compared to females. Another thing to note is that under "race", the data is also skewed and have a greater count of "White" people. Additionally, it can be seen under "label" that there were significantly more people earning more than $50K.

## 4.4 Correlations with MLlib:

This section will explore the correlations between numerical variables.

Note: the .corr() method here does return a NumPy array(s) or matrix where each of the elements can be either a Pearson (default) or Spearman correlation coefficient.

In [44]:
# Calculate the correlations:
corr_num = st.Statistics.corr(rdd_num)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [62]:
# Inspect:
corr_num

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

array([[ 1.        ,  0.0776745 ,  0.05777454,  0.06875571],
       [ 0.0776745 ,  1.        , -0.03161506,  0.07840862],
       [ 0.05777454, -0.03161506,  1.        ,  0.05425636],
       [ 0.06875571,  0.07840862,  0.05425636,  1.        ]])

In [64]:
cols_num

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['age', 'capital-gain', 'capital-loss', 'hours-per-week']

In [67]:
# Print out the calculations: Taking the upper triangular portion of the matrix without the diagonal values.
for i, el_i in enumerate(abs(corr_num) > 0.05):
    print(cols_num[i])
#     print(el_i)
    
    for j, el_j in enumerate(el_i):
        if el_j and j != i:
            print("     ", cols_num[j], corr_num[i][j])
    print()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

age
      capital-gain 0.07767449816599513
      capital-loss 0.05777453947897402
      hours-per-week 0.06875570750946958

capital-gain
      age 0.07767449816599513
      hours-per-week 0.07840861539013481

capital-loss
      age 0.05777453947897402
      hours-per-week 0.054256362272651674

hours-per-week
      age 0.06875570750946958
      capital-gain 0.07840861539013481
      capital-loss 0.054256362272651674

#### Observation:

Majority of the correlations found are less than 0.5, where most are less than 0.1. This means that the features are not highly correlated with each other and presents to be useful for the model. This avoids the issue of __multicollinearity__. 

The follwoing is taken from: https://www.statisticshowto.datasciencecentral.com/multicollinearity/

Multicollinearity generally occurs when there are high correlations between two or more predictor variables. In other words, one predictor variable can be used to predict the other. This creates redundant information, skewing the results in a regression model

## 5 Testing the Data:


