# GBA 6430 - Big Data Technology in Business
# Dr. Mohammad Salehan
# Feature Engineering with Spark

As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package.

If you get a cell_monitor error, you can ignore it. It is a Jypyter cell error and not a Spark error.

In this notebook, you will learn how to apply the following data preprocessing techniques using PySpark.
* Dummies
* Discretizing continuous variables
* Standardization using z-score (i.e., normalization)<br/>
You will also learn about Spark's vectorization pipelines.

## Creating Dummies

The code below creates dummies for two categorical variables names `TYPE` and `CODE` using `ps.get_dummies()`.

In [None]:
import pyspark.pandas as ps
df = ps.DataFrame([
    (1, "A", "X1"),
    (2, "B", "X2"),
    (3, "B", "X3"),
    (1, "B", "X3"),
    (2, "C", "X2"),
    (3, "C", "X2"),
    (1, "C", "X1"),
    (1, "B", "X1"),
], columns=["ID", "TYPE", "CODE"])
df.head()

In [None]:
dummies = ps.get_dummies(df, columns=["TYPE", "CODE"], drop_first=False)
dummies.head()

In [None]:
dummies = ps.get_dummies(df, columns=["TYPE", "CODE"], drop_first=True)
dummies.head()

## Discretizing continuous variables using quantiles
If you suspect that some features have a nonlinear relationship with your outcome variable, you can consider discritizing them.

In [None]:
import matplotlib.pyplot as plt
ps.set_option('plotting.backend', 'matplotlib')

In [None]:
signal_df = ps.read_csv('s3://cis4567-salehan/Spark/Data/fourier_signal.csv')
signal_df.head()

In [None]:
signal_df.describe()
#the mean is almost zero. 

In [None]:
signal_df.plot.line(y='signal')
%matplot plt

In [None]:
import pyspark.ml.feature as feat
steps = feat.QuantileDiscretizer(
       numBuckets=10,
       inputCol='signal',
       outputCol='discretized')

#.pandas_api() is the same as .to_pandas_on_spark() which has been deprecated
transformed = (
    steps
    .fit(signal_df.to_spark())
    .transform(signal_df.to_spark())
).pandas_api()
transformed.head()

In [None]:
transformed.describe()

In [None]:
plt.clf()
transformed.plot.line(y='discretized')
%matplot plt

## Vectorization
* Almost exclusively, every estimator (or, in other words, an ML model) found in the MLlib module expects to see a single column as an input; the column should contain all the features a data scientist wants such a model to use. 
* The `.VectorAssembler(...)` method, as the name suggests, collates multiple features into a single column.  

In [None]:
vectorAssembler = (
    feat.VectorAssembler(
        inputCols=['signal', 'discretized'], 
        outputCol='feat'
    )
)

#the 5 feature output of the record from PCA
signal_vectorized  = vectorAssembler.transform(transformed.to_spark()).pandas_api()
signal_vectorized.head()

## Standardizing continuous variables

In [None]:
vec = feat.VectorAssembler(
    inputCols=['signal']
    , outputCol='signal_vec'
)

signal_vectorized  = vec.transform(signal_df.to_spark())

norm = feat.StandardScaler(
    inputCol=vec.getOutputCol()
    , outputCol='signal_norm'
    , withMean=True
    , withStd=True
)

signal_norm = (
    norm
    .fit(signal_vectorized)
    .transform(signal_vectorized)
).pandas_api()

signal_norm.head()

In [None]:
from pyspark.ml.functions import vector_to_array
signal_norm = signal_norm.to_spark().select('signal', 
                                 vector_to_array('signal_norm')[0].alias('signal_norm')
                             ).pandas_api()
signal_norm.describe()

In [None]:
plt.clf()
fig, ax = plt.subplots(2, 1)
fig.tight_layout(pad=1.5)
for i, col in enumerate(['signal', 'signal_norm']):
    signal_norm.plot.line(ax=ax[i], y=col, title=col)
%matplot plt

## Pipelines
* The Pipeline class helps to sequence, or streamline, the execution of separate blocks that
lead to an estimated model; it chains multiple Transformers and Estimators to form a
sequential execution workflow.
* Pipelines are useful as they avoid explicitly creating multiple transformed datasets as the
data gets pushed through different parts of the overall data transformation and model
estimation process. 
* Instead, Pipelines abstract distinct intermediate stages by automating
the data flow through the workflow. 
* This makes the code more readable and maintainable
as it creates a higher abstraction of the system, and it helps with code debugging.

In [None]:
from pyspark.ml import Pipeline
vec = feat.VectorAssembler(
    inputCols=['signal']
    , outputCol='signal_vec'
)

norm = feat.StandardScaler(
    inputCol=vec.getOutputCol()
    , outputCol='signal_norm'
    , withMean=True
    , withStd=True
)

norm_pipeline = Pipeline(stages=[vec, norm])
signal_norm = (
    norm_pipeline
    .fit(signal_df.to_spark())
    .transform(signal_df.to_spark())
).pandas_api()

signal_norm.head()