## **Spark ML Assignment**
The goal of this assignment is to (1) use Spark to analyze and process data and (2) to train a Spark ML Model.

I've provided this template as a potentially helpful framework, but feel free to use other data processing and ML algorithms based on your dataset and on your personal preferences. 

<br>

**Assignment Tasks:**

1) Load the CSV into a Spark Dataframe. I've added code to automatically download "banking_attrition.csv" for you so that you have it locally, however, if you prefer to use another dataset for your analysis, then feel free to use that instead. 

2) Preprocess the data, applying at least one data exploration or cleaning techniques.

3) Split the historical data into train and test.

4) Train the model (Please predict "attrition" probability if you use the banking_attrition.csv" dataset). 

5) Test the model performance against your test dataset.

6) Display model evaluation / model fit statistics so that I can see the performance. For this assignment, I'm not focused too much on getting a good model performance, but rather I'm evaluating your process and the spark code that you write in order to prepocess and train the model.



## **Install PySpark Dependencies**

In [None]:
# Install Spark dependencies
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!rm spark-3.2.3-bin-hadoop3.2.tgz
!wget --no-cookies --no-check-certificate https://dlcdn.apache.org/spark/spark-3.2.3/spark-3.2.3-bin-hadoop3.2.tgz
!tar zxvf spark-3.2.3-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install pyspark==3.2.3

## **Load Data**

In [None]:
!wget https://raw.githubusercontent.com/zaratsian/Datasets/master/banking_attrition.csv

## **Import Python / Spark Libraries**

In [3]:
import os
os.environ["JAVA_HOME"]  = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.3-bin-hadoop3.2"

import datetime, time
import re, random, sys

# Note - Not all of these will be used, but I've added them for your reference as a "getting started"
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, ArrayType, IntegerType, StringType, FloatType, LongType, DateType
from pyspark.sql.functions import struct, array, lit, monotonically_increasing_id, col, expr, when, concat, udf, split, size, lag, count, isnull
from pyspark.sql import Window
from pyspark.ml.linalg import Vectors
from pyspark.ml.regression import GBTRegressor, LinearRegression, GeneralizedLinearRegression, RandomForestRegressor
from pyspark.ml.classification import GBTClassifier, RandomForestClassifier
from pyspark.ml.feature import VectorIndexer, VectorAssembler, StringIndexer, IndexToString
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, RegressionEvaluator

## **Create Spark Session**

In [4]:
spark = SparkSession.builder.appName("Spark ML Assignment").master("local[*]").getOrCreate()

## **Load CSV Data into Spark Dataframe**

## **Data Exploration**
Perform at least one data exploration of your choice (This could be a basic show(), an aggregation/[groupby](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy), [correlation](https://spark.apache.org/docs/latest/ml-statistics.html#correlation), [summarizer](https://spark.apache.org/docs/latest/ml-statistics.html#summarizer), etc.)

## **Split the Spark Dataframe into Train and Test**
You could use a [randomsplit](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.randomSplit) here, a [Cross Validator](https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation), or another approach of your choice.

## **Feature Engineering**
During this step, I'd like to see you convert at least one STRING variable (such as gender, membership, education or another variable of your choice) into a numeric representation so that you can use it as one of the model inputs. You can convert the string to a numeric by using [one-hot encoding](https://spark.apache.org/docs/latest/ml-features.html#onehotencoderestimator), a [stringindexer](https://spark.apache.org/docs/latest/ml-features.html#stringindexer), etc

You will also want to define a ML model object. An example of this would be a random forest, gradient boosting, or some other approach listed [here](https://spark.apache.org/docs/latest/ml-classification-regression.html). 

## **Fit/Train ML Model**

## **Make Predictions**
Use your model to make predications against the Test (holdout) Dataframe

## **Evaluate Model against Test Dataframe**
Display model fit statistics, such as RMSE or MSE

## Save the Model Object (Optional)

Write spark code that saves your model object. 

<br>

Context: For production purposes, it's often requierd to save the model object so that it can be deployed as a stand-alone and compressed binary object. The model object is typically wrapped in a container and served as a REST or gRPC endpoint.