## **Spark ML Assignment**
The goal of this assignment is to train a Spark ML Model. 

You may do this in any way that you'd like, but I've provided this template as a helpful framework, which you may find helpful. 

Assignment Tasks:

1) Load the CSV into a Spark Dataframe. I've added code to automatically download "banking_attrition.csv", however, if you have another dataset you'd like to analyze, feel free to use that instead. 

2) Preprocess the data, applying any data exploration or cleaning techniques. 

3) Split the model into train and test

4) Train the model (Please predict "attrition" probability if you use the banking_attrition.csv" file). 

5) Apply the model again your test dataframe

6) Display model evaluation 



## **Install PySpark Dependencies**

In [None]:
# Install Spark dependencies
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!rm spark-3.2.1-bin-hadoop3.2.tgz
!wget --no-cookies --no-check-certificate https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar zxvf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install pyspark

## **Load Data**

In [None]:
!wget https://raw.githubusercontent.com/zaratsian/Datasets/master/banking_attrition.csv

## **Import Python / Spark Libraries**

In [None]:
import os
os.environ["JAVA_HOME"]  = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"

import datetime, time
import re, random, sys

# Note - Not all of these will be used, but I've added them for your reference as a "getting started"
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, ArrayType, IntegerType, StringType, FloatType, LongType, DateType
from pyspark.sql.functions import struct, array, lit, monotonically_increasing_id, col, expr, when, concat, udf, split, size, lag, count, isnull
from pyspark.sql import Window
from pyspark.ml.linalg import Vectors
from pyspark.ml.regression import GBTRegressor, LinearRegression, GeneralizedLinearRegression, RandomForestRegressor
from pyspark.ml.classification import GBTClassifier, RandomForestClassifier
from pyspark.ml.feature import VectorIndexer, VectorAssembler, StringIndexer, IndexToString
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, RegressionEvaluator

## **Create Spark Session**

In [None]:
spark = SparkSession.builder.appName("Spark ML Assignment").master("local[*]").getOrCreate()

## **Load CSV Data into Spark Dataframe**

## **Data Exploration**
Perform at least one data exploration of your choice (This could be a basic show(), an aggregation/[groupby](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy), [correlation](https://spark.apache.org/docs/latest/ml-statistics.html#correlation), [summarizer](https://spark.apache.org/docs/latest/ml-statistics.html#summarizer), etc.)

## **Split the Spark Dataframe into Train and Test**
You could use a [randomsplit](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.randomSplit) here, a [Cross Validator](https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation), or another approach of your choice.

## **Feature Engineering**
During this step, I'd like to see you convert at least one STRING variable (such as gender, membership, education or another variable of your choice) into a numeric representation so that you can use it as one of the model inputs. You can convert the string to a numeric by using [one-hot encoding](https://spark.apache.org/docs/latest/ml-features.html#onehotencoderestimator), a [stringindexer](https://spark.apache.org/docs/latest/ml-features.html#stringindexer), etc

You will also want to define a ML model object. An example of this would be a random forest, gradient boosting, or some other approach listed [here](https://spark.apache.org/docs/latest/ml-classification-regression.html). 

## **Fit/Train ML Model**

## **Make Predictions**
Use your model to make predications against the Test (holdout) Dataframe

## **Evaluate Model against Test Dataframe**
Display model fit statistics, such as RMSE or MSE