# Lab 16 Assignment 3 - Group Assignment - Group O-1-7

When creating ML models, the concept of efficiency has three sides:
1. The time dedicated by the analyst to build the model
2. The computer time and resources needed by the final model
3. The accuracy of the final model

Efficiency is a combination of all

In this assignment, you are asked to be efficient. Spark is the best tool to build models over massive datasets

If you need to create Spark+Python Machine Learning models that "run fast" on the  cluster, you must avoid using Python code or working with RRD+python. Try to use  the already existing methods that do what you need (do not reinvent the wheel).

Therefore try to use the implemented object+methods inside the Spark SQL and ML modules. They are very fast, because it is compiled Java/Scala code. Try to use: DataFrames, Feature Transfomers, Estimators, Pipelines, GridSearch, CV, ...

For this assignment, you are asked to create a classification model that:
1. Uses the variables in the dataset (train.csv) to predict label "loan_status"
2. Write a python scripts that:
    - Reads the "train.csv" and "test.csv" files, transform and select variables as you wish.
    - Train/fit your model using the "train.csv".
    - Predict your model on the "test.csv" ( you should generate a file with your predictions).
    - I will use a different test dataset (with the true loan_status).

Your work will be evaluated under the following scoring schema
- (40%) ETL process
- (40%) Model train process
- (10%) Code Readability 
- (10%) AUC on the test set (at least 50%)

Enjoy it and best of luck!!


This Assignment is based on kaggle competition https://www.kaggle.com/c/loan-default-prediction from where a sub-dataset has been taken.

### File Description
**train.csv** - the training set (to use for building a model)

**test.csv** - the test set (to use for applying predictings)

**sample_submission.csv** - a template for the submission file

### Data Description (contained in LendingClub_DataDescription.csv)
**ID**: A unique LC assigned ID for the loan listing.

**loan_amnt**: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

**loan_status**: Current status of the loan (**Target**: 1 = Charged Off, 0 = Fully Paid).

**term**: The number of payments on the loan. Values are in months and can be either 36 or 60.

**int_rate**: Interest Rate on the loan.

**installment**: The monthly payment owed by the borrower if the loan originates.

**emp_length**: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.

**home_ownership**: The home ownership status provided by the borrower during registration. Our values are: OTHER/NONE, MORTGAGE, OWN, RENT.

**annual_inc**: The self-reported annual income provided by the borrower during registration.

**purpose**: A category provided by the borrower for the loan request.

**title**: The loan title provided by the borrower.

**STATE**: The state provided by the borrower in the loan application.

**delinq_2yrs**: The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years.

**revol_bal**: Total credit revolving balance.

**revol_util**: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.

**total_pymnt**: Indicates total payment at the end of the loan.

In [3]:
import os
import sys

os.environ['SPARK_HOME'] = "/Users/stavrostsentemeidis/Desktop/Install_Spark/spark-2.3.2-bin-hadoop2.7/"

# Create a variable for our root path
SPARK_HOME = os.environ['SPARK_HOME']

#Add the following paths to the system path. Please check your installation
#to make sure that these zip files actually exist. The names might change
#as versions change.
sys.path.insert(0,os.path.join(SPARK_HOME,"python"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","pyspark.zip"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","py4j-0.10.7-src.zip"))

#Initialize SparkSession and SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg
from pyspark.sql.functions import col
from pyspark.sql.functions import isnan, when, count
from pyspark.sql.functions import *
from pyspark.ml.classification import RandomForestClassificationModel, RandomForestClassifier
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel
from pyspark.ml.param import Params
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator
from pyspark.ml.classification import LogisticRegression
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.ml.classification import GBTClassifier

#Create a Spark Session
MySparkSession = SparkSession \
    .builder \
    .master("local[2]") \
    .appName("MiPrimer") \
    .config("spark.executor.memory", "6g") \
    .config("spark.cores.max","4") \
    .getOrCreate()


#Get the Spark Context from Spark Session    
MySparkContext = MySparkSession.sparkContext

### Reading & Displaying Files

In [5]:
loanDF = MySparkSession.read.format('csv') \
                .option("inferSchema", "true") \
                .option("delimiter", ";") \
                .option('header','true') \
                .load('../data/train.csv') 

testDF = MySparkSession.read.format('csv') \
                .option("inferSchema", "true") \
                .option("delimiter", ";") \
                .option('header','true') \
                .load('../data/test.csv')

AnalysisException: 'Path does not exist: file:/Users/stavrostsentemeidis/Documents/GitHub/data/train.csv;'

In [None]:
loanDF.limit(10).toPandas()

In [None]:
testDF.limit(10).toPandas()

### EDA | Null Values | Cross Table Distribution | Covariances

#### Summary of Columns

In [None]:
loanDF.printSchema()
testDF.printSchema()

#### Renaming | Describing | Changing Data Type

In [None]:
loanDF = loanDF.withColumn('int_rate', regexp_replace('int_rate', '%', ''))
loanDF = loanDF.withColumn('title', regexp_replace('int_rate', '.', ''))
loanDF = loanDF.withColumn("int_rate", loanDF["int_rate"].cast("decimal(10,0)"))
loanDF = loanDF.withColumn("revol_util", loanDF["revol_util"].cast("decimal(10,0)"))

testDF = testDF.withColumn('int_rate', regexp_replace('int_rate', '%', ''))
testDF = testDF.withColumn('int_rate', regexp_replace('int_rate', '.', ''))
testDF = testDF.withColumn("int_rate", testDF["int_rate"].cast("decimal(10,0)"))
testDF = testDF.withColumn("revol_util", testDF["revol_util"].cast("decimal(10,0)"))

In [None]:
loanDF = loanDF.withColumnRenamed("ID","id")
loanDF = loanDF.withColumnRenamed("loan_amnt","loan_amount")
loanDF = loanDF.withColumnRenamed("term","term")
loanDF = loanDF.withColumnRenamed("home_ownership","home_ownership")
loanDF = loanDF.withColumnRenamed("int_rate","interest_rate")
loanDF = loanDF.withColumnRenamed("installment","monthly_payment")
loanDF = loanDF.withColumnRenamed("emp_length","employment_time")
loanDF = loanDF.withColumnRenamed("delinq_2yrs","deliquency_past_2years")
loanDF = loanDF.withColumnRenamed("revol_bal","revolving_balance")
loanDF = loanDF.withColumnRenamed("revol_util","revolving_utilization_rate")
loanDF = loanDF.withColumnRenamed("total_pymnt","total_payment")
loanDF = loanDF.withColumnRenamed("purpose","loan_purpose")
loanDF = loanDF.withColumnRenamed("annual_inc","annual_income")
loanDF = loanDF.withColumnRenamed("STATE","state")
loanDF = loanDF.withColumnRenamed("installment","installment")
loanDF = loanDF.withColumnRenamed("loan_status","loan_status")


testDF = testDF.withColumnRenamed("ID","id")
testDF = testDF.withColumnRenamed("loan_amnt","loan_amount")
testDF = testDF.withColumnRenamed("term","term")
testDF = testDF.withColumnRenamed("home_ownership","home_ownership")
testDF = testDF.withColumnRenamed("int_rate","interest_rate")
testDF = testDF.withColumnRenamed("installment","monthly_payment")
testDF = testDF.withColumnRenamed("emp_length","employment_time")
testDF = testDF.withColumnRenamed("delinq_2yrs","deliquency_past_2years")
testDF = testDF.withColumnRenamed("revol_bal","revolving_balance")
testDF = testDF.withColumnRenamed("revol_util","revolving_utilization_rate")
testDF = testDF.withColumnRenamed("total_pymnt","total_payment")
testDF = testDF.withColumnRenamed("purpose","loan_purpose")
testDF = testDF.withColumnRenamed("annual_inc","annual_income")
testDF = testDF.withColumnRenamed("STATE","state")
testDF = testDF.withColumnRenamed("installment","installment")
testDF = testDF.withColumnRenamed("loan_status","loan_status")

In [None]:
numeric_df = ['loan_amount','interest_rate', 'monthly_payment','annual_income', 'deliquency_past_2years',
              'total_payment','revolving_balance','revolving_utilization_rate']

categorical_Df = ['term','employment_time', 'home_ownership', 'loan_purpose', 'title','state']

loanDF.describe('loan_amount','term','interest_rate','title','employment_time','home_ownership').show()
loanDF.describe('annual_income','loan_purpose','monthly_payment','state','deliquency_past_2years').show()
loanDF.describe('revolving_balance','revolving_utilization_rate','total_payment','loan_status').show()

In [None]:
display(testDF.describe())

In [None]:
print(loanDF.count())
print(testDF.count())

#### Checking Null Values


In [None]:
remove_loanDF = loanDF.na.drop() 
remove_testDF = testDF.na.drop()  
print(remove_loanDF.count())
print(remove_testDF.count())

#### Cross Table Distribution


In [None]:
# cross tables distribution
loanDF.stat.crosstab('term','employment_time').show()
loanDF.stat.crosstab('term','home_ownership').show()
loanDF.stat.crosstab('term','loan_purpose').show()
loanDF.stat.crosstab('term','state').show()
loanDF.stat.crosstab('employment_time','home_ownership').show()
loanDF.stat.crosstab('employment_time','loan_purpose').show()
loanDF.stat.crosstab('employment_time','state').show()
loanDF.stat.crosstab('home_ownership','loan_purpose').show()
loanDF.stat.crosstab('home_ownership','state').show()
loanDF.stat.crosstab('loan_purpose','state').show()


#### Covariance


In [None]:
# covariance
for i in numeric_df:
    for j in numeric_df:
        print('Covariance of ' + i + ' and '+ j )
        print(loanDF.stat.cov(i, j))
        print("")
    


### ETL summary |  Spark code for Imputing 

Below we present the steps we decided to follow during our EDA in order to prepare our dataset for our machine learning implementation. It is worth mentioning that some of the steps had to be included in the previous part of the assignment in order to have a complete overview of the *cross table distributions* and the *covariances*.

1. **interest_rate** : remove special % character and change datatype from *string* to *decimal*.
2. **revol_util** : change datatype from *string* to *decimal*.
3. **Rename** columns to be better represented.
4. **Trim** the variable title as there are multiple unnecessary dots.
5. **Checkin for Na**: 
    1. Number of rows with NA for *loanDF*:    29755
    2. Number of rows without NA for *loanDF*: 27773
    3. Number of rows with NA for *testDF*:    10024
    4. Number of rows without NA for *testDF*: 9116
6. **Filling Na values**:
    1. **title** :                      unknown
    2. **loan_purpose** :               unknown
    3. **state** :                      unknown
    4. **deliquency_past_2years** :     -1
    5. **revolving_balance** :          13350.529071398512 (avg)
    6. **revolving_utilization_rate** : 0.5054 (avg)
    7. **total_payment** :              12143.791490200982 (avg)
7. **Drop** variables *title* and *state* cause they have too many unique values that cannot be grouped in order to apply one hot coding.


####  Imputing Null Values

In [None]:
loanDF = loanDF.na.fill({'title': 'uknown', 'loan_purpose': 'unknown', 'state': 'unknown',
                         'deliquency_past_2years':-1,'revolving_balance':13350.529071398512,
                         'revolving_utilization_rate':0.5054 , 'total_payment': 12143.791490200982})

testDF = testDF.na.fill({'title': 'unknown', 'loan_purpose': 'uknown', 'state': 'unknown',
                         'deliquency_past_2years':-1,'revolving_balance':13350.529071398512,
                         'revolving_utilization_rate':0.5054 , 'total_payment': 12143.791490200982})
print(loanDF.count())
print(testDF.count())

####  Double checking final clean dataset

Checking all our columns to verify that we have:
   * Distinct user id so we do not need to group by.
   * Clean data format.
   * Clean cell values, (for example no Na and nan, which mean the same thing) 

In [None]:
loanDF.describe('loan_amount','term','interest_rate','title','employment_time','home_ownership').show()
loanDF.describe('annual_income','loan_purpose','monthly_payment','state','deliquency_past_2years').show()
loanDF.describe('revolving_balance','revolving_utilization_rate','total_payment','loan_status').show()

In [None]:
# we have unique id = number of rows of dataset, so we do not need to group by.
for i in loanDF.columns:
    print(loanDF.select(i).distinct().count())
    print(loanDF.select(i).distinct().show())


At this stage we check the final structure of the table and we also decide to drop the **title** and **state** feature as they have so many different values, which makes it difficult to use in our model.

In [None]:
loanDF = loanDF.drop("title", 'state')
testDF = testDF.drop("title", 'state')
loanDF.printSchema()
testDF.printSchema()

### Creating VectorAssembler

In [None]:
# stages = []

# term_index = StringIndexer() \
#                     .setInputCol("term") \
#                     .setOutputCol("term_index") \
#                     .fit(loanDF)
# stages.append(term_index)

# employment_time_index = StringIndexer() \
#                 .setInputCol("employment_time") \
#                 .setOutputCol("employment_time_index") \
#                 .fit(loanDF)
# stages.append(employment_time_index)

# home_ownership_index = StringIndexer() \
#                     .setInputCol("home_ownership") \
#                     .setOutputCol("home_ownership_index") \
#                     .fit(loanDF)
# stages.append(home_ownership_index)

# loan_purpose_index = StringIndexer() \
#                 .setInputCol("loan_purpose") \
#                 .setOutputCol("loan_purpose_index") \
#                 .fit(loanDF)
# stages.append(loan_purpose_index)

# # state_index = StringIndexer() \
# #                     .setInputCol("state") \
# #                     .setOutputCol("state_index") \
# #                     .fit(loanDF)
# # stages.append(state_index)

# label_index = StringIndexer() \
#                 .setInputCol("loan_status") \
#                 .setOutputCol("status") \
#                 .fit(loanDF)
# stages.append(label_index)

# assembler = VectorAssembler(inputCols=["loan_amount", 'term_index', 'interest_rate','monthly_payment',
#                                       'employment_time_index', 'home_ownership_index', 'annual_income',
#                                       'loan_purpose_index', 'deliquency_past_2years',
#                                       'revolving_balance', 'revolving_utilization_rate','total_payment'],
#                                         outputCol='features')

# stages.append(assembler)

# rf = RandomForestClassifier() \
#         .setFeaturesCol("features") \
#         .setLabelCol("status") \
#         .setSeed(100)
# stages.append(rf)


# pipeline = Pipeline(stages=stages)

In [None]:
# evaluator = BinaryClassificationEvaluator() \
#                 .setLabelCol("loan_status") \
#                 .setRawPredictionCol("rawPrediction")

# print("We are using metric: " + evaluator.getMetricName())

### 2.4 Create Pipeline for train and test data

In [None]:
(trainingData, validationData) = loanDF.randomSplit([0.7, 0.3], seed=100)
print(trainingData.count())
print(validationData.count())

### Logistic Regresion Model

#### Write a function "metrics" which has a LogisticRegressionModel.summary as input attribute and produces an output of: 
1. Area under ROC
2. False Positive Rate By Label
3. True Positive Rate By Label
4. Precision By Label
5. Recall By Label
6. fMeasure By Label
7. Accuracy
8. False Positive Rate
9. True Positive Rate
10. fMeasure
11. Precision
12. Recall

In [None]:
# def metrics(trainingSummary):  
#     print("AUC: " + str(evaluator.evaluate(trainingSummary)))
#     print("False Positive Rate by Label: " + x)
#     print("True Positive Rate by Label: " + x)
#     print("Precision by Label: " + x)
#     print("Recall by Label: " + x)
#     print("fMeasure by Label: " + x)
#     print("Accuracy: " + accuracy)
#     print("False Positive Rate: " + x)
#     print("True Positive Rate: " + x) 
#     print("fMeasure: " + x)
#     print("Precision: " + x)
#     print("Recall: " + x)

#### Apply a Logistic Regresion Base Model and show the metrics by the function above

At this point we define the pipeline for the **Logistic Regression** model.

In [None]:
stages_lr = []

term_index = StringIndexer() \
                    .setInputCol("term") \
                    .setOutputCol("term_index") \
                    .fit(loanDF)
stages_lr.append(term_index)

employment_time_index = StringIndexer() \
                .setInputCol("employment_time") \
                .setOutputCol("employment_time_index") \
                .fit(loanDF)
stages_lr.append(employment_time_index)

home_ownership_index = StringIndexer() \
                    .setInputCol("home_ownership") \
                    .setOutputCol("home_ownership_index") \
                    .fit(loanDF)
stages_lr.append(home_ownership_index)

loan_purpose_index = StringIndexer() \
                .setInputCol("loan_purpose") \
                .setOutputCol("loan_purpose_index") \
                .fit(loanDF)
stages_lr.append(loan_purpose_index)

# state_index = StringIndexer() \
#                     .setInputCol("state") \
#                     .setOutputCol("state_index") \
#                     .fit(loanDF)
# stages.append(state_index)

label_index = StringIndexer() \
                .setInputCol("loan_status") \
                .setOutputCol("status") \
                .fit(loanDF)
stages_lr.append(label_index)

assembler_lr = VectorAssembler(inputCols=["loan_amount", 'term_index', 'interest_rate','monthly_payment',
                                      'employment_time_index', 'home_ownership_index', 'annual_income',
                                      'loan_purpose_index', 'deliquency_past_2years',
                                      'revolving_balance', 'revolving_utilization_rate'],
                                        outputCol='features_lr')

stages_lr.append(assembler_lr)

lr = LogisticRegression(featuresCol = "features_lr", labelCol = "status")

stages_lr.append(lr)


pipeline_lr = Pipeline(stages=stages_lr)

In [None]:
evaluator = BinaryClassificationEvaluator() \
                .setLabelCol("loan_status") \
                .setRawPredictionCol("rawPrediction")

print("We are using metric: " + evaluator.getMetricName())

In [None]:
model_lr = pipeline_lr.fit(validationData)
pipeline_results_lr = model_lr.transform(validationData)
print("AUC: " + str(evaluator.evaluate(pipeline_results_lr)))
print();print('Model Parameters')
print('----------------')
print(lr.extractParamMap())

#### We are going to try to improve our model:
1. Using a `weight column` in our Logistic Regression Model (Take into account we are working with a unbalanced dataset)
2. Define a `ParamGridBuilder` with `regParam`, `elasticNetParam` and `maxIter` at least
3. Define an `BinaryClassificationEvaluator`
4. Using Cross Validation with a 5-fold `CrossValidator`

Questions to answer:
1. Have we improved the ROC-AUC?
2. Which are the average ROC-AUC measurements in the different cross validation runs?
3. Which are the parameters of the best model in the 5 k-fold runs?
4. Which are the metrics of the best model (training) in the 5 k-fold runs? (Use the function above)
5. Which is the ROC-AUC on validation dataset?


In [None]:
paramGrid_lr = ParamGridBuilder() \
                .addGrid(lr.regParam, [0.1, 0.5, 2.0]) \
                .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
                .addGrid(lr.maxIter, [1, 5, 10])\
                .build()

print("Param Grid: " + str(paramGrid_lr))

        
cv_lr = CrossValidator() \
        .setEstimator(pipeline_lr) \
        .setEvaluator(evaluator) \
        .setEstimatorParamMaps(paramGrid_lr) \
        .setNumFolds(10)

In [None]:
cv_lr_model = cv_lr.fit(trainingData)
cv_lr_results = cv_lr_model.transform(validationData)

# Before we had 0.9127 and now
print("AUC: " + str(evaluator.evaluate(cv_lr_results)))

# Means of model accuracy
print("Means of metrics: " + str(cv_lr_model.avgMetrics))

### 4. Random Forest Model
1. Define a `ParamGridBuilder` with `maxDepth`, `numTrees` and `maxIter` at least
2. Define an `BinaryClassificationEvaluator` (You can use the above one)
3. Using Cross Validation with a 5-fold `CrossValidator`

Questions to answer:

1. Have we improved the ROC-AUC?
2. Which are the average ROC-AUC measurements in the different cross validation runs?
3. Which are the parameters of the best model in the 5 k-fold runs?
4. Which is the importance of the features?
5. Print full description of model.
6. Which is the ROC-AUC on validation dataset?

In [None]:
stages = []

term_index = StringIndexer() \
                    .setInputCol("term") \
                    .setOutputCol("term_index") \
                    .fit(loanDF)
stages.append(term_index)

employment_time_index = StringIndexer() \
                .setInputCol("employment_time") \
                .setOutputCol("employment_time_index") \
                .fit(loanDF)
stages.append(employment_time_index)

home_ownership_index = StringIndexer() \
                    .setInputCol("home_ownership") \
                    .setOutputCol("home_ownership_index") \
                    .fit(loanDF)
stages.append(home_ownership_index)

loan_purpose_index = StringIndexer() \
                .setInputCol("loan_purpose") \
                .setOutputCol("loan_purpose_index") \
                .fit(loanDF)
stages.append(loan_purpose_index)

# state_index = StringIndexer() \
#                     .setInputCol("state") \
#                     .setOutputCol("state_index") \
#                     .fit(loanDF)
# stages.append(state_index)

label_index = StringIndexer() \
                .setInputCol("loan_status") \
                .setOutputCol("status") \
                .fit(loanDF)
stages.append(label_index)

assembler = VectorAssembler(inputCols=["loan_amount", 'term_index', 'interest_rate','monthly_payment',
                                      'employment_time_index', 'home_ownership_index', 'annual_income',
                                      'loan_purpose_index', 'deliquency_past_2years',
                                      'revolving_balance', 'revolving_utilization_rate'],
                                        outputCol='features')

stages.append(assembler)

rf = RandomForestClassifier() \
        .setFeaturesCol("features") \
        .setLabelCol("status") \
        .setSeed(100)
stages.append(rf)


pipeline = Pipeline(stages=stages)

In [None]:
evaluator = BinaryClassificationEvaluator() \
                .setLabelCol("loan_status") \
                .setRawPredictionCol("rawPrediction")

print("We are using metric: " + evaluator.getMetricName())

In [None]:
model = pipeline.fit(trainingData)

pipeline_results = model.transform(validationData)
print("AUC: " + str(evaluator.evaluate(pipeline_results)))

print();print('Model Parameters')
print('----------------')
print(rf.extractParamMap())

By setting the **paramGrid_rf** variable we define the set of different parameters we would like to test in order to tune our algorithm.

In [None]:
paramGrid_rf = ParamGridBuilder() \
                .addGrid(rf.maxDepth, [10,20]) \
                .addGrid(rf.numTrees, [10,20]) \
                .build()

print("Param Grid: " + str(paramGrid_rf))

cv_rf = CrossValidator() \
        .setEstimator(pipeline) \
        .setEvaluator(evaluator) \
        .setEstimatorParamMaps(paramGrid_rf) \
        .setNumFolds(10)

At this point we apply a **Cross Validation** strategy.

In [None]:
# cv_rf_model = cv_rf.fit(trainingData)
# cv_rf_results = cv_rf_model.transform(validationData)

# # Before we had 0.675 and now
# print("AUC: " + str(evaluator.evaluate(cv_rf_results)))

# # Means of model accuracy
# print("Means of metrics: " + str(cv_rf_model.avgMetrics))

We see that we still get lower results so we stick with our initial model

### 5. Gradient Boosting Model
1. Defining a `ParamGridBuilder` with `maxDepth`, `numTrees` and `maxIter` at least (You can use the above one)
2. Define an `BinaryClassificationEvaluator` (You can use the above one)
3. Using Cross Validation with a 5-fold `CrossValidator`

Questions to answer:

1. Have we improved the ROC-AUC?
2. Which are the average ROC-AUC measurements in the different cross validation runs?
3. Which are the parameters of the best model in the 5 k-fold runs?
4. Which is the importance of the features?
5. Print full description of model.
6. Which is the ROC-AUC on validation dataset?

In [None]:
stages_gbm = []

term_index = StringIndexer() \
                    .setInputCol("term") \
                    .setOutputCol("term_index") \
                    .fit(loanDF)
stages_gbm.append(term_index)

employment_time_index = StringIndexer() \
                .setInputCol("employment_time") \
                .setOutputCol("employment_time_index") \
                .fit(loanDF)
stages_gbm.append(employment_time_index)

home_ownership_index = StringIndexer() \
                    .setInputCol("home_ownership") \
                    .setOutputCol("home_ownership_index") \
                    .fit(loanDF)
stages_gbm.append(home_ownership_index)

loan_purpose_index = StringIndexer() \
                .setInputCol("loan_purpose") \
                .setOutputCol("loan_purpose_index") \
                .fit(loanDF)
stages_gbm.append(loan_purpose_index)

# state_index = StringIndexer() \
#                     .setInputCol("state") \
#                     .setOutputCol("state_index") \
#                     .fit(loanDF)
# stages.append(state_index)

label_index = StringIndexer() \
                .setInputCol("loan_status") \
                .setOutputCol("status") \
                .fit(loanDF)
stages_gbm.append(label_index)

assembler_gbm = VectorAssembler(inputCols=["loan_amount", 'term_index', 'interest_rate','monthly_payment',
                                      'employment_time_index', 'home_ownership_index', 'annual_income',
                                      'loan_purpose_index', 'deliquency_past_2years',
                                      'revolving_balance', 'revolving_utilization_rate'],
                                        outputCol='features_gbm')

stages_gbm.append(assembler_gbm)

gbm = GBTClassifier(labelCol="status", featuresCol="features_gbm", maxIter=10)

stages_gbm.append(gbm)


pipeline_gbm = Pipeline(stages=stages_gbm)

In [None]:
model_gbm = pipeline_gbm.fit(validationData)
pipeline_results_gbm = model_gbm.transform(validationData)
print("AUC: " + str(evaluator.evaluate(pipeline_results_gbm)))
print();print('Model Parameters')
print('----------------')
print(lr.extractParamMap())

In [None]:
paramGrid_gbm = ParamGridBuilder() \
                    .addGrid(GBTClassifier.maxDepth, [2,5])\
                    .addGrid(GBTClassifier.maxIter, [10,20])\
                    .build()

print("Param Grid: " + str(paramGrid_gbm))

cv_gbm = CrossValidator() \
        .setEstimator(pipeline_gbm) \
        .setEvaluator(evaluator) \
        .setEstimatorParamMaps(paramGrid_gbm) \
        .setNumFolds(10)

In [None]:
cv_gbm_model = cv_gbm.fit(trainingData)
cv_gbm_results = cv_gbm_model.transform(validationData)

# Before we had 0.675 and now
print("AUC: " + str(evaluator.evaluate(cv_gbm_results)))

# Means of model accuracy
print("Means of metrics: " + str(cv_gbm_model.avgMetrics))

### Apply your best model to send the predictions on test

In [None]:
sc.stop()