## DONT USE GET RID OF

File 08 - Gradient Boosted Tree
##### Group 12:

##### Hannah Schmuckler, mmc4cv

##### Rob Schwartz, res7cd

In this file, we create an ML pipeline based on the output from File 02 (Feature creation).

The files needed are `/processed_data/train.parquet` and `/processed_data/test.parquet`.

We create a gradient boosted tree model and fine-tune. 

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import GBTRegressor
from pyspark.mllib.util import MLUtils
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col
from pyspark.sql.functions import log
from pyspark.ml.stat import Correlation


import pandas as pd
import numpy as np
import copy

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

CPU times: user 658 ms, sys: 439 ms, total: 1.1 s
Wall time: 8.47 s


### Read in dataframes for train and test sets

This data should have been previously generated: we can find it in the `processed_data` folder.

In [2]:
%%time
trainDF = spark.read.parquet("./processed_data/train.parquet")
testDF = spark.read.parquet("./processed_data/test.parquet")


CPU times: user 3.25 ms, sys: 2.03 ms, total: 5.27 ms
Wall time: 4.49 s


### Set up Spark ML pipeline training for random forest

We create the function `generatePipeline(inputCols, outputCol)`, Then, we train the pipeline using this function. 

In [3]:
%%time

def generatePipeline(inputCols, outputCol):
    
    # Select input columns for random forest regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

    # Select output column for random forest regression
    gb = GBTRegressor(featuresCol="features", labelCol=outputCol, seed = 42)#, numTrees=5, maxDepth=5)
    
    pipeline = Pipeline(stages=[vecAssembler, gb])
    return pipeline



CPU times: user 8 µs, sys: 7 µs, total: 15 µs
Wall time: 18.6 µs


In [4]:
# Calculate adjusted r2 (https://towardsdatascience.com/machine-learning-linear-regression-using-pyspark-9d5d5c772b42)
def adj_r2(r2, inputCols, testDF):
    n = testDF.count()
    p = len(inputCols)
    adjusted_r2 = 1-(((1-r2)*(n-1))/(n-p-1))
    return adjusted_r2

In [5]:
def getEvaluationMetrics(pipelineMode,outputCol,testDF,inputCols):
    predDF = pipelineModel.transform(testDF)
    predDF.select(outputCol, "prediction").show(10)
    
    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="rmse")
    rmse = regressionEvaluator.evaluate(predDF)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="r2")
    r2 = regressionEvaluator.evaluate(predDF)
    
    # Manually calculate Adjusted r2
    adjusted_r2 = adj_r2(r2, inputCols, testDF)
    
    return rmse, r2, adjusted_r2



In [6]:
def modelInfo(inputCols, pipelineModel):
    modelCols = pipelineModel.stages[-2].getInputCols()
    
    feature_importance = pipelineModel.stages[-1].featureImportances
    
    return pd.DataFrame(list(zip(modelCols, feature_importance)), columns = ['Column name', 'Importance']).sort_values(by="Importance", ascending = False)

In [11]:
print("** All inputs, normal output **")
inputCols = ["total_spend", "total_events", "total_sessions", "avg_session_length", "sd_session_length", "avg_interactions_per_session", 
             "sd_interactions_per_session", "max_interactions_per_session", "purchase_pct_of_total_events", "view_pct_of_total_events",
             "cart_pct_of_total_events", "avg_purchases_per_session", "sd_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_purchase", "sessions_with_cart", "sessions_with_view", "pct_sessions_end_purchase", 
             "pct_sessions_end_cart"]
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All inputs, normal output **
                     Column name  Importance
15        sessions_with_purchase    0.149377
0                    total_spend    0.120364
19         pct_sessions_end_cart    0.107172
7   max_interactions_per_session    0.072532
5   avg_interactions_per_session    0.067686
8   purchase_pct_of_total_events    0.061044
3             avg_session_length    0.060804
18     pct_sessions_end_purchase    0.051655
10      cart_pct_of_total_events    0.051229
6    sd_interactions_per_session    0.046800
16            sessions_with_cart    0.046030
9       view_pct_of_total_events    0.037471
2                 total_sessions    0.033459
11     avg_purchases_per_session    0.023684
13                   cart_events    0.018355
12      sd_purchases_per_session    0.014163
14               purchase_events    0.012375
1                   total_events    0.012209
17            sessions_with_view    0.006955
4              sd_session_length    0.006636
+------------------+---

In [11]:
%%time
inputCols = ["total_spend","total_events","cart_events", "view_events"]

outputCol = "T_total_spend_log"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

gb = GBTRegressor(featuresCol="features", labelCol=outputCol, seed = 42)

pipeline = Pipeline(stages=[vecAssembler, gb])

# Set parameters to test
paramGrid = ParamGridBuilder() \
    .addGrid(gb.maxIter, [10, 20, 30]) \
    .addGrid(gb.stepSize, [.01, .1, .5]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("T_total_spend_log"),
                          numFolds=4)

CPU times: user 23 ms, sys: 4.32 ms, total: 27.3 ms
Wall time: 7.18 s


In [12]:
%%time
cvModel = crossval.fit(trainDF)

CPU times: user 2.08 s, sys: 610 ms, total: 2.69 s
Wall time: 5min 18s


In [13]:
print()
print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])

evaluationMetrics = getEvaluationMetrics(cvModel.bestModel,"T_total_spend",testDF,inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")


{Param(parent='GBTRegressor_d3ddeb2a7ab6', name='maxIter', doc='max number of iterations (>= 0).'): 10, Param(parent='GBTRegressor_d3ddeb2a7ab6', name='stepSize', doc='Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of each estimator.'): 0.01}
+-----------------+------------------+
|    T_total_spend|        prediction|
+-----------------+------------------+
|              0.0|-5.401702215435658|
|              0.0|-5.116493226972563|
|              0.0|-4.613897781605019|
|              0.0|-5.756182796571637|
|              0.0|-5.572453561657388|
|              0.0|-5.736000545638466|
|              0.0|-5.182997024887596|
|              0.0|-5.402200848194742|
|79118.00024795532|11.953560429151835|
|663.1999969482422| 8.014701964359105|
+-----------------+------------------+
only showing top 10 rows

RMSE is 71326.9
R^2 is -0.03782
Adjusted R^2 is -0.03788


#### For some reason, doing cross validation/parameter tuning doesn't work at all on the gradient boosted tree. 