## File 08 - Gradient Boosted Tree
##### Group 12:

##### Hannah Schmuckler, mmc4cv

##### Rob Schwartz, res7cd

In this file, we create an ML pipeline based on the output from File 02 (Feature creation).

The files needed are `/processed_data/train.parquet` and `/processed_data/test.parquet`.

We create a gradient boosted tree model and fine-tune. 

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import GBTRegressor
from pyspark.mllib.util import MLUtils
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col
from pyspark.sql.functions import log
from pyspark.ml.stat import Correlation


import pandas as pd
import numpy as np
import copy

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

CPU times: user 530 ms, sys: 403 ms, total: 933 ms
Wall time: 5.71 s


### Read in dataframes for train and test sets

This data should have been previously generated: we can find it in the `processed_data` folder.

In [2]:
%%time
trainDF = spark.read.parquet("./processed_data/train.parquet")
testDF = spark.read.parquet("./processed_data/test.parquet")
trainDF.show(5)

+---------+------------------+-----------------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+-------------------------+---------------------+------------------+------------------+--------------------+------------------+------------------+--------------------+--------------------+--------------------+--------------------+
|  user_id|     T_total_spend|      total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per_session|cart_even

### Set up Spark ML pipeline training for random forest

We create the function `generatePipeline(inputCols, outputCol)`, Then, we train the pipeline using this function. 

In [3]:
%%time

def generatePipeline(inputCols, outputCol):
    
    # Select input columns for random forest regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

    # Select output column for random forest regression
    gb = GBTRegressor(featuresCol="features", labelCol=outputCol, seed = 42)#, numTrees=5, maxDepth=5)
    
    pipeline = Pipeline(stages=[vecAssembler, gb])
    return pipeline



CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 6.68 µs


In [4]:
# Calculate adjusted r2 (https://towardsdatascience.com/machine-learning-linear-regression-using-pyspark-9d5d5c772b42)
def adj_r2(r2, inputCols, testDF):
    n = testDF.count()
    p = len(inputCols)
    adjusted_r2 = 1-(((1-r2)*(n-1))/(n-p-1))
    return adjusted_r2

In [5]:
def getEvaluationMetrics(pipelineMode,outputCol,testDF,inputCols):
    predDF = pipelineModel.transform(testDF)
    predDF.select(outputCol, "prediction").show(10)
    
    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="rmse")
    rmse = regressionEvaluator.evaluate(predDF)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="r2")
    r2 = regressionEvaluator.evaluate(predDF)
    
    # Manually calculate Adjusted r2
    adjusted_r2 = adj_r2(r2, inputCols, testDF)
    
    return rmse, r2, adjusted_r2



In [6]:
def modelInfo(inputCols, pipelineModel):
    modelCols = pipelineModel.stages[-2].getInputCols()
    
    feature_importance = pipelineModel.stages[-1].featureImportances
    
    return pd.DataFrame(list(zip(modelCols, feature_importance)), columns = ['Column name', 'Importance']).sort_values(by="Importance", ascending = False)

In [7]:
print("** All normal inputs, normal output **")
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "purchase_events",
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session']

outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All normal inputs, normal output **
                     Column name  Importance
1                   total_events    0.436859
0                    total_spend    0.167924
7   purchase_pct_of_total_events    0.075284
2                purchase_events    0.046957
8       view_pct_of_total_events    0.026099
13                   view_events    0.026002
11                   cart_events    0.022433
4             avg_session_length    0.019699
15            sessions_with_cart    0.019117
6   max_interactions_per_session    0.019015
10     avg_purchases_per_session    0.018866
21      sd_purchases_per_session    0.018790
5   avg_interactions_per_session    0.017099
9       cart_pct_of_total_events    0.016905
17     pct_sessions_end_purchase    0.016491
18         pct_sessions_end_cart    0.015591
14        sessions_with_purchase    0.015496
19             sd_session_length    0.010190
16            sessions_with_view    0.007199
20   sd_interactions_per_session    0.002907
3               

In [8]:
print("** All normal inputs + log inputs (25 total), normal output **")
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", 
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session', 'total_spend_log', 'total_events_log', 'purchase_events_log', 'total_sessions_log']

outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All normal inputs + log inputs (25 total), normal output **
                     Column name  Importance
1                   total_events    0.435210
0                    total_spend    0.169810
7   purchase_pct_of_total_events    0.073837
2                purchase_events    0.039631
12                   view_events    0.026925
11                   cart_events    0.025445
8       view_pct_of_total_events    0.023211
4             avg_session_length    0.021745
14            sessions_with_cart    0.019770
6   max_interactions_per_session    0.018971
10     avg_purchases_per_session    0.018866
20      sd_purchases_per_session    0.018752
16     pct_sessions_end_purchase    0.016491
13        sessions_with_purchase    0.015649
17         pct_sessions_end_cart    0.015591
9       cart_pct_of_total_events    0.015384
5   avg_interactions_per_session    0.013464
18             sd_session_length    0.009477
15            sessions_with_view    0.007592
3                 total_sessions    0

#### How interesting! Three of the four log values wound up with importances of 0. How does a log output work for this one?

In [9]:
print("** All normal inputs + log inputs (25 total), log output **")
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", 
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session', 'total_spend_log', 'total_events_log', 'purchase_events_log', 'total_sessions_log']

outputCol = "T_total_spend_log"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All normal inputs + log inputs (25 total), log output **
                     Column name  Importance
1                   total_events    0.380251
12                   view_events    0.262296
0                    total_spend    0.158733
11                   cart_events    0.076981
22              total_events_log    0.025315
2                purchase_events    0.023530
8       view_pct_of_total_events    0.016044
7   purchase_pct_of_total_events    0.014833
4             avg_session_length    0.014179
3                 total_sessions    0.008440
5   avg_interactions_per_session    0.003375
9       cart_pct_of_total_events    0.003367
10     avg_purchases_per_session    0.002819
6   max_interactions_per_session    0.002395
15            sessions_with_view    0.002045
18             sd_session_length    0.001559
20      sd_purchases_per_session    0.001318
13        sessions_with_purchase    0.001147
14            sessions_with_cart    0.001102
17         pct_sessions_end_cart    0.00

#### Dang! The log output actually worked significantly better this time. We'll use that for the remainder. Time for tuning...

In [10]:
print("** Smaller model, log output **")
inputCols = ["total_spend","total_events","cart_events", "view_events"]

outputCol = "T_total_spend_log"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** Smaller model, log output **
    Column name  Importance
1  total_events    0.428642
3   view_events    0.265078
0   total_spend    0.169663
2   cart_events    0.136617
+------------------+------------------+
| T_total_spend_log|        prediction|
+------------------+------------------+
|-6.907755278982137|-5.401702215435658|
|-6.907755278982137|-5.116493226972563|
|-6.907755278982137|-4.613897781605019|
|-6.907755278982137|-5.756182796571637|
|-6.907755278982137|-5.572453561657388|
|-6.907755278982137|-5.736000545638466|
|-6.907755278982137|-5.182997024887596|
|-6.907755278982137|-5.402200848194742|
|11.278695703691795|11.953560429151835|
|6.4970781070591626| 8.014701964359105|
+------------------+------------------+
only showing top 10 rows

RMSE is 4.5
R^2 is 0.60888
Adjusted R^2 is 0.60886


#### The champion model with the best adjusted r-square (.60886) has only 4 input columns, and uses the log output. But is it using the best hyperparameters? 

In [11]:
%%time
inputCols = ["total_spend","total_events","cart_events", "view_events"]

outputCol = "T_total_spend_log"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

gb = GBTRegressor(featuresCol="features", labelCol=outputCol, seed = 42)

pipeline = Pipeline(stages=[vecAssembler, gb])

# Set parameters to test
paramGrid = ParamGridBuilder() \
    .addGrid(gb.maxIter, [10, 20, 30]) \
    .addGrid(gb.stepSize, [.01, .1, .5]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("T_total_spend_log"),
                          numFolds=4)

CPU times: user 23 ms, sys: 4.32 ms, total: 27.3 ms
Wall time: 7.18 s


In [12]:
%%time
cvModel = crossval.fit(trainDF)

CPU times: user 2.08 s, sys: 610 ms, total: 2.69 s
Wall time: 5min 18s


In [13]:
print()
print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])

evaluationMetrics = getEvaluationMetrics(cvModel.bestModel,"T_total_spend",testDF,inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")


{Param(parent='GBTRegressor_d3ddeb2a7ab6', name='maxIter', doc='max number of iterations (>= 0).'): 10, Param(parent='GBTRegressor_d3ddeb2a7ab6', name='stepSize', doc='Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of each estimator.'): 0.01}
+-----------------+------------------+
|    T_total_spend|        prediction|
+-----------------+------------------+
|              0.0|-5.401702215435658|
|              0.0|-5.116493226972563|
|              0.0|-4.613897781605019|
|              0.0|-5.756182796571637|
|              0.0|-5.572453561657388|
|              0.0|-5.736000545638466|
|              0.0|-5.182997024887596|
|              0.0|-5.402200848194742|
|79118.00024795532|11.953560429151835|
|663.1999969482422| 8.014701964359105|
+-----------------+------------------+
only showing top 10 rows

RMSE is 71326.9
R^2 is -0.03782
Adjusted R^2 is -0.03788


#### For some reason, doing cross validation/parameter tuning doesn't work at all on the gradient boosted tree. 