## File 07 - Random Forest

##### Group 12:

##### Hannah Schmuckler, mmc4cv

##### Rob Schwartz, res7cd

In this file, we create an ML pipeline based on the output from File 02 (Feature creation).

The files needed are `/processed_data/train.parquet` and `/processed_data/test.parquet`.

We create a random forest model and fine-tune. 

### Set up Spark session

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.mllib.util import MLUtils
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col
from pyspark.sql.functions import log
from pyspark.ml.stat import Correlation


import pandas as pd
import numpy as np
import copy

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

CPU times: user 503 ms, sys: 421 ms, total: 925 ms
Wall time: 5.71 s


### Read in dataframes for train and test sets

This data should have been previously generated: we can find it in the `processed_data` folder.

In [2]:
%%time
trainDF = spark.read.parquet("./processed_data/train.parquet")
testDF = spark.read.parquet("./processed_data/test.parquet")
trainDF.show(5)

+---------+------------------+-----------------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+-------------------------+---------------------+------------------+------------------+--------------------+------------------+------------------+--------------------+--------------------+--------------------+--------------------+
|  user_id|     T_total_spend|      total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per_session|cart_even

### Set up Spark ML pipeline training for random forest

We create the function `generatePipeline(inputCols, outputCol)`, Then, we train the pipeline using this function. 

In [3]:
%%time

def generatePipeline(inputCols, outputCol):
    
    # Select input columns for random forest regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

    # Select output column for random forest regression
    rf = RandomForestRegressor(featuresCol="features", labelCol=outputCol, seed = 42)#, numTrees=5, maxDepth=5)
    
    pipeline = Pipeline(stages=[vecAssembler, rf])
    return pipeline



CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 6.68 µs


In [4]:
# Calculate adjusted r2 (https://towardsdatascience.com/machine-learning-linear-regression-using-pyspark-9d5d5c772b42)
def adj_r2(r2, inputCols, testDF):
    n = testDF.count()
    p = len(inputCols)
    adjusted_r2 = 1-(((1-r2)*(n-1))/(n-p-1))
    return adjusted_r2

In [5]:
def getEvaluationMetrics(pipelineMode,outputCol,testDF,inputCols):
    predDF = pipelineModel.transform(testDF)
    predDF.select(outputCol, "prediction").show(10)
    
    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="rmse")
    rmse = regressionEvaluator.evaluate(predDF)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="r2")
    r2 = regressionEvaluator.evaluate(predDF)
    
    # Manually calculate Adjusted r2
    adjusted_r2 = adj_r2(r2, inputCols, testDF)
    
    return rmse, r2, adjusted_r2



In [6]:
def modelInfo(inputCols, pipelineModel):
    modelCols = pipelineModel.stages[-2].getInputCols()
    
    feature_importance = pipelineModel.stages[-1].featureImportances
    
    return pd.DataFrame(list(zip(modelCols, feature_importance)), columns = ['Column name', 'Importance']).sort_values(by="Importance", ascending = False)

In [7]:
print("** All normal inputs, normal output **")
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events",
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session']

outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All normal inputs, normal output **
                     Column name  Importance
1                   total_events    0.496159
0                    total_spend    0.268796
13        sessions_with_purchase    0.050056
14            sessions_with_cart    0.041980
7   purchase_pct_of_total_events    0.029680
2                purchase_events    0.028271
15            sessions_with_view    0.015511
8       view_pct_of_total_events    0.014895
12                   view_events    0.013449
3                 total_sessions    0.008379
11                   cart_events    0.007676
16     pct_sessions_end_purchase    0.005491
9       cart_pct_of_total_events    0.005483
10     avg_purchases_per_session    0.003669
6   max_interactions_per_session    0.003664
17         pct_sessions_end_cart    0.001570
20      sd_purchases_per_session    0.001437
19   sd_interactions_per_session    0.001403
4             avg_session_length    0.000941
18             sd_session_length    0.000930
5   avg_interact

In [8]:
print("** All normal inputs + log inputs (25 total), normal output **")
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", 
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session', 'total_spend_log', 'total_events_log', 'purchase_events_log', 'total_sessions_log']

outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All normal inputs + log inputs (25 total), normal output **
                     Column name  Importance
1                   total_events    0.392177
21               total_spend_log    0.183318
22              total_events_log    0.145389
0                    total_spend    0.134207
7   purchase_pct_of_total_events    0.041035
8       view_pct_of_total_events    0.019149
12                   view_events    0.012591
2                purchase_events    0.009776
13        sessions_with_purchase    0.009383
10     avg_purchases_per_session    0.008511
14            sessions_with_cart    0.008126
16     pct_sessions_end_purchase    0.007888
9       cart_pct_of_total_events    0.006889
3                 total_sessions    0.005107
15            sessions_with_view    0.002572
24            total_sessions_log    0.002305
17         pct_sessions_end_cart    0.002038
20      sd_purchases_per_session    0.001948
6   max_interactions_per_session    0.001811
19   sd_interactions_per_session    0

#### Interesting - adding the log columns seems to have split the importance of their corresponding non-log inputs, but has increased the adjusted r^2 by approximately 1.5. 

In [9]:
print("** All normal inputs + log inputs, log output **")
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", 
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session', 'total_spend_log', 'total_events_log', 'purchase_events_log', 'total_sessions_log']

outputCol = "T_total_spend_log"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All normal inputs + log inputs, log output **
                     Column name  Importance
1                   total_events    0.215407
21               total_spend_log    0.210539
22              total_events_log    0.149537
0                    total_spend    0.148796
12                   view_events    0.142919
2                purchase_events    0.022436
23           purchase_events_log    0.020274
7   purchase_pct_of_total_events    0.020154
13        sessions_with_purchase    0.013405
6   max_interactions_per_session    0.011910
15            sessions_with_view    0.010489
3                 total_sessions    0.007956
18             sd_session_length    0.006692
11                   cart_events    0.005487
24            total_sessions_log    0.005112
19   sd_interactions_per_session    0.003684
4             avg_session_length    0.001849
5   avg_interactions_per_session    0.001203
14            sessions_with_cart    0.000837
20      sd_purchases_per_session    0.000748
10    

#### Log output is not the way to go for this model either. 

#### Can we improve adg r-square by tuning predictors? Yes. It turns out that removing sd_session_length improves the R^2

In [10]:
print("** Tuned inputs, normal output **")
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", 
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 
             'sd_interactions_per_session', 'sd_purchases_per_session', 'total_spend_log', 'total_events_log', 'purchase_events_log', 'total_sessions_log']

outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** Tuned inputs, normal output **
                     Column name  Importance
1                   total_events    0.315854
21              total_events_log    0.206459
20               total_spend_log    0.186679
0                    total_spend    0.138182
7   purchase_pct_of_total_events    0.042733
8       view_pct_of_total_events    0.017886
12                   view_events    0.015309
2                purchase_events    0.014784
16     pct_sessions_end_purchase    0.011693
13        sessions_with_purchase    0.009729
9       cart_pct_of_total_events    0.008567
10     avg_purchases_per_session    0.006986
11                   cart_events    0.005369
23            total_sessions_log    0.004449
3                 total_sessions    0.003977
19      sd_purchases_per_session    0.002283
22           purchase_events_log    0.002095
18   sd_interactions_per_session    0.001740
6   max_interactions_per_session    0.001733
14            sessions_with_cart    0.001337
5   avg_interactions_

#### How can we do with only a few predictors? Much more computationally efficient!

In [11]:
print("** Minimized, normal output **")
inputCols = ["total_spend","total_events","purchase_events", 'total_spend_log', 'total_events_log', 'purchase_events_log','view_events']

outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** 7 predictors, normal output **
           Column name  Importance
1         total_events    0.366643
4     total_events_log    0.279050
3      total_spend_log    0.154637
0          total_spend    0.121474
6          view_events    0.038532
5  purchase_events_log    0.031764
2      purchase_events    0.007900
+-----------------+------------------+
|    T_total_spend|        prediction|
+-----------------+------------------+
|              0.0| 1625.740230559058|
|              0.0| 5473.763240887481|
|              0.0| 5348.133433025998|
|              0.0| 1074.849415024602|
|              0.0| 1305.673974874158|
|              0.0| 1074.849415024602|
|              0.0| 1074.849415024602|
|              0.0|1494.5617479304678|
|79118.00024795532| 137855.7158498556|
|663.1999969482422| 1305.673974874158|
+-----------------+------------------+
only showing top 10 rows

RMSE is 48641.8
R^2 is 0.51735
Adjusted R^2 is 0.51730


### The maximum achievable adjusted r-square seems to be 0.52293, with 23 predictors. However, it is possible to achieve an adjusted r-square of 0.51730 with just 7 predictors.

#### Test out different combinations of hyperparameters on minimized model

In [21]:
%%time
inputCols = ["total_spend","total_events","purchase_events", 'total_spend_log', 'total_events_log', 'purchase_events_log','view_events']

outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

rf = RandomForestRegressor(featuresCol="features", labelCol=outputCol)

pipeline = Pipeline(stages=[vecAssembler, rf])

paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [5, 10, 20]) \
    .addGrid(rf.maxDepth, [2, 4, 10]) \
    .addGrid(rf.featureSubsetStrategy, ['sqrt', 'log2', 'onethird']) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("T_total_spend"),
                          numFolds=4)

# Run cross-validation, and choose the best set of parameters.
#cvModel = crossval.fit(trainDF)

CPU times: user 25 ms, sys: 7.2 ms, total: 32.2 ms
Wall time: 2.13 s


In [22]:
%%time
cvModel = crossval.fit(trainDF)

CPU times: user 5.48 s, sys: 1.64 s, total: 7.12 s
Wall time: 3min 36s


In [23]:
print()
print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])

evaluationMetrics = getEvaluationMetrics(cvModel.bestModel,"T_total_spend",testDF,inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")


{Param(parent='RandomForestRegressor_480fb57e65a5', name='numTrees', doc='Number of trees to train (>= 1).'): 5, Param(parent='RandomForestRegressor_480fb57e65a5', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 2, Param(parent='RandomForestRegressor_480fb57e65a5', name='featureSubsetStrategy', doc="The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 'n' (when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features). default = 'auto'"): 'sqrt'}
+-----------------+------------------+
|    T_total_spend|

#### Test out different combinations of hyperparameters on best adj-r-square model

In [16]:
%%time
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", 
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 
             'sd_interactions_per_session', 'sd_purchases_per_session', 'total_spend_log', 'total_events_log', 'purchase_events_log', 'total_sessions_log']


outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

rf = RandomForestRegressor(featuresCol="features", labelCol=outputCol)

pipeline = Pipeline(stages=[vecAssembler, rf])

#Set parameters to test
paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [5, 10, 20]) \
    .addGrid(rf.maxDepth, [2, 4, 10]) \
    .addGrid(rf.featureSubsetStrategy, ['sqrt', 'log2', 'onethird']) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("T_total_spend"),
                          numFolds=4)

CPU times: user 22.4 ms, sys: 3.32 ms, total: 25.7 ms
Wall time: 2.39 s


In [17]:
%%time
cvModel = crossval.fit(trainDF)

CPU times: user 5.86 s, sys: 1.85 s, total: 7.71 s
Wall time: 4min 32s


In [18]:
print()
print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])

evaluationMetrics = getEvaluationMetrics(cvModel.bestModel,"T_total_spend",testDF,inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")


{Param(parent='RandomForestRegressor_4a5f0be87ebf', name='numTrees', doc='Number of trees to train (>= 1).'): 10, Param(parent='RandomForestRegressor_4a5f0be87ebf', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 2, Param(parent='RandomForestRegressor_4a5f0be87ebf', name='featureSubsetStrategy', doc="The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 'n' (when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features). default = 'auto'"): 'sqrt'}
+-----------------+------------------+
|    T_total_spend

#### The cross-validation does not change our results. It did select different parameters than the default, but the performance remains the same. 