## File 07 - Random Forest

##### Group 12:

##### Hannah Schmuckler, mmc4cv

##### Rob Schwartz, res7cd

In this file, we create an ML pipeline based on the output from File 02 (Feature creation).

The files needed are `/processed_data/train.parquet` and `/processed_data/test.parquet`.

We create a random forest model and fine-tune. 

### Set up Spark session

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.mllib.util import MLUtils
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col
from pyspark.sql.functions import log
from pyspark.ml.stat import Correlation


import pandas as pd
import numpy as np
import copy

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

CPU times: user 513 ms, sys: 415 ms, total: 928 ms
Wall time: 5.8 s


### Read in dataframes for train and test sets

This data should have been previously generated: we can find it in the `processed_data` folder.

In [2]:
%%time
trainDF = spark.read.parquet("./processed_data/train.parquet")
testDF = spark.read.parquet("./processed_data/test.parquet")

CPU times: user 1.87 ms, sys: 2.88 ms, total: 4.75 ms
Wall time: 3.14 s


### Set up Spark ML pipeline training for random forest

We create the function `generatePipeline(inputCols, outputCol)`, Then, we train the pipeline using this function. 

In [3]:
%%time

def generatePipeline(inputCols, outputCol):
    
    # Select input columns for random forest regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

    # Select output column for random forest regression
    rf = RandomForestRegressor(featuresCol="features", labelCol=outputCol, seed = 42)#, numTrees=5, maxDepth=5)
    
    pipeline = Pipeline(stages=[vecAssembler, rf])
    return pipeline



CPU times: user 2 µs, sys: 2 µs, total: 4 µs
Wall time: 6.91 µs


In [4]:
# Calculate adjusted r2 (https://towardsdatascience.com/machine-learning-linear-regression-using-pyspark-9d5d5c772b42)
def adj_r2(r2, inputCols, testDF):
    n = testDF.count()
    p = len(inputCols)
    adjusted_r2 = 1-(((1-r2)*(n-1))/(n-p-1))
    return adjusted_r2

In [5]:
def getEvaluationMetrics(pipelineMode,outputCol,testDF,inputCols):
    predDF = pipelineModel.transform(testDF)
    predDF.select(outputCol, "prediction").show(10)
    
    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="rmse")
    rmse = regressionEvaluator.evaluate(predDF)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="r2")
    r2 = regressionEvaluator.evaluate(predDF)
    
    # Manually calculate Adjusted r2
    adjusted_r2 = adj_r2(r2, inputCols, testDF)
    
    return rmse, r2, adjusted_r2



In [6]:
def modelInfo(inputCols, pipelineModel):
    modelCols = pipelineModel.stages[-2].getInputCols()
    
    feature_importance = pipelineModel.stages[-1].featureImportances
    
    return pd.DataFrame(list(zip(modelCols, feature_importance)), columns = ['Column name', 'Importance']).sort_values(by="Importance", ascending = False)

In [19]:
print("** All inputs + normal output **")
inputCols = ["total_spend", "total_events", "total_sessions", "avg_session_length", "sd_session_length", "avg_interactions_per_session", 
             "sd_interactions_per_session", "max_interactions_per_session", "purchase_pct_of_total_events", "view_pct_of_total_events",
             "cart_pct_of_total_events", "avg_purchases_per_session", "sd_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_purchase", "sessions_with_cart", "sessions_with_view", "pct_sessions_end_purchase", 
             "pct_sessions_end_cart", "total_spend_log", "total_events_log", "purchase_events_log", "total_sessions_log",
             "avg_session_length_log"]
outputCol = "T_total_spend" #Performance is about .106 with log output. 

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All inputs + normal output **
                     Column name  Importance
20               total_spend_log    0.171475
15        sessions_with_purchase    0.162734
0                    total_spend    0.134540
16            sessions_with_cart    0.088704
10      cart_pct_of_total_events    0.052779
22           purchase_events_log    0.044009
3             avg_session_length    0.037823
14               purchase_events    0.034736
6    sd_interactions_per_session    0.030049
13                   cart_events    0.029198
9       view_pct_of_total_events    0.028643
24        avg_session_length_log    0.028189
19         pct_sessions_end_cart    0.026727
4              sd_session_length    0.024929
8   purchase_pct_of_total_events    0.022557
5   avg_interactions_per_session    0.020685
7   max_interactions_per_session    0.017503
12      sd_purchases_per_session    0.014780
2                 total_sessions    0.012836
11     avg_purchases_per_session    0.007375
23            total_se

## Champion features

In [20]:
print("** Reduced inputs, normal output **")
inputCols = ["total_spend", "total_sessions", "avg_session_length", "sd_session_length", "avg_interactions_per_session", 
             "sd_interactions_per_session", "max_interactions_per_session", "purchase_pct_of_total_events", "view_pct_of_total_events",
             "cart_pct_of_total_events", "avg_purchases_per_session", "sd_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_purchase", "sessions_with_cart", "pct_sessions_end_purchase", 
             "pct_sessions_end_cart", "total_spend_log", "purchase_events_log", "total_sessions_log",
             "avg_session_length_log"]

outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** Reduced inputs, normal output **
                     Column name  Importance
18               total_spend_log    0.144169
0                    total_spend    0.132870
14        sessions_with_purchase    0.126674
15            sessions_with_cart    0.103955
19           purchase_events_log    0.088216
13               purchase_events    0.053811
2             avg_session_length    0.046355
7   purchase_pct_of_total_events    0.040926
9       cart_pct_of_total_events    0.040770
3              sd_session_length    0.035845
21        avg_session_length_log    0.034031
8       view_pct_of_total_events    0.027884
4   avg_interactions_per_session    0.021731
17         pct_sessions_end_cart    0.021007
11      sd_purchases_per_session    0.014427
6   max_interactions_per_session    0.012821
5    sd_interactions_per_session    0.012192
12                   cart_events    0.011661
10     avg_purchases_per_session    0.009966
20            total_sessions_log    0.008608
1                 t

#### Test out different combinations of hyperparameters on best features model

In [22]:
%%time
inputCols = ["total_spend", "total_sessions", "avg_session_length", "sd_session_length", "avg_interactions_per_session", 
             "sd_interactions_per_session", "max_interactions_per_session", "purchase_pct_of_total_events", "view_pct_of_total_events",
             "cart_pct_of_total_events", "avg_purchases_per_session", "sd_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_purchase", "sessions_with_cart", "pct_sessions_end_purchase", 
             "pct_sessions_end_cart", "total_spend_log", "purchase_events_log", "total_sessions_log",
             "avg_session_length_log"]

outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

rf = RandomForestRegressor(featuresCol="features", labelCol=outputCol)

pipeline = Pipeline(stages=[vecAssembler, rf])

paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [5, 10, 20]) \
    .addGrid(rf.maxDepth, [2, 4, 10]) \
    .addGrid(rf.featureSubsetStrategy, ['sqrt', 'log2', 'onethird']) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("T_total_spend"),
                          numFolds=4)

# Run cross-validation, and choose the best set of parameters.
#cvModel = crossval.fit(trainDF)

CPU times: user 23.2 ms, sys: 4.25 ms, total: 27.5 ms
Wall time: 2.5 s


In [23]:
%%time
cvModel = crossval.setParallelism(8).fit(trainDF)

CPU times: user 6.91 s, sys: 1.81 s, total: 8.72 s
Wall time: 5min 14s


In [24]:
print()
print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])

evaluationMetrics = getEvaluationMetrics(cvModel.bestModel,"T_total_spend",testDF,inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")


{Param(parent='RandomForestRegressor_94d8fbe9cceb', name='numTrees', doc='Number of trees to train (>= 1).'): 5, Param(parent='RandomForestRegressor_94d8fbe9cceb', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 10, Param(parent='RandomForestRegressor_94d8fbe9cceb', name='featureSubsetStrategy', doc="The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 'n' (when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features). default = 'auto'"): 'onethird'}
+------------------+------------------+
|     T_total