## File 07 - Random Forest

##### Group 12:

##### Hannah Schmuckler, mmc4cv

##### Rob Schwartz, res7cd

In this file, we create an ML pipeline based on the output from File 02 (Feature creation).

The files needed are `/processed_data/train.parquet` and `/processed_data/test.parquet`.

We create a random forest model and fine-tune. 

### Set up Spark session

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.mllib.util import MLUtils
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col
from pyspark.sql.functions import log
from pyspark.ml.stat import Correlation


import pandas as pd
import numpy as np
import copy

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

CPU times: user 470 ms, sys: 365 ms, total: 836 ms
Wall time: 8.47 s


### Read in dataframes for train and test sets

This data should have been previously generated: we can find it in the `processed_data` folder.

In [2]:
%%time
trainDF = spark.read.parquet("./processed_data/train.parquet")
testDF = spark.read.parquet("./processed_data/test.parquet")

CPU times: user 1.95 ms, sys: 2.04 ms, total: 3.99 ms
Wall time: 4.13 s


### Set up Spark ML pipeline training for random forest

We create the function `generatePipeline(inputCols, outputCol)`, Then, we train the pipeline using this function. 

In [3]:
%%time

def generatePipeline(inputCols, outputCol):
    
    # Select input columns for random forest regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

    # Select output column for random forest regression
    rf = RandomForestRegressor(featuresCol="features", labelCol=outputCol, seed = 42)#, numTrees=5, maxDepth=5)
    
    pipeline = Pipeline(stages=[vecAssembler, rf])
    return pipeline



CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 5.01 µs


In [4]:
# Calculate adjusted r2 (https://towardsdatascience.com/machine-learning-linear-regression-using-pyspark-9d5d5c772b42)
def adj_r2(r2, inputCols, testDF):
    n = testDF.count()
    p = len(inputCols)
    adjusted_r2 = 1-(((1-r2)*(n-1))/(n-p-1))
    return adjusted_r2

In [5]:
def getEvaluationMetrics(pipelineMode,outputCol,testDF,inputCols):
    predDF = pipelineModel.transform(testDF)
    predDF.select(outputCol, "prediction").show(10)
    
    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="rmse")
    rmse = regressionEvaluator.evaluate(predDF)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="r2")
    r2 = regressionEvaluator.evaluate(predDF)
    
    # Manually calculate Adjusted r2
    adjusted_r2 = adj_r2(r2, inputCols, testDF)
    
    return rmse, r2, adjusted_r2



In [6]:
def modelInfo(inputCols, pipelineModel):
    modelCols = pipelineModel.stages[-2].getInputCols()
    
    feature_importance = pipelineModel.stages[-1].featureImportances
    
    return pd.DataFrame(list(zip(modelCols, feature_importance)), columns = ['Column name', 'Importance']).sort_values(by="Importance", ascending = False)

In [7]:
print("** All inputs + normal output **")
inputCols = ["total_spend", "total_events", "total_sessions", "avg_session_length", "sd_session_length", "avg_interactions_per_session", 
             "sd_interactions_per_session", "max_interactions_per_session", "purchase_pct_of_total_events", "view_pct_of_total_events",
             "cart_pct_of_total_events", "avg_purchases_per_session", "sd_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_purchase", "sessions_with_cart", "sessions_with_view", "pct_sessions_end_purchase", 
             "pct_sessions_end_cart", "total_spend_log", "total_events_log", "purchase_events_log", "total_sessions_log",
             "avg_session_length_log"]
outputCol = "T_total_spend" #Performance is about .106 with log output. 

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All inputs + normal output **
                     Column name  Importance
20               total_spend_log    0.171629
15        sessions_with_purchase    0.162941
0                    total_spend    0.138880
16            sessions_with_cart    0.091195
10      cart_pct_of_total_events    0.048967
22           purchase_events_log    0.047236
3             avg_session_length    0.046953
14               purchase_events    0.038467
6    sd_interactions_per_session    0.032558
24        avg_session_length_log    0.031682
19         pct_sessions_end_cart    0.028497
5   avg_interactions_per_session    0.028233
9       view_pct_of_total_events    0.023288
4              sd_session_length    0.022964
8   purchase_pct_of_total_events    0.021720
7   max_interactions_per_session    0.018826
2                 total_sessions    0.008926
12      sd_purchases_per_session    0.008833
18     pct_sessions_end_purchase    0.007736
11     avg_purchases_per_session    0.006570
23            total_se

## Champion features

In [8]:
print("** Reduced inputs, normal output **")
inputCols = ["total_spend", "total_sessions", "avg_session_length", "sd_session_length", "avg_interactions_per_session", 
             "sd_interactions_per_session", "max_interactions_per_session", "purchase_pct_of_total_events", "view_pct_of_total_events",
             "cart_pct_of_total_events", "avg_purchases_per_session", "sd_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_purchase", "sessions_with_cart", "pct_sessions_end_purchase", 
             "pct_sessions_end_cart", "total_spend_log", "purchase_events_log", "total_sessions_log",
             "avg_session_length_log"]

outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** Reduced inputs, normal output **
                     Column name  Importance
18               total_spend_log    0.140809
0                    total_spend    0.136926
14        sessions_with_purchase    0.120653
15            sessions_with_cart    0.105322
19           purchase_events_log    0.087959
13               purchase_events    0.057052
2             avg_session_length    0.044503
9       cart_pct_of_total_events    0.043786
7   purchase_pct_of_total_events    0.039523
3              sd_session_length    0.034424
21        avg_session_length_log    0.033345
8       view_pct_of_total_events    0.028719
4   avg_interactions_per_session    0.024396
6   max_interactions_per_session    0.016200
11      sd_purchases_per_session    0.015675
5    sd_interactions_per_session    0.013474
17         pct_sessions_end_cart    0.011887
12                   cart_events    0.011849
10     avg_purchases_per_session    0.010792
20            total_sessions_log    0.008766
16     pct_sessions

#### Test out different combinations of hyperparameters on best features model. We did this in small batches because the random forests take so much memory to run. 

In [12]:
%%time
inputCols = ["total_spend", "total_sessions", "avg_session_length", "sd_session_length", "avg_interactions_per_session", 
             "sd_interactions_per_session", "max_interactions_per_session", "purchase_pct_of_total_events", "view_pct_of_total_events",
             "cart_pct_of_total_events", "avg_purchases_per_session", "sd_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_purchase", "sessions_with_cart", "pct_sessions_end_purchase", 
             "pct_sessions_end_cart", "total_spend_log", "purchase_events_log", "total_sessions_log",
             "avg_session_length_log"]

outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

rf = RandomForestRegressor(featuresCol="features", labelCol=outputCol)

pipeline = Pipeline(stages=[vecAssembler, rf])

paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [5, 10, 20]) \
    .addGrid(rf.maxDepth, [4, 5, 10]) \
    .addGrid(rf.featureSubsetStrategy, ['sqrt', 'log2', 'onethird']) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("T_total_spend"),
                          numFolds=4)



CPU times: user 9.63 ms, sys: 4.72 ms, total: 14.4 ms
Wall time: 2.75 s


In [13]:
%%time
cvModel = crossval.setParallelism(8).fit(trainDF)

CPU times: user 3.76 s, sys: 771 ms, total: 4.53 s
Wall time: 4min 3s


In [14]:
evaluationMetrics = getEvaluationMetrics(cvModel.bestModel,"T_total_spend",testDF,inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

print()
print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])
print()
print(cvModel.avgMetrics)

+------------------+------------------+
|     T_total_spend|        prediction|
+------------------+------------------+
|               0.0|   128.96718512969|
|               0.0|130.50915088811848|
|               0.0|126.93085988451864|
|               0.0|130.50915088811848|
|               0.0| 164.1481583696438|
|               0.0|209.09686460782777|
| 312.4200134277344|165.69012412807226|
|  6009.78010559082| 529.9442741093114|
|1627.1900024414062|203.73672310971511|
|1393.8800354003906|155.98333182100924|
+------------------+------------------+
only showing top 10 rows

RMSE is 2260.1
R^2 is 0.15874
Adjusted R^2 is 0.15848

{Param(parent='RandomForestRegressor_fc1ab56f1913', name='numTrees', doc='Number of trees to train (>= 1).'): 5, Param(parent='RandomForestRegressor_fc1ab56f1913', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 10, Param(parent='RandomForestRegressor_fc1ab56f1913', na