## File 05 - Linear Regression

In this file, we create a small ML pipeline based on the output from File 02 (Feature creation).

We create a linear regression model, tune it, then compare it to a linear regression model created with downsampled data to see how performance compares. 

### Set up Spark session

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression, LinearRegressionModel, LinearRegressionSummary
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col
from pyspark.sql.functions import log
from pyspark.ml.stat import Correlation

import pandas as pd
import numpy as np
import copy

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

CPU times: user 530 ms, sys: 412 ms, total: 942 ms
Wall time: 7.07 s


### Read in dataframes for train and test sets

This data should have been previously generated: we can find it in the `processed_data` folder.

In [2]:
%%time
trainDF = spark.read.parquet("./processed_data/train.parquet")
testDF = spark.read.parquet("./processed_data/test.parquet")
# trainDF.show(5)

CPU times: user 1.81 ms, sys: 2.86 ms, total: 4.67 ms
Wall time: 2.89 s


### Set up Spark ML pipeline training for linear regression

Here we decide which input columns should be used in order to create our training pipeline. To implement this step, we create the function `generatePipeline(inputCols, outputCol)`. Then, we train the pipeline using this function.

In [3]:
%%time

def generatePipeline(inputCols, outputCol):
    # Select input columns for linear regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

    # Select output column for linear regression
    lr = LinearRegression(featuresCol="features", labelCol=outputCol)

    # The following lines (pipeline creation and fitting) replace these two commented-out lines.
    # vecTrainDF = vecAssembler.transform(trainDF)
 
    pipeline = Pipeline(stages=[vecAssembler, lr])
    return pipeline
    
pipeline = generatePipeline(inputCols, "T_total_spend")
pipelineModel = pipeline.fit(trainDF)



CPU times: user 15.3 ms, sys: 4.19 ms, total: 19.5 ms
Wall time: 4.52 s


### View the model information

Print out the model coefficients and view the pValues, RMSE and R^2. We define the functions `modelInfo(inputCols, pipelineModel)` and `getEvaluationMetrics(pipelineModel,outputCol,testDF)` to report this information.

In [4]:
def modelInfo(inputCols, pipelineModel):
    # Create a zipped list containing the coefficients and the data
    modelCols = copy.deepcopy(inputCols)
    modelCoeffs = list(pipelineModel.stages[-1].coefficients)
    modelCoeffs.insert(0,pipelineModel.stages[-1].intercept)
    modelCols.insert(0,"intercept")
    modelZippedList = list(map(list, zip(modelCols, modelCoeffs)))
    
    # Add in the p-values
    pvals = pipelineModel.stages[-1].summary.pValues
    
    # Create the pandas DataFrame
    modelDF = pd.DataFrame(modelZippedList, columns = ['Column name', 'Coefficient'])
    modelDF['pValues'] = pvals
    return modelDF

In [5]:
# Calculate adjusted r2 (https://towardsdatascience.com/machine-learning-linear-regression-using-pyspark-9d5d5c772b42)
# This function will allow us to calculate the adjusted r-square value when we do PCA later. The default r-square function does not take into account k. 
def adj_r2(r2, inputCols, testDF, k = 0):
    n = testDF.count()
    if k == 0:
        p = len(inputCols)
    else: 
        p = len(inputCols) + k - 1
    
    adjusted_r2 = 1-(((1-r2)*(n-1))/(n-p-1))
    return adjusted_r2

In [33]:
def getEvaluationMetrics(pipelineModel,outputCol,testDF,inputCols):
    predDF = pipelineModel.transform(testDF)
    print()
    print('type preddf' + str(type(predDF)))
    print()
    predDF.select(outputCol, "prediction").show(10)
    
    print(predDF)
    
    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="rmse")
    rmse = regressionEvaluator.evaluate(predDF)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="r2")
    r2 = regressionEvaluator.evaluate(predDF)
      
    # Manually calculate Adjusted r2
    adjusted_r2 = adj_r2(r2, inputCols, testDF)
    
    return rmse, r2, adjusted_r2


### Test out different combinations of features and outputs

Here we test out models in order to find the best combination of features.

In [34]:
print("** All original inputs, original output **")
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "view_events", 
             "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session']


outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All original inputs, original output **
Model coefficients
                     Column name  Coefficient       pValues
0                      intercept -3205.991128  0.000000e+00
1                    total_spend     4.102997  0.000000e+00
2                   total_events   225.554473  0.000000e+00
3                purchase_events -2619.098410  2.442491e-15
4                 total_sessions  1212.473221  3.236858e-02
5             avg_session_length    -0.093983  1.403162e-06
6   avg_interactions_per_session  -165.476869  1.663153e-07
7   max_interactions_per_session    97.135508  9.999787e-01
8   purchase_pct_of_total_events -8248.562191  9.999853e-01
9       view_pct_of_total_events  5662.164627  1.000000e+00
10      cart_pct_of_total_events -1563.829248  5.654318e-01
11     avg_purchases_per_session  -240.537590  4.574254e-01
12                   cart_events   -22.617590  0.000000e+00
13                   view_events  -200.201096  2.001908e-05
14        sessions_with_purchase  -763

### There's likely collinearity between some of these features. Let's take a look at a correlation matrix:


In [8]:
#Create function to view correlation matrices
# https://stackoverflow.com/questions/52214404/how-to-get-the-correlation-matrix-of-a-pyspark-data-frame
def generateCorrMatrix(inputCols, dataframe):
    # Select input columns for Correlation Matrix & transform
    vector_col = 'corr_features'
    corrAssembler = VectorAssembler(inputCols=inputCols, outputCol=vector_col)
    df_vector = corrAssembler.transform(dataframe).select(vector_col)
    
    #get correlation matrix
    matrix = Correlation.corr(df_vector, vector_col)
    result = matrix.collect()[0]["pearson({})".format(vector_col)].values
    readable = pd.DataFrame(result.reshape(-1, len(inputCols)), columns=inputCols, index=inputCols)
    
    return readable

In [9]:
generateCorrMatrix(inputCols, trainDF)

Unnamed: 0,total_spend,total_events,purchase_events,total_sessions,avg_session_length,avg_interactions_per_session,max_interactions_per_session,purchase_pct_of_total_events,view_pct_of_total_events,cart_pct_of_total_events,...,cart_events,view_events,sessions_with_purchase,sessions_with_cart,sessions_with_view,pct_sessions_end_purchase,pct_sessions_end_cart,sd_session_length,sd_interactions_per_session,sd_purchases_per_session
total_spend,1.0,0.564621,0.481098,0.151065,0.043725,-0.004885,0.042578,0.083175,-0.087756,0.073183,...,0.338576,0.066552,0.488184,0.403708,0.148906,0.069628,0.009892,0.050417,0.019942,0.175506
total_events,0.564621,1.0,0.367146,0.467376,0.109398,0.136702,0.380164,-0.182647,0.186159,-0.150776,...,0.384654,0.514898,0.387877,0.440534,0.465337,-0.171989,0.04845,0.134815,0.273639,0.089593
purchase_events,0.481098,0.367146,1.0,0.33877,0.046947,0.059073,0.173412,0.09662,-0.100247,0.082442,...,0.712189,0.227308,0.932698,0.776645,0.339643,0.087018,0.013407,0.060594,0.112997,0.359093
total_sessions,0.151065,0.467376,0.33877,1.0,0.174913,-0.039855,0.394534,-0.407273,0.412775,-0.332673,...,0.444593,0.75085,0.383884,0.577977,0.996996,-0.491223,0.09044,0.229781,0.242894,-0.000881
avg_session_length,0.043725,0.109398,0.046947,0.174913,1.0,0.025044,0.092508,-0.090967,0.063875,-0.031355,...,0.130374,0.134185,0.053639,0.147385,0.156466,-0.090676,0.053622,0.948501,0.05924,-0.009554
avg_interactions_per_session,-0.004885,0.136702,0.059073,-0.039855,0.025044,1.0,0.637865,-0.277896,0.2286,-0.146542,...,0.151535,0.304647,-0.000782,-0.009603,-0.03602,0.121906,-0.014604,-0.009738,0.550659,0.041945
max_interactions_per_session,0.042578,0.380164,0.173412,0.394534,0.092508,0.637865,1.0,-0.464993,0.456032,-0.356704,...,0.322928,0.737453,0.140828,0.229791,0.397881,-0.315549,0.055,0.092205,0.876179,0.124785
purchase_pct_of_total_events,0.083175,-0.182647,0.09662,-0.407273,-0.090967,-0.277896,-0.464993,1.0,-0.832517,0.542351,...,-0.146767,-0.420846,0.074052,-0.119763,-0.408404,0.767058,-0.253101,-0.090528,-0.478889,-0.130114
view_pct_of_total_events,-0.087756,0.186159,-0.100247,0.412775,0.063875,0.2286,0.456032,-0.832517,1.0,-0.91696,...,-0.083772,0.461413,-0.08266,0.015405,0.419957,-0.721475,-0.019379,0.067755,0.460357,0.082648
cart_pct_of_total_events,0.073183,-0.150776,0.082442,-0.332673,-0.031355,-0.146542,-0.356704,0.542351,-0.91696,1.0,...,0.232741,-0.396659,0.072024,0.062888,-0.342749,0.541713,0.211667,-0.037556,-0.353255,-0.031631


#### There's a lot of very correlated features in there! We took care of that in an iterative manner while looking at p-values to ensure we pick the best features to keep. The adjusted r^2 of the model below is nearly as good as the model above, with only 4 features!

In [10]:
# Redo, iteratively removing & adding features to find the best combination. In the real world, 
# sacraficing a fraction of a percent of performance to dramatically reduce the number of features can be very useful. 
# Intermediate stages not shown because this was all done in the same cell. 
print("** Best original inputs, original output **")
inputCols = ["total_spend","total_events","purchase_events", "view_events"]

             
             #"purchase_events" causes this to break for some reason? won't generate p-values when it's included with eveyrthing else
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** Best original inputs, original output **
Model coefficients
       Column name  Coefficient  pValues
0        intercept  1313.218380      0.0
1      total_spend     4.088648      0.0
2     total_events   227.102068      0.0
3  purchase_events -2556.784673      0.0
4      view_events  -166.467129      0.0
+-----------------+------------------+
|    T_total_spend|        prediction|
+-----------------+------------------+
|              0.0| 3968.343537712809|
|              0.0|  3155.42960828224|
|              0.0|1088.1524167513835|
|              0.0| 813.7975938765567|
|              0.0| 2406.689900135093|
|              0.0| 158.3496209132586|
|              0.0| 1540.409627549518|
|              0.0|  6403.87690864553|
|79118.00024795532|143457.33819014102|
|663.1999969482422| 6457.116320074398|
+-----------------+------------------+
only showing top 10 rows

DataFrame[user_id: int, T_total_spend: double, total_spend: double, total_events: bigint, total_sessions: bigint, avg_s

In [11]:
# This final feature set has an adjusted r-square that is nearly as good and has dramatically reduced the number of coefficients. 

In [12]:
print("** Original + Log-transformed inputs, normal output **")
inputCols = ["total_spend","total_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "purchase_events",
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session', "total_spend_log", "total_events_log", "purchase_events_log", "total_sessions_log"]
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** Original + Log-transformed inputs, normal output **
Model coefficients
                     Column name   Coefficient       pValues
0                      intercept  64676.487526  0.000000e+00
1                    total_spend      3.969448  0.000000e+00
2                   total_events    221.150198  0.000000e+00
3                 total_sessions   1479.276643  2.195719e-02
4             avg_session_length     -0.100486  1.302602e-11
5   avg_interactions_per_session   -243.069177  2.713388e-06
6   max_interactions_per_session     88.190723  9.998417e-01
7   purchase_pct_of_total_events -61705.024243  9.998736e-01
8       view_pct_of_total_events -49266.125091  9.998530e-01
9       cart_pct_of_total_events -57300.531692  0.000000e+00
10     avg_purchases_per_session  -4571.156833  1.591293e-01
11                   cart_events    -42.840964  0.000000e+00
12               purchase_events  -2008.129461  0.000000e+00
13                   view_events   -196.392595  0.000000e+00
14        s

In [13]:
generateCorrMatrix(inputCols, trainDF)

Unnamed: 0,total_spend,total_events,total_sessions,avg_session_length,avg_interactions_per_session,max_interactions_per_session,purchase_pct_of_total_events,view_pct_of_total_events,cart_pct_of_total_events,avg_purchases_per_session,...,sessions_with_view,pct_sessions_end_purchase,pct_sessions_end_cart,sd_session_length,sd_interactions_per_session,sd_purchases_per_session,total_spend_log,total_events_log,purchase_events_log,total_sessions_log
total_spend,1.0,0.564621,0.151065,0.043725,-0.004885,0.042578,0.083175,-0.087756,0.073183,0.119181,...,0.148906,0.069628,0.009892,0.050417,0.019942,0.175506,0.536203,0.379234,0.441939,0.165674
total_events,0.564621,1.0,0.467376,0.109398,0.136702,0.380164,-0.182647,0.186159,-0.150776,-0.105877,...,0.465337,-0.171989,0.04845,0.134815,0.273639,0.089593,0.350289,0.644777,0.361981,0.405858
total_sessions,0.151065,0.467376,1.0,0.174913,-0.039855,0.394534,-0.407273,0.412775,-0.332673,-0.387873,...,0.996996,-0.491223,0.09044,0.229781,0.242894,-0.000881,0.116135,0.635677,0.319133,0.793039
avg_session_length,0.043725,0.109398,0.174913,1.0,0.025044,0.092508,-0.090967,0.063875,-0.031355,-0.066332,...,0.156466,-0.090676,0.053622,0.948501,0.05924,-0.009554,0.026239,0.136689,0.053213,0.146432
avg_interactions_per_session,-0.004885,0.136702,-0.039855,0.025044,1.0,0.637865,-0.277896,0.2286,-0.146542,0.251451,...,-0.03602,0.121906,-0.014604,-0.009738,0.550659,0.041945,-0.019712,0.295554,0.085442,-0.089399
max_interactions_per_session,0.042578,0.380164,0.394534,0.092508,0.637865,1.0,-0.464993,0.456032,-0.356704,-0.176445,...,0.397881,-0.315549,0.055,0.092205,0.876179,0.124785,0.007458,0.608817,0.201523,0.426848
purchase_pct_of_total_events,0.083175,-0.182647,-0.407273,-0.090967,-0.277896,-0.464993,1.0,-0.832517,0.542351,0.662773,...,-0.408404,0.767058,-0.253101,-0.090528,-0.478889,-0.130114,0.186874,-0.617351,0.109589,-0.630544
view_pct_of_total_events,-0.087756,0.186159,0.412775,0.063875,0.2286,0.456032,-0.832517,1.0,-0.91696,-0.621138,...,0.419957,-0.721475,-0.019379,0.067755,0.460357,0.082648,-0.20344,0.56255,-0.119874,0.604896
cart_pct_of_total_events,0.073183,-0.150776,-0.332673,-0.031355,-0.146542,-0.356704,0.542351,-0.91696,1.0,0.464654,...,-0.342749,0.541713,0.211667,-0.037556,-0.353255,-0.031631,0.173938,-0.408516,0.102868,-0.463234
avg_purchases_per_session,0.119181,-0.105877,-0.387873,-0.066332,0.251451,-0.176445,0.662773,-0.621138,0.464654,1.0,...,-0.3867,0.829093,-0.244417,-0.085614,-0.256859,-0.027285,0.242444,-0.351623,0.28599,-0.630276


In [14]:
# Again, we iteratively tested features to find the best smaller set. 

print("** Smaller Original + Log-transformed inputs, normal output **")
inputCols = ["total_spend","total_events", "avg_session_length", "max_interactions_per_session", 
              "cart_pct_of_total_events","avg_purchases_per_session",  "purchase_events", 'sd_purchases_per_session',
              "view_events", "sessions_with_purchase", "sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
            "total_spend_log","total_events_log", "purchase_events_log", "total_sessions_log"]
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** Smaller Original + Log-transformed inputs, normal output **
Model coefficients
                     Column name   Coefficient       pValues
0                      intercept  11901.046535  0.000000e+00
1                    total_spend      3.958880  0.000000e+00
2                   total_events    222.124485  3.457475e-05
3             avg_session_length     -0.180450  5.185470e-02
4   max_interactions_per_session     20.193832  9.076295e-12
5       cart_pct_of_total_events  -6907.426805  0.000000e+00
6      avg_purchases_per_session  -5080.676490  0.000000e+00
7                purchase_events  -1970.098478  0.000000e+00
8       sd_purchases_per_session  -6956.063457  0.000000e+00
9                    view_events   -196.872666  0.000000e+00
10        sessions_with_purchase  -1476.328092  0.000000e+00
11            sessions_with_view    350.738617  0.000000e+00
12     pct_sessions_end_purchase -14673.353940  3.778907e-02
13         pct_sessions_end_cart   1367.757411  3.774758e-15
14 

In [15]:
generateCorrMatrix(inputCols, trainDF)

Unnamed: 0,total_spend,total_events,avg_session_length,max_interactions_per_session,cart_pct_of_total_events,avg_purchases_per_session,purchase_events,sd_purchases_per_session,view_events,sessions_with_purchase,sessions_with_view,pct_sessions_end_purchase,pct_sessions_end_cart,sd_session_length,total_spend_log,total_events_log,purchase_events_log,total_sessions_log
total_spend,1.0,0.564621,0.043725,0.042578,0.073183,0.119181,0.481098,0.175506,0.066552,0.488184,0.148906,0.069628,0.009892,0.050417,0.536203,0.379234,0.441939,0.165674
total_events,0.564621,1.0,0.109398,0.380164,-0.150776,-0.105877,0.367146,0.089593,0.514898,0.387877,0.465337,-0.171989,0.04845,0.134815,0.350289,0.644777,0.361981,0.405858
avg_session_length,0.043725,0.109398,1.0,0.092508,-0.031355,-0.066332,0.046947,-0.009554,0.134185,0.053639,0.156466,-0.090676,0.053622,0.948501,0.026239,0.136689,0.053213,0.146432
max_interactions_per_session,0.042578,0.380164,0.092508,1.0,-0.356704,-0.176445,0.173412,0.124785,0.737453,0.140828,0.397881,-0.315549,0.055,0.092205,0.007458,0.608817,0.201523,0.426848
cart_pct_of_total_events,0.073183,-0.150776,-0.031355,-0.356704,1.0,0.464654,0.082442,-0.031631,-0.396659,0.072024,-0.342749,0.541713,0.211667,-0.037556,0.173938,-0.408516,0.102868,-0.463234
avg_purchases_per_session,0.119181,-0.105877,-0.066332,-0.176445,0.464654,1.0,0.226345,-0.027285,-0.304312,0.112621,-0.3867,0.829093,-0.244417,-0.085614,0.242444,-0.351623,0.28599,-0.630276
purchase_events,0.481098,0.367146,0.046947,0.173412,0.082442,0.226345,1.0,0.359093,0.227308,0.932698,0.339643,0.087018,0.013407,0.060594,0.464972,0.381236,0.793098,0.317658
sd_purchases_per_session,0.175506,0.089593,-0.009554,0.124785,-0.031631,-0.027285,0.359093,1.0,0.002269,0.219623,0.000465,-0.303561,0.202615,-0.01375,0.229082,0.233302,0.426708,0.215818
view_events,0.066552,0.514898,0.134185,0.737453,-0.396659,-0.304312,0.227308,0.002269,1.0,0.235664,0.755356,-0.400806,0.050192,0.168932,0.030189,0.637956,0.222389,0.617183
sessions_with_purchase,0.488184,0.387877,0.053639,0.140828,0.072024,0.112621,0.932698,0.219623,0.235664,1.0,0.384296,0.078945,0.018408,0.070994,0.475812,0.405391,0.790096,0.367441


### Tested Models Adjusted R-Square Summary

- All original inputs, original output: 0.61426 (21 variables)
- Best original inputs, original output: 0.61334 (4 variables)
- Original + Log-transformed inputs, normal output: 0.61582 (25 variables)
- Smaller Original + Log-transformed inputs, normal output: 0.61617 (19 variables)


#### Our champion model has an adjusted R^2 of .61589, and uses 17 variables. However, it is possible to achieve an adjusted r^2 of .61334 using just 4 variables. 

Can we beat that by using a log-transformed output? (It's shown below, but the answer is no)

In [16]:
# Again, we iteratively tested features to find the best smaller set. Ultimately, we determined that the full set of variables gives the best performance. 
# The log-transformed response does not work nearly as well as the raw response. 

print("** All inputs, log-transformed output **")
inputCols = inputCols = ["total_spend", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "purchase_events",
              "sessions_with_purchase","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session', "total_spend_log", "total_events_log", "purchase_events_log", "total_sessions_log"]
outputCol = "T_total_spend_log"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All inputs, log-transformed output **
Model coefficients
                     Column name  Coefficient       pValues
0                      intercept    10.837386  0.000000e+00
1                    total_spend    -0.000126  0.000000e+00
2                 total_sessions     0.414863  4.051154e-05
3             avg_session_length    -0.000021  0.000000e+00
4   avg_interactions_per_session    -0.326426  0.000000e+00
5   max_interactions_per_session    -0.026669  9.999937e-01
6   purchase_pct_of_total_events     0.319616  9.993103e-01
7       view_pct_of_total_events   -30.691944  9.994139e-01
8       cart_pct_of_total_events   -26.082067  0.000000e+00
9      avg_purchases_per_session    -2.249517  0.000000e+00
10                   cart_events    -0.085133  7.152812e-05
11               purchase_events    -0.049146  0.000000e+00
12        sessions_with_purchase     0.338287  0.000000e+00
13            sessions_with_view    -0.418970  0.000000e+00
14     pct_sessions_end_purchase    -4.6

### Test out different combinations of hyperparameters

Here we test out 25 combinations of hyperparameters on our best linear regression model under cross validation.

- Regularization parameters: [0, 0.01, 0.2, 1, 10]
- Amount of LASSO-ness: [0, 0.25, 0.5, 0.75, 1]


In [40]:
# Change evaluation metrics so they work for CV
# def getEvaluationMetricsCV(cvModel,outputCol,testDF,inputCols):
#     predDF = cvModel.transform(testDF).
#     predDF.select(outputCol, "prediction").show(10)
    
#     print(predDF)
    
#     regressionEvaluator = RegressionEvaluator(
#     predictionCol="prediction",
#     labelCol=outputCol,
#     metricName="rmse")
#     rmse = regressionEvaluator.evaluate(predDF)

#     regressionEvaluator = RegressionEvaluator(
#     predictionCol="prediction",
#     labelCol=outputCol,
#     metricName="r2")
#     r2 = regressionEvaluator.evaluate(predDF)
      
#     # Manually calculate Adjusted r2
#     adjusted_r2 = adj_r2(r2, inputCols, testDF)
    
#     return rmse, r2, adjusted_r2

In [41]:
# pvalues breaks when used with this so taking that out
def modelInfo2(inputCols, pipelineModel):
    # Create a zipped list containing the coefficients and the data
    modelCols = copy.deepcopy(inputCols)
    modelCoeffs = list(pipelineModel.stages[-1].coefficients)
    modelCoeffs.insert(0,pipelineModel.stages[-1].intercept)
    modelCols.insert(0,"intercept")
    modelZippedList = list(map(list, zip(modelCols, modelCoeffs)))
        
    # Create the pandas DataFrame
    modelDF = pd.DataFrame(modelZippedList, columns = ['Column name', 'Coefficient'])
    
    return modelDF

In [42]:
%%time
print("** Champion adj-r-square **")
inputCols = ["total_spend", "total_events","avg_session_length", "avg_interactions_per_session", "view_events",
             'sd_interactions_per_session', 'sd_purchases_per_session', "total_spend_log", "total_events_log", "purchase_events",  "purchase_events_log", "total_sessions_log",
             "max_interactions_per_session",  "total_sessions", "view_pct_of_total_events", "sd_session_length", "pct_sessions_end_purchase"]
outputCol = "T_total_spend"
# Below creation of pipeline is necessary for crossval to run. I wonder if there's a way to get it to run on the generate pipeline fn? 
vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

# Select output column for linear regression
lr = LinearRegression(featuresCol="features", labelCol="T_total_spend")

pipeline = Pipeline(stages=[vecAssembler, lr])
    
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0, 0.01, 0.2, 1, 10]) \
    .addGrid(lr.elasticNetParam, [0, 0.25, 0.5, 0.75, 1]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("T_total_spend"),
                          numFolds=4)

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(trainDF)

** Champion adj-r-square **
CPU times: user 4.44 s, sys: 1.34 s, total: 5.78 s
Wall time: 1min 28s


In [43]:
# print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])

In [44]:
print("Best model coefficients")
print(modelInfo2(inputCols, cvModel.bestModel))

#evaluationMetrics = getEvaluationMetricsCV(cvModel.bestModel,"T_total_spend",testDF,inputCols)
evaluationMetrics = getEvaluationMetrics(cvModel.bestModel,"T_total_spend",testDF,inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

print()
print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])

Best model coefficients
                     Column name   Coefficient
0                      intercept   3339.589946
1                    total_spend      3.949472
2                   total_events    221.689208
3             avg_session_length     -0.147394
4   avg_interactions_per_session   -270.164405
5                    view_events   -198.215613
6    sd_interactions_per_session    -89.245144
7       sd_purchases_per_session  -5429.616294
8                total_spend_log   1469.072840
9               total_events_log   1302.806603
10               purchase_events  -2801.345903
11           purchase_events_log   5849.824343
12            total_sessions_log  -8681.086075
13  max_interactions_per_session    100.144892
14                total_sessions    353.863859
15      view_pct_of_total_events   8164.812837
16             sd_session_length      0.077834
17     pct_sessions_end_purchase -17286.792616
+-----------------+-------------------+
|    T_total_spend|         prediction|
+--

## Double-checking if the values truly come out the same with those selected parameters. 

In [47]:
%%time
# This pipeline model contains the optimized parameters chosen by our model
def generateOptimizedPipeline(inputCols, outputCol):
    # Select input columns for linear regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

    # Select output column for linear regression
    lr = LinearRegression(featuresCol="features", labelCol=outputCol, regParam = 1.0, elasticNetParam = .25)

    # The following lines (pipeline creation and fitting) replace these two commented-out lines.
    # vecTrainDF = vecAssembler.transform(trainDF)
 
    pipeline = Pipeline(stages=[vecAssembler, lr])
    return pipeline
    
pipeline = generatePipeline(inputCols, "T_total_spend")
pipelineModel = pipeline.fit(trainDF)



CPU times: user 17.1 ms, sys: 6.41 ms, total: 23.5 ms
Wall time: 697 ms


In [49]:
print("** All original inputs, original output **")
inputCols = ["total_spend", "total_events","avg_session_length", "avg_interactions_per_session", "view_events",
             'sd_interactions_per_session', 'sd_purchases_per_session', "total_spend_log", "total_events_log", "purchase_events",  "purchase_events_log", "total_sessions_log",
             "max_interactions_per_session",  "total_sessions", "view_pct_of_total_events", "sd_session_length", "pct_sessions_end_purchase"]

outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All original inputs, original output **
Model coefficients
                     Column name   Coefficient       pValues
0                      intercept   3687.902619  0.000000e+00
1                    total_spend      3.947723  0.000000e+00
2                   total_events    221.630879  6.005486e-04
3             avg_session_length     -0.150080  0.000000e+00
4   avg_interactions_per_session   -276.349453  0.000000e+00
5                    view_events   -198.527398  4.753246e-02
6    sd_interactions_per_session    -82.836690  0.000000e+00
7       sd_purchases_per_session  -5558.501378  0.000000e+00
8                total_spend_log   1462.262768  5.295764e-13
9               total_events_log   1355.420234  0.000000e+00
10               purchase_events  -2800.436050  0.000000e+00
11           purchase_events_log   5966.777828  0.000000e+00
12            total_sessions_log  -8878.314122  7.793396e-08
13  max_interactions_per_session     99.208006  0.000000e+00
14                total

#### It appears that in this case, selection of regularization parameters is inconsequential. The model performed essentially the same using the default parameters as it did using the parameters selected as the best by the crossvalidator, indicating that the differences are minute. It's not that the crossvalidator selected all the same parameters - it selected a regParam of 1 and elasticNetParam of .25. 

### Save pipeline model and get model size
##### CHECK NEW SIZE
The model size is ____ kB, according to the file explorer in Linux.

In [20]:
pipelinePath = "models/lr-pipeline-model_NewData"
cvModel.bestModel.write().overwrite().save(pipelinePath)

#### Our data is extremely zero-inflated. If we downsample the data so our response variable =0 only about half the time, can we beat the model performance above? The number to beat is an adjusted r-square of .61589

In [21]:
%%time
trainDF_ds = spark.read.parquet("./processed_data/ds_train.parquet")
testDF_ds = spark.read.parquet("./processed_data/ds_test.parquet")

CPU times: user 3.33 ms, sys: 535 µs, total: 3.86 ms
Wall time: 191 ms


In [22]:
trainDF_ds = trainDF_ds \
          .withColumn("total_spend_log", log(col("total_spend")+0.001)) \
          .withColumn("total_events_log", log(col("total_events")+0.001)) \
          .withColumn("purchase_events_log", log(col("purchase_events")+0.001)) \
          .withColumn("total_sessions_log", log(col("total_sessions")+0.001)) \
          .withColumn("T_total_spend_log", log(col("T_total_spend")+0.001))

testDF_ds = testDF_ds \
          .withColumn("total_spend_log", log(col("total_spend")+0.001)) \
          .withColumn("total_events_log", log(col("total_events")+0.001)) \
          .withColumn("purchase_events_log", log(col("purchase_events")+0.001)) \
          .withColumn("total_sessions_log", log(col("total_sessions")+0.001)) \
          .withColumn("T_total_spend_log", log(col("T_total_spend")+0.001))


In [23]:
#"cart_pct_of_total_events" oddly this one breaks it? Not important, it's not a good predictor anyway

print("** Original + Log-transformed inputs, normal output **")
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events","avg_purchases_per_session", "cart_events", "view_events", 
             "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session']
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF_ds)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF_ds, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** Original + Log-transformed inputs, normal output **
Model coefficients
                     Column name   Coefficient       pValues
0                      intercept  -8061.148243  0.000000e+00
1                    total_spend      4.360910  0.000000e+00
2                   total_events    215.828050  0.000000e+00
3                purchase_events  -3723.415567  5.391370e-05
4                 total_sessions   1070.623981  2.627864e-01
5             avg_session_length     -0.091686  1.613769e-05
6   avg_interactions_per_session   -300.344446  3.289889e-08
7   max_interactions_per_session    200.190515  7.547815e-03
8   purchase_pct_of_total_events -14144.069142  1.023383e-07
9       view_pct_of_total_events  13252.029420  6.534423e-01
10     avg_purchases_per_session   -394.147528  1.805919e-02
11                   cart_events    137.187457  0.000000e+00
12                   view_events   -175.830322  2.449210e-03
13        sessions_with_purchase  -1006.174796  6.189360e-10
14         

#### Interestingly, downsampling the inflated 0s doesn't seem to help at all on the test set!