## File 05 - Basic Regression

In this file, we create a small ML pipeline based on the output from File 03 (Basic Preprocessed Output).
The files needed are `/processed_data/train_output.parquet` and `/processed_data/test_output.parquet`.

The goal of this file is to provide:
- The type of model
- Best hyperparameters used
- Size of the saved model
- Performance metrics

### Set up Spark session

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col
from pyspark.sql.functions import log
from pyspark.ml.stat import Correlation

import pandas as pd
import numpy as np
import copy

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

CPU times: user 508 ms, sys: 415 ms, total: 923 ms
Wall time: 5.75 s


### Read in dataframes for train and test sets

This data should have been previously generated: we can find it in the `processed_data` folder.

In [2]:
%%time
trainDF = spark.read.parquet("./processed_data/train.parquet")
testDF = spark.read.parquet("./processed_data/test.parquet")
trainDF.show(5)

+---------+------------------+------------------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+-------------------------+---------------------+--------------------+
|  user_id|     T_total_spend|       total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per_session|cart_events|purchase_events|view_events|sessions_with_purchase|sessions_with_cart|sessions_with_view|pct_sessions_end_purchase|pct_sessions_end_cart|       pca_purchas

### Set up Spark ML pipeline training for linear regression

Here we decide which input columns should be used in order to create our training pipeline. To implement this step, we create the function `generatePipeline(inputCols, outputCol)`. Then, we train the pipeline using this function.

In [3]:
%%time

inputCols = ["total_spend","total_events","purchase_events","total_sessions"]

def generatePipeline(inputCols, outputCol):
    # Select input columns for linear regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

    # Select output column for linear regression
    lr = LinearRegression(featuresCol="features", labelCol=outputCol)

    # The following lines (pipeline creation and fitting) replace these two commented-out lines.
    # vecTrainDF = vecAssembler.transform(trainDF)
 
    pipeline = Pipeline(stages=[vecAssembler, lr])
    return pipeline
    
pipeline = generatePipeline(inputCols, "T_total_spend")
pipelineModel = pipeline.fit(trainDF)

CPU times: user 16.7 ms, sys: 1.9 ms, total: 18.6 ms
Wall time: 2.39 s


### View the model information

Print out the model coefficients and view the RMSE and R^2. We define the functions `modelInfo(inputCols, pipelineModel)` and `getEvaluationMetrics(pipelineModel,outputCol,testDF)` to report this information.

In [4]:
def modelInfo(inputCols, pipelineModel):
    # Create a zipped list containing the coefficients and the data
    modelCols = copy.deepcopy(inputCols)
    modelCoeffs = list(pipelineModel.stages[-1].coefficients)
    modelCoeffs.insert(0,pipelineModel.stages[-1].intercept)
    modelCols.insert(0,"intercept")
    modelZippedList = list(map(list, zip(modelCols, modelCoeffs)))

    # Create the pandas DataFrame
    modelDF = pd.DataFrame(modelZippedList, columns = ['Column name', 'Coefficient'])
    return modelDF

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))


Model coefficients
       Column name  Coefficient
0        intercept  1494.143909
1      total_spend     4.279921
2     total_events   208.218305
3  purchase_events -2602.249440
4   total_sessions  -718.697638


In [5]:
# Calculate adjusted r2 (https://towardsdatascience.com/machine-learning-linear-regression-using-pyspark-9d5d5c772b42)
def adj_r2(r2, inputCols, testDF):
    n = testDF.count()
    p = len(inputCols)
    adjusted_r2 = 1-(((1-r2)*(n-1))/(n-p-1))
    return adjusted_r2

In [6]:
def getEvaluationMetrics(pipelineModel,outputCol,testDF,inputCols):
    predDF = pipelineModel.transform(testDF)
    predDF.select(outputCol, "prediction").show(10)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="rmse")
    rmse = regressionEvaluator.evaluate(predDF)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="r2")
    r2 = regressionEvaluator.evaluate(predDF)
    
    # Manually calculate Adjusted r2
    adjusted_r2 = adj_r2(r2, inputCols, testDF)
    
    return rmse, r2, adjusted_r2

evaluationMetrics = getEvaluationMetrics(pipelineModel,"T_total_spend",testDF,inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

+------------------+-------------------+
|     T_total_spend|         prediction|
+------------------+-------------------+
| 56475.99983215332|  164657.5830154838|
|1639.5600357055664|-303.32782546206204|
|1975.2000427246094| 1213.7693599057573|
|               0.0|-180.91606436870802|
|               0.0|  7309.178585129883|
| 61383.30047607422|  57233.91785304816|
|               0.0|-1848.6527619242759|
| 53453.60915565491|  95008.50069317133|
|               0.0|  1945.984141302463|
|               0.0|  6666.573428944875|
+------------------+-------------------+
only showing top 10 rows

RMSE is 35497.8
R^2 is 0.61312
Adjusted R^2 is 0.61290


### Train linear regression model using log-transformed coefficients

Now that we have used our new functions to test out the model accuracy on untransformed features and untransformed output, let's retrain the linear regression model on transformed features and/or transformed output (log scale). First we create these new features.

In [7]:
trainDF = trainDF \
          .withColumn("total_spend_log", log(col("total_spend")+0.001)) \
          .withColumn("total_events_log", log(col("total_events")+0.001)) \
          .withColumn("purchase_events_log", log(col("purchase_events")+0.001)) \
          .withColumn("total_sessions_log", log(col("total_sessions")+0.001)) \
          .withColumn("T_total_spend_log", log(col("T_total_spend")+0.001))

testDF = testDF \
          .withColumn("total_spend_log", log(col("total_spend")+0.001)) \
          .withColumn("total_events_log", log(col("total_events")+0.001)) \
          .withColumn("purchase_events_log", log(col("purchase_events")+0.001)) \
          .withColumn("total_sessions_log", log(col("total_sessions")+0.001)) \
          .withColumn("T_total_spend_log", log(col("T_total_spend")+0.001))

trainDF.show(2)
testDF.show(2)


+---------+-----------------+------------------+------------+--------------+------------------+-----------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+-------------------------+---------------------+--------------------+------------------+------------------+--------------------+------------------+------------------+
|  user_id|    T_total_spend|       total_spend|total_events|total_sessions|avg_session_length|sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per_session|cart_events|purchase_events|view_events|sessions_with_purchase|sessions_wi

### Test out different combinations of log-transformed features and outputs

Here we test out three additional transformations of the dataset in order to evaluate the linear regression model performance.

Total tested model formulations:
1. Normal inputs, normal output: `RMSE is 107094.8, R^2 is 0.91824`
2. Log-transformed inputs, normal output: `RMSE is 372221.8, R^2 is 0.01233`
3. Log-transformed inputs, log-transformed output: `RMSE is 3.1, R^2 is 0.20218` (note: cannot be compared to non-log-transformed outputs)
4. Normal inputs, log-transformed output: `RMSE is 3.5, R^2 is 0.01027` (note: cannot be compared to non-log-transformed outputs)
5. Normal and log-transformed inputs, normal output: `RMSE is 106947.5, R^2 is 0.91846`

Due to the larger R^2 and smaller RMSE, we would suggest adopting the last model. Although linear regression does not take hyperparameters as such, we have made the pre-training choice to log-transform features and include them as well. In the future, it would be useful to do a more in-depth feature selection (perhaps a stepwise feature selection), and use AIC to determine model complexity.

In [8]:
print("** Log-transformed inputs, normal output **")
inputCols = ["total_spend_log","total_events_log","purchase_events_log","total_sessions_log"]
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** Log-transformed inputs, normal output **
Model coefficients
           Column name    Coefficient
0            intercept -119879.797176
1      total_spend_log   10609.023993
2     total_events_log   30439.674660
3  purchase_events_log   -6182.748393
4   total_sessions_log  -18318.133550
+------------------+-------------------+
|     T_total_spend|         prediction|
+------------------+-------------------+
| 56475.99983215332| 134704.45886734946|
|1639.5600357055664|    8960.3610047689|
|1975.2000427246094| 29692.507574293268|
|               0.0|-18382.545542118547|
|               0.0|   9524.66758815627|
| 61383.30047607422|   90036.9597024736|
|               0.0|-18460.991457053926|
| 53453.60915565491|  98218.09779670386|
|               0.0| -4444.113256183744|
|               0.0|  16446.46954234285|
+------------------+-------------------+
only showing top 10 rows

RMSE is 48065.7
R^2 is 0.29068
Adjusted R^2 is 0.29028


In [9]:
print("** Log-transformed inputs, log-transformed output **")
inputCols = ["total_spend_log","total_events_log","purchase_events_log","total_sessions_log"]
outputCol = "T_total_spend_log"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** Log-transformed inputs, log-transformed output **
Model coefficients
           Column name  Coefficient
0            intercept   -18.561402
1      total_spend_log     1.218765
2     total_events_log     4.049588
3  purchase_events_log    -1.132621
4   total_sessions_log    -2.954687
+------------------+-------------------+
| T_total_spend_log|         prediction|
+------------------+-------------------+
|10.941571062864051| 12.268538622236054|
| 7.402183823835461|-3.9072332769012306|
| 7.588425465939199|-1.4047602741376721|
|-6.907755278982137|  -6.01305773148562|
|-6.907755278982137|-3.2750433438952626|
|11.024893114216223|  6.764613065045317|
|-6.907755278982137| -7.061106794727561|
| 10.88656945684123|  7.180066795636673|
|-6.907755278982137|  -4.77134115434772|
|-6.907755278982137| -2.379090146231327|
+------------------+-------------------+
only showing top 10 rows

RMSE is 5.7
R^2 is 0.34015
Adjusted R^2 is 0.33978


In [10]:
print("** Normal inputs, log-transformed output **")
inputCols = ["total_spend","total_events","purchase_events","total_sessions"]
outputCol = "T_total_spend_log"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** Normal inputs, log-transformed output **
Model coefficients
       Column name  Coefficient
0        intercept    -3.757140
1      total_spend     0.000184
2     total_events     0.012134
3  purchase_events    -0.003103
4   total_sessions    -0.039294
+------------------+-------------------+
| T_total_spend_log|         prediction|
+------------------+-------------------+
|10.941571062864051|  5.573987557929431|
| 7.402183823835461| -3.592627931116834|
| 7.588425465939199| -3.037739917570958|
|-6.907755278982137| -3.712810774320358|
|-6.907755278982137| -3.246808307852624|
|11.024893114216223| -0.699170772228991|
|-6.907755278982137|-3.8041690483589896|
| 10.88656945684123|  1.953111401964024|
|-6.907755278982137|-3.5722478445728134|
|-6.907755278982137|-3.2926316700403957|
+------------------+-------------------+
only showing top 10 rows

RMSE is 6.4
R^2 is 0.17808
Adjusted R^2 is 0.17761


In [11]:
print("** Normal and log-transformed inputs, normal output **")
inputCols = ["total_spend","total_events","purchase_events","total_sessions",
             "total_spend_log","total_events_log","purchase_events_log","total_sessions_log"]
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** Normal and log-transformed inputs, normal output **
Model coefficients
           Column name  Coefficient
0            intercept -7165.231520
1          total_spend     4.073738
2         total_events   213.188174
3      purchase_events -2853.439562
4       total_sessions  -879.492699
5      total_spend_log  2230.632977
6     total_events_log -2463.650416
7  purchase_events_log  -356.121139
8   total_sessions_log  3902.620330
+------------------+-------------------+
|     T_total_spend|         prediction|
+------------------+-------------------+
| 56475.99983215332| 166623.25313624274|
|1639.5600357055664| 3347.6782072002334|
|1975.2000427246094|  573.4937887732995|
|               0.0|-2174.8218132244338|
|               0.0|  4062.986798932512|
| 61383.30047607422|  57671.71190932955|
|               0.0| 1477.3501192694475|
| 53453.60915565491|  97870.18484984158|
|               0.0|  -1308.75972203509|
|               0.0|  4828.875724919522|
+------------------+-------------

### Test out different combinations of hyperparameters

Here we test out 25 combinations of hyperparameters under cross validation.

- Regularization parameters: [0, 0.01, 0.2, 1, 10]
- Amount of LASSO-ness: [0, 0.25, 0.5, 0.75, 1]

The results that we find are that a regularization of 0.2 is desirable with an amount of LASSO-ness of 1 (L1 regularization only). Given that the R^2 and RMSE are quite similar to the previous models (`RMSE is 106998.3, R^2 is 0.91839`), it's not sure whether this is legitmately an improvement.


In [12]:
%%time
print("** Normal and log-transformed inputs, normal output **")
inputCols = ["total_spend","total_events","purchase_events","total_sessions",
             "total_spend_log","total_events_log","purchase_events_log","total_sessions_log"]
pipeline = generatePipeline(inputCols, "T_total_spend")
pipelineModel = pipeline.fit(trainDF)


# Below creation of pipeline is necessary for crossval to run. I wonder if there's a way to get it to run on the generate pipeline fn? 
vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

# Select output column for linear regression
lr = LinearRegression(featuresCol="features", labelCol="T_total_spend")

pipeline = Pipeline(stages=[vecAssembler, lr])
    
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0, 0.01, 0.2, 1, 10]) \
    .addGrid(lr.elasticNetParam, [0, 0.25, 0.5, 0.75, 1]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("T_total_spend"),
                          numFolds=4)

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(trainDF)

** Normal and log-transformed inputs, normal output **
CPU times: user 4 s, sys: 1.03 s, total: 5.03 s
Wall time: 41.6 s


In [13]:
print("Best model coefficients")
print(modelInfo(inputCols, cvModel.bestModel))

evaluationMetrics = getEvaluationMetrics(cvModel.bestModel,"T_total_spend",testDF,inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

print()
print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])

Best model coefficients
           Column name  Coefficient
0            intercept -7167.335048
1          total_spend     4.073367
2         total_events   212.922809
3      purchase_events -2852.231913
4       total_sessions  -871.583599
5      total_spend_log  2210.545557
6     total_events_log -2387.855526
7  purchase_events_log  -316.864567
8   total_sessions_log  3773.876845
+------------------+-------------------+
|     T_total_spend|         prediction|
+------------------+-------------------+
| 56475.99983215332|  166616.3018898134|
|1639.5600357055664|  3288.047947050788|
|1975.2000427246094|  625.2935185752003|
|               0.0| -2146.726932788386|
|               0.0| 4074.5094642090526|
| 61383.30047607422| 57769.319171859046|
|               0.0| 1367.3526390093584|
| 53453.60915565491|  97863.32779524822|
|               0.0|-1301.7518908486736|
|               0.0|  4829.366123719056|
+------------------+-------------------+
only showing top 10 rows

RMSE is 35425.7


### Save pipeline model and get model size

The model size is 13.9 kB, according to the file explorer in Linux.

In [14]:
pipelinePath = "models/lr-pipeline-model_NewData"
cvModel.bestModel.write().overwrite().save(pipelinePath)

### New models using more features

In [15]:
trainDF.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- T_total_spend: double (nullable = true)
 |-- total_spend: double (nullable = true)
 |-- total_events: long (nullable = true)
 |-- total_sessions: long (nullable = true)
 |-- avg_session_length: double (nullable = true)
 |-- sd_session_length: double (nullable = true)
 |-- avg_interactions_per_session: double (nullable = true)
 |-- sd_interactions_per_session: double (nullable = true)
 |-- max_interactions_per_session: long (nullable = true)
 |-- purchase_pct_of_total_events: double (nullable = true)
 |-- view_pct_of_total_events: double (nullable = true)
 |-- cart_pct_of_total_events: double (nullable = true)
 |-- avg_purchases_per_session: double (nullable = true)
 |-- sd_purchases_per_session: double (nullable = true)
 |-- cart_events: long (nullable = true)
 |-- purchase_events: long (nullable = true)
 |-- view_events: long (nullable = true)
 |-- sessions_with_purchase: long (nullable = true)
 |-- sessions_with_cart: long (nullable =

In [16]:
print("** All normal inputs, normal output **")
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "purchase_events",
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session']

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All normal inputs, normal output **
Model coefficients
                     Column name   Coefficient
0                      intercept    978.323124
1                    total_spend      3.744338
2                   total_events    232.763047
3                purchase_events  -1337.286294
4                 total_sessions    652.858800
5             avg_session_length      0.666471
6   avg_interactions_per_session   -184.916908
7   max_interactions_per_session     84.682862
8   purchase_pct_of_total_events -10307.204370
9       view_pct_of_total_events   2852.989534
10      cart_pct_of_total_events  -1247.398724
11     avg_purchases_per_session   1345.884707
12                   cart_events    -86.834132
13               purchase_events  -1337.286294
14                   view_events   -222.503874
15        sessions_with_purchase    -34.359770
16            sessions_with_cart    268.406486
17            sessions_with_view   -505.754050
18     pct_sessions_end_purchase   -919.995859
19

In [17]:
#Create function to view correlation matrices
# https://stackoverflow.com/questions/52214404/how-to-get-the-correlation-matrix-of-a-pyspark-data-frame
def generateCorrMatrix(inputCols, dataframe):
    # Select input columns for Correlation Matrix & transform
    vector_col = 'corr_features'
    corrAssembler = VectorAssembler(inputCols=inputCols, outputCol=vector_col)
    df_vector = corrAssembler.transform(dataframe).select(vector_col)
    
    #get correlation matrix
    matrix = Correlation.corr(df_vector, vector_col)
    result = matrix.collect()[0]["pearson({})".format(vector_col)].values
    readable = pd.DataFrame(result.reshape(-1, len(inputCols)), columns=inputCols, index=inputCols)
    
    return readable

In [18]:
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "purchase_events",
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session']

generateCorrMatrix(inputCols, trainDF)

Unnamed: 0,total_spend,total_events,purchase_events,total_sessions,avg_session_length,avg_interactions_per_session,max_interactions_per_session,purchase_pct_of_total_events,view_pct_of_total_events,cart_pct_of_total_events,...,purchase_events.1,view_events,sessions_with_purchase,sessions_with_cart,sessions_with_view,pct_sessions_end_purchase,pct_sessions_end_cart,sd_session_length,sd_interactions_per_session,sd_purchases_per_session
total_spend,1.0,0.577131,0.45737,0.149447,0.009732,-0.006148,0.033399,0.079635,-0.089586,0.078943,...,0.45737,0.059072,0.467091,0.399781,0.147379,0.069003,0.014678,0.020136,0.015963,0.162989
total_events,0.577131,1.0,0.358637,0.468312,0.090887,0.150442,0.384838,-0.180878,0.182567,-0.147464,...,0.358637,0.520823,0.373904,0.429579,0.467509,-0.164458,0.050054,0.113839,0.276446,0.091032
purchase_events,0.45737,0.358637,1.0,0.354513,0.025969,0.05581,0.160869,0.096594,-0.100802,0.083783,...,1.0,0.220501,0.941079,0.809692,0.354763,0.087254,0.015317,0.051327,0.104282,0.338999
total_sessions,0.149447,0.468312,0.354513,1.0,0.120711,-0.037086,0.394514,-0.408405,0.417469,-0.340953,...,0.354513,0.751192,0.390998,0.578193,0.998658,-0.490129,0.08767,0.196585,0.239764,0.002438
avg_session_length,0.009732,0.090887,0.025969,0.120711,1.0,0.165815,0.159632,-0.111095,0.097057,-0.067617,...,0.025969,0.154164,0.017547,0.079219,0.111216,-0.047512,0.026652,0.90142,0.123538,-0.004241
avg_interactions_per_session,-0.006148,0.150442,0.05581,-0.037086,0.165815,1.0,0.654099,-0.28542,0.234601,-0.151255,...,0.05581,0.309149,-0.002827,-0.004015,-0.034241,0.119163,-0.004887,0.017753,0.575194,0.049781
max_interactions_per_session,0.033399,0.384838,0.160869,0.394514,0.159632,0.654099,1.0,-0.469619,0.459395,-0.36062,...,0.160869,0.724573,0.128485,0.223111,0.396117,-0.315602,0.062282,0.104335,0.881275,0.128174
purchase_pct_of_total_events,0.079635,-0.180878,0.096594,-0.408405,-0.111095,-0.28542,-0.469619,1.0,-0.836946,0.552771,...,0.096594,-0.408124,0.076663,-0.110976,-0.409171,0.769048,-0.25488,-0.084501,-0.475822,-0.13257
view_pct_of_total_events,-0.089586,0.182567,-0.100802,0.417469,0.097057,0.234601,0.459395,-0.836946,1.0,-0.918711,...,-0.100802,0.445437,-0.0856,0.016629,0.421103,-0.72861,-0.00791,0.075667,0.455208,0.08646
cart_pct_of_total_events,0.078943,-0.147464,0.083783,-0.340953,-0.067617,-0.151255,-0.36062,0.552771,-0.918711,1.0,...,0.083783,-0.383742,0.075018,0.054763,-0.345935,0.554472,0.19597,-0.054239,-0.349767,-0.035984


#### Highly correlated (NUMBERS HAVE CHANGED)
* Sessions with view/total sessions .998
* Cart events/purchase events .723
* sessions with purchase/purchase events .944
* Sessions with cart/purchase events .805
* View events/total sessions .767
* view events/max interactions per session .750
* View pct of total events/purchase pct of total events -.83
* pct sessions end purchase/purchase pct of total events .758
* pct sessions end purchase/view pct of total events -.722
* cart pct of total events/view pct of total events -.917
* pct sessions end purchase/avg purchases per session .83
* purchase events/cart events .72
* sessions with purchase/cart events .70
* sessions with cart/cart events .816
* sessions with purchase/purchase events .94
* sessions with cart/purchase events.805
* sessions with view/view events .77
* sessions with cart/sessions with purchase .86


In [19]:
print("** All normal inputs, log output **")
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "purchase_events",
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session']

outputCol = "T_total_spend_log"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All normal inputs, log output **
Model coefficients
                     Column name  Coefficient
0                      intercept    -5.233751
1                    total_spend     0.000128
2                   total_events     0.013940
3                purchase_events    -0.307621
4                 total_sessions     0.700348
5             avg_session_length    -0.000316
6   avg_interactions_per_session    -0.051705
7   max_interactions_per_session     0.018756
8   purchase_pct_of_total_events    -2.914336
9       view_pct_of_total_events     0.640357
10      cart_pct_of_total_events     0.032913
11     avg_purchases_per_session     0.984941
12                   cart_events    -0.047029
13               purchase_events    -0.307621
14                   view_events    -0.014685
15        sessions_with_purchase     0.908058
16            sessions_with_cart    -0.005433
17            sessions_with_view    -0.677006
18     pct_sessions_end_purchase     0.316246
19         pct_sessions_e

In [20]:
print("** Champion model **")
inputCols = ["total_spend", 'total_events', "purchase_events"]

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** Champion model **
Model coefficients
       Column name  Coefficient
0        intercept -1388.983783
1      total_spend     4.590589
2     total_events   191.918327
3  purchase_events -3198.822949
+------------------+------------------+
|     T_total_spend|        prediction|
+------------------+------------------+
| 56475.99983215332|156915.25256631614|
|1639.5600357055664|447.00299320217096|
|1975.2000427246094|-74.31680559173196|
|               0.0|-2979.518947389361|
|               0.0| 9824.490284141324|
| 61383.30047607422|55951.747641149086|
|               0.0|-1133.880156229647|
| 53453.60915565491| 93874.90193753804|
|               0.0|1568.4674542642185|
|               0.0| 7939.062065458902|
+------------------+------------------+
only showing top 10 rows

RMSE is 35570.5
R^2 is 0.61154
Adjusted R^2 is 0.61137


We achieve an adjusted R^2 of .611 with just 3 predictors: total_spend, total_events, purchase_events. Additional improvement is marginal at best - when all predictors are used, the adjusted R^2 only rises to .620, and there is a dramatic decrease in interpretability due to multicollinearity. Log-transforming the response variable clearly does not work in this instance. 

### PCA

In [21]:
print("** PCA and Linear Regression **")
inputCols = ["pca_purchases"]

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
# print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}") #Adj R^2 won't calculate properly because it doesn't know how to count the PCA columns

** PCA and Linear Regression **
Model coefficients
     Column name  Coefficient
0      intercept  3194.762414
1  pca_purchases -5261.925901
+------------------+------------------+
|     T_total_spend|        prediction|
+------------------+------------------+
| 56475.99983215332|15402.639264436859|
|1639.5600357055664|29308.068689616477|
|1975.2000427246094| 23285.44863503289|
|               0.0| 7264.054697750803|
|               0.0| 3194.762414407774|
| 61383.30047607422|30002.585412207725|
|               0.0|14006.683153860173|
| 53453.60915565491|15402.639264436859|
|               0.0|3175.9210720189053|
|               0.0| 7264.054697750803|
+------------------+------------------+
only showing top 10 rows

RMSE is 55447.9
R^2 is 0.05607


PCA with k=10 performs laughably poorly, with an R^2 of .045. 

### Question - shouldn't each PCA column wind up with its own column coefficient above? 