## File 05 - Basic Regression

In this file, we create a small ML pipeline based on the output from File 03 (Basic Preprocessed Output).
The files needed are `/processed_data/train_output.parquet` and `/processed_data/test_output.parquet`.

The goal of this file is to provide:
- The type of model
- Best hyperparameters used
- Size of the saved model
- Performance metrics

### Set up Spark session

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col
from pyspark.sql.functions import log
from pyspark.ml.stat import Correlation

import pandas as pd
import numpy as np
import copy

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

CPU times: user 518 ms, sys: 407 ms, total: 925 ms
Wall time: 5.73 s


### Read in dataframes for train and test sets

This data should have been previously generated: we can find it in the `processed_data` folder.

In [2]:
%%time
trainDF = spark.read.parquet("./processed_data/train.parquet")
testDF = spark.read.parquet("./processed_data/test.parquet")
trainDF.show(5)

+---------+------------------+------------------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+-------------------------+---------------------+--------------------+
|  user_id|     T_total_spend|       total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per_session|cart_events|purchase_events|view_events|sessions_with_purchase|sessions_with_cart|sessions_with_view|pct_sessions_end_purchase|pct_sessions_end_cart|       pca_purchas

### Set up Spark ML pipeline training for linear regression

Here we decide which input columns should be used in order to create our training pipeline. To implement this step, we create the function `generatePipeline(inputCols, outputCol)`. Then, we train the pipeline using this function.

In [3]:
%%time

inputCols = ["total_spend","total_events","purchase_events","total_sessions"]

def generatePipeline(inputCols, outputCol):
    # Select input columns for linear regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

    # Select output column for linear regression
    lr = LinearRegression(featuresCol="features", labelCol=outputCol)

    # The following lines (pipeline creation and fitting) replace these two commented-out lines.
    # vecTrainDF = vecAssembler.transform(trainDF)
 
    pipeline = Pipeline(stages=[vecAssembler, lr])
    return pipeline
    
pipeline = generatePipeline(inputCols, "T_total_spend")
pipelineModel = pipeline.fit(trainDF)

CPU times: user 15.2 ms, sys: 3.87 ms, total: 19 ms
Wall time: 2.58 s


### View the model information

Print out the model coefficients and view the RMSE and R^2. We define the functions `modelInfo(inputCols, pipelineModel)` and `getEvaluationMetrics(pipelineModel,outputCol,testDF)` to report this information.

In [4]:
def modelInfo(inputCols, pipelineModel):
    # Create a zipped list containing the coefficients and the data
    modelCols = copy.deepcopy(inputCols)
    modelCoeffs = list(pipelineModel.stages[-1].coefficients)
    modelCoeffs.insert(0,pipelineModel.stages[-1].intercept)
    modelCols.insert(0,"intercept")
    modelZippedList = list(map(list, zip(modelCols, modelCoeffs)))

    # Create the pandas DataFrame
    modelDF = pd.DataFrame(modelZippedList, columns = ['Column name', 'Coefficient'])
    return modelDF

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))


Model coefficients
       Column name  Coefficient
0        intercept -1063.828796
1      total_spend     4.661704
2     total_events   203.901146
3  purchase_events -2372.306954
4   total_sessions  -384.117163


In [5]:
def getEvaluationMetrics(pipelineModel,outputCol,testDF):
    predDF = pipelineModel.transform(testDF)
    predDF.select(outputCol, "prediction").show(10)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="rmse")
    rmse = regressionEvaluator.evaluate(predDF)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="r2")
    r2 = regressionEvaluator.evaluate(predDF)
    
    return rmse, r2

evaluationMetrics = getEvaluationMetrics(pipelineModel,"T_total_spend",testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

+------------------+-------------------+
|     T_total_spend|         prediction|
+------------------+-------------------+
|24301.800384521484| 29379.162498772836|
| 185378.8254852295| 242684.40486044425|
|  91750.2392578125| 50054.022681519025|
|               0.0| 1239.0322413093036|
|               0.0|-2302.7153914925743|
|               0.0| -8180.234729164908|
|               0.0| -724.7054219984248|
|21416.200317382812|  6313.630742502146|
|               0.0|-3794.3101468262357|
|               0.0|-3375.4808417876866|
+------------------+-------------------+
only showing top 10 rows

RMSE is 45339.0
R^2 is 0.62258


### Train linear regression model using log-transformed coefficients

Now that we have used our new functions to test out the model accuracy on untransformed features and untransformed output, let's retrain the linear regression model on transformed features and/or transformed output (log scale). First we create these new features.

In [6]:
trainDF = trainDF \
          .withColumn("total_spend_log", log(col("total_spend")+0.001)) \
          .withColumn("total_events_log", log(col("total_events")+0.001)) \
          .withColumn("purchase_events_log", log(col("purchase_events")+0.001)) \
          .withColumn("total_sessions_log", log(col("total_sessions")+0.001)) \
          .withColumn("T_total_spend_log", log(col("T_total_spend")+0.001))

testDF = testDF \
          .withColumn("total_spend_log", log(col("total_spend")+0.001)) \
          .withColumn("total_events_log", log(col("total_events")+0.001)) \
          .withColumn("purchase_events_log", log(col("purchase_events")+0.001)) \
          .withColumn("total_sessions_log", log(col("total_sessions")+0.001)) \
          .withColumn("T_total_spend_log", log(col("T_total_spend")+0.001))

trainDF.show(2)
testDF.show(2)


+---------+------------------+------------------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+-------------------------+---------------------+--------------------+-----------------+-----------------+-------------------+------------------+------------------+
|  user_id|     T_total_spend|       total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per_session|cart_events|purchase_events|view_events|sessions_with_purchase|sessions_w

### Test out different combinations of log-transformed features and outputs

Here we test out three additional transformations of the dataset in order to evaluate the linear regression model performance.

Total tested model formulations:
1. Normal inputs, normal output: `RMSE is 107094.8, R^2 is 0.91824`
2. Log-transformed inputs, normal output: `RMSE is 372221.8, R^2 is 0.01233`
3. Log-transformed inputs, log-transformed output: `RMSE is 3.1, R^2 is 0.20218` (note: cannot be compared to non-log-transformed outputs)
4. Normal inputs, log-transformed output: `RMSE is 3.5, R^2 is 0.01027` (note: cannot be compared to non-log-transformed outputs)
5. Normal and log-transformed inputs, normal output: `RMSE is 106947.5, R^2 is 0.91846`

Due to the larger R^2 and smaller RMSE, we would suggest adopting the last model. Although linear regression does not take hyperparameters as such, we have made the pre-training choice to log-transform features and include them as well. In the future, it would be useful to do a more in-depth feature selection (perhaps a stepwise feature selection), and use AIC to determine model complexity.

In [7]:
print("** Log-transformed inputs, normal output **")
inputCols = ["total_spend_log","total_events_log","purchase_events_log","total_sessions_log"]
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** Log-transformed inputs, normal output **
Model coefficients
           Column name    Coefficient
0            intercept -116177.649253
1      total_spend_log   10395.598238
2     total_events_log   28923.053074
3  purchase_events_log   -5434.010726
4   total_sessions_log  -17050.958910
+------------------+-------------------+
|     T_total_spend|         prediction|
+------------------+-------------------+
|24301.800384521484|  55535.42344065395|
| 185378.8254852295|  91825.59319505643|
|  91750.2392578125|  73658.04613299514|
|               0.0|  7135.450385168835|
|               0.0|-19225.857843683072|
|               0.0| 11704.102485638228|
|               0.0| -6121.923113683035|
|21416.200317382812| 18462.667550880637|
|               0.0|-18658.426169930026|
|               0.0| -9847.713276101349|
+------------------+-------------------+
only showing top 10 rows

RMSE is 62775.3
R^2 is 0.27646


In [8]:
print("** Log-transformed inputs, log-transformed output **")
inputCols = ["total_spend_log","total_events_log","purchase_events_log","total_sessions_log"]
outputCol = "T_total_spend_log"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** Log-transformed inputs, log-transformed output **
Model coefficients
           Column name  Coefficient
0            intercept   -18.781989
1      total_spend_log     1.260336
2     total_events_log     3.992153
3  purchase_events_log    -1.067006
4   total_sessions_log    -2.855866
+------------------+-------------------+
| T_total_spend_log|         prediction|
+------------------+-------------------+
|10.098305757631335| 2.3834235267380315|
|12.130156721089524|  7.179141824126017|
|11.426825384527149|  4.676708432312527|
|-6.907755278982137|-3.6102758156974772|
|-6.907755278982137| -7.195849199330947|
|-6.907755278982137| -4.272291048918227|
|-6.907755278982137| -5.129443089026873|
| 9.971902985482062| -2.799605687517097|
|-6.907755278982137| -6.827290327714877|
|-6.907755278982137| -6.305145292235611|
+------------------+-------------------+
only showing top 10 rows

RMSE is 5.7
R^2 is 0.35707


In [9]:
print("** Normal inputs, log-transformed output **")
inputCols = ["total_spend","total_events","purchase_events","total_sessions"]
outputCol = "T_total_spend_log"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** Normal inputs, log-transformed output **
Model coefficients
       Column name  Coefficient
0        intercept    -3.885144
1      total_spend     0.000172
2     total_events     0.013070
3  purchase_events     0.051208
4   total_sessions    -0.037638
+------------------+-------------------+
| T_total_spend_log|         prediction|
+------------------+-------------------+
|10.098305757631335| -1.744206028543831|
|12.130156721089524| 11.824974714269732|
|11.426825384527149|-1.1171852665973976|
|-6.907755278982137|-3.6413991655287314|
|-6.907755278982137| -3.860818682181449|
|-6.907755278982137| -3.209436377269393|
|-6.907755278982137|-3.7279273845118133|
| 9.971902985482062| -3.379138591510047|
|-6.907755278982137| -3.701163904970816|
|-6.907755278982137| -3.757405882215644|
+------------------+-------------------+
only showing top 10 rows

RMSE is 6.5
R^2 is 0.18056


In [10]:
print("** Normal and log-transformed inputs, normal output **")
inputCols = ["total_spend","total_events","purchase_events","total_sessions",
             "total_spend_log","total_events_log","purchase_events_log","total_sessions_log"]
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,"T_total_spend",testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** Normal and log-transformed inputs, normal output **
Model coefficients
           Column name  Coefficient
0            intercept -3587.239517
1          total_spend     4.504743
2         total_events   210.319487
3      purchase_events -2728.472375
4       total_sessions  -174.645844
5      total_spend_log  1777.220660
6     total_events_log -2704.662284
7  purchase_events_log   960.486488
8   total_sessions_log  -112.924031
+------------------+------------------+
|     T_total_spend|        prediction|
+------------------+------------------+
|24301.800384521484|27174.913520633276|
| 185378.8254852295| 243024.7014901357|
|  91750.2392578125| 51692.94063729918|
|               0.0| 113.2267734650818|
|               0.0|-1891.432901908523|
|               0.0|-7867.924680837309|
|               0.0|-2077.989268229793|
|21416.200317382812| 9820.642217604262|
|               0.0|-5448.053764195429|
|               0.0|-2246.360237295643|
+------------------+------------------+
only s

### Test out different combinations of hyperparameters

Here we test out 25 combinations of hyperparameters under cross validation.

- Regularization parameters: [0, 0.01, 0.2, 1, 10]
- Amount of LASSO-ness: [0, 0.25, 0.5, 0.75, 1]

The results that we find are that a regularization of 0.2 is desirable with an amount of LASSO-ness of 1 (L1 regularization only). Given that the R^2 and RMSE are quite similar to the previous models (`RMSE is 106998.3, R^2 is 0.91839`), it's not sure whether this is legitmately an improvement.


In [14]:
%%time
print("** Normal and log-transformed inputs, normal output **")
inputCols = ["total_spend","total_events","purchase_events","total_sessions",
             "total_spend_log","total_events_log","purchase_events_log","total_sessions_log"]
pipeline = generatePipeline(inputCols, "T_total_spend")
pipelineModel = pipeline.fit(trainDF)


# Below creation of pipeline is necessary for crossval to run. I wonder if there's a way to get it to run on the generate pipeline fn? 
vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

# Select output column for linear regression
lr = LinearRegression(featuresCol="features", labelCol="T_total_spend")

pipeline = Pipeline(stages=[vecAssembler, lr])
    
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0, 0.01, 0.2, 1, 10]) \
    .addGrid(lr.elasticNetParam, [0, 0.25, 0.5, 0.75, 1]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("T_total_spend"),
                          numFolds=4)

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(trainDF)

** Normal and log-transformed inputs, normal output **
CPU times: user 4.09 s, sys: 1.23 s, total: 5.32 s
Wall time: 33.3 s


In [15]:
print("Best model coefficients")
print(modelInfo(inputCols, cvModel.bestModel))

evaluationMetrics = getEvaluationMetrics(cvModel.bestModel,"T_total_spend",testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

print()
print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])

Best model coefficients
           Column name  Coefficient
0            intercept -3680.922612
1          total_spend     4.503478
2         total_events   210.146125
3      purchase_events -2710.056879
4       total_sessions  -174.173339
5      total_spend_log  1777.280332
6     total_events_log -2667.909874
7  purchase_events_log   870.525546
8   total_sessions_log  -124.665755
+------------------+-------------------+
|     T_total_spend|         prediction|
+------------------+-------------------+
|24301.800384521484|  27191.99400658255|
| 185378.8254852295| 242924.34106940366|
|  91750.2392578125| 51667.672084040976|
|               0.0| 138.42861747800907|
|               0.0|-1892.8861593576212|
|               0.0|  -7890.78345866171|
|               0.0| -2064.496748817861|
|21416.200317382812|   9793.75983382773|
|               0.0|-5484.0595850186555|
|               0.0|-2282.0977534489493|
+------------------+-------------------+
only showing top 10 rows

RMSE is 45247.8


### Save pipeline model and get model size

The model size is 13.9 kB, according to the file explorer in Linux.

In [16]:
pipelinePath = "models/lr-pipeline-model_NewData"
cvModel.bestModel.write().overwrite().save(pipelinePath)

### New models using more features

In [17]:
trainDF.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- T_total_spend: double (nullable = true)
 |-- total_spend: double (nullable = true)
 |-- total_events: long (nullable = true)
 |-- total_sessions: long (nullable = true)
 |-- avg_session_length: double (nullable = true)
 |-- sd_session_length: double (nullable = true)
 |-- avg_interactions_per_session: double (nullable = true)
 |-- sd_interactions_per_session: double (nullable = true)
 |-- max_interactions_per_session: long (nullable = true)
 |-- purchase_pct_of_total_events: double (nullable = true)
 |-- view_pct_of_total_events: double (nullable = true)
 |-- cart_pct_of_total_events: double (nullable = true)
 |-- avg_purchases_per_session: double (nullable = true)
 |-- sd_purchases_per_session: double (nullable = true)
 |-- cart_events: long (nullable = true)
 |-- purchase_events: long (nullable = true)
 |-- view_events: long (nullable = true)
 |-- sessions_with_purchase: long (nullable = true)
 |-- sessions_with_cart: long (nullable =

In [18]:
print("** All normal inputs, normal output **")
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "purchase_events",
            "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart"]

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,"T_total_spend",testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** All normal inputs, normal output **
Model coefficients
                     Column name   Coefficient
0                      intercept -33691.770695
1                    total_spend      4.313394
2                   total_events    222.520976
3                purchase_events   -631.034384
4                 total_sessions   1564.266372
5             avg_session_length     -0.548247
6   avg_interactions_per_session    223.372803
7   max_interactions_per_session   -202.459437
8   purchase_pct_of_total_events  27481.992632
9       view_pct_of_total_events  35312.933177
10      cart_pct_of_total_events  27490.727449
11     avg_purchases_per_session  -3905.745836
12                   cart_events     77.242054
13               purchase_events      0.000000
14                   view_events   -135.331556
15        sessions_with_purchase  -4093.842066
16            sessions_with_cart   1408.032444
17            sessions_with_view  -1354.217348
18     pct_sessions_end_purchase   7225.541938
19

In [19]:
#Create function to view correlation matrices
# https://stackoverflow.com/questions/52214404/how-to-get-the-correlation-matrix-of-a-pyspark-data-frame
def generateCorrMatrix(inputCols, dataframe):
    # Select input columns for Correlation Matrix & transform
    vector_col = 'corr_features'
    corrAssembler = VectorAssembler(inputCols=inputCols, outputCol=vector_col)
    df_vector = corrAssembler.transform(dataframe).select(vector_col)
    
    #get correlation matrix
    matrix = Correlation.corr(df_vector, vector_col)
    result = matrix.collect()[0]["pearson({})".format(vector_col)].values
    readable = pd.DataFrame(result.reshape(-1, len(inputCols)), columns=inputCols, index=inputCols)
    
    return readable

In [20]:
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "purchase_events",
            "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart"]

generateCorrMatrix(inputCols, trainDF)

Unnamed: 0,total_spend,total_events,purchase_events,total_sessions,avg_session_length,avg_interactions_per_session,max_interactions_per_session,purchase_pct_of_total_events,view_pct_of_total_events,cart_pct_of_total_events,avg_purchases_per_session,cart_events,purchase_events.1,view_events,sessions_with_purchase,sessions_with_cart,sessions_with_view,pct_sessions_end_purchase,pct_sessions_end_cart
total_spend,1.0,0.56775,0.434259,0.131699,0.026031,-0.000169,0.036287,0.083838,-0.089142,0.07489,0.12097,0.315624,0.434259,0.060493,0.449563,0.383779,0.130005,0.078247,0.009015
total_events,0.56775,1.0,0.33529,0.451657,0.109332,0.154336,0.386002,-0.188944,0.193785,-0.157966,-0.105568,0.370126,0.33529,0.516343,0.360146,0.425592,0.451405,-0.170532,0.053071
purchase_events,0.434259,0.33529,1.0,0.338797,0.004958,0.067084,0.165092,0.098018,-0.097714,0.077668,0.229013,0.723629,1.0,0.24209,0.943705,0.804674,0.339058,0.097851,0.00744
total_sessions,0.131699,0.451657,0.338797,1.0,0.106648,-0.039435,0.400709,-0.416238,0.430069,-0.352808,-0.394018,0.43382,0.338797,0.767029,0.376308,0.559079,0.998607,-0.499987,0.080642
avg_session_length,0.026031,0.109332,0.004958,0.106648,1.0,0.179003,0.162756,-0.115486,0.101956,-0.071485,-0.016802,0.073513,0.004958,0.146593,0.002718,0.058192,0.099585,-0.046768,0.022698
avg_interactions_per_session,-0.000169,0.154336,0.067084,-0.039435,0.179003,1.0,0.630542,-0.285883,0.232477,-0.146687,0.269083,0.160399,0.067084,0.3112,0.006263,0.001274,-0.036203,0.138301,-0.016858
max_interactions_per_session,0.036287,0.386002,0.165092,0.400709,0.162756,0.630542,1.0,-0.460342,0.455324,-0.359308,-0.172021,0.312408,0.165092,0.749668,0.132378,0.230559,0.402686,-0.308314,0.050518
purchase_pct_of_total_events,0.083838,-0.188944,0.098018,-0.416238,-0.115486,-0.285883,-0.460342,1.0,-0.834152,0.544971,0.656895,-0.145894,0.098018,-0.433315,0.075551,-0.112927,-0.416702,0.758016,-0.246265
view_pct_of_total_events,-0.089142,0.193785,-0.097714,0.430069,0.101956,0.232477,0.455324,-0.834152,1.0,-0.917025,-0.618391,-0.077817,-0.097714,0.476743,-0.081413,0.016925,0.434125,-0.722507,-0.022181
cart_pct_of_total_events,0.07489,-0.157966,0.077668,-0.352808,-0.071485,-0.146687,-0.359308,0.544971,-0.917025,1.0,0.465074,0.223799,0.077668,-0.411414,0.069133,0.055931,-0.358639,0.550231,0.211801


#### Highly correlated
* Sessions with view/total sessions .998
* Cart events/purchase events .723
* sessions with purchase/purchase events .944
* Sessions with cart/purchase events .805
* View events/total sessions .767
* view events/max interactions per session .750
* View pct of total events/purchase pct of total events -.83
* pct sessions end purchase/purchase pct of total events .758
* pct sessions end purchase/view pct of total events -.722
* cart pct of total events/view pct of total events -.917
* pct sessions end purchase/avg purchases per session .83
* purchase events/cart events .72
* sessions with purchase/cart events .70
* sessions with cart/cart events .816
* sessions with purchase/purchase events .94
* sessions with cart/purchase events.805
* sessions with view/view events .77
* sessions with cart/sessions with purchase .86


In [21]:
print("** All normal inputs, log output **")
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "purchase_events",
            "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart"]

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend_log"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,"T_total_spend_log",testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** All normal inputs, log output **
Model coefficients
                     Column name  Coefficient
0                      intercept   -98.210187
1                    total_spend     0.000118
2                   total_events     0.014945
3                purchase_events    -0.387410
4                 total_sessions     0.548049
5             avg_session_length    -0.000015
6   avg_interactions_per_session    -0.114620
7   max_interactions_per_session     0.026767
8   purchase_pct_of_total_events    91.747879
9       view_pct_of_total_events    94.483438
10      cart_pct_of_total_events    94.815845
11     avg_purchases_per_session     1.376313
12                   cart_events    -0.032164
13               purchase_events     0.000000
14                   view_events    -0.016075
15        sessions_with_purchase     0.831651
16            sessions_with_cart    -0.090700
17            sessions_with_view    -0.537831
18     pct_sessions_end_purchase    -1.032730
19         pct_sessions_e