## File 05 - Linear Regression
##### Group 12:

##### Hannah Schmuckler, mmc4cv

##### Rob Schwartz, res7cd
In this file, we create a small ML pipeline based on the output from File 02 (Feature creation).

We create a linear regression model, tune it, then compare it to a linear regression model created with downsampled data to see how performance compares. 

### Set up Spark session

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.regression import LinearRegression, LinearRegressionModel, LinearRegressionSummary
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col
from pyspark.sql.functions import log
from pyspark.ml.stat import Correlation

import pandas as pd
import numpy as np
import copy

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

CPU times: user 394 ms, sys: 327 ms, total: 722 ms
Wall time: 4.49 s


### Read in dataframes for train and test sets

This data should have been previously generated: we can find it in the `processed_data` folder.

In [2]:
%%time
trainDF = spark.read.parquet("./processed_data/train.parquet")
testDF = spark.read.parquet("./processed_data/test.parquet")
# trainDF.show(5)

CPU times: user 2.6 ms, sys: 0 ns, total: 2.6 ms
Wall time: 2.3 s


In [3]:
trainDF.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- total_spend: double (nullable = true)
 |-- total_events: long (nullable = true)
 |-- total_sessions: long (nullable = true)
 |-- T_total_spend: double (nullable = true)
 |-- avg_session_length: double (nullable = true)
 |-- sd_session_length: double (nullable = true)
 |-- avg_interactions_per_session: double (nullable = true)
 |-- sd_interactions_per_session: double (nullable = true)
 |-- max_interactions_per_session: long (nullable = true)
 |-- purchase_pct_of_total_events: double (nullable = true)
 |-- view_pct_of_total_events: double (nullable = true)
 |-- cart_pct_of_total_events: double (nullable = true)
 |-- avg_purchases_per_session: double (nullable = true)
 |-- sd_purchases_per_session: double (nullable = true)
 |-- cart_events: long (nullable = true)
 |-- purchase_events: long (nullable = true)
 |-- view_events: long (nullable = true)
 |-- sessions_with_purchase: long (nullable = true)
 |-- sessions_with_cart: long (nullable =

### Set up Spark ML pipeline training for linear regression

Here we decide which input columns should be used in order to create our training pipeline. To implement this step, we create the function `generatePipeline(inputCols, outputCol)`. Then, we train the pipeline using this function.

In [4]:
%%time

def generatePipeline(inputCols, outputCol):
    # Select input columns for linear regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="unscaled_features")
    # Scale features
    ss = StandardScaler(inputCol = "unscaled_features", outputCol = "features", withMean = True, withStd = True)
    # Select output column for linear regression
    lr = LinearRegression(featuresCol="features", labelCol=outputCol)

 
    pipeline = Pipeline(stages=[vecAssembler, ss, lr])
    return pipeline
    

CPU times: user 0 ns, sys: 2 µs, total: 2 µs
Wall time: 4.77 µs


### View the model information

Print out the model coefficients and view the pValues, RMSE and R^2. We define the functions `modelInfo(inputCols, pipelineModel)` and `getEvaluationMetrics(pipelineModel,outputCol,testDF)` to report this information.

In [5]:
def modelInfo(inputCols, pipelineModel):
    # Create a zipped list containing the coefficients and the data
    modelCols = copy.deepcopy(inputCols)
    modelCoeffs = list(pipelineModel.stages[-1].coefficients)
    modelCoeffs.insert(0,pipelineModel.stages[-1].intercept)
    modelCols.insert(0,"intercept")
    modelZippedList = list(map(list, zip(modelCols, modelCoeffs)))
    
    # Add in the p-values
    pvals = pipelineModel.stages[-1].summary.pValues
    
    # Create the pandas DataFrame
    modelDF = pd.DataFrame(modelZippedList, columns = ['Column name', 'Coefficient'])
    modelDF['pValues'] = pvals
    return modelDF

In [6]:
# Calculate adjusted r2 (https://towardsdatascience.com/machine-learning-linear-regression-using-pyspark-9d5d5c772b42)
# This function will allow us to calculate the adjusted r-square value when we do PCA later. The default r-square function does not take into account k. 
def adj_r2(r2, inputCols, testDF, k = 0):
    n = testDF.count()
    if k == 0:
        p = len(inputCols)
    else: 
        p = len(inputCols) + k - 1
    
    adjusted_r2 = 1-(((1-r2)*(n-1))/(n-p-1))
    return adjusted_r2

In [7]:
def getEvaluationMetrics(pipelineModel,outputCol,testDF,inputCols):
    predDF = pipelineModel.transform(testDF)
    print()
    print('type preddf' + str(type(predDF)))
    print()
    predDF.select(outputCol, "prediction").show(10)
    
    print(predDF)
    
    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="rmse")
    rmse = regressionEvaluator.evaluate(predDF)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="r2")
    r2 = regressionEvaluator.evaluate(predDF)
      
    # Manually calculate Adjusted r2
    adjusted_r2 = adj_r2(r2, inputCols, testDF)
    
    return rmse, r2, adjusted_r2


In [8]:
# Comprehensive
inputCols = ["total_spend", "total_events", "total_sessions", "avg_session_length", "sd_session_length", "avg_interactions_per_session", 
             "sd_interactions_per_session", "max_interactions_per_session", "purchase_pct_of_total_events", "view_pct_of_total_events",
             "cart_pct_of_total_events", "avg_purchases_per_session", "sd_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_purchase", "sessions_with_cart", "sessions_with_view", "pct_sessions_end_purchase", 
             "pct_sessions_end_cart", "total_spend_log", "total_events_log", "purchase_events_log", "total_sessions_log",
             "avg_session_length_log"]

### There's likely collinearity between some of these features. Let's take a look at a correlation matrix:


In [9]:
#Create function to view correlation matrices
# https://stackoverflow.com/questions/52214404/how-to-get-the-correlation-matrix-of-a-pyspark-data-frame
def generateCorrMatrix(inputCols, dataframe):
    # Select input columns for Correlation Matrix & transform
    vector_col = 'corr_features'
    corrAssembler = VectorAssembler(inputCols=inputCols, outputCol=vector_col)
    df_vector = corrAssembler.transform(dataframe).select(vector_col)
    
    #get correlation matrix
    matrix = Correlation.corr(df_vector, vector_col)
    result = matrix.collect()[0]["pearson({})".format(vector_col)].values
    readable = pd.DataFrame(result.reshape(-1, len(inputCols)), columns=inputCols, index=inputCols)
    
    return readable

In [10]:
generateCorrMatrix(inputCols, trainDF)

Unnamed: 0,total_spend,total_events,total_sessions,avg_session_length,sd_session_length,avg_interactions_per_session,sd_interactions_per_session,max_interactions_per_session,purchase_pct_of_total_events,view_pct_of_total_events,...,sessions_with_purchase,sessions_with_cart,sessions_with_view,pct_sessions_end_purchase,pct_sessions_end_cart,total_spend_log,total_events_log,purchase_events_log,total_sessions_log,avg_session_length_log
total_spend,1.0,0.267576,0.297334,0.074837,0.099307,0.009718,0.038839,0.092811,0.107863,-0.117347,...,0.709111,0.626564,0.285545,0.088643,0.016506,0.487793,0.201547,0.514174,0.204204,0.07757
total_events,0.267576,1.0,0.789305,0.054993,0.111465,0.291275,0.526505,0.722697,-0.369753,0.380738,...,0.42281,0.533934,0.793259,-0.348755,0.077966,0.115701,0.736466,0.368157,0.635699,0.255801
total_sessions,0.297334,0.789305,1.0,0.082139,0.164686,-0.033912,0.237057,0.397175,-0.369811,0.365643,...,0.482276,0.638601,0.993768,-0.448364,0.101794,0.14612,0.694987,0.386997,0.782182,0.181336
avg_session_length,0.074837,0.054993,0.082139,1.0,0.834726,-0.008058,0.018618,0.028854,-0.037597,-0.026544,...,0.075846,0.127129,0.065021,-0.045,0.082164,0.04435,0.073224,0.071939,0.085964,0.570257
sd_session_length,0.099307,0.111465,0.164686,0.834726,1.0,-0.029591,0.034662,0.054289,-0.072183,0.001362,...,0.111145,0.19997,0.137877,-0.096415,0.11391,0.055072,0.132356,0.103713,0.163273,0.685703
avg_interactions_per_session,0.009718,0.291275,-0.033912,-0.008058,-0.029591,1.0,0.555439,0.629323,-0.27495,0.229046,...,0.003659,-0.008019,-0.028986,0.122442,-0.01966,-0.013982,0.351459,0.083284,-0.087016,0.320714
sd_interactions_per_session,0.038839,0.526505,0.237057,0.018618,0.034662,0.555439,1.0,0.873582,-0.474157,0.454473,...,0.075191,0.117955,0.242911,-0.388312,0.082515,-0.017801,0.628156,0.159523,0.383093,0.268427
max_interactions_per_session,0.092811,0.722697,0.397175,0.028854,0.054289,0.629323,0.873582,1.0,-0.451875,0.44254,...,0.155269,0.220849,0.40339,-0.306974,0.054627,0.017614,0.695921,0.22367,0.430422,0.311904
purchase_pct_of_total_events,0.107863,-0.369753,-0.369811,-0.037597,-0.072183,-0.27495,-0.474157,-0.451875,1.0,-0.828854,...,0.078746,-0.0712,-0.374835,0.769086,-0.2523,0.209643,-0.720444,0.11524,-0.613472,-0.296796
view_pct_of_total_events,-0.117347,0.380738,0.365643,-0.026544,0.001362,0.229046,0.454473,0.44254,-0.828854,1.0,...,-0.090975,-0.021904,0.380505,-0.716757,-0.030546,-0.230949,0.652967,-0.130654,0.578633,0.206277


### Test out different combinations of features and outputs

Here we test out models in order to find the best combination of features.

In [11]:
print("** All original inputs, original output **")
# For comparison, with all inputs. If you use T_total_spend_log for this one as the response instead, the adjusted R^2 is about .1
# Comprehensive
inputCols = ["total_spend", "total_events", "total_sessions", "avg_session_length", "sd_session_length", "avg_interactions_per_session", 
             "sd_interactions_per_session", "max_interactions_per_session", "purchase_pct_of_total_events", "view_pct_of_total_events",
             "cart_pct_of_total_events", "avg_purchases_per_session", "sd_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_purchase", "sessions_with_cart", "sessions_with_view", "pct_sessions_end_purchase", 
             "pct_sessions_end_cart", "total_spend_log", "total_events_log", "purchase_events_log", "total_sessions_log",
             "avg_session_length_log"]

outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All original inputs, original output **
Model coefficients
                     Column name  Coefficient       pValues
0                      intercept   423.176077  0.000000e+00
1                    total_spend   959.039237  1.100538e-01
2                   total_events   -18.103645  0.000000e+00
3                 total_sessions   729.648895  7.544119e-02
4             avg_session_length   -13.818762  3.463034e-10
5              sd_session_length    59.255441  3.439369e-01
6   avg_interactions_per_session    -8.792587  2.841095e-01
7    sd_interactions_per_session   -12.175377  7.300872e-02
8   max_interactions_per_session    24.406808  1.000000e+00
9   purchase_pct_of_total_events   157.067100  9.999936e-01
10      view_pct_of_total_events   392.046266  1.000000e+00
11      cart_pct_of_total_events   220.883964  0.000000e+00
12     avg_purchases_per_session  -134.142461  4.582408e-05
13      sd_purchases_per_session   -34.303431  3.564709e-10
14                   cart_events    72

In [12]:
# Redo, iteratively removing & adding features to find the best combination. In the real world, 
# sacraficing a fraction of a percent of performance to dramatically reduce the number of features can be very useful. 
# Intermediate stages not shown because this was all done in the same cell. 

### Put this through cv below. It chose regparam = 0, sending some of these to 0, so they were removed from the final model. 
print("** Champion features, original output **")
inputCols = ["total_spend", "total_events", "total_sessions", "avg_session_length", "sd_session_length",
             "cart_pct_of_total_events", "avg_purchases_per_session", "sd_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_purchase", "sessions_with_cart", "sessions_with_view", "pct_sessions_end_purchase", 
             "pct_sessions_end_cart", "total_spend_log", "total_events_log", "purchase_events_log", "total_sessions_log"]
#"view_events" breaks it - probably because it creates a linear combination when added :)
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** Champion features, original output **
Model coefficients
                  Column name  Coefficient       pValues
0                   intercept   423.176077  0.000000e+00
1                 total_spend   958.357005  8.912611e-02
2                total_events   -14.848439  0.000000e+00
3              total_sessions   722.700369  7.938251e-02
4          avg_session_length   -13.635267  3.129694e-07
5           sd_session_length    40.949070  1.898210e-07
6    cart_pct_of_total_events   -38.377257  0.000000e+00
7   avg_purchases_per_session  -141.221003  2.229956e-06
8    sd_purchases_per_session   -37.828465  1.185112e-10
9                 cart_events    74.626408  0.000000e+00
10            purchase_events   175.317209  3.190048e-11
11     sessions_with_purchase  -142.707805  0.000000e+00
12         sessions_with_cart   174.569769  0.000000e+00
13         sessions_with_view  -717.313050  3.410221e-01
14  pct_sessions_end_purchase    17.228519  2.475769e-07
15      pct_sessions_end_car

In [13]:
# Turns out, almost all of our performance is given by total_spend
print("** One input, original output **")
inputCols = ["total_spend"]
             
#"view_events" breaks it - probably because it creates a linear combination when added :)
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** One input, original output **
Model coefficients
   Column name  Coefficient  pValues
0    intercept   423.176077      0.0
1  total_spend  1127.862791      0.0

type preddf<class 'pyspark.sql.dataframe.DataFrame'>

+------------------+------------------+
|     T_total_spend|        prediction|
+------------------+------------------+
|               0.0| 88.44964229068427|
|               0.0| 59.72016388535616|
|               0.0| 48.47638984971593|
|               0.0|129.85495440029206|
|               0.0| 166.4244580091202|
|               0.0| 57.01169580682296|
| 312.4200134277344| 97.79956244132069|
|  6009.78010559082| 698.6332202129912|
|1627.1900024414062| 74.39881482764747|
|1393.8800354003906|128.29836363018774|
+------------------+------------------+
only showing top 10 rows

DataFrame[user_id: int, total_spend: double, total_events: bigint, total_sessions: bigint, T_total_spend: double, avg_session_length: double, sd_session_length: double, avg_interactions_per_sessio

### Hyperparameter tuning on best features:

In [14]:
# pvalues breaks when used with this so taking that out
def modelInfo2(inputCols, pipelineModel):
    # Create a zipped list containing the coefficients and the data
    modelCols = copy.deepcopy(inputCols)
    modelCoeffs = list(pipelineModel.stages[-1].coefficients)
    modelCoeffs.insert(0,pipelineModel.stages[-1].intercept)
    modelCols.insert(0,"intercept")
    modelZippedList = list(map(list, zip(modelCols, modelCoeffs)))
        
    # Create the pandas DataFrame
    modelDF = pd.DataFrame(modelZippedList, columns = ['Column name', 'Coefficient'])
    
    return modelDF

# CHAMPION MODEL
### Test out different combinations of hyperparameters

Here we test out 25 combinations of hyperparameters on our best linear regression model under cross validation.

- Regularization parameters: [0, 0.01, 0.2, 1, 10]
- Amount of LASSO-ness: [0, 0.25, 0.5, 0.75, 1]


In [15]:
%%time
print("** Champion Model **")
inputCols = ["total_spend", "sd_session_length",
             "cart_pct_of_total_events", "avg_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_cart", "sessions_with_view", 
             "pct_sessions_end_cart", "total_events_log"]

outputCol = "T_total_spend"
# Below creation of pipeline is necessary for crossval to run. I wonder if there's a way to get it to run on the generate pipeline fn? 
vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="unscaled_features")
#Scale features
ss = StandardScaler(inputCol = "unscaled_features", outputCol = "features", withMean = True, withStd = True)
# Select output column for linear regression
lr = LinearRegression(featuresCol="features", labelCol="T_total_spend")

pipeline = Pipeline(stages=[vecAssembler, ss, lr])
    
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0, 0.01, 0.2, 1, 10]) \
    .addGrid(lr.elasticNetParam, [0, 0.25, 0.5, 0.75, 1]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("T_total_spend"),
                          numFolds=4)

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.setParallelism(8).fit(trainDF)

** Champion Model **
CPU times: user 3.94 s, sys: 806 ms, total: 4.75 s
Wall time: 2min 21s


In [16]:
print("Best model coefficients")
print(modelInfo2(inputCols, cvModel.bestModel))


evaluationMetrics = getEvaluationMetrics(cvModel.bestModel,"T_total_spend",testDF,inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

print()

print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])
print()
print(cvModel.avgMetrics)

Best model coefficients
                  Column name  Coefficient
0                   intercept   423.176077
1                 total_spend   945.620418
2           sd_session_length    37.008892
3    cart_pct_of_total_events   -29.968641
4   avg_purchases_per_session   -59.315484
5                 cart_events    98.199506
6             purchase_events    17.886250
7          sessions_with_cart   214.567903
8          sessions_with_view   -29.259443
9       pct_sessions_end_cart   -36.063384
10           total_events_log   -74.075208

type preddf<class 'pyspark.sql.dataframe.DataFrame'>

+------------------+-------------------+
|     T_total_spend|         prediction|
+------------------+-------------------+
|               0.0|  96.93146014143309|
|               0.0|  71.14974932833758|
|               0.0| -6.397339358116994|
|               0.0| 158.54518067553505|
|               0.0| 16.040471542388445|
|               0.0|0.21607380805846788|
| 312.4200134277344| -32.53853729089

### Single predictor CV

In [17]:
%%time
print("** Single Predictor **")
inputCols = ["total_spend"]

outputCol = "T_total_spend"
# Below creation of pipeline is necessary for crossval to run. I wonder if there's a way to get it to run on the generate pipeline fn? 
vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")
#ss = StandardScaler(inputCol = "inputCols", outputCol = "features", withMean = True, withStd = False) #Doesn't work with one feature :)
# Select output column for linear regression
lr = LinearRegression(featuresCol="features", labelCol="T_total_spend")

pipeline = Pipeline(stages=[vecAssembler, lr])
    
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0, 0.01, 0.2, 1, 10]) \
    .addGrid(lr.elasticNetParam, [0, 0.25, 0.5, 0.75, 1]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("T_total_spend"),
                          numFolds=4)

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.setParallelism(8).fit(trainDF)

** Single Predictor **
CPU times: user 2.84 s, sys: 578 ms, total: 3.42 s
Wall time: 2min 24s


In [18]:
print("Single Predictor coefficients")
print(modelInfo2(inputCols, cvModel.bestModel))


evaluationMetrics = getEvaluationMetrics(cvModel.bestModel,"T_total_spend",testDF,inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

print()

print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])
print()
print(cvModel.avgMetrics)

Single Predictor coefficients
   Column name  Coefficient
0    intercept    47.187462
1  total_spend     0.516834

type preddf<class 'pyspark.sql.dataframe.DataFrame'>

+------------------+------------------+
|     T_total_spend|        prediction|
+------------------+------------------+
|               0.0| 89.75906626702661|
|               0.0|61.141975363855934|
|               0.0|49.942186104439656|
|               0.0| 131.0024039913568|
|               0.0| 167.4288505245308|
|               0.0| 58.44410260340425|
| 312.4200134277344| 99.07241025166809|
|  6009.78010559082|  697.555653126959|
|1627.1900024414062| 75.76320455657108|
|1393.8800354003906|129.45190248420616|
+------------------+------------------+
only showing top 10 rows

DataFrame[user_id: int, total_spend: double, total_events: bigint, total_sessions: bigint, T_total_spend: double, avg_session_length: double, sd_session_length: double, avg_interactions_per_session: double, sd_interactions_per_session: double, m

## Other things we tried are below

### Save pipeline model and get model size
##### CHECK NEW SIZE
The model size is ____ kB, according to the file explorer in Linux.

In [19]:
pipelinePath = "models/lr-pipeline-model_NewData"
cvModel.bestModel.write().overwrite().save(pipelinePath)

#### Our data is extremely zero-inflated. If we downsample the data so our response variable =0 only about half the time, can we beat the model performance above? NO

In [20]:
%%time
trainDF_ds = spark.read.parquet("./processed_data/ds_train.parquet")
testDF_ds = spark.read.parquet("./processed_data/ds_test.parquet")

CPU times: user 2.57 ms, sys: 147 µs, total: 2.72 ms
Wall time: 452 ms


In [40]:
#"cart_pct_of_total_events" oddly this one breaks it? Not important, it's not a good predictor anyway

print("** Original + Log-transformed inputs, normal output **")
inputCols = ["total_spend", "total_events", "total_sessions", "avg_session_length", "sd_session_length", "avg_interactions_per_session", 
             "sd_interactions_per_session", "max_interactions_per_session", "purchase_pct_of_total_events", "view_pct_of_total_events",
             "avg_purchases_per_session", "sd_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_purchase", "sessions_with_cart", "sessions_with_view", "pct_sessions_end_purchase", 
             "pct_sessions_end_cart", "total_spend_log", "total_events_log", "purchase_events_log", "total_sessions_log",
             "avg_session_length_log"]
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF_ds)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF_ds, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** Original + Log-transformed inputs, normal output **
Model coefficients
                     Column name  Coefficient       pValues
0                      intercept   838.205909  0.000000e+00
1                    total_spend  1445.836981  3.480280e-02
2                   total_events   -46.516061  9.139583e-02
3                 total_sessions   177.628187  1.250111e-01
4             avg_session_length   -24.427766  1.396487e-05
5              sd_session_length    86.003322  9.969550e-01
6   avg_interactions_per_session    -0.068504  4.946507e-01
7    sd_interactions_per_session   -14.918812  1.063055e-01
8   max_interactions_per_session    41.050507  4.625457e-01
9   purchase_pct_of_total_events   -18.679707  1.962957e-03
10      view_pct_of_total_events    65.627334  0.000000e+00
11     avg_purchases_per_session  -191.164887  5.555238e-01
12      sd_purchases_per_session    -9.398940  2.142610e-04
13                   cart_events    89.807460  0.000000e+00
14               purchase_

#### Interestingly, downsampling the inflated 0s doesn't seem to help at all on the test set!