## File 05 - Basic Regression

In this file, we create a small ML pipeline based on the output from File 03 (Basic Preprocessed Output).
The files needed are `/processed_data/train_output.parquet` and `/processed_data/test_output.parquet`.

The goal of this file is to provide:
- The type of model
- Best hyperparameters used
- Size of the saved model
- Performance metrics

### Set up Spark session

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col
from pyspark.sql.functions import log

import pandas as pd
import numpy as np
import copy

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

CPU times: user 535 ms, sys: 415 ms, total: 950 ms
Wall time: 5.76 s


### Read in dataframes for train and test sets

This data should have been previously generated: we can find it in the `processed_data` folder.

In [5]:
%%time
trainDF = spark.read.parquet("./processed_data/train.parquet")
testDF = spark.read.parquet("./processed_data/test.parquet")
trainDF.show(5)

+---------+------------------+------------------+---------------+------------------+----------------+------------------+----------+------------------------+----------------------------+----------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+-------------------------+----------------+--------------------+----------------+--------------------+---------------+---------------+-------------+------------+
|  user_id|    m2_total_spend|    m1_total_spend|m1_total_events|m1_purchase_events|m1_user_sessions|num_sessions_month|AvgSessLen|stddev_SessionLengthSecs|avg_interactions_per_session|stddev_int_per_session|max_interactions_one_session|purchase_pct_of_total_events|cart_pct_of_total_events|view_pct_of_total_events|avg_purchases_per_session|std_purchases_per_session|monthlyCartTotal|monthlyPurchaseTotal|monthlyViewTotal|NumSessWithPurchases|NumSessWithCart|NumSessWithView|ses_end_purch|ses_en

### Set up Spark ML pipeline training for linear regression

Here we decide which input columns should be used in order to create our training pipeline. To implement this step, we create the function `generatePipeline(inputCols, outputCol, trainDF)`. Then, we train the pipeline using this function.

In [6]:
%%time

inputCols = ["m1_total_spend","m1_total_events","m1_purchase_events","m1_user_sessions"]

def generatePipeline(inputCols, outputCol):
    # Select input columns for linear regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

    # Select output column for linear regression
    lr = LinearRegression(featuresCol="features", labelCol=outputCol)

    # The following lines (pipeline creation and fitting) replace these two commented-out lines.
    # vecTrainDF = vecAssembler.transform(trainDF)
 
    pipeline = Pipeline(stages=[vecAssembler, lr])
    return pipeline
    
pipeline = generatePipeline(inputCols, "m2_total_spend")
pipelineModel = pipeline.fit(trainDF)

CPU times: user 16.6 ms, sys: 4.17 ms, total: 20.7 ms
Wall time: 3.17 s


### View the model information

Print out the model coefficients and view the RMSE and R^2. We define the functions `modelInfo(inputCols, pipelineModel)` and `getEvaluationMetrics(pipelineModel,outputCol,testDF)` to report this information.

In [7]:
def modelInfo(inputCols, pipelineModel):
    # Create a zipped list containing the coefficients and the data
    modelCols = copy.deepcopy(inputCols)
    modelCoeffs = list(pipelineModel.stages[-1].coefficients)
    modelCoeffs.insert(0,pipelineModel.stages[-1].intercept)
    modelCols.insert(0,"intercept")
    modelZippedList = list(map(list, zip(modelCols, modelCoeffs)))

    # Create the pandas DataFrame
    modelDF = pd.DataFrame(modelZippedList, columns = ['Column name', 'Coefficient'])
    return modelDF

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))


Model coefficients
          Column name  Coefficient
0           intercept -1917.439876
1      m1_total_spend     4.944613
2     m1_total_events   318.076873
3  m1_purchase_events -2021.617824
4    m1_user_sessions  -424.612619


In [8]:
def getEvaluationMetrics(pipelineModel,outputCol,testDF):
    predDF = pipelineModel.transform(testDF)
    predDF.select(outputCol, "prediction").show(10)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="rmse")
    rmse = regressionEvaluator.evaluate(predDF)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="r2")
    r2 = regressionEvaluator.evaluate(predDF)
    
    return rmse, r2

evaluationMetrics = getEvaluationMetrics(pipelineModel,"m2_total_spend",testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

+-----------------+-------------------+
|   m2_total_spend|         prediction|
+-----------------+-------------------+
|              0.0|-2023.9756215550901|
|              0.0| 12528.061264367017|
|              0.0|   4336.03147487643|
|              0.0|-2023.9756215550901|
|              0.0|  4417.654334931182|
|              0.0|  520.6393628286082|
|936.8799743652344|    2561.5397477047|
|              0.0|-1705.8987485071277|
|              0.0| 13875.276937259849|
|              0.0|-2023.9756215550901|
+-----------------+-------------------+
only showing top 10 rows

RMSE is 41364.8
R^2 is 0.59463


### Train linear regression model using log-transformed coefficients

Now that we have used our new functions to test out the model accuracy on untransformed features and untransformed output, let's retrain the linear regression model on transformed features and/or transformed output (log scale). First we create these new features.

In [9]:
trainDF = trainDF \
          .withColumn("m1_total_spend_log", log(col("m1_total_spend")+0.001)) \
          .withColumn("m1_total_events_log", log(col("m1_total_events")+0.001)) \
          .withColumn("m1_purchase_events_log", log(col("m1_purchase_events")+0.001)) \
          .withColumn("m1_user_sessions_log", log(col("m1_user_sessions")+0.001)) \
          .withColumn("m2_total_spend_log", log(col("m2_total_spend")+0.001))

testDF = testDF \
          .withColumn("m1_total_spend_log", log(col("m1_total_spend")+0.001)) \
          .withColumn("m1_total_events_log", log(col("m1_total_events")+0.001)) \
          .withColumn("m1_purchase_events_log", log(col("m1_purchase_events")+0.001)) \
          .withColumn("m1_user_sessions_log", log(col("m1_user_sessions")+0.001)) \
          .withColumn("m2_total_spend_log", log(col("m2_total_spend")+0.001))

trainDF.show(2)
testDF.show(2)


+---------+--------------+--------------+---------------+------------------+----------------+------------------+----------+------------------------+----------------------------+----------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+-------------------------+----------------+--------------------+----------------+--------------------+---------------+---------------+-------------+------------+------------------+-------------------+----------------------+--------------------+------------------+
|  user_id|m2_total_spend|m1_total_spend|m1_total_events|m1_purchase_events|m1_user_sessions|num_sessions_month|AvgSessLen|stddev_SessionLengthSecs|avg_interactions_per_session|stddev_int_per_session|max_interactions_one_session|purchase_pct_of_total_events|cart_pct_of_total_events|view_pct_of_total_events|avg_purchases_per_session|std_purchases_per_session|monthlyCartTotal|monthlyPurchaseTotal|mont

### Test out different combinations of log-transformed features and outputs

Here we test out three additional transformations of the dataset in order to evaluate the linear regression model performance.

Total tested model formulations:
1. Normal inputs, normal output: `RMSE is 107094.8, R^2 is 0.91824`
2. Log-transformed inputs, normal output: `RMSE is 372221.8, R^2 is 0.01233`
3. Log-transformed inputs, log-transformed output: `RMSE is 3.1, R^2 is 0.20218` (note: cannot be compared to non-log-transformed outputs)
4. Normal inputs, log-transformed output: `RMSE is 3.5, R^2 is 0.01027` (note: cannot be compared to non-log-transformed outputs)
5. Normal and log-transformed inputs, normal output: `RMSE is 106947.5, R^2 is 0.91846`

Due to the larger R^2 and smaller RMSE, we would suggest adopting the last model. Although linear regression does not take hyperparameters as such, we have made the pre-training choice to log-transform features and include them as well. In the future, it would be useful to do a more in-depth feature selection (perhaps a stepwise feature selection), and use AIC to determine model complexity.

In [10]:
print("** Log-transformed inputs, normal output **")
inputCols = ["m1_total_spend_log","m1_total_events_log","m1_purchase_events_log","m1_user_sessions_log"]
outputCol = "m2_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** Log-transformed inputs, normal output **
Model coefficients
              Column name    Coefficient
0               intercept  405548.650250
1      m1_total_spend_log  -78721.412670
2     m1_total_events_log    9869.487577
3  m1_purchase_events_log  139117.122722
4    m1_user_sessions_log   -5530.440014
+-----------------+-------------------+
|   m2_total_spend|         prediction|
+-----------------+-------------------+
|              0.0| -11645.79780449369|
|              0.0| 115486.35930115735|
|              0.0| 12450.714808698162|
|              0.0| -11645.79780449369|
|              0.0|-146342.30326582986|
|              0.0| 10030.914857241209|
|936.8799743652344| 173359.04247102424|
|              0.0| -4809.721362801618|
|              0.0| 16505.776933381218|
|              0.0| -11645.79780449369|
+-----------------+-------------------+
only showing top 10 rows

RMSE is 64376.4
R^2 is 0.01814


In [11]:
print("** Log-transformed inputs, log-transformed output **")
inputCols = ["m1_total_spend_log","m1_total_events_log","m1_purchase_events_log","m1_user_sessions_log"]
outputCol = "m2_total_spend_log"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** Log-transformed inputs, log-transformed output **
Model coefficients
              Column name  Coefficient
0               intercept     1.637632
1      m1_total_spend_log    -1.334214
2     m1_total_events_log     0.907192
3  m1_purchase_events_log     2.625446
4    m1_user_sessions_log    -0.644856
+------------------+--------------------+
|m2_total_spend_log|          prediction|
+------------------+--------------------+
|-6.907755278982137|  -7.281618523265613|
|-6.907755278982137| -1.2343610559668314|
|-6.907755278982137|  -5.255827046948227|
|-6.907755278982137|  -7.281618523265613|
|-6.907755278982137|   -6.39052370920676|
|-6.907755278982137|  -5.289120033539898|
| 6.842556245744051|-0.02871557635196398|
|-6.907755278982137|  -6.653254241362046|
|-6.907755278982137|  -5.008147167038023|
|-6.907755278982137|  -7.281618523265613|
+------------------+--------------------+
only showing top 10 rows

RMSE is 3.1
R^2 is 0.20314


In [12]:
print("** Normal inputs, log-transformed output **")
inputCols = ["m1_total_spend","m1_total_events","m1_purchase_events","m1_user_sessions"]
outputCol = "m2_total_spend_log"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** Normal inputs, log-transformed output **
Model coefficients
          Column name  Coefficient
0           intercept    -6.164281
1      m1_total_spend     0.000010
2     m1_total_events     0.000742
3  m1_purchase_events    -0.007871
4    m1_user_sessions     0.020023
+------------------+-------------------+
|m2_total_spend_log|         prediction|
+------------------+-------------------+
|-6.907755278982137| -6.143516268567878|
|-6.907755278982137|  -5.74495533254868|
|-6.907755278982137| -6.065642770917223|
|-6.907755278982137| -6.143516268567878|
|-6.907755278982137| -6.133748908762303|
|-6.907755278982137|  -6.13758107993571|
| 6.842556245744051| -6.118261665017763|
|-6.907755278982137|-6.1427743699888575|
|-6.907755278982137| -5.917314761406123|
|-6.907755278982137| -6.143516268567878|
+------------------+-------------------+
only showing top 10 rows

RMSE is 3.5
R^2 is 0.02057


In [13]:
print("** Normal and log-transformed inputs, normal output **")
inputCols = ["m1_total_spend","m1_total_events","m1_purchase_events","m1_user_sessions",
             "m1_total_spend_log","m1_total_events_log","m1_purchase_events_log","m1_user_sessions_log"]
outputCol = "m2_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,"m2_total_spend",testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** Normal and log-transformed inputs, normal output **
Model coefficients
              Column name   Coefficient
0               intercept  46034.196767
1          m1_total_spend      4.967977
2         m1_total_events    320.032916
3      m1_purchase_events  -2054.062406
4        m1_user_sessions   -378.173218
5      m1_total_spend_log  -7887.879815
6     m1_total_events_log  -2700.519187
7  m1_purchase_events_log  14332.576358
8    m1_user_sessions_log   -296.789909
+-----------------+-------------------+
|   m2_total_spend|         prediction|
+-----------------+-------------------+
|              0.0| 1454.6740854820819|
|              0.0| 17220.281521277124|
|              0.0|-1100.3160772484116|
|              0.0| 1454.6740854820819|
|              0.0| -9482.492119901959|
|              0.0|-1916.3105846395993|
|936.8799743652344| 15489.803254065293|
|              0.0| -95.80101092906261|
|              0.0|  6024.080353453886|
|              0.0| 1454.6740854820819|
+-----

### Test out different combinations of hyperparameters

Here we test out 25 combinations of hyperparameters under cross validation.

- Regularization parameters: [0, 0.01, 0.2, 1, 10]
- Amount of LASSO-ness: [0, 0.25, 0.5, 0.75, 1]

The results that we find are that a regularization of 0.2 is desirable with an amount of LASSO-ness of 1 (L1 regularization only). Given that the R^2 and RMSE are quite similar to the previous models (`RMSE is 106998.3, R^2 is 0.91839`), it's not sure whether this is legitmately an improvement.


In [14]:
%%time
print("** Normal and log-transformed inputs, normal output **")
inputCols = ["m1_total_spend","m1_total_events","m1_purchase_events","m1_user_sessions",
             "m1_total_spend_log","m1_total_events_log","m1_purchase_events_log","m1_user_sessions_log"]
pipeline = generatePipeline(inputCols, "m2_total_spend")
pipelineModel = pipeline.fit(trainDF)
vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

# Select output column for linear regression
lr = LinearRegression(featuresCol="features", labelCol="m2_total_spend")

pipeline = Pipeline(stages=[vecAssembler, lr])
    
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0, 0.01, 0.2, 1, 10]) \
    .addGrid(lr.elasticNetParam, [0, 0.25, 0.5, 0.75, 1]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("m2_total_spend"),
                          numFolds=4)

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(trainDF)

** Normal and log-transformed inputs, normal output **
CPU times: user 4 s, sys: 1.25 s, total: 5.25 s
Wall time: 59.2 s


In [15]:
print("Best model coefficients")
print(modelInfo(inputCols, cvModel.bestModel))

evaluationMetrics = getEvaluationMetrics(cvModel.bestModel,"m2_total_spend",testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

print()
print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])

Best model coefficients
              Column name   Coefficient
0               intercept  31457.416889
1          m1_total_spend      4.997810
2         m1_total_events    322.192469
3      m1_purchase_events  -2085.126513
4        m1_user_sessions   -377.625687
5      m1_total_spend_log  -4755.139513
6     m1_total_events_log  -2499.824344
7  m1_purchase_events_log   9120.782572
8    m1_user_sessions_log   -515.153737
+-----------------+-------------------+
|   m2_total_spend|         prediction|
+-----------------+-------------------+
|              0.0|  1242.176310043753|
|              0.0| 16080.312874361229|
|              0.0|  -916.075658715039|
|              0.0|  1242.176310043753|
|              0.0|-1009.4002857744053|
|              0.0|-1670.7385947712792|
|936.8799743652344| 12053.742415288976|
|              0.0|-167.12844183439302|
|              0.0| 6279.1061103897555|
|              0.0|  1242.176310043753|
+-----------------+-------------------+
only showing top

### Save pipeline model and get model size

The model size is 13.9 kB, according to the file explorer in Linux.

In [16]:
pipelinePath = "models/lr-pipeline-model"
cvModel.bestModel.write().overwrite().save(pipelinePath)