## File 07 - Basic Regression

In this file, we create a small ML pipeline based on the output from File 03 (Basic Preprocessed Output).
The files needed are `/processed_data/train_output.parquet` and `/processed_data/test_output.parquet`.

The goal of this file is to provide:
- The type of model
- Best hyperparameters used
- Size of the saved model
- Performance metrics

### Set up Spark session

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col
from pyspark.sql.functions import log

import pandas as pd
import numpy as np
import copy

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

CPU times: user 463 ms, sys: 393 ms, total: 856 ms
Wall time: 5.19 s


### Read in dataframes for train and test sets

This data should have been previously generated: we can find it in the `processed_data` folder.

In [2]:
%%time
trainDF = spark.read.parquet("./processed_data/train_output.parquet")
testDF = spark.read.parquet("./processed_data/test_output.parquet")
trainDF.show(5)

+---------+--------------+--------------+---------------+------------------+----------------+
|  user_id|m2_total_spend|m1_total_spend|m1_total_events|m1_purchase_events|m1_user_sessions|
+---------+--------------+--------------+---------------+------------------+----------------+
|216064734|           0.0|           0.0|              7|                 0|               3|
|276954781|           0.0|           0.0|              1|                 0|               1|
|300004940|           0.0|           0.0|             91|                 0|              15|
|308982710|           0.0|           0.0|              7|                 0|               6|
|324078599|           0.0|           0.0|              5|                 0|               3|
+---------+--------------+--------------+---------------+------------------+----------------+
only showing top 5 rows

CPU times: user 2.76 ms, sys: 1.86 ms, total: 4.62 ms
Wall time: 4.43 s


### Set up Spark ML pipeline training for linear regression

Here we decide which input columns should be used in order to create our training pipeline. To implement this step, we create the function `generatePipeline(inputCols, outputCol, trainDF)`. Then, we train the pipeline using this function.

In [3]:
%%time

inputCols = ["m1_total_spend","m1_total_events","m1_purchase_events","m1_user_sessions"]

def generatePipeline(inputCols, outputCol):
    # Select input columns for linear regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

    # Select output column for linear regression
    lr = LinearRegression(featuresCol="features", labelCol=outputCol)

    # The following lines (pipeline creation and fitting) replace these two commented-out lines.
    # vecTrainDF = vecAssembler.transform(trainDF)
 
    pipeline = Pipeline(stages=[vecAssembler, lr])
    return pipeline
    
pipeline = generatePipeline(inputCols, "m2_total_spend")
pipelineModel = pipeline.fit(trainDF)

CPU times: user 9.14 ms, sys: 4.03 ms, total: 13.2 ms
Wall time: 5.49 s


### View the model information

Print out the model coefficients and view the RMSE and R^2. We define the functions `modelInfo(inputCols, pipelineModel)` and `getEvaluationMetrics(pipelineModel,outputCol,testDF)` to report this information.

In [4]:
def modelInfo(inputCols, pipelineModel):
    # Create a zipped list containing the coefficients and the data
    modelCols = copy.deepcopy(inputCols)
    modelCoeffs = list(pipelineModel.stages[-1].coefficients)
    modelCoeffs.insert(0,pipelineModel.stages[-1].intercept)
    modelCols.insert(0,"intercept")
    modelZippedList = list(map(list, zip(modelCols, modelCoeffs)))

    # Create the pandas DataFrame
    modelDF = pd.DataFrame(modelZippedList, columns = ['Column name', 'Coefficient'])
    return modelDF

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))


Model coefficients
          Column name  Coefficient
0           intercept -2881.958103
1      m1_total_spend     5.285700
2     m1_total_events   313.911430
3  m1_purchase_events -1679.682063
4    m1_user_sessions  -248.626789


In [5]:
def getEvaluationMetrics(pipelineModel,outputCol,testDF):
    predDF = pipelineModel.transform(testDF)
    predDF.select(outputCol, "prediction").show(10)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="rmse")
    rmse = regressionEvaluator.evaluate(predDF)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="r2")
    r2 = regressionEvaluator.evaluate(predDF)
    
    return rmse, r2

evaluationMetrics = getEvaluationMetrics(pipelineModel,"m2_total_spend",testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

+--------------+-------------------+
|m2_total_spend|         prediction|
+--------------+-------------------+
|           0.0|  950.2636956517554|
|           0.0|  6979.865502618586|
|           0.0|-1809.6545312152025|
|           0.0| -2751.388820554537|
|           0.0|  11623.25230816421|
|           0.0| 2479.5597553504995|
|           0.0| -2751.388820554537|
|           0.0| 14852.674799063938|
|           0.0|-2620.8195382524455|
|           0.0|-2816.6734617055827|
+--------------+-------------------+
only showing top 10 rows

RMSE is 107094.8
R^2 is 0.91824


### Train linear regression model using log-transformed coefficients

Now that we have used our new functions to test out the model accuracy on untransformed features and untransformed output, let's retrain the linear regression model on transformed features and/or transformed output (log scale). First we create these new features.

In [6]:
trainDF = trainDF \
          .withColumn("m1_total_spend_log", log(col("m1_total_spend")+0.001)) \
          .withColumn("m1_total_events_log", log(col("m1_total_events")+0.001)) \
          .withColumn("m1_purchase_events_log", log(col("m1_purchase_events")+0.001)) \
          .withColumn("m1_user_sessions_log", log(col("m1_user_sessions")+0.001)) \
          .withColumn("m2_total_spend_log", log(col("m2_total_spend")+0.001))

testDF = testDF \
          .withColumn("m1_total_spend_log", log(col("m1_total_spend")+0.001)) \
          .withColumn("m1_total_events_log", log(col("m1_total_events")+0.001)) \
          .withColumn("m1_purchase_events_log", log(col("m1_purchase_events")+0.001)) \
          .withColumn("m1_user_sessions_log", log(col("m1_user_sessions")+0.001)) \
          .withColumn("m2_total_spend_log", log(col("m2_total_spend")+0.001))

trainDF.show(2)
testDF.show(2)


+---------+--------------+--------------+---------------+------------------+----------------+------------------+--------------------+----------------------+--------------------+------------------+
|  user_id|m2_total_spend|m1_total_spend|m1_total_events|m1_purchase_events|m1_user_sessions|m1_total_spend_log| m1_total_events_log|m1_purchase_events_log|m1_user_sessions_log|m2_total_spend_log|
+---------+--------------+--------------+---------------+------------------+----------------+------------------+--------------------+----------------------+--------------------+------------------+
|216064734|           0.0|           0.0|              7|                 0|               3|-6.907755278982137|  1.9460529959950605|    -6.907755278982137|  1.0989455664582302|-6.907755278982137|
|276954781|           0.0|           0.0|              1|                 0|               1|-6.907755278982137|9.995003330834232E-4|    -6.907755278982137|9.995003330834232E-4|-6.907755278982137|
+---------+----

### Test out different combinations of log-transformed features and outputs

Here we test out three additional transformations of the dataset in order to evaluate the linear regression model performance.

Total tested model formulations:
1. Normal inputs, normal output: `RMSE is 107094.8, R^2 is 0.91824`
2. Log-transformed inputs, normal output: `RMSE is 372221.8, R^2 is 0.01233`
3. Log-transformed inputs, log-transformed output: `RMSE is 3.1, R^2 is 0.20218` (note: cannot be compared to non-log-transformed outputs)
4. Normal inputs, log-transformed output: `RMSE is 3.5, R^2 is 0.01027` (note: cannot be compared to non-log-transformed outputs)
5. Normal and log-transformed inputs, normal output: `RMSE is 106947.5, R^2 is 0.91846`

Due to the larger R^2 and smaller RMSE, we would suggest adopting the last model. Although linear regression does not take hyperparameters as such, we have made the pre-training choice to log-transform features and include them as well. In the future, it would be useful to do a more in-depth feature selection (perhaps a stepwise feature selection), and use AIC to determine model complexity.

In [12]:
print("** Log-transformed inputs, normal output **")
inputCols = ["m1_total_spend_log","m1_total_events_log","m1_purchase_events_log","m1_user_sessions_log"]
outputCol = "m2_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** Log-transformed inputs, normal output **
Model coefficients
              Column name    Coefficient
0               intercept  423132.734213
1      m1_total_spend_log  -82482.258824
2     m1_total_events_log   11326.038662
3  m1_purchase_events_log  145541.653360
4    m1_user_sessions_log   -7671.589245
+--------------+-------------------+
|m2_total_spend|         prediction|
+--------------+-------------------+
|           0.0| 16577.787751273194|
|           0.0| 21814.416458436986|
|           0.0| 443.31442810606677|
|           0.0| -9931.233204243472|
|           0.0| 31133.362531157385|
|           0.0|   6326.17963634443|
|           0.0| -9931.233204243472|
|           0.0|  15922.40951142332|
|           0.0| -7399.075163899048|
|           0.0|-12462.478659115674|
+--------------+-------------------+
only showing top 10 rows

RMSE is 372221.8
R^2 is 0.01233


In [13]:
print("** Log-transformed inputs, log-transformed output **")
inputCols = ["m1_total_spend_log","m1_total_events_log","m1_purchase_events_log","m1_user_sessions_log"]
outputCol = "m2_total_spend_log"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** Log-transformed inputs, log-transformed output **
Model coefficients
              Column name  Coefficient
0               intercept     1.538954
1      m1_total_spend_log    -1.322550
2     m1_total_events_log     0.923090
3  m1_purchase_events_log     2.600790
4    m1_user_sessions_log    -0.668795
+------------------+------------------+
|m2_total_spend_log|        prediction|
+------------------+------------------+
|-6.907755278982137|-4.923735095132409|
|-6.907755278982137|-4.527105440221794|
|-6.907755278982137|-6.268883690890332|
|-6.907755278982137|-7.114425553958757|
|-6.907755278982137|-3.737432978301927|
|-6.907755278982137|-5.859492201162338|
|-6.907755278982137|-7.114425553958757|
|-6.907755278982137|-5.085323561619605|
|-6.907755278982137|-6.938225364445513|
|-6.907755278982137|-7.290562241221904|
+------------------+------------------+
only showing top 10 rows

RMSE is 3.1
R^2 is 0.20218


In [14]:
print("** Normal inputs, log-transformed output **")
inputCols = ["m1_total_spend","m1_total_events","m1_purchase_events","m1_user_sessions"]
outputCol = "m2_total_spend_log"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** Normal inputs, log-transformed output **
Model coefficients
          Column name  Coefficient
0           intercept    -6.113765
1      m1_total_spend     0.000002
2     m1_total_events     0.000773
3  m1_purchase_events    -0.003500
4    m1_user_sessions     0.002602
+------------------+-------------------+
|m2_total_spend_log|         prediction|
+------------------+-------------------+
|-6.907755278982137| -6.101116067603466|
|-6.907755278982137|-6.0830573477300405|
|-6.907755278982137| -6.104696865948726|
|-6.907755278982137| -6.107015385757871|
|-6.907755278982137| -6.074839509766491|
|-6.907755278982137| -6.068424698054745|
|-6.907755278982137| -6.107015385757871|
|-6.907755278982137| -6.031534418371512|
|-6.907755278982137|-6.1002658635935205|
|-6.907755278982137| -6.110390146840046|
+------------------+-------------------+
only showing top 10 rows

RMSE is 3.5
R^2 is 0.01027


In [15]:
print("** Normal and log-transformed inputs, normal output **")
inputCols = ["m1_total_spend","m1_total_events","m1_purchase_events","m1_user_sessions",
             "m1_total_spend_log","m1_total_events_log","m1_purchase_events_log","m1_user_sessions_log"]
outputCol = "m2_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,"m2_total_spend",testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** Normal and log-transformed inputs, normal output **
Model coefficients
              Column name   Coefficient
0               intercept -16570.114363
1          m1_total_spend      5.288267
2         m1_total_events    315.512164
3      m1_purchase_events  -1686.180334
4        m1_user_sessions   -228.815114
5      m1_total_spend_log   4125.214436
6     m1_total_events_log  -3332.968078
7  m1_purchase_events_log  -6892.338972
8    m1_user_sessions_log   -729.879780
+--------------+-------------------+
|m2_total_spend|         prediction|
+--------------+-------------------+
|           0.0| -2132.532440776513|
|           0.0|   338.647900184842|
|           0.0|-2206.7153234176476|
|           0.0|-100.28359720380467|
|           0.0|  4311.541875639628|
|           0.0| -4265.069252114428|
|           0.0|-100.28359720380467|
|           0.0|  4844.746269128675|
|           0.0| -2742.025703185145|
|           0.0| 2627.1409886582915|
+--------------+-------------------+
only sho

### Test out different combinations of hyperparameters

Here we test out 25 combinations of hyperparameters under cross validation.

- Regularization parameters: [0, 0.01, 0.2, 1, 10]
- Amount of LASSO-ness: [0, 0.25, 0.5, 0.75, 1]

The results that we find are that a regularization of 0.2 is desirable with an amount of LASSO-ness of 1 (L1 regularization only). Given that the R^2 and RMSE are quite similar to the previous models (`RMSE is 106998.3, R^2 is 0.91839`), it's not sure whether this is legitmately an improvement.


In [16]:
%%time
print("** Normal and log-transformed inputs, normal output **")
inputCols = ["m1_total_spend","m1_total_events","m1_purchase_events","m1_user_sessions",
             "m1_total_spend_log","m1_total_events_log","m1_purchase_events_log","m1_user_sessions_log"]
pipeline = generatePipeline(inputCols, "m2_total_spend")
pipelineModel = pipeline.fit(trainDF)
vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

# Select output column for linear regression
lr = LinearRegression(featuresCol="features", labelCol="m2_total_spend")

pipeline = Pipeline(stages=[vecAssembler, lr])
    
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0, 0.01, 0.2, 1, 10]) \
    .addGrid(lr.elasticNetParam, [0, 0.25, 0.5, 0.75, 1]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("m2_total_spend"),
                          numFolds=4)

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(trainDF)

** Normal and log-transformed inputs, normal output **
CPU times: user 2.44 s, sys: 815 ms, total: 3.26 s
Wall time: 5min 30s


In [17]:
print("Best model coefficients")
print(modelInfo(inputCols, cvModel.bestModel))

evaluationMetrics = getEvaluationMetrics(cvModel.bestModel,"m2_total_spend",testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

print()
print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])

Best model coefficients
              Column name   Coefficient
0               intercept -10920.001611
1          m1_total_spend      5.290461
2         m1_total_events    315.345439
3      m1_purchase_events  -1686.488895
4        m1_user_sessions   -228.595785
5      m1_total_spend_log   2882.170527
6     m1_total_events_log  -3411.894171
7  m1_purchase_events_log  -4846.929282
8    m1_user_sessions_log   -673.152098
+--------------+-------------------+
|m2_total_spend|         prediction|
+--------------+-------------------+
|           0.0|-2229.3050498460416|
|           0.0|  204.5318315520326|
|           0.0|-2187.2374375699965|
|           0.0| -8.009957100601241|
|           0.0|   4107.66931780401|
|           0.0| -4282.908267246825|
|           0.0| -8.009957100601241|
|           0.0|  4754.234076957451|
|           0.0|-2665.0280750342354|
|           0.0|  2734.737700567528|
+--------------+-------------------+
only showing top 10 rows

RMSE is 106998.3
R^2 is 0.91839


### Save pipeline model and get model size

The model size is 13.9 kB, according to the file explorer in Linux.

In [None]:
pipelinePath = "models/lr-pipeline-model"
cvModel.bestModel.write().overwrite().save(pipelinePath)