## File 05 - Basic Regression

In this file, we create a small ML pipeline based on the output from File 03 (Basic Preprocessed Output).
The files needed are `/processed_data/train_output.parquet` and `/processed_data/test_output.parquet`.

The goal of this file is to provide:
- The type of model (e.g., logistic regression)
- Best hyperparameters used
- Size of the saved model
- Performance metrics

### Set up Spark session

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import col
from pyspark.sql.functions import log

import pandas as pd
import copy

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

CPU times: user 371 ms, sys: 321 ms, total: 692 ms
Wall time: 4.38 s


### Read in dataframes for train and test sets

This data should have been previously generated: we can find it in the `processed_data` folder.

In [2]:
%%time
trainDF = spark.read.parquet("./processed_data/train_output.parquet")
testDF = spark.read.parquet("./processed_data/test_output.parquet")
trainDF.show(5)

+--------+--------------+--------------+---------------+------------------+----------------+
| user_id|m2_total_spend|m1_total_spend|m1_total_events|m1_purchase_events|m1_user_sessions|
+--------+--------------+--------------+---------------+------------------+----------------+
|22165363|          0.00|          0.00|              2|                 0|               2|
|32978429|          0.00|          0.00|              8|                 0|               2|
|38661019|          0.00|          0.00|              1|                 0|               1|
|49484535|          0.00|          0.00|             22|                 0|              18|
|62336140|          0.00|          0.00|              1|                 0|               1|
+--------+--------------+--------------+---------------+------------------+----------------+
only showing top 5 rows

CPU times: user 2.53 ms, sys: 345 µs, total: 2.88 ms
Wall time: 3.09 s


### Set up Spark ML pipeline training for linear regression

Here we decide which input columns should be used in order to create our training pipeline. To implement this step, we create the function `generateLRModel(inputCols, outputCol, trainDF)`. Then, we train the pipeline using this function.

In [3]:
%%time

inputCols = ["m1_total_spend","m1_total_events","m1_purchase_events","m1_user_sessions"]

def generateLRModel(inputCols, outputCol, trainDF):
    # Select input columns for linear regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

    # Select output column for linear regression
    lr = LinearRegression(featuresCol="features", labelCol=outputCol)

    # The following lines (pipeline creation and fitting) replace these two commented-out lines.
    # vecTrainDF = vecAssembler.transform(trainDF)
    # lrModel = lr.fit(vecTrainDF)
    pipeline = Pipeline(stages=[vecAssembler, lr])
    pipelineModel = pipeline.fit(trainDF)
    return pipelineModel
    
pipelineModel = generateLRModel(inputCols, "m2_total_spend", trainDF)

CPU times: user 8.21 ms, sys: 3.4 ms, total: 11.6 ms
Wall time: 11.9 s


### View the model information

Print out the model coefficients and view the RMSE and R^2. We define the functions `modelInfo(inputCols, pipelineModel)` and `getEvaluationMetrics(pipelineModel,outputCol,testDF)` to report this information.

In [4]:
def modelInfo(inputCols, pipelineModel):
    # Create a zipped list containing the coefficients and the data
    modelCols = copy.deepcopy(inputCols)
    modelCoeffs = list(pipelineModel.stages[-1].coefficients)
    modelCoeffs.insert(0,pipelineModel.stages[-1].intercept)
    modelCols.insert(0,"intercept")
    modelZippedList = list(map(list, zip(modelCols, modelCoeffs)))

    # Create the pandas DataFrame
    modelDF = pd.DataFrame(modelZippedList, columns = ['Column name', 'Coefficient'])
    return modelDF

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))


Model coefficients
          Column name  Coefficient
0           intercept -2041.194264
1      m1_total_spend     5.120766
2     m1_total_events   274.648877
3  m1_purchase_events -1424.596141
4    m1_user_sessions  -297.351526


In [5]:
def getEvaluationMetrics(pipelineModel,outputCol,testDF):
    predDF = pipelineModel.transform(testDF)
    predDF.select(outputCol, "prediction").show(10)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="rmse")
    rmse = regressionEvaluator.evaluate(predDF)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="r2")
    r2 = regressionEvaluator.evaluate(predDF)
    
    return rmse, r2

evaluationMetrics = getEvaluationMetrics(pipelineModel,"m2_total_spend",testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

+--------------+-------------------+
|m2_total_spend|         prediction|
+--------------+-------------------+
|          0.00|-2063.8969131678823|
|          0.00|-2063.8969131678823|
|          0.00|-2063.8969131678823|
|          0.00|-2086.5995624395064|
|          0.00|-2086.5995624395064|
|          0.00| -781.4631257672772|
|          0.00|-1811.9506855420864|
|          0.00|-2063.8969131678823|
|          0.00| -2132.004860982755|
|          0.00|-1857.3559840853345|
+--------------+-------------------+
only showing top 10 rows

RMSE is 121897.8
R^2 is 0.82886


### Train linear regression model using log-transformed coefficients

Now that we have used our new functions to test out the model accuracy on untransformed features and untransformed output, let's retrain the linear regression model on transformed features and/or transformed output (log scale). First we create these new features.

In [6]:
trainDF = trainDF \
          .withColumn("m1_total_spend_log", log(col("m1_total_spend")+0.001)) \
          .withColumn("m1_total_events_log", log(col("m1_total_events")+0.001)) \
          .withColumn("m1_purchase_events_log", log(col("m1_purchase_events")+0.001)) \
          .withColumn("m1_user_sessions_log", log(col("m1_user_sessions")+0.001)) \
          .withColumn("m2_total_spend_log", log(col("m2_total_spend")+0.001))

testDF = testDF \
          .withColumn("m1_total_spend_log", log(col("m1_total_spend")+0.001)) \
          .withColumn("m1_total_events_log", log(col("m1_total_events")+0.001)) \
          .withColumn("m1_purchase_events_log", log(col("m1_purchase_events")+0.001)) \
          .withColumn("m1_user_sessions_log", log(col("m1_user_sessions")+0.001)) \
          .withColumn("m2_total_spend_log", log(col("m2_total_spend")+0.001))

trainDF.show(2)
testDF.show(2)


+--------+--------------+--------------+---------------+------------------+----------------+------------------+-------------------+----------------------+--------------------+------------------+
| user_id|m2_total_spend|m1_total_spend|m1_total_events|m1_purchase_events|m1_user_sessions|m1_total_spend_log|m1_total_events_log|m1_purchase_events_log|m1_user_sessions_log|m2_total_spend_log|
+--------+--------------+--------------+---------------+------------------+----------------+------------------+-------------------+----------------------+--------------------+------------------+
|22165363|          0.00|          0.00|              2|                 0|               2|-6.907755278982137| 0.6936470556015963|    -6.907755278982137|  0.6936470556015963|-6.907755278982137|
|32978429|          0.00|          0.00|              8|                 0|               2|-6.907755278982137|  2.079566533867987|    -6.907755278982137|  0.6936470556015963|-6.907755278982137|
+--------+--------------+

### Test out different combinations of log-transformed features and outputs

Here we test out three additional transformations of the dataset in order to evaluate the linear regression model performance.

Total tested model formulations:
1. Normal inputs, normal output: `RMSE is 121897.8, R^2 is 0.82886`
2. Log-transformed inputs, normal output: `RMSE is 292861.0, R^2 is 0.01218`
3. Log-transformed inputs, log-transformed output: `RMSE is 3.1, R^2 is 0.20528` (note: cannot be compared to non-log-transformed outputs)
4. Normal inputs, log-transformed output: `RMSE is 3.5, R^2 is 0.01118` (note: cannot be compared to non-log-transformed outputs)
5. Normal and log-transformed inputs, normal output: `RMSE is 121674.3, R^2 is 0.82949`

Due to the larger R^2 and smaller RMSE, we would suggest adopting the last model. Although linear regression does not take hyperparameters as such, we have made the pre-training choice to log-transform features and include them 
as well. In the future, it would be useful to do a more in-depth feature selection (perhaps a stepwise feature selection), and use AIC to determine model complexity.

In [7]:
print("** Log-transformed inputs, normal output **")
inputCols = ["m1_total_spend_log","m1_total_events_log","m1_purchase_events_log","m1_user_sessions_log"]
pipelineModel = generateLRModel(inputCols, "m2_total_spend", trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,"m2_total_spend",testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** Log-transformed inputs, normal output **
Model coefficients
              Column name    Coefficient
0               intercept  460262.326282
1      m1_total_spend_log  -89945.506680
2     m1_total_events_log   12206.931534
3  m1_purchase_events_log  158480.854321
4    m1_user_sessions_log   -8806.028622
+--------------+-------------------+
|m2_total_spend|         prediction|
+--------------+-------------------+
|          0.00|-13159.683979370282|
|          0.00|-13159.683979370282|
|          0.00|-13159.683979370282|
|          0.00| -10804.05689219205|
|          0.00| -10804.05689219205|
|          0.00|  771.1186771667562|
|          0.00| -5856.605719513434|
|          0.00|-13159.683979370282|
|          0.00|  -8447.58053492289|
|          0.00| -5724.292691155686|
+--------------+-------------------+
only showing top 10 rows

RMSE is 292861.0
R^2 is 0.01218


In [8]:
print("** Log-transformed inputs, log-transformed output **")
inputCols = ["m1_total_spend_log","m1_total_events_log","m1_purchase_events_log","m1_user_sessions_log"]
pipelineModel = generateLRModel(inputCols, "m2_total_spend_log", trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,"m2_total_spend_log",testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** Log-transformed inputs, log-transformed output **
Model coefficients
              Column name  Coefficient
0               intercept     1.503228
1      m1_total_spend_log    -1.317085
2     m1_total_events_log     0.920986
3  m1_purchase_events_log     2.590074
4    m1_user_sessions_log    -0.666778
+------------------+-------------------+
|m2_total_spend_log|         prediction|
+------------------+-------------------+
|-6.907755278982137| -7.290018879251713|
|-6.907755278982137| -7.290018879251713|
|-6.907755278982137| -7.290018879251713|
|-6.907755278982137|-7.1139419292199975|
|-6.907755278982137|-7.1139419292199975|
|-6.907755278982137| -6.242802257791032|
|-6.907755278982137| -6.740667570247851|
|-6.907755278982137| -7.290018879251713|
|-6.907755278982137| -6.937801498475139|
|-6.907755278982137| -6.732335393060614|
+------------------+-------------------+
only showing top 10 rows

RMSE is 3.1
R^2 is 0.20528


In [9]:
print("** Normal inputs, log-transformed output **")
inputCols = ["m1_total_spend","m1_total_events","m1_purchase_events","m1_user_sessions"]
pipelineModel = generateLRModel(inputCols, "m2_total_spend_log", trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,"m2_total_spend_log",testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** Normal inputs, log-transformed output **
Model coefficients
          Column name  Coefficient
0           intercept    -6.116609
1      m1_total_spend     0.000002
2     m1_total_events     0.000767
3  m1_purchase_events    -0.003722
4    m1_user_sessions     0.002832
+------------------+-------------------+
|m2_total_spend_log|         prediction|
+------------------+-------------------+
|-6.907755278982137| -6.113010575597168|
|-6.907755278982137| -6.113010575597168|
|-6.907755278982137| -6.113010575597168|
|-6.907755278982137| -6.109411936212896|
|-6.907755278982137| -6.109411936212896|
|-6.907755278982137|  -6.09478209746954|
|-6.907755278982137| -6.108645152094788|
|-6.907755278982137| -6.113010575597168|
|-6.907755278982137| -6.102214657444352|
|-6.907755278982137|-6.1014478733262445|
+------------------+-------------------+
only showing top 10 rows

RMSE is 3.5
R^2 is 0.01118


In [10]:
print("** Normal and log-transformed inputs, normal output **")
inputCols = ["m1_total_spend","m1_total_events","m1_purchase_events","m1_user_sessions",
             "m1_total_spend_log","m1_total_events_log","m1_purchase_events_log","m1_user_sessions_log"]
pipelineModel = generateLRModel(inputCols, "m2_total_spend", trainDF)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,"m2_total_spend",testDF)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")

** Normal and log-transformed inputs, normal output **
Model coefficients
              Column name  Coefficient
0               intercept  1791.628848
1          m1_total_spend     5.122784
2         m1_total_events   275.439568
3      m1_purchase_events -1428.450224
4        m1_user_sessions  -285.095136
5      m1_total_spend_log   371.336103
6     m1_total_events_log -2251.076418
7  m1_purchase_events_log  -334.958025
8    m1_user_sessions_log  -614.021730
+--------------+-------------------+
|m2_total_spend|         prediction|
+--------------+-------------------+
|          0.00| 1527.8187484631812|
|          0.00| 1527.8187484631812|
|          0.00| 1527.8187484631812|
|          0.00| -466.3400480942919|
|          0.00| -466.3400480942919|
|          0.00|-3302.6147649497266|
|          0.00|-1103.2584007598834|
|          0.00| 1527.8187484631812|
|          0.00| -2470.869882475308|
|          0.00|-2697.6309727164603|
+--------------+-------------------+
only showing top 1

### Save pipeline model and get model size

The model size is 14.0 kB, according to the file explorer in Linux.

In [11]:
pipelinePath = "models/lr-pipeline-model"
pipelineModel.write().overwrite().save(pipelinePath)