### File 06 - PCA Linear Regression

In this file, we create a small ML pipeline based on the output from File 02 (Feature creation).

We create a linear regression model, but this time use PCA features that were created in file 02. We compare k-values. 

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression, LinearRegressionModel, LinearRegressionSummary
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col
from pyspark.sql.functions import log
from pyspark.ml.stat import Correlation

import pandas as pd
import numpy as np
import copy

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

CPU times: user 504 ms, sys: 415 ms, total: 919 ms
Wall time: 5.75 s


### Read in dataframes for train and test sets

This data should have been previously generated: we can find it in the `processed_data` folder.

In [2]:
%%time
trainDF = spark.read.parquet("./processed_data/train.parquet")
testDF = spark.read.parquet("./processed_data/test.parquet")
# trainDF.show(5)

CPU times: user 1.49 ms, sys: 3.58 ms, total: 5.06 ms
Wall time: 3.34 s


### Set up Spark ML pipeline training for linear regression

Here we decide which input columns should be used in order to create our training pipeline. To implement this step, we create the function `generatePipeline(inputCols, outputCol)`. Then, we train the pipeline using this function.

In [3]:
%%time

inputCols = ["total_spend","total_events","purchase_events","total_sessions"]

def generatePipeline(inputCols, outputCol):
    # Select input columns for linear regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

    # Select output column for linear regression
    lr = LinearRegression(featuresCol="features", labelCol=outputCol)

    # The following lines (pipeline creation and fitting) replace these two commented-out lines.
    # vecTrainDF = vecAssembler.transform(trainDF)
 
    pipeline = Pipeline(stages=[vecAssembler, lr])
    return pipeline
    
pipeline = generatePipeline(inputCols, "T_total_spend")
pipelineModel = pipeline.fit(trainDF)



CPU times: user 13.7 ms, sys: 5.73 ms, total: 19.4 ms
Wall time: 4.66 s


### View the model information

Print out the model coefficients and view the pValues, RMSE and R^2. We define the functions `modelInfo(inputCols, pipelineModel)` and `getEvaluationMetrics(pipelineModel,outputCol,testDF)` to report this information.

In [4]:
def modelInfo(inputCols, pipelineModel):
    # Create a zipped list containing the coefficients and the data
    modelCols = copy.deepcopy(inputCols)
    modelCoeffs = list(pipelineModel.stages[-1].coefficients)
    modelCoeffs.insert(0,pipelineModel.stages[-1].intercept)
    modelCols.insert(0,"intercept")
    modelZippedList = list(map(list, zip(modelCols, modelCoeffs)))
    
    # p-values don't work with PCA
    #pvals = pipelineModel.stages[-1].summary.pValues
    
    # Create the pandas DataFrame
    modelDF = pd.DataFrame(modelZippedList, columns = ['Column name', 'Coefficient'])
    # modelDF['pValues'] = pvals
    return modelDF


In [5]:
# Calculate adjusted r2 (https://towardsdatascience.com/machine-learning-linear-regression-using-pyspark-9d5d5c772b42)
# This function will allow us to calculate the adjusted r-square value when we do PCA later. The default r-square function does not take into account k. 
def adj_r2(r2, inputCols, testDF, k = 0):
    n = testDF.count()
    if k == 0:
        p = len(inputCols)
    else: 
        p = len(inputCols) + k - 1
    
    adjusted_r2 = 1-(((1-r2)*(n-1))/(n-p-1))
    return adjusted_r2

In [6]:
def getEvaluationMetrics(pipelineModel,outputCol,testDF,inputCols,k):
    predDF = pipelineModel.transform(testDF)
    predDF.select(outputCol, "prediction").show(10)
    
    print(predDF)
    
    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="rmse")
    rmse = regressionEvaluator.evaluate(predDF)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="r2")
    r2 = regressionEvaluator.evaluate(predDF)
      
    # Manually calculate Adjusted r2
    adjusted_r2 = adj_r2(r2, inputCols, testDF, k)
    
    return rmse, r2, adjusted_r2


### PCA
First, we'll see how each PCA variant performs on its own. We've created four variations with different k values: 10, 20, 50, and 100. Our PCA represents the categories of purchases that customers made in month one. 

In [7]:
print("** PCA 10 **")
inputCols = ["pca_purchases10"]

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 10)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** PCA 10 **
Model coefficients
       Column name  Coefficient
0        intercept  4249.783142
1  pca_purchases10 -6018.619738
+-----------------+------------------+
|    T_total_spend|        prediction|
+-----------------+------------------+
|              0.0|16222.022803958636|
|              0.0|14165.292918462947|
|              0.0| 7054.707380808632|
|              0.0| 4261.523596944663|
|              0.0| 4443.930340058431|
|              0.0|4843.5200384033615|
|              0.0|14165.292918462947|
|              0.0| 9207.538030410084|
|79118.00024795532|139349.27246711947|
|663.1999969482422|10506.865630229699|
+-----------------+------------------+
only showing top 10 rows

DataFrame[user_id: int, T_total_spend: double, total_spend: double, total_events: bigint, total_sessions: bigint, avg_session_length: double, sd_session_length: double, avg_interactions_per_session: double, sd_interactions_per_session: double, max_interactions_per_session: bigint, purchase_pct_of_to

In [8]:
print("** PCA 20 **")
inputCols = ["pca_purchases20"]

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 20)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** PCA 20 **
Model coefficients
       Column name  Coefficient
0        intercept  3302.769792
1  pca_purchases20 -6090.548809
+-----------------+------------------+
|    T_total_spend|        prediction|
+-----------------+------------------+
|              0.0|14924.016233146831|
|              0.0|13261.655393385969|
|              0.0|25212.340462443914|
|              0.0| 3423.800684836333|
|              0.0|3529.2899271059546|
|              0.0| 4945.947730154665|
|              0.0|13261.655393385969|
|              0.0| 8282.212592473055|
|79118.00024795532|133486.25812927008|
|663.1999969482422| 9460.859597112536|
+-----------------+------------------+
only showing top 10 rows

DataFrame[user_id: int, T_total_spend: double, total_spend: double, total_events: bigint, total_sessions: bigint, avg_session_length: double, sd_session_length: double, avg_interactions_per_session: double, sd_interactions_per_session: double, max_interactions_per_session: bigint, purchase_pct_of_to

In [9]:
print("** PCA 50 **")
inputCols = ["pca_purchases50"]

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 50)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** PCA 50 **
Model coefficients
       Column name  Coefficient
0        intercept  2243.838346
1  pca_purchases50 -6164.415169
+-----------------+------------------+
|    T_total_spend|        prediction|
+-----------------+------------------+
|              0.0| 12467.35327157172|
|              0.0| 12296.52457444936|
|              0.0|20367.420154728115|
|              0.0|2645.8068345480833|
|              0.0|3691.1328516170292|
|              0.0| 16615.53500024911|
|              0.0| 12296.52457444936|
|              0.0| 7270.181460086474|
|79118.00024795532|123987.13502881095|
|663.1999969482422| 8260.856816430161|
+-----------------+------------------+
only showing top 10 rows

DataFrame[user_id: int, T_total_spend: double, total_spend: double, total_events: bigint, total_sessions: bigint, avg_session_length: double, sd_session_length: double, avg_interactions_per_session: double, sd_interactions_per_session: double, max_interactions_per_session: bigint, purchase_pct_of_to

In [10]:
print("** PCA 100 **")
inputCols = ["pca_purchases100"]

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 100)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** PCA 100 **
Model coefficients
        Column name  Coefficient
0         intercept  1687.288407
1  pca_purchases100 -6209.014438
+-----------------+------------------+
|    T_total_spend|        prediction|
+-----------------+------------------+
|              0.0| 11876.49285967455|
|              0.0|11847.232049525555|
|              0.0|23495.729228865268|
|              0.0| 18688.53330831621|
|              0.0| 15815.21653862747|
|              0.0|14739.728192962637|
|              0.0|11847.232049525555|
|              0.0| 6767.260228149837|
|79118.00024795532| 123238.8723546263|
|663.1999969482422|7709.3533120768025|
+-----------------+------------------+
only showing top 10 rows

DataFrame[user_id: int, T_total_spend: double, total_spend: double, total_events: bigint, total_sessions: bigint, avg_session_length: double, sd_session_length: double, avg_interactions_per_session: double, sd_interactions_per_session: double, max_interactions_per_session: bigint, purchase_pct_o

#### The adjusted r-square is highest for the PCA with the highest k value, 100. However, all of them are very close - a very small sacrafice in performance helps with a great reduction in dimensionality. 

In [11]:
print("** PCA 10 + all other features **")
inputCols = ["total_spend","total_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "purchase_events",
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session', "total_spend_log", "total_events_log", "purchase_events_log", "total_sessions_log",
            'pca_purchases10']

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 10)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** PCA 10 + all other features **
Model coefficients
                     Column name   Coefficient
0                      intercept  63423.298618
1                    total_spend      3.972754
2                   total_events    221.079053
3                 total_sessions   1453.161990
4             avg_session_length     -0.103075
5   avg_interactions_per_session   -242.107532
6   max_interactions_per_session     88.161212
7   purchase_pct_of_total_events -61019.196518
8       view_pct_of_total_events -48319.275145
9       cart_pct_of_total_events -56517.727969
10     avg_purchases_per_session  -4470.158508
11                   cart_events    -33.960747
12               purchase_events  -2188.408853
13                   view_events   -195.934449
14        sessions_with_purchase  -2118.196622
15            sessions_with_cart    772.924790
16            sessions_with_view  -1132.466683
17     pct_sessions_end_purchase -14325.948417
18         pct_sessions_end_cart  -3588.031521
19     

In [12]:
print("** PCA 20 + all other features **")
inputCols = ["total_spend","total_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "purchase_events",
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session', "total_spend_log", "total_events_log", "purchase_events_log", "total_sessions_log",
            'pca_purchases20']

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 20)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** PCA 20 + all other features **
Model coefficients
                     Column name   Coefficient
0                      intercept  64070.428142
1                    total_spend      3.971888
2                   total_events    221.183961
3                 total_sessions   1467.302454
4             avg_session_length     -0.101948
5   avg_interactions_per_session   -242.848078
6   max_interactions_per_session     87.440270
7   purchase_pct_of_total_events -61456.848064
8       view_pct_of_total_events -48755.388167
9       cart_pct_of_total_events -56867.865279
10     avg_purchases_per_session  -4522.307171
11                   cart_events    -36.308621
12               purchase_events  -2164.340779
13                   view_events   -195.819222
14        sessions_with_purchase  -2147.008530
15            sessions_with_cart    775.241793
16            sessions_with_view  -1143.978272
17     pct_sessions_end_purchase -14468.809940
18         pct_sessions_end_cart  -3614.575561
19     

In [13]:
print("** PCA 50 + all other features **")
inputCols = ["total_spend","total_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "purchase_events",
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session', "total_spend_log", "total_events_log", "purchase_events_log", "total_sessions_log",
            'pca_purchases50']

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 50)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** PCA 50 + all other features **
Model coefficients
                     Column name   Coefficient
0                      intercept  64025.534032
1                    total_spend      3.962949
2                   total_events    221.505175
3                 total_sessions   1471.251424
4             avg_session_length     -0.096650
5   avg_interactions_per_session   -245.127099
6   max_interactions_per_session     84.540344
7   purchase_pct_of_total_events -61084.457760
8       view_pct_of_total_events -48648.969022
9       cart_pct_of_total_events -56786.703303
10     avg_purchases_per_session  -4596.230400
11                   cart_events    -34.680544
12               purchase_events  -2134.384139
13                   view_events   -196.033375
14        sessions_with_purchase  -2226.040924
15            sessions_with_cart    782.260437
16            sessions_with_view  -1155.352317
17     pct_sessions_end_purchase -14261.465905
18         pct_sessions_end_cart  -3645.229388
19     

In [14]:
print("** PCA 100 + all other features **")
inputCols = ["total_spend","total_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "purchase_events",
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session', "total_spend_log", "total_events_log", "purchase_events_log", "total_sessions_log",
            'pca_purchases100']

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 100)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** PCA 100 + all other features **
Model coefficients
                     Column name   Coefficient
0                      intercept  64544.011873
1                    total_spend      3.966504
2                   total_events    221.610719
3                 total_sessions   1481.892329
4             avg_session_length     -0.098597
5   avg_interactions_per_session   -244.208334
6   max_interactions_per_session     85.977061
7   purchase_pct_of_total_events -61917.383850
8       view_pct_of_total_events -49176.515283
9       cart_pct_of_total_events -57237.630080
10     avg_purchases_per_session  -4541.246077
11                   cart_events    -37.863972
12               purchase_events  -1668.499650
13                   view_events   -195.915067
14        sessions_with_purchase  -2151.265483
15            sessions_with_cart    767.437917
16            sessions_with_view  -1156.984208
17     pct_sessions_end_purchase -14458.359573
18         pct_sessions_end_cart  -3614.741810
19    

#### When compared against a linear regression without the PCA features, performance of the models with PCA suffers. This is not just due to the increased number of features increasing the adjusted r-square. The feature we have encoded with PCA must not be a good predictor.  

In [18]:
print("** Champion Linear Regression Model with PCA (k=10) Added **")

inputCols = ["total_spend","total_events", "avg_session_length", "max_interactions_per_session", 
              "cart_pct_of_total_events","avg_purchases_per_session",  "purchase_events", 'sd_purchases_per_session',
              "view_events", "sessions_with_purchase", "sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
            "total_spend_log","total_events_log", "purchase_events_log", "total_sessions_log",
            'pca_purchases10']

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 10)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** Champion Linear Regression Model with PCA (k=10) Added **
Model coefficients
                     Column name   Coefficient
0                      intercept  11428.893823
1                    total_spend      3.963334
2                   total_events    221.957675
3             avg_session_length     -0.181925
4   max_interactions_per_session     22.151790
5       cart_pct_of_total_events  -6886.856672
6      avg_purchases_per_session  -4976.049138
7                purchase_events  -2098.678209
8       sd_purchases_per_session  -6846.984863
9                    view_events   -196.568342
10        sessions_with_purchase  -1477.524897
11            sessions_with_view    344.505094
12     pct_sessions_end_purchase -14394.190953
13         pct_sessions_end_cart   1441.524961
14             sd_session_length      0.088787
15               total_spend_log   1521.500097
16              total_events_log   1066.565885
17           purchase_events_log   7531.094591
18            total_session

### When pca_purchases10 is added, the adjusted R^2 of this regression falls from .61617 to .61588. The raw R^2 value also falls. 