## File 10 - Logistic + Linear Regression

In this file, we create a small ML pipeline based on the output from File 02 (Feature creation).

We create a linear regression model, tune it, then compare it to a linear regression model created with downsampled data to see how performance compares. 

### Set up Spark session

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression, LinearRegressionModel, LinearRegressionSummary
from pyspark.ml.evaluation import RegressionEvaluator, BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col
from pyspark.sql.functions import log
from pyspark.ml.stat import Correlation
from pyspark.sql.functions import when
from pyspark.ml.classification import LogisticRegression
from handyspark import *
from matplotlib import pyplot as plt
from pyspark.mllib.evaluation import BinaryClassificationMetrics, MulticlassMetrics
from pyspark.sql.types import IntegerType, DoubleType
from pyspark.sql.functions import lit

import pandas as pd
import numpy as np
import copy

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

CPU times: user 949 ms, sys: 743 ms, total: 1.69 s
Wall time: 6.4 s


### Read in dataframes for train and test sets

This data should have been previously generated: we can find it in the `processed_data` folder.

In [2]:
%%time
trainDF = spark.read.parquet("./processed_data/train.parquet")
testDF = spark.read.parquet("./processed_data/test.parquet")
# trainDF.show(5)

CPU times: user 2.12 ms, sys: 2.18 ms, total: 4.3 ms
Wall time: 3.17 s


### Label trainDF with 0 if they did not spend in m2 and 1 if they did

In [3]:
trainDF = trainDF.withColumn('logistic_T_total_spend', when(trainDF.T_total_spend == 0, 0)
                                                       .otherwise(1))
testDF = testDF.withColumn('logistic_T_total_spend', when(testDF.T_total_spend == 0, 0)
                                                       .otherwise(1))

trainDF = trainDF.withColumn('logistic_T_total_spend', col("logistic_T_total_spend").cast(DoubleType()))
testDF = testDF.withColumn('logistic_T_total_spend', col("logistic_T_total_spend").cast(DoubleType()))

### Set up Spark ML pipeline training for linear regression

Here we decide which input columns should be used in order to create our training pipeline. To implement this step, we create the function `generatePipeline(inputCols, outputCol)`. Then, we train the pipeline using this function.

In [4]:
%%time

def generateLogisticPipeline(inputCols, outputCol, threshold):
    # Select input columns for linear regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

    # Select output column for linear regression
    logR = LogisticRegression(featuresCol="features", labelCol=outputCol, rawPredictionCol = "Logistic_Probabilities", predictionCol = "Logistic_Predictions", threshold = threshold)

    # The following lines (pipeline creation and fitting) replace these two commented-out lines.
    # vecTrainDF = vecAssembler.transform(trainDF)
 
    pipeline = Pipeline(stages=[vecAssembler, logR])
    return pipeline
    
#pipeline = generateLogisticPipeline(inputCols, "logistic_T_total_spend")
#pipelineModel = pipeline.fit(trainDF)



CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.63 µs


### View the model information

Print out the model coefficients and view the pValues, RMSE and R^2. We define the functions `modelInfo(inputCols, pipelineModel)` and `getEvaluationMetrics(pipelineModel,outputCol,testDF)` to report this information.

In [5]:
def modelInfo(inputCols, pipelineModel):
    # Create a zipped list containing the coefficients and the data
    modelCols = copy.deepcopy(inputCols)
    modelCoeffs = list(pipelineModel.stages[-1].coefficients)
    modelCoeffs.insert(0,pipelineModel.stages[-1].intercept)
    modelCols.insert(0,"intercept")
    modelZippedList = list(map(list, zip(modelCols, modelCoeffs)))
    
    # Add in the p-values
    #pvals = pipelineModel.stages[-1].summary.pValues
    
    # Create the pandas DataFrame
    modelDF = pd.DataFrame(modelZippedList, columns = ['Column name', 'Coefficient'])
    #modelDF['pValues'] = pvals
    return modelDF

In [6]:
# Calculate adjusted r2 (https://towardsdatascience.com/machine-learning-linear-regression-using-pyspark-9d5d5c772b42)
# This function will allow us to calculate the adjusted r-square value when we do PCA later. The default r-square function does not take into account k. 
def adj_r2(r2, inputCols, testDF, k = 0):
    n = testDF.count()
    if k == 0:
        p = len(inputCols)
    else: 
        p = len(inputCols) + k - 1
    
    adjusted_r2 = 1-(((1-r2)*(n-1))/(n-p-1))
    return adjusted_r2

In [7]:
# https://towardsdatascience.com/binary-classifier-evaluation-made-easy-with-handyspark-3b1e69c12b4f
def getLogisticEvaluationMetrics(pipelineModel,outputCol,testDF,inputCols):
    predDF = pipelineModel.transform(testDF)
    preds_only = predDF.select(outputCol, "Logistic_Predictions")
    preds_only.show(5)
    
    preds_only = preds_only.rdd
    
    metrics = MulticlassMetrics(preds_only)
    print(metrics.confusionMatrix().toArray())
    print("Accuracy: " + str(metrics.accuracy))

    return predDF


In [8]:
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "purchase_events",
             "view_events", "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session']
outputCol= "logistic_T_total_spend"

pipeline = generateLogisticPipeline(inputCols, outputCol, .6)
pipelineModel = pipeline.fit(trainDF)

print(modelInfo(inputCols, pipelineModel))

logistic_predictions = getLogisticEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)
#print(f"RMSE is {evaluationMetrics[0]:.1f}")
#print(f"R^2 is {evaluationMetrics[1]:.5f}")
#print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

                     Column name  Coefficient
0                      intercept    -1.184833
1                    total_spend     0.000047
2                   total_events     0.879037
3                purchase_events    -0.496094
4                 total_sessions     0.289156
5             avg_session_length     0.000017
6   avg_interactions_per_session    -0.030066
7   max_interactions_per_session     0.010927
8   purchase_pct_of_total_events    -1.052250
9       view_pct_of_total_events    -1.453185
10      cart_pct_of_total_events    -1.826852
11     avg_purchases_per_session    -0.288964
12                   cart_events    -0.882949
13               purchase_events    -0.496094
14                   view_events    -0.877355
15        sessions_with_purchase     0.196496
16            sessions_with_cart    -0.014244
17            sessions_with_view    -0.278276
18     pct_sessions_end_purchase     0.715227
19         pct_sessions_end_cart     0.188995
20             sd_session_length  

#### Threshold of .6 maximizes accuracy

#### Next, we'll train a linear regression model using only the true positive training data, and generate a regression prediction for all test data that had a prediction of positive. 

In [9]:
part_2_train = trainDF.filter(col("logistic_T_total_spend") == 1)
part_2_test = logistic_predictions.filter(col("Logistic_Predictions") == 1).withColumnRenamed('probability', "ProbabilityLogistic")

part_1_test = logistic_predictions.filter(col("Logistic_Predictions") == 0).withColumnRenamed('probability', "ProbabilityLogistic") \
    .withColumn("", lit(None)) ## For overall accuracy metrics later, will be joined with output of part_2_test's regression predictions
part_1_test = part_1_test.withColumn("")

In [10]:
### Add empty columns to pad it to union later


In [11]:
part_2_train.show(1)

+---------+------------------+-----------------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+-------------------------+---------------------+-----------------+----------------+-------------------+------------------+-----------------+--------------------+--------------------+--------------------+--------------------+----------------------+
|  user_id|     T_total_spend|      total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_of_total_events|cart_pct_of_total_events|avg_purchases_per_session|sd_purchases_per

In [12]:
part_2_test.show(1)

+---------+-----------------+----------------+------------+--------------+------------------+-----------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+-------------------------+---------------------+-----------------+-----------------+-------------------+------------------+------------------+--------------------+--------------------+--------------------+--------------------+----------------------+--------------------+----------------------+--------------------+--------------------+
|  user_id|    T_total_spend|     total_spend|total_events|total_sessions|avg_session_length|sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pct_o

In [13]:
part_1_test.show(1)

+---------+-------------+-----------------+------------+--------------+------------------+------------------+----------------------------+---------------------------+----------------------------+----------------------------+------------------------+------------------------+-------------------------+------------------------+-----------+---------------+-----------+----------------------+------------------+------------------+-------------------------+---------------------+-----------------+------------------+--------------------+------------------+------------------+--------------------+--------------------+--------------------+--------------------+----------------------+--------------------+----------------------+--------------------+--------------------+----+
|  user_id|T_total_spend|      total_spend|total_events|total_sessions|avg_session_length| sd_session_length|avg_interactions_per_session|sd_interactions_per_session|max_interactions_per_session|purchase_pct_of_total_events|view_pc

In [14]:
print(part_2_train.count())
print(part_2_test.count())

69465
11587


In [15]:
%%time

def generateLinearPipeline(inputCols, outputCol):
    # Select input columns for linear regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features2")

    # Select output column for linear regression
    lr = LinearRegression(featuresCol="features2", labelCol=outputCol)

    # The following lines (pipeline creation and fitting) replace these two commented-out lines.
    # vecTrainDF = vecAssembler.transform(trainDF)
 
    pipeline = Pipeline(stages=[vecAssembler, lr])
    return pipeline
    
pipeline = generateLinearPipeline(inputCols, "T_total_spend")
pipelineModel = pipeline.fit(trainDF)



CPU times: user 17.8 ms, sys: 2.98 ms, total: 20.8 ms
Wall time: 1.92 s


In [42]:
def getLinearEvaluationMetrics(linearPipelineModel,outputCol,testDF,inputCols):
    new_preds = pipelineModel.transform(testDF)
    #new_preds.show(1)
    #part_1_test.show(1)
    
    predDF = new_preds.unionByName(part_1_test, allowMissingColumns=True)
    #predDF.show(1)

    predDF = predDF.withColumn("FinalPredictions", when(col("Logistic_Predictions") == 0, 0.0)
                               .otherwise(col("prediction")))
    #predDF.select(outputCol, "prediction").show(10)

    predDF.select(outputCol, "FinalPredictions").show()
    
    regressionEvaluator = RegressionEvaluator(
    predictionCol="FinalPredictions",
    labelCol=outputCol,
    metricName="rmse")
    rmse = regressionEvaluator.evaluate(predDF)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="FinalPredictions",
    labelCol=outputCol,
    metricName="r2")
    r2 = regressionEvaluator.evaluate(predDF)
      
    # Manually calculate Adjusted r2
    adjusted_r2 = adj_r2(r2, inputCols, predDF)
    
    return rmse, r2, adjusted_r2, predDF

In [44]:
print("** All original inputs, original output **")
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "view_events", 
             "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session']


outputCol = "T_total_spend"

pipeline = generateLinearPipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(part_2_train)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getLinearEvaluationMetrics(pipelineModel,outputCol,part_2_test, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All original inputs, original output **
Model coefficients
                     Column name   Coefficient
0                      intercept     42.791957
1                    total_spend      4.715632
2                   total_events    198.240063
3                purchase_events  -4257.149567
4                 total_sessions    340.222601
5             avg_session_length      0.008373
6   avg_interactions_per_session   -331.880827
7   max_interactions_per_session    329.062660
8   purchase_pct_of_total_events -47159.037863
9       view_pct_of_total_events  14460.859569
10      cart_pct_of_total_events -10157.956380
11     avg_purchases_per_session  -3629.363791
12                   cart_events    443.925907
13                   view_events   -139.962083
14        sessions_with_purchase  -2455.489738
15            sessions_with_cart   1901.689269
16            sessions_with_view    134.391390
17     pct_sessions_end_purchase   7923.069760
18         pct_sessions_end_cart  -6030.98114

In [32]:
df = evaluationMetrics[3]

In [39]:
df.select('T_total_spend', "FinalPredictions").sort("t_total_spend").show(100)

+-------------+-------------------+
|T_total_spend|   FinalPredictions|
+-------------+-------------------+
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|                0.0|
|          0.0|             