## File 08 - Two-part model (Logistic + Linear Regression)

##### Group 12:

##### Hannah Schmuckler, mmc4cv

##### Rob Schwartz, res7cd

In this file, we create a small ML pipeline based on the output from File 02 (Feature creation).

We create a two part model made up of a logistic and a linear regression.

### Set up Spark session

We can specify more options in the SparkSession creator, but currently the options are at the default settings.

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.regression import LinearRegression, LinearRegressionModel, LinearRegressionSummary
from pyspark.ml.evaluation import RegressionEvaluator, BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col
from pyspark.sql.functions import log
from pyspark.ml.stat import Correlation
from pyspark.sql.functions import when
from pyspark.ml.classification import LogisticRegression
from handyspark import *
from matplotlib import pyplot as plt
from pyspark.mllib.evaluation import BinaryClassificationMetrics, MulticlassMetrics
from pyspark.sql.types import IntegerType, DoubleType
from pyspark.sql.functions import lit

import pandas as pd
import numpy as np
import copy

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

CPU times: user 877 ms, sys: 672 ms, total: 1.55 s
Wall time: 14.5 s


### Read in dataframes for train and test sets

This data should have been previously generated: we can find it in the `processed_data` folder.

In [2]:
%%time
trainDF = spark.read.parquet("./processed_data/train.parquet")
testDF = spark.read.parquet("./processed_data/test.parquet")
# trainDF.show(5)

CPU times: user 1.9 ms, sys: 1.68 ms, total: 3.58 ms
Wall time: 6.84 s


### Label trainDF with 0 if they did not spend in m2 and 1 if they did

In [3]:
trainDF = trainDF.withColumn('logistic_T_total_spend', when(trainDF.T_total_spend == 0, 0)
                                                       .otherwise(1))
testDF = testDF.withColumn('logistic_T_total_spend', when(testDF.T_total_spend == 0, 0)
                                                       .otherwise(1))

trainDF = trainDF.withColumn('logistic_T_total_spend', col("logistic_T_total_spend").cast(DoubleType()))
testDF = testDF.withColumn('logistic_T_total_spend', col("logistic_T_total_spend").cast(DoubleType()))

### Set up Spark ML pipeline training for logistic regression

Here we decide which input columns should be used in order to create our training pipeline. To implement this step, we create the function `generatePipeline(inputCols, outputCol)`. Then, we train the pipeline using this function.

In [4]:
%%time

def generateLogisticPipeline(inputCols, outputCol, threshold):
    # Select input columns for linear regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="unscaled_features")
    # Standard Scale
    ss = StandardScaler(inputCol="unscaled_features", outputCol="features")
    # Select output column for linear regression
    logR = LogisticRegression(featuresCol="features", labelCol=outputCol, rawPredictionCol = "Logistic_Probabilities", predictionCol = "Logistic_Predictions", threshold = threshold)

    # The following lines (pipeline creation and fitting) replace these two commented-out lines.
    # vecTrainDF = vecAssembler.transform(trainDF)
 
    pipeline = Pipeline(stages=[vecAssembler, ss, logR])
    return pipeline
    

CPU times: user 1e+03 ns, sys: 1e+03 ns, total: 2 µs
Wall time: 5.01 µs


### View the model information

Create function to print out the model coefficients for the overall model, RMSE and R^2. We define the functions `modelInfo(inputCols, pipelineModel)` and `getEvaluationMetrics(pipelineModel,outputCol,testDF)` to report this information.

In [5]:
def modelInfo(inputCols, pipelineModel):
    # Create a zipped list containing the coefficients and the data
    modelCols = copy.deepcopy(inputCols)
    modelCoeffs = list(pipelineModel.stages[-1].coefficients)
    modelCoeffs.insert(0,pipelineModel.stages[-1].intercept)
    modelCols.insert(0,"intercept")
    modelZippedList = list(map(list, zip(modelCols, modelCoeffs)))
    
    
    # Create the pandas DataFrame
    modelDF = pd.DataFrame(modelZippedList, columns = ['Column name', 'Coefficient'])
    #modelDF['pValues'] = pvals
    return modelDF

In [6]:
# Calculate adjusted r2 (https://towardsdatascience.com/machine-learning-linear-regression-using-pyspark-9d5d5c772b42)
# This function will allow us to calculate the adjusted r-square value when we do PCA later. The default r-square function does not take into account k. 
def adj_r2(r2, inputCols, testDF, k = 0):
    n = testDF.count()
    if k == 0:
        p = len(inputCols)
    else: 
        p = len(inputCols) + k - 1
    
    adjusted_r2 = 1-(((1-r2)*(n-1))/(n-p-1))
    return adjusted_r2

In [7]:
# https://towardsdatascience.com/binary-classifier-evaluation-made-easy-with-handyspark-3b1e69c12b4f
def getLogisticEvaluationMetrics(pipelineModel,outputCol,testDF,inputCols):
    predDF = pipelineModel.transform(testDF)
    preds_only = predDF.select(outputCol, "Logistic_Predictions")
    preds_only.show(5)
    
    preds_only = preds_only.rdd
    
    metrics = MulticlassMetrics(preds_only)
    print(metrics.confusionMatrix().toArray())
    print("Accuracy: " + str(metrics.accuracy))

    return predDF


## Part 1: Logistic regression

In [8]:
%%time
inputCols = ["total_spend", "total_events", "total_sessions", "avg_session_length", "sd_session_length", "avg_interactions_per_session", 
             "sd_interactions_per_session", "max_interactions_per_session", "purchase_pct_of_total_events", "view_pct_of_total_events",
             "cart_pct_of_total_events", "avg_purchases_per_session", "sd_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_purchase", "sessions_with_cart", "sessions_with_view", "pct_sessions_end_purchase", 
             "pct_sessions_end_cart", "total_spend_log", "total_events_log", "purchase_events_log", "total_sessions_log",
             "avg_session_length_log"]
outputCol= "logistic_T_total_spend"

pipeline = generateLogisticPipeline(inputCols, outputCol, .2) #Iterative testing determined that a threshold of .2 led to the highest overal adj R^2 for the model
pipelineModel = pipeline.fit(trainDF)

print(modelInfo(inputCols, pipelineModel))

logistic_predictions = getLogisticEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols)

                     Column name  Coefficient
0                      intercept    -1.087308
1                    total_spend     0.027652
2                   total_events     0.166859
3                 total_sessions     1.422216
4             avg_session_length    -0.089043
5              sd_session_length     0.267300
6   avg_interactions_per_session    -0.058958
7    sd_interactions_per_session     0.025095
8   max_interactions_per_session    -0.045492
9   purchase_pct_of_total_events    -0.187097
10      view_pct_of_total_events    -0.078184
11      cart_pct_of_total_events    -0.095201
12     avg_purchases_per_session    -0.129099
13      sd_purchases_per_session     0.059970
14                   cart_events    -0.049034
15               purchase_events    -0.192963
16        sessions_with_purchase     0.220615
17            sessions_with_cart    -0.071309
18            sessions_with_view    -1.529007
19     pct_sessions_end_purchase     0.469537
20         pct_sessions_end_cart  

#### Threshold of .2 maximizes model adj-r^2, though it does not maximize the accuracy of this logistic regression. 

## Part 2: Linear Regression

#### Next, we'll train a linear regression model using only the true positive training data, and generate a regression prediction for all test data that had a prediction of positive. 

In [9]:
part_2_train = trainDF.filter(col("logistic_T_total_spend") == 1)
part_2_test = logistic_predictions.filter(col("Logistic_Predictions") == 1).withColumnRenamed('probability', "ProbabilityLogistic")

part_1_test = logistic_predictions.filter(col("Logistic_Predictions") == 0).withColumnRenamed('probability', "ProbabilityLogistic") \
    .withColumn("", lit(None)) ## For overall accuracy metrics later, will be joined with output of part_2_test's regression predictions

In [10]:
print(part_2_train.count())
print(part_2_test.count())

74751
36564


In [11]:
%%time

def generateLinearPipeline(inputCols, outputCol):
    # Select input columns for linear regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="unscaled_features2")
    # Standard Scale
    ss = StandardScaler(inputCol="unscaled_features2", outputCol="features2")
    # Select output column for linear regression
    lr = LinearRegression(featuresCol="features2", labelCol=outputCol)

    # The following lines (pipeline creation and fitting) replace these two commented-out lines.
    # vecTrainDF = vecAssembler.transform(trainDF)
 
    pipeline = Pipeline(stages=[vecAssembler, ss, lr])
    return pipeline
    
pipeline = generateLinearPipeline(inputCols, "T_total_spend")
pipelineModel = pipeline.fit(trainDF)



CPU times: user 20.5 ms, sys: 1.46 ms, total: 22 ms
Wall time: 6.84 s


In [12]:
def getLinearEvaluationMetrics(linearPipelineModel,outputCol,testDF,inputCols):
    new_preds = pipelineModel.transform(testDF)
    
    # Merge the predictions from the linear regression with predictions from the logistic regression performed above. 
    # That will allow this to output complete results. 
    predDF = new_preds.unionByName(part_1_test, allowMissingColumns=True)

    predDF = predDF.withColumn("FinalPredictions", when(col("Logistic_Predictions") == 0, 0.0)
                               .otherwise(col("prediction")))

    predDF.select(outputCol, "FinalPredictions").show()
    
    regressionEvaluator = RegressionEvaluator(
    predictionCol="FinalPredictions",
    labelCol=outputCol,
    metricName="rmse")
    rmse = regressionEvaluator.evaluate(predDF)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="FinalPredictions",
    labelCol=outputCol,
    metricName="r2")
    r2 = regressionEvaluator.evaluate(predDF)
      
    # Manually calculate Adjusted r2
    adjusted_r2 = adj_r2(r2, inputCols, predDF)
    
    return rmse, r2, adjusted_r2, predDF

In [13]:
print("** All original inputs, original output **")
inputCols = ["total_spend","total_events","purchase_events", "total_sessions", "avg_session_length", "avg_interactions_per_session", "max_interactions_per_session",
             "purchase_pct_of_total_events", "view_pct_of_total_events", "cart_pct_of_total_events","avg_purchases_per_session", "cart_events", "view_events", 
             "sessions_with_purchase", "sessions_with_cart","sessions_with_view", "pct_sessions_end_purchase", "pct_sessions_end_cart", 'sd_session_length', 
             'sd_interactions_per_session', 'sd_purchases_per_session']


outputCol = "T_total_spend"

pipeline = generateLinearPipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(part_2_train)

print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getLinearEvaluationMetrics(pipelineModel,outputCol,part_2_test, inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** All original inputs, original output **
Model coefficients
                     Column name  Coefficient
0                      intercept   759.390549
1                    total_spend  2149.124576
2                   total_events   -48.135840
3                purchase_events   303.019611
4                 total_sessions  -789.070612
5             avg_session_length   -29.179753
6   avg_interactions_per_session   -65.282682
7   max_interactions_per_session   115.651063
8   purchase_pct_of_total_events    35.111501
9       view_pct_of_total_events   -18.064595
10      cart_pct_of_total_events     3.097761
11     avg_purchases_per_session  -152.356488
12                   cart_events    74.457028
13                   view_events   -92.113428
14        sessions_with_purchase  -907.340709
15            sessions_with_cart   928.968113
16            sessions_with_view   724.985263
17     pct_sessions_end_purchase   142.424649
18         pct_sessions_end_cart  -130.378715
19             sd_

In [14]:
%%time
print("** Champion Model **")
inputCols = ["total_spend","purchase_events","avg_interactions_per_session", 
             "purchase_pct_of_total_events","cart_events", "view_events", 
             "sessions_with_purchase", "sessions_with_cart","pct_sessions_end_purchase", "pct_sessions_end_cart",  
             'sd_purchases_per_session']

outputCol = "T_total_spend"
# Below creation of pipeline is necessary for crossval to run.
vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="unscaled_features")
# Standard Scale
ss = StandardScaler(inputCol="unscaled_features", outputCol="features")
# Select output column for linear regression
lr = LinearRegression(featuresCol="features", labelCol="T_total_spend")

pipeline = Pipeline(stages=[vecAssembler,ss, lr])
    
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0, 0.01, 0.2, 1, 10]) \
    .addGrid(lr.elasticNetParam, [0, 0.25, 0.5, 0.75, 1]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("T_total_spend"),
                          numFolds=4)

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.setParallelism(8).fit(part_2_train)

** Champion Model **
CPU times: user 3.92 s, sys: 694 ms, total: 4.62 s
Wall time: 1min 43s


In [15]:
print("Best model coefficients")
print(modelInfo(inputCols, cvModel.bestModel))


evaluationMetrics = getLinearEvaluationMetrics(cvModel.bestModel,"T_total_spend",part_2_test,inputCols)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

print()
print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])
print()
print(cvModel.avgMetrics)

Best model coefficients
                     Column name  Coefficient
0                      intercept   722.584726
1                    total_spend  2143.824564
2                purchase_events   129.044353
3   avg_interactions_per_session   -82.542527
4   purchase_pct_of_total_events    26.390023
5                    cart_events   151.070671
6                    view_events   -62.329827
7         sessions_with_purchase  -426.917763
8             sessions_with_cart   456.417060
9      pct_sessions_end_purchase    33.579882
10         pct_sessions_end_cart   -56.555954
11      sd_purchases_per_session    78.266746
+------------------+------------------+
|     T_total_spend|  FinalPredictions|
+------------------+------------------+
|               0.0| 624.7580396565033|
|               0.0| 526.3177991687103|
| 312.4200134277344| 557.3234942313471|
|  6009.78010559082| 1652.212288595837|
|1627.1900024414062| 854.8646853826301|
| 573.9199829101562| 782.6884600712948|
| 622.400024414062