## File 06 - PCA Linear Regression
##### Group 12:

##### Hannah Schmuckler, mmc4cv

##### Rob Schwartz, res7cd

In this file, we create a small ML pipeline based on the output from File 02 (Feature creation).

We create a linear regression model, but this time use PCA features that were created in file 02. We compare k-values. 

In [1]:
%%time
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression, LinearRegressionModel, LinearRegressionSummary
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import col
from pyspark.sql.functions import log
from pyspark.ml.stat import Correlation

import pandas as pd
import numpy as np
import copy

spark = SparkSession.builder \
        .appName("project") \
        .getOrCreate()

sc = spark.sparkContext

CPU times: user 568 ms, sys: 428 ms, total: 996 ms
Wall time: 8.14 s


### Read in dataframes for train and test sets

This data should have been previously generated: we can find it in the `processed_data` folder.

In [2]:
%%time
trainDF = spark.read.parquet("./processed_data/train.parquet")
testDF = spark.read.parquet("./processed_data/test.parquet")
# trainDF.show(5)

CPU times: user 3.76 ms, sys: 272 µs, total: 4.03 ms
Wall time: 4.19 s


### Set up Spark ML pipeline training for linear regression

Here we decide which input columns should be used in order to create our training pipeline. To implement this step, we create the function `generatePipeline(inputCols, outputCol)`. Then, we train the pipeline using this function.

In [3]:
%%time


def generatePipeline(inputCols, outputCol):
    # Select input columns for linear regression
    vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

    # Select output column for linear regression
    lr = LinearRegression(featuresCol="features", labelCol=outputCol)

    # The following lines (pipeline creation and fitting) replace these two commented-out lines.
    # vecTrainDF = vecAssembler.transform(trainDF)
 
    pipeline = Pipeline(stages=[vecAssembler, lr])
    return pipeline


CPU times: user 2 µs, sys: 2 µs, total: 4 µs
Wall time: 6.44 µs


### View the model information

Print out the model coefficients and view the pValues, RMSE and R^2. We define the functions `modelInfo(inputCols, pipelineModel)` and `getEvaluationMetrics(pipelineModel,outputCol,testDF)` to report this information.

In [4]:
def modelInfo(inputCols, pipelineModel):
    # Create a zipped list containing the coefficients and the data
    modelCols = copy.deepcopy(inputCols)
    modelCoeffs = list(pipelineModel.stages[-1].coefficients)
    modelCoeffs.insert(0,pipelineModel.stages[-1].intercept)
    modelCols.insert(0,"intercept")
    modelZippedList = list(map(list, zip(modelCols, modelCoeffs)))
    
    # p-values don't work with PCA
    #pvals = pipelineModel.stages[-1].summary.pValues
    
    # Create the pandas DataFrame
    modelDF = pd.DataFrame(modelZippedList, columns = ['Column name', 'Coefficient'])
    # modelDF['pValues'] = pvals
    return modelDF


In [5]:
# Calculate adjusted r2 (https://towardsdatascience.com/machine-learning-linear-regression-using-pyspark-9d5d5c772b42)
# This function will allow us to calculate the adjusted r-square value when we do PCA later. The default r-square function does not take into account k. 
def adj_r2(r2, inputCols, testDF, k = 0):
    n = testDF.count()
    if k == 0:
        p = len(inputCols)
    else: 
        p = len(inputCols) + k - 1
    
    adjusted_r2 = 1-(((1-r2)*(n-1))/(n-p-1))
    return adjusted_r2

In [6]:
def getEvaluationMetrics(pipelineModel,outputCol,testDF,inputCols,k):
    predDF = pipelineModel.transform(testDF)
    predDF.select(outputCol, "prediction").show(10)
    
    print(predDF)
    
    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="rmse")
    rmse = regressionEvaluator.evaluate(predDF)

    regressionEvaluator = RegressionEvaluator(
    predictionCol="prediction",
    labelCol=outputCol,
    metricName="r2")
    r2 = regressionEvaluator.evaluate(predDF)
      
    # Manually calculate Adjusted r2
    adjusted_r2 = adj_r2(r2, inputCols, testDF, k)
    
    return rmse, r2, adjusted_r2


### PCA
First, we'll see how each PCA variant performs on its own. We've created four variations with different k values: 10, 20, 50, and 100. Our PCA represents the categories of purchases that customers made in month one. 

In [7]:
print("** PCA 10 **")
inputCols = ["pca_purchases10"]

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 10)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** PCA 10 **
Model coefficients
       Column name  Coefficient
0        intercept    16.498896
1  pca_purchases10  -274.866115
+------------------+------------------+
|     T_total_spend|        prediction|
+------------------+------------------+
|               0.0|18.388107610880304|
|               0.0|24.058767198489555|
|               0.0| 17.91693081574608|
|               0.0|338.05151648298573|
|               0.0|254.21827080566473|
|               0.0| 38.56716882864596|
| 312.4200134277344| 20.43665840632384|
|  6009.78010559082|254.21827080566473|
|1627.1900024414062| 54.57791769584591|
|1393.8800354003906|131.87719543591749|
+------------------+------------------+
only showing top 10 rows

DataFrame[user_id: int, total_spend: double, total_events: bigint, total_sessions: bigint, T_total_spend: double, avg_session_length: double, sd_session_length: double, avg_interactions_per_session: double, sd_interactions_per_session: double, max_interactions_per_session: bigint, purc

In [8]:
print("** PCA 20 **")
inputCols = ["pca_purchases20"]

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 20)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** PCA 20 **
Model coefficients
       Column name  Coefficient
0        intercept     3.000018
1  pca_purchases20  -274.857092
+------------------+------------------+
|     T_total_spend|        prediction|
+------------------+------------------+
|               0.0|7.6885225618468205|
|               0.0|15.150733461172578|
|               0.0| 5.440875896002195|
|               0.0| 321.6890969666492|
|               0.0|240.34247634880285|
|               0.0|38.242885316940225|
| 312.4200134277344|  58.0081992636654|
|  6009.78010559082|240.34247634880285|
|1627.1900024414062| 364.3507162823513|
|1393.8800354003906|116.51253004292738|
+------------------+------------------+
only showing top 10 rows

DataFrame[user_id: int, total_spend: double, total_events: bigint, total_sessions: bigint, T_total_spend: double, avg_session_length: double, sd_session_length: double, avg_interactions_per_session: double, sd_interactions_per_session: double, max_interactions_per_session: bigint, purc

In [9]:
print("** PCA 50 **")
inputCols = ["pca_purchases50"]

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 50)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** PCA 50 **
Model coefficients
       Column name  Coefficient
0        intercept   -11.911242
1  pca_purchases50  -274.922295
+------------------+--------------------+
|     T_total_spend|          prediction|
+------------------+--------------------+
|               0.0|-0.08604014342354027|
|               0.0|   11.73863094435069|
|               0.0|  -9.883890142817528|
|               0.0|  303.58809235670697|
|               0.0|   227.0088763163214|
|               0.0|  259.31563447469205|
| 312.4200134277344|   62.03615349120954|
|  6009.78010559082|   227.0088763163214|
|1627.1900024414062|  345.28597710450634|
|1393.8800354003906|   99.48662692145429|
+------------------+--------------------+
only showing top 10 rows

DataFrame[user_id: int, total_spend: double, total_events: bigint, total_sessions: bigint, T_total_spend: double, avg_session_length: double, sd_session_length: double, avg_interactions_per_session: double, sd_interactions_per_session: double, max_interactio

In [10]:
print("** PCA 100 **")
inputCols = ["pca_purchases100"]

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 100)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** PCA 100 **
Model coefficients
        Column name  Coefficient
0         intercept   -18.129547
1  pca_purchases100  -274.978438
+------------------+-------------------+
|     T_total_spend|         prediction|
+------------------+-------------------+
|               0.0|  128.0118099202071|
|               0.0|-1.7165766838571699|
|               0.0|  658.0872617356608|
|               0.0|  299.2365078186722|
|               0.0| 220.59970954830678|
|               0.0| 218.19452970664932|
| 312.4200134277344| 61.612941328389134|
|  6009.78010559082| 220.59970954830678|
|1627.1900024414062|  493.0728778636065|
|1393.8800354003906|  92.17758433384682|
+------------------+-------------------+
only showing top 10 rows

DataFrame[user_id: int, total_spend: double, total_events: bigint, total_sessions: bigint, T_total_spend: double, avg_session_length: double, sd_session_length: double, avg_interactions_per_session: double, sd_interactions_per_session: double, max_interactions_per_ses

#### The adjusted r-square is highest for the PCA with the highest k value, 100. However, all of them are very close - a very small sacrafice in performance helps with a great reduction in dimensionality. 

In [11]:
print("** PCA 10 + all other features **")
inputCols = ["total_spend", "sd_session_length",
             "cart_pct_of_total_events", "avg_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_cart", "sessions_with_view", "pct_sessions_end_cart", "total_spend_log", "total_events_log", "pca_purchases10"]

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 10)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** PCA 10 + all other features **
Model coefficients
                  Column name  Coefficient
0                   intercept   351.774375
1                 total_spend     0.430083
2           sd_session_length     0.000447
3    cart_pct_of_total_events  -265.405678
4   avg_purchases_per_session  -117.318675
5                 cart_events    10.360983
6             purchase_events   -28.830292
7          sessions_with_cart    79.836426
8          sessions_with_view    -6.146621
9       pct_sessions_end_cart  -402.860244
10            total_spend_log   -18.379311
11           total_events_log   -63.313512
12            pca_purchases10   -19.453177
+------------------+------------------+
|     T_total_spend|        prediction|
+------------------+------------------+
|               0.0|107.44056659916882|
|               0.0|104.66517793217096|
|               0.0|22.727672460024735|
|               0.0|234.79518845076026|
|               0.0|-54.97627819008051|
|               0.0| -24.

In [12]:
print("** PCA 20 + all other features **")
inputCols = ["total_spend", "sd_session_length",
             "cart_pct_of_total_events", "avg_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_cart", "sessions_with_view", "pct_sessions_end_cart", "total_spend_log", "total_events_log", "pca_purchases20"]

# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 20)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** PCA 20 + all other features **
Model coefficients
                  Column name  Coefficient
0                   intercept   352.336150
1                 total_spend     0.432883
2           sd_session_length     0.000443
3    cart_pct_of_total_events  -258.179977
4   avg_purchases_per_session  -117.771595
5                 cart_events    10.171157
6             purchase_events   -25.245002
7          sessions_with_cart    79.393788
8          sessions_with_view    -6.157408
9       pct_sessions_end_cart  -398.926873
10            total_spend_log   -18.818522
11           total_events_log   -63.310927
12            pca_purchases20   -14.600050
+------------------+------------------+
|     T_total_spend|        prediction|
+------------------+------------------+
|               0.0|111.73347536960472|
|               0.0|105.55782701319663|
|               0.0|25.761377134798295|
|               0.0|233.00177689299971|
|               0.0|-57.37355443047761|
|               0.0|-27.5

In [13]:
print("** PCA 50 + all other features **")
inputCols = ["total_spend", "sd_session_length",
             "cart_pct_of_total_events", "avg_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_cart", "sessions_with_view", "pct_sessions_end_cart", "total_spend_log", "total_events_log", "pca_purchases50"]


# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 50)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** PCA 50 + all other features **
Model coefficients
                  Column name  Coefficient
0                   intercept   348.089712
1                 total_spend     0.435719
2           sd_session_length     0.000435
3    cart_pct_of_total_events  -258.080567
4   avg_purchases_per_session  -118.222579
5                 cart_events     9.941050
6             purchase_events   -31.788890
7          sessions_with_cart    78.834710
8          sessions_with_view    -6.242064
9       pct_sessions_end_cart  -394.991110
10            total_spend_log   -18.185279
11           total_events_log   -62.878437
12            pca_purchases50   -21.465918
+------------------+------------------+
|     T_total_spend|        prediction|
+------------------+------------------+
|               0.0|107.76882961835136|
|               0.0|28.794762828693933|
|               0.0| 15.61780352501313|
|               0.0|232.15708102267246|
|               0.0|-58.16794994987646|
|               0.0| 78.1

In [14]:
print("** PCA 100 + all other features **")
inputCols = ["total_spend", "sd_session_length",
             "cart_pct_of_total_events", "avg_purchases_per_session", "cart_events", "purchase_events", 
             "sessions_with_cart", "sessions_with_view", "pct_sessions_end_cart", "total_spend_log", "total_events_log", "pca_purchases100"]


# sd_session_length breaks it, as does sd_interactions_per_session, sd_purchases_per_session
outputCol = "T_total_spend"

pipeline = generatePipeline(inputCols, outputCol)
pipelineModel = pipeline.fit(trainDF)
    
print("Model coefficients")
print(modelInfo(inputCols, pipelineModel))

evaluationMetrics = getEvaluationMetrics(pipelineModel,outputCol,testDF, inputCols, 100)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

** PCA 100 + all other features **
Model coefficients
                  Column name  Coefficient
0                   intercept   324.340614
1                 total_spend     0.433205
2           sd_session_length     0.000414
3    cart_pct_of_total_events  -245.778856
4   avg_purchases_per_session  -118.868935
5                 cart_events     9.242983
6             purchase_events   -56.249895
7          sessions_with_cart    82.631315
8          sessions_with_view    -6.485063
9       pct_sessions_end_cart  -415.141783
10            total_spend_log   -14.360580
11           total_events_log   -61.594333
12           pca_purchases100   -48.727836
+------------------+------------------+
|     T_total_spend|        prediction|
+------------------+------------------+
|               0.0|130.17943376318345|
|               0.0|30.057751146321664|
|               0.0| 605.8833443458468|
|               0.0|235.37122692489302|
|               0.0|-61.09414031890759|
|               0.0| 68.

# Champion Model
#### When compared against a linear regression without the PCA features, performance of the models with PCA suffers. This is not just due to the increased number of features increasing the adjusted r-square. The feature we have encoded with PCA must not be a good predictor.  

In [15]:
# pvalues breaks when used with this so taking that out
def modelInfo2(inputCols, pipelineModel):
    # Create a zipped list containing the coefficients and the data
    modelCols = copy.deepcopy(inputCols)
    modelCoeffs = list(pipelineModel.stages[-1].coefficients)
    modelCoeffs.insert(0,pipelineModel.stages[-1].intercept)
    modelCols.insert(0,"intercept")
    modelZippedList = list(map(list, zip(modelCols, modelCoeffs)))
        
    # Create the pandas DataFrame
    modelDF = pd.DataFrame(modelZippedList, columns = ['Column name', 'Coefficient'])
    
    return modelDF

In [16]:
%%time
print("** Champion Model **")
inputCols = ["total_spend", "sd_session_length",
              "avg_purchases_per_session", "cart_events", 
             "sessions_with_cart", "sessions_with_view", "pct_sessions_end_cart", "total_spend_log", "total_events_log", "pca_purchases10"]

outputCol = "T_total_spend"
# Below creation of pipeline is necessary for crossval to run. I wonder if there's a way to get it to run on the generate pipeline fn? 
vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

# Select output column for linear regression
lr = LinearRegression(featuresCol="features", labelCol="T_total_spend")

pipeline = Pipeline(stages=[vecAssembler, lr])
    
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0, 0.01, 0.2, 1, 10]) \
    .addGrid(lr.elasticNetParam, [0, 0.25, 0.5, 0.75, 1]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("T_total_spend"),
                          numFolds=4)

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.setParallelism(8).fit(trainDF)

** Champion Model **
CPU times: user 5.34 s, sys: 1.4 s, total: 6.74 s
Wall time: 2min 10s


In [17]:
print("Best model coefficients")
print(modelInfo2(inputCols, cvModel.bestModel))


evaluationMetrics = getEvaluationMetrics(cvModel.bestModel,"T_total_spend",testDF,inputCols, 10)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

print()
print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])

Best model coefficients
                  Column name  Coefficient
0                   intercept   146.439957
1                 total_spend     0.425120
2           sd_session_length     0.000367
3   avg_purchases_per_session   -91.112432
4                 cart_events     5.532628
5          sessions_with_cart    62.295721
6          sessions_with_view    -2.155294
7       pct_sessions_end_cart  -231.925664
8             total_spend_log    -7.004726
9            total_events_log   -38.530261
10            pca_purchases10     0.000000
+------------------+-------------------+
|     T_total_spend|         prediction|
+------------------+-------------------+
|               0.0|  97.14451746520089|
|               0.0| 55.620751388341716|
|               0.0|-3.7926072597744565|
|               0.0|  164.2302977535853|
|               0.0| 23.700282810460692|
|               0.0| 59.802637858039375|
| 312.4200134277344|-12.774374240771948|
|  6009.78010559082|  574.1374007991707|
|1627.190

### CV for best PCA ONLY

In [22]:
%%time
print("** PCA Only **")
inputCols = ["pca_purchases50"]

outputCol = "T_total_spend"
# Below creation of pipeline is necessary for crossval to run. I wonder if there's a way to get it to run on the generate pipeline fn? 
vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")

# Select output column for linear regression
lr = LinearRegression(featuresCol="features", labelCol="T_total_spend")

pipeline = Pipeline(stages=[vecAssembler, lr])
    
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0, 0.01, 0.2, 1, 10]) \
    .addGrid(lr.elasticNetParam, [0, 0.25, 0.5, 0.75, 1]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator().setLabelCol("T_total_spend"),
                          numFolds=4)

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.setParallelism(8).fit(trainDF)

** PCA Only **
CPU times: user 5.11 s, sys: 1.44 s, total: 6.55 s
Wall time: 1min 30s


In [24]:
print("Best model coefficients")
print(modelInfo2(inputCols, cvModel.bestModel))


evaluationMetrics = getEvaluationMetrics(cvModel.bestModel,"T_total_spend",testDF,inputCols, 50)
print(f"RMSE is {evaluationMetrics[0]:.1f}")
print(f"R^2 is {evaluationMetrics[1]:.5f}")
print(f"Adjusted R^2 is {evaluationMetrics[2]:.5f}")

print()
print(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)])

Best model coefficients
       Column name  Coefficient
0        intercept    -9.953911
1  pca_purchases50  -274.589055
+------------------+------------------+
|     T_total_spend|        prediction|
+------------------+------------------+
|               0.0|1.6128141870923454|
|               0.0|  8.90735816111607|
|               0.0| -7.95072766014445|
|               0.0| 304.5789532264374|
|               0.0|228.79933755629847|
|               0.0|245.02677473623731|
| 312.4200134277344| 58.86015972470783|
|  6009.78010559082|228.79933755629847|
|1627.1900024414062|340.74907101879154|
|1393.8800354003906|100.66348844773286|
+------------------+------------------+
only showing top 10 rows

DataFrame[user_id: int, total_spend: double, total_events: bigint, total_sessions: bigint, T_total_spend: double, avg_session_length: double, sd_session_length: double, avg_interactions_per_session: double, sd_interactions_per_session: double, max_interactions_per_session: bigint, purchase_pct