# Portland Trail Blazers - Customer Lifetime Value
* StelllarAlgo Data Science
* Ryan Kazmerik
* Jul 14, 2022

In [2]:
import getpass
import pyodbc

import pandas as pd
import matplotlib.pyplot as plt

from pycaret.regression import *

## Hypothesis:

**Customer Lifetime Value (CLTV)** represents the total amount of money a customer is expected to spend in a business during his/her lifetime. This is an important metric to monitor because it helps to make decisions about how much money to invest in acquiring new customers and retaining existing ones.

We propose to build a CLTV regression model trained on RFM data from past season that is capable of predicting future CLTV for the next season.

In [6]:
df = pd.read_parquet("./data/ptb_ticket_purchases_all.parquet")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3949410 entries, 0 to 3949409
Data columns (total 8 columns):
 #   Column               Dtype  
---  ------               -----  
 0   dimcustomermasterid  int64  
 1   purchasedate         object 
 2   ticketcount          int64  
 3   revenue              float64
 4   isplanproduct        bool   
 5   producttype          object 
 6   subproductname       object 
 7   seasonyear           int64  
dtypes: bool(1), float64(1), int64(3), object(3)
memory usage: 214.7+ MB


In [7]:
df.head()

Unnamed: 0,dimcustomermasterid,purchasedate,ticketcount,revenue,isplanproduct,producttype,subproductname,seasonyear
0,3,2017-11-21,3,75.0,False,Other,Group,2017
1,16,2018-02-02,2,85.0,True,Package,Half Season,2017
2,16,2018-02-02,2,85.0,True,Package,Half Season,2017
3,48,2019-02-05,1,10.0,False,Individual,Individual,2018
4,87,2021-05-28,2,144.0,False,Individual,Individual,2020


### Exploratory Data Analysis (EDA)

Let's have a look at the data and decide whether we need any data cleaning and data transformation for further analysis.

By reading the profiling report, we can see that the following actions should be taken to improve the dataset quality:
* Remove missing values from revenue (23.3%)
* Remove zero values from revenue (6.8%)

In [None]:
df.dropna(subset=['revenue'], inplace=True)

Let's check NULL records in the revenue column

In [None]:
df['revenue'].isnull().sum()

Let's also drop any rows where revenue = 0, and then check the value counts to ensure the zero values are gone

In [None]:
df = df[df['revenue'] > 0]
df['revenue'].value_counts(bins=[0, 1000, 10000, 100000])

### Data Types

Let's look at the data types in our dataframe to see if we need to convert any values to a different type

In [None]:
df.dtypes

Purchase date should be a datetime not an object (string), all of the other inferred data types look correct

In [None]:
df['purchasedate'] =  pd.to_datetime(df['purchasedate'], format='%Y-%m-%d')

### Distributions

Let's look at the distributions for our three key fields: purchasedate, ticketcount and revenue

In [None]:
plt.hist(df["purchasedate"], bins=20, color='dodgerblue', edgecolor='black')
plt.title("Recency", fontsize=16)
plt.xlabel("Year", fontsize=14)
plt.ylabel("Fans", fontsize=14)

### Period of Time

In [None]:
print(f"Start Date: {df['purchasedate'].min()}")
print(f"End Date: {df['purchasedate'].max()}")

There was abnormally low purchasing during the 2020 season because of stadium closures due to the COVID-19 pandemic. There are also some outlier purchases from before 2017

We may want to remove these from our training dataset, but let's leave them in for now

In [None]:
plt.hist(df["ticketcount"], bins=20, color='dodgerblue', edgecolor='black')
plt.title("Ticket Count", fontsize=16)
plt.xlabel("No. Tickets", fontsize=14)
plt.ylabel("Fans", fontsize=14)

In [None]:
df['ticketcount'].value_counts(bins=[0, 5, 10, 50, 100, 10000])

We can see that ticket count is highly skewed toward 1 ticket, we may want to remove the outliers here but let's leave it for now

In [None]:
plt.hist(df["revenue"], bins=20, color='dodgerblue', edgecolor='black')
plt.title("Revenue", fontsize=16)
plt.xlabel("Spend ($)", fontsize=14)
plt.ylabel("Fans", fontsize=14)

In [None]:
df.revenue.value_counts(bins=[0, 100, 1000, 10000, 100000, 1000000])

In [None]:
df_big_spenders = df[df['revenue'] > 10000]
df_big_spenders.head()

Most fans spend between 1 and 1000 dollars on a purchase, but some spend much more (df_big_spenders), we may want to remove these outliers but let's leave them in for now

### Data Preprocessing

We are going to build a model, which predicts CLTV for 3 months. First, let us slice the data into chunks with 3-month data each and take the last chunk as the target for predictions.

In [None]:
def groupby_mean(x):
    return x.mean()

def groupby_count(x):
    return x.count()

groupby_mean.__name__ = 'avg'
groupby_count.__name__ = 'count'

In [None]:
clv_freq = '3M'

df_data = df.groupby(
    ['dimcustomermasterid',pd.Grouper(key='purchasedate', freq=clv_freq)]
).agg({
    'revenue': [sum, groupby_mean, groupby_count]
})

In [None]:
df_data.columns = ['_'.join(col).lower() for col in df_data.columns]
df_data = df_data.reset_index()

df_data.info()

In [None]:
df_data.head()

### Model Training
We need to create training and evaluation datasets to train our regression model and evaluate the model performance against unseen data points


In [None]:
df_dataset = df

df_train = df_dataset.sample(frac=0.85, random_state=786)
df_eval = df_dataset.drop(df_train.index)

df_train.reset_index(drop=True, inplace=True)
df_eval.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

In [None]:
setup(
    data = df_train, 
    date_features=["purchasedate"],
    ignore_features=["dimCustomerMasterId","lkupClientId","scoreDate", "type"],
    target='monetary', 
    silent=True,
    verbose=True,
    session_id=123
);

Let's look at the regression models we can experiment with

In [None]:
models()

We could experiment with different model types, but for now let's choose linear regression.

In [None]:
model_matrix = compare_models(
    fold=10,
    include=["lr"]
)

In [None]:
best_model = create_model(model_matrix)

We can see the model performance, the R2 metric is a measure of how well the model fits our dataset on a scale of 0 (not good at all) to 1 (very good, possibly overfit)

We can plot the results of the model line of best fit, vs. the predicted values line of best fit

In [None]:
plot_model(best_model, plot = 'error')

We can also plot the feature importance to see what features are impacting model prediction the most

In [None]:
plot_model(best_model, plot='feature')

### Observations

* Observation 1
* Observation 2
* Observation 3

### Conclusion

Add a conclusion to the experiment here