# NHL Flames - Event Propensity Model
* Ryan Kazmerik & Joey Lai
* Oct 14, 2022

## Hypothesis
Each team has different tier (or quality) of games, based on day of the week, time of the season, opponent, etc. We think that by using previous buyer behaviour we can make a prediction on whether the fan will purchase for the next game or not.

## Experiment
This section details our experiment including querying data, data transformations, feature selection and modelling.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from pycaret.classification import *
from shared_utilities import helpers

### Let's connect to RedShift and run a stored proc to get our dataset:

In [None]:
CLUSTER = "prod-app"
DATABASE = "stlrflames"
LKUPCLIENTID = "36"

In [None]:
df = helpers.get_event_propensity_training_dataset(
    cluster=CLUSTER,
    database=DATABASE,
    lkupclientid=LKUPCLIENTID,
    start_year=2010,
    end_year=2021
)

df.shape

In [None]:
df_train = df
df_train.info()

In [None]:
df_train.head()

### Now we can model the data using a binary classification prediction for the target field to see how likely a customer is to purchase:

In [None]:
setup(
    data= df_train, 
    target="did_purchase", 
    train_size = 0.90,
    data_split_shuffle=True,
    categorical_features=[
        "inmarket"
    ],
    date_features=[
        "eventdate"
    ],
    ignore_features=[
        "count_merchowned",
        "dimcustomermasterid",
        "eventname",
        "inmarket",
        "mindaysout",
        "maxdaysout"
    ],
    silent=True,
    verbose=False,
    numeric_features=[
        "distancetovenue",
        "events_purchased",
        "frequency_eventday",
        "frequency_opponent",
        "frequency_eventtime",
        "recent_clickrate",
        "recent_openrate",
        "tenure"
    ]
);

In [None]:
model_matrix = compare_models(
    fold= 10, 
    include= ["lr"]
)

### The top model is performing well, so let's compare it against our test dataset:

In [None]:
final_model = create_model(model_matrix, fold= 10)

### We can also see the correlation between the features and the target variable:

In [None]:
cor = df.corr()

df_correlated = df

corr = df_correlated.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

### This plot lists the most important features for a correct prediction (either Score_0 or Score_1) in the model:

In [None]:
plot_model(final_model, plot='feature')

### This confusion matrix helps understand where the model guessed correctly or incorrectly based on the evaluation data:

In [None]:
plot_model(final_model, plot='confusion_matrix')

### The AUC curve tells how much the model is capable of distinguishing between classes. The higher the curve, the better the model is at distiguishing classes:

In [None]:
plot_model(final_model, plot='auc')

### Let's load up some real events from previous events to see how the model scores the data:

In [None]:
df_inference = helpers.get_event_propensity_scoring_dataset(
    cluster=CLUSTER,
    database=DATABASE,
    lkupclientid=LKUPCLIENTID,
    game_date="2022-10-22"
)

df_inference.shape

In [None]:
df_inference.info()

In [None]:
# RENAMING SOME COLUMNS:
df_inference = df_inference.rename(columns={
    "daysOut": "daysout", "dimCustomerMasterId": "dimcustomermasterid","eventDate": "eventdate","frequency_eventDay":"frequency_eventday","frequency_eventTime":"frequency_eventtime","inMarket":"inmarket","distanceToVenue":"distancetovenue","recent_openRate":"recent_openrate","recent_clickRate":"recent_clickrate"
})

### Let's run the previous scores through the model for predictions:

In [None]:
df_scores = predict_model(final_model, data=df_inference, raw_score=True)
df_scores.head()

### Included are some metrics on did and did_not purchase:

In [None]:
did_purchase = df_scores["Label"].value_counts()[1]
did_not_purchase = df_scores["Label"].value_counts()[0]
total_rows = df_scores["Label"].count()
purchase_percentage = round((did_purchase / total_rows) * 100, 2)

print(f"Would purchase: {did_purchase}")
print(f"Would not purchase: {did_not_purchase}")
print(f"Purchase percentage: {purchase_percentage}")

### Here we can see the distribution of fans who would purchase:

In [None]:
sns.histplot(data=df_scores, x='Score_1', bins= 20, kde=True)

## Observations
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?