# Yankees - Event Propensity - Next Event Buyer
* StelllarAlgo Data Science
* Ryan Kazmerik, Nakisa Rad, Joey Lai, Shawn Sutherland, Matt Bahler, Pat Faith
* Feb 09, 2022

## Hypothesis
Each team has different tier (or quality) of games, based on day of the week, time of the season, opponent, etc. We think that by using previous buyer behaviour we can make a prediction on whether the fan will purchase for the next game or not.

## Experiment
This section details our experiment including querying data, data transformations, feature selection and modelling.

In [None]:
import getpass
import matplotlib.pyplot as plt
import pyodbc
import pandas as pd
import warnings

from pycaret.classification import *

warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina'

In [None]:
pip install pycaret==2.3.3

### Let's connect to MSSQL and run a stored proc to get our dataset:

In [None]:
# connect to SQL Server.
SERVER = '54.164.224.129'  
DATABASE = 'stlrYankees' 
USERNAME = 'dsAdminWrite' 
PASSWORD = getpass.getpass(prompt='Enter your password')
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

In [None]:
cursor = CNXN.cursor()

query =  f"""
    select * FROM datascience.yankees.event_propensity_training_noFirstPurchases2
    """
    
df = pd.read_sql(query, CNXN)
    
CNXN.commit()
cursor.close()

In [None]:
cor = df.corr()

corr = df_correlated.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

In [None]:
df.info()

In [None]:
#profile = ProfileReport(df, minimal=True)
#profile.to_file(output_file="yankees_pandas_profile_events.html")

### We should specify the features used in our model:

In [None]:
# copy your main dataframe
df_dataset = df

df_train = df_dataset.sample(frac=0.85, random_state=786)
df_eval = df_dataset.drop(df_train.index)

df_train.reset_index(drop=True)
df_eval.reset_index(drop=True)

# print out the number of records for training and eval
print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

In [None]:
df_train.head()

### Now we can model the data using a binary classification prediction for the target field to see how likely a customer is to purchase:

In [None]:
setup(
    data= df_train, 
    target="did_purchase", 
    train_size = 0.80,
    data_split_shuffle=True,
    categorical_features=["inMarket"],
    date_features=["eventDate"],
    ignore_features=["dimCustomerMasterId","minDaysOut","maxDaysOut"],
    silent=True,
    verbose=False,
    numeric_features=[
        "distanceToVenue",
        "events_purchased",
        "frequency_eventDay",
        "frequency_opponent",
        "frequency_eventTime",
        "recent_clickRate",
        "recent_openRate"
    ]
);

In [None]:
model_matrix = compare_models(
    fold=10,
    include=["xgboost"]
)

### Let's load in our evaluation data and get propensity scores using the model:

In [None]:
df_inference = predict_model(final_model, data=df_eval, raw_score=True)
df_inference.head()

In [None]:
did_purchase = df_inference["Label"].value_counts()[1]
did_not_purchase = df_inference["Label"].value_counts()[0]
total_rows = df_inference["Label"].count()
purchase_percentage = round((did_purchase / total_rows), 2) * 100

print("Would purchase:", did_purchase)
print("Would not purchase:", did_not_purchase)
print("Purchase percentage:", purchase_percentage)

### Score_0 = Did Not Purchase, Score_1 = Did Purchase

In [None]:
df_inference.hist(column=['Score_0', 'Score_1'], figsize=(30,5), layout=(1,3));

In [None]:
plot_model(best_model, plot='feature')

In [None]:
plot_model(best_model, plot='confusion_matrix')

## Observations
Here you can document some ideas on the results from above


## Conclusions
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?