# Yankees - Event Propensity - Next Event Buyer Model
* StellarAlgo Data Science
* Ryan Kazmerik, Nakisa Rad, Joey Lai, Shawn Sutherland, Matt Bahler, Pat Faith
* Feb 09, 2022

## Hypothesis
Each team has different tier (or quality) of games, based on day of the week, time of the season, opponent, etc. We think that by using previous buyer behaviour we can make a prediction on whether the fan will purchase for the next game or not.

## Experiment
This section details our experiment including querying data, data transformations, feature selection and modelling.

In [5]:
import getpass
import matplotlib.pyplot as plt
import pyodbc
import pandas as pd
import warnings

from pycaret.classification import *

warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina'

### Let's connect to MSSQL and run a stored proc to get our dataset:

In [None]:
# connect to SQL Server.
SERVER = '54.164.224.129'  
DATABASE = 'stlrYankees' 
USERNAME = 'dsAdminWrite' 
PASSWORD = getpass.getpass(prompt='Enter your password')
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

In [3]:
cursor = CNXN.cursor()

query = "SELECT * FROM datascience.yankees.event_propensity_training_noFirstPurchases2"
    
df = pd.read_sql(query, CNXN)
    
CNXN.commit()
cursor.close()

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266386 entries, 0 to 266385
Data columns (total 16 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   daysOut              266386 non-null  object        
 1   minDaysOut           266386 non-null  int64         
 2   maxDaysOut           147819 non-null  float64       
 3   dimCustomerMasterId  266386 non-null  int64         
 4   recent_openRate      266386 non-null  float64       
 5   recent_clickRate     266386 non-null  float64       
 6   eventDate            266386 non-null  datetime64[ns]
 7   eventName            266386 non-null  object        
 8   inMarket             242800 non-null  object        
 9   distanceToVenue      242800 non-null  float64       
 10  tenure               266386 non-null  int64         
 11  did_purchase         266386 non-null  int64         
 12  events_purchased     266386 non-null  int64         
 13  frequency_oppo

### We should create separate out some data for training the model and some for evaluating:

In [5]:
# copy your main dataframe
df_dataset = df

df_train = df_dataset.sample(frac=0.85, random_state=786)
df_eval = df_dataset.drop(df_train.index)

df_train.reset_index(drop=True)
df_eval.reset_index(drop=True)

# print out the number of records for training and eval
print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

df_train.info()

Data for Modeling: (226428, 16)
Unseen Data For Predictions: (39958, 16)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 226428 entries, 158559 to 178255
Data columns (total 16 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   daysOut              226428 non-null  object        
 1   minDaysOut           226428 non-null  int64         
 2   maxDaysOut           125558 non-null  float64       
 3   dimCustomerMasterId  226428 non-null  int64         
 4   recent_openRate      226428 non-null  float64       
 5   recent_clickRate     226428 non-null  float64       
 6   eventDate            226428 non-null  datetime64[ns]
 7   eventName            226428 non-null  object        
 8   inMarket             206404 non-null  object        
 9   distanceToVenue      206404 non-null  float64       
 10  tenure               226428 non-null  int64         
 11  did_purchase         226428 non-null  int64        

### Now we can model the data using a binary classification prediction for the target field to see how likely a customer is to purchase:

In [6]:
setup(
    data= df_train, 
    target="did_purchase", 
    train_size = 0.80,
    data_split_shuffle=True,
    categorical_features=["inMarket"],
    date_features=["eventDate"],
    ignore_features=[
        "dimCustomerMasterId",
        "eventName",
        "minDaysOut",
        "maxDaysOut"
    ],
    silent=True,
    verbose=False,
    numeric_features=[
        "distanceToVenue",
        "events_purchased",
        "frequency_eventDay",
        "frequency_opponent",
        "frequency_eventTime",
        "recent_clickRate",
        "recent_openRate",
        "tenure"
    ]
);

In [7]:
model_matrix = compare_models(
    fold= 10, 
    include= ["lr"]
)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.8017,0.8787,0.7834,0.8132,0.798,0.6035,0.6039,4.957


### Let's choose the best performing model from our model matrix:

In [8]:
best_model = create_model(
    model_matrix, 
    fold= 10
)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.8026,0.878,0.7809,0.8163,0.7982,0.6052,0.6058
1,0.8029,0.8796,0.786,0.8134,0.7995,0.6057,0.6061
2,0.7997,0.8777,0.7768,0.814,0.795,0.5993,0.5999
3,0.8032,0.8802,0.7881,0.8127,0.8002,0.6065,0.6068
4,0.7945,0.8759,0.7801,0.8031,0.7915,0.5889,0.5892
5,0.7981,0.8752,0.7815,0.8082,0.7946,0.5961,0.5964
6,0.809,0.8819,0.7909,0.8205,0.8054,0.618,0.6184
7,0.8066,0.8836,0.7846,0.8206,0.8022,0.6132,0.6138
8,0.7989,0.876,0.7798,0.8107,0.7949,0.5978,0.5982
9,0.802,0.8789,0.7853,0.8124,0.7986,0.6041,0.6044
