# Yankees - Propensity Event- NextTierBuyer
* StelllarAlgo Data Science
* Ryan Kazmerik & Nakisa Rad
* Feb 03, 2022

## Hypothesis


## Experiment
This section details our experiment including querying data, data transformations, feature selection and modelling.

In [7]:
import getpass
import pyodbc
import pandas as pd
import warnings
import matplotlib.pyplot as plt

from pycaret.classification import *

warnings.filterwarnings('ignore')

### Let's connect to MSSQL and run a stored proc to get our dataset:

In [8]:
# connect to SQL Server.
SERVER = '54.164.224.129'  
DATABASE = 'stlrRays' 
USERNAME = 'nrad' 
PASSWORD = 'Y34@PSc^n@JAG=4%p8194'#getpass.getpass(prompt='Enter your password')
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

In [10]:
lkupclientid = 53 # Yankees

cursor = CNXN.cursor()

storedProc = (
    f"""Exec [stlrYankees].[ds].[getPropensityEventScoring_new] {lkupclientid}"""
)

df = pd.read_sql(storedProc, CNXN)

# apply some data transformations
df["year"] = pd.to_numeric(df["year"]) 

CNXN.commit()
cursor.close()

df.shape

(96777, 28)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96777 entries, 0 to 96776
Data columns (total 28 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   lkupClientId            96777 non-null  int64         
 1   dimCustomerMasterId     96777 non-null  int64         
 2   eventName               96777 non-null  object        
 3   inMarket                96777 non-null  bool          
 4   year                    96777 non-null  int64         
 5   productGrouping         96777 non-null  object        
 6   totalSpent              96777 non-null  float64       
 7   recentDate              96777 non-null  datetime64[ns]
 8   attendancePercent       96777 non-null  float64       
 9   renewedBeforeDays       96777 non-null  int64         
 10  isBuyer                 96777 non-null  object        
 11  source_tenure           96777 non-null  int64         
 12  tenure                  96777 non-null  int64 

In [12]:
df.shape

(96777, 28)

In [13]:
df.to_csv('yankees-data-export.csv')

### We should specify the features used in our model:

In [14]:
# choose the features for the stellar base retention model
features = ["dimCustomerMasterId",
        "attendancePercent",
        "click_link",
        "clickToSendRatio",
        "clickToOpenRatio",
        "distToVenue",
        "fill_out_form",
        "inMarket",
        "open_email" ,
        "openToSendRatio",
        "recency",
        "renewedBeforeDays",
        "send_email",
        "source_tenure",
        "tenure",
        "totalGames",
        "unsubscribe_email",
        "nextYearTier",
        "year"
]

# copy your main dataframe
df_dataset = df

# choose the features & train year & test year
df_dataset = df_dataset[features]
df_dataset["year"] = pd.to_numeric(df_dataset["year"])
df_dataset = df_dataset.loc[df_dataset["year"] <= 2019]
#df_dataset = df_dataset[df_dataset["Tier"] == {1,2,3} ]

df_train = df_dataset.sample(frac=0.85, random_state=786)
df_eval = df_dataset.drop(df_train.index)

df_train.reset_index(drop=True, inplace=True)
df_eval.reset_index(drop=True, inplace=True)

# print out the number of records for training and eval
print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

Data for Modeling: (82260, 19)
Unseen Data For Predictions: (14517, 19)



### Now we can model the data using a binary classification prediction for the isnextyear_buyer field to see how likely a customer is to re-purchase.

In [15]:
setup(
    data= df_train, 
    target="nextYearTier", 
    train_size = 0.80,
    data_split_shuffle=True,
    ignore_features=["dimCustomerMasterId","year"],
    silent=True,
    verbose=False,
    numeric_features=[
        "attendancePercent",
        "renewedBeforeDays",
        "source_tenure",
        "tenure",
        "distToVenue",
        "totalGames",
        "recency",
        "click_link",
        "fill_out_form",
        "open_email" ,
        "send_email",
        "unsubscribe_email",
        "openToSendRatio",
        "clickToSendRatio",
        "clickToOpenRatio"
    ]
);

In [16]:
model_matrix = compare_models(
    fold=10,
    include=["xgboost"]
)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
xgboost,Extreme Gradient Boosting,0.5161,0.7494,0.3603,0.5634,0.4698,0.2484,0.2834,10.33


In [17]:
best_model = create_model(model_matrix)
final_model = finalize_model(best_model)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.5179,0.7509,0.3663,0.5634,0.475,0.2541,0.2858
1,0.5247,0.7609,0.3631,0.5711,0.4762,0.261,0.2989
2,0.5074,0.7475,0.3481,0.5537,0.4576,0.2328,0.2681
3,0.5115,0.7436,0.3488,0.5481,0.4614,0.2393,0.2754
4,0.5267,0.7522,0.3732,0.5784,0.4842,0.2664,0.3015
5,0.5128,0.75,0.365,0.5573,0.4686,0.2452,0.2782
6,0.5198,0.7429,0.3643,0.5731,0.474,0.254,0.2904
7,0.511,0.7468,0.3529,0.5572,0.4619,0.2389,0.2752
8,0.5106,0.7504,0.3535,0.5653,0.4641,0.2389,0.2733
9,0.5184,0.7491,0.3679,0.5668,0.475,0.2538,0.2877




### Let's load in our 2022 season data and get retention scores using the model:

In [18]:
df_inference = df.loc[df["year"] >= 2021]
df_inference = df_inference.fillna(0)
df_inference.shape

(0, 28)

In [19]:
df_inference

Unnamed: 0,lkupClientId,dimCustomerMasterId,eventName,inMarket,year,productGrouping,totalSpent,recentDate,attendancePercent,renewedBeforeDays,...,open_email,send_email,unsubscribe_email,openToSendRatio,clickToSendRatio,clickToOpenRatio,credits_after_refund,NumberofGamesPerSeason,isNextGameBuyer,nextYearTier


In [20]:
new_predictions = predict_model(final_model, data=df_inference, raw_score=True)
new_predictions.head()

ValueError: Found array with 0 sample(s) (shape=(0, 15)) while a minimum of 1 is required.

In [None]:
new_predictions["Label"].value_counts()

In [None]:
new_predictions["Score_1"].value_counts(bins=[0, 0.2, 0.4, 0.6, 0.8, 1.0])

In [None]:
new_predictions["Score_2"].value_counts(bins=[0, 0.2, 0.4, 0.6, 0.8, 1.0])

In [None]:
new_predictions["Score_3"].value_counts(bins=[0, 0.2, 0.4, 0.6, 0.8, 1.0])

In [None]:
new_predictions["Score_4"].value_counts(bins=[0, 0.2, 0.4, 0.6, 0.8, 1.0])

In [None]:
new_predictions["Score_5"].value_counts(bins=[0, 0.2, 0.4, 0.6, 0.8, 1.0])

In [None]:
new_predictions[new_predictions["Label"]==1][["Score_1"]].hist(bins=30, figsize=(10,5), range=[0,1])

In [None]:
new_predictions[new_predictions["Label"]==2][["Score_2"]].hist(bins=30, figsize=(10,5), range=[0,1])

In [None]:
new_predictions[new_predictions["Label"]==3][["Score_3"]].hist(bins=30, figsize=(10,5), range=[0,1])

In [None]:
new_predictions[new_predictions["Label"]==4][["Score_4"]].hist(bins=30, figsize=(10,5), range=[0,1])

In [None]:
new_predictions[new_predictions["Label"]==5][["Score_5"]].hist(bins=30, figsize=(10,5), range=[0,1])

In [None]:
plot_model(best_model, plot='feature')

In [None]:
plot_model(best_model, plot='confusion_matrix')

## Observations
Here you can document some ideas on the results from above


## Conclusions
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?

* We should ask CS/CI what they think the most significant factor towards a next game buyer is.