# Arizona Coyotes - Venue Relocation
* StelllarAlgo Data Science
* Ryan Kazmerik & Nakisa Rad
* Mar 11, 2022

## Hypothesis
With the Coyotes undergoing a venue relocation, we think that the distToVenue feature could have a significant impact on package buyers. This notebook will measure the importance of the distToVenue (distance to venue) feature according to the SA product propensity model

## Experiment
This section details our experiment including querying data, data transformations, feature selection and modelling.

In [1]:
import getpass
import pyodbc
import pandas as pd
import matplotlib.pyplot as plt
import warnings

from pycaret.classification import *

warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina'

### Let's connect to MSSQL and run a stored proc to get our dataset:

In [2]:
# connect to SQL Server.
SERVER = '34.206.73.189' 
DATABASE = 'datascience' 
USERNAME = 'dsAdminWrite' 
PASSWORD = getpass.getpass(prompt='Enter your password')
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

Enter your password ··················


In [None]:
#lkupclientid = 55 # coyotes
#cursor = CNXN.cursor()

#storedProc = (
#    f"""Exec [stlrCoyotes].[ds].[getRetentionScoringModelData] {lkupclientid}"""
#)

#df = pd.read_sql(storedProc, CNXN)

# apply some data transformations
#df["year"] = pd.to_numeric(df["year"])

#CNXN.commit()
#cursor.close()

#df.shape

In [3]:
df = pd.read_csv("./2022 Coyotes Product Propensity Training Data.csv")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333265 entries, 0 to 333264
Data columns (total 10 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   dimCustomerMasterId  333265 non-null  int64  
 1   distance             333265 non-null  float64
 2   seasonYear           333265 non-null  int64  
 3   events_prior         333265 non-null  int64  
 4   attended_prior       333265 non-null  int64  
 5   events_last          333265 non-null  int64  
 6   attended_last        333265 non-null  int64  
 7   tenure               333265 non-null  int64  
 8   atp_last             333265 non-null  float64
 9   product_current      333265 non-null  object 
dtypes: float64(2), int64(7), object(1)
memory usage: 25.4+ MB


### Let's specify the default SA features for our model:

In [4]:
# choose the features for the stellar base retention model
features = [
    "atp_last", 
    "attended_last", 
    "attended_prior", 
    "dimCustomerMasterId",
    "distance", 
    "events_prior", 
    "events_last", 
    "product_current", 
    "seasonYear",
    "tenure"
]

# copy your main dataframe
df_dataset = df

# choose the features & train year & test year
df_dataset = df_dataset[features]
df_dataset = df_dataset.loc[df_dataset["seasonYear"] <= 2019]

df_train = df_dataset.sample(frac=0.85, random_state=786)
df_eval = df_dataset.drop(df_train.index)

df_train.reset_index(drop=True, inplace=True)
df_eval.reset_index(drop=True, inplace=True)

# print out the number of records for training and eval
print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

Data for Modeling: (187008, 10)
Unseen Data For Predictions: (33001, 10)



In [None]:
df_train.head()

### Now we can model the data using a multiclass classification prediction for the isnextyear_buyer field to see how likely a customer is to re-purchase.

In [11]:
setup(
    data= df_train, 
    fix_imbalance=True,
    silent=True,
    ignore_features=["dimCustomerMasterId","seasonYear"],
    numeric_features=[
        "atp_last",
        "attended_last",
        "attended_prior",
        "distance",
        "events_prior",
        "events_last",
        "tenure"
    ],
    target="product_current", 
    train_size = 0.85,
    verbose=False
);

In [None]:
model_matrix = compare_models(
    fold=10,
    include=["gbc"]
)

IntProgress(value=0, description='Processing: ', max=9)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)


### Now that we have the best performing model, we can finalize it:

In [None]:
best_model = create_model(
    model_matrix, 
    fold= 10
)

### We can see the correlation between the features and the target variable:

In [None]:
cor = df.corr()

df_correlated = df

corr = df_correlated.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

### This plot lists the most important features for a correct prediction made by the model:

In [None]:
plot_model(best_model, plot='feature')

### Using a confusion matrix can also help us understand where the model is predicting correctly and where it's missing:

In [None]:
plot_model(best_model, plot='confusion_matrix')

### The AUC curve tells how much the model is capable of distinguishing between classes. The higher the curve, the better the model is at distiguishing classes:

In [None]:
plot_model(best_model, plot="auc")

### Let's load in our evaluation data and get product propensity scores using the model:

In [None]:
df_inference = predict_model(best_model, data=df_eval, raw_score=True)
df_inference.head()

### Included are some metrics on how many packages would be purchased:

In [None]:
df_inference["Label"].value_counts()

## Observations