# Yankees - Extended Feature Selection
* StelllarAlgo Data Science
* Ryan Kazmerik & Nakisa Rad
* Jan 12, 2022

## Hypothesis
Two very important components of a machine learning model are feature selection and feature engineering. Our idea is that adding some more features to the StellarAlgo retention model could improve performance of the model. This notebook will test the standard StellarAlgo retention model features.

## Experiment
This section details our experiment including querying data, data transformations, feature selection and modelling.

In [1]:
import getpass
import pyodbc
import pandas as pd
import warnings
import matplotlib.pyplot as plt

from pycaret.classification import *

warnings.filterwarnings('ignore')

### Let's connect to MSSQL and run a stored proc to get our dataset:

In [2]:
# connect to SQL Server.
SERVER = '54.164.224.129' 
DATABASE = 'stlrYankees' 
USERNAME = 'dsAdminWrite' 
PASSWORD = 'PodDtfsgy22LT5z73tJaiN$22#'
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

In [3]:
lkupclientid = 53 # Yankees

cursor = CNXN.cursor()

storedProc = (
    f"""Exec [stlrYankees].[ds].[getPropensityEventScoring] {lkupclientid}"""
)

df = pd.read_sql(storedProc, CNXN)

# apply some data transformations
df["year"] = pd.to_numeric(df["year"]) 

CNXN.commit()
cursor.close()

df.shape

DatabaseError: Execution failed on sql 'Exec [stlrYankees].[ds].[getPropensityEventScoring] 53': ('42000', "[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The EXECUTE permission was denied on the object 'getPropensityEventScoring', database 'stlrYankees', schema 'ds'. (229) (SQLExecDirectW)")

In [None]:
df.info()

### We should specify the features used in our model:

In [None]:
# choose the features for the stellar base retention model
features = ["dimCustomerMasterId",
    "inMarket",
        "totalSpent",
        "attendancePercent",
        "source_tenure",
        "tenure",
        "distToVenue",
        "totalGames",
        "recency",
        "click_link",
        "open_email",
        "send_email",
        "isNextGameBuyer",
        "year"
]

# copy your main dataframe
df_dataset = df

# choose the features & train year & test year
df_dataset = df_dataset[features]
df_dataset["year"] = pd.to_numeric(df_dataset["year"])
df_dataset = df_dataset.loc[df_dataset["year"] <= 2019]

df_train = df_dataset.sample(frac=0.85, random_state=786)
df_eval = df_dataset.drop(df_train.index)

df_train.reset_index(drop=True, inplace=True)
df_eval.reset_index(drop=True, inplace=True)

# print out the number of records for training and eval
print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

### Now we can model the data using a binary classification prediction for the isnextyear_buyer field to see how likely a customer is to re-purchase.

In [None]:
setup(
    data= df_train, 
    target="isNextGameBuyer", 
    train_size = 0.80,
    data_split_shuffle=True,
    ignore_features=["dimCustomerMasterId","year"],
    silent=True,
    verbose=False,
    numeric_features=["inMarket",
        "totalSpent",
        "attendancePercent",
        "source_tenure",
        "tenure",
        "distToVenue",
        "totalGames",
        "recency",
        "click_link",
        "open_email",
        "send_email"
    ]
);

In [None]:
model_matrix = compare_models(
    fold=10,
    include=["lr", "xgboost"]
)

In [None]:
best_model = create_model(model_matrix)
final_model = finalize_model(best_model)

### Let's load in our 2021 season data and get retention scores using the model:

In [None]:
df_inference = df.loc[df["year"] >= 2021]
df_inference = df_inference.fillna(0)
df_inference.shape

In [None]:
new_predictions = predict_model(final_model, data=df_inference, raw_score=True)
new_predictions.head()

In [None]:
new_predictions["Label"].value_counts()

In [None]:
#new_predictions = new_predictions.loc[new_predictions["productGrouping"] == "Full Season"]

In [None]:
new_predictions[new_predictions["Label"]==1][["Score_1"]].hist(bins=30, figsize=(10,5), range=[0,1])

In [None]:
plot_model(best_model, plot='feature')

In [None]:
plot_model(best_model, plot='confusion_matrix')

## Observations
Here you can document some ideas on the results from above

## Conclusions
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?