# grizzlies - Extended Feature Selection
* StelllarAlgo Data Science
* Ryan Kazmerik & Nakisa Rad
* Feb 22, 2022

## Hypothesis
Two very important components of a machine learning model are feature selection and feature engineering. Our idea is that adding some more features to the StellarAlgo retention model could improve performance of the model. This notebook will test the standard StellarAlgo retention model features.

## Experiment
This section details our experiment including querying data, data transformations, feature selection and modelling.

In [1]:
import getpass
import pyodbc
import pandas as pd
import warnings
import matplotlib.pyplot as plt

from pycaret.classification import *

warnings.filterwarnings('ignore')

### Let's connect to MSSQL and run a stored proc to get our dataset:

In [2]:
# connect to SQL Server.
SERVER = '34.206.73.189' 
DATABASE = 'datascience' 
USERNAME = 'nrad' 
PASSWORD = getpass.getpass(prompt='Enter your password')
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

In [3]:
lkupclientid = 27 # grizzlies

cursor = CNXN.cursor()

storedProc = (
    f"""Exec [stlrMILB].[ds].[getRetentionScoringModelData] {lkupclientid}"""
)

df = pd.read_sql(storedProc, CNXN)

# apply some data transformations
df["year"] = pd.to_numeric(df["year"])

CNXN.commit()
cursor.close()

df.head()
#df.info()

Unnamed: 0,lkupClientId,dimCustomerMasterId,customerNumber,year,productGrouping,totalSpent,recentDate,attendancePercent,renewedBeforeDays,isBuyer,source_tenure,tenure,distToVenue,totalGames,recency,missed_games_1,missed_games_2,missed_games_over_2,click_link,fill_out_form,open_email,send_email,unsubscribe_email,openToSendRatio,clickToSendRatio,clickToOpenRatio,posting_records,resale_records,resale_atp,forward_records,cancel_records,email,inbound_email,inbound_phonecall,inperson_contact,internal_note,left_message,outbound_email,outbound_phonecall,phonecall,text,unknown,gender,childrenPresentInHH,maritalStatus,lengthOfResidenceInYrs,annualHHIncome,education,urbanicity,credits_after_refund,is_Lockdown,NumberofGamesPerSeason,CNTPostponedGames,isNextYear_Buyer
0,27,323074094,325825,2014,Half Season,1656.0,2014-08-16,0.267606,-4.0,True,1460,177.0,13.51,9,3,0,2,4,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,Unknown,1,1,,,,,0.0,0,78,,1
1,27,323077639,315195,2014,Half Season,900.0,2014-08-29,0.728571,-77.0,True,1460,177.0,14.13,28,1,3,1,2,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,Unknown,1,1,,,,,0.0,0,78,,1
2,27,352013305,315448,2014,Full Season,1584.0,2014-08-31,0.637681,114.0,True,1460,257.0,14.13,66,0,2,3,8,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,Unknown,1,1,,,,,0.0,0,78,,1
3,27,352013343,306343,2014,Club Seats,1200.0,2014-04-19,0.055556,49.0,True,1460,190.0,34.37,1,16,1,0,1,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,Unknown,1,1,,,,,0.0,0,78,,1
4,27,352013516,314649,2014,Club Seats,825.0,2014-09-01,0.85,20.0,True,1460,135.0,11.52,9,0,1,1,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,Unknown,1,1,,,,,0.0,0,78,,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6328 entries, 0 to 6327
Data columns (total 54 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   lkupClientId            6328 non-null   int64  
 1   dimCustomerMasterId     6328 non-null   int64  
 2   customerNumber          6328 non-null   object 
 3   year                    6328 non-null   int64  
 4   productGrouping         6328 non-null   object 
 5   totalSpent              6328 non-null   float64
 6   recentDate              6328 non-null   object 
 7   attendancePercent       6328 non-null   float64
 8   renewedBeforeDays       6292 non-null   float64
 9   isBuyer                 6328 non-null   object 
 10  source_tenure           6328 non-null   int64  
 11  tenure                  6292 non-null   float64
 12  distToVenue             6328 non-null   float64
 13  totalGames              6328 non-null   int64  
 14  recency                 6328 non-null   

### Let's also drop the features that only have a single value, as they won't add much differentiation to our model:

In [5]:
for col in df.columns:
    if len(df[col].unique()) == 1:
        df.drop(col,inplace=True,axis=1)
        
df.shape

(6328, 19)

### We should also drop features that have a low correlation with the target label as they won't be useful for prediction, we'll only keep features that have a correlation above a set threshold:

In [6]:
cor = df.corr()

threshold = 0.05

#Correlation with output variable
cor_target = abs(cor["isNextYear_Buyer"])

#Selecting highly correlated features
relevant_features = cor_target[cor_target > threshold]

feats = []
for name, val in relevant_features.items():
    feats.append(name)

df_correlated = df[feats]

df_correlated.shape

df_correlated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6328 entries, 0 to 6327
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   dimCustomerMasterId  6328 non-null   int64  
 1   year                 6328 non-null   int64  
 2   totalSpent           6328 non-null   float64
 3   attendancePercent    6328 non-null   float64
 4   source_tenure        6328 non-null   int64  
 5   tenure               6292 non-null   float64
 6   totalGames           6328 non-null   int64  
 7   missed_games_1       6328 non-null   int64  
 8   missed_games_2       6328 non-null   int64  
 9   missed_games_over_2  6328 non-null   int64  
 10  is_Lockdown          6328 non-null   int64  
 11  isNextYear_Buyer     6328 non-null   int64  
dtypes: float64(3), int64(9)
memory usage: 593.4 KB


### Now that we have the right features we can look at the correlations between them, if features are highly correlated with each other it might negatively impact the model:

In [7]:
corr = df_correlated.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,dimCustomerMasterId,year,totalSpent,attendancePercent,source_tenure,tenure,totalGames,missed_games_1,missed_games_2,missed_games_over_2,is_Lockdown,isNextYear_Buyer
dimCustomerMasterId,1.0,-0.34,-0.05,0.09,0.53,0.19,0.08,0.07,0.06,0.1,-0.21,0.14
year,-0.34,1.0,0.1,-0.22,-0.26,0.64,-0.19,-0.15,-0.13,-0.2,0.54,-0.21
totalSpent,-0.05,0.1,1.0,-0.15,-0.02,0.15,0.4,0.1,0.16,0.39,0.1,0.14
attendancePercent,0.09,-0.22,-0.15,1.0,0.08,-0.26,0.32,0.35,0.2,-0.16,-0.35,0.12
source_tenure,0.53,-0.26,-0.02,0.08,1.0,0.14,0.09,0.07,0.06,0.08,-0.18,0.13
tenure,0.19,0.64,0.15,-0.26,0.14,1.0,0.01,-0.01,-0.0,0.01,0.4,0.05
totalGames,0.08,-0.19,0.4,0.32,0.09,0.01,1.0,0.68,0.57,0.54,-0.24,0.38
missed_games_1,0.07,-0.15,0.1,0.35,0.07,-0.01,0.68,1.0,0.52,0.26,-0.19,0.27
missed_games_2,0.06,-0.13,0.16,0.2,0.06,-0.0,0.57,0.52,1.0,0.35,-0.18,0.27
missed_games_over_2,0.1,-0.2,0.39,-0.16,0.08,0.01,0.54,0.26,0.35,1.0,-0.28,0.34


### In order to compare two sets of features, we need to create some datasets for training and evalution:


In [8]:
# select % of the data for training
df_train = df_correlated.sample(frac=0.8, random_state=786).reset_index(drop=True)

# create the eval datasets for A and B
df_eval = df_correlated.drop(df_train.index).reset_index(drop=True)

# print out the number of records for training and eval
print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

Data for Modeling: (5062, 12)
Unseen Data For Predictions: (1266, 12)



### Now we can model the data using a binary classification prediction for the isnextyear_buyer field to see how likely a customer is to re-purchase.

In [11]:
setup(
    data= df_train, 
    target="isNextYear_Buyer", 
    train_size = 0.80,
    data_split_shuffle=True,
    silent=True,
    numeric_features=[
        "totalSpent",
        "attendancePercent",
        "source_tenure",
        "tenure",
        "totalGames",
        "missed_games_1",
        "missed_games_2",
        "missed_games_over_2",
        "is_Lockdown"
    ]
)

Unnamed: 0,Description,Value
0,session_id,8651
1,Target,isNextYear_Buyer
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(5062, 12)"
5,Missing Values,True
6,Numeric Features,10
7,Categorical Features,1
8,Ordinal Features,False
9,High Cardinality Features,False


('ca90',
 -1,
 'lightgbm',
 None,
       dimCustomerMasterId   totalSpent  attendancePercent  source_tenure  \
 2601          352031968.0   900.000000           0.375000         1460.0   
 497           352023808.0   900.000000           0.157143         1460.0   
 4941          352018400.0  3600.000000           0.117647         1460.0   
 3757          352027488.0  1060.000000           0.096154         1460.0   
 3561          323099840.0  1960.000000           0.000000         1460.0   
 ...                   ...          ...                ...            ...   
 62            352019136.0   748.000000           0.125000         1460.0   
 4316          352033696.0    75.400002           1.000000         1460.0   
 3928          352034368.0   270.000000           0.285714         1460.0   
 1814          352033760.0    80.000000           0.875000         1460.0   
 4021          352032704.0  1296.000000           0.119403         1460.0   
 
       tenure  totalGames  missed_games_

In [12]:
model_matrix = compare_models(
    fold=10,
    include=["lr", "xgboost"]
)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
xgboost,Extreme Gradient Boosting,0.7256,0.8077,0.7509,0.7155,0.7326,0.4512,0.4519,0.383
lr,Logistic Regression,0.5009,0.5347,1.0,0.5009,0.6674,0.0,0.0,0.313


In [13]:
best_model = create_model(model_matrix)
final_model = finalize_model(best_model)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.7457,0.8173,0.798,0.7232,0.7588,0.4912,0.4939
1,0.7259,0.8007,0.7291,0.7255,0.7273,0.4518,0.4518
2,0.7086,0.7991,0.7438,0.6959,0.719,0.4172,0.4182
3,0.7185,0.8074,0.7291,0.715,0.722,0.437,0.4371
4,0.6914,0.7749,0.734,0.6773,0.7045,0.3826,0.3839
5,0.7654,0.844,0.7833,0.7571,0.77,0.5308,0.5311
6,0.7506,0.8315,0.7882,0.7339,0.7601,0.5011,0.5025
7,0.763,0.8373,0.7833,0.7536,0.7681,0.5259,0.5263
8,0.6716,0.7637,0.6881,0.6651,0.6764,0.3433,0.3435
9,0.7153,0.8009,0.7327,0.7081,0.7202,0.4307,0.431




## Observations
Here you can document some ideas on the results from above

## Conclusions
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?