# rays - Extended Feature Selection
* StelllarAlgo Data Science
* Ryan Kazmerik & Nakisa Rad
* Feb 22, 2022

## Hypothesis
Two very important components of a machine learning model are feature selection and feature engineering. Our idea is that adding some more features to the StellarAlgo retention model could improve performance of the model. This notebook will test the standard StellarAlgo retention model features.

## Experiment
This section details our experiment including querying data, data transformations, feature selection and modelling.

In [1]:
import getpass
import pyodbc
import pandas as pd
import warnings
import matplotlib.pyplot as plt

from pycaret.classification import *

warnings.filterwarnings('ignore')

### Let's connect to MSSQL and run a stored proc to get our dataset:

In [4]:
# connect to SQL Server.
SERVER = '54.164.224.129'  
DATABASE = 'stlrRays' 
USERNAME = 'nrad' 
PASSWORD = getpass.getpass(prompt='Enter your password')
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

In [5]:
lkupclientid = 45 # rays

cursor = CNXN.cursor()

storedProc = (
    f"""Exec [stlrRays].[ds].[getRetentionScoringModelData] {lkupclientid}"""
)

df = pd.read_sql(storedProc, CNXN)

# apply some data transformations
df["year"] = pd.to_numeric(df["year"])

CNXN.commit()
cursor.close()

df.head()
#df.info()

Unnamed: 0,lkupClientId,dimCustomerMasterId,customerNumber,year,productGrouping,totalSpent,recentDate,attendancePercent,renewedBeforeDays,isBuyer,source_tenure,tenure,distToVenue,totalGames,recency,missed_games_1,missed_games_2,missed_games_over_2,click_link,fill_out_form,open_email,send_email,unsubscribe_email,openToSendRatio,clickToSendRatio,clickToOpenRatio,posting_records,resale_records,resale_atp,forward_records,cancel_records,email,inbound_email,inbound_phonecall,inperson_contact,internal_note,left_message,outbound_email,outbound_phonecall,phonecall,text,unknown,gender,childrenPresentInHH,maritalStatus,lengthOfResidenceInYrs,annualHHIncome,education,urbanicity,credits_after_refund,is_Lockdown,NumberofGamesPerSeason,CNTPostponedGames,isNextYear_Buyer
0,45,28367424,123987,2016,Full Season,19220.0,2016-09-25,0.796875,81,True,9855,256,3.39,70,0,9,6,2,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,Unknown,1,1,,,,,0.0,0,81,,1
1,45,28367705,312728,2016,Full Season,14226.0,2016-09-25,0.825,81,True,6205,256,6.02,67,0,13,0,1,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,1,1,0,0,Unknown,1,1,,,,,0.0,0,81,,1
2,45,28368680,9196196,2016,Full Season,7124.0,2016-09-25,0.63125,81,True,3285,256,24.14,51,0,12,4,2,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,2,2,0,0,Unknown,1,1,,,,,0.0,0,81,,0
3,45,28368997,100271479,2016,Full Season,7124.0,2016-09-23,0.5625,81,True,1825,256,18.33,45,1,10,5,3,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,2,2,0,0,Unknown,1,1,,,,,0.0,0,81,,0
4,45,28375130,3240689,2016,Full Season,2408.0,2016-09-25,0.525,81,True,5110,256,72.53,33,1,7,5,4,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,Unknown,1,1,,,,,0.0,0,81,,0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10233 entries, 0 to 10232
Data columns (total 54 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   lkupClientId            10233 non-null  int64  
 1   dimCustomerMasterId     10233 non-null  int64  
 2   customerNumber          10231 non-null  object 
 3   year                    10233 non-null  int64  
 4   productGrouping         10233 non-null  object 
 5   totalSpent              10233 non-null  float64
 6   recentDate              10233 non-null  object 
 7   attendancePercent       10233 non-null  float64
 8   renewedBeforeDays       10233 non-null  int64  
 9   isBuyer                 10233 non-null  object 
 10  source_tenure           10233 non-null  int64  
 11  tenure                  10233 non-null  int64  
 12  distToVenue             10233 non-null  float64
 13  totalGames              10233 non-null  int64  
 14  recency                 10233 non-null

### Let's also drop the features that only have a single value, as they won't add much differentiation to our model:

In [7]:
for col in df.columns:
    if len(df[col].unique()) == 1:
        df.drop(col,inplace=True,axis=1)
        
df.shape

(10233, 31)

### We should also drop features that have a low correlation with the target label as they won't be useful for prediction, we'll only keep features that have a correlation above a set threshold:

In [8]:
cor = df.corr()

threshold = 0.05

#Correlation with output variable
cor_target = abs(cor["isNextYear_Buyer"])

#Selecting highly correlated features
relevant_features = cor_target[cor_target > threshold]

feats = []
for name, val in relevant_features.items():
    feats.append(name)

df_correlated = df[feats]

df_correlated.shape

df_correlated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10233 entries, 0 to 10232
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   dimCustomerMasterId     10233 non-null  int64  
 1   year                    10233 non-null  int64  
 2   totalSpent              10233 non-null  float64
 3   attendancePercent       10233 non-null  float64
 4   renewedBeforeDays       10233 non-null  int64  
 5   source_tenure           10233 non-null  int64  
 6   tenure                  10233 non-null  int64  
 7   distToVenue             10233 non-null  float64
 8   totalGames              10233 non-null  int64  
 9   missed_games_1          10233 non-null  int64  
 10  missed_games_2          10233 non-null  int64  
 11  missed_games_over_2     10233 non-null  int64  
 12  click_link              10233 non-null  int64  
 13  open_email              10233 non-null  int64  
 14  send_email              10233 non-null

### Now that we have the right features we can look at the correlations between them, if features are highly correlated with each other it might negatively impact the model:

In [9]:
corr = df_correlated.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,dimCustomerMasterId,year,totalSpent,attendancePercent,renewedBeforeDays,source_tenure,tenure,distToVenue,totalGames,missed_games_1,missed_games_2,missed_games_over_2,click_link,open_email,send_email,openToSendRatio,clickToSendRatio,clickToOpenRatio,is_Lockdown,NumberofGamesPerSeason,isNextYear_Buyer
dimCustomerMasterId,1.0,0.21,-0.02,-0.02,0.17,0.16,0.22,-0.06,0.04,0.04,0.01,-0.04,0.15,0.15,0.12,0.14,0.15,0.04,0.14,0.12,0.16
year,0.21,1.0,-0.14,-0.59,0.44,-0.31,0.72,0.05,-0.46,-0.32,-0.33,-0.3,0.47,0.51,0.69,0.43,0.41,0.15,0.84,0.76,-0.39
totalSpent,-0.02,-0.14,1.0,0.08,-0.03,0.24,-0.06,0.09,0.44,0.26,0.25,0.28,-0.1,-0.08,-0.12,-0.05,-0.08,-0.04,-0.13,-0.1,0.06
attendancePercent,-0.02,-0.59,0.08,1.0,-0.39,0.27,-0.35,-0.1,0.59,0.35,0.19,-0.07,-0.2,-0.26,-0.36,-0.17,-0.13,-0.06,-0.71,-0.59,0.24
renewedBeforeDays,0.17,0.44,-0.03,-0.39,1.0,0.0,0.42,-0.01,-0.15,-0.08,-0.11,-0.11,0.25,0.29,0.34,0.23,0.21,0.04,0.39,0.29,-0.09
source_tenure,0.16,-0.31,0.24,0.27,0.0,1.0,0.11,-0.12,0.47,0.41,0.35,0.27,-0.08,-0.08,-0.21,-0.05,-0.06,-0.04,-0.31,-0.27,0.22
tenure,0.22,0.72,-0.06,-0.35,0.42,0.11,1.0,-0.07,-0.26,-0.17,-0.18,-0.16,0.45,0.46,0.59,0.36,0.38,0.14,0.54,0.5,-0.24
distToVenue,-0.06,0.05,0.09,-0.1,-0.01,-0.12,-0.07,1.0,-0.03,-0.06,-0.04,0.02,-0.04,-0.02,-0.01,-0.01,-0.03,-0.01,0.07,0.07,-0.06
totalGames,0.04,-0.46,0.44,0.59,-0.15,0.47,-0.26,-0.03,1.0,0.75,0.54,0.34,-0.2,-0.22,-0.32,-0.14,-0.15,-0.07,-0.51,-0.43,0.27
missed_games_1,0.04,-0.32,0.26,0.35,-0.08,0.41,-0.17,-0.06,0.75,1.0,0.57,0.27,-0.18,-0.18,-0.24,-0.11,-0.13,-0.05,-0.34,-0.28,0.22


### In order to compare two sets of features, we need to create some datasets for training and evalution:


In [10]:
# select % of the data for training
df_train = df_correlated.sample(frac=0.8, random_state=786).reset_index(drop=True)

# create the eval datasets for A and B
df_eval = df_correlated.drop(df_train.index).reset_index(drop=True)

# print out the number of records for training and eval
print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

Data for Modeling: (8186, 21)
Unseen Data For Predictions: (2047, 21)



### Now we can model the data using a binary classification prediction for the isnextyear_buyer field to see how likely a customer is to re-purchase.

In [11]:
setup(
    data= df_train, 
    target="isNextYear_Buyer", 
    train_size = 0.80,
    data_split_shuffle=True,
    silent=True,
    numeric_features=[
        "totalSpent",
        "attendancePercent",
        "renewedBeforeDays",
        "source_tenure",
        "tenure",
        "distToVenue",
        "totalGames",
        "missed_games_1",
        "missed_games_2",
        "missed_games_over_2",
        "click_link",
        "open_email",
        "send_email",
        "openToSendRatio",
        "clickToSendRatio",
        "clickToOpenRatio",
        "is_Lockdown",
        "NumberofGamesPerSeason"
    ]
)

Unnamed: 0,Description,Value
0,session_id,8735
1,Target,isNextYear_Buyer
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(8186, 21)"
5,Missing Values,False
6,Numeric Features,19
7,Categorical Features,1
8,Ordinal Features,False
9,High Cardinality Features,False


(None,
 'a311',
       dimCustomerMasterId  year  totalSpent  attendancePercent  \
 0                46265196  2017      1851.5           0.650000   
 1                46258269  2019      5610.0           0.762500   
 2                46253086  2021       500.0           0.000000   
 3                46258125  2019      1627.5           0.736842   
 4                46258280  2017      3071.0           0.303797   
 ...                   ...   ...         ...                ...   
 8181             46258779  2017      6749.5           0.092105   
 8182             28588117  2017      4652.0           0.881579   
 8183             46241976  2018      1371.0           0.954545   
 8184             46258622  2017       810.5           0.904762   
 8185             46241404  2016      2994.0           0.545455   
 
       renewedBeforeDays  source_tenure  tenure  distToVenue  totalGames  \
 0                   124           9855     614        36.53          13   
 1                   170  

In [12]:
model_matrix = compare_models(
    fold=10,
    include=["lr", "xgboost"]
)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
xgboost,Extreme Gradient Boosting,0.8488,0.9281,0.8593,0.8488,0.8539,0.6973,0.6975,0.598
lr,Logistic Regression,0.5371,0.6112,0.9401,0.5362,0.6754,0.0506,0.051,0.29


In [13]:
best_model = create_model(model_matrix)
final_model = finalize_model(best_model)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.8351,0.9234,0.8427,0.8378,0.8402,0.6699,0.6699
1,0.8489,0.9253,0.8427,0.8606,0.8516,0.6976,0.6978
2,0.8519,0.9323,0.8516,0.8593,0.8554,0.7036,0.7037
3,0.8458,0.9272,0.8368,0.8598,0.8481,0.6916,0.6919
4,0.8198,0.9128,0.8457,0.812,0.8285,0.6389,0.6395
5,0.8733,0.9395,0.8932,0.8649,0.8788,0.7461,0.7465
6,0.8595,0.9407,0.8813,0.851,0.8659,0.7185,0.719
7,0.858,0.93,0.8813,0.8486,0.8646,0.7155,0.716
8,0.8547,0.9294,0.869,0.8513,0.8601,0.7091,0.7093
9,0.841,0.9202,0.8482,0.8432,0.8457,0.6817,0.6817




## Observations
Here you can document some ideas on the results from above

## Conclusions
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?