# galaxy - Extended Feature Selection
* StelllarAlgo Data Science
* Ryan Kazmerik & Nakisa Rad
* Feb 22, 2022

## Hypothesis
Two very important components of a machine learning model are feature selection and feature engineering. Our idea is that adding some more features to the StellarAlgo retention model could improve performance of the model. This notebook will test the standard StellarAlgo retention model features.

## Experiment
This section details our experiment including querying data, data transformations, feature selection and modelling.

In [1]:
import getpass
import pyodbc
import pandas as pd
import warnings
import matplotlib.pyplot as plt

from pycaret.classification import *

warnings.filterwarnings('ignore')

### Let's connect to MSSQL and run a stored proc to get our dataset:

In [None]:
# connect to SQL Server.
SERVER = '34.206.73.189' 
DATABASE = 'datascience' 
USERNAME = 'nrad' 
PASSWORD = getpass.getpass(prompt='Enter your password')
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

In [3]:
lkupclientid = 6 # galaxy

cursor = CNXN.cursor()

storedProc = (
    f"""Exec [stlrMLS].[ds].[getRetentionScoringModelData] {lkupclientid}"""
)

df = pd.read_sql(storedProc, CNXN)

# apply some data transformations
df["year"] = pd.to_numeric(df["year"])

CNXN.commit()
cursor.close()

df.head()
#df.info()

Unnamed: 0,lkupClientId,dimCustomerMasterId,customerNumber,year,productGrouping,totalSpent,recentDate,attendancePercent,renewedBeforeDays,isBuyer,source_tenure,tenure,distToVenue,totalGames,recency,missed_games_1,missed_games_2,missed_games_over_2,click_link,fill_out_form,open_email,send_email,unsubscribe_email,openToSendRatio,clickToSendRatio,clickToOpenRatio,posting_records,resale_records,resale_atp,forward_records,cancel_records,email,inbound_email,inbound_phonecall,inperson_contact,internal_note,left_message,outbound_email,outbound_phonecall,phonecall,text,unknown,gender,childrenPresentInHH,maritalStatus,lengthOfResidenceInYrs,annualHHIncome,education,urbanicity,credits_after_refund,is_Lockdown,NumberofGamesPerSeason,CNTPostponedGames,isNextYear_Buyer
0,6,450019520,8034385,2016,Full Season,425.0,2016-10-23,0.75,157.0,True,388.0,388.0,4.59,12,0,3,1,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,F,1,1,,99999,Completed College,,0.0,0,17,,1
1,6,450019565,9228651,2016,Full Season,1428.0,2016-10-23,1.015625,156.0,True,387.0,387.0,63.61,17,0,1,0,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,Unknown,1,1,,149999,Completed High School,,0.0,0,17,,1
2,6,450019578,8108698,2016,Full Season,425.0,2016-10-23,1.0625,199.0,True,430.0,430.0,9.61,15,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,Unknown,1,0,,19999,Completed High School,,0.0,0,17,,0
3,6,450019580,8037737,2016,Full Season,2006.0,2016-09-03,0.703125,195.0,True,426.0,426.0,58.53,12,2,0,1,1,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,F,1,1,,74999,Completed College,,0.0,0,17,,0
4,6,450019638,8268659,2016,Full Season,1207.0,2016-09-11,0.875,194.0,True,425.0,425.0,63.32,14,1,2,0,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,M,1,1,,29999,Completed College,,0.0,0,17,,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25439 entries, 0 to 25438
Data columns (total 54 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   lkupClientId            25439 non-null  int64  
 1   dimCustomerMasterId     25439 non-null  int64  
 2   customerNumber          25439 non-null  object 
 3   year                    25439 non-null  int64  
 4   productGrouping         25439 non-null  object 
 5   totalSpent              25439 non-null  float64
 6   recentDate              25439 non-null  object 
 7   attendancePercent       25439 non-null  float64
 8   renewedBeforeDays       25421 non-null  float64
 9   isBuyer                 25439 non-null  object 
 10  source_tenure           25421 non-null  float64
 11  tenure                  25421 non-null  float64
 12  distToVenue             25439 non-null  float64
 13  totalGames              25439 non-null  int64  
 14  recency                 25439 non-null

### Let's also drop the features that only have a single value, as they won't add much differentiation to our model:

In [5]:
for col in df.columns:
    if len(df[col].unique()) == 1:
        df.drop(col,inplace=True,axis=1)
        
df.shape

(25439, 34)

### We should also drop features that have a low correlation with the target label as they won't be useful for prediction, we'll only keep features that have a correlation above a set threshold:

In [6]:
cor = df.corr()

threshold = 0.05

#Correlation with output variable
cor_target = abs(cor["isNextYear_Buyer"])

#Selecting highly correlated features
relevant_features = cor_target[cor_target > threshold]

feats = []
for name, val in relevant_features.items():
    feats.append(name)

df_correlated = df[feats]

df_correlated.shape

df_correlated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25439 entries, 0 to 25438
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   totalSpent         25439 non-null  float64
 1   renewedBeforeDays  25421 non-null  float64
 2   source_tenure      25421 non-null  float64
 3   tenure             25421 non-null  float64
 4   totalGames         25439 non-null  int64  
 5   recency            25439 non-null  int64  
 6   missed_games_1     25439 non-null  int64  
 7   missed_games_2     25439 non-null  int64  
 8   click_link         25439 non-null  int64  
 9   open_email         25439 non-null  int64  
 10  send_email         25439 non-null  int64  
 11  openToSendRatio    25439 non-null  float64
 12  clickToSendRatio   25439 non-null  float64
 13  isNextYear_Buyer   25439 non-null  int64  
dtypes: float64(6), int64(8)
memory usage: 2.7 MB


### Now that we have the right features we can look at the correlations between them, if features are highly correlated with each other it might negatively impact the model:

In [7]:
corr = df_correlated.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,totalSpent,renewedBeforeDays,source_tenure,tenure,totalGames,recency,missed_games_1,missed_games_2,click_link,open_email,send_email,openToSendRatio,clickToSendRatio,isNextYear_Buyer
totalSpent,1.0,0.08,0.02,0.02,0.21,-0.0,0.09,0.07,-0.01,-0.01,0.0,-0.04,-0.01,0.07
renewedBeforeDays,0.08,1.0,0.38,0.38,0.38,0.07,0.25,0.17,0.11,0.2,0.22,0.08,0.03,0.44
source_tenure,0.02,0.38,1.0,1.0,-0.0,-0.05,-0.03,-0.01,0.36,0.53,0.61,0.35,0.14,0.2
tenure,0.02,0.38,1.0,1.0,-0.0,-0.05,-0.03,-0.01,0.36,0.53,0.61,0.35,0.14,0.2
totalGames,0.21,0.38,-0.0,-0.0,1.0,-0.1,0.46,0.17,0.04,0.06,0.05,-0.06,-0.01,0.4
recency,-0.0,0.07,-0.05,-0.05,-0.1,1.0,0.07,0.16,-0.11,-0.11,-0.09,-0.12,-0.08,-0.09
missed_games_1,0.09,0.25,-0.03,-0.03,0.46,0.07,1.0,0.18,-0.09,-0.07,-0.06,-0.14,-0.08,0.16
missed_games_2,0.07,0.17,-0.01,-0.01,0.17,0.16,0.18,1.0,-0.09,-0.07,-0.06,-0.11,-0.07,0.06
click_link,-0.01,0.11,0.36,0.36,0.04,-0.11,-0.09,-0.09,1.0,0.68,0.56,0.48,0.61,0.13
open_email,-0.01,0.2,0.53,0.53,0.06,-0.11,-0.07,-0.07,0.68,1.0,0.82,0.71,0.27,0.18


### In order to compare two sets of features, we need to create some datasets for training and evalution:


In [8]:
# select % of the data for training
df_train = df_correlated.sample(frac=0.8, random_state=786).reset_index(drop=True)

# create the eval datasets for A and B
df_eval = df_correlated.drop(df_train.index).reset_index(drop=True)

# print out the number of records for training and eval
print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

Data for Modeling: (20351, 14)
Unseen Data For Predictions: (5088, 14)



### Now we can model the data using a binary classification prediction for the isnextyear_buyer field to see how likely a customer is to re-purchase.

In [9]:
setup(
    data= df_train, 
    target="isNextYear_Buyer", 
    train_size = 0.80,
    data_split_shuffle=True,
    silent=True,
    numeric_features=[
        "totalSpent",
        "renewedBeforeDays",
        "source_tenure",
        "tenure",
        "recency",
        "totalGames",
        "missed_games_1",
        "missed_games_2",
        "click_link",
        "open_email",
        "send_email",
        "openToSendRatio",
        "clickToSendRatio"
    ]
)

Unnamed: 0,Description,Value
0,session_id,8905
1,Target,isNextYear_Buyer
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(20351, 14)"
5,Missing Values,True
6,Numeric Features,13
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


(-1,
 [('Setup Config',
                                  Description             Value
   0                               session_id              8905
   1                                   Target  isNextYear_Buyer
   2                              Target Type            Binary
   3                            Label Encoded        0: 0, 1: 1
   4                            Original Data       (20351, 14)
   5                           Missing Values              True
   6                         Numeric Features                13
   7                     Categorical Features                 0
   8                         Ordinal Features             False
   9                High Cardinality Features             False
   10                 High Cardinality Method              None
   11                   Transformed Train Set       (16280, 12)
   12                    Transformed Test Set        (4071, 12)
   13                      Shuffle Train-Test              True
   14           

In [10]:
model_matrix = compare_models(
    fold=10,
    include=["lr", "xgboost"]
)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
xgboost,Extreme Gradient Boosting,0.8024,0.8668,0.8625,0.8082,0.8344,0.5901,0.5923,0.828
lr,Logistic Regression,0.7603,0.8051,0.8172,0.7784,0.7972,0.5044,0.5057,0.564


In [11]:
best_model = create_model(model_matrix)
final_model = finalize_model(best_model)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.8213,0.8812,0.8669,0.8306,0.8484,0.6309,0.6318
1,0.8084,0.8676,0.8605,0.817,0.8382,0.6036,0.6048
2,0.7918,0.8574,0.8456,0.8036,0.8241,0.5694,0.5705
3,0.7942,0.8567,0.8562,0.8008,0.8276,0.5732,0.5751
4,0.8047,0.8722,0.8765,0.8029,0.8381,0.5931,0.5968
5,0.8108,0.876,0.8637,0.8184,0.8404,0.6086,0.6099
6,0.7991,0.8689,0.8713,0.799,0.8336,0.5815,0.585
7,0.7973,0.8634,0.8383,0.8157,0.8269,0.5825,0.5828
8,0.8028,0.865,0.8819,0.7979,0.8378,0.588,0.5928
9,0.7936,0.8597,0.8638,0.7961,0.8286,0.5704,0.5734




## Observations
Here you can document some ideas on the results from above

## Conclusions
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?