# Yankees - Extended Feature Selection
* StelllarAlgo Data Science
* Ryan Kazmerik & Nakisa Rad
* Jan 12, 2022

## Hypothesis
Two very important components of a machine learning model are feature selection and feature engineering. Our idea is that adding some more features to the StellarAlgo retention model could improve performance of the model. This notebook will test the standard StellarAlgo retention model features.

## Experiment
This section details our experiment including querying data, data transformations, feature selection and modelling.

In [1]:
import getpass
import pyodbc
import pandas as pd
import warnings
import matplotlib.pyplot as plt

from pycaret.classification import *

warnings.filterwarnings('ignore')

### Let's connect to MSSQL and run a stored proc to get our dataset:

In [2]:
# connect to SQL Server.
SERVER = '52.44.171.130' 
DATABASE = 'datascience' 
USERNAME = 'nrad' 
PASSWORD = 'ThisIsQA123' 
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

In [3]:
lkupclientid = 53 # Yankees

cursor = CNXN.cursor()

storedProc = (
    f"""Exec [stlrYankees].[ds].[getPropensityEventScoring] {lkupclientid}"""
)

df = pd.read_sql(storedProc, CNXN)

# apply some data transformations
df["year"] = pd.to_numeric(df["year"])

CNXN.commit()
cursor.close()

df.head()
#df.info()

Unnamed: 0,lkupClientId,dimCustomerMasterId,dimEventId,inMarket,customerNumber,year,productGrouping,totalSpent,recentDate,attendancePercent,renewedBeforeDays,isBuyer,source_tenure,tenure,distToVenue,totalGames,recency,click_link,fill_out_form,open_email,send_email,unsubscribe_email,openToSendRatio,clickToSendRatio,clickToOpenRatio,posting_records,resale_records,resale_atp,forward_records,cancel_records,email,inbound_email,inbound_phonecall,inperson_contact,internal_note,left_message,outbound_email,outbound_phonecall,phonecall,text,unknown,credits_after_refund,isNextGameBuyer
0,53,13,343,False,23070972,2017,Online Individual Game,0.0,2017-09-28,1.0,41,False,3285,181,30.03,2,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0
1,53,28,319,False,4417707,2017,Online Individual Game,0.0,2017-05-27,1.0,17,False,1825,203,31.29,1,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0
2,53,40,673,False,22768012,2017,Online Individual Game,20.0,2017-10-01,0.0,204,True,3285,221,38.62,10,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0
3,53,48,321,False,19513838,2017,Online Individual Game,12.0,1970-01-01,0.0,167,True,3650,191,29.99,0,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0
4,53,60,309,True,13552174,2017,Online Individual Game,5.01,1970-01-01,0.0,23,True,4380,23,23.37,2,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1162194 entries, 0 to 1162193
Data columns (total 43 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   lkupClientId          1162194 non-null  int64  
 1   dimCustomerMasterId   1162194 non-null  int64  
 2   dimEventId            1162194 non-null  int64  
 3   inMarket              1162194 non-null  bool   
 4   customerNumber        1162194 non-null  object 
 5   year                  1162194 non-null  int64  
 6   productGrouping       1162194 non-null  object 
 7   totalSpent            1162192 non-null  float64
 8   recentDate            1162194 non-null  object 
 9   attendancePercent     1162194 non-null  float64
 10  renewedBeforeDays     1162194 non-null  int64  
 11  isBuyer               1162194 non-null  object 
 12  source_tenure         1162194 non-null  int64  
 13  tenure                1162194 non-null  int64  
 14  distToVenue           1162194 non-

### Let's also drop the features that only have a single value, as they won't add much differentiation to our model:

In [5]:
for col in df.columns:
    if len(df[col].unique()) == 1:
        df.drop(col,inplace=True,axis=1)
        
df.shape

(1162194, 26)

### We should also drop features that have a low correlation with the target label as they won't be useful for prediction, we'll only keep features that have a correlation above a set threshold:

In [6]:
cor = df.corr()

threshold = 0.05

#Correlation with output variable
cor_target = abs(cor["isNextGameBuyer"])

#Selecting highly correlated features
relevant_features = cor_target[cor_target > threshold]

feats = []
for name, val in relevant_features.items():
    feats.append(name)

df_correlated = df[feats]

df_correlated.shape

df_correlated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1162194 entries, 0 to 1162193
Data columns (total 13 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   dimCustomerMasterId  1162194 non-null  int64  
 1   inMarket             1162194 non-null  bool   
 2   totalSpent           1162192 non-null  float64
 3   attendancePercent    1162194 non-null  float64
 4   source_tenure        1162194 non-null  int64  
 5   tenure               1162194 non-null  int64  
 6   distToVenue          1162194 non-null  float64
 7   totalGames           1162194 non-null  int64  
 8   recency              1162194 non-null  int64  
 9   click_link           1162194 non-null  int64  
 10  open_email           1162194 non-null  int64  
 11  send_email           1162194 non-null  int64  
 12  isNextGameBuyer      1162194 non-null  int64  
dtypes: bool(1), float64(3), int64(9)
memory usage: 107.5 MB


### Now that we have the right features we can look at the correlations between them, if features are highly correlated with each other it might negatively impact the model:

In [7]:
corr = df_correlated.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,dimCustomerMasterId,inMarket,totalSpent,attendancePercent,source_tenure,tenure,distToVenue,totalGames,recency,click_link,open_email,send_email,isNextGameBuyer
dimCustomerMasterId,1.0,0.06,0.04,-0.03,0.04,0.32,-0.11,0.11,0.05,0.14,0.19,0.15,0.06
inMarket,0.06,1.0,-0.02,-0.06,-0.05,0.07,-0.29,0.07,0.04,0.01,0.0,-0.03,0.06
totalSpent,0.04,-0.02,1.0,0.03,0.07,0.03,0.03,0.27,-0.01,-0.0,0.01,0.0,0.06
attendancePercent,-0.03,-0.06,0.03,1.0,0.06,-0.18,0.06,0.14,-0.1,-0.14,-0.13,-0.27,-0.1
source_tenure,0.04,-0.05,0.07,0.06,1.0,0.32,-0.16,0.14,0.05,0.22,0.24,0.31,0.08
tenure,0.32,0.07,0.03,-0.18,0.32,1.0,-0.16,0.11,0.07,0.28,0.35,0.41,0.1
distToVenue,-0.11,-0.29,0.03,0.06,-0.16,-0.16,1.0,-0.04,-0.05,-0.09,-0.08,-0.13,-0.06
totalGames,0.11,0.07,0.27,0.14,0.14,0.11,-0.04,1.0,0.04,0.04,0.03,-0.0,0.26
recency,0.05,0.04,-0.01,-0.1,0.05,0.07,-0.05,0.04,1.0,0.1,0.09,0.1,0.26
click_link,0.14,0.01,-0.0,-0.14,0.22,0.28,-0.09,0.04,0.1,1.0,0.63,0.5,0.12


### In order to compare two sets of features, we need to create some datasets for training and evalution:


In [8]:
# select % of the data for training
df_train = df_correlated.sample(frac=0.8, random_state=786).reset_index(drop=True)

# create the eval datasets for A and B
df_eval = df_correlated.drop(df_train.index).reset_index(drop=True)

# print out the number of records for training and eval
print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

Data for Modeling: (929755, 13)
Unseen Data For Predictions: (232439, 13)



### Now we can model the data using a binary classification prediction for the isnextyear_buyer field to see how likely a customer is to re-purchase.

In [9]:
setup(
    data= df_train, 
    target="isNextGameBuyer", 
    train_size = 0.80,
    data_split_shuffle=True,
    silent=True,
    numeric_features=[
        "inMarket",
        "totalSpent",
        "attendancePercent",
        "source_tenure",
        "tenure",
        "distToVenue",
        "totalGames",
        "recency",
        "click_link",
        "open_email",
        "send_email" 
    ]
)

Unnamed: 0,Description,Value
0,session_id,2847
1,Target,isNextGameBuyer
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(929755, 13)"
5,Missing Values,True
6,Numeric Features,12
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


(None,
 StratifiedKFold(n_splits=10, random_state=None, shuffle=False),
 False,
 <MLUsecase.CLASSIFICATION: 1>,
 'clf-default-name',
 False,
 {'parameter': 'Hyperparameters',
  'auc': 'AUC',
  'confusion_matrix': 'Confusion Matrix',
  'threshold': 'Threshold',
  'pr': 'Precision Recall',
  'error': 'Prediction Error',
  'class_report': 'Class Report',
  'rfe': 'Feature Selection',
  'learning': 'Learning Curve',
  'manifold': 'Manifold Learning',
  'calibration': 'Calibration Curve',
  'vc': 'Validation Curve',
  'dimension': 'Dimensions',
  'feature': 'Feature Importance',
  'feature_all': 'Feature Importance (All)',
  'boundary': 'Decision Boundary',
  'lift': 'Lift Chart',
  'gain': 'Gain Chart',
  'tree': 'Decision Tree',
  'ks': 'KS Statistic Plot'},
 [],
 'isNextGameBuyer',
 False,
 -1,
 'lightgbm',
 False,
         dimCustomerMasterId  inMarket  totalSpent  attendancePercent  \
 0                  17143004     False      680.00           1.000000   
 1                    619693 

In [10]:
model_matrix = compare_models(
    fold=10,
    include=["lr", "xgboost"]
)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
xgboost,Extreme Gradient Boosting,0.9713,0.9466,0.3796,0.6931,0.4905,0.477,0.5001,20.232
lr,Logistic Regression,0.9624,0.4645,0.0087,0.1702,0.0165,0.0129,0.0307,1.304


In [11]:
best_model = create_model(model_matrix)
final_model = finalize_model(best_model)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.9713,0.9497,0.3879,0.6861,0.4956,0.482,0.5029
1,0.9718,0.9465,0.3791,0.7118,0.4947,0.4816,0.5071
2,0.9718,0.944,0.3831,0.7067,0.4969,0.4837,0.5078
3,0.9712,0.9461,0.3831,0.6865,0.4918,0.4782,0.4999
4,0.9711,0.9468,0.3733,0.6902,0.4845,0.471,0.4947
5,0.9718,0.9478,0.3848,0.7065,0.4982,0.485,0.5089
6,0.9705,0.9463,0.3581,0.6783,0.4688,0.4551,0.4798
7,0.9707,0.9462,0.3751,0.6751,0.4823,0.4685,0.49
8,0.9718,0.9439,0.3888,0.7007,0.5001,0.4868,0.5093
9,0.9713,0.9486,0.3828,0.6886,0.492,0.4785,0.5005




## Observations
Here you can document some ideas on the results from above

## Conclusions
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?