# Florida Panthers - Feature Selection
* StellarAlgo Data Science
* Ryan Kazmerik & Nakisa Rad
* October 8, 2021

## Hypothesis
Two very important components of a machine learning model are feature selection and feature engineering. Our idea is that adding some more features to the StellarAlgo retention model could improve performance of the model.

## Experiment
This section details our experiment including querying data, data transformations, feature selection and modelling.

In [1]:
import getpass
import pyodbc
import pandas as pd
import warnings
import matplotlib.pyplot as plt

from pycaret.classification import *
from ngboost import NGBClassifier

warnings.filterwarnings('ignore')

### Let's connect to MSSQL and run a stored proc to get our dataset:

In [2]:
# connect to SQL Server.
SERVER = '34.206.73.189' 
DATABASE = 'datascience' 
USERNAME = 'dsAdminWrite' 
PASSWORD = getpass.getpass(prompt='Enter your password')
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

In [4]:
lkupclientid = 93 # panthers

cursor = CNXN.cursor()

storedProc = (
    f"""Exec [stlrNHLPanthers].[ds].[getRetentionScoringModelData] {lkupclientid}"""
)

df = pd.read_sql(storedProc, CNXN)

CNXN.commit()
cursor.close()

df.shape

(14066, 55)

### Let's drop the features that have lots of null values, and any ID columns, as these won't be useful to our model:

In [5]:
df.drop([
    'dimCustomerMasterId',
    'distToVenue',
    'lengthOfResidenceInYrs', 
    'annualHHIncome',
    'education',
    'urbanicity',
    'isnextyear_buyer', 
    'isnextyear_samepkg_buyer',
    'pkgupgrade_status',
    'auto_renewal'],
    axis=1, 
    inplace=True
)

df.shape

(14066, 45)

### Let's also drop the features that only have a single value, as they won't add much differentiation to our model:

In [6]:
for col in df.columns:
    if len(df[col].unique()) == 1:
        df.drop(col,inplace=True,axis=1)
        
df.shape

(14066, 26)

### We should also drop features that have a low correlation with the target label as they won't be useful for prediction, we'll only keep features that have a correlation above a set threshold:

In [7]:
cor = df.corr()

threshold = 0.05

#Correlation with output variable
cor_target = abs(cor["isNextYear_Buyer"])

#Selecting highly correlated features
relevant_features = cor_target[cor_target > threshold]

feats = []
for name, val in relevant_features.items():
    feats.append(name)

df_correlated = df[feats]

df_correlated.shape

df_correlated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14066 entries, 0 to 14065
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   attendancePercent    14066 non-null  float64
 1   source_tenure        14066 non-null  int64  
 2   tenure               14066 non-null  int64  
 3   totalGames           14066 non-null  int64  
 4   missed_games_1       14066 non-null  int64  
 5   missed_games_2       14066 non-null  int64  
 6   missed_games_over_2  14066 non-null  int64  
 7   openToSendRatio      14066 non-null  float64
 8   clickToOpenRatio     14066 non-null  float64
 9   isNextYear_Buyer     14066 non-null  int64  
dtypes: float64(3), int64(7)
memory usage: 1.1 MB


### Now that we have the right features we can look at the correlations between them, if features are highly correlated with each other it might negatively impact the model:

In [8]:
corr = df_correlated.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,attendancePercent,source_tenure,tenure,totalGames,missed_games_1,missed_games_2,missed_games_over_2,openToSendRatio,clickToOpenRatio,isNextYear_Buyer
attendancePercent,1.0,-0.05,-0.12,0.27,0.0,-0.17,-0.38,0.01,0.02,0.12
source_tenure,-0.05,1.0,0.24,0.15,0.1,0.12,0.12,0.03,0.03,0.17
tenure,-0.12,0.24,1.0,-0.21,-0.17,-0.08,-0.07,0.26,0.22,-0.2
totalGames,0.27,0.15,-0.21,1.0,0.57,0.32,0.15,-0.1,-0.03,0.49
missed_games_1,0.0,0.1,-0.17,0.57,1.0,0.37,0.14,-0.11,-0.07,0.28
missed_games_2,-0.17,0.12,-0.08,0.32,0.37,1.0,0.32,-0.06,-0.06,0.16
missed_games_over_2,-0.38,0.12,-0.07,0.15,0.14,0.32,1.0,-0.07,-0.05,0.14
openToSendRatio,0.01,0.03,0.26,-0.1,-0.11,-0.06,-0.07,1.0,0.06,-0.1
clickToOpenRatio,0.02,0.03,0.22,-0.03,-0.07,-0.06,-0.05,0.06,1.0,-0.08
isNextYear_Buyer,0.12,0.17,-0.2,0.49,0.28,0.16,0.14,-0.1,-0.08,1.0


### In order to compare two sets of features, we need to create some datasets for training and evalution:

In [9]:
# select % of the data for training
df_train = df_correlated.sample(frac=0.8, random_state=786).reset_index(drop=True)

# create the eval datasets for A and B
df_eval = df_correlated.drop(df_train.index).reset_index(drop=True)

# print out the number of records for training and eval
print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

Data for Modeling: (11253, 10)
Unseen Data For Predictions: (2813, 10)



## Now we can model the data using a binary classification prediction for the isnextyear_buyer field to see how likely a customer is to re-purchase.

In [26]:
from sklearn.impute import SimpleImputer
setup(
    data= df_train, 
    target="isNextYear_Buyer", 
    train_size = 0.80,
    data_split_shuffle=True,
    silent=True,
    numeric_features=[
        "attendancePercent",
        "source_tenure",
        "tenure",
        "totalGames",
        "missed_games_1",
        "missed_games_2",
        "missed_games_over_2",
        "openToSendRatio",
        "clickToOpenRatio"
    ]
)

Unnamed: 0,Description,Value
0,session_id,5541
1,Target,isNextYear_Buyer
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(11253, 10)"
5,Missing Values,False
6,Numeric Features,9
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


AttributeError: 'Simple_Imputer' object has no attribute 'fill_value_categorical'

In [20]:
# adding an extra classifier ngboost
ngc = NGBClassifier()
ngboost = create_model(ngc)

model_matrix = compare_models(
    fold=10,
    include=["ada","dt","gbc","et","knn","lightgbm","lr","rf",ngboost,"xgboost"]
)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
5,Light Gradient Boosting Machine,0.7996,0.8761,0.7651,0.7727,0.7687,0.5919,0.5922,0.041
2,Gradient Boosting Classifier,0.7919,0.8702,0.7556,0.7642,0.7597,0.5763,0.5765,0.205
9,Extreme Gradient Boosting,0.7919,0.8688,0.761,0.7613,0.761,0.5768,0.5769,0.482
7,Random Forest Classifier,0.787,0.8706,0.7462,0.7605,0.7531,0.5659,0.5662,0.322
8,NGBClassifier,0.7796,0.0,0.7676,0.7373,0.752,0.5539,0.5544,2.733
3,Extra Trees Classifier,0.777,0.8619,0.7383,0.7469,0.7424,0.5459,0.5461,0.283
0,Ada Boost Classifier,0.7706,0.8564,0.7724,0.7209,0.7457,0.5372,0.5384,0.075
4,K Neighbors Classifier,0.7296,0.7968,0.6997,0.6858,0.6927,0.4513,0.4515,0.054
1,Decision Tree Classifier,0.7252,0.7203,0.6824,0.6852,0.6838,0.4408,0.4408,0.015
6,Logistic Regression,0.7027,0.7963,0.6128,0.6754,0.6422,0.389,0.3906,0.044


### The top model is performing well, so let's compare it against our unseen eval dataset:

In [18]:
best_model = create_model(model_matrix)

unseen_predictions = predict_model(best_model, data=df_eval)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.7836,0.8692,0.7519,0.75,0.751,0.5596,0.5596
1,0.808,0.8837,0.7801,0.7781,0.7791,0.6093,0.6093
2,0.8067,0.8897,0.7692,0.7812,0.7752,0.6056,0.6057
3,0.8,0.8799,0.7641,0.772,0.768,0.5923,0.5923
4,0.78,0.8591,0.7308,0.754,0.7422,0.5504,0.5506
5,0.8022,0.8836,0.7641,0.776,0.77,0.5966,0.5966
6,0.7756,0.8637,0.7154,0.7541,0.7342,0.5402,0.5408
7,0.7922,0.8747,0.7564,0.7623,0.7593,0.5765,0.5766
8,0.7878,0.8646,0.759,0.7532,0.7561,0.5683,0.5683
9,0.8167,0.8853,0.7903,0.7883,0.7893,0.627,0.627


In [21]:
plot_model(best_model, plot='feature')

IntProgress(value=0, description='Processing: ', max=5)

Finished loading model, total used 100 iterations


ImportError: cannot import name 'safe_indexing' from 'sklearn.utils' (/Users/stellaralgo/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/__init__.py)

In [29]:
plot_model(best_model, plot='confusion_matrix')

IntProgress(value=0, description='Processing: ', max=5)

Finished loading model, total used 100 iterations


ImportError: cannot import name 'safe_indexing' from 'sklearn.utils' (/Users/stellaralgo/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/__init__.py)

## Results

## Observations
Here you can document some ideas on the results from above

## Conclusions
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?