# Sounders2 Feature Selection
* StellarAlgo Data Science
* Ryan Kazmerik & Nakisa Rad
* October 27, 2021

## Hypothesis
Write about the hunch you have and why you're running this experiment.

## Experiment
Document the experiment including selecting data, data transformations, feature engineering and modelling

In [1]:
import getpass
import pyodbc
import pandas as pd
import warnings
from pycaret.classification import *
from ngboost import NGBClassifier

warnings.filterwarnings('ignore')

### Let's connect to MSSQL and run a stored proc to get our dataset:

In [2]:
SERVER = '34.206.73.189' 
DATABASE = 'datascience' 
USERNAME = 'nrad' 
PASSWORD = getpass.getpass(prompt='Enter your password')
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

Enter your password········


In [5]:
lkupclientid = 18 # sounders2

cursor = CNXN.cursor()

storedProc = (
    f"""Exec [stlrUSL].[ds].[getRetentionScoringModelData] {lkupclientid}"""
)

df = pd.read_sql(storedProc, CNXN)

CNXN.commit()
cursor.close()

df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 820 entries, 0 to 819
Data columns (total 55 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   lkupClientId              820 non-null    int64  
 1   dimCustomerMasterId       820 non-null    int64  
 2   customerNumber            820 non-null    object 
 3   year                      820 non-null    object 
 4   productGrouping           820 non-null    object 
 5   totalSpent                820 non-null    float64
 6   recentDate                820 non-null    object 
 7   attendancePercent         820 non-null    float64
 8   renewedBeforeDays         820 non-null    int64  
 9   isBuyer                   820 non-null    object 
 10  source_tenure             820 non-null    int64  
 11  tenure                    820 non-null    int64  
 12  distToVenue               820 non-null    float64
 13  totalGames                820 non-null    int64  
 14  recency   

### Let's drop the features that have lots of null values:

In [6]:
df.drop([ 
    'lengthOfResidenceInYrs',
    'annualHHIncome',
    'education',
    'urbanicity',
    'isnextyear_buyer',
    'isnextyear_samepkg_buyer',
    'pkgupgrade_status',
    'auto_renewal'],
    axis=1, 
    inplace=True
)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 820 entries, 0 to 819
Data columns (total 47 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   lkupClientId          820 non-null    int64  
 1   dimCustomerMasterId   820 non-null    int64  
 2   customerNumber        820 non-null    object 
 3   year                  820 non-null    object 
 4   productGrouping       820 non-null    object 
 5   totalSpent            820 non-null    float64
 6   recentDate            820 non-null    object 
 7   attendancePercent     820 non-null    float64
 8   renewedBeforeDays     820 non-null    int64  
 9   isBuyer               820 non-null    object 
 10  source_tenure         820 non-null    int64  
 11  tenure                820 non-null    int64  
 12  distToVenue           820 non-null    float64
 13  totalGames            820 non-null    int64  
 14  recency               820 non-null    int64  
 15  missed_games_1        8

### In order to compare two sets of features, we need to create some datasets for training and evalution:

In [7]:
df_train_A = df.sample(frac=0.9, random_state=786)
df_train_B = df.sample(frac=0.9, random_state=786)

df_eval_A = df.drop(df_train_A.index)
df_eval_B = df.drop(df_train_B.index)

print('Data for Modeling (A Class): ' + str(df_train_A.shape))
print('Unseen Data For Predictions: ' + str(df_eval_A.shape))

print('Data for Modeling (A Class): ' + str(df_train_A.shape))
print('Unseen Data For Predictions: ' + str(df_eval_A.shape))

Data for Modeling (A Class): (738, 47)
Unseen Data For Predictions: (82, 47)
Data for Modeling (A Class): (738, 47)
Unseen Data For Predictions: (82, 47)


### Let's also drop the features that only have a single value, as they won't add much differentiation to our model:

In [8]:
for col in df.columns:
    if len(df[col].unique()) == 1:
        df.drop(col,inplace=True,axis=1)
        
df.shape

(820, 19)

### We should also drop features that have a low correlation with the target label as they won't be useful for prediction, we'll only keep features that have a correlation above a set threshold:

In [9]:
cor = df.corr()

threshold = 0.05

#Correlation with output variable
cor_target = abs(cor["isNextYear_Buyer"])

#Selecting highly correlated features
relevant_features = cor_target[cor_target > threshold]

feats = []
for name, val in relevant_features.items():
    feats.append(name)

df_correlated = df[feats]

df_correlated.shape

df_correlated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 820 entries, 0 to 819
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   dimCustomerMasterId  820 non-null    int64  
 1   totalSpent           820 non-null    float64
 2   attendancePercent    820 non-null    float64
 3   renewedBeforeDays    820 non-null    int64  
 4   source_tenure        820 non-null    int64  
 5   tenure               820 non-null    int64  
 6   distToVenue          820 non-null    float64
 7   totalGames           820 non-null    int64  
 8   recency              820 non-null    int64  
 9   missed_games_1       820 non-null    int64  
 10  missed_games_2       820 non-null    int64  
 11  missed_games_over_2  820 non-null    int64  
 12  isNextYear_Buyer     820 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 83.4 KB


### Now that we have the right features we can look at the correlations between them, if features are highly correlated with each other it might negatively impact the model:

In [10]:
corr = df_correlated.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,dimCustomerMasterId,totalSpent,attendancePercent,renewedBeforeDays,source_tenure,tenure,distToVenue,totalGames,recency,missed_games_1,missed_games_2,missed_games_over_2,isNextYear_Buyer
dimCustomerMasterId,1.0,0.1,-0.17,-0.09,0.02,0.45,-0.0,-0.09,-0.23,-0.12,-0.14,-0.2,0.39
totalSpent,0.1,1.0,0.0,-0.01,0.27,0.13,-0.03,0.22,-0.04,0.07,0.13,0.14,0.09
attendancePercent,-0.17,0.0,1.0,0.06,-0.0,-0.43,-0.09,0.78,-0.27,0.45,0.21,-0.2,0.3
renewedBeforeDays,-0.09,-0.01,0.06,1.0,0.12,-0.03,-0.03,0.17,0.02,0.13,0.07,0.1,0.06
source_tenure,0.02,0.27,-0.0,0.12,1.0,0.12,-0.09,0.11,0.01,0.07,0.12,0.14,0.12
tenure,0.45,0.13,-0.43,-0.03,0.12,1.0,-0.05,-0.27,-0.17,-0.19,-0.17,-0.2,-0.13
distToVenue,-0.0,-0.03,-0.09,-0.03,-0.09,-0.05,1.0,-0.1,0.02,-0.07,-0.04,0.0,-0.05
totalGames,-0.09,0.22,0.78,0.17,0.11,-0.27,-0.1,1.0,-0.22,0.53,0.32,0.01,0.34
recency,-0.23,-0.04,-0.27,0.02,0.01,-0.17,0.02,-0.22,1.0,-0.1,-0.03,0.37,-0.1
missed_games_1,-0.12,0.07,0.45,0.13,0.07,-0.19,-0.07,0.53,-0.1,1.0,0.18,-0.13,0.13


### In order to compare two sets of features, we need to create some datasets for training and evalution:


In [11]:
# select % of the data for training
df_train = df_correlated.sample(frac=0.8, random_state=786).reset_index(drop=True)

# create the eval datasets for A and B
df_eval = df_correlated.drop(df_train.index).reset_index(drop=True)

# print out the number of records for training and eval
print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

Data for Modeling: (656, 13)
Unseen Data For Predictions: (164, 13)



## Now we can model the data using a binary classification prediction for the isnextyear_buyer field to see how likely a customer is to re-purchase.

In [12]:
from sklearn.impute import SimpleImputer
setup(
    data= df_train, 
    target="isNextYear_Buyer", 
    train_size = 0.80,
    data_split_shuffle=True,
    silent=True,
    numeric_features=["totalSpent",
        "attendancePercent",
        "source_tenure",
        "tenure",
        "renewedBeforeDays",
        "distToVenue",
        "recency",
        "totalGames",
        "missed_games_1",
        "missed_games_2",
        "missed_games_over_2"
    ]
)

Unnamed: 0,Description,Value
0,session_id,7816
1,Target,isNextYear_Buyer
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(656, 13)"
5,Missing Values,False
6,Numeric Features,12
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


(False,
 10,
 {'lr': <pycaret.containers.models.classification.LogisticRegressionClassifierContainer at 0x7fe133a87790>,
  'knn': <pycaret.containers.models.classification.KNeighborsClassifierContainer at 0x7fe133a87890>,
  'nb': <pycaret.containers.models.classification.GaussianNBClassifierContainer at 0x7fe133a879d0>,
  'dt': <pycaret.containers.models.classification.DecisionTreeClassifierContainer at 0x7fe133a87a90>,
  'svm': <pycaret.containers.models.classification.SGDClassifierContainer at 0x7fe133a87d10>,
  'rbfsvm': <pycaret.containers.models.classification.SVCClassifierContainer at 0x7fe133b35090>,
  'gpc': <pycaret.containers.models.classification.GaussianProcessClassifierContainer at 0x7fe133b351d0>,
  'mlp': <pycaret.containers.models.classification.MLPClassifierContainer at 0x7fe133b35290>,
  'ridge': <pycaret.containers.models.classification.RidgeClassifierContainer at 0x7fe133b35510>,
  'rf': <pycaret.containers.models.classification.RandomForestClassifierContainer at 0x

In [13]:
# adding an extra classifier ngboost
ngc = NGBClassifier()
ngboost = create_model(ngc)

model_matrix = compare_models(
    fold=10,
    include=["ada","dt","gbc","et","knn","lightgbm","lr","rf",ngboost,"xgboost"]
)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
8,NGBClassifier,0.8396,0.8925,0.6167,0.7808,0.6855,0.5807,0.5903,0.362
2,Gradient Boosting Classifier,0.8359,0.8785,0.6829,0.733,0.7048,0.5916,0.5942,0.034
3,Extra Trees Classifier,0.8358,0.8778,0.6571,0.7463,0.6945,0.5837,0.5889,0.175
5,Light Gradient Boosting Machine,0.8358,0.8691,0.6829,0.7333,0.7042,0.5912,0.5941,0.019
7,Random Forest Classifier,0.8302,0.8692,0.6304,0.7372,0.6758,0.563,0.5682,0.186
0,Ada Boost Classifier,0.8245,0.841,0.5917,0.7663,0.6625,0.5468,0.5588,0.029
9,Extreme Gradient Boosting,0.8187,0.8603,0.6567,0.6989,0.675,0.5498,0.5519,0.242
1,Decision Tree Classifier,0.788,0.7473,0.6504,0.6295,0.6373,0.4881,0.4902,0.005
4,K Neighbors Classifier,0.7519,0.7451,0.4983,0.586,0.5294,0.3653,0.3724,0.045
6,Logistic Regression,0.7099,0.3142,0.0,0.0,0.0,0.0,0.0,0.006


### The top model is performing well, so let's compare it against our unseen eval dataset:


In [14]:
best_model = create_model(model_matrix)

unseen_predictions = predict_model(best_model, data=df_eval)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.8302,0.9211,0.5333,0.8,0.64,0.5346,0.5535
1,0.7736,0.8228,0.5333,0.6154,0.5714,0.4186,0.4206
2,0.8679,0.902,0.625,0.9091,0.7407,0.6562,0.6769
3,0.9245,0.9358,0.875,0.875,0.875,0.8209,0.8209
4,0.8077,0.9027,0.5333,0.7273,0.6154,0.4912,0.5017
5,0.8462,0.9369,0.6667,0.7692,0.7143,0.6098,0.6127
6,0.9038,0.9117,0.8,0.8571,0.8276,0.761,0.7619
7,0.7308,0.7604,0.4,0.5455,0.4615,0.2877,0.2938
8,0.8269,0.8694,0.5333,0.8,0.64,0.532,0.5509
9,0.8846,0.9622,0.6667,0.9091,0.7692,0.6947,0.7096


In [15]:
plot_model(best_model, plot='feature')

IntProgress(value=0, description='Processing: ', max=5)

ValueError: Data must be 1-dimensional

In [21]:
plot_model(best_model, plot='confusion_matrix')

IntProgress(value=0, description='Processing: ', max=5)

Finished loading model, total used 100 iterations


AttributeError: 'Pipeline' object has no attribute 'fig'

## Conclusions
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?