# Ports Feature Selection
* StellarAlgo Data Science
* Ryan Kazmerik & Nakisa Rad
* October 27, 2021

## Hypothesis
Write about the hunch you have and why you're running this experiment.

## Experiment
Document the experiment including selecting data, data transformations, feature engineering and modelling

In [1]:
import getpass
import pyodbc
import pandas as pd
import warnings
from pycaret.classification import *
from ngboost import NGBClassifier

warnings.filterwarnings('ignore')

### Let's connect to MSSQL and run a stored proc to get our dataset:

In [4]:
SERVER = '34.206.73.189' 
DATABASE = 'datascience' 
USERNAME = 'nrad' 
PASSWORD = getpass.getpass(prompt='Enter your password')
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

Enter your password········


In [6]:
lkupclientid = 25 # ports

cursor = CNXN.cursor()

storedProc = (
    f"""Exec [stlrMILB].[ds].[getRetentionScoringModelData] {lkupclientid}"""
)

df = pd.read_sql(storedProc, CNXN)

CNXN.commit()
cursor.close()

df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1962 entries, 0 to 1961
Data columns (total 55 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   lkupClientId              1962 non-null   int64  
 1   dimCustomerMasterId       1962 non-null   int64  
 2   customerNumber            1962 non-null   object 
 3   year                      1962 non-null   object 
 4   productGrouping           1962 non-null   object 
 5   totalSpent                1962 non-null   float64
 6   recentDate                1962 non-null   object 
 7   attendancePercent         1962 non-null   float64
 8   renewedBeforeDays         1962 non-null   int64  
 9   isBuyer                   1962 non-null   object 
 10  source_tenure             1962 non-null   int64  
 11  tenure                    1962 non-null   int64  
 12  distToVenue               1962 non-null   float64
 13  totalGames                1962 non-null   int64  
 14  recency 

### Let's drop the features that have lots of null values:

In [7]:
df.drop([ 
    'lengthOfResidenceInYrs',
    'annualHHIncome',
    'education',
    'urbanicity',
    'isnextyear_buyer',
    'isnextyear_samepkg_buyer',
    'pkgupgrade_status',
    'auto_renewal'],
    axis=1, 
    inplace=True
)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1962 entries, 0 to 1961
Data columns (total 47 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   lkupClientId          1962 non-null   int64  
 1   dimCustomerMasterId   1962 non-null   int64  
 2   customerNumber        1962 non-null   object 
 3   year                  1962 non-null   object 
 4   productGrouping       1962 non-null   object 
 5   totalSpent            1962 non-null   float64
 6   recentDate            1962 non-null   object 
 7   attendancePercent     1962 non-null   float64
 8   renewedBeforeDays     1962 non-null   int64  
 9   isBuyer               1962 non-null   object 
 10  source_tenure         1962 non-null   int64  
 11  tenure                1962 non-null   int64  
 12  distToVenue           1962 non-null   float64
 13  totalGames            1962 non-null   int64  
 14  recency               1962 non-null   int64  
 15  missed_games_1       

### In order to compare two sets of features, we need to create some datasets for training and evalution:

In [8]:
df_train_A = df.sample(frac=0.9, random_state=786)
df_train_B = df.sample(frac=0.9, random_state=786)

df_eval_A = df.drop(df_train_A.index)
df_eval_B = df.drop(df_train_B.index)

print('Data for Modeling (A Class): ' + str(df_train_A.shape))
print('Unseen Data For Predictions: ' + str(df_eval_A.shape))

print('Data for Modeling (A Class): ' + str(df_train_A.shape))
print('Unseen Data For Predictions: ' + str(df_eval_A.shape))

Data for Modeling (A Class): (1766, 47)
Unseen Data For Predictions: (196, 47)
Data for Modeling (A Class): (1766, 47)
Unseen Data For Predictions: (196, 47)


### Let's also drop the features that only have a single value, as they won't add much differentiation to our model:

In [9]:
for col in df.columns:
    if len(df[col].unique()) == 1:
        df.drop(col,inplace=True,axis=1)
        
df.shape

(1962, 17)

### We should also drop features that have a low correlation with the target label as they won't be useful for prediction, we'll only keep features that have a correlation above a set threshold:

In [10]:
cor = df.corr()

threshold = 0.05

#Correlation with output variable
cor_target = abs(cor["isNextYear_Buyer"])

#Selecting highly correlated features
relevant_features = cor_target[cor_target > threshold]

feats = []
for name, val in relevant_features.items():
    feats.append(name)

df_correlated = df[feats]

df_correlated.shape

df_correlated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1962 entries, 0 to 1961
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   totalSpent           1962 non-null   float64
 1   renewedBeforeDays    1962 non-null   int64  
 2   source_tenure        1962 non-null   int64  
 3   tenure               1962 non-null   int64  
 4   totalGames           1962 non-null   int64  
 5   missed_games_1       1962 non-null   int64  
 6   missed_games_2       1962 non-null   int64  
 7   missed_games_over_2  1962 non-null   int64  
 8   isNextYear_Buyer     1962 non-null   int64  
dtypes: float64(1), int64(8)
memory usage: 138.1 KB


### Now that we have the right features we can look at the correlations between them, if features are highly correlated with each other it might negatively impact the model:

In [11]:
corr = df_correlated.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,totalSpent,renewedBeforeDays,source_tenure,tenure,totalGames,missed_games_1,missed_games_2,missed_games_over_2,isNextYear_Buyer
totalSpent,1.0,0.12,-0.02,-0.02,0.4,0.09,0.15,0.58,0.18
renewedBeforeDays,0.12,1.0,0.07,0.07,0.16,0.14,0.17,0.17,0.29
source_tenure,-0.02,0.07,1.0,1.0,0.03,0.03,-0.02,-0.06,-0.18
tenure,-0.02,0.07,1.0,1.0,0.03,0.03,-0.02,-0.06,-0.18
totalGames,0.4,0.16,0.03,0.03,1.0,0.73,0.57,0.38,0.16
missed_games_1,0.09,0.14,0.03,0.03,0.73,1.0,0.56,0.15,0.14
missed_games_2,0.15,0.17,-0.02,-0.02,0.57,0.56,1.0,0.26,0.12
missed_games_over_2,0.58,0.17,-0.06,-0.06,0.38,0.15,0.26,1.0,0.16
isNextYear_Buyer,0.18,0.29,-0.18,-0.18,0.16,0.14,0.12,0.16,1.0


### In order to compare two sets of features, we need to create some datasets for training and evalution:


In [12]:
# select % of the data for training
df_train = df_correlated.sample(frac=0.8, random_state=786).reset_index(drop=True)

# create the eval datasets for A and B
df_eval = df_correlated.drop(df_train.index).reset_index(drop=True)

# print out the number of records for training and eval
print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

Data for Modeling: (1570, 9)
Unseen Data For Predictions: (392, 9)



## Now we can model the data using a binary classification prediction for the isnextyear_buyer field to see how likely a customer is to re-purchase.

In [13]:
from sklearn.impute import SimpleImputer
setup(
    data= df_train, 
    target="isNextYear_Buyer", 
    train_size = 0.80,
    data_split_shuffle=True,
    silent=True,
    numeric_features=[
        "totalSpent",
        "renewedBeforeDays",
        "source_tenure",
        "tenure",
        "totalGames",
        "missed_games_1",
        "missed_games_2",
        "missed_games_over_2"
    ]
)

Unnamed: 0,Description,Value
0,session_id,5032
1,Target,isNextYear_Buyer
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(1570, 9)"
5,Missing Values,False
6,Numeric Features,8
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


({'acc': <pycaret.containers.metrics.classification.AccuracyMetricContainer at 0x7ffcea48f950>,
  'auc': <pycaret.containers.metrics.classification.ROCAUCMetricContainer at 0x7ffcea48f990>,
  'recall': <pycaret.containers.metrics.classification.RecallMetricContainer at 0x7ffcea48fa10>,
  'precision': <pycaret.containers.metrics.classification.PrecisionMetricContainer at 0x7ffcea48fb10>,
  'f1': <pycaret.containers.metrics.classification.F1MetricContainer at 0x7ffcea48fc10>,
  'kappa': <pycaret.containers.metrics.classification.KappaMetricContainer at 0x7ffcea48fd50>,
  'mcc': <pycaret.containers.metrics.classification.MCCMetricContainer at 0x7ffcea48fdd0>},
 0       1
 1       1
 2       0
 3       1
 4       0
        ..
 1565    0
 1566    1
 1567    1
 1568    0
 1569    0
 Name: isNextYear_Buyer, Length: 1570, dtype: int64,
 594     1
 806     0
 776     1
 427     1
 1155    0
        ..
 886     0
 778     0
 353     0
 1566    1
 1198    0
 Name: isNextYear_Buyer, Length: 314, d

In [14]:
# adding an extra classifier ngboost
ngc = NGBClassifier()
ngboost = create_model(ngc)

model_matrix = compare_models(
    fold=10,
    include=["ada","dt","gbc","et","knn","lightgbm","lr","rf",ngboost,"xgboost"]
)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
8,NGBClassifier,0.7126,0.7926,0.6738,0.5764,0.6182,0.3911,0.397,0.762
2,Gradient Boosting Classifier,0.7118,0.7988,0.5773,0.5882,0.5806,0.3617,0.3633,0.038
0,Ada Boost Classifier,0.7055,0.7909,0.5728,0.579,0.5741,0.3496,0.3508,0.031
6,Logistic Regression,0.7022,0.7377,0.4136,0.6031,0.4861,0.2887,0.3006,0.013
7,Random Forest Classifier,0.7006,0.7765,0.5429,0.5719,0.555,0.3305,0.3319,0.196
9,Extreme Gradient Boosting,0.6999,0.7714,0.5474,0.5703,0.5572,0.3308,0.3318,0.335
3,Extra Trees Classifier,0.6839,0.7578,0.5103,0.545,0.5226,0.2886,0.2911,0.181
5,Light Gradient Boosting Machine,0.6823,0.7717,0.5337,0.5419,0.5355,0.2951,0.2964,0.024
4,K Neighbors Classifier,0.6688,0.7271,0.5169,0.5239,0.5189,0.2668,0.2677,0.047
1,Decision Tree Classifier,0.6569,0.628,0.5337,0.5056,0.5177,0.2523,0.2535,0.005


### The top model is performing well, so let's compare it against our unseen eval dataset:


In [15]:
best_model = create_model(model_matrix)

unseen_predictions = predict_model(best_model, data=df_eval)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.6984,0.7889,0.5455,0.5714,0.5581,0.3294,0.3296
1,0.7937,0.8721,0.7955,0.6731,0.7292,0.5644,0.5695
2,0.6429,0.751,0.7045,0.4921,0.5794,0.2857,0.2997
3,0.7143,0.7942,0.6136,0.587,0.6,0.3779,0.3782
4,0.6984,0.7597,0.6136,0.5625,0.587,0.3502,0.351
5,0.7381,0.7988,0.6279,0.6136,0.6207,0.4207,0.4208
6,0.696,0.7429,0.814,0.5385,0.6481,0.3995,0.4261
7,0.728,0.8313,0.6279,0.6,0.6136,0.4039,0.4042
8,0.744,0.8327,0.7209,0.6078,0.6596,0.4568,0.4611
9,0.672,0.754,0.6744,0.5179,0.5859,0.322,0.3297


In [16]:
plot_model(best_model, plot='feature')

IntProgress(value=0, description='Processing: ', max=5)

ValueError: Data must be 1-dimensional

In [21]:
plot_model(best_model, plot='confusion_matrix')

IntProgress(value=0, description='Processing: ', max=5)

Finished loading model, total used 100 iterations


AttributeError: 'Pipeline' object has no attribute 'fig'

## Conclusions
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?