# Portland Trail Blazers - Feature Selection
* StellarAlgo Data Science
* Ryan Kazmerik & Nakisa Rad
* October 8, 2021

## Hypothesis
Two very important components of a machine learning model are feature selection and feature engineering. Our idea is that adding some more features to the StellarAlgo retention model could improve performance of the model.

## Experiment
This section details our experiment including querying data, data transformations, feature selection and modelling.

In [49]:
import getpass
import pyodbc
import pandas as pd
import warnings

from pycaret.classification import *
from ngboost import NGBClassifier

warnings.filterwarnings('ignore')

### Let's connect to MSSQL and run a stored proc to get our dataset:

In [50]:
SERVER = '34.206.73.189' 
DATABASE = 'datascience' 
USERNAME = 'dsAdminWrite' 
PASSWORD = getpass.getpass(prompt='Enter your password')
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

Enter your password ··················


In [51]:
lkupclientid = 5 # Portland Trail Blazers

cursor = CNXN.cursor()

storedProc = (
    f"""Exec [stlrTrailBlazers].[ds].[getRetentionScoringModelData] {lkupclientid}"""
)

df = pd.read_sql(storedProc, CNXN)

CNXN.commit()
cursor.close()

### Let's drop the features that have lots of null values, as these won't be useful to our model:

In [52]:
df.drop([
    'dimCustomerMasterId',
    'source_tenure',
    'urbanicity', 
    'isnextyear_buyer', 
    'isnextyear_samepkg_buyer',
    'pkgupgrade_status',
    'auto_renewal'],
    axis=1, 
    inplace=True
)

### Let's also drop the features that only have a single value, as they won't add much differentiation to our model:

In [53]:
for col in df.columns:
    if len(df[col].unique()) == 1:
        df.drop(col,inplace=True,axis=1)
        
df.shape

(11004, 30)

### In order to compare two sets of features, we need to create some datasets for training and evalution:

In [54]:
# choose the features for the stellar base retention model
features = [
    "attendancePercent",
    "distToVenue",
    "education",
    "fill_out_form",
    "gender",
    "isNextYear_Buyer",
    "missed_games_1",
    "missed_games_2",
    "missed_games_over_2",
    "recency",
    "renewedBeforeDays",
    "resale_atp",
    "resale_records",
    "totalSpent",
    "tenure"
]

# select % of the data for training
df_train = df.sample(frac=0.8, random_state=786).reset_index(drop=True)

# create the eval datasets for A and B
df_eval = df.drop(df_train.index).reset_index(drop=True)

# choose features for each train dataset
df_train = df_train[features]

# choose features for each train dataset
df_eval = df_eval[features]

# print out the number of records for training and eval
print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

Data for Modeling: (8803, 15)
Unseen Data For Predictions: (2201, 15)



## Now we can model the data using a binary classification prediction for the isnextyear_buyer field to see how likely a customer is to re-purchase.

In [None]:
setup(
    data= df_train, 
    target="isNextYear_Buyer", 
    train_size = 0.80,
    data_split_shuffle=True,
    silent=True,
    numeric_features=[
        "attendancePercent",
        "distToVenue",
        "fill_out_form",
        "missed_games_1",
        "missed_games_2",
        "missed_games_over_2",
        "recency",
        "renewedBeforeDays",
        "resale_atp",
        "resale_records",
        "tenure",
        "totalSpent"
    ]
)

In [56]:
# adding an extra classifier ngboost
ngc = NGBClassifier()
ngboost = create_model(ngc)

model_matrix = compare_models(
    fold=10,
    include=["ada","dt","gbc","et","knn","lightgbm","lr","rf",ngboost,"xgboost"]
)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
8,NGBClassifier,0.9125,0.9591,0.9819,0.7991,0.881,0.813,0.8239,1.433
2,Gradient Boosting Classifier,0.9121,0.9595,0.9591,0.8098,0.878,0.8101,0.8173,0.078
0,Ada Boost Classifier,0.9094,0.9552,0.9474,0.8103,0.8734,0.8035,0.8097,0.043
5,Light Gradient Boosting Machine,0.9074,0.9564,0.9306,0.8151,0.8689,0.7978,0.8022,0.731
7,Random Forest Classifier,0.9064,0.9563,0.9087,0.8253,0.8649,0.7936,0.7958,0.147
9,Extreme Gradient Boosting,0.9061,0.9554,0.9216,0.8173,0.8662,0.7943,0.7979,0.321
3,Extra Trees Classifier,0.8996,0.9491,0.8944,0.818,0.8545,0.7781,0.78,0.128
6,Logistic Regression,0.8894,0.9413,0.9362,0.776,0.8483,0.7626,0.7712,0.12
1,Decision Tree Classifier,0.8757,0.8596,0.8022,0.8175,0.8096,0.7174,0.7177,0.009
4,K Neighbors Classifier,0.8418,0.9086,0.8388,0.7251,0.7776,0.6559,0.6604,0.032


### The top model is performing well, so let's compare it against our unseen eval dataset:

In [57]:
best_model = create_model(model_matrix)

unseen_predictions = predict_model(best_model, data=df_eval)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.9035,0.9553,0.9871,0.7789,0.8707,0.7955,0.8097
1,0.9092,0.9593,0.9871,0.7904,0.8779,0.807,0.8197
2,0.902,0.9519,0.9914,0.7744,0.8696,0.7929,0.8085
3,0.9162,0.9517,0.9828,0.8057,0.8854,0.8204,0.8305
4,0.9176,0.9652,0.9871,0.8063,0.8876,0.8236,0.8341
5,0.9134,0.9588,0.9828,0.8,0.882,0.8147,0.8254
6,0.9119,0.9621,0.9655,0.8058,0.8784,0.8103,0.8184
7,0.9247,0.9585,0.9828,0.8231,0.8959,0.8376,0.8457
8,0.9077,0.9589,0.9698,0.7951,0.8738,0.8021,0.812
9,0.919,0.9693,0.9828,0.8114,0.8889,0.8261,0.8355


In [None]:
plot_model(best_model, plot='feature')

In [None]:
plot_model(best_model, plot='confusion_matrix')

## Results

## Observations
Here you can document some ideas on the results from above

## Conclusions
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?