# Portland Trail Blazers - Feature Selection
* StellarAlgo Data Science
* Ryan Kazmerik & Nakisa Rad
* October 8, 2021

## Hypothesis
Two very important components of a machine learning model are feature selection and feature engineering. Our idea is that adding some more features to the StellarAlgo retention model could improve performance of the model.

## Experiment
This section details our experiment including querying data, data transformations, feature selection and modelling

In [101]:
import getpass
import pyodbc
import pandas as pd
import warnings

from pycaret.classification import *
from ngboost import NGBClassifier

warnings.filterwarnings('ignore')

### Let's connect to MSSQL and run a stored proc to get our dataset:

In [103]:
SERVER = '34.206.73.189' 
DATABASE = 'datascience' 
USERNAME = 'dsAdminWrite' 
PASSWORD = getpass.getpass(prompt='Enter your password')
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

Enter your password ··················


In [104]:
lkupclientid = 5 # Portland Trail Blazers

cursor = CNXN.cursor()

storedProc = (
    f"""Exec [stlrTrailBlazers].[ds].[getRetentionScoringModelData] {lkupclientid}"""
)

df = pd.read_sql(storedProc, CNXN)

CNXN.commit()
cursor.close()

### Let's drop the features that have lots of null values, as these won't be useful to our model:

In [105]:
df.drop([
    'urbanicity', 
    'isnextyear_buyer', 
    'isnextyear_samepkg_buyer',
    'pkgupgrade_status',
    'auto_renewal'],
    axis=1, 
    inplace=True
)

### In order to compare two sets of features, we need to create some datasets for training and evalution:

In [121]:
# choose the features that include the extended stellar retention features
features = [
    "dimCustomerMasterId",
    "attendancePercent",
    "distToVenue",
    "education",
    "fill_out_form",
    "forward_records",
    "gender",
    "isNextYear_Buyer",
    "missed_games_1",
    "missed_games_2",
    "missed_games_over_2",
    "posting_records",
    "productGrouping",
    "recency",
    "renewedBeforeDays",
    "resale_atp",
    "resale_records",
    "source_tenure",
    "totalSpent",
    "tenure",
    "year"
]

# select 90% of the data for training
df_train = df.sample(frac=0.9, random_state=786).reset_index(drop=True)

# create the eval datasets for A and B
df_eval = df.drop(df_train.index).reset_index(drop=True)

# choose features for each train dataset
df_train = df_train[features]

# choose features for each train dataset
df_eval = df_eval[features]

# print out the number of records for training and eval
print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

Data for Modeling: (9896, 21)
Unseen Data For Predictions: (1100, 21)



## Now we can model the data using a binary classification prediction for the isnextyear_buyer field to see how likely a customer is to re-purchase.

In [None]:
setup(
    data= df_train, 
    target="isNextYear_Buyer", 
    train_size = 0.80,
    data_split_shuffle=True,
    silent=True,
    ignore_features=["dimCustomerMasterId","productGrouping","year"],
    categorical_features=[
        "education","gender"
    ],
    numeric_features=[
        "attendancePercent",
        "distToVenue",
        "fill_out_form",
        "forward_records",
        "missed_games_1",
        "missed_games_2",
        "missed_games_over_2",
        "posting_records",
        "recency",
        "renewedBeforeDays",
        "resale_atp",
        "resale_records",
        "source_tenure",
        "totalSpent"
    ]
)

In [108]:
# adding an extra classifier ngboost
ngc = NGBClassifier()
ngboost = create_model(ngc)

model_matrix = compare_models(
    fold=10,
    include=["ada","dt","gbc","et","knn","lightgbm","lr","rf",ngboost,"xgboost"]
)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
8,NGBClassifier,0.9162,0.9622,0.9832,0.8067,0.8862,0.821,0.8311,2.872
0,Ada Boost Classifier,0.9157,0.96,0.9611,0.8171,0.8832,0.818,0.8247,0.252
2,Gradient Boosting Classifier,0.9157,0.9617,0.9604,0.8176,0.8831,0.8179,0.8245,0.117
5,Light Gradient Boosting Machine,0.9155,0.9605,0.9371,0.8302,0.8803,0.8153,0.8191,0.047
9,Extreme Gradient Boosting,0.9119,0.9589,0.9287,0.8271,0.8748,0.8073,0.8107,0.644
7,Random Forest Classifier,0.9055,0.9569,0.8982,0.8307,0.863,0.7911,0.7927,0.171
3,Extra Trees Classifier,0.9007,0.9516,0.8819,0.8295,0.8548,0.7795,0.7804,0.157
6,Logistic Regression,0.8804,0.9329,0.9367,0.7617,0.8393,0.7459,0.7571,0.306
1,Decision Tree Classifier,0.8733,0.8567,0.7984,0.8158,0.8068,0.7125,0.7129,0.011
4,K Neighbors Classifier,0.8482,0.9103,0.8441,0.7369,0.7867,0.6697,0.6736,0.04


### The top model is performing well, so hypertune the parameters and then compare it against our unseen eval dataset:

In [118]:
best_model = create_model(model_matrix)

unseen_predictions = predict_model(best_model, data=df_eval)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.9116,0.9589,0.9924,0.7933,0.8818,0.8126,0.8256
1,0.9192,0.9609,0.9886,0.81,0.8904,0.8274,0.8377
2,0.9255,0.9719,0.981,0.8269,0.8974,0.8396,0.8471
3,0.9104,0.9646,0.9658,0.8038,0.8774,0.8077,0.8161
4,0.9167,0.9593,0.9771,0.8101,0.8858,0.8211,0.83
5,0.9192,0.9685,0.9809,0.8133,0.8893,0.8265,0.8355
6,0.909,0.959,0.9847,0.7914,0.8776,0.8065,0.8187
7,0.9064,0.9541,0.9809,0.7883,0.8741,0.8011,0.8132
8,0.9204,0.9668,0.9924,0.81,0.8919,0.8299,0.8406
9,0.9241,0.9582,0.9885,0.8196,0.8962,0.8373,0.8464


## Results

## Observations
Here you can document some ideas on the results from above

## Conclusions
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?