# Portland Trail Blazers - Feature Selection
* StellarAlgo Data Science
* Ryan Kazmerik & Nakisa Rad
* October 8, 2021

## Hypothesis
Two very important components of a machine learning model are feature selection and feature engineering. Our idea is that adding some more features to the StellarAlgo retention model could improve performance of the model.

## Experiment
This section details our experiment including querying data, data transformations, feature selection and modelling.

In [1]:
import getpass
import pyodbc
import pandas as pd
import warnings

from pycaret.classification import *
from ngboost import NGBClassifier

warnings.filterwarnings('ignore')

### Let's connect to MSSQL and run a stored proc to get our dataset:

In [3]:
SERVER = '34.206.73.189' 
DATABASE = 'datascience' 
USERNAME = 'dsAdminWrite' 
PASSWORD = getpass.getpass(prompt='Enter your password')
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

Enter your password ··················


In [4]:
lkupclientid = 5 # Portland Trail Blazers

cursor = CNXN.cursor()

storedProc = (
    f"""Exec [stlrTrailBlazers].[ds].[getRetentionScoringModelData] {lkupclientid}"""
)

df = pd.read_sql(storedProc, CNXN)

CNXN.commit()
cursor.close()

### Let's drop the features that have lots of null values, as these won't be useful to our model:

In [5]:
df.drop([
    'urbanicity', 
    'isnextyear_buyer', 
    'isnextyear_samepkg_buyer',
    'pkgupgrade_status',
    'auto_renewal'],
    axis=1, 
    inplace=True
)

### In order to compare two sets of features, we need to create some datasets for training and evalution:

In [6]:
# choose the features for the stellar base retention model
features = [
    "dimCustomerMasterId",
    "attendancePercent",
    "distToVenue",
    "isNextYear_Buyer",
    "missed_games_1",
    "missed_games_2",
    "missed_games_over_2",
    "productGrouping",
    "recency",
    "renewedBeforeDays",
    "source_tenure",
    "totalSpent",
    "year"
]

# select 90% of the data for training
df_train = df.sample(frac=0.9, random_state=786).reset_index(drop=True)

# create the eval datasets for A and B
df_eval = df.drop(df_train.index).reset_index(drop=True)

# choose features for each train dataset
df_train = df_train[features]

# choose features for each train dataset
df_eval = df_eval[features]

# print out the number of records for training and eval
print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

Data for Modeling: (9896, 13)
Unseen Data For Predictions: (1100, 13)



## Now we can model the data using a binary classification prediction for the isnextyear_buyer field to see how likely a customer is to re-purchase.

In [None]:
setup(
    data= df_train, 
    target="isNextYear_Buyer", 
    train_size = 0.80,
    data_split_shuffle=True,
    silent=True,
    ignore_features=["dimCustomerMasterId","productGrouping","year"],
    numeric_features=[
        "attendancePercent",
        "distToVenue",
        "missed_games_1",
        "missed_games_2",
        "missed_games_over_2",
        "recency",
        "renewedBeforeDays",
        "source_tenure",
        "totalSpent"
    ]
)

In [8]:
# adding an extra classifier ngboost
ngc = NGBClassifier()
ngboost = create_model(ngc)

model_matrix = compare_models(
    fold=10,
    include=["ada","dt","gbc","et","knn","lightgbm","lr","rf",ngboost,"xgboost"]
)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
0,Ada Boost Classifier,0.8863,0.9423,0.8908,0.7939,0.8393,0.7518,0.7551,0.177
2,Gradient Boosting Classifier,0.8859,0.9433,0.9037,0.7864,0.8407,0.7525,0.7573,0.129
5,Light Gradient Boosting Machine,0.8835,0.9413,0.9063,0.7803,0.8384,0.7482,0.7536,0.04
8,NGBClassifier,0.8796,0.9402,0.9909,0.738,0.8459,0.7506,0.7726,3.474
9,Extreme Gradient Boosting,0.879,0.9393,0.8935,0.7773,0.831,0.7375,0.7422,0.367
7,Random Forest Classifier,0.8729,0.9362,0.8612,0.7806,0.8187,0.7213,0.7236,0.176
6,Logistic Regression,0.8658,0.9271,0.9317,0.7367,0.8225,0.7171,0.7303,0.115
3,Extra Trees Classifier,0.8612,0.9278,0.8286,0.772,0.799,0.6932,0.6945,0.181
4,K Neighbors Classifier,0.8513,0.9128,0.854,0.7402,0.7928,0.6778,0.6823,0.043
1,Decision Tree Classifier,0.843,0.8252,0.7524,0.7709,0.7613,0.6443,0.6447,0.249


### The top model is performing well, so let's compare it against our unseen eval dataset:

In [10]:
best_model = create_model(model_matrix)

unseen_predictions = predict_model(best_model, data=df_eval)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.8927,0.9483,0.8826,0.8118,0.8457,0.7637,0.7652
1,0.8838,0.9415,0.8523,0.8094,0.8303,0.7421,0.7426
2,0.8876,0.9446,0.9091,0.7869,0.8436,0.7566,0.7614
3,0.8864,0.9431,0.8788,0.8,0.8375,0.7505,0.7524
4,0.9003,0.9466,0.9129,0.8114,0.8592,0.7824,0.7856
5,0.8636,0.934,0.8977,0.7453,0.8144,0.7081,0.7158
6,0.8887,0.9444,0.8977,0.7953,0.8434,0.7576,0.761
7,0.8913,0.9436,0.9163,0.7902,0.8486,0.7645,0.7696
8,0.8786,0.9387,0.8669,0.7889,0.8261,0.7332,0.7351
9,0.89,0.9386,0.8935,0.7993,0.8438,0.7593,0.7621


## Results

## Observations
Here you can document some ideas on the results from above

## Conclusions
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?