# Portland Trail Blazers - Feature Selection
* StellarAlgo Data Science
* Ryan Kazmerik & Nakisa Rad
* October 8, 2021

## Hypothesis
Two very important components of a machine learning model are feature selection and feature engineering. Our idea is that adding some more features to the StellarAlgo retention model could improve performance of the model.

## Experiment
This section details our experiment including querying data, data transformations, feature selection and modelling

In [1]:
import getpass
import pyodbc
import pandas as pd
import warnings

from pycaret.classification import *
from ngboost import NGBClassifier

warnings.filterwarnings('ignore')

### Let's connect to MSSQL and run a stored proc to get our dataset:

In [2]:
SERVER = '34.206.73.189' 
DATABASE = 'datascience' 
USERNAME = 'dsAdminWrite' 
PASSWORD = getpass.getpass(prompt='Enter your password')
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

Enter your password ··················


In [3]:
lkupclientid = 5 # Portland Trail Blazers

cursor = CNXN.cursor()

storedProc = (
    f"""Exec [stlrTrailBlazers].[ds].[getRetentionScoringModelData] {lkupclientid}"""
)

df = pd.read_sql(storedProc, CNXN)

CNXN.commit()
cursor.close()

### Let's drop the features that have lots of null values, as these won't be useful to our model:

In [4]:
df.drop([
    'urbanicity', 
    'isnextyear_buyer', 
    'isnextyear_samepkg_buyer',
    'pkgupgrade_status',
    'auto_renewal'],
    axis=1, 
    inplace=True
)

### In order to compare two sets of features, we need to create some datasets for training and evalution:

In [5]:
# choose the features that include the extended stellar retention features
features = [
    "dimCustomerMasterId",
    "attendancePercent",
    "distToVenue",
    "education",
    "fill_out_form",
    "forward_records",
    "gender",
    "isNextYear_Buyer",
    "missed_games_1",
    "missed_games_2",
    "missed_games_over_2",
    "posting_records",
    "productGrouping",
    "recency",
    "renewedBeforeDays",
    "resale_atp",
    "resale_records",
    "source_tenure",
    "totalSpent",
    "tenure",
    "year"
]

features = [
    "dimCustomerMasterId",
    "attendancePercent",
    "distToVenue",
    "isNextYear_Buyer"
]

# select 90% of the data for training
df_train = df.sample(frac=0.9, random_state=786).reset_index(drop=True)

# create the eval datasets for A and B
df_eval = df.drop(df_train.index).reset_index(drop=True)

# choose features for each train dataset
df_train = df_train[features]

# choose features for each train dataset
df_eval = df_eval[features]

# print out the number of records for training and eval
print('Data for Modeling: ' + str(df_train.shape))
print('Unseen Data For Predictions: ' + str(df_eval.shape), end="\n\n")

Data for Modeling: (9896, 4)
Unseen Data For Predictions: (1100, 4)



## Now we can model the data using a binary classification prediction for the isnextyear_buyer field to see how likely a customer is to re-purchase.

In [None]:
setup(
    data= df_train, 
    target="isNextYear_Buyer", 
    train_size = 0.80,
    data_split_shuffle=True,
    silent=True,
    ignore_features=["dimCustomerMasterId"],
    numeric_features=[
        "attendancePercent",
        "distToVenue"
    ]
)

In [7]:
# adding an extra classifier ngboost
ngc = NGBClassifier()
ngboost = create_model(ngc)

model_matrix = compare_models(
    fold=10,
    include=["ada","dt","gbc","et","knn","lightgbm","lr","rf",ngboost,"xgboost"]
)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
8,NGBClassifier,0.6741,0.5168,0.0396,0.7685,0.0751,0.044,0.1211,0.966
0,Ada Boost Classifier,0.6729,0.5067,0.0392,0.7309,0.0742,0.0415,0.1123,0.152
2,Gradient Boosting Classifier,0.6713,0.5021,0.0365,0.6889,0.0693,0.0366,0.0999,0.034
5,Light Gradient Boosting Machine,0.6699,0.4954,0.0377,0.6448,0.0709,0.0345,0.0904,0.022
6,Logistic Regression,0.6647,0.4902,0.0,0.0,0.0,0.0,0.0,0.009
9,Extreme Gradient Boosting,0.6644,0.488,0.0309,0.5211,0.0578,0.0192,0.0512,0.257
1,Decision Tree Classifier,0.6585,0.4774,0.0298,0.3818,0.055,0.007,0.0159,0.009
3,Extra Trees Classifier,0.6585,0.4789,0.0309,0.3884,0.0571,0.0078,0.0179,0.109
7,Random Forest Classifier,0.6539,0.4815,0.0366,0.349,0.0657,0.0023,0.0048,0.136
4,K Neighbors Classifier,0.5814,0.4979,0.2399,0.3309,0.2743,-0.0065,-0.0065,0.03


### The top model is performing well, so hypertune the parameters and then compare it against our unseen eval dataset:

In [8]:
best_model = create_model(model_matrix)

unseen_predictions = predict_model(best_model, data=df_eval)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.6705,0.5166,0.0302,0.6667,0.0578,0.0296,0.0873
1,0.6793,0.5163,0.0491,0.8667,0.0929,0.0591,0.1567
2,0.6705,0.5344,0.0451,0.6316,0.0842,0.0413,0.0982
3,0.6705,0.4916,0.0263,0.7778,0.0509,0.0296,0.1003
4,0.6793,0.5328,0.0489,0.9286,0.0929,0.0613,0.1683
5,0.6717,0.5311,0.0338,0.75,0.0647,0.0368,0.1088
6,0.6814,0.5269,0.0604,0.8421,0.1127,0.071,0.1685
7,0.6751,0.5278,0.0415,0.7857,0.0789,0.0468,0.1282
8,0.6738,0.4732,0.0377,0.7692,0.0719,0.0419,0.1189
9,0.6688,0.517,0.0226,0.6667,0.0438,0.0223,0.0754


In [9]:
plot_model(best_model, plot='feature')


IntProgress(value=0, description='Processing: ', max=5)

ValueError: Data must be 1-dimensional

## Results

## Observations
Here you can document some ideas on the results from above

## Conclusions
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?