Using the Scikit-Learn framework, build a set of functioning binary classifiers using provided training data. Use comments to document your major steps (what and why) and any important decisions you make along the way. You can choose any performance metric(s) you like but please include AUC score as an output for the grader.



To access the training data, clone this public GitHub repo and unzip the file called 'testing_data.csv'

https://github.com/MarletteFunding/marlette-ds-challenge



The data columns beginning with 'var' are numeric, and 'cat' are categorical. The target column is labeled 'target'.



Requirements
1. Build (3) Binary Classifiers using Logistic, Catboost, XGBoost algorithms. Pick the best & measure the metrics (accuracy, precision & recall, AUC score) using the provided training data. You may ensemble them if you wish. 

2. Modify the objective & evaluation functions of algorithm of choice to beat the precision of the model in the first requirement

3. Develop code to find a sample weighting scheme that produces better precision than the model in the second requirement

4. Document the key steps in your workflow (what and why)

5. Your code should be replicable so a grader can run your code and achieve the same results



How to submit
Please upload the code for this project to GitHub, and post a link to your repository below. Make sure it is publicly accessible.

# Imports

In [1]:
!pip install catboost



In [2]:
import pandas as pd
import numpy as np

from sklearn.metrics import precision_recall_fscore_support, plot_confusion_matrix, roc_auc_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from catboost import CatBoostClassifier
from xgboost import XGBClassifier


# set a seed for reproducibility
RANDOM_STATE = 99

# Feature engineering

In [3]:
df = pd.read_csv('training_data.csv')

In [4]:
df.head()

Unnamed: 0,ID,var1,var2,var3,var4,var5,var6,var7,var8,var9,...,var189,var190,var191,var192,cat1,cat2,cat3,cat4,cat5,target
0,44686,86.52893,80.79771,75.25887,74.02016,69.01476,65.61648,63.23896,59.07834,56.80397,...,85.133333,84.45,85.2,85.9,S,H,C,B,C,0
1,44687,68.56225,72.05599,69.52573,68.79211,65.48515,63.00976,61.19186,57.85757,55.94791,...,90.533333,86.55,87.24,87.3,S,I,C,B,C,0
2,44688,77.88821,76.6227,73.11046,72.20956,68.26166,65.34046,63.19467,59.25676,57.01834,...,93.933333,90.2,89.84,88.6,S,I,C,B,C,0
3,44689,81.11949,78.43038,74.59578,73.63714,69.4554,66.35951,64.07976,59.88543,57.50303,...,93.2,88.15,88.48,87.766667,S,I,C,B,C,0
4,44690,62.18698,68.60618,67.86709,67.44987,65.15601,63.13671,61.52867,58.35072,56.4246,...,92.733333,88.15,88.0,88.566667,S,I,C,B,C,0


In [5]:
df.describe(include='all')

Unnamed: 0,ID,var1,var2,var3,var4,var5,var6,var7,var8,var9,...,var189,var190,var191,var192,cat1,cat2,cat3,cat4,cat5,target
count,14193.0,14193.0,14193.0,14193.0,14193.0,14193.0,14193.0,14193.0,14193.0,14193.0,...,14193.0,14193.0,14193.0,14193.0,14193,14193,14193,14193,14193,14193.0
unique,,,,,,,,,,,...,,,,,20,9,3,3,3,
top,,,,,,,,,,,...,,,,,M,H,B,B,B,
freq,,,,,,,,,,,...,,,,,716,1577,7091,12814,11367,
mean,60691.679067,49.770732,49.816368,49.842512,49.847619,49.866617,49.877695,49.883878,49.889778,49.889927,...,64.371634,64.739135,65.117889,65.51435,,,,,,0.009441
std,9537.84035,19.287088,13.258854,11.027864,10.617691,9.063626,8.01557,7.24892,5.79651,4.95848,...,51.892179,51.490296,51.202395,51.035665,,,,,,0.09671
min,44686.0,1.96209,6.80249,10.04946,10.79022,14.12184,16.95182,19.40626,25.20809,29.42134,...,1.666667,2.05,2.16,2.066667,,,,,,0.0
25%,52284.0,35.13129,40.74508,42.3778,42.68846,43.73788,44.45517,44.99521,46.01659,46.57202,...,22.533333,22.95,22.88,23.4,,,,,,0.0
50%,60726.0,49.79851,49.92422,50.05118,50.09362,50.08018,50.10192,50.0633,49.96799,49.86434,...,50.866667,51.05,51.4,51.266667,,,,,,0.0
75%,69174.0,64.21333,59.01531,57.29067,57.03203,55.97237,55.19449,54.66191,53.79012,53.20811,...,92.066667,92.9,93.76,94.433333,,,,,,0.0


The describe method on the 5 categorical variables shows that the highest number of categories is 20 for cat1.  Encoding the 5 categorical variables will result in 33 columns (since we can drop one category for every variable) compared to the original 5.  Given that there are so many numerical variables, one hot encoding will not result in an overly sparse matrix, so this type of encoding will be sufficient for modeling purposes.

In [6]:
# first scale all the numerical data

scaler = StandardScaler()
X = df.iloc[:, 1:193]
X[X.columns] = scaler.fit_transform(X[X.columns])

In [7]:
# Encoding the categorical columns

# one hot encoding the 5 categorical columns
# get list of categorical columns
cat_vars = ['cat1', 'cat2', 'cat3', 'cat4', 'cat5']

# dummify each categorical variable, drop the first column,
# and add it to the numerical columns
for i in cat_vars:
  X = X.join(pd.get_dummies(df[i]).iloc[:, 1:], rsuffix=i)


# Getting the target volumn as its own series
y = df.target

# get a sense of whether or not y is imbalanced
print(sum(y) / len(y))

0.009441273867399421


We can see from the above that the training sample is highly imbalanced, with less than 1% of the samples with a target of 1.

In [8]:
# Getting testing and training set

x_train, x_test, y_train, y_test =  train_test_split(X, y,
                                                     test_size=.33,
                                                     random_state = RANDOM_STATE)

In [9]:
train_label_allocation = sum(y_train) / len(y_train)
test_label_allocation = sum(y_test) / len(y_test)

print('Percentage of ones in training set is: ', train_label_allocation)
print('Percentage of ones in testing set is: ', test_label_allocation)

Percentage of ones in training set is:  0.008833736460195604
Percentage of ones in testing set is:  0.01067463706233988


In [10]:
(train_label_allocation - test_label_allocation) / train_label_allocation

-0.2083943312594039

The percentage of the one label in the testing set is roughly 20% off from the percentage of the one label in the training set.  This could potentially be problematic but steps can be taken to mitigate this.

# Building baseline binary classifiers

In [11]:
# Defining a function to easily get scores in a df

score_cols=['Train_acc', 'Train_prec', 'Train_recall', 'Train_fscore', 'Train_auc',
            'Test_acc','Test_prec', 'Test_recall', 'Test_fscore', 'Test_auc',
            'Model', 'Model_version']
all_scores = pd.DataFrame(columns=score_cols)

def get_scores(train_pred, test_pred, y_train, y_test, mod, version):
  scores_train = precision_recall_fscore_support(y_train, train_pred,
                                                 average='binary')
  acc_train = sum(train_pred == y_train) / len(y_train)
  prec_train = scores_train[0]
  recall_train = scores_train[1]
  f_score_train = scores_train[2]
  auc_train = roc_auc_score(y_train, train_pred)

  scores_test = precision_recall_fscore_support(y_test, test_pred,
                                                average='binary')
  acc_test = sum(test_pred == y_test) / len(y_test)
  prec_test = scores_test[0]
  recall_test = scores_test[1]
  f_score_test = scores_test[2]
  auc_test = roc_auc_score(y_test, test_pred)

  scores = [acc_train, prec_train, recall_train, f_score_train, 
            auc_train, acc_test, prec_test, recall_test, f_score_test,
            auc_test, mod, version]

  df_score = pd.DataFrame(np.reshape(scores, newshape=(1,12)),
                          columns=score_cols)
  return df_score

In [12]:
# Looking at benchmark logistic regression

# using a very low value for C and increasing max_inter so the model can converge
lr_bench = LogisticRegression(random_state=RANDOM_STATE, C=.00001, max_iter=1000)
lr_bench.fit(x_train, y_train)

# getting predicitons for LR model
lr_train_pred = lr_bench.predict(x_train)
lr_test_pred = lr_bench.predict(x_test)

In [13]:
# Looking at benchmark logistic regression

# using a defautl catboost classifier
cat_bench = CatBoostClassifier(random_state=RANDOM_STATE)
cat_bench.fit(x_train, y_train)

# getting predicitons for LR model
cat_train_pred = cat_bench.predict(x_train)
cat_test_pred = cat_bench.predict(x_test)

Learning rate set to 0.026952
0:	learn: 0.6216228	total: 183ms	remaining: 3m 2s
1:	learn: 0.5613878	total: 305ms	remaining: 2m 32s
2:	learn: 0.5066779	total: 413ms	remaining: 2m 17s
3:	learn: 0.4601435	total: 538ms	remaining: 2m 14s
4:	learn: 0.4190732	total: 651ms	remaining: 2m 9s
5:	learn: 0.3808890	total: 763ms	remaining: 2m 6s
6:	learn: 0.3466707	total: 879ms	remaining: 2m 4s
7:	learn: 0.3184302	total: 980ms	remaining: 2m 1s
8:	learn: 0.2925518	total: 1.15s	remaining: 2m 6s
9:	learn: 0.2676237	total: 1.31s	remaining: 2m 9s
10:	learn: 0.2466513	total: 1.42s	remaining: 2m 7s
11:	learn: 0.2252152	total: 1.53s	remaining: 2m 6s
12:	learn: 0.2075687	total: 1.64s	remaining: 2m 4s
13:	learn: 0.1928473	total: 1.77s	remaining: 2m 4s
14:	learn: 0.1793594	total: 1.89s	remaining: 2m 4s
15:	learn: 0.1671384	total: 2.01s	remaining: 2m 3s
16:	learn: 0.1565863	total: 2.14s	remaining: 2m 3s
17:	learn: 0.1445894	total: 2.28s	remaining: 2m 4s
18:	learn: 0.1353641	total: 2.42s	remaining: 2m 5s
19:	lear

In [14]:
# Looking at benchmark logistic regression

# using the default xgboost classifier
xgb_bench = XGBClassifier(random_state=RANDOM_STATE)
xgb_bench.fit(x_train, y_train)

# getting predicitons for LR model
xgb_train_pred = xgb_bench.predict(x_train)
xgb_test_pred = xgb_bench.predict(x_test)

In [15]:
all_scores = all_scores.append(get_scores(lr_train_pred, lr_test_pred, y_train, y_test, 'lr', 'benchmark'))
all_scores = all_scores.append(get_scores(cat_train_pred, cat_test_pred, y_train, y_test, 'cat', 'benchmark'))
all_scores = all_scores.append(get_scores(xgb_train_pred, xgb_test_pred, y_train, y_test, 'xgb', 'benchmark'))

all_scores.set_index(['Model_version', 'Model'])

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0_level_0,Unnamed: 1_level_0,Train_acc,Train_prec,Train_recall,Train_fscore,Train_auc,Test_acc,Test_prec,Test_recall,Test_fscore,Test_auc
Model_version,Model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
benchmark,lr,0.9911662635398044,0.0,0.0,0.0,0.5,0.9893253629376602,0.0,0.0,0.0,0.5
benchmark,cat,0.9991586917656956,1.0,0.9047619047619048,0.95,0.9523809523809524,0.988471391972673,0.0,0.0,0.0,0.4995684074233923
benchmark,xgb,0.9919024082448208,1.0,0.0833333333333333,0.1538461538461538,0.5416666666666666,0.9891118701964132,0.0,0.0,0.0,0.4998921018558481


It seems like the every benchmark model performs poorly on the test set, though catboost seems to have pretty good results for the training set.  Next we can try hyperparameter tuning on the baseline classifiers to see if they can be improved.

# Hyperparameter tuning on baseline classifiers

In [16]:
# first tune lr hyperparameters

lr_grid_values = {'C': [0.00001, 0.000001, .0000001]}
lr_tuned = GridSearchCV(LogisticRegression(max_iter=1000, random_state=RANDOM_STATE),
                        param_grid=lr_grid_values)
lr_tuned.fit(x_train, y_train)
lr_tuned.best_params_

# This is the same regularizer that we used in the baseline model, so the
# results will end up the same, so we will skip this prediction

{'C': 1e-05}

In [20]:
# tuning catboost hyperparameters

cat_grid_values = {'depth': [4,7,10],
                   'learning_rate' : [0.01,0.03],
                   'iterations'    : [20,40,60,80,100]
                 }
cat_tuned = GridSearchCV(CatBoostClassifier(random_state=RANDOM_STATE),
                         param_grid=cat_grid_values,
                         cv=3 # not using default cv as the catboost training time is already high
                         )
cat_tuned.fit(x_train, y_train)
cat_tuned.best_params_

# getting the predictions
cat_tuned_train_pred = cat_tuned.predict(x_train)
cat_tuned_test_pred = cat_tuned.predict(x_test)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
0:	learn: 0.6636060	total: 47.5ms	remaining: 2.8s
1:	learn: 0.6403104	total: 86.1ms	remaining: 2.5s
2:	learn: 0.6186845	total: 136ms	remaining: 2.58s
3:	learn: 0.5947876	total: 171ms	remaining: 2.39s
4:	learn: 0.5750209	total: 233ms	remaining: 2.56s
5:	learn: 0.5554274	total: 299ms	remaining: 2.69s
6:	learn: 0.5365667	total: 363ms	remaining: 2.75s
7:	learn: 0.5187418	total: 411ms	remaining: 2.67s
8:	learn: 0.5019048	total: 465ms	remaining: 2.63s
9:	learn: 0.4839404	total: 540ms	remaining: 2.7s
10:	learn: 0.4676159	total: 606ms	remaining: 2.7s
11:	learn: 0.4492475	total: 662ms	remaining: 2.65s
12:	learn: 0.4326671	total: 727ms	remaining: 2.63s
13:	learn: 0.4188498	total: 833ms	remaining: 2.74s
14:	learn: 0.4037393	total: 906ms	remaining: 2.72s
15:	learn: 0.3905791	total: 962ms	remaining: 2.64s
16:	learn: 0.3761005	total: 993ms	remaining: 2.51s
17:	learn: 0.3644532	total: 1.04s	remaining: 2.43s
18:	learn: 0.3529801	total: 1

In [None]:
# tuning xgbboost hyperparameters

xgb_grid_values = {'depth': [4,7,10],
                   'learning_rate' : [0.01,0.03],
                   'iterations'    : [20,40,60,80,100]
                 }
xgb_tuned = GridSearchCV(XGBClassifier(random_state=RANDOM_STATE),
                         param_grid=xgb_grid_values
                         )
xgb_tuned.fit(x_train, y_train)
xgb_tuned.best_params_

# getting the predictions
xgb_tuned_train_pred = xgb_tuned.predict(x_train)
xgb_tuned_test_pred = xgb_tuned.predict(x_test)

In [None]:
all_scores = all_scores.append(get_scores(lr_train_pred, lr_test_pred, y_train, y_test, 'lr', 'benchmark'))
all_scores = all_scores.append(get_scores(cat_train_pred, cat_test_pred, y_train, y_test, 'cat', 'benchmark'))
all_scores = all_scores.append(get_scores(xgb_train_pred, xgb_test_pred, y_train, y_test, 'xgb', 'benchmark'))

all_scores.set_index(['Model_version', 'Model'])