In this project we will build a classification model, and will design a stratgy using the model.

General steps in building a ML model (see chapter 10):

1. Model Design
2. Data Collection
3. Data Cleaning
4. Test-Train Split
5. Data Processing
6. Feature Reduction
7. Model Training

In this data, steps 1 to 3 are already done. We will discuss those steps later, in other projects.


In [2]:
import pandas as pd

In [3]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [4]:
data = pd.read_csv("drive/My Drive/Classification_Data.csv")

1. Data Exploration

In [5]:
data.shape

(27354, 28)

In [6]:
data.columns

# Class is the target variable. Vs are PCAs based on some original features.

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V2.1', 'Y'],
      dtype='object')

In [7]:
data. tail(5)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V19,V20,V21,V22,V23,V24,V25,V26,V2.1,Y
27349,103000.0,0.0,0.0,,,15.0,140044.0,54.1,34.0,25124.18,...,,0.0,686060.0,,,,,,,0
27350,75000.0,0.0,0.0,,,12.0,6566.0,17.6,26.0,10437.99,...,,433.0,340761.0,,,,,,,1
27351,45000.0,0.0,1.0,,67.0,11.0,6204.0,22.5,15.0,14819.54,...,,0.0,18227.0,,,,,,,0
27352,173000.0,0.0,1.0,,,10.0,36754.0,68.3,19.0,43359.98,...,,403.0,58540.0,,,,,,,0
27353,32000.0,0.0,0.0,,,13.0,10514.0,63.3,23.0,11891.89,...,,0.0,170250.0,,,,,,,0


In [8]:
data.dtypes

# all features are numerical, so no need for one-hot encoding

V1      float64
V2      float64
V3      float64
V4      float64
V5      float64
V6      float64
V7      float64
V8      float64
V9      float64
V10     float64
V11     float64
V12     float64
V13     float64
V14     float64
V15     float64
V16     float64
V17     float64
V18     float64
V19     float64
V20     float64
V21     float64
V22     float64
V23     float64
V24     float64
V25     float64
V26     float64
V2.1    float64
Y         int64
dtype: object

In [9]:
data.describe().transpose()

# class discussion - what information you get from

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
V1,27354.0,82915.987355,69495.442428,8160.0,52000.0,71000.0,98000.0,6000000.0
V2,27354.0,0.349821,0.905096,0.0,0.0,0.0,0.0,15.0
V3,27354.0,0.475104,0.77737,0.0,0.0,0.0,1.0,5.0
V4,13744.0,33.006476,21.798169,0.0,14.0,30.0,48.0,121.0
V5,3648.0,66.332785,25.302058,0.0,51.0,67.0,82.0,119.0
V6,27354.0,12.601082,5.790563,1.0,9.0,12.0,16.0,55.0
V7,27354.0,22602.580719,27405.841487,0.0,9735.0,16516.0,27742.5,1630818.0
V8,27348.0,59.199759,22.986396,0.0,42.7,60.45,77.2,130.5
V9,27354.0,25.984938,11.709076,4.0,18.0,24.0,33.0,127.0
V10,27354.0,19547.685243,7466.802783,1376.49,13583.04,18281.81,23970.45,56694.02


In [None]:
# in class assignmnet - how to show number of missing values, by column, in a pandas dataframe?

2. Test-Train Split: We define two test samples. Each test sample contains 15% of observations.

Samples are chosen randomly. A better approach is to split based on a feature like time. Why random split is not a good idea?

In [10]:
Y = data.Y
X = data.drop(["Y"], axis = 1)

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)
X_test_1, X_test_2, Y_test_1, Y_test_2 = train_test_split(X_test, Y_test, test_size=0.5, random_state=42)

In [11]:
# check

print (X_train.shape)
print (Y_train.shape)
print (X_test_1.shape)
print (Y_test_1.shape)
print (X_test_2.shape)
print (Y_test_2.shape)

(19147, 27)
(19147,)
(4103, 27)
(4103,)
(4104, 27)
(4104,)


3. Data Processing: Main steps in data processing are:

*   One-Hot Encoding
*   Outlier Treatment
*   Feature Scaling
*   Missing Value Imputation
*   Solve for Collinearity

All the fields are numerical; so no one-hot encoding is needed. The XGBoost package we will use, can automatically handle missing values. Also often no need for Outlier Treatment and Feature Scaling in a Tree-Based model (why?). We also don't address collinearity here.

In summary, no need for data processing.

4. Feature Reduction: We are ready to do grid search; but before that we will reduce number features, and remove those features that have no explanation power. This will significantly improve the speed of grid search and will make the process more efficient.

There are several methods for feature reduction. Here we will build a simple XGB model and will keep only features with feature importance higher than 1% (a subjective threshold).

Note that in this project we have only 29 features; so really there is no need for feature reduction, and we are just trying to show the process and the concept.

In [12]:
from xgboost import XGBClassifier


# for this step, we don't play with parameters of RF, and just use the
model_for_feature_reduction = XGBClassifier()
model_for_feature_reduction.fit(X_train, Y_train)

In [13]:
Feature_Importance = pd.DataFrame(columns = ["Feature", "Feature_Importance"])
Feature_Importance.Feature = X_train.columns
Feature_Importance.Feature_Importance = model_for_feature_reduction.feature_importances_
Feature_Importance.sort_values(by=["Feature_Importance"], inplace=True, ascending=False)
Feature_Importance

Unnamed: 0,Feature,Feature_Importance
13,V14,0.174039
15,V16,0.072002
10,V11,0.066542
9,V10,0.062199
11,V12,0.045655
12,V13,0.040358
2,V3,0.032835
26,V2.1,0.032244
21,V22,0.03177
19,V20,0.03119


In [14]:
features_to_drop = Feature_Importance[Feature_Importance.Feature_Importance < 0.01]["Feature"]
features_to_drop

14    V15
Name: Feature, dtype: object

In [15]:
X_train.drop(features_to_drop, axis = 1, inplace=True)
X_test_1.drop(features_to_drop, axis = 1, inplace=True)
X_test_2.drop(features_to_drop, axis = 1, inplace=True)

In [16]:
# check
X_test_1.columns

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22',
       'V23', 'V24', 'V25', 'V26', 'V2.1'],
      dtype='object')

5. Grid Search (Hyper-parameter tuning): Now we are ready to train the XGB model and do grid search.

In [17]:
from sklearn.metrics import roc_auc_score

In [18]:
Grid_Search_Results = pd.DataFrame(columns = ["Model Number", "Number Trees", "Learning Rate", "Tree Depth",
                                              "AUC Train", "AUC Test 1", "AUC Test 2"])
Counter = 0
for num_trees in [10, 20, 50, 100]:
    for learning_rate in [0.01, 0.1, 0.5]:
        for depth in [2, 3, 4]:
          xgb_instance = XGBClassifier(n_estimators=num_trees, learning_rate = learning_rate, max_depth = depth)
          model = xgb_instance.fit(X_train, Y_train)

          Grid_Search_Results.loc[Counter,"Model Number"] = Counter
          Grid_Search_Results.loc[Counter,"Number Trees"] = num_trees
          Grid_Search_Results.loc[Counter,"Learning Rate"] = learning_rate
          Grid_Search_Results.loc[Counter,"Tree Depth"] = depth
          Grid_Search_Results.loc[Counter,"AUC Train"] = roc_auc_score(Y_train, model.predict_proba(X_train)[:,1])
          Grid_Search_Results.loc[Counter,"AUC Test 1"] = roc_auc_score(Y_test_1, model.predict_proba(X_test_1)[:,1])
          Grid_Search_Results.loc[Counter,"AUC Test 2"] = roc_auc_score(Y_test_2, model.predict_proba(X_test_2)[:,1])

          Counter = Counter + 1

In [19]:
# Analyze the results in Excel

Grid_Search_Results.to_csv("drive/My Drive/results.csv")

In [25]:
# Build final model with optimum thresholds

xgb_instance = XGBClassifier(n_estimators=20, learning_rate = 0.5, max_depth = 3)
final_model = xgb_instance.fit(X_train, Y_train)

6. Strategy Development: To develop strategy, we often need to have an setimate of benefit of True Positive and True Negative, and cost of False Positive and False Negative. Separate analysis is needed to have an estimate of these components. Check sample questions for examples.

For this analysis, assume cost of FP is 10, cost of FN is 0, benefit from TP is 1, and benefit from TN is 2. We wnat to find the best threshold to classify customers to Positive and Negative based on the model's output.

In [26]:
# first we create dataframes containing Y and Y-Hat, for all three samples.

strategy_train = pd.DataFrame(columns = ["Y", "Y_Hat"])
strategy_test_1 = pd.DataFrame(columns = ["Y", "Y_Hat"])
strategy_test_2 = pd.DataFrame(columns = ["Y", "Y_Hat"])

strategy_train.Y = Y_train
strategy_train.Y_Hat = final_model.predict_proba(X_train)[:,1]

strategy_test_1.Y = Y_test_1
strategy_test_1.Y_Hat = final_model.predict_proba(X_test_1)[:,1]

strategy_test_2.Y = Y_test_2
strategy_test_2.Y_Hat = final_model.predict_proba(X_test_2)[:,1]

In [30]:
#check

roc_auc_score(strategy_test_2.Y, strategy_test_2.Y_Hat)

0.8126710794804162

In [31]:
# Next check the range of Y-Hat to have an idea of possible threshold values

strategy_train.Y_Hat.describe()

count    19147.000000
mean         0.039054
std          0.080005
min          0.001896
25%          0.013254
50%          0.019033
75%          0.031993
max          0.990794
Name: Y_Hat, dtype: float64

In [40]:
# We will use numbers from 0 to 1, increments of 0.01, and calculate profit for each threshold across all three samples.
# To do so, we define a function that calculates Profit based on the associated cost and benefits.
# The function is written for a binary classification model with responses coded as 1.

def profit_calculator (data, actual_column, prediction_column, threshold, TP_benefit, TN_benefit, FP_cost, FN_cost):
  TP_count = data[data[prediction_column] >= threshold][actual_column].sum()
  FP_count = data[data[prediction_column] >= threshold].shape[0] - TP_count

  FN_count = data[data[prediction_column] < threshold][actual_column].sum()
  TN_count = data[data[prediction_column] < threshold].shape[0] - FN_count

  Profit = TP_count*TP_benefit + TN_count*TN_benefit - FN_count*FN_cost - FP_count*FP_cost

  return Profit

In [42]:
# estimating profits
Profits = pd.DataFrame(columns = ["Threshold", "Train Profit", "Test 1 Profit", "Test 2 Profit"])

import numpy as np

TP_benefit = 1
TN_benefit = 2
FP_cost = 10
FN_cost = 0

Counter = 0
for threshold in np.arange(0.0, 1.0, 0.01):
  Profits.loc[Counter,"Threshold"] = threshold
  Profits.loc[Counter,"Train Profit"] = profit_calculator(strategy_train, "Y", "Y_Hat", threshold, TP_benefit, TN_benefit, FP_cost, FN_cost)
  Profits.loc[Counter,"Test 1 Profit"] = profit_calculator(strategy_test_1, "Y", "Y_Hat", threshold, TP_benefit, TN_benefit, FP_cost, FN_cost)
  Profits.loc[Counter,"Test 2 Profit"] = profit_calculator(strategy_test_2, "Y", "Y_Hat", threshold, TP_benefit, TN_benefit, FP_cost, FN_cost)

  Counter = Counter + 1

In [44]:
# Analyze profits in Excel. Sounds like 0.9 is the optimum threshold.

Profits.to_csv("drive/My Drive/profits.csv")