In this project we will build a classification model, and will design a stratgy using the model.

General steps in building a ML model (see chapter 10):

1. Model Design
2. Data Collection
3. Data Cleaning
4. Test-Train Split
5. Data Processing
6. Feature Reduction
7. Model Training

In this data, steps 1 to 3 are already done. We will discuss those steps later, in other projects.


In [1]:
import pandas as pd

In [2]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [3]:
data = pd.read_csv("drive/My Drive/Classification_Data.csv")

1. Data Exploration

In [4]:
data.shape

(27354, 28)

In [17]:
data.columns

# Class is the target variable. Vs are PCAs based on some original features.

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'Y'],
      dtype='object')

In [16]:
# Change the wierd column name.
data.rename({"V2.1":"V27"}, inplace = True, axis = 1)

In [18]:
data. tail(5)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V19,V20,V21,V22,V23,V24,V25,V26,V27,Y
27349,103000.0,0.0,0.0,,,15.0,140044.0,54.1,34.0,25124.18,...,,0.0,686060.0,,,,,,,0
27350,75000.0,0.0,0.0,,,12.0,6566.0,17.6,26.0,10437.99,...,,433.0,340761.0,,,,,,,1
27351,45000.0,0.0,1.0,,67.0,11.0,6204.0,22.5,15.0,14819.54,...,,0.0,18227.0,,,,,,,0
27352,173000.0,0.0,1.0,,,10.0,36754.0,68.3,19.0,43359.98,...,,403.0,58540.0,,,,,,,0
27353,32000.0,0.0,0.0,,,13.0,10514.0,63.3,23.0,11891.89,...,,0.0,170250.0,,,,,,,0


In [19]:
data.dtypes

# all features are numerical, so no need for one-hot encoding

V1     float64
V2     float64
V3     float64
V4     float64
V5     float64
V6     float64
V7     float64
V8     float64
V9     float64
V10    float64
V11    float64
V12    float64
V13    float64
V14    float64
V15    float64
V16    float64
V17    float64
V18    float64
V19    float64
V20    float64
V21    float64
V22    float64
V23    float64
V24    float64
V25    float64
V26    float64
V27    float64
Y        int64
dtype: object

In [20]:
# Look at summary stat

data.describe().transpose()

# class discussion - what information you get from

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
V1,27354.0,82915.987355,69495.442428,8160.0,52000.0,71000.0,98000.0,6000000.0
V2,27354.0,0.349821,0.905096,0.0,0.0,0.0,0.0,15.0
V3,27354.0,0.475104,0.77737,0.0,0.0,0.0,1.0,5.0
V4,13744.0,33.006476,21.798169,0.0,14.0,30.0,48.0,121.0
V5,3648.0,66.332785,25.302058,0.0,51.0,67.0,82.0,119.0
V6,27354.0,12.601082,5.790563,1.0,9.0,12.0,16.0,55.0
V7,27354.0,22602.580719,27405.841487,0.0,9735.0,16516.0,27742.5,1630818.0
V8,27348.0,59.199759,22.986396,0.0,42.7,60.45,77.2,130.5
V9,27354.0,25.984938,11.709076,4.0,18.0,24.0,33.0,127.0
V10,27354.0,19547.685243,7466.802783,1376.49,13583.04,18281.81,23970.45,56694.02


In [None]:
# in class assignmnet - how to show number of missing values, by column, in a pandas dataframe?

2. Test-Train Split: We define two test samples. Each test sample contains 15% of observations.

Samples are chosen randomly. A better approach is to split based on a feature like time. Why random split is not a good idea?

In [22]:
Y = data.Y
X = data.drop(["Y"], axis = 1)

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)
X_test_1, X_test_2, Y_test_1, Y_test_2 = train_test_split(X_test, Y_test, test_size=0.5, random_state=42)

In [23]:
# check

print (X_train.shape)
print (Y_train.shape)
print (X_test_1.shape)
print (Y_test_1.shape)
print (X_test_2.shape)
print (Y_test_2.shape)

(19147, 27)
(19147,)
(4103, 27)
(4103,)
(4104, 27)
(4104,)


3. Data Processing: Main steps in data processing are:

*   One-Hot Encoding
*   Outlier Treatment
*   Feature Scaling
*   Missing Value Imputation
*   Solve for Collinearity


In [None]:
# One-Hot Encoding: All columns in this dataset are numerical; so no need for one-hot encoding in this dataset.

In [31]:
# Outlier Treatment: Looking at summary stat (above), there seems to be outliers in V1, V7, V14, V16, V20, V21, V27.
# In these columns, we replace any value less than P1 with P1, and any value higher than P99 with P99.
# So, first step is to get P1 and P99 from the train sample.

outlier = pd.DataFrame(columns = ["Column Name", "P1", "P99"])

counter = 0
for feature in ["V1", "V7", "V14", "V16", "V20", "V21", "V27"]:
  outlier.loc[counter, "Column Name"] = feature
  outlier.loc[counter, "P1"] = data[feature].quantile(0.01)
  outlier.loc[counter, "P99"] = data[feature].quantile(0.99)
  counter = counter + 1

outlier

Unnamed: 0,Column Name,P1,P99
0,V1,27000.0,257066.59
1,V7,831.59,121135.99
2,V14,0.0,76.6564
3,V16,208.675,1067.2686
4,V20,0.0,3951.4
5,V21,7265.89,673232.58
6,V27,0.0,226082.78


In [40]:
# Next we replace outlers with P1 and P99
import numpy as np

for counter in range (outlier.shape[0]):
  X_train[outlier.loc[counter, "Column Name"]] = np.where(X_train[outlier.loc[counter, "Column Name"]] < outlier.loc[counter, "P1"],
                                                       outlier.loc[counter, "P1"], X_train[outlier.loc[counter, "Column Name"]])

  X_train[outlier.loc[counter, "Column Name"]] = np.where(X_train[outlier.loc[counter, "Column Name"]] > outlier.loc[counter, "P99"],
                                                       outlier.loc[counter, "P99"], X_train[outlier.loc[counter, "Column Name"]])


In [39]:
# check
X_train.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
V1,19147.0,81441.109797,41777.778947,27000.0,52295.0,71000.0,98000.0,257066.59
V2,19147.0,0.349924,0.882396,0.0,0.0,0.0,0.0,13.0
V3,19147.0,0.472554,0.776709,0.0,0.0,0.0,1.0,5.0
V4,9614.0,32.830664,21.821592,0.0,14.0,29.0,48.0,121.0
V5,2548.0,66.003925,25.592315,0.0,51.0,67.0,82.0,119.0
V6,19147.0,12.593983,5.775111,1.0,9.0,12.0,16.0,53.0
V7,19147.0,21833.436827,19319.784956,831.59,9755.5,16487.0,27624.0,121135.99
V8,19143.0,59.328062,22.977417,0.0,42.9,60.8,77.3,130.5
V9,19147.0,26.020264,11.749986,4.0,18.0,24.0,33.0,127.0
V10,19147.0,19523.793886,7484.634553,1376.49,13549.145,18257.45,23979.305,53684.284671


In [41]:
# Next we do the same for test samples. Note we use the same P1/P99 that we got from train sample.
# Test sample represents unseen data, and should not be used in any stage of the model, including data processing.

for counter in range (outlier.shape[0]):
  X_test_1[outlier.loc[counter, "Column Name"]] = np.where(X_test_1[outlier.loc[counter, "Column Name"]] < outlier.loc[counter, "P1"],
                                                       outlier.loc[counter, "P1"], X_test_1[outlier.loc[counter, "Column Name"]])

  X_test_2[outlier.loc[counter, "Column Name"]] = np.where(X_test_2[outlier.loc[counter, "Column Name"]] < outlier.loc[counter, "P1"],
                                                       outlier.loc[counter, "P1"], X_test_2[outlier.loc[counter, "Column Name"]])

  X_test_1[outlier.loc[counter, "Column Name"]] = np.where(X_test_1[outlier.loc[counter, "Column Name"]] > outlier.loc[counter, "P99"],
                                                       outlier.loc[counter, "P99"], X_test_1[outlier.loc[counter, "Column Name"]])

  X_test_2[outlier.loc[counter, "Column Name"]] = np.where(X_test_2[outlier.loc[counter, "Column Name"]] > outlier.loc[counter, "P99"],
                                                       outlier.loc[counter, "P99"], X_test_2[outlier.loc[counter, "Column Name"]])


In [50]:
# Feature Scaling: We will use StandardScaler. There are other scaling options such as Min-Max Scaler.
# No matter which technique to use, again scaling parameters (here mean and STD) should come from the train sample.
# To find scaling parameters, we use a sklearn package.

# get scaling parameters
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)

# scale features
X_train = pd.DataFrame(sc.transform(X_train), columns = X_train.columns)
X_test_1 = pd.DataFrame(sc.transform(X_test_1), columns = X_test_1.columns)
X_test_2 = pd.DataFrame(sc.transform(X_test_2), columns = X_test_2.columns)

In [51]:
#check
X_test_1.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
V1,4103.0,-0.018727,1.013846,-1.303146,-0.728663,-0.273863,0.392728,4.203911
V2,4103.0,-0.020092,1.023592,-0.396572,-0.396572,-0.396572,-0.396572,16.603044
V3,4103.0,0.027335,1.02022,-0.608422,-0.608422,-0.608422,0.679096,5.829166
V4,2035.0,0.038726,0.996808,-1.504582,-0.817154,-0.083897,0.741017,3.444902
V5,586.0,0.023723,0.97498,-2.423231,-0.508218,0.07801,0.625157,2.071187
V6,4103.0,0.008774,0.999143,-2.00763,-0.622339,-0.102855,0.416629,6.650438
V7,4103.0,0.005692,0.991008,-1.087093,-0.623804,-0.26349,0.318002,5.140075
V8,4103.0,-0.026768,1.000841,-2.582084,-0.754154,0.016188,0.751712,2.862535
V9,4103.0,-0.003405,0.978487,-1.874116,-0.682594,-0.171942,0.508928,5.870776
V10,4103.0,-0.016303,0.975322,-2.026575,-0.782969,-0.170554,0.548033,4.966335


In [52]:
# For missing value imputation, we replace all missing values with 0
X_train.fillna(0,inplace=True)
X_test_1.fillna(0,inplace=True)
X_test_2.fillna(0,inplace=True)

4. Feature Reduction: We are ready to do grid search; but before that we will reduce number features, and remove those features that have no explanation power. This will significantly improve the speed of grid search and will make the process more efficient.

There are several methods for feature reduction. Here we will build a simple XGB model and will keep only features with feature importance higher than 1% (a subjective threshold).

Note that in this project we have only 29 features; so really there is no need for feature reduction, and we are just trying to show the process and the concept.

In [53]:
from xgboost import XGBClassifier


# for this step, we don't play with parameters of RF, and just use the
model_for_feature_reduction = XGBClassifier()
model_for_feature_reduction.fit(X_train, Y_train)

In [54]:
Feature_Importance = pd.DataFrame(columns = ["Feature", "Feature_Importance"])
Feature_Importance.Feature = X_train.columns
Feature_Importance.Feature_Importance = model_for_feature_reduction.feature_importances_
Feature_Importance.sort_values(by=["Feature_Importance"], inplace=True, ascending=False)
Feature_Importance

Unnamed: 0,Feature,Feature_Importance
13,V14,0.164442
15,V16,0.070819
10,V11,0.070813
9,V10,0.05884
11,V12,0.056697
12,V13,0.053177
26,V27,0.035563
3,V4,0.035327
2,V3,0.034932
8,V9,0.034812


In [55]:
features_to_drop = Feature_Importance[Feature_Importance.Feature_Importance < 0.01]["Feature"]
features_to_drop

14    V15
18    V19
Name: Feature, dtype: object

In [56]:
X_train.drop(features_to_drop, axis = 1, inplace=True)
X_test_1.drop(features_to_drop, axis = 1, inplace=True)
X_test_2.drop(features_to_drop, axis = 1, inplace=True)

In [57]:
# check
X_test_1.columns

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V16', 'V17', 'V18', 'V20', 'V21', 'V22', 'V23',
       'V24', 'V25', 'V26', 'V27'],
      dtype='object')

5. Grid Search (Hyper-parameter tuning): Now we are ready to train the NN model and do grid search.

In [58]:
pip install tensorflow



In [59]:
pip install keras



Let's first build a single NN just to get familiar with the syntax.

In [60]:
import tensorflow.keras as keras
from keras.models import Sequential
from keras.layers import Dense

In [66]:
classifier = Sequential()

# add the first hidden layer
classifier.add(Dense(units=6,kernel_initializer='glorot_uniform',
                    activation = 'relu'))

# add the second hidden layer
classifier.add(Dense(units=4,kernel_initializer='glorot_uniform',
                activation = 'relu'))

# add the output layer
classifier.add(Dense(units=1,kernel_initializer='glorot_uniform',
                    activation = 'sigmoid'))

# add additional parameters
classifier.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy', 'FalseNegatives'])

# train the model
classifier.fit(X_train,Y_train,batch_size=1000,epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x79b083465270>

Now we do Grid Search. Here we don't change number fo hidden layers. To do so, you will need a separate for loop.

In [64]:
from sklearn.metrics import roc_auc_score

In [69]:
# Here we do grid search on #nodes in each layer, activation function, and batch size,
# and look at the AUC as performance metric.

Grid_Search_Results = pd.DataFrame(columns = ["Model Number", "Number of Nodes", "Activation Function", "Batch Size",
                                              "AUC Train", "AUC Test 1", "AUC Test 2"])

Counter = 0
for num_nodes in [2, 5, 10]:
  for activation in ['relu', 'sigmoid']:
    for batch_size in [100, 1000, 10000]:
      model = Sequential()
      model.add(Dense(units=num_nodes, kernel_initializer='glorot_uniform',
                    activation = activation))
      model.add(Dense(units=num_nodes,kernel_initializer='glorot_uniform',
                activation = activation))
      model.add(Dense(units=1,kernel_initializer='glorot_uniform',
                    activation = 'sigmoid'))
      model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy', 'FalseNegatives'])
      model.fit(X_train,Y_train,batch_size=batch_size,epochs=20,verbose=0)

      Grid_Search_Results.loc[Counter,"Model Number"] = Counter
      Grid_Search_Results.loc[Counter,"Number of Nodes"] = num_nodes
      Grid_Search_Results.loc[Counter,"Activation Function"] = activation
      Grid_Search_Results.loc[Counter,"Batch Size"] = batch_size

      Grid_Search_Results.loc[Counter,"AUC Train"] = roc_auc_score(Y_train, model.predict(X_train))
      Grid_Search_Results.loc[Counter,"AUC Test 1"] = roc_auc_score(Y_test_1, model.predict(X_test_1))
      Grid_Search_Results.loc[Counter,"AUC Test 2"] = roc_auc_score(Y_test_2, model.predict(X_test_2))

      Counter = Counter + 1



In [70]:
# Analyze the results in Excel

Grid_Search_Results.to_csv("drive/My Drive/NN_results.csv")

In [72]:
# Build final model with optimum thresholds

final_model = Sequential()

# add the first hidden layer
final_model.add(Dense(units=5,kernel_initializer='glorot_uniform',
                    activation = 'relu'))

# add the second hidden layer
final_model.add(Dense(units=5,kernel_initializer='glorot_uniform',
                activation = 'relu'))

# add the output layer
final_model.add(Dense(units=1,kernel_initializer='glorot_uniform',
                    activation = 'sigmoid'))

# add additional parameters
final_model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy', 'FalseNegatives'])

# train the model
final_model.fit(X_train,Y_train,batch_size=100,epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x79b07311bf70>

6. Strategy Development: To develop strategy, we often need to have an setimate of benefit of True Positive and True Negative, and cost of False Positive and False Negative. Separate analysis is needed to have an estimate of these components. Check sample questions for examples.

For this analysis, assume cost of FP is 10, cost of FN is 0, benefit from TP is 1, and benefit from TN is 2. We wnat to find the best threshold to classify customers to Positive and Negative based on the model's output.

In [73]:
# first we create dataframes containing Y and Y-Hat, for all three samples.

strategy_train = pd.DataFrame(columns = ["Y", "Y_Hat"])
strategy_test_1 = pd.DataFrame(columns = ["Y", "Y_Hat"])
strategy_test_2 = pd.DataFrame(columns = ["Y", "Y_Hat"])

strategy_train.Y = Y_train
strategy_train.Y_Hat = final_model.predict(X_train)

strategy_test_1.Y = Y_test_1
strategy_test_1.Y_Hat = final_model.predict(X_test_1)

strategy_test_2.Y = Y_test_2
strategy_test_2.Y_Hat = final_model.predict(X_test_2)



In [74]:
#check

roc_auc_score(strategy_test_2.Y, strategy_test_2.Y_Hat)

0.768535239706792

In [75]:
# Next check the range of Y-Hat to have an idea of possible threshold values

strategy_train.Y_Hat.describe()

count    19147.000000
mean         0.038326
std          0.050384
min          0.000002
25%          0.014705
50%          0.024629
75%          0.039139
max          0.337873
Name: Y_Hat, dtype: float64

In [76]:
# We will use numbers from 0 to 1, increments of 0.01, and calculate profit for each threshold across all three samples.
# To do so, we define a function that calculates Profit based on the associated cost and benefits.
# The function is written for a binary classification model with responses coded as 1.

def profit_calculator (data, actual_column, prediction_column, threshold, TP_benefit, TN_benefit, FP_cost, FN_cost):
  TP_count = data[data[prediction_column] >= threshold][actual_column].sum()
  FP_count = data[data[prediction_column] >= threshold].shape[0] - TP_count

  FN_count = data[data[prediction_column] < threshold][actual_column].sum()
  TN_count = data[data[prediction_column] < threshold].shape[0] - FN_count

  Profit = TP_count*TP_benefit + TN_count*TN_benefit - FN_count*FN_cost - FP_count*FP_cost

  return Profit

In [77]:
# estimating profits
Profits = pd.DataFrame(columns = ["Threshold", "Train Profit", "Test 1 Profit", "Test 2 Profit"])

import numpy as np

TP_benefit = 1
TN_benefit = 2
FP_cost = 10
FN_cost = 0

Counter = 0
for threshold in np.arange(0.0, 1.0, 0.01):
  Profits.loc[Counter,"Threshold"] = threshold
  Profits.loc[Counter,"Train Profit"] = profit_calculator(strategy_train, "Y", "Y_Hat", threshold, TP_benefit, TN_benefit, FP_cost, FN_cost)
  Profits.loc[Counter,"Test 1 Profit"] = profit_calculator(strategy_test_1, "Y", "Y_Hat", threshold, TP_benefit, TN_benefit, FP_cost, FN_cost)
  Profits.loc[Counter,"Test 2 Profit"] = profit_calculator(strategy_test_2, "Y", "Y_Hat", threshold, TP_benefit, TN_benefit, FP_cost, FN_cost)

  Counter = Counter + 1

In [78]:
# Analyze profits in Excel. Sounds like 0.9 is the optimum threshold.

Profits.to_csv("drive/My Drive/NN_profits.csv")