# Part 3: Supervised Learning Model

**Now that we've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model.** 

In [1]:
import time

import pandas as pd
import numpy as np

from utils import *

utils loaded: this module contains helpful data cleaning functions


**Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.**

In [73]:
location = "data\\" 

In [74]:
# load in the data file
mailout_train = pd.read_csv(location+'Udacity_MAILOUT_052018_TRAIN.csv', sep=';', low_memory=False)

In [75]:
# Have a quick look
print(mailout_train.shape)
mailout_train.head()

(42962, 367)


Unnamed: 0,LNR,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,ANZ_HH_TITEL,ANZ_KINDER,ANZ_PERSONEN,ANZ_STATISTISCHE_HAUSHALTE,ANZ_TITEL,...,UNGLEICHENN_FLAG,VERDICHTUNGSRAUM,VERS_TYP,VHA,VHN,VK_DHT4A,VK_DISTANZ,VK_ZG11,W_KEIT_KIND_HH,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,RESPONSE,ANREDE_KZ,ALTERSKATEGORIE_GROB
0,1763,2,1.0,8.0,,,,,8.0,15.0,0.0,0.0,1.0,13.0,0.0,...,0.0,4.0,2,1.0,2.0,5.0,2.0,1.0,6.0,9.0,3.0,3,0,2,4
1,1771,1,4.0,13.0,,,,,13.0,1.0,0.0,0.0,2.0,1.0,0.0,...,0.0,0.0,1,1.0,3.0,1.0,2.0,1.0,4.0,9.0,7.0,1,0,2,3
2,1776,1,1.0,9.0,,,,,7.0,0.0,,0.0,0.0,1.0,0.0,...,0.0,10.0,1,4.0,1.0,6.0,4.0,2.0,,9.0,2.0,3,0,1,4
3,1460,2,1.0,6.0,,,,,6.0,4.0,0.0,0.0,2.0,4.0,0.0,...,0.0,5.0,2,1.0,4.0,8.0,11.0,11.0,6.0,9.0,1.0,3,0,2,4
4,1783,2,1.0,9.0,,,,,9.0,53.0,0.0,0.0,1.0,44.0,0.0,...,0.0,4.0,1,0.0,4.0,2.0,2.0,1.0,6.0,9.0,3.0,3,0,1,3


**The "TRAIN" partition of "MAILOUT" data allows us to verify our model.**

**It includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign.**

In [6]:
mailout_train.RESPONSE.value_counts()

0    42430
1      532
Name: RESPONSE, dtype: int64

In [7]:
round(100 * mailout_train.RESPONSE.value_counts()[0] / mailout_train.RESPONSE.value_counts().sum(), 1)

98.8

In [8]:
round(100 * mailout_train.RESPONSE.value_counts()[1] / mailout_train.RESPONSE.value_counts().sum(), 2)

1.24

- There is a significant imbalance here: most values (~99%) are 0. 

- Because the data is imbalanced, accuracy will be the wrong metric to use.
   - The reason is that it is trivially easy to get a high accuracy. Just predict that all the responses will be zero!)

- We shall use the AUC (Area Under the Curve) ROC (Receiver Operating Characteristic) score instead. 

- The ROC curve represents the diagnostic ability of a binary classifier and using the sci-kit learn implementation we can determine probabilities for 0 (probability of not becoming a customer) and for 1 (probability of becoming a customer.)

# Data cleaning

In [9]:
# Load the documentation in the Excel files
location = "D:\\UDACITY-CAPSTONE-ARVATO\\"
attributes_values = pd.read_excel(location+'DIAS Attributes - Values 2017.xlsx', header=1, usecols='B:E')
unknowns = attributes_values[attributes_values.Meaning.isin(["unknown / no main age detectable", "unknown"])]
value_of_unknown = {}
for i, row in unknowns.iterrows():
    value_of_unknown[row.Attribute] = str(row.Value).split(", ")

Replace unknown values with nan.

In [10]:
symbols=["XX", "X"]
mailout_train = replace_symbols_with_nan(mailout_train, mailout_train.columns[[18,19]], symbols)
mailout_train = unknown_to_nan(mailout_train, value_of_unknown)

In [11]:
print(mailout_train.shape)

(42962, 367)


Drop columns and rows containing too many nans

In [12]:
mailout_train = drop_columns_and_rows_percent(mailout_train, missing_columns_percent=20.0, missing_rows_percent = 10.)
print(mailout_train.shape)

77 columns are missing 20.0 of their values
Dropping these columns: ['AGER_TYP', 'ALTER_HH', 'ALTER_KIND1', 'ALTER_KIND2', 'ALTER_KIND3', 'ALTER_KIND4', 'EXTSEL992', 'HH_DELTA_FLAG', 'KBA05_ALTER1', 'KBA05_ALTER2', 'KBA05_ALTER3', 'KBA05_ALTER4', 'KBA05_ANHANG', 'KBA05_ANTG1', 'KBA05_ANTG2', 'KBA05_ANTG3', 'KBA05_ANTG4', 'KBA05_AUTOQUOT', 'KBA05_BAUMAX', 'KBA05_CCM1', 'KBA05_CCM2', 'KBA05_CCM3', 'KBA05_CCM4', 'KBA05_DIESEL', 'KBA05_FRAU', 'KBA05_GBZ', 'KBA05_HERST1', 'KBA05_HERST2', 'KBA05_HERST3', 'KBA05_HERST4', 'KBA05_HERST5', 'KBA05_KRSAQUOT', 'KBA05_KRSHERST1', 'KBA05_KRSHERST2', 'KBA05_KRSHERST3', 'KBA05_KRSKLEIN', 'KBA05_KRSOBER', 'KBA05_KRSVAN', 'KBA05_KRSZUL', 'KBA05_KW1', 'KBA05_KW2', 'KBA05_KW3', 'KBA05_MAXAH', 'KBA05_MAXBJ', 'KBA05_MAXHERST', 'KBA05_MAXSEG', 'KBA05_MAXVORB', 'KBA05_MOD1', 'KBA05_MOD2', 'KBA05_MOD3', 'KBA05_MOD4', 'KBA05_MOD8', 'KBA05_MOTOR', 'KBA05_MOTRAD', 'KBA05_SEG1', 'KBA05_SEG10', 'KBA05_SEG2', 'KBA05_SEG3', 'KBA05_SEG4', 'KBA05_SEG5', 'KBA05_SEG6', 'K

In [13]:
mailout_train_columns = list(mailout_train.columns) # Keep this for later
mailout_train_columns.remove('RESPONSE')

In [14]:
categories_not_in_mailout_train = set(categorical_variables)-set(mailout_train.columns)
categories_in_mailout_train = set(categorical_variables)-set(categories_not_in_mailout_train)

Handle categorical variables by making dummy columns

In [15]:
mailout_train_new = handle_categorical(df_original=mailout_train, categories_to_handle=categories_in_mailout_train)

Original columns are dropped


In [16]:
mailout_train_new.shape

(34998, 402)

Completing missing values with the median value

In [17]:
from sklearn.impute import SimpleImputer
imp_median = SimpleImputer(missing_values=np.nan, strategy='median') 

In [18]:
print(mailout_train_new.isna().sum().sum())
mailout_train_new = pd.DataFrame(imp_median.fit_transform(mailout_train_new), columns = mailout_train_new.columns)
print(mailout_train_new.isna().sum().sum())

21060
0


In [19]:
X = mailout_train_new.drop(['RESPONSE'], axis=1)
y = mailout_train_new['RESPONSE']

In [20]:
print(X.shape)
print(y.shape)

(34998, 401)
(34998,)


# Model training & testing

Import the potential classifiers

In [21]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
import xgboost

In [22]:
from sklearn.model_selection import train_test_split, GridSearchCV, learning_curve
from sklearn.metrics import roc_auc_score

Create testing and training sets

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [24]:
def test_algorithms(algorithms, X_train, y_train, X_test, y_test):
    '''This function allows us to test many algorithms at once'''
    results = []
    print("Algorithm, ROC_AUC_score, Time taken to train")
    for i, algorithm in algorithms:
        t0 = time.time()
        model =  algorithm.fit(X_train, y_train.values.ravel())
        y_predictions = model.predict_proba(X_test)[:,1]
        roc_auc = roc_auc_score(y_test, y_predictions)
        results.append(roc_auc)
        print("Algorithm, ROC_AUC_score, Time taken to train")
        print(i, roc_auc, (time.time() - t0))
        print()
    return results

I have chosen a few algorithms I have worked with successfully in the past.

In [25]:
algorithms = [
    ("XGBClassifier", xgboost.XGBClassifier(random_state=1, eval_metric = 'auc')),
    ("AdaBoost", AdaBoostClassifier(random_state=1)),
    ("GradientBoostingClassifier", GradientBoostingClassifier(random_state=1))
]

In [26]:
results = test_algorithms(algorithms, X_train, y_train, X_test, y_test)

Algorithm, ROC_AUC_score, Time taken to train




Algorithm, ROC_AUC_score, Time taken to train
XGBClassifier 0.7321439331854994 5.944477558135986

Algorithm, ROC_AUC_score, Time taken to train
AdaBoost 0.7391738273717702 8.44746208190918

Algorithm, ROC_AUC_score, Time taken to train
GradientBoostingClassifier 0.7764571984670265 32.50508666038513



In [27]:
for i in range(len(algorithms)):
    print(algorithms[i][0], results[i])

XGBClassifier 0.7321439331854994
AdaBoost 0.7391738273717702
GradientBoostingClassifier 0.7764571984670265


Although GradientBoostingClassifier gives a better score, it takes ~5 times longer to train, so I'll use AdaBoost and tune its parameters.

The second best algorithm was AdaBoost. We will use GridSearch to tune the hyper-parameters of AdaBoost.

In [28]:
adaboost_params = { 'n_estimators': [25, 50, 100, 150, 200], 'learning_rate' : [0.01,0.05,0.1,0.3,0.5,1] }

ada_boost_grid = GridSearchCV(  estimator = AdaBoostClassifier(random_state=1), 
                                param_grid = adaboost_params,
                                scoring = "roc_auc",
                                n_jobs = -1,
                                verbose = 3)

In [29]:
t0 = time.time()
ada_boost_grid.fit(X_train, y_train.values)
print("Time taken {}s".format(time.time() - t0))

Fitting 5 folds for each of 30 candidates, totalling 150 fits
Time taken 508.3334023952484s


In [30]:
print("The best score found is ", ada_boost_grid.best_score_)

The best score found is  0.7813392411234951


In [31]:
print("The best parameters found are ", ada_boost_grid.best_params_)

The best parameters found are  {'learning_rate': 0.05, 'n_estimators': 150}


Let us check if this can be fine tuned further.

In [32]:
adaboost_params = { 'n_estimators': [75, 100, 125], 'learning_rate' : [0.03,0.05,0.07] }

ada_boost_grid = GridSearchCV(  estimator = AdaBoostClassifier(random_state=1), 
                                param_grid = adaboost_params,
                                scoring = "roc_auc",
                                n_jobs = -1,
                                verbose = 3)

In [33]:
t0 = time.time()
ada_boost_grid.fit(X_train, y_train.values)
print("Time taken {}s".format(time.time() - t0))

Fitting 5 folds for each of 9 candidates, totalling 45 fits
Time taken 192.54086208343506s


In [34]:
print("The best score found is ", ada_boost_grid.best_score_)
print("The best parameters found are ", ada_boost_grid.best_params_)

The best score found is  0.7813007547818083
The best parameters found are  {'learning_rate': 0.05, 'n_estimators': 125}


In [35]:
# Store this for use in the next section
AdaBoostClassifierBest = ada_boost_grid.best_estimator_

# Model deployment

We are now in a position to make predictions for the new data (once we clean the new data.)

In [76]:
location = "data\\"

In [77]:
mailout_test = pd.read_csv(location+'Udacity_MAILOUT_052018_TEST.csv', sep=';', low_memory=False)

In [79]:
mailout_test.head()

Unnamed: 0,LNR,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,ANZ_HH_TITEL,ANZ_KINDER,ANZ_PERSONEN,ANZ_STATISTISCHE_HAUSHALTE,ANZ_TITEL,...,UMFELD_JUNG,UNGLEICHENN_FLAG,VERDICHTUNGSRAUM,VERS_TYP,VHA,VHN,VK_DHT4A,VK_DISTANZ,VK_ZG11,W_KEIT_KIND_HH,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB
0,1754,2,1.0,7.0,,,,,6.0,2.0,0.0,0.0,2.0,2.0,0.0,...,5.0,0.0,23.0,1,1.0,4.0,5.0,6.0,3.0,6.0,9.0,3.0,3,1,4
1,1770,-1,1.0,0.0,,,,,0.0,20.0,0.0,0.0,1.0,21.0,0.0,...,3.0,0.0,0.0,1,1.0,1.0,5.0,2.0,1.0,6.0,9.0,5.0,3,1,4
2,1465,2,9.0,16.0,,,,,11.0,2.0,0.0,0.0,4.0,2.0,0.0,...,5.0,1.0,15.0,1,1.0,3.0,9.0,6.0,3.0,2.0,9.0,4.0,3,2,4
3,1470,-1,7.0,0.0,,,,,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,5.0,0.0,10.0,2,1.0,2.0,6.0,6.0,3.0,,9.0,2.0,3,2,4
4,1478,1,1.0,21.0,,,,,13.0,1.0,0.0,0.0,4.0,1.0,0.0,...,5.0,0.0,0.0,1,1.0,1.0,2.0,4.0,3.0,3.0,9.0,7.0,4,2,4


Clean data

In [51]:
# Replace unknown values with nan.
symbols=["XX", "X"]
mailout_test = replace_symbols_with_nan(mailout_test, mailout_train.columns[[18,19]], symbols)
mailout_test = unknown_to_nan(mailout_test, value_of_unknown)
print(mailout_test.shape)

# Ensure the columns in the test set are the same as the training set.
mailout_test_new = mailout_test[mailout_train_columns]
print(mailout_test_new.shape)

# Remove rows with too many missing values
mailout_test_new = drop_columns_and_rows(mailout_test_new, columns_to_drop=[], missing_values_threshold_row=36.0)
print(len(mailout_test_new.columns))
print(mailout_test_new.shape)

# Handle categorical variables
categories_not_in_mailout_test = set(categorical_variables)-set(mailout_test_new.columns)
categories_in_mailout_test = set(categorical_variables)-set(categories_not_in_mailout_test)
mailout_test_new = handle_categorical(df_original=mailout_test_new, categories_to_handle=categories_in_mailout_test)

# Impute median values
print(mailout_test_new.isna().sum().sum())
mailout_test_new = pd.DataFrame(imp_median.fit_transform(mailout_test_new), columns = mailout_test_new.columns)
print(mailout_test_new.isna().sum().sum())

# Ensure the columns in the test set are the same as the training set.
to_drop = set(mailout_test_new.columns) - set(mailout_train_new.columns) 
mailout_test_new = mailout_test_new.drop(to_drop, axis = 1)

print(X.shape)
print(mailout_test_new.shape)

(42833, 367)
(42833, 289)
289
(34991, 289)
Original columns are dropped
20892
0
(34998, 401)
(34991, 401)


Make predictions

In [52]:
RESPONSE = AdaBoostClassifierBest.predict_proba(mailout_test_new)

The probability that any particular individual **will** become a customer:

In [53]:
CUSTOMER_PROBABILITY = RESPONSE[:,1]

The probability that any particular individual **will not** become a customer:

In [54]:
NON_CUSTOMER_PROBABILITY = RESPONSE[:,0]

Let us do a final check (probabilities must add to 1):

In [55]:
set(NON_CUSTOMER_PROBABILITY + CUSTOMER_PROBABILITY) 

{0.9999999999999999, 1.0, 1.0000000000000002}

These are all equal to 1.0 (within a small error introduced because of floating point (im)precision)

In [56]:
CUSTOMER_PROBABILITY

array([0.36488395, 0.35944006, 0.28434235, ..., 0.35541688, 0.34407794,
       0.35561989])

In [63]:
(sorted(CUSTOMER_PROBABILITY))[::-1]

[0.37462597705557643,
 0.37462597705557643,
 0.37462597705557643,
 0.3742256294655508,
 0.37379542750826433,
 0.3736783745397096,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.3730959559237464,
 0.373095