# Creating and avaluating a classifier.

We can roughly guess that the distribution will not be evenly distributed under and over the 50K demarcation, but that the boundary will be in one of the extremums. We also have both continuous and categorical data. For this reason, we want to use a method based on decision trees because it generally works well with this kind of mixed data. We want to try boosting and bagging strategies. For boosting, Gradient Boosting. 
    
We try this because it minimize the bias in the construction of the bagging decision trees for Extremely randomized forests and Random Forests, and because it absorbs the bias with the number of trees in boosting. However, since we do not have a live performance imperative, we might not use boosting which is very fast but less accurate. 

We use Gini for training. In order to estimate our predictor's performance, we will use accuracy, precision and recall. We might try cross validation as well.

In [1]:
import pandas as pd
import os


In [2]:
dir_path = '/Users/macbookpro/Documents/'
cols = ['age',
        'workclass',
        'fnlwgt',
        'education',
        'education-num',
        'marital-status',
        'occupation',
        'relationship',
        'race',
        'sex',
        'capital-gain',
        'capital-loss',
        'hours-per-week',
        'native-country',
        'class'
       ]
data = pd.read_csv(dir_path + 'adult.data.csv', names=cols)


In [3]:
data = data.sample(frac=1)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
11357,24,Private,179423,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Female,0,0,40,United-States,<=50K
27665,30,?,103651,11th,7,Married-civ-spouse,?,Husband,White,Male,0,0,35,United-States,<=50K
19293,51,Private,280093,HS-grad,9,Separated,Adm-clerical,Other-relative,White,Female,0,0,40,United-States,<=50K
9305,28,Private,162343,HS-grad,9,Never-married,Other-service,Not-in-family,White,Male,0,0,40,Puerto-Rico,<=50K
7115,22,Private,71469,Assoc-acdm,12,Never-married,Sales,Own-child,White,Female,0,0,25,United-States,<=50K
7885,25,Private,108658,HS-grad,9,Never-married,Exec-managerial,Not-in-family,White,Male,0,0,45,United-States,<=50K
8399,19,Private,231962,HS-grad,9,Never-married,Other-service,Unmarried,White,Male,0,0,40,United-States,<=50K
31258,27,Private,492263,10th,6,Separated,Machine-op-inspct,Own-child,White,Male,0,0,35,Mexico,<=50K
22341,29,Private,354558,HS-grad,9,Married-civ-spouse,Sales,Husband,White,Male,0,0,60,United-States,<=50K
26130,53,Self-emp-not-inc,46704,Masters,14,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,>50K


In [4]:
classes = data['class']

In [5]:
unclassified_data = data.drop('class', 1)

We want to split data between categorical and continuous data

In [6]:
categorical_cols = [
        'workclass',
        'education',
        'marital-status',
        'occupation',
        'relationship',
        'race',
        'sex',
        'native-country',
       ]
categorical_data = unclassified_data[categorical_cols]

In [7]:
non_categorical_cols = [col for col in cols if col not in categorical_cols and col != 'class']

In [8]:
continuous_data = unclassified_data[non_categorical_cols]

We want to avoid overfitting on the training set. For this reason, we remove the order relationship within categorical data. If these serve the classification, the classifier will rebuild them. Else, it will not be influenced. Use: get_dummies.

In [9]:
categorical_as_dummies = pd.get_dummies(categorical_data, prefix=categorical_cols)

In [10]:
[(col, unclassified_data[col].dtype) for col in unclassified_data.columns]

[('age', dtype('int64')),
 ('workclass', dtype('O')),
 ('fnlwgt', dtype('int64')),
 ('education', dtype('O')),
 ('education-num', dtype('int64')),
 ('marital-status', dtype('O')),
 ('occupation', dtype('O')),
 ('relationship', dtype('O')),
 ('race', dtype('O')),
 ('sex', dtype('O')),
 ('capital-gain', dtype('int64')),
 ('capital-loss', dtype('int64')),
 ('hours-per-week', dtype('int64')),
 ('native-country', dtype('O'))]

In [11]:
classes_as_numbers = classes.replace(' <=50K', 0).replace(' >50K', 1)

Conveniently, Random Forest Bagging deals as well with categorical as with continuous data. For this reason, we can use a single classifier working on the whole data.

In [12]:
full_data = pd.concat([continuous_data, categorical_as_dummies], axis=1)


Let's use scikit-learn random forest implementation. We do a few tries for the trees max depth (limited "grid search"). We estimate it using the ROC AUC cross val score. 20-21 are the best values.

In [13]:
import scipy
from sklearn.ensemble import (AdaBoostClassifier, ExtraTreesClassifier,
                              GradientBoostingClassifier, RandomForestClassifier)

benchmarked_classifiers = [AdaBoostClassifier, ExtraTreesClassifier,
                           GradientBoostingClassifier, RandomForestClassifier]

bagging_param_grid = [
  {
      'n_estimators': [n for n in range(20, 100) if (n%3 == 0 or n%10 == 0)], 
       'criterion': ['gini', 'linear'], 
       'min_sample_leaf': [n for n in range(1, 50) if n%3 == 1],
       'njobs': [2],
       'random_state': [None, 0, 1, 2, 3, 4],
       'oob_score': [True, False]
  }
]

boosting_param_grid = [
    {
        'lol': 'lol'
    }
]

clf = GradientBoostingClassifier(loss='deviance', learning_rate=0.1, n_estimators=400, subsample=1.0, 
                                 criterion='friedman_mse', min_samples_split=6, min_samples_leaf=1, 
                                 min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0,
                                 min_impurity_split=None, init=None, random_state=None, max_features=None, 
                                 verbose=0, max_leaf_nodes=None, warm_start=False, presort='auto')
clf.fit(full_data, classes_as_numbers)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=6,
              min_weight_fraction_leaf=0.0, n_estimators=400,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

Since performance of Gradient Boosting are good, and better than Random Forests and Extremely Randomized Forests, we only keep this one.

In [14]:
print(list(zip(full_data.columns, clf.feature_importances_)))


[('age', 0.10206480778725541), ('fnlwgt', 0.13188088702608744), ('education-num', 0.065607222775722951), ('capital-gain', 0.10597754767555731), ('capital-loss', 0.096009691381413867), ('hours-per-week', 0.090860682616692723), ('workclass_ ?', 0.0), ('workclass_ Federal-gov', 0.0071405651217611018), ('workclass_ Local-gov', 0.01056237963971266), ('workclass_ Never-worked', 0.0), ('workclass_ Private', 0.0070893626486215108), ('workclass_ Self-emp-inc', 0.0072862032151652153), ('workclass_ Self-emp-not-inc', 0.01108158029582198), ('workclass_ State-gov', 0.0086382293741332546), ('workclass_ Without-pay', 0.0018957642907502833), ('education_ 10th', 0.00039726512104029237), ('education_ 11th', 0.00083324776801138687), ('education_ 12th', 0.00058530799246263602), ('education_ 1st-4th', 0.00050492873584257988), ('education_ 5th-6th', 0.00092028743984407804), ('education_ 7th-8th', 0.00026465097721238047), ('education_ 9th', 0.00050867965528338869), ('education_ Assoc-acdm', 0.003689561664699

As we could guess, fnlwgt, capital loss and capital gain are fairly pertinent to describe data in order to guess the revenue. Some other columns are not very pertinent and could be removed in order to see if this improves results.

In [15]:
test = pd.read_csv(dir_path+'adult.test.txt', names=cols)
y_true = test['class'].replace(' <=50K.', 0).replace(' >50K.', 1)
test_categorical_data = test[categorical_cols]
test_continuous_data = test[non_categorical_cols]
test_categorical_as_dummies = pd.get_dummies(test_categorical_data, prefix=categorical_cols)
test_full_data = pd.concat([test_continuous_data, test_categorical_as_dummies], axis=1)
test_full_data.columns

full_data.columns

[(i, col) for i, col in enumerate(full_data.columns) if col not in test_full_data.columns]


[(81, 'native-country_ Holand-Netherlands')]

Unfortunately, no one is from the Netherlands in the test set. We add the dummy column Netherlands full of zeros at position 81 in the DataFrame.

In [16]:
test_full_data.insert(81, 'native-country_ Holand-Netherlands', [0]*len(test_full_data.index))

In [17]:
y_pred = clf.predict(test_full_data)
y_pred

array([0, 0, 0, ..., 1, 0, 1])

In [18]:
from sklearn.model_selection import cross_val_score
cross_val_score(clf, full_data, classes_as_numbers, scoring='roc_auc') 

array([ 0.92518826,  0.92721748,  0.92982492])

In [20]:
comp = pd.concat([y_true, pd.DataFrame(y_pred)], axis=1)
comp.columns = ['t', 'p']

# To measure performance of the model, we use accuracy, precision
# and recall.
# It allows us to get an idea of the overfitting level of our model
# to tune hyperparameters accordingly.
# crossval score on the training set is not used here because the model
# is overfitting when used on a to small subset of the data, meaning
# that performance decreases when the best result on the test dataset
# is not yet reached.
# We used validation/train split of 0.3.
# Maybe with a smaller validation proportion it would be
# a better indicator.
unaccurate = comp['t'] != comp['p']
unaccurate_preds = np.where(unaccurate, comp['t'], np.nan)
unaccurate_preds = unaccurate_preds[~np.isnan(unaccurate_preds)]
accuracy = 1 - len(unaccurate_preds) / len(y_true)
print("accuracy: " + str(accuracy))
true_pos = ((comp['t'] + comp['p']) == 2).sum()
false_pos = (comp['p'] == 1).sum() - true_pos
false_negs = (comp['t'] == 1).sum() - true_pos
precision = true_pos / (true_pos + false_pos)
print("precision: " + str(precision))
recall = true_pos / (true_pos + false_negs)
print("recall: " + str(recall))

accuracy: 0.8770960014741109
precision: 0.787293677982
recall: 0.657306292252


Accuracy is very good, but precision is not as good, meaning that there is stil a lot of noise in the predicition. To diminish this noise, we can try to remove columns considered unuseful.

Recall is quite low, meaning that we do not detect many positive >50K in proportion of all the positive. For this reason, we should consider applying transformations on the important continuous variables (fnlwgt, capital loss, capital gain) in order to zoom in on the important ranges, to make the Gradient Boosting learning more precise on this part of the data. We could for example exponentiate capital gain and capital loss, since it looks like highly positive values of these are correlated with high revenue.

Lets's save the best classifier, the predictions done on the test set and the metrics inn order to use them in the API

In [21]:
from sklearn.externals import joblib

joblib.dump(clf, dir_path+'gradient_boosting.joblib.pkl')

['/Users/macbookpro/Documents/gradient_boosting.joblib.pkl']

In [22]:
import pickle

with open(dir_path + 'test_results.pkl', 'wb') as f:
    pickle.dump({'accuracy': accuracy, 'precision': precision, 'recall': recall}, f)

In [24]:
y_pred = pd.DataFrame(y_pred).replace(0, '<=50K').replace(1, '>50K')
y_pred.columns = ['classifier output']
df = pd.concat([test, y_pred], axis=1)


df.to_csv(dir_path + 'classifier_output.csv')
print(dir_path + 'classifier_output.csv')

/Users/macbookpro/Documents/classifier_output.csv
