In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

from sklearn.metrics import log_loss
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_hastie_10_2
from sklearn.model_selection import train_test_split

In [4]:
# Use this sigmoid function to turn probabilities into classifications
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

A simple synthetic dataset is created below. It has a 1D normal distribution, the goal is to predict Class '1' if the item is positive and '0' otherwise.

In [8]:
# Generate an array of 5000 random numbers that are normally distributed
X_all = np.random.randn(5000, 1)

Based on whether the item is greater or less than zero, the mask will return true or false. The result is multiplied by 2 and 1 is subtracted to convert booleans into 1 or 0

In [9]:
y_all = (X_all[:, 0] > 0)*2 - 1

Use sci-kit learn's famous [train-test-split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. Half the data is used for training data and the remaining hald if used for testing purposes.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.5, random_state=42)

You will notice this dataset can be solved with one tree stump! Thus, we will choose max depth as 1

In [11]:
clf = DecisionTreeClassifier(max_depth=1)
clf.fit(X_train, y_train)

print ('Accuracy for a single decision stump: {}'.format(clf.score(X_test, y_test)))

Accuracy for a single decision stump: 1.0


The Decision Tree Classifier only needed 1 stump but for the Gradient Boosting Classifier, we will need 800 trees to classify the data correctly. We shall use sklearn's GradientBoostingClassifier

In [12]:
clf = GradientBoostingClassifier(n_estimators=5000, learning_rate=0.01, max_depth=3, random_state=0)
clf.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.01, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=5000,
              n_iter_no_change=None, presort='auto', random_state=0,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)

In [15]:
print('Accuracy for Gradient Booing: {}'.format(clf.score(X_test, y_test)))

Accuracy for Gradient Booing: 1.0


The predict_proba method gives the probabilities on a data-point belonging to a class. As our metric, we will use the logloss method from sklearn. Find more about it [here](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html)

In [17]:
y_pred = clf.predict_proba(X_test)[:, 1]
print("Test logloss: {}".format(log_loss(y_test, y_pred)))

Test logloss: 0.00031395706515999623
