# Histogram Gradient Boosting Classification
This is a basic test of the experimental *histogram-based gradient boosting classification tree estimator* which is now available in scikit-learn as [sklearn.ensemble.HistGradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html). For large datasets this algorithm is *very* fast! 

We shall be using the [Santander Customer Satisfaction](https://www.kaggle.com/c/santander-customer-satisfaction) data. The winning Private Score for this competition was `0.82907` which was achieved using an ensemble solution. Here we obtain a Private Score of `0.82066` with no feature engineering whatsoever.

In [None]:
import numpy  as np
import pandas as pd

train_data = pd.read_csv('../input/santander-customer-satisfaction/train.csv',index_col=0)
test_data  = pd.read_csv('../input/santander-customer-satisfaction/test.csv', index_col=0)
sample     = pd.read_csv('../input/santander-customer-satisfaction/sample_submission.csv')

X_train = train_data.iloc[:,:-1]
y_train = train_data['TARGET']

now for the Histogram Gradient Boosting

In [None]:
# To use this experimental feature, we need to explicitly ask for it:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

classifier =  HistGradientBoostingClassifier()
classifier.fit(X_train,y_train)

from sklearn.model_selection import cross_val_score
CVscores = cross_val_score(classifier, X_train, y_train, scoring='roc_auc', cv=2) 
print("The average CV score is",CVscores.mean())

predictions = classifier.predict_proba(test_data)[:,1]

and write out a `submission.csv` file

In [None]:
sample.iloc[:,1:] = predictions
sample.to_csv('submission.csv',index=False)

# Related links
* [Aleksei Guryanov "Histogram-Based Algorithm for Building Gradient Boosting Ensembles of Piecewise Linear Decision Trees", In: van der Aalst W. et al. (eds) Analysis of Images, Social Networks and Texts. AIST 2019. Lecture Notes in Computer Science, vol 11832. Springer (2019)](https://link.springer.com/chapter/10.1007%2F978-3-030-37334-4_4)