# Histogram based Gradient Boosting

This is form of gradient boosting tree classification that works not with continuous values, but with histogram based binned values
of variables.

Typically,  some power of 2 is used as the number of bins of each continuous variable,  often 256 bins.

When constructing trees,   each binnded variable has a maximum of 256 boundary points to consider.  This is a modest loss in resolution, but a huge gain in speed,  particularly with very large data sets.

Histogram based trees can also work with missing data without having to resort to imputation,  the missing data is simply assigned to a specific bin.  The algorithm places the missing bin on either side of each split point and is assigned to the size that produces the greatest improvement in the loss.

You do not need to one-hot encode categorical variables,  it will run with integer categories.

Like other tree methods,  it may be used for regression or classification

There is an implementation of the Histogram Gradient Boost in SciKit Learn:

https://scikit-learn.org/stable/modules/ensemble.html#histogram-based-gradient-boosting

The original Histogram based Gradient Boost was produced by Microsoft and distributed as the LightGBM package.    The Microsoft LightGBM can be used from within Python, DASK, or R, or as a stand-alone piece of software.   It looks like the the Sci Kit Learn histogram GB is in fact just an API to the Microsoft code.

It does look like you can configure the Micrsoft LightGBM to train models on a GPU,   this would be a reason to consider using the original Microsoft code instead of running it though the Scikit learn inferface.

Checked 1/17/2023

Let's try running a histogram gradient boost method on the Lyon's housing data set, since that was a relatively large and somewhat challenging data set

No need to do one-hot encoding either...

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#load the data, change the file address- this is a local file load- see below for Google Drive upload

infile= "C:\\Users\\hdavi\\Dropbox\\Data_Analytics\\DAT_514_Machine_Learning\\Example_data\\Lyons_Housing\\lyon_housing.csv"
lyon=pd.read_csv(infile)
lyon.head()

Unnamed: 0,date_transaction,type_achat,type_bien,nombre_pieces,surface_logement,surface_carrez_logement,surface_terrain,nombre_parkings,prix,adresse,commune,latitude,longitude,date_construction,anciennete
0,2019-10-31,ancien,maison,5,100.0,,247.0,0,530000.0,6 PAS DES ANTONINS,Villeurbanne,45.781673,4.879333,2003-06-11 11:38:24,16.387783
1,2018-11-26,ancien,maison,2,52.0,,156.0,0,328550.0,12 RUE DU LUIZET,Villeurbanne,45.78324,4.884683,2003-06-11 11:38:24,15.459633
2,2016-08-04,ancien,appartement,1,28.0,28.2,0.0,1,42500.0,4 RUE DE L ESPOIR,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.148839
3,2016-11-18,ancien,appartement,3,67.0,66.3,0.0,1,180900.0,6 RUE DE L ESPOIR,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.439058
4,2016-12-16,ancien,appartement,1,28.0,,0.0,1,97000.0,163 AV ROGER SALENGRO,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.515719


In [2]:
# this is the process for importing the data from your Google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
path = "/content/drive/MyDrive/DAT514_data/lyon_housing.csv"

lyon=pd.read_csv(path)
lyon.head()

Unnamed: 0,date_transaction,type_achat,type_bien,nombre_pieces,surface_logement,surface_carrez_logement,surface_terrain,nombre_parkings,prix,adresse,commune,latitude,longitude,date_construction,anciennete
0,2019-10-31,ancien,maison,5,100.0,,247.0,0,530000.0,6 PAS DES ANTONINS,Villeurbanne,45.781673,4.879333,2003-06-11 11:38:24,16.387783
1,2018-11-26,ancien,maison,2,52.0,,156.0,0,328550.0,12 RUE DU LUIZET,Villeurbanne,45.78324,4.884683,2003-06-11 11:38:24,15.459633
2,2016-08-04,ancien,appartement,1,28.0,28.2,0.0,1,42500.0,4 RUE DE L ESPOIR,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.148839
3,2016-11-18,ancien,appartement,3,67.0,66.3,0.0,1,180900.0,6 RUE DE L ESPOIR,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.439058
4,2016-12-16,ancien,appartement,1,28.0,,0.0,1,97000.0,163 AV ROGER SALENGRO,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.515719


In [4]:
lyon['date_transaction']=pd.to_datetime(lyon['date_transaction'])
lyon['year_transaction']=lyon['date_transaction'].dt.year
lyon['date_construction']=pd.to_datetime(lyon['date_construction'])
lyon['year_construction']=lyon['date_construction'].dt.year
lyon.head()

Unnamed: 0,date_transaction,type_achat,type_bien,nombre_pieces,surface_logement,surface_carrez_logement,surface_terrain,nombre_parkings,prix,adresse,commune,latitude,longitude,date_construction,anciennete,year_transaction,year_construction
0,2019-10-31,ancien,maison,5,100.0,,247.0,0,530000.0,6 PAS DES ANTONINS,Villeurbanne,45.781673,4.879333,2003-06-11 11:38:24,16.387783,2019,2003
1,2018-11-26,ancien,maison,2,52.0,,156.0,0,328550.0,12 RUE DU LUIZET,Villeurbanne,45.78324,4.884683,2003-06-11 11:38:24,15.459633,2018,2003
2,2016-08-04,ancien,appartement,1,28.0,28.2,0.0,1,42500.0,4 RUE DE L ESPOIR,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.148839,2016,2003
3,2016-11-18,ancien,appartement,3,67.0,66.3,0.0,1,180900.0,6 RUE DE L ESPOIR,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.439058,2016,2003
4,2016-12-16,ancien,appartement,1,28.0,,0.0,1,97000.0,163 AV ROGER SALENGRO,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.515719,2016,2003


In [5]:
temp=pd.cut(lyon.anciennete,bins=[-5,0,5,10,20,30,40],labels=['UnderConstruction','0-5','5-10','10-20','20-30','30+'])
lyon['age']=temp
lyon.head(3)

Unnamed: 0,date_transaction,type_achat,type_bien,nombre_pieces,surface_logement,surface_carrez_logement,surface_terrain,nombre_parkings,prix,adresse,commune,latitude,longitude,date_construction,anciennete,year_transaction,year_construction,age
0,2019-10-31,ancien,maison,5,100.0,,247.0,0,530000.0,6 PAS DES ANTONINS,Villeurbanne,45.781673,4.879333,2003-06-11 11:38:24,16.387783,2019,2003,10-20
1,2018-11-26,ancien,maison,2,52.0,,156.0,0,328550.0,12 RUE DU LUIZET,Villeurbanne,45.78324,4.884683,2003-06-11 11:38:24,15.459633,2018,2003,10-20
2,2016-08-04,ancien,appartement,1,28.0,28.2,0.0,1,42500.0,4 RUE DE L ESPOIR,Villeurbanne,45.781488,4.883474,2003-06-11 11:38:24,13.148839,2016,2003,10-20


Ordinal Encoding

To use the Histogram we do need to encode the categories as such

The pandas category option did not work here

In [6]:
from sklearn.preprocessing import OrdinalEncoder

enc_achat=OrdinalEncoder()
enc_bien=OrdinalEncoder()
enc_commune=OrdinalEncoder()
enc_age=OrdinalEncoder()
enc_anciennete=OrdinalEncoder()
enc_age=OrdinalEncoder()


lyon.type_achat=enc_achat.fit_transform(lyon[['type_achat']]).astype("int32")
lyon.type_bien=enc_bien.fit_transform(lyon[['type_bien']]).astype("int32")
lyon.commune=enc_commune.fit_transform(lyon[['commune']]).astype("int32")
lyon.age=enc_age.fit_transform(lyon[['age']]).astype("int32")
lyon.head()

Unnamed: 0,date_transaction,type_achat,type_bien,nombre_pieces,surface_logement,surface_carrez_logement,surface_terrain,nombre_parkings,prix,adresse,commune,latitude,longitude,date_construction,anciennete,year_transaction,year_construction,age
0,2019-10-31,1,1,5,100.0,,247.0,0,530000.0,6 PAS DES ANTONINS,9,45.781673,4.879333,2003-06-11 11:38:24,16.387783,2019,2003,1
1,2018-11-26,1,1,2,52.0,,156.0,0,328550.0,12 RUE DU LUIZET,9,45.78324,4.884683,2003-06-11 11:38:24,15.459633,2018,2003,1
2,2016-08-04,1,0,1,28.0,28.2,0.0,1,42500.0,4 RUE DE L ESPOIR,9,45.781488,4.883474,2003-06-11 11:38:24,13.148839,2016,2003,1
3,2016-11-18,1,0,3,67.0,66.3,0.0,1,180900.0,6 RUE DE L ESPOIR,9,45.781488,4.883474,2003-06-11 11:38:24,13.439058,2016,2003,1
4,2016-12-16,1,0,1,28.0,,0.0,1,97000.0,163 AV ROGER SALENGRO,9,45.781488,4.883474,2003-06-11 11:38:24,13.515719,2016,2003,1


In [7]:
lyon.columns

Index(['date_transaction', 'type_achat', 'type_bien', 'nombre_pieces',
       'surface_logement', 'surface_carrez_logement', 'surface_terrain',
       'nombre_parkings', 'prix', 'adresse', 'commune', 'latitude',
       'longitude', 'date_construction', 'anciennete', 'year_transaction',
       'year_construction', 'age'],
      dtype='object')

In [8]:
lyon_pred=['type_achat', 'type_bien','commune','year_transaction','nombre_pieces','surface_logement', 'surface_terrain', 'nombre_parkings' ,'anciennete', 'year_transaction','age']

X=lyon[lyon_pred]
X.head()
y=lyon.prix

In [9]:
X.dtypes

type_achat            int32
type_bien             int32
commune               int32
year_transaction      int64
nombre_pieces         int64
surface_logement    float64
surface_terrain     float64
nombre_parkings       int64
anciennete          float64
year_transaction      int64
age                   int32
dtype: object

In [10]:
from sklearn.ensemble import HistGradientBoostingRegressor

my_hgbr=HistGradientBoostingRegressor(loss="squared_error",learning_rate=0.1, max_iter=200, max_leaf_nodes=21, min_samples_leaf=15, l2_regularization =0,categorical_features= (X.dtypes=="int32"))
my_hgbr.fit(X,y)


HistGradientBoostingRegressor(categorical_features=type_achat           True
type_bien            True
commune              True
year_transaction    False
nombre_pieces       False
surface_logement    False
surface_terrain     False
nombre_parkings     False
anciennete          False
year_transaction    False
age                  True
dtype: bool,
                              l2_regularization=0, max_iter=200,
                              max_leaf_nodes=21, min_samples_leaf=15)

In [11]:
y_pred=my_hgbr.predict(X)

In [12]:
from sklearn.metrics import explained_variance_score

print(explained_variance_score(y,y_pred))
      
print(np.mean( (y-y_pred)**2)**0.5)

0.8175513715995163
65989.7113119613
