# 7. Binning Numeric Columns

Numeric columns like year might better be represented as a nominal categorical variable which are suitable for one-hot encoding. Encoding each unique year would yield a huge number of columns with some of them having very few non-zero entries. Instead, we can use a much smaller number of bins to contain groups of years. The `KBinsDescritizer` is a new transformer introduced in scikit-learn version 0.20 that does this.

In [None]:
import pandas as pd
hs = pd.read_csv('data/housing_sample.csv')
hs.head()

## Using `KBinsDescritizer`

The `KBinsDescritizer` transformer is found in the `preprocessing` module. To instantiate it you must choose the number of bins you'd like to divide your numeric data into and whether you want one-hot or ordinal encoding. The default value for `encode` is 'onehot' which returns a sparse array. We choose `onehot-dense` to return an actual numpy array. You can create the bins that have equally spaced edges, that have the same number of observations in each bin, or a more complex approach involving k-means.

In [None]:
X = hs[['YearBuilt']].values
y = hs.pop('SalePrice').values

Let's bin and encode the year built.

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
kbd = KBinsDiscretizer(n_bins=6, encode='onehot-dense', strategy='uniform')
kbd.fit_transform(hs[['YearBuilt']])

You can see the edges of the bins with the `bin_edges_` attribute.

In [None]:
kbd.bin_edges_

## Putting it all together

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, KBinsDiscretizer

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV
from mymetrics import root_mean_squared_log_error

# string pipeline
string_si = SimpleImputer(strategy='constant', fill_value='MISSING')
ohe = OneHotEncoder(sparse=False)
steps = [('impute', string_si), ('encode', ohe)]
string_pipe = Pipeline(steps)

# numeric pipeline
numeric_si = SimpleImputer(strategy='mean')
ss = StandardScaler()
steps = [('si', numeric_si), ('standardize', ss)]
numeric_pipe = Pipeline(steps)

# year transformation
kbd = KBinsDiscretizer(n_bins=10, encode='onehot-dense', strategy='uniform')

# columns
string_cols = ['Neighborhood', 'Exterior1st']
numeric_cols = ['LotFrontage', 'GrLivArea', 'GarageArea']

transformers = [('string', string_pipe, string_cols), 
                ('numeric', numeric_pipe, numeric_cols), 
                ('year', kbd, ['YearBuilt'])]

ct = ColumnTransformer(transformers)
rfr = RandomForestRegressor()
steps = [('transformers', ct), ('rfr', rfr)]
final_pipe = Pipeline(steps)

kf = KFold(n_splits=5, shuffle=True)
grid = {'transformers__numeric__si__strategy': ['mean', 'median'],
       'rfr__n_estimators': [50, 100], 'rfr__max_depth': range(2, 6)}
gs = GridSearchCV(final_pipe, grid, cv=kf, scoring=root_mean_squared_log_error)
gs.fit(hs, y)
gs.best_params_

## Exercise 

Use `KBinsDescritizer` experimenting with the number of bins and the binning strategy.