# Stacking ensemble using House Prices data

This is a short example of using the Scikit-learn [Stacking Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingRegressor.html) which implements the stacked generalization technique.

For the ensemble base learners we shall use [XGBoost](https://github.com/dmlc/xgboost), [CatBoost](https://github.com/catboost/catboost), and the [Regularized Greedy Forest (RGF)](https://github.com/RGF-team/rgf/tree/master/python-package) (See my notebook ["Introduction to the Regularized Greedy Forest"](https://www.kaggle.com/carlmcbrideellis/introduction-to-the-regularized-greedy-forest) for more details).
For the meta estimator we shall use the [Random Forest Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html).

### Install the Regularized Greedy Forest (`rgf_python`):

In [None]:
!pip install rgf_python

### set up the House Prices competition data

In [None]:
import pandas  as pd
import numpy   as np

#===========================================================================
# read in the data
#===========================================================================
train_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test_data  = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

#===========================================================================
# select some features
#===========================================================================
features = ['MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 
        'YearBuilt', 'YearRemodAdd', 'BsmtFinSF1', 'BsmtFinSF2', 
        'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 
        'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 
        'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 
        'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'GarageArea', 
        'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 
        'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']

#===========================================================================
#===========================================================================
X_train       = train_data[features]
y_train       = train_data["SalePrice"]
X_test        = test_data[features]

#===========================================================================
# imputation; substitute any 'NaN' with mean value
#===========================================================================
X_train      = X_train.fillna(X_train.mean())
X_test       = X_test.fillna(X_test.mean())

### build and run the ensemble

In [None]:
from rgf.sklearn import RGFRegressor
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.model_selection import train_test_split

estimators =  [('xgb',xgb.XGBRegressor(n_estimators  = 750,learning_rate = 0.02, max_depth = 5)),
               ('cat',CatBoostRegressor(loss_function='RMSE', verbose=False)),
               ('RGF',RGFRegressor(max_leaf=500, algorithm="RGF_Sib", test_interval=100, loss="LS"))]

ensemble = StackingRegressor(estimators      =  estimators,
                             final_estimator =  RandomForestRegressor())

# Fit ensemble using cross-validation
X_tr, X_te, y_tr, y_te = train_test_split(X_train,y_train)
ensemble.fit(X_tr, y_tr).score(X_te, y_te)

# Prediction
predictions = ensemble.predict(X_test)

### now write out the `submission.csv` file:

In [None]:
output = pd.DataFrame({"Id":test_data.Id, "SalePrice":predictions})
output.to_csv('submission.csv', index=False)

# Links
* [David H.Wolpert "Stacked generalization", Neural Networks Vol 5, pp. 241-259 (1992)](https://www.sciencedirect.com/science/article/abs/pii/S0893608005800231)
