In [1]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split

In [2]:
housing_data = pd.read_csv('../data/ames_housing.csv')

Un-Tuned Example

In [7]:
X, y = housing_data[housing_data.columns.tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]]
housing_dmatrix = xgb.DMatrix(data=X, label=y)
untuned_params = {"objective": "reg:squarederror"}
untuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=untuned_params, nfold=4, metrics="rmse", as_pandas=True,
                                 seed=123)
print("Untuned RMSE:%f" % (untuned_cv_results_rmse["test-rmse-mean"].tail(1)))

Untuned RMSE:33238.397179


  print("Untuned RMSE:%f" % (untuned_cv_results_rmse["test-rmse-mean"].tail(1)))


Tuned Example

In [8]:
tuned_params = {"objective": "reg:squarederror", "colsample_bylevel": 0.3, "learning_rate": 0.1, "max_depth": 5}
tuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=tuned_params, nfold=4, metrics="rmse",
                               num_boost_round=400, as_pandas=True, seed=123)
print("Tuned RMSE: %f" % (tuned_cv_results_rmse["test-rmse-mean"].tail(1)))


Tuned RMSE: 29665.755488


  print("Tuned RMSE: %f" % (tuned_cv_results_rmse["test-rmse-mean"].tail(1)))


Common tree tune-able parameters
- Learning Rate: The learning rate effect how quickly the model fits the residual error using additional base learners.
A low learning rate will require more boosting rounds to achieve the same reduction in residual error as XGBOOST model with a high learning rate.
- gamma: min loss reduction to create new tree split
- lambda: L2 reg on leaf weights
- alpha: L1 reg on leaft weights
- max_depth: Max Depth must a positive integer value and effects how deeply each tree is allowed to grow during a given boosting round.
- subsample: Must be a value between 0 and 1 and is the fraction of the total training set that can be used for any given boosting round.
 If the value is low then the fraction of your training data would be per boosting round would be low and we may run into under-fitting problems, a value that is very high can lead to over-fitting models.
 - colsample_bytree: colsample_bytree is the fraction of the features you can select from during any given boosting round and must be a value between 0 and 1. 
 A large value means that almost all features can be used to build a tree during a given boosting round, whereas a small value means that the fraction of the features that can be selected from is very small.
In general smaller colsample_bytree value can be thought of as providing additional regularization to the model, whereas all columns may in certain cases overfit  a trained model.


Linear Tunable paramters
For a linear based learner, the number of tunable parameters is significantly smaller. You only have access to 
- lambda: L2 regularization on weights
- alpha: L1 regularization on weights
- lambda_bias: L2 reg term on bias
