<a href="https://colab.research.google.com/github/raj-vijay/ml/blob/master/04.Extreme%20Gradient%20Boost/04_XGBoost_Housing_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Objective (loss) functions and base learners**

Loss function names in xgboost:
- reg:linear - use for regression problems
- reg:logistic - use for classification involving a decision and not probability
- binary:logistic - use when you want probability rather than just
decision

https://xgboost.readthedocs.io/en/latest/

**Trees as base learners example: Scikit-learn API**

In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)

In [None]:
#import xgboost
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [None]:
# Create the training and test sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

In [None]:
xg_reg = xgb.XGBRegressor(objective='reg:linear', n_estimators=10, seed=123)
xg_reg.fit(X_train, y_train)
preds = xg_reg.predict(X_test)



In [None]:
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))

RMSE: 9.749041


**Linear base learners**

Here, we use another base model that can be used with XGBoost - a linear learner. This model, although not as commonly used in XGBoost, allows to create a regularized linear regression using XGBoost's powerful learning API. 

In [None]:
# Convert the training and testing sets into DMatrixes: DM_train, DM_test
DM_train = xgb.DMatrix(data=X_train, label=y_train)
DM_test =  xgb.DMatrix(data=X_test, label=y_test)

# Create the parameter dictionary: params
params = {"booster":"gblinear", "objective":"reg:linear"}

# Train the model: xg_reg
xg_reg = xgb.train(params = params, dtrain=DM_train, num_boost_round=5)

# Predict the labels of the test set: preds
preds = xg_reg.predict(DM_test)

# Compute and print the RMSE
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))

RMSE: 6.631172


**Compare the RMSE and MAE of a cross-validated XGBoost model**

Perform 4-fold cross-validation with 5 boosting rounds and "rmse" as the metric.

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics="rmse", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Extract and print final round boosting round metric
print((cv_results["test-rmse-mean"]).tail(1))

   train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0        17.120438        0.057830       17.151866       0.295723
1        12.353698        0.034427       12.510376       0.372386
2         9.017977        0.038795        9.245965       0.314345
3         6.690101        0.047236        7.060159       0.317659
4         5.069411        0.048644        5.571861       0.252100
4    5.571861
Name: test-rmse-mean, dtype: float64


Compute the "mae" instead of the "rmse".

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics="mae", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Extract and print final round boosting round metric
print((cv_results["test-mae-mean"]).tail(1))

   train-mae-mean  train-mae-std  test-mae-mean  test-mae-std
0       15.584812       0.087903      15.567934      0.345122
1       11.036514       0.069404      11.044831      0.347553
2        7.827224       0.052691       7.886081      0.315104
3        5.596108       0.044331       5.718952      0.288004
4        4.062843       0.052193       4.285985      0.175467
4    4.285985
Name: test-mae-mean, dtype: float64
