This exercise is about applying XGBoost. We will again make use of the Boston Housing Dataset (https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). In the first step, you will load and quickly analyze the dataset. In the second step, you will training and apply a XGBoost regression model.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np
np.random.seed(0)

Let's load the dataset. Luckily, the Scikit-Learn package provides a very simple way to load the Boston dataset ...

In [None]:
from sklearn.datasets import load_boston

boston = load_boston()

print("Number of instances: {}".format(boston.data.shape[0]))
print("Number of features: {}".format(boston.data.shape[1]))

Let's quickly analyze the main characteristics of the data at hand. The Pandas package (https://pandas.pydata.org/) provides some nice out-of-the-box methods to quickly analyze data and to do some preprocessing (if needed).

In [None]:
import pandas as pd

df = pd.DataFrame(boston.data)
df.columns = boston.feature_names
df['PRICE (REGRESSION TARGET)'] = boston.target
df.head()

Let us split the data into a "training" and "test" set. The test set is NOT used during the model generation phase (started by calling the 'fit'-function). 

In [None]:
from sklearn.model_selection import train_test_split

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

Next, we will make use of the XGBoost library! You can find a description of the Python API here: https://xgboost.readthedocs.io/en/latest/python/python_api.html

In [None]:
import xgboost

# YOUR TASK: Find parameters to get an RMSE below 3.0

# instantiate the model
model = xgboost.XGBRegressor(objective ='reg:linear', 
                             colsample_bytree = 0.1, 
                             learning_rate = 1,
                             max_depth = 2,
                             reg_lambda = 1.0,
                             n_estimators = 10)

# fit the model
model.fit(X_train, y_train)

Next, we evaluate the final performance on the test set using the root-mean-square error (https://en.wikipedia.org/wiki/Root-mean-square_deviation).

In [None]:
from sklearn.metrics import mean_squared_error

preds = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))

print("The final RMSE on the test set is: {}".format(rmse))

For regression scenarios, a scatter plot (true labels vs. predictions) is often very useful to analyze the quality of a given model.

In [None]:
plt.figure(figsize=(5,5))
plt.scatter(y_test, preds, marker='o');
plt.plot(range(50), range(50), 'r-');