<a href="https://colab.research.google.com/github/villafue/Machine_Learning_Notes/blob/master/Supervised_LearningSupervised%20Learning%20with%20Scikit-Learn/Extreme%20Gradient%20Boosting%20with%20XGBoost/02_Regression%20with%20XGBoost/Regression_with_XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Which of these is a regression problem?
Here are 4 potential machine learning problems you might encounter in the wild. Pick the one that is a clear example of a regression problem.

**Incorrect Answers**

1. Recommending a restaurant to a user given their past history of restaurant visits and reviews for a dining aggregator app.
 - This is a recommendation problem.

2. Predicting which of several thousand diseases a given person is most likely to have given their symptoms.
 - This is a multi-class classification problem.

3. Tagging an email as spam/not spam based on its content and metadata (sender, time sent, etc.).
 - This is a binary classification problem.
**Correct Answer**

Predicting the expected payout of an auto insurance claim given claim properties (car, accident type, driver prior history, etc.).
 - Well done! This is indeed an example of a regression problem.

# Decision trees as base learners
It's now time to build an XGBoost model to predict house prices - not in Boston, Massachusetts, as you saw in the video, but in Ames, Iowa! This dataset of housing prices has been pre-loaded into a DataFrame called df. If you explore it in the Shell, you'll see that there are a variety of features about the house and its location in the city.

In this exercise, your goal is to use trees as base learners. By default, XGBoost uses trees as base learners, so you don't have to specify that you want to use trees here with booster="gbtree".

xgboost has been imported as xgb and the arrays for the features and the target are available in X and y, respectively.

**Instructions**

- Split df into training and testing sets, holding out 20% for testing. Use a random_state of 123.
- Instantiate the XGBRegressor as xg_reg, using a seed of 123. Specify an objective of "reg:linear" and use 10 trees. Note: You don't have to specify booster="gbtree" as this is the default.
- Fit xg_reg to the training data and predict the labels of the test set. Save the predictions in a variable called preds.
- Compute the rmse using np.sqrt() and the mean_squared_error() function from sklearn.metrics, which has been pre-imported.

**Conclusion**

Well done! Next, you'll train an XGBoost model using linear base learners and XGBoost's learning API. Will it perform better or worse?

In [None]:
# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBRegressor: xg_reg
xg_reg = xgb.XGBRegressor(objective='reg:linear', n_estimators=10, seed=123)

# Fit the regressor to the training set
xg_reg.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_reg.predict(X_test)

# Compute the rmse: rmse
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

'''
<script.py> output:
    [00:27:14] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    RMSE: 78847.401758
'''

# Linear base learners
Now that you've used trees as base models in XGBoost, let's use the other kind of base model that can be used with XGBoost - a linear learner. This model, although not as commonly used in XGBoost, allows you to create a regularized linear regression using XGBoost's powerful learning API. However, because it's uncommon, you have to use XGBoost's own non-scikit-learn compatible functions to build the model, such as xgb.train().

In order to do this you must create the parameter dictionary that describes the kind of booster you want to use (similarly to how you created the dictionary in Chapter 1 when you used xgb.cv()). The key-value pair that defines the booster type (base model) you need is "booster":"gblinear".

Once you've created the model, you can use the .train() and .predict() methods of the model just like you've done in the past.

Here, the data has already been split into training and testing sets, so you can dive right into creating the DMatrix objects required by the XGBoost learning API.

Instructions

- Create two DMatrix objects - DM_train for the training set (X_train and y_train), and DM_test (X_test and y_test) for the test set.

- Create a parameter dictionary that defines the "booster" type you will use ("gblinear") as well as the "objective" you will minimize ("reg:linear").

- Train the model using xgb.train(). You have to specify arguments for the following parameters: params, dtrain, and num_boost_round. Use 5 boosting rounds.

- Predict the labels on the test set using xg_reg.predict(), passing it DM_test. Assign to preds.

- Hit 'Submit Answer' to view the RMSE!

**Conclusion**

Interesting - it looks like linear base learners performed better!

In [None]:
# Convert the training and testing sets into DMatrixes: DM_train, DM_test
DM_train = xgb.DMatrix(data=X_train, label=y_train)
DM_test =  xgb.DMatrix(data=X_test, label=y_test)

# Create the parameter dictionary: params
params = {"booster":"gblinear", "objective":"reg:linear"}

# Train the model: xg_reg
xg_reg = xgb.train(params = params, dtrain=DM_train, num_boost_round=5)

# Predict the labels of the test set: preds
preds = xg_reg.predict(DM_test)

# Compute and print the RMSE
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))

'''
<script.py> output:
    [00:47:49] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    RMSE: 41929.341664
'''

# Evaluating model quality
It's now time to begin evaluating model quality.

Here, you will compare the RMSE and MAE of a cross-validated XGBoost model on the Ames housing data. As in previous exercises, all necessary modules have been pre-loaded and the data is available in the DataFrame df.

**Instructions 1/2**

- Perform 4-fold cross-validation with 5 boosting rounds and "rmse" as the metric.

- Extract and print the final boosting round RMSE.

**Instructions 2/2**

- Perform 4-fold cross-validation with 5 boosting rounds and "rmse" as the metric.

- Extract and print the final boosting round RMSE.

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics="rmse", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Extract and print final boosting round metric
print((cv_results["test-rmse-mean"]).tail(1))

'''
<script.py> output:
    [00:56:30] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    [00:56:30] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    [00:56:30] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    [00:56:30] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
       train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
    0    141767.527344      429.450237   142980.429688    1193.794436
    1    102832.542969      322.473304   104891.396485    1223.155480
    2     75872.617187      266.469946    79478.939453    1601.341376
    3     57245.650390      273.626583    62411.920899    2220.150028
    4     44401.296875      316.423413    51348.280274    2963.379319
    4    51348.280274
    Name: test-rmse-mean, dtype: float64
'''

# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics="mae", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Extract and print final boosting round metric
print((cv_results["test-mae-mean"]).tail(1))

'''
<script.py> output:
    [01:00:55] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    [01:00:56] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    [01:00:57] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    [01:00:57] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
       train-mae-mean  train-mae-std  test-mae-mean  test-mae-std
    0   127343.484375     668.335266  127633.982422   2404.005021
    1    89770.050782     456.959206   90122.503906   2107.912235
    2    63580.791992     263.407499   64278.563477   1887.565119
    3    45633.153320     151.884919   46819.166015   1459.819399
    4    33587.092774      86.999100   35670.645508   1140.606558
    4    35670.645508
    Name: test-mae-mean, dtype: float64
'''

# Using regularization in XGBoost
Having seen an example of l1 regularization in the video, you'll now vary the l2 regularization penalty - also known as "lambda" - and see its effect on overall model performance on the Ames housing dataset.

**Instructions**

- Create your DMatrix from X and y as before.

- Create an initial parameter dictionary specifying an "objective" of "reg:linear" and "max_depth" of 3.

- Use xgb.cv() inside of a for loop and systematically vary the "lambda" value by passing in the current l2 value (reg).

- Append the "test-rmse-mean" from the last boosting round for each cross-validated xgboost model.

- Hit 'Submit Answer' to view the results. What do you notice?

**Conclusion**

Nice work! It looks like as as the value of 'lambda' increases, so does the RMSE.

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

reg_params = [1, 10, 100]

# Create the initial parameter dictionary for varying l2 strength: params
params = {"objective":"reg:linear","max_depth":3}

# Create an empty list for storing rmses as a function of l2 complexity
rmses_l2 = []

# Iterate over reg_params
for reg in reg_params:

    # Update l2 strength
    params["lambda"] = reg
    
    # Pass this updated param dictionary into cv
    cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, num_boost_round=5, metrics="rmse", as_pandas=True, seed=123)
    
    # Append best rmse (final round) to rmses_l2
    rmses_l2.append(cv_results_rmse["test-rmse-mean"].tail(1).values[0])

# Look at best rmse per l2 param
print("Best rmse as a function of l2:")
print(pd.DataFrame(list(zip(reg_params, rmses_l2)), columns=["l2", "rmse"]))

'''
<script.py> output:
    [23:09:39] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    [23:09:40] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    [23:09:42] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    [23:09:42] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    [23:09:42] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    [23:09:42] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    Best rmse as a function of l2:
        l2          rmse
    0    1  52275.359375
    1   10  57746.064453
    2  100  76624.625000
'''

# Visualizing individual XGBoost trees
Now that you've used XGBoost to both build and evaluate regression as well as classification models, you should get a handle on how to visually explore your models. Here, you will visualize individual trees from the fully boosted model that XGBoost creates using the entire housing dataset.

XGBoost has a plot_tree() function that makes this type of visualization easy. Once you train a model using the XGBoost learning API, you can pass it to the plot_tree() function along with the number of trees you want to plot using the num_trees argument.

Instructions

- Create a parameter dictionary with an "objective" of "reg:linear" and a "max_depth" of 2.

- Train the model using 10 boosting rounds and the parameter dictionary you created. Save the result in xg_reg.

- Plot the first tree using xgb.plot_tree(). It takes in two arguments - the model (in this case, xg_reg), and num_trees, which is 0-indexed. So to plot the first tree, specify num_trees=0.

- Plot the fifth tree.

- Plot the last (tenth) tree sideways. To do this, specify the additional keyword argument rankdir="LR".

Conclusion:

Excellent! Have a look at each of the plots. They provide insight into how the model arrived at its final decisions and what splits it made to arrive at those decisions. This allows us to identify which features are the most important in determining house price. In the next exercise, you'll learn another way of visualizing feature importances.

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":2}

# Train the model: xg_reg
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)

# Plot the first tree
xgb.plot_tree(xg_reg, num_trees=0)
plt.show()

# Plot the fifth tree
xgb.plot_tree(xg_reg, num_trees=4)
plt.show()

# Plot the last tree sideways
xgb.plot_tree(xg_reg, num_trees=9, rankdir="LR")
plt.show()

'''
<script.py> output:
    [23:20:57] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
'''

# Visualizing feature importances: What features are most important in my dataset
Another way to visualize your XGBoost models is to examine the importance of each feature column in the original dataset within the model.

One simple way of doing this involves counting the number of times each feature is split on across all boosting rounds (trees) in the model, and then visualizing the result as a bar graph, with the features ordered according to how many times they appear. XGBoost has a plot_importance() function that allows you to do exactly this, and you'll get a chance to use it in this exercise!

**Instructions**

- Create your DMatrix from X and y as before.

- Create a parameter dictionary with appropriate "objective" ("reg:linear") and a "max_depth" of 4.

- Train the model with 10 boosting rounds, exactly as you did in the previous exercise.

- Use xgb.plot_importance() and pass in the trained model to generate the graph of feature importances.

**Conclusion**

Brilliant! It looks like GrLivArea is the most important feature. Congratulations on completing Chapter 2!