# Non-linear machine learning - trees
#### with thanks to Google's Kaggle

## Trees

The model is now  the following:

In [3]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
model = DecisionTreeRegressor(random_state=1)

**Task** Re-run the house price analysis with the decision trees. Compare the in and out of sample MAE with the linear model!

In [17]:
# Solutions
import pandas as pd
import numpy as np
h = pd.read_csv('train.csv')

from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

feature_names = ["LotArea", "YearBuilt", "1stFlrSF", "2ndFlrSF",
                      "FullBath", "BedroomAbvGr", "TotRmsAbvGrd"]

X=h[feature_names]
y = h.SalePrice

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 2)


# Define model
h_model_ols = linear_model.LinearRegression()
# Fit model
h_model_ols.fit(train_X, train_y)

# get predicted prices on validation data
print('ols')
val_predictions = h_model_ols.predict(val_X)
print(mean_absolute_error(h_model_ols.predict(train_X), train_y))
print(mean_absolute_error(val_y, val_predictions))


# Define model
h_model_tree = DecisionTreeRegressor(random_state=1)
# Fit model
h_model_tree.fit(train_X, train_y)

# get predicted prices on validation data
print('tree')
val_predictions_tree = h_model_tree.predict(val_X)
print(mean_absolute_error(h_model_tree.predict(train_X), train_y))
print(mean_absolute_error(val_y, val_predictions_tree))

## Note the wildly better in sample performance, but worse out of sample performance!

ols
26861.096683034102
30671.232158103503
tree
27.48675799086758
32245.33698630137


## Underfitting and Overfitting
Experimenting With Different Models¶

Now that you have a reliable way to measure model accuracy, you can experiment with alternative models and see which gives the best predictions. But what alternatives do you have for models?

You can see in scikit-learn's documentation that the decision tree model has many options (more than you'll want or need for a long time). The most important options determine the tree's depth. A tree's depth is a measure of how many splits it makes before coming to a prediction. This is a relatively shallow tree.

<img src="http://i.imgur.com/R3ywQsR.png" width=40%>

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have $2^{10}$ groups of houses by the time we get to the 10th level. That's 1024 leaves.

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.

Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting. Visually, we want the low point of the (red) validation curve in the figure below

<img src="http://i.imgur.com/AXSEOfI.png" width=40%>

**Example**

There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes. But the max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.

We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:

In [28]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

**Task** Using this function - find the optimal number of leaf nodes out of the followin options ['5','50','500','5000']

In [27]:
#Solution
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))
    
#Of the options listed, 500 is the optimal number of leaves.

Max leaf nodes: 5  		 Mean Absolute Error:  40532
Max leaf nodes: 50  		 Mean Absolute Error:  31384
Max leaf nodes: 500  		 Mean Absolute Error:  31339
Max leaf nodes: 5000  		 Mean Absolute Error:  31469


# Random Forests

Decision trees leave you with a difficult decision. A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.

Even today's most sophisticated modeling techniques face this tension between underfitting and overfitting. But, many models have clever ideas that can lead to better performance. We'll look at the random forest as an example.

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters.



We build a random forest model similarly to how we built a decision tree in scikit-learn - this time using the RandomForestRegressor class instead of DecisionTreeRegressor.


In [32]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)

**Task** Does the RF improve our MAE?

In [33]:
# Solution
forest_model.fit(train_X, train_y)
preds_rf = forest_model.predict(val_X)
print(mean_absolute_error(val_y, preds_rf))

25050.85186692759


## XGBoost

Gradient boosting is a method that goes through cycles to iteratively add models into an ensemble.

It begins by initializing the ensemble with a single model, whose predictions can be pretty naive. (Even if its predictions are wildly inaccurate, subsequent additions to the ensemble will address those errors.)

Then, we start the cycle:

- First, we use the current ensemble to generate predictions for each observation in the dataset. To make a prediction, we add the predictions from all models in the ensemble.
- These predictions are used to calculate a loss function (like mean squared error, for instance).
- Then, we use the loss function to fit a new model that will be added to the ensemble. Specifically, we determine model parameters so that adding this new model to the ensemble will reduce the loss. (Side note: The "gradient" in "gradient boosting" refers to the fact that we'll use gradient descent on the loss function to determine the parameters in this new model.)
- Finally, we add the new model to ensemble, and ...
- ... repeat!


![](https://i.imgur.com/MvCGENh.png)

In [40]:
from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(train_X, train_y)
predictions = my_model.predict(val_X)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, val_y)))

Mean Absolute Error: 26637.48289811644


## Parameter tuning
We can also train our XGBoost model, and fine tune the parameters:

In [46]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(train_X, train_y, 
             early_stopping_rounds=5, 
             eval_set=[(val_X, val_y)], 
             verbose=False)
predictions = my_model.predict(val_X)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, val_y)))

Mean Absolute Error: 26115.46934931507


  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


### Details
XGBoost has a few parameters that can dramatically affect accuracy and training speed. The first parameters you should understand are:


#### n_estimators

n_estimators specifies how many times to go through the modeling cycle described above. It is equal to the number of models that we include in the ensemble.

- Too low a value causes underfitting, which leads to inaccurate predictions on both training data and test data.
- Too high a value causes overfitting, which causes accurate predictions on training data, but inaccurate predictions on test data (which is what we care about).

Typical values range from 100-1000, though this depends a lot on the learning_rate parameter discussed below.


#### early_stopping_round

early_stopping_rounds offers a way to automatically find the ideal value for n_estimators. Early stopping causes the model to stop iterating when the validation score stops improving, even if we aren't at the hard stop for n_estimators. It's smart to set a high value for n_estimators and then use early_stopping_rounds to find the optimal time to stop iterating.

Since random chance sometimes causes a single round where validation scores don't improve, you need to specify a number for how many rounds of straight deterioration to allow before stopping. Setting early_stopping_rounds=5 is a reasonable choice. In this case, we stop after 5 straight rounds of deteriorating validation scores.

When using early_stopping_rounds, you also need to set aside some data for calculating the validation scores - this is done by setting the eval_set parameter.

#### learning_rate

Instead of getting predictions by simply adding up the predictions from each component model, we can multiply the predictions from each model by a small number (known as the learning rate) before adding them in.

This means each tree we add to the ensemble helps us less. So, we can set a higher value for n_estimators without overfitting. If we use early stopping, the appropriate number of trees will be determined automatically.

In general, a small learning rate and large number of estimators will yield more accurate XGBoost models, though it will also take the model longer to train since it does more iterations through the cycle. As default, XGBoost sets learning_rate=0.1.

#### n_jobs

On larger datasets where runtime is a consideration, you can use parallelism to build your models faster. It's common to set the parameter n_jobs equal to the number of cores on your machine. On smaller datasets, this won't help.

The resulting model won't be any better, so micro-optimizing for fitting time is typically nothing but a distraction. But, it's useful in large datasets where you would otherwise spend a long time waiting during the fit command.

## Conclusion

If you're interested in this, I highly recommend you learn these - which build on what I've shown in class:
- Imputation: https://www.kaggle.com/code/alexisbcook/missing-values
- Cleaning pipelines: https://www.kaggle.com/code/alexisbcook/pipelines

And then try and enter this competition!
https://www.kaggle.com/c/home-data-for-ml-course