# Random forests for house price prediction

Your Name

The purpose of this assignment is to give you experience with ensemble methods while answering a couple of questions:
- How do random forests perform, compared to single trees?
- What are the hyperparameters of random forests?  Does tuning take a long time?
- What does the concept of "feature importance" in random forests mean? 

We'll examine these questions using the CA Housing dataset.  You can get information about the dataset here: https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset.

You will probably want to look at that web page to understand the variables in the dataset.

There are three problems to solve.

__Instructions__:
- Read this notebook to get a feeling for its structure.  Don't modify the top-level structure.
- Don't modify the cell that loads the data.
- Write code to answer the 3 problems.  Look for # YOUR CODE HERE comments.  You also need to write a summary at the end of each problem.
- Please make sure to read the grading rubric.

In [None]:
# add more imports as needed
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from scipy.stats import zscore

In [None]:
sns.set()
sns.set_context('talk')
sns.set_style('whitegrid')

## Loading the data

From 8 numeric predictors we will try to predict the house value.

In [None]:
# df is a data frame of the predictors; target is a Series with the target
bunch = fetch_california_housing(as_frame=True)
df = bunch.data
target = bunch.target

## Data exploration

You can add more data exploration if you like, but it's not needed.

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
target.plot.hist()
plt.title('Histogram of house values');

#### Check for outlier values of predictor variables

In [None]:
# normalize to make it easier to see outliers
dfs = df.apply(zscore)

# https://stackoverflow.com/questions/41328633 explains formatting
with pd.option_context('float_format', '{:.2f}'.format): 
    print(dfs.describe(percentiles=[0.5, 0.95, 0.99])) 

The result of describe on the z-score normalized values show some extreme outliers in variables AveRooms, AveBedrms, Population, and AveOccup.

## Data preprocessing

In this section we first remove the outliers found earlier, then perform a test/train split and scale the data.

Also, a smaller version of the dataset is created to speed up hyperparameter tuning.

#### Remove outliers in columns that have significant outliers.

In [None]:
high_vals = df.quantile(0.99)
for col in ['AveRooms', 'AveBedrms', 'Population', 'AveOccup']:
    mask = df[col] <= high_vals[col]
    df = df[mask]
    target = target[mask]

#### How many rows after outlier removal? 

In [None]:
print(f'Number of rows in data frame: {df.shape[0]}')

#### Transform the data to NumPy arrays, perform a train/test split, and then scale the data.

The test data is not used in fitting the scaler -- that would "leak" information about the test data.

In [None]:
X = df.values
y = target.values

# 20% should be enough for the test set, as the data set has 20K rows
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#### Make a smaller version of the training data to allow for faster hyperparameter tuning.

In [None]:
# Use 50% of the data in the sample
np.random.seed(0)
m = X_train.shape[0]
rows = np.random.choice(m, size=int(m*0.5), replace=False)
X_train_s = X_train[rows]
y_train_s = y_train[rows]

In [None]:
X_train_s.shape

## Problem 1. Regression with a single regression tree

In this problem you will tune a regression tree using GridSearchCV, then look at the importance of the features.

### Hyperparameter tuning using GridSearchCV

In [None]:
# Use GridSearchCV to tune a DecisionTreeRegressor.  
# You will need to create a dictionary, which is often given variable name 'param_grid'.
# Only tune the DecisionTreeRegressor hyperparameters 'max_leaf_nodes' and 'max_depth'.  
# Be sure to include large enough possible values

# Hint: create a GridSearchCV object and then use the .fit() method.
# Hint: don't forget to specify the scoring parameter when you use GridSearchCV.

# YOUR CODE HERE

### Report on the best parameters and the score associated with them.

In [None]:
# Print the best hyperparameter values and the best CV RMSE value.
# Use two print statements.
# Hint: you can get the values you want from your trained GridSearchCV object.

# YOUR CODE HERE

### Feature importance

In [None]:
# Create a horizontal bar plot showing the feature importance for each predictor variable.
# Give the plot an appropriate title and x/y axis labels.

# Hint: you can get the feature importances from your GridSearchCV object by
# first accessing the best estimator that the GridSearchCV found.

# YOUR CODE HERE

### Summary

Replace this text with your own discussion about what you learned on this problem.

## Problem 2. Regression with a random forest

For this problem you tune a random forest and see how well it performs.  Does it outperform a single tree?

### Hyperparameter tuning

Note that a `max_features` value of 1 means "bagged trees".

In [None]:
# Perform GridSearchCV again, but this time with a RandomForestRegressor.
# Your param_grid should contain values for only max_depth and max_features.
# For max_depth, I recommend you look at some small integer values, plus value None.
# For max_features, I recommend you look at values 1.0 and 'sqrt'.
# Read the RandomForestRegressor to understand what these hyperparameter values mean.

# Hint: don't consider too many values in your param grid, because this can take a while to run.

# YOUR CODE HERE

### Report on the best parameters and the score associated with them.

In [None]:
# Produce output similar to what you did for a single regression tree.

# YOUR CODE HERE

### Feature importance

In [None]:
# RandomForestRegressor also supports the concept of feature importance.
# Print a horizontal bar plot like you did for your single regression tree.

# YOUR CODE HERE

### Summary

Replace this text with your own discussion about what you learned on this problem.  Be sure to compare your results for random forests with your results for a single regression tree.

## Problem 3. Tuning the number of base regressors

Now we find the best number of best regressors, keeping the hyperparameter values in the last part unchanged.

In [None]:
# Create a param_grid dictionary based on the best hyperparameter you found in the last problem.
# The dictionary should have a list containing only one value for those hyperparameters,
# but should additionally have a 'n_estimators' key that has as its value a list with multiple elements.

# Hint: to create a param_grid from a dictionary of best hyperparameter values, you can
# use a dictionary comprehension, like this:
# param_grid = {k: [v] for k, v in best_hyper_vals.items()}

# YOUR CODE HERE

In [None]:
# leave this cell alone
print(param_grid)

In [None]:
# Perform a grid search again.  Now the only parameter being tuned is n_estimators.

# YOUR CODE HERE

In [None]:
# As before, print the best hyperparameter values and the best CV RMSE.

# YOUR CODE HERE

#### Examine how score varies by the number of estimators

In [None]:
# Create a bar plot showing RMSE (y axis) by number of tree (x axis)

# Hint: I found it useful to use pd.DataFrame() on the cross validation results.

### Summary

Replace this text with your own comments on what you learned in this problem.