# *Predicting Boston Housing Prices using Random Forest*
#### Authors: Tom Sharp, Troy Sattgast


## Agenda ##
Part 1: Data Import, Exploration, and Cleaning <br>
Part 2: Decision Tree - The Building Block of Random Forest <br>
Part 3: Random Forest <br>
Part 4: Forest Simplification <br>
Part 5: Tuning the Forest

<br>
## Part 0: Environment Setup

In [None]:
# Import the os library, the pandas library (aliased as pd), and the numpy library (aliased as np)

import os
import pandas as pd
import numpy as np

In [None]:
print(os.getcwd())

In [None]:
# Store the paths to frequently used files

parent_path = os.getcwd()
data_path = os.path.join(parent_path,  'data', 'Boston Housing Prices.csv')
data_dict_path = os.path.join(parent_path, 'data', 'boston_data_dict.csv')
image_path = os.path.join(parent_path, 'images')


print(parent_path)
print(data_path)
print(data_dict_path)
print(image_path)

## Part 1: Data Import, Exploration, and Cleaning

During any analysis, it is always important to first examine your data. This involves looking at the data itself, the column names, and some summary statistics about the data.

In [None]:
# Read in the data using the pandas package. The data is stored in what is called a dataframe (similar to a spreadsheet)

data = pd.read_csv(data_path)

In [None]:
# Examine number of rows and columns 

print("num of rows, num of columns = ", data.shape)

In [None]:
# That's a lot of rows. Let's just look at the first three columns of the data, instead of all of them

print(data.head(3))

In [None]:
# We can list all the column names by calling the "columns" attribute of "data" 
# Def: Attribute - describes the data (an adjective)

print(list(data.columns))

In [None]:
# We can view summary statistics about the data by calling the "describe()" method of "data"
# Def: Method - take an action on the data (a verb)

print(data.describe())

In [None]:
#What do all these fields mean? Let's use the data dictionary to find out

data_dict = pd.read_csv(data_dict_path)
print(data_dict)

**This last value, *cmedv*, is what we would like to predict using a machine learning. Before we can predict, we need to make sure we clean the data.**

In [None]:
# check for NAs, NaNs, etc. 
any(data.isnull().sum(axis=1))

In [None]:
# Clean the data - do all of this at once 

data.fillna(0)
data['river'].replace('no', False, inplace = True)
data['river'].replace('yes', True, inplace = True)
data.drop(['town'], axis = 1, inplace = True)

print("Data clean!")

In [None]:
data.head()

In [None]:
# Add a column for the average cmedv

val_mean = data['cmedv'].mean()*1000
print('The average housing price (in 1980) in Boston was ${:,.2f}'.format(val_mean))

*Side Note - In most applications of data science and ML, we would take a closer look at cleaning the data. Data gathering and cleansing usually consumes +80% of the DS/ML process; however, this dataset happened to be extremely clean when it was retrieved from its source online.*

## Part 2: Decision Tree - The Building Block of Random Forest

<img src="images/tree_joke.jpg" height="500" align="center"/>


### Conceptual Introduction

In machine learning, the columns to be used as inputs (X) are referred to as the **features**, and the output (y) value is referred to as the **target** or the **label**.
<br>

Since we are given the target/label values in this dataset, the type of machine learning we will be doing is called **supervised**. 
<br>
In particular, we will be using a random forest. Before we jump into that, we need to understand the basic building block of that model, known as the decision tree. 
<br>
<br>
A decision tree is one of the easiest machine learning model to comprehend, since it is easily visualized. The below graphic is an example of a simple decision tree. Notice that each *node* contains a yes/no question, and each *branch* leads to a new node, unless it leads to an answer. These answers are called *leaves* or *leaf nodes*.

<img src="images/decision_tree_example.jpg" width="500" height="500" align="center"/>

How are these questions determined? The decision tree is given several features (inputs) and determines which questions to ask to *gain the most information from the oucome*, i.e., to increase **information gain**. You can think of a decision tree like a game of *Guess Who?*. Each round, you ask one question in order to get the most information out of the opposite player. 
<br>
<br>
For example, a popular first round question is, *"Is your character a man or woman?"*. This gives you a lot more information than asking *"Is your character Joe?"*.

<img src="images/guess_who.jpg" width="500" height="500" align="center"/>

### Splitting the Data

In order to perform supervised learning, we will **train** (aka, fit) our model, and then **test** our model to see how accurate it is. We do this by first dividing the data into the **training data** and the **testing data**. In order for our model to be trained adequately, we would like it to have as much data as possible. Therefore, we take 80% of our current dataset to be the training data, and the remaining 20% to be the testing data. This is somewhat arbitrary, but the split usually lies around 75 / 25 or 80 / 20. 
<br>

Also recall from above that the input (X) values are referred to as **features** and the output (y) values are referred to as **targets** or **labels**. We need to store the columns in our dataset into these variables before we can split our data.


<img src="images/splitting_data.png" width="700" height="700" align="center"/>

We will first split our data into **feaures** and **labels**, and then **training data** and **testing data**.

*Features vs. Labels*

In [None]:
# Drop the cmedv column - the features are all the columns except this one
features = data.drop(['cmedv'], axis=1)
feature_list = list(data.drop(['cmedv'], axis=1).columns)

# Drop all columns that aren't cmedv - cmedv is the only label
labels = data.drop(data.columns[data.columns != 'cmedv'], axis = 1)

# Convert to numpy arrays - these are similar to dataframes but have less structure. sklearn can only take numpy arrays
features = np.array(features)
labels = np.array(labels)

*Training Data vs. Testing Data*

In [None]:
# Use scikit-learn to split the data and store the data into variables. 
# Notice we specify test_size = 0.2. This gives the 80/20 split as explained above

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size = 0.2, random_state = 1)

### The Supervised Learning Approach

As we know, a decision tree is a supervised learnin model, since we have labels that help the algorithm learn. The following picture depicts the supervised learning approach.

<img src="images/Supervised_Learning.png" width="700" height="700" align="center"/>


### Step 1: Train the Model

Sci-kit learn 3 lines of code to Train the Model

In [None]:
# import
from sklearn import tree

# instantiate 
decision_tree = tree.DecisionTreeRegressor(random_state = 8)

# train/fit
decision_tree = decision_tree.fit(X_train, y_train)

Let's see what our *Trained Model* looks like by converting the tree into an image.

In [None]:
#View picture after converting to png (I did this already)
!"images/tree.png"

*Side-Note: To convert this dot file on your own, you need to use some command line magic I converted the file beforehand, so you can view the tree by running this code block.
For anyone interested, the command line function is below (make sure you are cd'd into the random_forest/images directory and are in the dm_ml environment)*

>\> *dot -Tpng tree.dot -o tree.png*

### Step 2: Test the Model

Here, we will use the labels from the testing data to generate predictions on the housing prices. Let's see what the model comes up with.

In [None]:
# Use the forest's predict method on the test data
tree_predictions = decision_tree.predict(X_test)

# Format and print
tree_predictions = pd.Series(tree_predictions)
pd.DataFrame(data = {'predicted value': tree_predictions, 'actual value':list(y_test)})

<br>
As you can see, the predicted values differ from the y_values for each row; the accuracy of each row differs. To better understand our model, however, we want the overall accuracy of all the rows, using RMSE.

<img src="images/rmse_formula.png" width="200" height="200"/>

### Step 3: Calculate the Accuracy

Let's view the actual y values from the test data (y_test) next to the model's predicted values (predictions)

In [None]:
# Import some functions from sklearn
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Calculate the ROOT-mse (mean square error), aka the RMSE
tree_rmse = np.sqrt(mean_squared_error(y_test, tree_predictions))

# Print the results
print("Our tree's TEST RMSE is {:.2} or ${:,.2f}".format(tree_rmse, tree_rmse*1000))

This seems like a pretty good accuracy, except we **overfit** the model...

In [None]:
# What happens if we try to predict the y's on our training data?
predictions_training = decision_tree.predict(X_train)

# Calculate the RMSE
training_rmse = mean_squared_error(y_train, predictions_training)

print("Our tree's TRAIN RMSE is {:.2f}.".format(training_rmse))

By fitting the model "out of the box", we allowed the tree to grow as large as possible (we can see this because there was absolutely no error when it predicted the y values for the training data... the model predicted every y value exactly.).This causes the tree to overfit the data. 

Overfitting is when the model follows the *"noise"* of the **training data** too closely, and therefore won't predict general input data later on. 

<img src="images/overfitting_underfitting.png" width="700" height="700" align="center"/>

There are ways to combat overfitting by tuning the model. One way to do this is decrease the depth of the tree (either during or after fitting - research *pruning*). We won't get into that here, instead we will show another way to more accurately (and powerfully) predict our outcomes - the random forest.

*Side note - Please note that if we set the random states to different numbers, the result would be different, however While this decision tree is quite accurate, we can possibly improve accuracy using the random forest model The random forest model essentially builds multiple decision trees, takes the outputs from all of those trees, and determines the best prediction by taking the average (regression) or the mode (classification) of the outputs*

## Part 3: Random Forest

<img src="images/random-forest.jpg" width="700" height="700" align="center"/>


The random forest is an **ensemble model** i.e., it combines multiple models into one larger model. By combining multiple decision trees, the random forest is able to improve the prediction accuracy. 

<br> The random forest combines multiple decision trees by using a concept called **bootstrap aggregating**, or **bagging** for short. This method builds multiple (usually 1,000's) decision trees during the *Train the Model* step. When we *Test the Model*, each decision tree predicts the output and the random forest combines all the outputs into a *single* output. It does this by either taking a majority vote (in classification) or by aggregating the values (in regression, which is our case) by use of a mean, median, etc. 

This is all done behind the scenes within sklearn. The same 3-step process is used (recall that the data was originally split above).

### Step 1: Train the Model

In [None]:
# Again, we import, instantiate, and then fit
# Here, n_estimators is the number of decision trees in our random forest

# import 
from sklearn.ensemble import RandomForestRegressor

# instantiate 
rf = RandomForestRegressor(n_estimators = 1000, random_state = 10)

# train/fit
rf.fit(X_train, y_train)

### Step 2: Test the Model

In [None]:
# Now let's predict and see the predictions next to the actual y values

# predict
rf_predictions = rf.predict(X_test)

# format and print
pd.DataFrame(data = {'predicted values':rf_predictions, 'actual value': list(y_test)}).head()

In [None]:
# Open image file
!("images/tree_from_random_forest_output.png")

### Step 3: Calculate the Accuracy

In [None]:
# Calculate accuracy (RMSE)
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_predictions))

# Format and print
print("Our tree's RMSE is {:.2} or ${:,.2f}. \n...getting better".format(rf_rmse, rf_rmse*1000))

Although we see that simply using random forest is better than a single decision tree, there is still room for improvement. These next 2 sections will show you some techniques to both simplify and tune your model in order to improve its accuracy.

## Part 4: Forest Simplification

Let's see which factors of a neighborhood influence it's price the most. We can do this using a few more complex techniques in Python. 
<br>

I won't be getting into these and I am also going to use some code that was written by William Koehrsen in his article that can be found here: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0

In [None]:
# Get numerical feature importances
importances = list(rf.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

It looks like lstat and rooms make up about 80% of the importance in predicting the housing price.<br>
Does this make sense? It is always improtant to look at your model output and determine if it logically matches the context of the problem

In [None]:
sorted_importances = [importance[1] for importance in feature_importances]
sorted_features = [importance[0] for importance in feature_importances]
cumulative_importances = np.round(np.cumsum(sorted_importances),2)

importance_df = pd.DataFrame({'features': sorted_features, 'importance': sorted_importances, 'cumulative importance': cumulative_importances})
importance_df

We can pick what we want our cumulative importance to be. Here I chose 95%

In [None]:
# This will not work on your comp - need to install matplotlib!

from matplotlib import pyplot as plt
%matplotlib inline

features = importance_df['features']
importance =  importance_df['importance']
cumulative_importance = importance_df['cumulative importance']


fig, ax = plt.subplots(figsize = (15,8))
ax.scatter(features, importance, label="importance")
ax.plot(features, cumulative_importance, label="cumulative importance")
ax.legend()

plt.show()

In [None]:
# Return only enough features to give us 95% importance 

new_df = importance_df[importance_df['cumulative importance'] <= 0.95]
new_df

Let's say we only want to use these features. We can re-run the random forest with only these

In [None]:
#split the data - features vs labels
features = data[new_df.features]
labels = data['cmedv']
print(list(features.columns))

#convert to numpy arrays
import numpy as np
features = np.array(features)
labels = np.array(labels)

#split the data - training vs testing 
from sklearn.model_selection import train_test_split
X_train_simple, X_test_simple, y_train_simple, y_test_simple = train_test_split(
    features, labels, test_size = 0.2, random_state = 1)

### Step 1: Train the Model

In [None]:
# import - we already did this above!

# instantiate
rf_simple = RandomForestRegressor(n_estimators = 1000, random_state = 10)

# train/fit
rf_simple.fit(X_train_simple, y_train_simple)

### Step 2: Test the Model

In [None]:
# Predict 
rf_simple_predictions = rf_simple.predict(X_test_simple)

# Format and print
print(pd.DataFrame({'predictions': rf_simple_predictions, 'actual values': y_test_simple}).head())

### Step 3: Calculate the Accuracy

In [None]:
# Calculate accuracy (RMSE)
rf_simple_rmse = np.sqrt(mean_squared_error(y_test_simple, rf_simple_predictions))

# Format and print
print("Our tree's RMSE is {:.2} or ${:,.2f}.".format(rf_simple_rmse, rf_simple_rmse*1000))

Remember that our first random forest's RMSE was $2,787.01. Our model got worse!
<br>
<br>

But only slightly... we lost a few % points of accuracy, but were able to cut the number of inputs into our model by about half. This proves that (1) those other inputs added almost no value, and (2) we don't always need a super complex model in machine learning. 

<br>
*Side note - What we also gained here is decreased runtime - it took less computation time to get almost the same accuracy. This trade-off is extremely important in data science, especially when developing a model that will scale and/or will potentially be deployed in production.*

## Part 5: Tuning the Forest

"While model parameters are learned during training — such as the slope and intercept in a linear regression — hyperparameters must be set by the data scientist before training." - William Koehrsen
<br>


In [None]:
# Number of trees in random forest
n_estimators = [1000, 2000, 3000]

# Number of features to consider at every split
max_features =  ['sqrt', 'log']

# Maximum number of levels in tree
max_depth = [None, 1, 2, 4]

# Minimum number of samples required to split a node
min_samples_split = [2, 4, 8]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]


In [None]:
#create a grid
cv_grid = {'n_estimators': n_estimators,
           'max_features': max_features,
           'max_depth': max_depth,
           'min_samples_split': min_samples_split,
           'min_samples_leaf': min_samples_leaf
          }

In [None]:
# utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")


*This will take some time to run....*

In [None]:
# run grid search
from sklearn.model_selection import GridSearchCV
from time import time

rf = RandomForestRegressor(max_depth=2, random_state=10)
grid_search = GridSearchCV(rf, param_grid=cv_grid, cv=5)
start = time()
grid_search.fit(X_train, y_train.ravel())

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.cv_results_['params'])))
report(grid_search.cv_results_)

<br>
<br>
## Resources 

<img src="images/data_science.jpg" width="400" height="400" align="right"/>

### Introductory Topics ###

*How to Become a Data Science* <br>
https://towardsdatascience.com/how-to-learn-data-science-if-youre-broke-7ecc408b53c7 <br>
https://www.class-central.com/subject/data-science <br>

*Jupyter* <br>
https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html

*Pandas* <br>
https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python


### Deep Dive Topics ###

*Information Gain and Entropy* <br>
https://www.saedsayad.com/decision_tree.htm <br>

*Ensemble Models - The Power of Crowds and Aggregated Predictions* <br>
https://www.npr.org/sections/money/2015/08/07/429720443/17-205-people-guessed-the-weight-of-a-cow-heres-how-they-did <br>

*Random Forest - Feature Information* <br>
http://explained.ai/rf-importance/index.html <br>
http://www.scikit-yb.org/en/latest/api/features/importances.html <br>

*Grid Search* <br>
https://www.quora.com/Machine-Learning-How-does-grid-search-work <br>


### General ####

*Good Reads* <br>
https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6 <br>
https://medium.com/machine-learning-for-humans/why-machine-learning-matters-6164faf1df12 <br>
https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d <br>
https://towardsdatascience.com/random-forest-in-python-24d0893d51c0 <br>
https://github.com/WillKoehrsen/Data-Analysis/tree/master/random_forest_explained <br>
https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

*Data Source* <br>
https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html