# *Predicting Boston Housing Prices using Random Forest*
#### Authors: Tom Sharp, Troy Sattgast

<br>
## Part 0: Environment Setup

In [None]:
# Import the os library, the pandas library (aliased as pd), and the numpy library (aliased as np)

import os
import pandas as pd
import numpy as np

Please update the **username** variable with your username on your comp

In [None]:
username = "tomsharp"

# Store the paths to frequently used files

parent_path = r"C:/Users/" +str(username)+"/Desktop/Demystifying_ML/random_forest/"
data_path = parent_path + "data/Boston Housing Prices.csv"
data_dict_path = parent_path + "data/boston_data_dict.csv"
image_path = parent_path + 'images/'

print(parent_path)
print(data_path)
print(data_dict_path)
print(image_path)

## Part 1: Data Import, Exploration, and Cleaning

During any analysis, it is always important to first examine your data. This involves looking at the data itself, the column names, and some summary statistics about the data.

In [None]:
# Read in the data using the pandas package. The data is stored in what is called a dataframe (similar to a spreadsheet)

data = pd.read_csv(data_path)

In [None]:
# Examine number of rows and columns 

print("(num of rows, num of columns) = ", data.shape)

In [None]:
# That's a lot of rows. Let's just look at the first three columns of the data, instead of all of them

print(data.head(3))

In [None]:
# We can list all the column names by calling the "columns" attribute of "data" 
# Def: Attribute - describes the data (an adjective)

print(data.columns)

In [None]:
# We can view summary statistics about the data by calling the "describe()" method of "data"
# Def: Method - take an action on the data (a verb)

print(data.describe())

In [None]:
#What do all these fields mean? Let's use the data dictionary to find out

data_dict = pd.read_csv(data_dict_path)
print(data_dict)

**This last value, *cmedv*, is what we would like to predict using a machine learning. Before we can predict, we need to make sure we clean the data.**

In [None]:
# Clean the data - do all of this at once 

data.fillna(0)
data['river'].replace('no', False, inplace = True)
data['river'].replace('yes', True, inplace = True)
data.drop(['town'], axis = 1, inplace = True)

print(data)

*Side Note - In most applications of data science and machine learning, we would take a closer look at cleaning the data.
Data gathering and cleansing usually consumes +80% of the DS/ML process, however this dataset happened to be extremely clean when it was retrieved from its source online.*

## Part 2: Decision Tree - The Building Block of Random Forest

<img src="https://github.com/tmsharp/random_forest/blob/master/images/tree_joke.jpg?raw=true" height="500" align="center"/>


### Conceptual Introduction

In machine learning, the columns to be used as inputs (X) are referred to as the **features**, and the output (y) value is referred to as the **target** or the **label**.
<br>
Since we are given the target/label values in this dataset, the type of machine learning we will be doing is called **supervised**. 
<br>
In particular, we will be using a random forest. Before we jump into that, we need to understand the basic building block of that model, known as the decision tree. 
<br>
<br>
A decision tree is one of the easiest machine learning model to comprehend, since it is easily visualized. The below graphic is an example of a simple decision tree. Notice that each *node* contains a yes/no question, and each *branch* leads to a new node, unless it leads to an answer. These answers are called *leaves* or *leaf nodes*.

<img src="https://cdn-images-1.medium.com/max/1200/0*Yclq0kqMAwCQcIV_.jpg" width="500" height="500" align="center"/>

How are these questions determined? The decision tree is given several features (inputs) and determines which questions to ask to *gain the most information from the oucome*, i.e., to increase **information gain**. You can think of a decision tree like a game of *Guess Who?*. Each round, you ask one question in order to get the most information out of the opposite player. 
<br>
<br>
For example, a popular first round question is, *"Is your character a man or woman?"*. This gives you a lot more information thatn asking *"Is your character Joe?"*.

<img src="http://nothingbutnostalgia.com/wp-content/uploads/2017/09/Guess-Who-300x190.jpg" width="500" height="500" align="center"/>

### Splitting the Data

In order to perform supervised learning, we will **train/fit** our model, and then **test** our model to see how accurate it is. We do this by first dividing the data into the **training data** and the **testing data**. In order for our model to be trained adequately, we would like it to have as much data as possible. Therefore, we take 80% of our current dataset to be the training data, and the remaining 20% to be the testing data. This is somewhat arbitrary, but the split usually lies around 75 / 25 or 80 / 20. 
<br>

Also recall from above that the input (X) values are referred to as **features** and the output (y) values are referred to as **targets** or **labels**. We need to store the columns in our dataset into these variables before we can split our data.


<img src="https://github.com/tmsharp/random_forest/blob/master/images/splitting_data.png?raw=true" width="700" height="700" align="center"/>

We will split our data first between **feaures** and **labels**, and then between **training data** and **testing data**.

Features vs. Labels

In [None]:
# Drop the cmedv from the dataset to give all the features
features = data.drop('cmedv', axis=1)

# Drop all columns that aren't cmedv to give the labels 
labels = data.drop(data.columns[data.columns != 'cmedv'], axis = 1)

# Convert to numpy arrays - these are similar to dataframes but have less structure. sklearn can only take numpy arrays
features = np.array(features)
labels = np.array(labels)

Training Data vs. Testing Data

In [None]:
# Use scikit-learn to split the data and store the data into variables. Notice we specify test_size = 0.2, as explained above

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size = 0.2, random_state = 1)

### The Supervised Learning Approach

As we know, a decision tree is a supervised learnin model, since we have labels that help the algorithm learn. The following picture depicts the supervised learning approach.

<img src="https://github.com/tmsharp/random_forest/blob/master/images/Supervised_Learning.png?raw=true" width="700" height="700" align="center"/>


### Step 1: Train the Model

Sci-kit learn 3 lines of code to Train the Model (Step 1)

In [None]:
#import
from sklearn import tree

#instantiate 
decision_tree = tree.DecisionTreeRegressor(random_state = 8)

#train/fit
decision_tree = decision_tree.fit(X_train, y_train)

Let's see what our *Trained Model* looks like by converting the tree into an image.

In [None]:
#View picture after converting to png (I did this already)
!"images/tree.png"

*Side-Note: To convert this dot file on your own, you need to use some command line magic I converted the file beforehand, so you can view the tree by running this code block.
For anyone interested, the command line function is below (make sure you are cd'd into the random_forest/images directory and are in the dm_ml environment)*

>\> *dot -Tpng tree.dot -o tree.png*

### Step 2: Test the Model

Here, we will use the labels from the testing data to generate predictions on the housing prices. Let's see what the model comes up with.

In [None]:
# Use the forest's predict method on the test data

predictions = decision_tree.predict(X_test)

# Format and print
predictions = pd.Series(predictions)
print(predictions)

### Step 3: Calculate the Accuracy

Let's view the actual y values from the test data (y_test) next to the model's predicted values (predictions)

In [None]:
y_tst = pd.Series(y_test.flatten())
compare = pd.DataFrame(data = {'y_test': y_tst, 'predictions': predictions})
print(compare)

<br>
As you can see, the predicted values differ from the y_values for each row; the accuracy of each row differs. To better understand our model, however, we want the overall accuracy of all the rows.

In [None]:
from sklearn.metrics import accuracy_score

accuracy = decision_tree.score(np.array(X_test), np.array(y_test))

print("Our tree's accuracy is " + str(round(accuracy*100,2)) + "%")

This seems like a pretty good accuracy, except we **overfit** the model...
<br>

By fitting the model "out of the box", we allowed the tree to grow as large as possible. This causes the tree to overfit the data.
Overfitting is when the model follows the *"noise"* of the **training data** too closely, and therefore won't predict general input data later on. 

<img src="https://github.com/tmsharp/random_forest/blob/master/images/overfitting_underfitting.png?raw=true" width="700" height="700" align="center"/>

There are ways to combat overfitting by tuning the model. One way to do this is decrease the depth of the tree (either during or after fitting - research *pruning*). We won't get into that here, instead we will show another way to more accurately (and powerfully) predict our outcomes - the random forest.

## Part 3: Random Forest

<img src="https://github.com/tmsharp/random_forest/blob/master/images/random-forest.jpg?raw=true" width="700" height="700" align="center"/>


The random forest is an **ensemble model** i.e., it combines multiple models into one larger model. By combining multiple decision trees, the random forest is able to improve the prediction accuracy. 

<br> The random forest combines multiple decision trees by using a concept called **bootstrap aggregating**, or **bagging** for short. This method builds multiple (usually 1,000's) decision trees during the *Train the Model* step. When we *Test the Model*, each decision tree predicts the output and the random forest combines all the outputs into a *single* output. It does this by either taking a majority vote (in classification) or by aggregating the values (in regression, which is our case) by use of a mean, median, etc. 

This is all done behind the scenes within sklearn. The same 3-step process is used (recall that the data was originally split above).

### Step 1: Train the Model

In [None]:
# Again, we import, instantiate, and then fit
# Here, n_estimators is the number of decision trees in our random forest

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 1000, random_state = 10)
rf.fit(X_train, y_train)

### Step 2: Test the Model

In [None]:
# Now let's predict and see the predictions next to the actual y values

predictions = rf.predict(X_test)
y_tst = pd.Series(y_test.flatten())
pd.DataFrame(data = {'y_test': y_tst, 'predictions':predictions}).head(10)

### Step 3: Calculate the Accuracy

In [None]:
from sklearn.metrics import accuracy_score

# accuracy_score(np.array(y_test), np.array(predictions))
accuracy = rf.score(np.array(X_test), np.array(y_test))

print("Our tree's accuracy is " + str(round(accuracy*100,2)) + "%")

Our accuracy improved! Let's take a deeper look at how close our model is to predicting actual housing prices

In [None]:
# Calculate the average error for the predicted results
absolute_errors = abs(predictions - y_test)
mean_absolute_error = np.mean(absolute_errors)
mean_absolute_error = round(mean_absolute_error, 4)

# Remember that our housing prices are in thousands of dollars, so let's show that here
print('Mean Absolute Error: $', 1000*mean_absolute_error, sep = '')

In [None]:
# Open image file
!("images/tree_from_random_forest_output.png")

Let's see which factors of a neighborhood influence it's price the most. We can do this using a few more complex techniques in Python. I won't be getting into these and I am also going to use some code that was written by William Koehrsen in his article that can be found here: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0

In [None]:
# Get numerical feature importances
importances = list(rf.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

It looks like lstat and rooms make up about 80% of the importance in predicting the housing price.<br>
Does this make sense? It is always improtant to look at your model output and determine if it logically matches the context of the problem

## Part 4 - Model Simplification

In [None]:
sorted_importances = [importance[1] for importance in feature_importances]
sorted_features = [importance[0] for importance in feature_importances]
cumulative_importances = np.round(np.cumsum(sorted_importances),2)

importance_df = pd.DataFrame({'features': sorted_features, 'importance': sorted_importances, 'cumulative importance': cumulative_importances})
importance_df

In [None]:
# We can pick what we want our cumulative importance to be. Here I chose 95%
new_df = importance_df[importance_df['cumulative importance'] <= 0.95]
new_df

In [None]:
# Let's say we only want to use these features. We can re-run the random forest with only these

#split the data
features = data[new_df.features]
labels = data['cmedv']
print(features.columns)

#convert to numpy arrays
import numpy as np
features = np.array(features)
labels = np.array(labels)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size = 0.2, random_state = 1)

In [None]:
rf = RandomForestRegressor(n_estimators = 1000, random_state = 10)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)

# accuracy_score(np.array(y_test), np.array(predictions))
accuracy = rf.score(np.array(X_test), np.array(y_test))
print("Our tree's accuracy is " + str(round(accuracy*100,2)) + "%")

# Calculate the average error for the predicted results
absolute_errors = abs(predictions - y_test)
mean_absolute_error = np.mean(absolute_errors)
mean_absolute_error = round(mean_absolute_error, 4)
print('Mean Absolute Error: $', 1000*mean_absolute_error, sep = '')

Remember that our accuracy before was ~ 92% and that the MAE was about $2100 as well
While our new model may seem just as good as before, remember that we dropped more than half of our features. 
That should prove that the other features really weren't adding much more value.
This is just one example of how models do not have super complex to provide good results. What we also gained here is decreased runtime - it took less computation time to get almost the same accuracy. When scaling a model to production, this is a trade-off we would likely want to make.

## Part 5 - Tuning the Forest

"While model parameters are learned during training — such as the slope and intercept in a linear regression — hyperparameters must be set by the data scientist before training." - William Koehrsen
<br>


<br>
<br>
## Resources 

<img src="https://github.com/tmsharp/random_forest/blob/master/images/data_science.jpg?raw=true" width="400" height="400" align="right"/>
**Data Source** <br>
https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

**Jupyter** <br>
https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html

**Pandas** <br>
https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python

**Information Gain and Entropy** <br>
https://www.saedsayad.com/decision_tree.htm <br>

**How to Become a Data Science** <br>
https://towardsdatascience.com/how-to-learn-data-science-if-youre-broke-7ecc408b53c7 <br>
https://www.class-central.com/subject/data-science <br>


**General Sources & Good Reads** <br>
https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6 <br>
https://medium.com/machine-learning-for-humans/why-machine-learning-matters-6164faf1df12 <br>
https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d <br>
https://towardsdatascience.com/random-forest-in-python-24d0893d51c0 <br>
https://github.com/WillKoehrsen/Data-Analysis/tree/master/random_forest_explained <br>
https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74