# Random Forests

Using a more sophisticated machine learning algorithm. 

<h2>Introduction</h2>

Decision tree leaves us with a difficult decision. A deep tree with lots of leaves will overfit because each prediction is coming from historical data from the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data. 

Even today's most sophisticated modeling techniques face this tension between underfitting and overfitting. But, many models have clever ideas that can lead to better performance. We'll look at the **random forest** as an example.

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy tahn a single decision tree and it works well with default parameters. If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters. 

**Example**

We've already seen the code to load the data a few times. At the end of data-loading, we have the following variables:

- train_X
- val_X
- train_y
- val_y

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Load data 


melbourne_path = '/Users/jeanzayas/Desktop/Divergence/DATA ANALYSIS/Learning/ML kaggle learn/Data_ML_try/Melbourne Housing/melb_data.csv'

melbourne_data = pd.read_csv(melbourne_path)

# Filter rows with missing values
melbourne_data = melbourne_data.dropna(axis=0)

# Choose target and features 
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']

X = melbourne_data[melbourne_features]


#--------------------
# Split data into training and validation data, for both features and target
# The split is based on a random number generator.
# Supplying a numeric value to the random_state argument guarantees we get the split every time we run this script. 
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)


We build a random forest model similarly to how we built a decision tree in scikit-learn -this time using the {RandomForestRegressor} class instead of DecisionTreeRegressor

In [6]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y) 
melb_preds = forest_model.predict(val_X)
print('The MAE is: ', mean_absolute_error(val_y, melb_preds))

The MAE is:  191669.7536453626


There is likely room for further improvement, but this is a big improvement over the best decision tree error of 250,000. There are parameters which allow you to change the performance of the Random Forest much as we changed the maximum depth of the single decision tree. But one of the best features of Random Forest models is that they generally work reasonably even without this tuning.

# _________________

## Recap
 Here's the code we've written so far. 
    
    

In [7]:
# Path of the file to read

iowa_file_path = '/Users/jeanzayas/Desktop/Divergence/DATA ANALYSIS/Learning/ML kaggle learn/Data_ML_try/Housing Prices/train.csv'

home_data = pd.read_csv(iowa_file_path)

# Create target object and call it y
y = home_data.SalePrice

# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE when not specifying max_leaf_nodes: {:,.0f}".format(val_mae))

# Using best value for max_leaf_nodes
iowa_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)
iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE for best value of max_leaf_nodes: {:,.0f}".format(val_mae))




Validation MAE when not specifying max_leaf_nodes: 29,653
Validation MAE for best value of max_leaf_nodes: 27,283


Data science isn't always this easy. But replacing the decision tree with a Random Forest is going to be an easy win.

## Use a Random Forest


In [8]:
from sklearn.ensemble import RandomForestRegressor

# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state=1)

# fit your model
rf_model.fit(train_X, train_y)
# Calculate the mean absolute error of your Random Forest model on the validation data
rf_val_prediction = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(val_y, rf_val_prediction)

print("Validation MAE for Random Forest Model: {}".format(rf_val_mae))


Validation MAE for Random Forest Model: 21857.15912981083


# Train a model for the competition 

The code cell aboce trains a Random Forest model on "train_X" and "train_y",
Use the code cell below to build a Random FOrest model and train it on all of "X" and "y".

In [9]:
# To improve accuracy, create a new Random Forest model which you will train on all training data
rf_model_on_full_data = RandomForestRegressor(random_state=0)

# fit rf_model_on_full_data on all data from the training data
rf_model_on_full_data.fit(X, y)


# path to file you will use for predictions
test_data_path = '/Users/jeanzayas/Desktop/Divergence/DATA ANALYSIS/Learning/ML kaggle learn/Data_ML_try/Housing Prices/test.csv'

# read test data file using pandas
test_data = pd.read_csv(test_data_path)

# create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable called features

test_X = test_data[features]

# make predictions which we will submit. 
test_preds = rf_model_on_full_data.predict(test_X)


### Generate a submission

Run the code cell below to generate a CSV file with your predictions that you can use to submit to the competition.

In [10]:
# Run the code to save predictions in the format used for competition scoring

output = pd.DataFrame({'Id': test_data.Id,
                       'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)