# Random Forest Regression

Random Forest Regression is an ensemble learning technique that combines multiple decision tree regressors to improve prediction accuracy and control overfitting. Here’s an in-depth look at how it works:

### Basic Concept

A random forest is composed of a collection of decision tree regressors, where each tree is trained on a random subset of the data and a random subset of the features. The final prediction is made by averaging the predictions of all individual trees.

### Steps Involved

#### 1. Bootstrapping

The first step in building a random forest is to create multiple subsets of the original dataset using a method called bootstrapping. In bootstrapping:
- **Random Sampling with Replacement**: A subset of the original dataset is created by randomly selecting samples with replacement. This means some samples may appear multiple times, and some may not appear at all.

#### 2. Building Decision Trees

Each bootstrapped subset is used to train a separate decision tree. During the training of each tree:
- **Random Feature Selection**: At each split in the tree, a random subset of features is considered rather than evaluating all features. This introduces more diversity among the trees.

#### 3. Aggregating Predictions

Once all trees are trained, predictions are made for each tree, and the final prediction for the random forest is obtained by averaging the predictions of all trees.

### Detailed Process

#### 1. Data Preparation
- **Original Dataset**: Assume we have a dataset $D$ with $N$ samples and $M$ features.

#### 2. Bootstrapping
- **Creating Bootstrapped Subsets**: For $B$ trees, create $B$ bootstrapped subsets $D_1, D_2, \ldots, D_B$, each containing $N$ samples randomly chosen with replacement from $D$.

#### 3. Building Trees
- **Training Individual Trees**: For each bootstrapped subset $D_b$:
  - Start with a root node.
  - For each node in the tree, select a random subset of $k$ features (where $k \leq M$).
  - Find the best split among the selected features using criteria like Mean Squared Error (MSE).
  - Split the node based on the best feature and repeat the process recursively for child nodes until stopping criteria are met (e.g., maximum depth, minimum samples per leaf).
  - Each tree $T_b$ is grown to full depth or until it meets other stopping criteria.

#### 4. Making Predictions
- **Individual Tree Prediction**: For a new data point $x$, each tree $T_b$ makes a prediction $\hat{y}_b(x)$.
- **Aggregating Predictions**: The final prediction $\hat{y}(x)$ of the random forest is the average of the predictions of all trees:
  $ \hat{y}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{y}_b(x) $

### Advantages of Random Forest Regression

1. **Improved Accuracy**: By averaging the predictions of multiple trees, random forests reduce the risk of overfitting and generally provide more accurate predictions compared to individual decision trees.
2. **Robustness**: Random forests are less sensitive to outliers and noise in the data due to the averaging process.
3. **Feature Importance**: They provide a measure of feature importance, which helps in understanding the contribution of each feature to the prediction.

### Disadvantages of Random Forest Regression

1. **Complexity**: Random forests are more complex and require more computational resources than individual decision trees.
2. **Interpretability**: The ensemble nature makes them harder to interpret compared to a single decision tree.
3. **Training Time**: Training multiple trees can be time-consuming, especially with large datasets.

### Example Scenario

Imagine we want to predict the price of houses based on various features such as size, location, number of rooms, and age of the house. Here's how a random forest regression model would work:

1. **Bootstrapping**: Create multiple bootstrapped subsets of the house dataset.
2. **Building Trees**: Train a decision tree on each subset, randomly selecting a subset of features at each split.
3. **Aggregating Predictions**: For a new house, each tree in the forest makes a price prediction, and the final predicted price is the average of these predictions.

### Mathematical Formulation

#### 1. Bootstrapping

Given a dataset $D$ with $N$ samples, create $B$ bootstrapped datasets $D_1, D_2, \ldots, D_B$, each containing $N$ samples randomly selected with replacement from $D$.

#### 2. Random Feature Selection

For each tree $T_b$:
- At each split, select a random subset of $k$ features from the $M$ available features.
- Evaluate the best split among the $k$ features based on criteria such as MSE.

#### 3. Tree Prediction

For a new input $x$, each tree $T_b$ makes a prediction $\hat{y}_b(x)$.

#### 4. Aggregated Prediction

The final prediction $\hat{y}(x)$ of the random forest is:
$ \hat{y}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{y}_b(x) $

In summary, Random Forest Regression leverages the power of multiple decision trees, trained on random subsets of the data and features, to provide more robust and accurate predictions. It balances the bias-variance tradeoff by reducing overfitting while maintaining high predictive performance.

### Example

In [1]:
# Import Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [4]:
# Import Data

data = pd.read_csv('Position_Salaries.csv')
X = data.iloc[:, 1:-1].values
y = data.iloc[:, -1].values

### Training the Random Forest Regression model on the whole data

In [6]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X,y)

### Predict Results

In [7]:
regressor.predict([[6.5]])

array([167000.])