# Module 1: Introduction to Scikit-Learn

## Section 2: Supervised Learning Algorithms

### Part 17: Random Forests

In this section, we will explore Random Forest, an ensemble learning method used for both classification and regression tasks.

### 17.1 Understanding Random Forest

Random Forest is an ensemble learning method used for both classification and regression tasks. Ensemble learning mean that it combines the predictions of multiple machine learning models (decision trees) to make more accurate predictions than individual models. Random Forest is based on decision trees, which are simple yet powerful models. Each decision tree makes predictions by recursively splitting the data based on feature values. It builds multiple decision trees during training and combines their predictions to improve accuracy and reduce overfitting.

### 17.2 Training and Evaluation

During training, Random Forest uses a technique called bootstrapping. It creates multiple subsets of the training data by randomly sampling with replacement. Each subset is used to train a separate decision tree. Random Forest selects a random subset of features to consider at each split. This randomness helps to reduce correlation among trees and improves generalization.

For classification tasks, each tree in the forest makes a prediction, and the final prediction is determined by majority voting. For regression tasks, the predictions are averaged to obtain the final result.

Random Forest can also provide a measure of feature importance, indicating which features are most influential in making predictions.

Training individual decision trees in a Random Forest can be parallelized, making it efficient for large datasets.

Random Forest has several hyperparameters to tune, such as the number of trees in the forest, the maximum depth of each tree, and the number of features to consider at each split.

#### Random Forest Classification Example

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

feature_importances = clf.feature_importances_
plt.figure(figsize=(6, 4))
plt.bar(range(len(feature_importances)), feature_importances, tick_label=iris.feature_names)
plt.title("Feature Importances")
plt.xlabel("Features")
plt.ylabel("Importance")
plt.show()

This code demonstrates the use of a Random Forest Classifier to classify iris flower species based on their features. It splits the dataset into training and testing sets, then trains the classifier with 100 decision trees. After making predictions on the test data, it calculates the accuracy. Additionally, it visualizes the feature importances, showcasing which features contribute most to the classification. Overall, the Random Forest Classifier achieves high accuracy in classifying iris species and identifies the most important features in the dataset.

#### Random Forest Regression Example

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Set a random seed for reproducibility
np.random.seed(0)

# Generate synthetic data
X = np.linspace(0, 10, 100)  # Create 100 data points between 0 and 10
y = 2 * X + 1 + np.random.normal(0, 1, 100)  # Generate y-values with noise

# Create a Random Forest Regressor
regressor = RandomForestRegressor()

# Define a grid of hyperparameters to search
param_grid = {
    'n_estimators': [10, 50, 100],         # Number of trees in the forest
    'max_depth': [None, 10, 20, 30],       # Maximum depth of the trees
    'min_samples_split': [2, 5, 10],      # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4]         # Minimum number of samples required to be at a leaf node
}

# Create the GridSearchCV object
grid_search = GridSearchCV(regressor, param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit the grid search to the data
grid_search.fit(X.reshape(-1, 1), y)

# Print the best hyperparameters found
print("Best Hyperparameters:")
print(grid_search.best_params_)

# Get the best Random Forest Regressor model
best_regressor = grid_search.best_estimator_

# Predict target values
y_pred = best_regressor.predict(X.reshape(-1, 1))

# Calculate evaluation metrics
mae = mean_absolute_error(y, y_pred)
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

# Print evaluation metrics
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R²): {r2:.2f}")

# Plot the synthetic dataset and predictions
plt.figure(figsize=(6, 4))
plt.scatter(X, y, label="Synthetic Data", c='b', s=20)
plt.plot(X, y_pred, c='r', label="Predictions", linewidth=2)
plt.xlabel("X")
plt.ylabel("y")
plt.title("Synthetic Dataset and Predictions")
plt.legend()
plt.grid(True)
plt.show()

In this example, a synthetic dataset is created with a linear relationship between X and y, contaminated with random noise. A Random Forest Regressor is trained to predict y from X, and a hyperparameter grid search is performed to find the best model configuration. After training, the model's predictions are evaluated using Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared (R²) metrics. The best model accurately captures the underlying linear relationship, and the evaluation metrics demonstrate its effectiveness. The synthetic dataset and model predictions are visualized, illustrating the model's ability to fit the data.

### 17.3 Summary

Random Forest is an ensemble learning method widely used in machine learning for both classification and regression tasks. It operates by constructing multiple decision trees during training and combines their outputs to make more robust predictions. Each tree is built using a random subset of the data and a random subset of features, reducing overfitting. In classification tasks, it selects the most frequent class among the trees (majority voting), while in regression tasks, it averages the outputs. Random Forests are known for their excellent performance, handling of high-dimensional data, and robustness against overfitting. They also provide feature importances, aiding in feature selection.