# Unit 2 Random Forest in Machine Learning

Hey there\! Today, we're going to dive into a powerful tool in machine learning called **Random Forest**. Just like a forest made up of many trees, a **Random Forest** is made up of many decision trees working together. This helps make more accurate predictions and reduces the risk of mistakes.

Our goal for this lesson is to understand how to load a dataset, split it into training and testing sets, train a **Random Forest** classifier, and use it to make predictions. Ready? Let's go\!

-----

## RandomForestClassifier vs BaggingClassifier

The **RandomForestClassifier** is closely related to the **BaggingClassifier**. Both are ensemble methods that fit multiple models on various sub-samples of the dataset. The key difference is that **RandomForestClassifier** introduces an additional layer of randomization by selecting a random subset of features for each split in the decision trees, while the **BaggingClassifier** uses every feature for splitting.

Why use **Random Forest**? Here are a few reasons:

  * **Reduces Overfitting**: By using many trees, **Random Forests** avoid learning the noise in the data instead of the actual pattern.
  * **Improves Accuracy**: Combining multiple predictions generally leads to better accuracy.
  * **Handles Large Feature Spaces**: **Random Forests** can manage many input features effectively.

-----

## Loading the Dataset

Let's dive into some code by loading a dataset. We'll use the wine dataset from scikit-learn, a popular machine learning library. This dataset includes measurements of wines that help classify them into different categories.

```python
from sklearn.datasets import load_wine

# Load the wine dataset
X, y = load_wine(return_X_y=True)
```

In this code, `X` represents input features (measurements of wines) and `y` represents labels (categories of wine).

Before training our model, we need to split our dataset into training and testing sets. This way, we can train our model on one part and test its accuracy on another.

```python
from sklearn.model_selection import train_test_split

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

-----

## Training the Random Forest Classifier

Now, let's train our **Random Forest** classifier. A classifier assigns labels to data points. Our classifier will decide the category of the wine based on its features.

```python
from sklearn.ensemble import RandomForestClassifier

# Training a random forest classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
```

Here, we create a **Random Forest** with 100 trees and fit it to our training data. Note that you can specify the settings of the trees used in the random forest – the **RandomForestClassifier** class has the same set of parameters.

For example, here is how we can control the maximum depth of each tree in the forest:

```python
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=3)
```

Yep, this simple\! Now all the trees will be initialized with `max_depth=3`.

-----

## Evaluating the Model

Now, we will evaluate the **Random Forest** model on the test set and compare its accuracy with that of a simple Decision Tree classifier.

```python
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Training a decision tree classifier for comparison
dt_clf = DecisionTreeClassifier(random_state=42)
dt_clf.fit(X_train, y_train)

# Making predictions with both classifiers
y_pred_rf = rf_clf.predict(X_test)
y_pred_dt = dt_clf.predict(X_test)

# Calculating accuracy for both models
accuracy_rf = accuracy_score(y_test, y_pred_rf)
accuracy_dt = accuracy_score(y_test, y_pred_dt)

print(f"Random Forest Accuracy: {accuracy_rf:.2f}")
print(f"Decision Tree Accuracy: {accuracy_dt:.2f}")

# Random Forest Accuracy: 1.00
# Decision Tree Accuracy: 0.94
```

Here, we trained a **DecisionTreeClassifier** for comparison. We then made predictions on the test set using both the **Random Forest** and **Decision Tree** models, and calculated their accuracies. As you can see, **Random Forest** outperforms a simple **Decision Tree**, showing an amazing score – 100% of correct predictions.

-----

## Lesson Summary

Great job\! Let's recap:

  * **Understanding Random Forest**: A **Random Forest** is an ensemble of decision trees that make accurate predictions.
  * **RandomForestClassifier vs BaggingClassifier**: **RandomForestClassifier** adds random feature selection to the bagging method.
  * **Advantages**: **Random Forests** reduce overfitting, improve accuracy, and handle large feature spaces.
  * **Loading and Splitting Data**: We loaded a dataset and split it into training and testing sets.
  * **Training the Model**: We trained a **Random Forest** classifier using **RandomForestClassifier**, with important parameters like `n_estimators` and `random_state`.
  * **Model Evaluation**: We evaluated model performance and found that the **Random Forest** often outperforms a single **Decision Tree**.

Now that you understand **Random Forests**, it's time to practice. In the upcoming session, you'll get hands-on experience implementing and tuning a **Random Forest** model using your new skills. Get ready to experiment with different parameters and see how they affect the model's performance. Happy coding\!

## Adjusting Random Forest Tree Depth

Now let's tweak the RandomForestClassifier. The current version seems to underperform. Find more optimal parameters for n_estimators and max_depth. Make sure to achieve the accuracy no less than 1.00 – it is possible with this model and this dataset.

Go ahead and train that perfect classfier!

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the wine dataset
X, y = load_wine(return_X_y=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train and evaluate the Random Forest classifier
rf_clf = RandomForestClassifier(n_estimators=25, random_state=42, max_depth=3)
rf_clf.fit(X_train, y_train)
accuracy_rf = accuracy_score(y_test, rf_clf.predict(X_test))

print(f"Random Forest Accuracy: {accuracy_rf:.2f}")

```

You're right\! With the wine dataset, it's definitely possible to achieve very high accuracy, sometimes even 1.00, with a well-tuned Random Forest. The initial parameters might be a bit too constrained.

Let's adjust the `n_estimators` (number of trees) and `max_depth` (maximum depth of each tree) to find a combination that yields optimal performance. We'll iterate through a range of values for both to see what works best.

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the wine dataset
X, y = load_wine(return_X_y=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

best_accuracy = 0
best_n_estimators = 0
best_max_depth = 0

# Experiment with different n_estimators and max_depth values
for n_estimators in range(50, 201, 25): # Try n_estimators from 50 to 200
    for max_depth in range(5, 21, 5): # Try max_depth from 5 to 20
        # Train the Random Forest classifier
        rf_clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
        rf_clf.fit(X_train, y_train)
        
        # Calculate accuracy
        accuracy_rf = accuracy_score(y_test, rf_clf.predict(X_test))
        
        # Check if current parameters give better accuracy
        if accuracy_rf > best_accuracy:
            best_accuracy = accuracy_rf
            best_n_estimators = n_estimators
            best_max_depth = max_depth

print(f"Best Random Forest Accuracy: {best_accuracy:.2f}")
print(f"Best n_estimators: {best_n_estimators}")
print(f"Best max_depth: {best_max_depth}")

# Train the "perfect" classifier with the best found parameters
perfect_rf_clf = RandomForestClassifier(n_estimators=best_n_estimators, max_depth=best_max_depth, random_state=42)
perfect_rf_clf.fit(X_train, y_train)
perfect_accuracy = accuracy_score(y_test, perfect_rf_clf.predict(X_test))

print(f"\nAccuracy of the perfect classifier: {perfect_accuracy:.2f}")

```

## Complete the Random Forest Classifier for Wine Dataset

Hey, Space Explorer! Your mission is to complete the Random Forest script. Fill in the missing pieces of code to load the wine dataset, split it into training and testing sets, and train a classifier. Let's see your skills in action!

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine

# TODO: Load the wine dataset and assign the features and labels to X and y
X, y = load_wine(return_X_y=True)

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Train the RandomForestClassifier with 100 estimators and max_depth of 3

# TODO: Make predictions and calculate accuracy

# TODO: Print accuracy


```

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine

# Load the wine dataset and assign the features and labels to X and y
X, y = load_wine(return_X_y=True)

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the RandomForestClassifier with 100 estimators and max_depth of 3
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
rf_clf.fit(X_train, y_train)

# Make predictions and calculate accuracy
y_pred = rf_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy
print(f"Random Forest Classifier Accuracy: {accuracy:.2f}")
```

## Improving Random Forest for Wine Classification

Hey, Stellar Navigator! Ready to add some code and see the magic?

Complete the code by addressing all the TODO comments, and see the graph showing how the model's accuracy depend on the number of estimators used.

```python
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the wine dataset
X, y = load_wine(return_X_y=True)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# TODO: Define the list of n_estimators to try, including integers from 1 to 40 (inclusive)
accuracies = []

# Train and test models with different n_estimators
for n in n_estimators_list:
    # TODO: train RandomForestClassifier with the given number of estimators and make predictions
    accuracies.append(accuracy_score(y_test, y_pred))

# Plot the results
plt.plot(n_estimators_list, accuracies, marker='o')
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy')
plt.title('Random Forest Accuracy vs Number of Estimators')
plt.show()

```

```python
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the wine dataset
X, y = load_wine(return_X_y=True)

# Split data into training and testing sets
# Note: test_size=0.5 is a large test set, which might show more variance.
# For typical model training, a smaller test_size like 0.2 or 0.3 is common.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Define the list of n_estimators to try, including integers from 1 to 40 (inclusive)
n_estimators_list = list(range(1, 41))
accuracies = []

# Train and test models with different n_estimators
for n in n_estimators_list:
    # train RandomForestClassifier with the given number of estimators and make predictions
    rf_clf = RandomForestClassifier(n_estimators=n, random_state=42)
    rf_clf.fit(X_train, y_train)
    y_pred = rf_clf.predict(X_test)
    
    accuracies.append(accuracy_score(y_test, y_pred))

# Plot the results
plt.figure(figsize=(10, 6)) # Optional: make the plot a bit larger
plt.plot(n_estimators_list, accuracies, marker='o', linestyle='-')
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy')
plt.title('Random Forest Accuracy vs Number of Estimators')
plt.grid(True) # Optional: add a grid for better readability
plt.xticks(n_estimators_list[::4]) # Show fewer x-axis ticks for cleaner plot
plt.tight_layout() # Adjust layout to prevent labels from overlapping
plt.show()
```

## Evaluate Random Forest Accuracy with Varying Depths