# Random Forest

## Problem Type
**Random Forest** is primarily used for:
- **Classification** problems
- **Regression** problems
- **Supervised** learning

### How Random Forest Works
- **Ensemble method:** Combines multiple decision trees (typically trained with the "bagging" method) to improve the model’s robustness and accuracy.
- **Bootstrap Aggregation (Bagging):** 
  - Creates multiple subsets of the training data by random sampling with replacement.
  - Trains each decision tree on a different subset.
- **Random feature selection:**
  - At each split in a tree, a random subset of features is considered.
  - Helps in reducing the correlation between individual trees.
- **Aggregation of results:**
  - **Classification:** Takes a majority vote from all trees to make the final prediction.
  - **Regression:** Averages the predictions of all trees.
- **Reduces overfitting:** Due to the averaging of multiple trees, Random Forest tends to generalize better than a single decision tree.

### Key Tuning Metrics
- **`n_estimators`:**
  - **Description:** Number of trees in the forest.
  - **Impact:** More trees generally improve performance but increase computational cost.
  - **Default:** `100` trees.
- **`max_depth`:**
  - **Description:** Maximum depth of the trees.
  - **Impact:** Deeper trees can capture more detail but risk overfitting; shallower trees are less prone to overfitting.
  - **Default:** `None` (expand until all leaves are pure or until `min_samples_split` is reached).
- **`min_samples_split`:**
  - **Description:** Minimum number of samples required to split an internal node.
  - **Impact:** Higher values prevent the model from learning overly specific patterns, reducing overfitting.
  - **Default:** `2`.
- **`min_samples_leaf`:**
  - **Description:** Minimum number of samples required to be at a leaf node.
  - **Impact:** Larger values create smoother models and help prevent overfitting.
  - **Default:** `1`.
- **`max_features`:**
  - **Description:** Number of features to consider when looking for the best split.
  - **Impact:** Smaller subsets of features reduce correlation between trees, improving generalization; `sqrt(n_features)` is a common choice.
  - **Default:** `auto` (equivalent to `sqrt(n_features)` for classification).
- **`bootstrap`:**
  - **Description:** Whether bootstrap samples are used when building trees.
  - **Impact:** `True` enables bagging, which is crucial for Random Forest’s performance.
  - **Default:** `True`.
- **`oob_score`:**
  - **Description:** Whether to use out-of-bag samples to estimate the generalization accuracy.
  - **Impact:** Provides an internal cross-validation score without needing a separate validation set.
  - **Default:** `False`.

### Pros vs Cons

| Pros                                                  | Cons                                                   |
|-------------------------------------------------------|--------------------------------------------------------|
| Reduces overfitting by averaging multiple trees       | Computationally expensive, especially with many trees  |
| Can handle both numerical and categorical data        | Large model size can be difficult to interpret         |
| Robust to outliers and noise                          | Can still overfit if trees are too deep and `n_estimators` is too low |
| Automatically handles missing data                    | Requires more memory and storage than single decision trees |
| Good performance with default parameters              | Slower to predict compared to simpler models           |
| Provides feature importance ranking                   | Difficult to implement in real-time systems due to complexity |

### Evaluation Metrics
- **Accuracy (Classification):**
  - **Description:** Ratio of correct predictions to total predictions.
  - **Good Value:** Higher is better; generally, above 0.85 is considered good.
  - **Bad Value:** Below 0.5 suggests poor model performance.
- **Precision (Classification):**
  - **Description:** Proportion of positive identifications that were actually correct.
  - **Good Value:** Higher values indicate fewer false positives, especially important in imbalanced datasets.
  - **Bad Value:** Low values suggest many false positives.
- **Recall (Classification):**
  - **Description:** Proportion of actual positives that were correctly identified.
  - **Good Value:** Higher values indicate fewer false negatives, crucial when missing positive cases is costly.
  - **Bad Value:** Low values suggest many false negatives.
- **F1 Score (Classification):**
  - **Description:** Harmonic mean of Precision and Recall.
  - **Good Value:** Higher values indicate a good balance between Precision and Recall.
  - **Bad Value:** Low values suggest an imbalance, with either high false positives or false negatives.
- **Mean Squared Error (MSE) (Regression):**
  - **Description:** Average of the squared differences between predicted and actual values.
  - **Good Value:** Lower values are better, indicating a closer fit to the data.
  - **Bad Value:** Higher values indicate poor prediction accuracy.
- **R-squared (R²) (Regression):**
  - **Description:** Proportion of variance in the dependent variable predictable from the independent variables.
  - **Good Value:** Closer to 1 is better, indicating a strong explanatory power.
  - **Bad Value:** Values close to 0 suggest the model explains little of the variance.
- **Feature Importance:**
  - **Description:** Measures the importance of each feature in the prediction process.
  - **Good Value:** High values for relevant features indicate the model relies on significant predictors.
  - **Bad Value:** High values for irrelevant features suggest the model might be overfitting.



In [None]:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

In [None]:
# Load the wine dataset
wine = datasets.load_wine()

# Separate features and target variable
X = wine.data  # Features
y = wine.target  # Target labels (wine type)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [None]:
# Create and train a Random Forest Classifier
model = RandomForestClassifier(
    n_estimators=10,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=2,
    max_features='sqrt',
    bootstrap=True,
    oob_score=True,
    random_state=42,
)
model.fit(X_train, y_train)

# Make predictions and evaluate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=wine.target_names)

print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)

In [None]:
importances = model.feature_importances_

# Print feature importances (optional)
for feature, importance in zip(wine.feature_names, importances):
    print(f"{feature}: {importance:.2f}")

In [None]:
plt.figure(figsize=(15,10))
plot_tree(model.estimators_[0], 
               filled=True, 
               rounded = True)
plt.show()