## Random Forest Classifier
Use Case: Classification problems (e.g., spam detection, stock movement: up/down)

Scaling:  Not required (tree-based models are scale-invariant)

Loss/Score Functions: Accuracy, F1-score, ROC-AUC, Log Loss (probabilistic)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model
clf = RandomForestClassifier(
    n_estimators=100,        # number of trees
    max_depth=None,          # depth of each tree
    min_samples_split=2,     # min samples to split an internal node
    min_samples_leaf=1,      # min samples at a leaf
    max_features='sqrt',     # number of features to consider at each split
    random_state=42
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)

# Evaluation
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC AUC: {roc_auc_score(y_test, y_proba[:,1]):.4f}")
print(f"Log Loss: {log_loss(y_test, y_proba):.4f}")


## Random Forest Regressor
Use Case: Regression problems (e.g., stock price prediction, house prices)

Scaling:  Not required

Loss/Score Functions: MSE, RMSE, MAE, R²

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import make_regression

# Generate sample regression data
X, y = make_regression(n_samples=1000, n_features=10, noise=0.3, random_state=42)

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model
reg = RandomForestRegressor(
    n_estimators=100,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features='sqrt',
    random_state=42
)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

# Evaluation
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse:.4f}, RMSE: {rmse:.4f}, MAE: {mae:.4f}, R²: {r2:.4f}")


## 
Parameter	Meaning

n_estimators	Number of trees in the forest

max_depth	Maximum depth of each tree

min_samples_split	Minimum samples required to split an internal node

min_samples_leaf	Minimum samples at a leaf node

max_features	Number of features to consider when looking for best split

random_state	Ensures reproducibility

## Random Forest Classifier – GridSearchCV

Parameters we're tuning:
n_estimators: Number of trees

max_depth: Max depth of tree

min_samples_split: Minimum number of samples to split a node

min_samples_leaf: Minimum number of samples at a leaf node

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Dataset
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model
clf = RandomForestClassifier(random_state=42)

# Grid
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Grid Search
grid_clf = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_clf.fit(X_train, y_train)

# Best parameters
print("Best parameters:", grid_clf.best_params_)

# Final evaluation
y_pred = grid_clf.predict(X_test)
print(classification_report(y_test, y_pred))


## Random Forest Regressor – GridSearchCV

Parameters we're tuning:

Same as above, but used for regression

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model
reg = RandomForestRegressor(random_state=42)

# Grid
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Grid Search
grid_reg = GridSearchCV(reg, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_reg.fit(X_train, y_train)

# Best parameters
print("Best parameters:", grid_reg.best_params_)

# Final evaluation
y_pred = grid_reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse:.4f}, RMSE: {rmse:.4f}, MAE: {mae:.4f}, R²: {r2:.4f}")


### Hyperparameter Tuning with GridSearchCV for Random Forest Regressor

#### 1. **`param_grid = { ... }`**

This dictionary defines the **hyperparameters** and the **values you want to test** for each one during the tuning process. GridSearchCV will try all combinations of these values and return the best one.

- **`n_estimators`**: Number of trees in the random forest. Here we are testing values [50, 100].
- **`max_depth`**: Maximum depth of each tree. We are testing values `None` (no limit) and 10, 20 as potential limits.
- **`min_samples_split`**: Minimum number of samples required to split an internal node. We are testing values 2 and 5.
- **`min_samples_leaf`**: Minimum number of samples required to be at a leaf node. We are testing values 1 and 2.

This results in **3 × 2 × 2 × 2 = 24** different combinations of hyperparameters.

#### 2. **Creating the GridSearchCV Object**

```python
grid_reg = GridSearchCV(reg, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

reg: The base model you want to tune, which is a RandomForestRegressor.

param_grid: The dictionary from above, specifying the parameters to be tested.

cv=5: We are using 5-fold cross-validation, meaning the dataset will be split into 5 parts and the model will be trained and evaluated 5 times (with each part acting as the test set once).

scoring='neg_mean_squared_error': The negative mean squared error (MSE) is used as the scoring function. The reason it’s negative is that scikit-learn aims to maximize the score, so by using negative MSE, it will effectively minimize MSE.

n_jobs=-1: Using all CPU cores available to speed up the search process.

3. Fitting the GridSearchCV
python
Copy
Edit
grid_reg.fit(X_train, y_train)
This runs the grid search to find the best combination of parameters.

It tries all 24 combinations of the hyperparameters in param_grid.

The model is trained and evaluated using 5-fold cross-validation for each combination.

After this, the grid search will pick the best-performing combination based on the lowest negative MSE score.

4. Accessing the Results
You can retrieve the best-performing model and its parameters:

Best parameters:

python
Copy
Edit
grid_reg.best_params_      # Best hyperparameter combination found
Best score (Negative MSE):

python
Copy
Edit
grid_reg.best_score_       # Best negative mean squared error across all folds
Best estimator (trained model with best params):

python
Copy
Edit
grid_reg.best_estimator_   # The fully trained model with the best combination of parameters
5. Summary
param_grid specifies the range of values for the hyperparameters to test.

GridSearchCV is used to perform a comprehensive search for the best parameter combination.

The grid search uses 5-fold cross-validation and evaluates the models based on negative mean squared error.

After fitting, the best-performing model and its hyperparameters can be retrieved using best_params_, best_score_, and best_estimator_.

yaml
Copy
Edit
