
# üå≤ **Random Forest in Machine Learning**

**Random Forest** is an **ensemble learning algorithm** that uses **many decision trees** to make better predictions.

Think of it like this:

‚û°Ô∏è **1 tree** = can make mistakes
‚û°Ô∏è **100 trees** = combine their decisions ‚Üí **more accurate**, **more stable**

---

## ‚≠ê **How Random Forest Works**

1. It creates **many decision trees**.
2. Each tree looks at **different random features** and **random rows**.
3. All trees give their predictions.
4. Random Forest takes a **majority vote** (classification) or **average** (regression).

---

## üéØ **Why Random Forest is Good**

* ‚úîÔ∏è High accuracy
* ‚úîÔ∏è Reduces **overfitting**
* ‚úîÔ∏è Works well on big datasets
* ‚úîÔ∏è Handles missing values
* ‚úîÔ∏è Works for both **classification** & **regression**

---

## üî• **Where It Is Used**

* Fraud detection
* Loan approval
* Medical diagnosis
* Stock prediction
* Customer churn prediction

---

## üü¢ **Example (Simple)**

If 1 tree says **spam**, another says **not spam**, but
**80 out of 100 trees say spam ‚Üí final answer = spam**.



In [203]:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import classification_report, confusion_matrix, mean_squared_error, mean_absolute_error, r2_score, accuracy_score

In [204]:
# load dataset
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [205]:
# encode features which are categorical and objective
le = LabelEncoder()

for col in df.select_dtypes(include='obiect' and 'category' ):
    df[col] = le.fit_transform(df[col])

In [206]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,2,0,2
1,10.34,1.66,1,0,2,0,3
2,21.01,3.5,1,0,2,0,3
3,23.68,3.31,1,0,2,0,2
4,24.59,3.61,0,0,2,0,4


In [207]:
# split the data into X and y for classifiction
X = df.drop('sex', axis=1)
y = df['sex']

# train test split the model 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# create the model and  fit predict evaluate
model = RandomForestClassifier(random_state=42, n_estimators=100, criterion='entropy',max_depth= 5, min_samples_split= 4, min_samples_leaf= 2)

# fit the model
model.fit(X_train, y_train)

# predict the model
y_pred = model.predict(X_test)

# evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print('--------------------------------')
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print('--------------------------------')
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print('--------------------------------')
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
print('--------------------------------')
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))


Accuracy: 0.673469387755102
--------------------------------
Confusion Matrix:
 [[ 6 13]
 [ 3 27]]
--------------------------------

Classification Report:
               precision    recall  f1-score   support

           0       0.67      0.32      0.43        19
           1       0.68      0.90      0.77        30

    accuracy                           0.67        49
   macro avg       0.67      0.61      0.60        49
weighted avg       0.67      0.67      0.64        49

--------------------------------
Mean Absolute Error: 0.32653061224489793
--------------------------------
Mean Squared Error: 0.32653061224489793


# **`Use random forest for classification problem`**

In [209]:
# split the X and y for regression on tips column
X = df.drop('tip', axis=1)
y = df['tip']

# train test split the model 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# create the model (REGRESSOR, not Classifier)
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(
    random_state=42,
    n_estimators=100,
    criterion='squared_error',   # correct for regression
    max_depth=5,
    min_samples_split=4,
    min_samples_leaf=2
)

# fit the model
model.fit(X_train, y_train)

# predict the model
y_pred = model.predict(X_test)

# evaluate the model
print("R2 Score:", r2_score(y_test, y_pred))
print('--------------------------------')
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
print('--------------------------------')
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))


R2 Score: 0.25830303523883347
--------------------------------
Mean Absolute Error: 0.7542516374630264
--------------------------------
Mean Squared Error: 0.9270999528272701


In [None]:
# 