## Random Forest Hyperparameters

| Parameter | Meaning | Effect |
|-----------|---------|--------|
| **n_estimators** | Number of trees | ↓ Variance |
| **max_depth** | Tree depth | Controls overfitting |
| **min_samples_split** | Minimum samples to split | Regularization |
| **min_samples_leaf** | Minimum samples in leaf | Smooths predictions |
| **max_features** | Features per split | Controls correlation |
| **bootstrap** | Row sampling | Required |
| **oob_score** | Out-of-bag validation | Free accuracy estimate |

### Typical Defaults
- **Classification:** `max_features = sqrt(p)`
- **Regression:** `max_features = p/3`

---

## Random Forest vs Decision Tree (Summary Table)

| Aspect | Decision Tree (DT) | Random Forest (RF) |
|--------|-------------------|-------------------|
| **Accuracy** | Medium | High |
| **Overfitting** | High | Low |
| **Stability** | Poor | Excellent |
| **Interpretability** | High | Low |
| **Training Speed** | Fast | Slower |
| **Scaling Needed** | No | No |


# python code classifiaction(sklearn)

In [23]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier   # Correct class name
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_breast_cancer

# Load dataset
X, y = load_breast_cancer(return_X_y=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Random Forest Model
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    max_features='sqrt',
    oob_score=True,
    random_state=42
)

# Train model
rf.fit(X_train, y_train)

# Predictions
y_pred = rf.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("OOB Score:", rf.oob_score_)
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))


Accuracy: 0.9649122807017544
OOB Score: 0.9604395604395605

Classification Report:

              precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



# visualization

# load and prepare data
# Pipeline is used to combine preprocessing and model training into a single workflow while preventing data leakage.
# to do 2 step we do column transformer 

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier


df = pd.read_csv('plan_purchase.csv')
print(df.head(5))
X = df.drop("Purchase", axis=1)
y = df['Purchase'].map({"No": 0, "Yes": 1})  

categorical_features = X.select_dtypes(include='object').columns
numeric_features = X.select_dtypes(exclude='object').columns
print("Categorical Features:", list(categorical_features))
print("Numerical Features:", list(numeric_features))

numerical_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='median'))
])

categorical_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('encode', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numerical_pipe, numeric_features),
    ('cat', categorical_pipe, categorical_features)
])


# Full pipeline: preprocessing + model
pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('model', RandomForestClassifier(
        n_estimators=100,
        random_state=42,
        oob_score=True
    ))
])

# Inspect columns and basic info
print('Columns:', df.columns.tolist())
print('\nInfo:')
print(df.info())
print("\nDescribe:")
display(df.describe())

   Age  MonthlyIncome  PlanType  UsageScore Purchase
0   56          81476  Standard          90      Yes
1   46          64811  Standard          92      Yes
2   32          56208     Basic          71      Yes
3   25          40150   Premium          82      Yes
4   38          63286  Standard          34       No
Categorical Features: ['PlanType']
Numerical Features: ['Age', 'MonthlyIncome', 'UsageScore']
Columns: ['Age', 'MonthlyIncome', 'PlanType', 'UsageScore', 'Purchase']

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Age            500 non-null    int64 
 1   MonthlyIncome  500 non-null    int64 
 2   PlanType       500 non-null    object
 3   UsageScore     500 non-null    int64 
 4   Purchase       500 non-null    object
dtypes: int64(3), object(2)
memory usage: 19.7+ KB
None

Describe:


Unnamed: 0,Age,MonthlyIncome,UsageScore
count,500.0,500.0,500.0
mean,39.326,52753.62,60.082
std,12.200386,20181.171598,19.938967
min,18.0,20055.0,0.0
25%,29.0,35309.5,46.0
50%,41.0,52286.0,61.0
75%,50.0,70364.25,75.0
max,59.0,89896.0,100.0


In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42,
    stratify=y
)

print("Train class distribution:")
print(y_train.value_counts(normalize=True))

print("\nTest class distribution:")
print(y_test.value_counts(normalize=True))



Train class distribution:
Purchase
0    0.562857
1    0.437143
Name: proportion, dtype: float64

Test class distribution:
Purchase
0    0.566667
1    0.433333
Name: proportion, dtype: float64


# train

In [26]:
pipeline.fit(X_train, y_train)

# predict

In [27]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = pipeline.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('Report:\n')
print(classification_report(y_test, y_pred))


Accuracy: 0.9866666666666667
Report:

              precision    recall  f1-score   support

           0       0.98      1.00      0.99        85
           1       1.00      0.97      0.98        65

    accuracy                           0.99       150
   macro avg       0.99      0.98      0.99       150
weighted avg       0.99      0.99      0.99       150



In [29]:
new_customer = pd.DataFrame({
    "Age": [30],
    "MonthlyIncome": [55000],
    "PlanType": ["Premium"],
    "UsageScore": [65]
})

prediction = pipeline.predict(planr)
probability = pipeline.predict_proba(new_customer)

result = "Yes" if prediction[0] == 1 else "No"

print("Purchase Prediction Result")
print("-" * 30)
print(f"Predicted Purchase: {result}")
print(f"Probability of Purchase: {probability[0][1] :.2%}")

print("\nPrediction (0 = No, 1 = Yes):", prediction)


NameError: name 'plan_purchase' is not defined