KNN bagging & stacking

1.What is the fundamental idea behind ensemble techniques? How does
bagging differ from boosting in terms of approach and objective?

-The fundamental idea behind ensemble techniques is to combine multiple individual models (weak learners) to form a stronger, more accurate, and more stable model. The assumption is that a group of diverse models will outperform any single model.

Difference Between Bagging and Boosting

Aspect	Bagging (Bootstrap Aggregating)	Boosting

Approach	Trains multiple models independently on different bootstrapped samples (random subsets of data).	Trains models sequentially, where each new model focuses on correcting the errors of the previous one.
Objective	Reduce variance and avoid overfitting by averaging predictions.	Reduce bias by creating a strong learner from several weak learners that focus on mistakes.
Model Weighting	All models have equal weight.	Misclassified samples get higher weight; later models get more influence.
Examples	Random Forest	AdaBoost, Gradient Boosting



2.: Explain how the Random Forest Classifier reduces overfitting compared to
a single decision tree. Mention the role of two key hyperparameters in this process?

-A single decision tree tends to overfit because it learns patterns too specific to the training data. A Random Forest, however, reduces overfitting by creating multiple decision trees using random subsets of data and features, then averaging their predictions. This randomness increases model diversity and improves generalization.

Two Key Hyperparameters:

1. n_estimators

Number of trees in the forest.
More trees reduce variance and improve stability.
Helps in reducing overfitting by averaging many models.

2. max_features

Number of features considered at each split.
Limiting features introduces randomness, reduces correlation among trees, and lowers overfitting.
Other helpful hyperparameters: max_depth, min_samples_split, min_samples_leaf.



3.: What is Stacking in ensemble learning? How does it differ from traditional
bagging/boosting methods? Provide a simple example use case?

-Stacking (Stacked Generalization) is an ensemble technique where predictions from multiple base models are used as input features for a meta-model (or final model), which makes the final prediction.

How Stacking Differs:

Bagging / Boosting	Stacking

Combine predictions by voting or averaging.	Combine predictions by training a meta-learner on model outputs.
Focus on reducing variance (bagging) or bias (boosting).	Focus on leveraging strengths of different model types.
Uses many similar models.	Uses diverse models (e.g., SVM + Decision Tree + Logistic Regression).


Simple Use Case:
Predicting employee attrition by stacking:
Base models: Random Forest, Logistic Regression, Gradient Boosting.
Meta-model: Logistic Regression to learn how to best combine the predictions.
This typically improves accuracy compared to using any single model.


4.What is the OOB Score in Random Forest, and why is it useful? How does
it help in model evaluation without a separate validation set?

-Grid Search with Random Forest is a hyperparameter tuning technique where Grid Search systematically tests multiple combinations of hyperparameters for a Random Forest model to find the best-performing configuration.

Why It Is Useful:
Random Forest has many hyperparameters (e.g., number of trees, depth, features per split).
The right combination can significantly improve model accuracy and generalization.
Grid Search automates the process of trying these combinations using cross-validation.

How It Improves Model Performance

1. Optimizes Hyperparameters:
Grid Search tests combinations such as:
n_estimators
max_depth
min_samples_split
max_features
It selects the combination that gives the best validation performance.

2. Reduces Overfitting or Underfitting:
If trees are too deep → model overfits.
If trees are too shallow → model underfits.
Grid Search finds the balance.

3. Uses Cross-Validation for Stability:
Evaluates each parameter set on multiple folds.
Ensures the chosen model performs well on unseen data.


5.Compare AdaBoost and Gradient Boosting in terms of:
● How they handle errors from weak learners
● Weight adjustment mechanism
● Typical use cases

-1. Handling Errors from Weak Learners

AdaBoost:
Focuses on misclassified samples by increasing their weights so that the next weak learner focuses more on these difficult cases.

Gradient Boosting:
Fits each new learner to the residual errors (the difference between predicted and actual values). It reduces error by moving in the direction of the negative gradient of the loss function.

2. Weight Adjustment Mechanism

AdaBoost:
Misclassified samples → weight increases
Correctly classified samples → weight decreases
Each weak learner is assigned a weight based on its accuracy.
Gradient Boosting:
No sample weighting.
Instead, each new learner tries to minimize a differentiable loss function using gradient descent.
Tree predictions are scaled by the learning rate.

3. Typical Use Cases

AdaBoost:
Suitable for clean, non-noisy datasets

Binary classification, face detection, spam classification
Works well with weak learners like small-depth decision trees
Gradient Boosting:
Regression & classification
Complex datasets with non-linear relationships
Real-world use: credit scoring, insurance risk, churn prediction
More powerful and flexible than AdaBoost


6.Why does CatBoost perform well on categorical features without requiring
extensive preprocessing? Briefly explain its handling of categorical variables?

-CatBoost performs well on categorical features because it handles them internally using advanced encoding techniques, eliminating the need for manual preprocessing like One-Hot Encoding or Label Encoding.

How CatBoost Handles Categorical Variables:

1. Ordered Target Encoding (Target Statistics):
CatBoost converts categorical values into numerical values by using statistics (mean target value) computed in an ordered fashion to avoid target leakage.
2. Permutation-Based Processing:
CatBoost applies random permutations to prevent the model from seeing future information, ensuring reliable encoding.
3. Efficient Handling of High-Cardinality Features:
It handles features with many categories (e.g., city names, product IDs) without memory explosion.
4. Automatic Preprocessing:
CatBoost automatically detects categorical features and applies the best encoding internally, making it fast and accurate.




In [None]:
#7. KNN Classifier Assignment: Wine Dataset Analysis with
#Optimization
#Task:
#1. Load the Wine dataset (sklearn.datasets.load_wine()).
#2. Split data into 70% train and 30% test.
#3. Train a KNN classifier (default K=5) without scaling and evaluate using:
#a. Accuracy
#b. Precision, Recall, F1-Score (print classification report)
#4. Apply StandardScaler, retrain KNN, and compare metrics.
#5. Use GridSearchCV to find the best K (test K=1 to 20) and distance metric
#(Euclidean, Manhattan)
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score

# 1. Load the Wine Dataset
data = load_wine()
X, y = data.data, data.target

# 2. Split into train (70%) and test (30%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)

# ---------------------------------------------------------
# Unscaled KNN
# ---------------------------------------------------------
knn_default = KNeighborsClassifier()
knn_default.fit(X_train, y_train)
pred_unscaled = knn_default.predict(X_test)

print("Accuracy (Unscaled KNN):", accuracy_score(y_test, pred_unscaled))
print("\nClassification Report (Unscaled KNN):\n")
print(classification_report(y_test, pred_unscaled))

# ---------------------------------------------------------
# Scaling (StandardScaler)
# ---------------------------------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier()
knn_scaled.fit(X_train_scaled, y_train)
pred_scaled = knn_scaled.predict(X_test_scaled)

print("\nAccuracy (Scaled KNN):", accuracy_score(y_test, pred_scaled))
print("\nClassification Report (Scaled KNN):\n")
print(classification_report(y_test, pred_scaled))

# ---------------------------------------------------------
# GridSearchCV Optimization
# k = 1 to 20
# Distance metrics: Euclidean, Manhattan
# ---------------------------------------------------------

param_grid = {
    "n_neighbors": list(range(1, 21)),
    "metric": ["euclidean", "manhattan"]
}

grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid.fit(X_train_scaled, y_train)

print("\nBest Parameters from GridSearchCV:", grid.best_params_)

# Evaluate optimized model
best_knn = grid.best_estimator_
pred_best = best_knn.predict(X_test_scaled)

print("\nAccuracy (Optimized KNN):", accuracy_score(y_test, pred_best))
print("\nClassification Report (Optimized KNN):\n")
print(classification_report(y_test, pred_best))

In [None]:
#8.PCA + KNN with Variance Analysis and Visualization
#Task:
#1. Load the Breast Cancer dataset (sklearn.datasets.load_breast_cancer()).
#2. Apply PCA and plot the scree plot (explained variance ratio).
#3. Retain 95% variance and transform the dataset.
#4. Train KNN on the original data and PCA-transformed data, then compare
#accuracy.
#5. Visualize the first two principal components using a scatter plot (color by class).
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# 1. Load the dataset
data = load_breast_cancer()
X, y = data.data, data.target

# 2. Split Data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)

# 3. Standardize (mandatory for PCA & KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. PCA Fit
pca = PCA().fit(X_train_scaled)

# Scree Plot (Explained Variance)
plt.figure(figsize=(8,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o')
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("Scree Plot - Breast Cancer Dataset")
plt.grid(True)
plt.show()

# 5. Retain 95% Variance
pca_95 = PCA(n_components=0.95)
X_train_pca = pca_95.fit_transform(X_train_scaled)
X_test_pca = pca_95.transform(X_test_scaled)

print("Number of components for 95% variance:", pca_95.n_components_)

# 6. Train KNN (Original vs PCA)
knn = KNeighborsClassifier(n_neighbors=5)

# Original (scaled)
knn.fit(X_train_scaled, y_train)
orig_pred = knn.predict(X_test_scaled)
orig_acc = accuracy_score(y_test, orig_pred)

# PCA-transformed
knn.fit(X_train_pca, y_train)
pca_pred = knn.predict(X_test_pca)
pca_acc = accuracy_score(y_test, pca_pred)

print("Accuracy (Original Scaled Data):", orig_acc)
print("Accuracy (PCA-Transformed Data):", pca_acc)

# 7. Visualization: First 2 Principal Components
pca_2 = PCA(n_components=2)
X_2 = pca_2.fit_transform(X_train_scaled)

plt.figure(figsize=(7,5))
plt.scatter(X_2[:,0], X_2[:,1], c=y_train, cmap='coolwarm', s=40)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA Visualization (First 2 Components)")
plt.colorbar(label="Class")
plt.show()

In [None]:
#9.KNN Regressor with Distance Metrics and K-Value
#Analysis
#Task:
#1. Generate a synthetic regression dataset
#(sklearn.datasets.make_regression(n_samples=500, n_features=10)).
#2. Train a KNN regressor with:
#a. Euclidean distance (K=5)
#b. Manhattan distance (K=5)
#c. Compare Mean Squared Error (MSE) for both.
#3. Test K=1, 5, 10, 20, 50 and plot K vs. MSE to analyze bias-variance tradeoff
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# 1. Generate a synthetic regression dataset
X, y = make_regression(n_samples=200, n_features=1, noise=15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Test different K values
k_values = [1, 4, 10, 20, 50]
mse_euclidean = []
mse_manhattan = []

for k in k_values:
    # Euclidean Distance
    knn_e = KNeighborsRegressor(n_neighbors=k, metric='euclidean')
    knn_e.fit(X_train, y_train)
    pred_e = knn_e.predict(X_test)
    mse_euclidean.append(mean_squared_error(y_test, pred_e))

    # Manhattan Distance
    knn_m = KNeighborsRegressor(n_neighbors=k, metric='manhattan')
    knn_m.fit(X_train, y_train)
    pred_m = knn_m.predict(X_test)
    mse_manhattan.append(mean_squared_error(y_test, pred_m))

# Print Results
print("K Values:", k_values)
print("MSE Euclidean:", mse_euclidean)
print("MSE Manhattan:", mse_manhattan)

# 3. Plot MSE values
plt.plot(k_values, mse_euclidean, label="Euclidean")
plt.plot(k_values, mse_manhattan, label="Manhattan")
plt.xlabel("K Value")
plt.ylabel("MSE")
plt.title("K vs MSE for Euclidean & Manhattan Distance")
plt.legend()
plt.show()


In [None]:
#10.KNN with KD-Tree/Ball Tree, Imputation, and Real-World
#Data
#Task:
#1. Load the Pima Indians Diabetes dataset (contains missing values).
#2. Use KNN Imputation (sklearn.impute.KNNImputer) to fill missing values.
#3. Train KNN using:
#a. Brute-force method
#b. KD-Tree
#c. Ball Tree
#4. Compare their training time and accuracy.
#5. Plot the decision boundary for the best-performing method (use 2 most important
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# 1. Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = ["Preg", "Plasma", "BP", "Skin", "Insulin", "BMI", "Pedigree", "Age", "Outcome"]
df = pd.read_csv(url, names=columns)

# 2. Imputation with KNN
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=columns)

X = df_imputed.drop("Outcome", axis=1)
y = df_imputed["Outcome"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train KNN (KD-Tree)
knn_kd = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')
knn_kd.fit(X_train, y_train)
pred_kd = knn_kd.predict(X_test)
acc_kd = accuracy_score(y_test, pred_kd)

# 4. Train KNN (Ball Tree)
knn_ball = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree')
knn_ball.fit(X_train, y_train)
pred_ball = knn_ball.predict(X_test)
acc_ball = accuracy_score(y_test, pred_ball)

print("KD-Tree Accuracy:", acc_kd)
print("Ball Tree Accuracy:", acc_ball)

# 5. Decision boundary (using 2 important features)
feat1 = "Plasma"
feat2 = "BMI"

plt.scatter(X_test[feat1], X_test[feat2], c=pred_kd)
plt.xlabel(feat1)
plt.ylabel(feat2)
plt.title("Decision Boundary (KD-Tree KNN)")
plt.show()