BOOSTING

1.: What is the fundamental idea behind ensemble techniques? How does
bagging differ from boosting in terms of approach and objective?

- The fundamental idea behind ensemble techniques is to combine multiple weak or base learners (like decision trees) to create a more accurate, robust, and generalized model. By aggregating predictions from several models, ensemble methods reduce variance, bias, and improve overall predictive performance.

Bagging (Bootstrap Aggregating):

Approach: Trains multiple models independently on different random subsets of the training data (created through bootstrapping).
Objective: Reduces variance and prevents overfitting.
Example: Random Forest.
Combination: Uses averaging (for regression) or majority voting (for classification).


Boosting:

Approach: Trains models sequentially, where each new model focuses on correcting the errors of the previous one.
Objective: Reduces bias and improves accuracy by building strong learners from weak ones.
Example: AdaBoost, Gradient Boosting, XGBoost.
Combination: Weighted sum of model outputs.


2.Explain how the Random Forest Classifier reduces overfitting compared to
a single decision tree. Mention the role of two key hyperparameters in this process?

- The Random Forest Classifier reduces overfitting by combining multiple decision trees trained on random subsets of data and features. This randomness ensures that individual trees are less correlated and capture different patterns, making the ensemble more generalizable and less prone to overfitting.

Two key hyperparameters that help in this process are:

1. n_estimators:

The number of trees in the forest.
A higher number generally improves performance and stability by averaging out errors from individual trees.

2. max_features:

The number of features to consider when splitting a node.
Using fewer random features for each tree increases diversity among trees, reducing correlation and overfitting.
Together, these parameters create a balance between bias and variance, improving overall model generalization.


3.: What is Stacking in ensemble learning? How does it differ from traditional
bagging/boosting methods? Provide a simple example use case?

Stacking (Stacked Generalization) is an ensemble learning technique where predictions from multiple base models (level-1 models) are combined using a meta-model (level-2 model) that learns the optimal way to blend their outputs.

Difference from Bagging/Boosting:
Bagging trains models independently and averages their predictions.
Boosting trains models sequentially to correct previous errors.
Stacking combines diverse models (e.g., Decision Tree, Logistic Regression, SVM) and uses another model (e.g., Linear Regression) to learn the best combination of their outputs.

Example Use Case:
In a loan approval prediction system, stacking could combine:
A Random Forest (captures non-linear relationships),
A Logistic Regression (captures linear trends),
And a Gradient Boosting model.
A meta-model like Linear Regression would then learn how to weight these models’ predictions to achieve the best overall accuracy.




4.:What is the OOB Score in Random Forest, and why is it useful? How does
it help in model evaluation without a separate validation set?

-The OOB (Out-Of-Bag) score is an internal performance estimate used in Random Forests. It measures how well the model predicts unseen data without requiring a separate validation or test set.

Explanation:
When building each tree in a Random Forest, only a bootstrap sample (random sample with replacement) of the training data is used. On average, about 63% of the training instances are used to train the tree, while the remaining 37% of the data (called Out-Of-Bag samples) are left out.
These OOB samples serve as unseen data for that particular tree. Once the forest is trained, each sample’s prediction is averaged from all trees where it was OOB, and accuracy is calculated over the entire dataset — this gives the OOB score.

Why it is useful:
It provides an unbiased estimate of model performance similar to cross-validation.
It saves time and data since no additional validation set is needed.
It helps in tuning hyperparameters efficiently during training.

In short:
OOB score acts like an internal cross-validation mechanism, giving a reliable estimate of how well the Random Forest generalizes to unseen data.


5.: Compare AdaBoost and Gradient Boosting in terms of:
● How they handle errors from weak learners
● Weight adjustment mechanism
● Typical use cases

- Aspect	AdaBoost	Gradient Boosting

How they handle errors from weak learners	AdaBoost increases the weights of misclassified samples so that the next weak learner focuses more on those difficult samples.	Gradient Boosting fits the next learner to the residual errors (difference between actual and predicted values) of the previous model.
Weight adjustment mechanism	Sample weights are adjusted exponentially based on classification errors — higher weight to incorrectly predicted samples.	No explicit sample weighting. Errors are minimized by using gradient descent to update predictions in the direction of reducing loss.
Typical use cases	Mostly used for classification problems (binary or multiclass). Example: Spam detection, face recognition.	Used for both regression and classification. Example: Predicting house prices, customer churn, etc.

6.6:Why does CatBoost perform well on categorical features without requiring
extensive preprocessing? Briefly explain its handling of categorical variables?

-CatBoost automatically handles categorical variables using a technique called ordered target statistics (or target encoding with permutations).
It replaces categorical values with numerical representations based on how the target variable behaves for each category, while preventing data leakage by using permutations.
This eliminates the need for manual encoding (like one-hot or label encoding) and reduces overfitting, allowing CatBoost to work efficiently with categorical data.

In [None]:
#7.: KNN Classifier Assignment: Wine Dataset Analysis with
#Optimization
#Task:
#1. Load the Wine dataset (sklearn.datasets.load_wine()).
#2. Split data into 70% train and 30% test.
#3. Train a KNN classifier (default K=5) without scaling and evaluate using:
#a. Accuracy
#b. Precision, Recall, F1-Score (print classification report)
#4. Apply StandardScaler, retrain KNN, and compare metrics.
#5. Use GridSearchCV to find the best K (test K=1 to 20) and distance metric
#(Euclidean, Manhattan).

# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# 1. Load the Wine dataset
data = load_wine()
X, y = data.data, data.target

# 2. Split into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 3. Train KNN (default k=5) without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print("=== Without Scaling ===")
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

# 4. Apply StandardScaler and retrain KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)

print("\n=== With Scaling ===")
print("Accuracy:", accuracy_score(y_test, y_pred_scaled))
print(classification_report(y_test, y_pred_scaled))

# 5. Use GridSearchCV to find best K (1 to 20) and distance metric
param_grid = {
    'n_neighbors': range(1, 21),
    'metric': ['euclidean', 'manhattan']
}

grid = GridSearchCV(
    KNeighborsClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy'
)
grid.fit(X_train_scaled, y_train)

print("\n=== Best Parameters from GridSearchCV ===")
print(grid.best_params_)
print("Best Cross-validation Accuracy:", grid.best_score_)

# 6. Train optimized KNN with best params and compare
best_knn = grid.best_estimator_
y_pred_best = best_knn.predict(X_test_scaled)

print("\n=== Optimized KNN on Scaled Data ===")
print("Accuracy:", accuracy_score(y_test, y_pred_best))
print(classification_report(y_test, y_pred_best))

In [None]:
#8.: PCA + KNN with Variance Analysis and Visualization
#Task:
#1. Load the Breast Cancer dataset (sklearn.datasets.load_breast_cancer()).
#2. Apply PCA and plot the scree plot (explained variance ratio).
#3. Retain 95% variance and transform the dataset.
#4. Train KNN on the original data and PCA-transformed data, then compare
#accuracy.
#5. Visualize the first two principal components using a scatter plot (color by class)

# Import libraries
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# 1. Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Scale data before PCA or KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Apply PCA and plot explained variance ratio
pca = PCA().fit(X_train_scaled)
plt.figure(figsize=(6,4))
plt.plot(range(1, len(pca.explained_variance_ratio_)+1),
         pca.explained_variance_ratio_, marker='o')
plt.title("Scree Plot – Explained Variance Ratio")
plt.xlabel("Principal Component")
plt.ylabel("Explained Variance Ratio")
plt.grid(True)
plt.show()

# 4. Retain 95% variance
pca_95 = PCA(0.95)
X_train_pca = pca_95.fit_transform(X_train_scaled)
X_test_pca = pca_95.transform(X_test_scaled)
print(f"Number of components to retain 95% variance: {pca_95.n_components_}")

# 5. Train KNN on original (scaled) data
knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_orig = knn_original.predict(X_test_scaled)
acc_orig = accuracy_score(y_test, y_pred_orig)
print("\n=== KNN on Original Scaled Data ===")
print("Accuracy:", acc_orig)
print(classification_report(y_test, y_pred_orig))

# 6. Train KNN on PCA-transformed data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)
print("\n=== KNN on PCA-Transformed Data ===")
print("Accuracy:", acc_pca)
print(classification_report(y_test, y_pred_pca))

# 7. Compare
print(f"\nAccuracy Comparison:\nOriginal = {acc_orig:.4f}, PCA = {acc_pca:.4f}")

# 8. Visualize first two principal components
pca_2 = PCA(n_components=2)
X_2D = pca_2.fit_transform(X_train_scaled)

plt.figure(figsize=(6,5))
plt.scatter(X_2D[:,0], X_2D[:,1], c=y_train, cmap='coolwarm', alpha=0.7)
plt.title("PCA – First Two Principal Components")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.colorbar(label='Class')
plt.show()

In [None]:
#9.KNN Regressor with Distance Metrics and K-Value
#Analysis
#Task:
#1. Generate a synthetic regression dataset
#(sklearn.datasets.make_regression(n_samples=500, n_features=10)).
#2. Train a KNN regressor with:
#a. Euclidean distance (K=5)
#b. Manhattan distance (K=5)
#c. Compare Mean Squared Error (MSE) for both.
#3. Test K=1, 5, 10, 20, 50 and plot K vs. MSE to analyze bias-variance tradeoff.

# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# 1. Generate synthetic regression dataset
X, y = make_regression(n_samples=500, n_features=10, noise=15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale data for distance-based models
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 2. Train KNN Regressor with Euclidean and Manhattan (K=5)
knn_euclidean = KNeighborsRegressor(n_neighbors=5, metric='euclidean')
knn_manhattan = KNeighborsRegressor(n_neighbors=5, metric='manhattan')

knn_euclidean.fit(X_train_scaled, y_train)
knn_manhattan.fit(X_train_scaled, y_train)

y_pred_euc = knn_euclidean.predict(X_test_scaled)
y_pred_man = knn_manhattan.predict(X_test_scaled)

mse_euc = mean_squared_error(y_test, y_pred_euc)
mse_man = mean_squared_error(y_test, y_pred_man)

print("=== MSE Comparison (K=5) ===")
print("Euclidean Distance:", round(mse_euc, 3))
print("Manhattan Distance:", round(mse_man, 3))

# 3. Test K = 1–20 and plot K vs MSE
k_values = range(1, 21)
mse_values = []

for k in k_values:
    knn = KNeighborsRegressor(n_neighbors=k, metric='euclidean')
    knn.fit(X_train_scaled, y_train)
    y_pred = knn.predict(X_test_scaled)
    mse_values.append(mean_squared_error(y_test, y_pred))

plt.figure(figsize=(6,4))
plt.plot(k_values, mse_values, marker='o')
plt.title("K vs Mean Squared Error (Bias–Variance Tradeoff)")
plt.xlabel("Number of Neighbors (K)")
plt.ylabel("Mean Squared Error")
plt.grid(True)
plt.show()

best_k = k_values[np.argmin(mse_values)]
print(f"Best K based on lowest MSE: {best_k}")

In [None]:
#10.KNN with KD-Tree/Ball Tree, Imputation, and Real-World
#Data
#Task:
#1. Load the Pima Indians Diabetes dataset (contains missing values).
#2. Use KNN Imputation (sklearn.impute.KNNImputer) to fill missing values.
#3. Train KNN using:
#a. Brute-force method
#b. KD-Tree
#c. Ball Tree
#4. Compare their training time and accuracy.
#5. Plot the decision boundary for the best-performing method (use 2 most important
#Features).

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# 1. Load Pima Indians Diabetes dataset
# (Dataset link: https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv)
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
cols = [
    "Pregnancies","Glucose","BloodPressure","SkinThickness",
    "Insulin","BMI","DiabetesPedigreeFunction","Age","Outcome"
]
df = pd.read_csv(url, names=cols)

# 2. Replace invalid 0 values with NaN (for selected features)
features_with_zero = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]
df[features_with_zero] = df[features_with_zero].replace(0, np.nan)

# Apply KNN Imputer to fill missing values
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=cols)

# Split data
X = df_imputed.drop("Outcome", axis=1)
y = df_imputed["Outcome"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Train KNN using different algorithms
methods = ["brute", "kd_tree", "ball_tree"]
results = {}

for method in methods:
    start = time.time()
    knn = KNeighborsClassifier(n_neighbors=5, algorithm=method)
    knn.fit(X_train_scaled, y_train)
    y_pred = knn.predict(X_test_scaled)
    end = time.time()

    acc = accuracy_score(y_test, y_pred)
    results[method] = {"Accuracy": acc, "Time (s)": end - start}
    print(f"\n=== {method.upper()} ===")
    print("Accuracy:", round(acc, 4))
    print("Training + Prediction Time:", round(end - start, 4), "seconds")
    print(classification_report(y_test, y_pred))

# 4. Compare performance
result_df = pd.DataFrame(results).T
print("\nPerformance Comparison:\n", result_df)

# 5. Plot decision boundary (using 2 most important features)
# Let's pick 'Glucose' and 'BMI'
from matplotlib.colors import ListedColormap

X2 = df_imputed[["Glucose", "BMI"]]
y2 = df_imputed["Outcome"]

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.3, random_state=42)
scaler2 = StandardScaler()
X2_train_scaled = scaler2.fit_transform(X2_train)
X2_test_scaled = scaler2.transform(X2_test)

# Use best-performing method (from result_df)
best_method = result_df["Accuracy"].idxmax()
best_knn = KNeighborsClassifier(n_neighbors=5, algorithm=best_method)
best_knn.fit(X2_train_scaled, y2_train)

# Create mesh grid for plotting
x_min, x_max = X2_train_scaled[:, 0].min() - 1, X2_train_scaled[:, 0].max() + 1
y_min, y_max = X2_train_scaled[:, 1].min() - 1, X2_train_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))
Z = best_knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot decision boundary
plt.figure(figsize=(7,5))
plt.contourf(xx, yy, Z, cmap=ListedColormap(['#FFAAAA','#AAFFAA']), alpha=0.6)
plt.scatter(X2_train_scaled[:,0], X2_train_scaled[:,1], c=y2_train, cmap='bwr', edgecolors='k', alpha=0.8)
plt.title(f"KNN Decision Boundary ({best_method.upper()} method)")
plt.xlabel("Glucose (scaled)")
plt.ylabel("BMI (scaled)")
plt.show()

