 1. What is Boosting in Machine Learning?
Boosting is an ensemble technique that combines multiple weak learners (usually decision trees) to form a strong learner. It works sequentially, where each new model focuses on correcting the errors made by the previous ones.

2. How does Boosting differ from Bagging?
Aspect	Bagging	Boosting
Model Type	Parallel	Sequential
Goal	Reduce variance	Reduce bias (and variance)
Sample Usage	Bootstrapped samples	Focus on errors of prior models
Examples	Random Forest	AdaBoost, Gradient Boosting

3. What is the key idea behind AdaBoost?
AdaBoost (Adaptive Boosting) improves model performance by:

Assigning weights to training instances.

Increasing weights on misclassified points.

Building subsequent weak learners focused on hard-to-classify points.

Combining learners with weighted voting.

 4. Explain the working of AdaBoost with an example
Example:
Suppose you want to classify emails as spam/not spam.

Initialize weights equally to all training samples.

Train the first weak classifier (e.g., a decision stump).

Evaluate errors:

Misclassified samples → Increase their weights.

Correctly classified → Decrease weights.

Train the next model with the updated weights.

Repeat steps 2–4.

Final prediction: weighted majority vote of all weak learners.

5. What is Gradient Boosting, and how is it different from AdaBoost?
Gradient Boosting:

Uses gradients (first-order derivatives) of a loss function to minimize prediction error.

Fits new models on residuals (errors) of previous models.

Difference:
AdaBoost adjusts sample weights.

Gradient Boosting fits new learners to residual errors.

Gradient Boosting is more flexible in choosing the loss function.

6. What is the loss function in Gradient Boosting?
It depends on the task:

Task	Loss Function
Regression	Mean Squared Error (MSE)
Classification	Log Loss (binary cross-entropy)

7. How does XGBoost improve over traditional Gradient Boosting?
XGBoost (Extreme Gradient Boosting) improvements:

Regularization to prevent overfitting.

Parallel processing for speed.

Tree pruning and approximate splits for efficiency.

Handling missing values natively.

Cache-aware access patterns for faster computation.

 8. What is the difference between XGBoost and CatBoost?
Feature	XGBoost	CatBoost
Categorical Data	Needs encoding (e.g., one-hot)	Handles natively
Training Speed	Fast	Competitive, optimized for GPU
Default Settings	Requires tuning	Better out-of-the-box performance
Gradient Type	First-order	Symmetric tree with advanced gradient handling

 9. What are some real-world applications of Boosting techniques?
Fraud detection (finance)

Click-through rate prediction (ads)

Credit scoring (banking)

Customer churn prediction

Medical diagnosis

Search ranking and recommendation engines

10. How does regularization help in XGBoost?
Regularization prevents overfitting by:

Penalizing complex trees (L1/L2 regularization).

Controlling tree depth, leaf scores, and number of leaves.

Reducing variance while maintaining bias.

11. What are some hyperparameters to tune in Gradient Boosting models?
learning_rate – Shrinks contribution of each tree.

n_estimators – Number of boosting rounds.

max_depth – Depth of each tree.

subsample – Fraction of samples used per tree.

colsample_bytree – Fraction of features per tree.

min_child_weight – Minimum sum of instance weight in a child.

gamma – Minimum loss reduction to make a split.

lambda, alpha – L2 and L1 regularization.

 12. What is the concept of Feature Importance in Boosting?
Feature importance measures how valuable each feature is in making predictions. In boosting:

It shows how often a feature is used for splitting.

XGBoost and CatBoost provide:

Gain: Total reduction in loss due to the feature.

Cover: Number of samples affected.

Frequency: Number of times a feature appears in trees.

13. Why is CatBoost efficient for categorical data?
CatBoost is designed to natively handle categorical variables:

Uses ordered boosting to avoid target leakage.

Converts categories using target statistics with permutations.

Requires minimal preprocessing—no need for one-hot or label encoding.


In [None]:
#Q14 Train an AdaBoost Classifier on a sample dataset and print model accuracy

from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train AdaBoost classifier
model = AdaBoostClassifier(n_estimators=50, learning_rate=1.0, random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"AdaBoost Model Accuracy: {accuracy:.4f}")

n_estimators: Number of weak learners (default 50).

learning_rate: Weight applied to each classifier (default 1.0).

accuracy_score: Compares true vs predicted labels.


#Q15 Train an AdaBoost Regressor and evaluate performance using Mean Absolute Error (MAE)4

from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train AdaBoost Regressor
model = AdaBoostRegressor(n_estimators=100, learning_rate=0.5, random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)

print(f"AdaBoost Regressor MAE: {mae:.4f}")

#AdaBoostRegressor: Uses a series of weak regressors (default: decision stumps).

#n_estimators: Number of weak learners.

#learning_rate: Step size shrinkage to prevent overfitting.

#mean_absolute_error: Measures average absolute prediction error.

#Q16-Train a Gradient Boosting Classifier on the Breast Cancer dataset and print feature importance4

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Gradient Boosting Classifier
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
model.fit(X_train, y_train)

# Get feature importances
importances = model.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Print feature importances
print("Feature Importances:\n", feature_importance_df)

# Optional: Plot the importances
plt.figure(figsize=(10, 6))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'], color='teal')
plt.gca().invert_yaxis()
plt.title('Gradient Boosting Feature Importances (Breast Cancer Dataset)')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

#Q17 Train a Gradient Boosting Regressor and evaluate using R-Squared Score

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Gradient Boosting Regressor
model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)

print(f"Gradient Boosting Regressor R-squared Score: {r2:.4f}")

📌 What is R² Score?
R² = 1: Perfect predictions

R² = 0: Model performs no better than the mean

R² < 0: Model is worse than guessing the mean

#Q18- Train an XGBoost Classifier on a dataset and compare accuracy with Gradient Boosting

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Gradient Boosting Classifier
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gb_model.fit(X_train, y_train)
gb_preds = gb_model.predict(X_test)
gb_accuracy = accuracy_score(y_test, gb_preds)

# XGBoost Classifier
xgb_model = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_model.fit(X_train, y_train)
xgb_preds = xgb_model.predict(X_test)
xgb_accuracy = accuracy_score(y_test, xgb_preds)

# Print accuracies
print(f"Gradient Boosting Classifier Accuracy: {gb_accuracy:.4f}")
print(f"XGBoost Classifier Accuracy        : {xgb_accuracy:.4f}")


#Q19 - Train a CatBoost Classifier and evaluate using F1-Score

from catboost import CatBoostClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train CatBoost Classifier (silent mode)
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=3, verbose=0, random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate using F1-score
f1 = f1_score(y_test, y_pred)

print(f"CatBoost Classifier F1-Score: {f1:.4f}")

#Q20- Train an XGBoost Regressor and evaluate using Mean Squared Error (MSE)

from xgboost import XGBRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train XGBoost Regressor
model = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate using MSE
mse = mean_squared_error(y_test, y_pred)

print(f"XGBoost Regressor Mean Squared Error (MSE): {mse:.4f}")

#Q20- Train an XGBoost Regressor and evaluate using Mean Squared Error (MSE)

from xgboost import XGBRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train XGBoost Regressor
model = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate using MSE
mse = mean_squared_error(y_test, y_pred)

print(f"XGBoost Regressor Mean Squared Error (MSE): {mse:.4f}")

#Q21 Train an AdaBoost Classifier and visualize feature importance
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train AdaBoost Classifier
model = AdaBoostClassifier(n_estimators=50, learning_rate=1.0, random_state=42)
model.fit(X_train, y_train)

# Get feature importances
importances = model.feature_importances_
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Print feature importances
print("Feature Importances:\n", importance_df)

# Plot feature importances
plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='salmon')
plt.gca().invert_yaxis()
plt.xlabel("Importance Score")
plt.title("AdaBoost Classifier Feature Importances (Breast Cancer Dataset)")
plt.tight_layout()
plt.show()

#Q22- Train a Gradient Boosting Regressor and plot learning curves

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize model
model = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42)

# Train model stage-wise and record training and validation errors
train_errors = []
val_errors = []

for y_pred_train in model.staged_predict(X_train):
    train_errors.append(mean_squared_error(y_train, y_pred_train))

for y_pred_val in model.staged_predict(X_val):
    val_errors.append(mean_squared_error(y_val, y_pred_val))

# Plot learning curves
plt.figure(figsize=(10, 6))
plt.plot(train_errors, label='Training MSE')
plt.plot(val_errors, label='Validation MSE')
plt.xlabel('Number of Trees')
plt.ylabel('Mean Squared Error')
plt.title('Learning Curves for Gradient Boosting Regressor')
plt.legend()
plt.grid(True)
plt.show()

#Q23 - Train an XGBoost Classifier and visualize feature importance
from xgboost import XGBClassifier, plot_importance
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train XGBoost Classifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, use_label_encoder=False, eval_metric='logloss', random_state=42)
model.fit(X_train, y_train)

# Plot feature importance
plt.figure(figsize=(10, 8))
plot_importance(model, max_num_features=15, importance_type='gain', title='XGBoost Feature Importance (Gain)')
plt.show()

#Q24 Train a CatBoost Classifier and plot the confusion matrix

from catboost import CatBoostClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train CatBoost Classifier
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=3, verbose=0, random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=model.classes_)

# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot(cmap=plt.cm.Blues)
plt.title('CatBoost Classifier Confusion Matrix')
plt.show()


#Q25- Train an AdaBoost Classifier with different numbers of estimators and compare accuracy

from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Different values for n_estimators to test
estimators_list = [10, 50, 100, 150, 200]

accuracies = []

for n in estimators_list:
    model = AdaBoostClassifier(n_estimators=n, learning_rate=1.0, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)
    print(f"Estimators: {n} - Accuracy: {acc:.4f}")

# Plotting the results
plt.figure(figsize=(8,5))
plt.plot(estimators_list, accuracies, marker='o', color='navy')
plt.title('AdaBoost Accuracy vs Number of Estimators')
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()

#Q26- Train a Gradient Boosting Classifier and visualize the ROC curve

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Gradient Boosting Classifier
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
model.fit(X_train, y_train)

# Predict probabilities for the positive class
y_proba = model.predict_proba(X_test)[:, 1]

# Compute ROC curve and ROC area
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Gradient Boosting Classifier')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

#Q27- Train an XGBoost Regressor and tune the learning rate using GridSearchCV

from xgboost import XGBRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize model (other params fixed, tune learning_rate)
xgb = XGBRegressor(n_estimators=100, max_depth=3, random_state=42)

# Define parameter grid for learning_rate
param_grid = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3]
}

# Setup GridSearchCV
grid_search = GridSearchCV(
    estimator=xgb,
    param_grid=param_grid,
    scoring='neg_mean_squared_error',  # minimizing MSE
    cv=5,
    verbose=1,
    n_jobs=-1
)

# Run grid search
grid_search.fit(X_train, y_train)

# Best parameter and score
print(f"Best learning_rate: {grid_search.best_params_['learning_rate']}")
print(f"Best CV MSE: {-grid_search.best_score_:.4f}")

# Evaluate on test set using the best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_mse = mean_squared_error(y_test, y_pred)
print(f"Test Set Mean Squared Error: {test_mse:.4f}")

#Q28- Train a CatBoost Classifier on an imbalanced dataset and compare performance with class weighting
from catboost import CatBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score

# Create imbalanced dataset
X, y = make_classification(
    n_samples=5000,
    n_features=20,
    n_informative=2,
    n_redundant=10,
    n_clusters_per_class=1,
    weights=[0.9, 0.1],  # 90% of class 0, 10% of class 1
    flip_y=0,
    random_state=42
)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Compute class weights (inversely proportional to class frequencies)
from collections import Counter
counter = Counter(y_train)
total = sum(counter.values())
class_weights = {cls: total/count for cls, count in counter.items()}

print("Class weights:", class_weights)

# Model without class weights
model_no_weights = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=4, verbose=0, random_state=42)
model_no_weights.fit(X_train, y_train)
y_pred_no_weights = model_no_weights.predict(X_test)
f1_no_weights = f1_score(y_test, y_pred_no_weights)

# Model with class weights
model_with_weights = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=4,
    class_weights=[class_weights[0], class_weights[1]],
    verbose=0,
    random_state=42
)
model_with_weights.fit(X_train, y_train)
y_pred_with_weights = model_with_weights.predict(X_test)
f1_with_weights = f1_score(y_test, y_pred_with_weights)

# Print results
print(f"F1-Score without class weights: {f1_no_weights:.4f}")
print(f"F1-Score with class weights:    {f1_with_weights:.4f}")

#Q29-   Train an AdaBoost Classifier and analyze the effect of different learning rates

from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Different learning rates to test
learning_rates = [0.01, 0.05, 0.1, 0.5, 1, 2]

accuracies = []

for lr in learning_rates:
    model = AdaBoostClassifier(n_estimators=50, learning_rate=lr, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)
    print(f"Learning Rate: {lr} - Accuracy: {acc:.4f}")

# Plot the results
plt.figure(figsize=(8,5))
plt.plot(learning_rates, accuracies, marker='o', color='teal')
plt.title('AdaBoost Accuracy vs Learning Rate')
plt.xlabel('Learning Rate')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()

#30- Train an XGBoost Classifier for multi-class classification and evaluate using log-loss.

from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

# Load Iris dataset (3 classes)
data = load_iris()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train XGBoost Classifier for multi-class classification
model = XGBClassifier(
    objective='multi:softprob',  # outputs probabilities for each class
    num_class=3,                 # number of classes
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    use_label_encoder=False,
    eval_metric='mlogloss',
    random_state=42
)
model.fit(X_train, y_train)

# Predict class probabilities on test set
y_proba = model.predict_proba(X_test)

# Calculate Log-Loss
logloss = log_loss(y_test, y_proba)
print(f"Multi-class Log-Loss: {logloss:.4f}")
