Q1.Can we use Bagging for regression problems?
Ans.Yes, Bagging (Bootstrap Aggregating) can be used for regression problems in Python. Bagging is not limited to classification; it also works well for regression tasks by reducing variance and improving model stability.

How Bagging Works for Regression:
It creates multiple bootstrap samples from the training data.
A base regression model (e.g., Decision Tree Regressor) is trained on each sample.
The final prediction is obtained by averaging the predictions from all models.

In [None]:
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate sample regression data
X, y = make_regression(n_samples=1000, n_features=5, noise=0.2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the base estimator (e.g., Decision Tree Regressor)
base_model = DecisionTreeRegressor()

# Create the Bagging Regressor
bagging_regressor = BaggingRegressor(base_estimator=base_model, n_estimators=50, random_state=42)

# Train the model
bagging_regressor.fit(X_train, y_train)

# Predict on test data
y_pred = bagging_regressor.predict(X_test)

# Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')


Q2. What is the difference between multiple model training and single model training?
Ans.Difference Between Single Model Training and Multiple Model Training
1. Definition
Single Model Training: Train one model on the entire dataset.
Multiple Model Training: Train multiple models independently or as an ensemble.
2. Complexity
Single Model: Easier to implement and requires less coordination.
Multiple Models: More complex, requiring additional computation and resources.
3. Training Time
Single Model: Faster as only one model is trained.
Multiple Models: Slower due to training multiple models and aggregating results.
4. Performance & Generalization
Single Model: Can suffer from overfitting (high variance) or underfitting (high bias).
Multiple Models: Typically improves generalization and reduces bias/variance.
5. Robustness to Noise & Outliers
Single Model: More sensitive to noise, may struggle with unstable predictions.
Multiple Models: More stable, as different models compensate for each other's weaknesses.
6. Common Techniques
Single Model: Logistic Regression, Decision Trees, Neural Networks.
Multiple Models:
Bagging (e.g., Random Forest) → Reduces variance.
Boosting (e.g., XGBoost, AdaBoost) → Reduces bias.
Stacking & Voting → Combines multiple models for better performance.

Q3.Explain the concept of feature randomness in Random Forest?
Ans.Feature Randomness in Random Forest
Feature randomness is a key concept in Random Forest, which helps improve generalization and reduce overfitting. It ensures that different trees in the forest use different subsets of features, making the ensemble more diverse and robust.

How Feature Randomness Works in Random Forest?
Random Subset of Features for Each Tree:

When training each decision tree in the Random Forest, only a random subset of features (instead of all features) is considered at each split.
This prevents trees from always picking the most dominant feature, leading to more diverse trees in the ensemble.
Controlled by max_features Parameter:

In sklearn.ensemble.RandomForestClassifier and RandomForestRegressor, the max_features parameter controls feature randomness:
For Classification (RandomForestClassifier):
Default: sqrt(n_features) (square root of total features).
For Regression (RandomForestRegressor):
Default: n_features / 3 (one-third of total features).
Effect on Model Performance:

Less Feature Randomness (max_features = total_features)
Each tree sees almost the same data → Less diversity, more correlation among trees.
Can lead to overfitting (especially if features are highly correlated).
More Feature Randomness (max_features is small)
Trees become more diverse → Reduces overfitting.
However, if max_features is too low, individual trees may become too weak.


Q4.What is OOB (Out-of-Bag) Score?
Ans.OOB (Out-of-Bag) Score in Random Forest
The Out-of-Bag (OOB) Score is a performance metric used in Random Forest to estimate model accuracy without using a separate validation set. It helps in evaluating how well the model generalizes to unseen data.

How OOB Score Works?
Bootstrap Sampling:

Random Forest uses Bootstrap Aggregation (Bagging), where each decision tree is trained on a random subset of the dataset (with replacement).
Some data points are not included in this training sample—these are called Out-of-Bag (OOB) samples.
Prediction on OOB Samples:

Since each tree in the forest does not see certain data points, those OOB samples can be used as a validation set for that tree.
The final OOB prediction for a sample is obtained by averaging (regression) or majority voting (classification) over all trees where the sample was OOB.
Computing the OOB Score:

The OOB score is simply the accuracy (classification) or R² score (regression) calculated from OOB predictions.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
X, y = load_iris(return_X_y=True)

# Train Random Forest with OOB score enabled
rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
rf.fit(X, y)  # No need for a separate train-test split

# Print OOB Score
print("OOB Score:", rf.oob_score_)


Q5.How can you measure the importance of features in a Random Forest model?
Ans.Measuring Feature Importance in a Random Forest Model
Feature importance in Random Forest helps us understand which features have the most impact on predictions. It allows for better interpretability, feature selection, and model optimization.

Methods to Measure Feature Importance in Random Forest
Mean Decrease in Impurity (MDI) – Gini Importance
Permutation Importance (Mean Decrease in Accuracy)
SHAP (SHapley Additive exPlanations) Values
1. Mean Decrease in Impurity (MDI) – Gini Importance
Each decision tree in the Random Forest splits nodes based on the feature that reduces impurity (e.g., Gini index or entropy) the most.
The higher the impurity reduction, the more important the feature.
The Random Forest model averages these impurity reductions across all trees to assign importance scores.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import pandas as pd

# Load dataset
X, y = load_iris(return_X_y=True)
feature_names = load_iris().feature_names

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get Feature Importance
importance = rf.feature_importances_

# Display in DataFrame
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importance})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
print(importance_df)


2. Permutation Importance (Mean Decrease in Accuracy)
Shuffles each feature randomly and checks how much the model performance (accuracy or RMSE) drops.
If a feature is important, shuffling it will significantly lower accuracy.
More robust than MDI, works even for correlated features.

In [None]:
from sklearn.inspection import permutation_importance

# Compute Permutation Importance
perm_importance = permutation_importance(rf, X, y, n_repeats=10, random_state=42)

# Display in DataFrame
perm_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': perm_importance.importances_mean})
perm_importance_df = perm_importance_df.sort_values(by='Importance', ascending=False)
print(perm_importance_df)


Q6.Explain the working principle of a Bagging Classifier?
Ans.Working Principle of a Bagging Classifier
A Bagging Classifier (Bootstrap Aggregating) is an ensemble learning method that improves model stability and accuracy by reducing variance. It works by training multiple weak models on different subsets of data and combining their predictions.

Steps in the Working of a Bagging Classifier:
Bootstrap Sampling (Random Sampling with Replacement):

The dataset is randomly sampled with replacement to create multiple bootstrap samples.
Each bootstrap sample has the same size as the original dataset but may contain duplicate samples.
Some original data points may not be included in a bootstrap sample.
Train Multiple Base Models (Weak Learners):

A base model (e.g., Decision Tree, SVM, etc.) is trained on each bootstrap sample independently.
Each model learns slightly different patterns due to data variation.
Aggregation of Predictions (Majority Voting or Averaging):

For Classification:
Each model makes a prediction.
The majority vote (most frequent class label) is chosen as the final prediction.
For Regression:
The final prediction is obtained by averaging all model predictions.


In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a base model (weak learner)
base_model = DecisionTreeClassifier()

# Create a Bagging Classifier with multiple decision trees
bagging_clf = BaggingClassifier(base_estimator=base_model, n_estimators=50, random_state=42)

# Train the Bagging Classifier
bagging_clf.fit(X_train, y_train)

# Make predictions
y_pred = bagging_clf.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Bagging Classifier Accuracy: {accuracy:.4f}")


Q7.How do you evaluate a Bagging Classifier’s performance?
Ans.Evaluating a Bagging Classifier’s Performance
To assess the effectiveness of a Bagging Classifier, we use various performance metrics and validation techniques.

1. Accuracy (for Classification Problems)
Measures the percentage of correctly classified instances.
Works best for balanced datasets.

In [None]:
from sklearn.metrics import accuracy_score

# Compute Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")


2. Precision, Recall, and F1-Score (for Imbalanced Data)
Precision: How many of the predicted positives are actually positive?
Recall: How many actual positives were correctly predicted?
F1-Score: Harmonic mean of Precision and Recall (useful for imbalanced datasets).

In [None]:
from sklearn.metrics import classification_report

# Compute Classification Report
print(classification_report(y_test, y_pred))


3. ROC-AUC Score (for Binary Classification)
Receiver Operating Characteristic (ROC) Curve evaluates model performance at different thresholds.
AUC (Area Under the Curve): Measures how well the model separates classes.

In [None]:
from sklearn.metrics import roc_auc_score

# Compute AUC Score (Only for Binary Classification)
auc_score = roc_auc_score(y_test, bagging_clf.predict_proba(X_test)[:, 1])
print(f"AUC Score: {auc_score:.4f}")


4. Cross-Validation (for More Reliable Evaluation)
Splits data into multiple train-test folds to evaluate stability.
Reduces bias due to random train-test splits.

In [None]:
from sklearn.model_selection import cross_val_score

# Perform 5-Fold Cross-Validation
cv_scores = cross_val_score(bagging_clf, X, y, cv=5, scoring='accuracy')
print(f"Cross-Validation Accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")


Q8.How does a Bagging Regressor work?
Ans.Working of a Bagging Regressor
A Bagging Regressor is an ensemble learning technique that combines multiple regression models (weak learners) to improve prediction accuracy and reduce overfitting. It follows the Bootstrap Aggregation (Bagging) principle to create an ensemble of models.

Steps in the Working of a Bagging Regressor
Bootstrap Sampling (Random Sampling with Replacement)

The dataset is randomly sampled with replacement to create multiple bootstrap samples.
Each sample has the same size as the original dataset but may contain duplicate values.
Training Multiple Base Regressors

A base regressor (e.g., Decision Tree Regressor, Linear Regression, etc.) is trained on each bootstrap sample.
Each regressor learns a slightly different relationship from the data due to variation in training samples.
Aggregating Predictions (Averaging Output)

The final regression prediction is obtained by averaging the predictions from all base models.
This helps in reducing variance and improving generalization.
𝑦
^
=
1
𝑛
∑
𝑖
=
1
𝑛
𝑓
𝑖
(
𝑋
)
y
^
​
 =
n
1
​
  
i=1
∑
n
​
 f
i
​
 (X)
where
𝑓
𝑖
(
𝑋
)
f
i
​
 (X) is the prediction from the
𝑖
𝑡
ℎ
i
th
  base regressor.

In [None]:
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Generate synthetic regression data
X, y = make_regression(n_samples=1000, n_features=5, noise=0.2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a base regressor (weak learner)
base_regressor = DecisionTreeRegressor()

# Create a Bagging Regressor
bagging_regressor = BaggingRegressor(base_estimator=base_regressor, n_estimators=50, random_state=42)

# Train the model
bagging_regressor.fit(X_train, y_train)

# Make predictions
y_pred = bagging_regressor.predict(X_test)

# Evaluate Performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.4f}")
print(f"R² Score: {r2:.4f}")


Q9.What is the main advantage of ensemble techniques?
Ans.Main Advantages of Ensemble Techniques
Ensemble techniques combine multiple models to improve performance, making predictions more accurate, stable, and generalizable. Here are the key advantages:

1. Higher Accuracy
By aggregating predictions from multiple models, ensembles reduce individual model errors.
Example: Random Forest (an ensemble of decision trees) performs better than a single decision tree.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

# Train Random Forest (Ensemble)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Compare Accuracy
print(f"Decision Tree Accuracy: {accuracy_score(y_test, y_pred_dt):.4f}")
print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")


2. Reduces Overfitting
Ensembles prevent overfitting by averaging multiple models.
Bagging methods (like Random Forest) reduce variance by training on different subsets of data.
3. Works Well with Noisy Data
Boosting methods (like AdaBoost, Gradient Boosting) focus on hard-to-predict samples, making the model more robust to noise.
4. Improves Generalization (Better Performance on Unseen Data)
Since ensemble models learn diverse patterns, they generalize better than single models.
This is useful for complex real-world datasets where a single model may fail.
5. Handles Bias-Variance Tradeoff Efficiently
Bagging reduces variance (useful for high-variance models like Decision Trees).
Boosting reduces bias (useful for underfitting models like Logistic Regression).

Q10.What is the main challenge of ensemble methods?
Ans.Main Challenges of Ensemble Methods
While ensemble methods improve accuracy and generalization, they also come with several challenges:

1. Increased Computational Cost
Training multiple models requires more time and computational power.
Example: A Random Forest with 100 trees takes longer to train than a single Decision Tree.
Solution: Use parallel processing and optimized libraries like scikit-learn, XGBoost for faster training.

2. Complexity in Interpretation
Individual models like Decision Trees are easy to interpret, but ensembles (e.g., Random Forest, Gradient Boosting) act as black boxes.
It is harder to explain why a model made a specific prediction.
Solution: Use feature importance (feature_importances_ in Random Forest) or tools like SHAP (SHapley Additive Explanations) for interpretability.

In [None]:
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

# Train a Random Forest model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Plot feature importance
importances = rf.feature_importances_
plt.bar(range(len(importances)), importances)
plt.xlabel("Feature Index")
plt.ylabel("Importance Score")
plt.title("Feature Importance in Random Forest")
plt.show()


3. Risk of Overfitting in Boosting
Boosting methods (AdaBoost, Gradient Boosting, XGBoost) can overfit if trained with too many estimators.
Unlike Bagging (which reduces variance), Boosting focuses on hard-to-learn samples, sometimes learning noise instead of patterns.
Solution: Use early stopping to prevent overfitting.

In [None]:
from xgboost import XGBClassifier

# Train XGBoost with early stopping
xgb = XGBClassifier(n_estimators=1000, early_stopping_rounds=50, eval_metric="logloss", eval_set=[(X_test, y_test)], verbose=False)
xgb.fit(X_train, y_train)


4. Requires More Data for Best Performance
Ensemble methods work best with large datasets.
Small datasets may not benefit much, and simple models like Logistic Regression might perform just as well.
Solution: If data is limited, use cross-validation or simpler models like Bagging instead of Boosting.

5. Harder to Tune Hyperparameters
Ensembles have more hyperparameters than single models.
Example: Random Forest requires tuning n_estimators, max_depth, min_samples_split, etc.
Example: Gradient Boosting has learning_rate, n_estimators, max_depth, etc.
Solution: Use Grid Search or Random Search for tuning.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [None, 10, 20],
    "min_samples_split": [2, 5, 10]
}

grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(f"Best Parameters: {grid_search.best_params_}")


Q11.Explain the key idea behind ensemble techniques?
Ans.Key Idea Behind Ensemble Techniques
Ensemble techniques combine multiple models to improve overall performance. Instead of relying on a single model, ensemble methods aggregate predictions from multiple models to reduce errors, increase accuracy, and improve generalization.

Key Principles of Ensemble Learning
Diversity of Models

Different models make different types of errors. Combining diverse models ensures that errors cancel out, improving overall accuracy.
Example: Using Decision Trees, Support Vector Machines (SVMs), and Neural Networks together in an ensemble.
Reducing Variance (Bagging)

Bagging (Bootstrap Aggregating) reduces variance by training multiple models on different random subsets of data and averaging their predictions.
Example: Random Forest is a bagging technique that combines multiple Decision Trees.
Reducing Bias (Boosting)

Boosting reduces bias by training models sequentially, where each new model focuses on correcting the errors made by the previous model.
Example: Gradient Boosting, AdaBoost, and XGBoost.
Combining Weak Learners to Form a Strong Learner

A weak learner is a model that performs slightly better than random guessing.
Combining multiple weak learners results in a highly accurate model.
Example: Decision Stumps (one-level Decision Trees) in AdaBoost.
Aggregating Predictions for Stability

Different ensemble methods aggregate predictions differently:
Averaging (Regression Models) → Bagging, Random Forest.
Majority Voting (Classification Models) → Voting Classifier.
Weighted Sum (Boosting Models) → AdaBoost, Gradient Boosting.


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest (Bagging)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))


In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Base model (weak learner)
base_model = DecisionTreeClassifier(max_depth=1)

# Train an AdaBoost model
ada = AdaBoostClassifier(base_estimator=base_model, n_estimators=50, random_state=42)
ada.fit(X_train, y_train)

# Predict and evaluate
y_pred = ada.predict(X_test)
print("AdaBoost Accuracy:", accuracy_score(y_test, y_pred))


Q12.What is a Random Forest Classifier?
Ans.Random Forest Classifier
A Random Forest Classifier is an ensemble learning method that combines multiple Decision Trees to improve accuracy, reduce overfitting, and enhance generalization. It is a bagging-based technique, meaning it trains multiple models on different subsets of the data and aggregates their predictions.

Key Features of Random Forest Classifier
Ensemble of Decision Trees

Random Forest consists of multiple Decision Trees, each trained on a different subset of data.
The final prediction is made using majority voting (for classification) or averaging (for regression).
Feature Randomness (Random Subspace Method)

Each tree is trained on a random subset of features, preventing trees from being too similar.
This increases diversity among models and improves performance.
Handles Overfitting

Unlike a single Decision Tree, which can overfit, Random Forest reduces overfitting by averaging multiple trees.
Works Well with Missing Data

Can handle missing values effectively by averaging predictions from multiple trees.
Can Handle Large Datasets

Suitable for high-dimensional data with many features and samples.
How Random Forest Works
Bootstrap Sampling (Bagging)

Randomly select multiple subsets of the training data with replacement.
Each subset is used to train a separate Decision Tree.
Random Feature Selection

Each tree is trained on a random subset of features instead of using all features.
Helps create diverse trees and reduces correlation among them.
Decision Tree Training

Each tree is trained independently on its subset of data.
The trees are not pruned, meaning they grow fully.
Majority Voting (Classification) or Averaging (Regression)

For classification, each tree votes for a class, and the most common class is chosen.
For regression, the final prediction is the average of all trees.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))


Q13.What are the main types of ensemble techniques?
Ans.Main Types of Ensemble Techniques
Ensemble techniques combine multiple models to improve accuracy, reduce variance, and enhance generalization. The main types of ensemble methods are:

1. Bagging (Bootstrap Aggregating)
Key Idea:

Reduces variance by training multiple models on different random subsets of the dataset.
Each model is trained independently in parallel, and predictions are combined (averaging for regression, majority voting for classification).
Example Algorithms:

Random Forest (combines multiple Decision Trees).
Bagging Classifier (wrapper for any base model).

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest model (Bagging)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))


. Boosting
Key Idea:

Reduces bias by sequentially training weak models, where each model corrects errors of the previous one.
Models are trained sequentially, unlike bagging, which trains them in parallel.
Example Algorithms:

AdaBoost (Adaptive Boosting)
Gradient Boosting (GBM, XGBoost, LightGBM, CatBoost)

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Base model (weak learner)
base_model = DecisionTreeClassifier(max_depth=1)

# Train an AdaBoost model
ada = AdaBoostClassifier(base_estimator=base_model, n_estimators=50, random_state=42)
ada.fit(X_train, y_train)

# Predict and evaluate
y_pred = ada.predict(X_test)
print("AdaBoost Accuracy:", accuracy_score(y_test, y_pred))


3. Stacking (Stacked Generalization)
Key Idea:

Combines predictions from multiple base models using a meta-model (final model) to make the final prediction.
Each base model makes predictions, and these predictions are used as input features for the meta-model.
Example Algorithms:

Stacking Classifier (combining SVM, Decision Trees, Neural Networks, etc.)
Stacking Regressor (for regression tasks)

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Define base models
base_models = [
    ('svm', SVC(probability=True)),
    ('tree', DecisionTreeClassifier())
]

# Meta-model
meta_model = LogisticRegression()

# Stacking Classifier
stack = StackingClassifier(estimators=base_models, final_estimator=meta_model)

# Train and evaluate
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
print("Stacking Classifier Accuracy:", accuracy_score(y_test, y_pred))


Q14.What is ensemble learning in machine learning?
Ans.Ensemble Learning in Machine Learning
Definition
Ensemble learning is a technique in machine learning where multiple models (weak learners) are combined to produce a stronger and more accurate predictive model. The goal is to improve performance by reducing errors, increasing accuracy, and enhancing generalization.

Why Use Ensemble Learning?
Reduces Overfitting (Variance Reduction) – Combining multiple models prevents over-reliance on a single model, making the final prediction more stable.
Increases Accuracy – Multiple weak models working together perform better than a single strong model.
Improves Generalization – Works well on unseen data by reducing bias and variance.
Handles Noisy Data – Since multiple models contribute, the impact of noise is minimized.
Types of Ensemble Learning Techniques
1. Bagging (Bootstrap Aggregating)
Reduces variance by training multiple models on random subsets of data (with replacement).
Models are trained independently in parallel, and final predictions are combined (averaging for regression, majority voting for classification).
Example Algorithm: Random Forest
2. Boosting
Reduces bias by training models sequentially, where each new model focuses on correcting the errors of the previous one.
Example Algorithms: AdaBoost, Gradient Boosting, XGBoost
3. Stacking (Stacked Generalization)
Uses multiple base models and a meta-model to make final predictions.
Unlike bagging and boosting, stacking learns how to combine models optimally.
Example Algorithm: Stacking Classifier
4. Voting Ensemble
Combines multiple models and selects the final output using:
Hard Voting – Chooses the most common class label.
Soft Voting – Uses probability-weighted predictions.
Example Algorithm: Voting Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest (Bagging)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))


In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Base model (weak learner)
base_model = DecisionTreeClassifier(max_depth=1)

# Train AdaBoost model
ada = AdaBoostClassifier(base_estimator=base_model, n_estimators=50, random_state=42)
ada.fit(X_train, y_train)

# Predict and evaluate
y_pred = ada.predict(X_test)
print("AdaBoost Accuracy:", accuracy_score(y_test, y_pred))


Q15.When should we avoid using ensemble methods?
Ans.When to Avoid Using Ensemble Methods
While ensemble methods improve accuracy and generalization, they are not always the best choice. Here are situations where you should avoid using them:

1. When a Single Model is Sufficient
If a simple model (like Logistic Regression or a single Decision Tree) provides good accuracy, using an ensemble is unnecessary.
Example: If a dataset is linearly separable, Logistic Regression or SVM may perform well without needing ensembles.
2. When Interpretability is Important
Ensemble models like Random Forest, XGBoost, and Stacking are complex and hard to interpret.
Example: In healthcare or finance, where model decisions impact real lives, decision trees or linear models may be preferred for transparency.
3. When Computational Resources are Limited
Ensemble models (especially Boosting and Stacking) require high memory and CPU/GPU power.
Example: If running on edge devices (like IoT sensors or mobile phones), simpler models are more efficient.
4. When the Training Data is Small
If the dataset is too small, ensemble methods can overfit instead of improving performance.
Example: Training a Random Forest with 100 trees on a dataset with only 100 samples may lead to overfitting.
5. When Prediction Speed is Crucial
Some ensemble methods (like Boosting and Stacking) have slow inference times because they need to evaluate multiple models.
Example: Real-time applications like autonomous driving or fraud detection require fast predictions, making ensembles less suitable.
6. When the Problem is Not Complex
If a dataset is well-structured with clear patterns, a single model like SVM, Logistic Regression, or Decision Tree may work just as well.
Example: A spam detection system with simple word frequency counts may perform well with Naïve Bayes instead of an ensemble.
7. When Ensemble Performance Gains Are Minimal
If ensemble learning improves accuracy by only 1-2%, but increases training time significantly, it may not be worth using.
Example: If a Decision Tree gives 94% accuracy and Random Forest gives 95%, the improvement may not justify the extra complexity.

Q16.How does Bagging help in reducing overfitting?
Ans.Bagging (Bootstrap Aggregating) is an ensemble learning technique that helps reduce overfitting by decreasing model variance. It works by training multiple instances of the same model on different random subsets of the dataset and then aggregating their predictions.

How Bagging Reduces Overfitting?
1. Introduces Data Variability (Bootstrap Sampling)
Bagging creates multiple training datasets by randomly sampling data with replacement.
Each model trains on a slightly different dataset, reducing dependence on any particular sample.
This prevents overfitting to the noise in the original data.
2. Reduces Model Variance
Overfitting happens when a model is too sensitive to minor variations in data.
Since bagging averages predictions from multiple models, it smooths out extreme variations and reduces overfitting.
3. Reduces the Impact of Outliers
If a dataset has outliers, a single model (like a Decision Tree) may get misled by them.
In Bagging, since each model sees only a subset of the data, outliers affect only a few models, preventing overfitting.
4. Creates Independent Decision Boundaries
Overfitting occurs when a model memorizes training data instead of generalizing.
Bagging ensures that each model learns slightly different decision boundaries, leading to a more generalized ensemble model.
5. Aggregates Predictions to Improve Stability
Bagging averages predictions (for regression) or uses majority voting (for classification).
This aggregation ensures that individual models’ errors do not dominate, leading to a more stable and robust model.


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest model (Bagging technique)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))


Q17.Why is Random Forest better than a single Decision Tree?
Ans.Why is Random Forest Better Than a Single Decision Tree?
Random Forest is an ensemble method that combines multiple Decision Trees to improve accuracy, generalization, and robustness. Here’s why Random Forest outperforms a single Decision Tree:

1. Reduces Overfitting (Lower Variance)
A single Decision Tree tends to memorize the training data, leading to overfitting and poor generalization on unseen data.
Random Forest reduces overfitting by training multiple trees on different random subsets of data (Bagging) and averaging their predictions, leading to a more generalized model.
2. More Stable and Robust
A single Decision Tree is highly sensitive to small changes in the dataset. If you change the training data slightly, the tree structure may change completely.
Random Forest, by combining multiple trees, is much more stable and resistant to data variations.
3. Handles Noisy Data Better
A single Decision Tree can be easily misled by noise in the training data, making incorrect splits.
Random Forest, by aggregating multiple trees, reduces the impact of noise, making predictions more reliable.
4. Works Well with High-Dimensional Data
In datasets with many features, a single Decision Tree may struggle to find the best splits and may overfit.
Random Forest uses Feature Randomness, selecting a random subset of features for each tree, ensuring better feature utilization and improved performance.
5. Handles Missing Values Automatically
Decision Trees may struggle with missing values, leading to biased splits.
Random Forest can handle missing values by using feature selection across multiple trees.
6. Reduces Correlation Between Trees
A single Decision Tree relies entirely on the structure of one model.
Random Forest ensures trees are decorrelated by training each one on a random subset of data and features, making it more independent and diverse.
7. More Accurate Predictions
Random Forest aggregates predictions from multiple trees, using:
Majority voting (for classification)
Averaging (for regression)
This ensemble approach leads to higher accuracy compared to a single Decision Tree.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

# Predict and evaluate
y_pred_dt = dt.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))


Q18.What is the role of bootstrap sampling in Bagging?
Ans.1. Reduces Overfitting (Variance Reduction)
A single model (e.g., Decision Tree) tends to memorize data, leading to overfitting.
By training multiple models on different sampled datasets, Bagging prevents overfitting and improves generalization.
2. Increases Model Stability
Since each model is trained on a different dataset, they make different errors.
Combining predictions reduces the impact of outliers and noise.
3. Introduces Diversity in Models
The randomness introduced by bootstrap sampling ensures that no two models are identical.
This decorrelation between models makes the final ensemble more effective.
4. Works Well with High Variance Models
Bagging is particularly effective with high-variance models like Decision Trees, as it smooths their predictions.
Example: Random Forest uses Bagging to train multiple Decision Trees, leading to better performance.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

# Predict and evaluate
y_pred_dt = dt.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))


Q19.What are some real-world applications of ensemble techniques?
Ans.eal-World Applications of Ensemble Techniques
Ensemble techniques like Bagging, Boosting, and Stacking are widely used in various industries to improve predictive accuracy, reduce overfitting, and enhance model stability. Here are some key real-world applications:

1. Finance and Banking
✔ Fraud Detection:

Problem: Detecting fraudulent transactions in real-time.
Solution: Random Forest & XGBoost analyze transaction patterns to distinguish fraud from legitimate transactions.
✔ Credit Risk Assessment:

Problem: Predicting whether a loan applicant will default.
Solution: Gradient Boosting Models (GBM) & AdaBoost improve classification accuracy in predicting high-risk customers.
2. Healthcare and Medical Diagnosis
✔ Disease Prediction and Diagnosis:

Problem: Predicting diseases like diabetes, cancer, and heart disease.
Solution: Bagging and Boosting combine models (e.g., Decision Trees, SVM) for better diagnostic accuracy.
✔ Medical Image Analysis:

Problem: Detecting tumors or anomalies in MRI/X-ray images.
Solution: Convolutional Neural Networks (CNNs) + Ensemble Learning improve image classification performance.
3. E-commerce and Retail
✔ Recommendation Systems:

Problem: Predicting user preferences for personalized recommendations.
Solution: Stacking & Random Forest combine multiple models (e.g., collaborative filtering, content-based models) to improve recommendations.
✔ Customer Churn Prediction:

Problem: Identifying customers likely to stop using a service.
Solution: XGBoost & LightGBM analyze customer behavior to predict churn and improve retention strategies.
4. Autonomous Vehicles and Transportation
✔ Self-Driving Cars:

Problem: Identifying objects, pedestrians, and traffic signals in real-time.
Solution: Bagging & Boosting with Deep Learning Models improve accuracy in object detection.
✔ Traffic Flow Prediction:

Problem: Predicting congestion and suggesting optimal routes.
Solution: Gradient Boosting & Random Forest analyze traffic patterns for better predictions.
5. Cybersecurity
✔ Intrusion Detection Systems (IDS):

Problem: Detecting malware and network intrusions.
Solution: Ensemble techniques like Random Forest & XGBoost analyze system logs and network patterns for anomaly detection.
✔ Spam Email Filtering:

Problem: Identifying and filtering spam emails.
Solution: Bagging (Random Forest) & Boosting (AdaBoost) improve email classification accuracy.
6. Social Media and Sentiment Analysis
✔ Fake News Detection:

Problem: Identifying misinformation in news articles.
Solution: Boosting algorithms (XGBoost, CatBoost) improve text classification accuracy.
✔ Sentiment Analysis:

Problem: Understanding public opinion on social media.
Solution: Ensemble learning improves the classification of positive, negative, or neutral sentiments.
7. Weather Forecasting and Climate Modeling
✔ Predicting Natural Disasters:

Problem: Forecasting hurricanes, earthquakes, or floods.
Solution: Bagging with Random Forest & Boosting models process historical climate data to predict weather patterns.
✔ Air Quality Monitoring:

Problem: Predicting pollution levels in different cities.
Solution: Random Forest & Gradient Boosting analyze pollutant levels and weather conditions for forecasting.

Q20.What is the difference between Bagging and Boosting?
Ans.Difference Between Bagging and Boosting
Bagging (Bootstrap Aggregating) and Boosting are both ensemble learning techniques, but they work in different ways to improve model performance. Here’s a detailed comparison:

1. Definition
Bagging: Trains multiple models in parallel on different random subsets of data (Bootstrap Sampling) and aggregates their predictions (e.g., majority voting for classification, averaging for regression).
Boosting: Trains multiple models sequentially, where each model corrects the errors of the previous one, focusing more on misclassified samples.
2. Working Principle
Bagging (Parallel Training & Aggregation)
✔ Uses random sampling with replacement (bootstrap sampling) to create multiple training datasets.
✔ Trains independent base models (e.g., Decision Trees) in parallel.
✔ Aggregates predictions using majority voting (classification) or averaging (regression).
✔ Reduces variance and prevents overfitting.

Boosting (Sequential Training & Weighted Learning)
✔ Models are trained sequentially, where each new model focuses on correcting the mistakes of the previous one.
✔ Assigns higher weights to misclassified instances, making the model learn from hard-to-classify cases.
✔ The final prediction is made by weighted averaging or boosted voting.
✔ Reduces bias and improves accuracy.



In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Bagging Classifier
bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)
bagging.fit(X_train, y_train)

# Predict and evaluate
y_pred_bag = bagging.predict(X_test)
print("Bagging Classifier Accuracy:", accuracy_score(y_test, y_pred_bag))


Q21.Train a Bagging Classifier using Decision Trees on a sample dataset and print model accuracy?
Ans.Here is a Python implementation to train a Bagging Classifier using Decision Trees on a sample dataset and print the model accuracy.

Step 1: Load the Dataset
We'll use the Iris dataset, a commonly used dataset for classification problems.

Step 2: Train the Bagging Classifier
We'll use Decision Trees as the base estimator in the Bagging ensemble.

Step 3: Evaluate the Model
We'll calculate and print the accuracy score of the trained model on the test set.



In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Split dataset into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Bagging Classifier with Decision Trees as base estimator
bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                n_estimators=100,  # Number of trees
                                bootstrap=True,    # Enable bootstrap sampling
                                random_state=42)

# Train the model
bagging_clf.fit(X_train, y_train)

# Predict on test data
y_pred = bagging_clf.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Bagging Classifier Accuracy: {accuracy:.4f}")


Q22.Train a Bagging Regressor using Decision Trees and evaluate using Mean Squared Error (MSE)?
Ans.Here’s a Python implementation to train a Bagging Regressor using Decision Trees and evaluate it using Mean Squared Error (MSE).

Steps to Implement
Load Dataset: We'll use the Boston Housing dataset (or any regression dataset).
Train the Bagging Regressor: Using Decision Trees as base estimators.
Evaluate the Model: Using Mean Squared Error (MSE).

In [None]:
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load a sample regression dataset (California Housing)
data = fetch_california_housing()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Bagging Regressor with Decision Trees
bagging_reg = BaggingRegressor(base_estimator=DecisionTreeRegressor(),
                               n_estimators=100,  # Number of trees
                               bootstrap=True,    # Enable bootstrap sampling
                               random_state=42)

# Train the model
bagging_reg.fit(X_train, y_train)

# Predict on test data
y_pred = bagging_reg.predict(X_test)

# Evaluate using Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Bagging Regressor MSE: {mse:.4f}")


Q23.Train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores?
Ans.Here’s a Python implementation to train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores.

Steps to Implement
Load Dataset: Use the Breast Cancer dataset from sklearn.datasets.
Train the Random Forest Classifier.
Evaluate Feature Importance: Print feature importance scores.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names  # Get feature names

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Predict on test data
y_pred = rf_clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Classifier Accuracy: {accuracy:.4f}")

# Get feature importance scores
feature_importance = rf_clf.feature_importances_

# Convert to DataFrame for better readability
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importance})
importance_df = importance_df.sort_values(by='Importance', ascending=False)  # Sort by importance

# Print feature importance scores
print("\nFeature Importance Scores:")
print(importance_df)


Q24Train a Random Forest Regressor and compare its performance with a single Decision Tree?
Ans.Here’s a Python implementation to train a Random Forest Regressor and compare its performance with a single Decision Tree Regressor using Mean Squared Error (MSE).

Steps to Implement
Load Dataset: Use the California Housing dataset (or any regression dataset).
Train Models: Train both Decision Tree Regressor and Random Forest Regressor.
Evaluate Performance: Compare MSE of both models.


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor
dt_reg = DecisionTreeRegressor(random_state=42)
dt_reg.fit(X_train, y_train)
y_pred_dt = dt_reg.predict(X_test)

# Train Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)

# Calculate Mean Squared Error (MSE) for both models
mse_dt = mean_squared_error(y_test, y_pred_dt)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print MSE values
print(f"Decision Tree Regressor MSE: {mse_dt:.4f}")
print(f"Random Forest Regressor MSE: {mse_rf:.4f}")


Q25.Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier?
Ans.Computing the Out-of-Bag (OOB) Score for a Random Forest Classifier
The Out-of-Bag (OOB) score is an internal validation score used in Random Forest models. Since bootstrap sampling (random sampling with replacement) is used for training, some data points remain out-of-bag and can be used to estimate model accuracy without needing a separate validation set.

Steps to Implement
Load Dataset: Use the Breast Cancer dataset from sklearn.datasets.
Train a Random Forest Classifier with oob_score=True.
Print the OOB Score to estimate model performance.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier with OOB Score enabled
rf_clf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42, bootstrap=True)
rf_clf.fit(X_train, y_train)

# Print the OOB Score
print(f"Out-of-Bag (OOB) Score: {rf_clf.oob_score_:.4f}")


Q26. Train a Bagging Classifier using SVM as a base estimator and print accuracy?
Ans.

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Bagging Classifier with SVM as base estimator
bagging_svm = BaggingClassifier(base_estimator=SVC(),
                                n_estimators=50,
                                bootstrap=True,
                                random_state=42)

# Train the model
bagging_svm.fit(X_train, y_train)

# Predict on test data
y_pred = bagging_svm.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Bagging Classifier (SVM) Accuracy: {accuracy:.4f}")


Q27. Train a Random Forest Classifier with different numbers of trees and compare accuracy?
Ans.Steps to Implement
Load Dataset: Use the Breast Cancer dataset.
Train Multiple Random Forest Models with different n_estimators (number of trees).
Compare Accuracy for different numbers of trees.


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Different numbers of trees to test
tree_counts = [1, 5, 10, 50, 100, 200, 500]
accuracies = []

# Train and evaluate models with different numbers of trees
for n in tree_counts:
    rf_clf = RandomForestClassifier(n_estimators=n, random_state=42)
    rf_clf.fit(X_train, y_train)
    y_pred = rf_clf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)
    print(f"Random Forest with {n} trees - Accuracy: {acc:.4f}")

# Plot results
plt.figure(figsize=(8, 5))
plt.plot(tree_counts, accuracies, marker='o', linestyle='-')
plt.xlabel("Number of Trees")
plt.ylabel("Accuracy")
plt.title("Random Forest Accuracy vs. Number of Trees")
plt.grid()
plt.show()


Q28.Train a Bagging Classifier using Logistic Regression as a base estimator and print AUC score?
Ans.Steps to Implement
Load Dataset: Use the Breast Cancer dataset.
Train a Bagging Classifier with Logistic Regression as the base estimator.
Evaluate Performance using AUC score.

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Bagging Classifier with Logistic Regression as the base estimator
bagging_lr = BaggingClassifier(base_estimator=LogisticRegression(max_iter=5000),
                               n_estimators=50,
                               bootstrap=True,
                               random_state=42)

# Train the model
bagging_lr.fit(X_train, y_train)

# Predict probabilities on test data
y_prob = bagging_lr.predict_proba(X_test)[:, 1]  # Get probability of class 1

# Calculate and print AUC Score
auc_score = roc_auc_score(y_test, y_prob)
print(f"Bagging Classifier (Logistic Regression) AUC Score: {auc_score:.4f}")


Q29.Train a Random Forest Regressor and analyze feature importance scores?
Ans.Steps to Implement
Load Dataset: Use the California Housing dataset for regression.
Train a Random Forest Regressor.
Extract Feature Importance Scores and visualize them.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target
feature_names = data.feature_names  # Get feature names

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)

# Predict on test data
y_pred = rf_reg.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Random Forest Regressor MSE: {mse:.4f}")

# Get feature importance scores
feature_importance = rf_reg.feature_importances_

# Convert to DataFrame for better readability
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importance})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Print feature importance scores
print("\nFeature Importance Scores:")
print(importance_df)

# Plot feature importance
plt.figure(figsize=(8, 5))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='skyblue')
plt.xlabel("Feature Importance")
plt.ylabel("Features")
plt.title("Feature Importance in Random Forest Regressor")
plt.gca().invert_yaxis()  # Highest importance on top
plt.show()


Q30. Train an ensemble model using both Bagging and Random Forest and compare accuracy
Ans.Steps to Implement
Load Dataset: Use the Breast Cancer dataset for classification.
Train Two Ensemble Models:
Bagging Classifier with Decision Trees.
Random Forest Classifier.
Evaluate Accuracy on the test set.
Compare Performance of both models.

In [None]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Bagging Classifier with Decision Tree as base estimator
bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                n_estimators=50,
                                bootstrap=True,
                                random_state=42)
bagging_clf.fit(X_train, y_train)
y_pred_bagging = bagging_clf.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)

# Train Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=50, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

# Print accuracy scores
print(f"Bagging Classifier Accuracy: {accuracy_bagging:.4f}")
print(f"Random Forest Classifier Accuracy: {accuracy_rf:.4f}")


Q31.Train a Random Forest Classifier and tune hyperparameters using GridSearchCV?
Ans.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a base Random Forest Classifier
rf_clf = RandomForestClassifier(random_state=42)

# Define hyperparameters for tuning
param_grid = {
    'n_estimators': [50, 100, 200],  # Number of trees
    'max_depth': [None, 10, 20],  # Depth of trees
    'min_samples_split': [2, 5, 10],  # Minimum samples required to split a node
    'min_samples_leaf': [1, 2, 4],  # Minimum samples required at a leaf node
    'criterion': ['gini', 'entropy']  # Splitting criterion
}

# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=rf_clf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)

# Train the best model
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Optimized Random Forest Accuracy: {accuracy:.4f}")


In [None]:
Q32.Train a Bagging Regressor with different numbers of base estimators and compare performance?
Ans.Steps to Implement
Load Dataset: Use the California Housing dataset for regression.
Train Bagging Regressors with different numbers of base estimators (n_estimators).
Compare Performance using Mean Squared Error (MSE).
Visualize the results.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Different numbers of base estimators to test
n_estimators_list = [1, 5, 10, 50, 100, 200]
mse_scores = []

# Train and evaluate Bagging Regressors with different numbers of base estimators
for n in n_estimators_list:
    bagging_reg = BaggingRegressor(base_estimator=DecisionTreeRegressor(),
                                   n_estimators=n,
                                   random_state=42,
                                   n_jobs=-1)
    bagging_reg.fit(X_train, y_train)
    y_pred = bagging_reg.predict(X_test)

    # Calculate MSE
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)

    print(f"Bagging Regressor with {n} estimators - MSE: {mse:.4f}")

# Plot MSE vs. Number of Base Estimators
plt.figure(figsize=(8, 5))
plt.plot(n_estimators_list, mse_scores, marker='o', linestyle='-')
plt.xlabel("Number of Base Estimators")
plt.ylabel("Mean Squared Error (MSE)")
plt.title("Bagging Regressor Performance vs. Number of Estimators")
plt.grid()
plt.show()


Q33.Train a Random Forest Classifier and analyze misclassified samples?
Ans.

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names  # Feature names
target_names = data.target_names  # Class labels

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Predict on test data
y_pred = rf_clf.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Classifier Accuracy: {accuracy:.4f}")

# Identify misclassified samples
misclassified_indices = np.where(y_pred != y_test)[0]
print(f"\nNumber of Misclassified Samples: {len(misclassified_indices)}")

# Analyze misclassified samples
misclassified_df = pd.DataFrame(X_test[misclassified_indices], columns=feature_names)
misclassified_df['Actual Label'] = [target_names[label] for label in y_test[misclassified_indices]]
misclassified_df['Predicted Label'] = [target_names[label] for label in y_pred[misclassified_indices]]

print("\nMisclassified Samples Analysis:")
print(misclassified_df)

# Display Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)


Q34.Train a Bagging Classifier and compare its performance with a single Decision Tree Classifier?
Ans.

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Single Decision Tree Classifier
dt_clf = DecisionTreeClassifier(random_state=42)
dt_clf.fit(X_train, y_train)
y_pred_dt = dt_clf.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)

# Train a Bagging Classifier with Decision Tree as base estimator
bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                n_estimators=50,
                                random_state=42,
                                n_jobs=-1)
bagging_clf.fit(X_train, y_train)
y_pred_bagging = bagging_clf.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)

# Print accuracy scores
print(f"Decision Tree Classifier Accuracy: {accuracy_dt:.4f}")
print(f"Bagging Classifier Accuracy: {accuracy_bagging:.4f}")


Q35.Train a Random Forest Classifier and visualize the confusion matrix
Ans.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_clf.predict(X_test)

# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Classifier Accuracy: {accuracy:.4f}")

# Compute confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Visualize confusion matrix
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=data.target_names, yticklabels=data.target_names)
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
plt.title("Confusion Matrix - Random Forest Classifier")
plt.show()


Q36.Train a Stacking Classifier using Decision Trees, SVM, and Logistic Regression, and compare accuracy?
Ans.

In [None]:
import numpy as np
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base classifiers
decision_tree = DecisionTreeClassifier(random_state=42)
svm_clf = SVC(probability=True, random_state=42)  # Enable probability estimates for stacking
log_reg = LogisticRegression(random_state=42)

# Define stacking classifier with logistic regression as the meta-learner
stacking_clf = StackingClassifier(
    estimators=[('dt', decision_tree), ('svm', svm_clf), ('lr', log_reg)],
    final_estimator=LogisticRegression(),
    passthrough=False  # Use base model predictions for final model
)

# Train individual models and Stacking Classifier
classifiers = {'Decision Tree': decision_tree, 'SVM': svm_clf, 'Logistic Regression': log_reg, 'Stacking Classifier': stacking_clf}
accuracy_scores = {}

for name, model in classifiers.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores[name] = accuracy
    print(f"{name} Accuracy: {accuracy:.4f}")

# Compare accuracy of different models
best_model = max(accuracy_scores, key=accuracy_scores.get)
print(f"\nBest Performing Model: {best_model} with Accuracy: {accuracy_scores[best_model]:.4f}")


Q37.Train a Random Forest Classifier and print the top 5 most important features
Ans.

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names  # Feature names

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Extract feature importance scores
feature_importance = rf_clf.feature_importances_

# Create a DataFrame for better visualization
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importance})

# Sort features by importance in descending order
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Print the top 5 most important features
print("Top 5 Most Important Features in Random Forest Classifier:")
print(importance_df.head(5))


Q38.Train a Bagging Classifier and evaluate performance using Precision, Recall, and F1-score
Ans.

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Bagging Classifier with Decision Trees
bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                n_estimators=50,
                                random_state=42,
                                n_jobs=-1)
bagging_clf.fit(X_train, y_train)

# Predict on test data
y_pred = bagging_clf.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print evaluation metrics
print(f"Bagging Classifier Performance:")
print(f"Accuracy  : {accuracy:.4f}")
print(f"Precision : {precision:.4f}")
print(f"Recall    : {recall:.4f}")
print(f"F1-score  : {f1:.4f}")


Q39.Train a Random Forest Classifier and analyze the effect of max_depth on accuracy
Ans.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define max_depth values to test
max_depth_values = range(1, 21)
accuracy_scores = []

# Train and evaluate Random Forest with different max_depth values
for depth in max_depth_values:
    rf_clf = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=42)
    rf_clf.fit(X_train, y_train)
    y_pred = rf_clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores.append(accuracy)

# Plot max_depth vs. accuracy
plt.figure(figsize=(8, 5))
plt.plot(max_depth_values, accuracy_scores, marker='o', linestyle='-', color='b')
plt.xlabel("Max Depth of Trees")
plt.ylabel("Accuracy")
plt.title("Effect of max_depth on Random Forest Accuracy")
plt.grid


Q40.Train a Bagging Regressor using different base estimators (DecisionTree and KNeighbors) and compare
performance
Ans.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load Boston Housing dataset
data = load_boston()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base estimators
dt_regressor = DecisionTreeRegressor(random_state=42)
knn_regressor = KNeighborsRegressor()

# Train Bagging Regressors with different base estimators
bagging_dt = BaggingRegressor(base_estimator=dt_regressor, n_estimators=50, random_state=42, n_jobs=-1)
bagging_knn = BaggingRegressor(base_estimator=knn_regressor, n_estimators=50, random_state=42, n_jobs=-1)

# Fit models
bagging_dt.fit(X_train, y_train)
bagging_knn.fit(X_train, y_train)

# Predict on test set
y_pred_dt = bagging_dt.predict(X_test)
y_pred_knn = bagging_knn.predict(X_test)

# Evaluate performance using Mean Squared Error (MSE)
mse_dt = mean_squared_error(y_test, y_pred_dt)
mse_knn = mean_squared_error(y_test, y_pred_knn)

# Print MSE scores
print(f"Bagging Regressor (Decision Tree) - MSE: {mse_dt:.4f}")
print(f"Bagging Regressor (K-Neighbors) - MSE: {mse_knn:.4f}")

# Plot comparison
models = ['Bagging (Decision Tree)', 'Bagging (K-Neighbors)']
mse_scores = [mse_dt, mse_knn]

plt.figure(figsize=(7, 5))
plt.bar(models, mse_scores, color=['blue', 'green'])
plt.xlabel("Model")
plt.ylabel("Mean Squared Error (MSE)")
plt.title("Comparison of Bagging Regressors with Different Base Estimators")
plt.show()


Q41. Train a Random Forest Classifier and evaluate its performance using ROC-AUC Score?
Ans.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Predict probabilities for the positive class
y_prob = rf_clf.predict_proba(X_test)[:, 1]

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_prob)

# Compute ROC curve
fpr, tpr, _ = roc_curve(y_test, y_prob)

# Print ROC-AUC score
print(f"Random Forest Classifier ROC-AUC Score: {roc_auc:.4f}")

# Plot ROC Curve
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, color='blue', label=f'ROC Curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')  # Random classifier line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Random Forest Classifier")
plt.legend()
plt.grid(True)
plt.show()


Q42.Train a Bagging Classifier and evaluate its performance using cross-validatio.
Ans.

In [None]:
import numpy as np
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Define Bagging Classifier with Decision Tree as base estimator
bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                n_estimators=50,
                                random_state=42,
                                n_jobs=-1)

# Perform cross-validation with Stratified K-Fold (5 folds)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(bagging_clf, X, y, cv=cv, scoring='accuracy')

# Print Cross-Validation Results
print(f"Cross-Validation Accuracy Scores: {cv_scores}")
print(f"Mean Accuracy: {np.mean(cv_scores):.4f}")
print(f"Standard Deviation: {np.std(cv_scores):.4f}")


Q43.Train a Random Forest Classifier and plot the Precision-Recall curv?
Ans.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, auc

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Predict probabilities for the positive class
y_prob = rf_clf.predict_proba(X_test)[:, 1]

# Compute Precision-Recall curve
precision, recall, _ = precision_recall_curve(y_test, y_prob)

# Compute PR AUC Score
pr_auc = auc(recall, precision)

# Print PR AUC Score
print(f"Random Forest Classifier PR AUC Score: {pr_auc:.4f}")

# Plot Precision-Recall Curve
plt.figure(figsize=(7, 5))
plt.plot(recall, precision, color='blue', label=f'PR Curve (AUC = {pr_auc:.4f})')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve for Random Forest Classifier")
plt.legend()
plt.grid(True)
plt.show()


Q44.Train a Stacking Classifier with Random Forest and Logistic Regression and compare accuracy?
Ans.

In [None]:
import numpy as np
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define Base Learners
base_learners = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('lr', LogisticRegression(max_iter=500))
]

# Define Stacking Classifier with Logistic Regression as Meta-Learner
stacking_clf = StackingClassifier(estimators=base_learners, final_estimator=LogisticRegression(), n_jobs=-1)

# Train Stacking Classifier
stacking_clf.fit(X_train, y_train)

# Predict and evaluate accuracy
y_pred_stacking = stacking_clf.predict(X_test)
stacking_accuracy = accuracy_score(y_test, y_pred_stacking)

# Train Individual Models for Comparison
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
rf_accuracy = accuracy_score(y_test, rf_clf.predict(X_test))

lr_clf = LogisticRegression(max_iter=500)
lr_clf.fit(X_train, y_train)
lr_accuracy = accuracy_score(y_test, lr_clf.predict(X_test))

# Print Accuracy Scores
print(f"Random Forest Accuracy: {rf_accuracy:.4f}")
print(f"Logistic Regression Accuracy: {lr_accuracy:.4f}")
print(f"Stacking Classifier Accuracy: {stacking_accuracy:.4f}")


Q45.= Train a Bagging Regressor with different levels of bootstrap samples and compare performance.
Ans.

In [None]:
import numpy as np
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Different bootstrap sample levels
bootstrap_samples = [0.5, 0.7, 1.0]
mse_scores = {}

for sample in bootstrap_samples:
    # Train Bagging Regressor with different bootstrap sample sizes
    bagging_reg = BaggingRegressor(base_estimator=DecisionTreeRegressor(),
                                   n_estimators=50,
                                   max_samples=sample,
                                   random_state=42,
                                   n_jobs=-1)
    bagging_reg.fit(X_train, y_train)

    # Predict and evaluate MSE
    y_pred = bagging_reg.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)

    mse_scores[sample] = mse
    print(f"Bagging Regressor (max_samples={sample}) - MSE: {mse:.4f}")

# Compare MSE values
print("\nPerformance Comparison:")
for sample, mse in mse_scores.items():
    print(f"Bootstrap Sample {sample*100}%: MSE = {mse:.4f}")
