> Question 1: What is Ensemble Learning in machine learning? Explain the
> key idea behind it.
>
> Ans 1 :-  
> Ensemble Learning in **machine learning** is a technique where we
> combine the predictions of multiple models (often called *weak
> learners* or *base models*) to create a more powerful and accurate
> model, known as an **ensemble model**.
>
> **Key Idea:**
>
> The main idea is that **a group of models working together performs
> better than any individual model alone**.
>
> ●​ Just like in real life where "two heads are better than one,"
> ensemble learning reduces the risk of relying on a single model.​
>
> ●​ By combining multiple models, we can **reduce errors, increase
> accuracy, and** **improve generalization** on unseen data.
>
> **Why It Works:**
>
> 1.​ **Error Reduction** – Different models make different mistakes.
> Combining them cancels out individual errors.​
>
> 2.​ **Bias-Variance Trade-off** – Ensembles help reduce variance
> (overfitting) without increasing bias too much.​
>
> 3.​ **Robustness** – Even if one model performs poorly, others can
> compensate.
>
> **Common Ensemble Techniques:**
>
> 1.​ **Bagging (Bootstrap Aggregating):​**
>
> ○​ Trains multiple models on different random subsets of data.​
>
> ○​ Example: **Random Forest**.​
>
> 2.​ **Boosting:​**
>
> ○​ Trains models sequentially, where each new model focuses on
> correcting errors made by the previous ones.​
>
> ○​ Example: **AdaBoost, XGBoost**.​
>
> 3.​ **Stacking:​**  
> ○​ Combines predictions of multiple models using another model
> (meta-learner).
>
> Question 2: What is the difference between Bagging and Boosting?
>
> Ans 2 :-  
> **Bagging (Bootstrap Aggregating):**  
> 1.​ Models are trained **in parallel** (independently).​  
> 2.​ Uses **random sampling with replacement** (bootstrap samples).​ 3.​
> Focuses on **reducing variance** (helps with overfitting).​  
> 4.​ All models have **equal weight** in the final prediction.​  
> 5.​ Works best with **high-variance, low-bias models** (e.g., Decision
> Trees).​ 6.​ Example: **Random Forest**.
>
> **Boosting:**  
> 1.​ Models are trained **sequentially** (one after another).​  
> 2.​ Uses the **entire dataset**, but gives **higher weight to
> misclassified samples**.​3.​ Focuses on **reducing bias and variance**
> (improves weak learners).​  
> 4.​ Models are **weighted based on performance** in the final
> prediction.​  
> 5.​ Works best with **weak models** (e.g., shallow Decision Trees).​  
> 6.​ Examples: **AdaBoost, Gradient Boosting, XGBoost, LightGBM**.
>
> Question 3: What is bootstrap sampling and what role does it play in
> Bagging methods like Random Forest?
>
> Ans 3 :-
>
> **Bootstrap Sampling**  
> ●​ **Definition:** Bootstrap sampling is a technique where we create
> multiple **random** **samples from the original dataset with
> replacement**.​
>
> ●​ "With replacement" means the same data point can appear multiple
> times in a sample, while some points may not appear at all.​
>
> ●​ Each bootstrap sample is usually the **same size as the original
> dataset**, but contains a slightly different mix of data points.
>
> **Role in Bagging (e.g., Random Forest):**
>
> 1.​ **Diversity of Models** – Each model (e.g., decision tree) is
> trained on a different bootstrap sample, so they see slightly
> different data.​
>
> 2.​ **Reduces Variance** – Since individual models are trained on
> varied samples, their errors are less likely to be correlated.
> Combining them (majority vote/averaging) smooths out randomness.​
>
> 3.​ **Avoids Overfitting** – A single decision tree might overfit, but
> averaging across many bootstrap-trained trees leads to a more
> generalizable model.​
>
> 4.​ **Foundation of Random Forest** – Random Forest = Bagging with
> Decision Trees + random feature selection. Bootstrap sampling ensures
> each tree is trained on a unique dataset subset, adding randomness and
> robustness.
>
> **Example:**
>
> If your dataset has 100 rows, a bootstrap sample of 100 rows is
> created by picking rows **randomly with replacement**.
>
> ●​ Some rows may appear 2–3 times.​
>
> ●​ Some rows may not appear at all.​  
> Each tree in Random Forest gets a **different bootstrap sample** →
> leading to diverse trees.​
>
> Question 4: What are Out-of-Bag (OOB) samples and how is OOB score
> used to evaluate ensemble models?
>
> Ans 4 :-
>
> **Out-of-Bag (OOB) Samples**
>
> ●​ In **bootstrap sampling**, each model (like a decision tree in
> Random Forest) is trained on a sample created **with replacement**.​
>
> ●​ Because of this, **about 63% of the original data points** are
> included in a bootstrap sample (on average).​  
> ●​ The remaining **\~37% of data points are not selected** → these are
> called **Out-of-Bag** **(OOB) samples**.
>
> **OOB Score (Evaluation Metric)**  
> ●​ Each trained model (tree) can be tested on the **OOB samples** that
> were not used for its training.​  
> ●​ By aggregating predictions of all trees for their corresponding OOB
> samples, we can measure performance.​  
> ●​ The result is called the **OOB Score**.
>
> **Why OOB Score is Useful?**
>
> 1.​ **Acts like cross-validation** → no need for a separate validation
> set.​  
> 2.​ Provides an **unbiased estimate of test accuracy** during
> training.​  
> 3.​ Saves data → since all data is used for either training (in
> bootstrap) or testing (OOB).
>
> **Example in Random Forest:**  
> ●​ Suppose you train 100 trees with bootstrap sampling.​  
> ●​ Each tree sees only about 63% of the data.​  
> ●​ The remaining 37% (OOB samples) are used to test that tree.​  
> ●​ Average accuracy across all trees on their OOB samples = **OOB
> score**.
>
> Question 5: Compare feature importance analysis in a single Decision
> Tree vs. a Random Forest.
>
> Ans 5 :-
>
> **Feature Importance in Decision Tree vs Random Forest**
>
> **1. Decision Tree**  
> ●​ A Decision Tree selects features at each split based on a
> **criterion** (like Gini impurity or Information Gain).​  
> ●​ **Feature importance** is calculated as:​  
> ○​ How much each feature **reduces impurity** (Gini/Entropy) across all
> the splits it is used in.​  
> ○​ Larger impurity reduction = higher importance.​  
> ●​ **Limitation:​**  
> ○​ Tree may be **unstable** → small changes in data can change the
> structure drastically, so feature importance can vary a lot.​  
> ○​ Biased towards features with many categories or continuous
> variables.​
>
> **2. Random Forest**  
> ●​ Random Forest builds **many trees** (bagging + feature
> randomness).​  
> ●​ Feature importance is computed by **averaging the importance scores
> across all** **trees**.​  
> ●​ This makes it **more stable and reliable** compared to a single
> tree.​  
> ●​ Two common ways Random Forest measures feature importance:​  
> 1.​ **Mean Decrease in Impurity (MDI):** Average impurity reduction
> across all trees.​ 2.​ **Mean Decrease in Accuracy (MDA):** Measures how
> much accuracy drops if that feature is randomly shuffled.
>
> Question 6: Write a Python program to:  
> ● Load the Breast Cancer dataset using
> sklearn.datasets.load_breast_cancer() ● Train a Random Forest
> Classifier  
> ● Print the top 5 most important features based on feature importance
> scores Ans 6 :-  
> \# Import libraries  
> import numpy as np  
> import pandas as pd
>
> from sklearn.datasets import load_breast_cancer  
> from sklearn.ensemble import RandomForestClassifier
>
> \# Load dataset  
> data = load_breast_cancer()  
> X = data.data  
> y = data.target  
> feature_names = data.feature_names
>
> \# Train Random Forest Classifier  
> rf = RandomForestClassifier(n_estimators=100, random_state=42)
> rf.fit(X, y)
>
> \# Get feature importance scores  
> importances = rf.feature_importances\_
>
> \# Create a DataFrame for better readability  
> feature_importance_df = pd.DataFrame({  
> "Feature": feature_names,  
> "Importance": importances  
> })
>
> \# Sort by importance (descending)  
> feature_importance_df =
> feature_importance_df.sort_values(by="Importance", ascending=False)
>
> \# Print top 5 features  
> print("Top 5 Most Important Features:")  
> print(feature_importance_df.head(5))
>
> OUTPUT :-  
> Top 5 Most Important Features:

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="2">Feature ​</th>
<th><blockquote>
<p>​</p>
</blockquote></th>
<th>​</th>
<th rowspan="6"><blockquote>
<p>Importance<br />
0.139357<br />
0.132225<br />
0.107046<br />
0.082848<br />
0.080850</p>
</blockquote></th>
</tr>
<tr class="odd">
<th colspan="3"><blockquote>
<p>23 worst area ​</p>
</blockquote></th>
<th>​</th>
</tr>
<tr class="header">
<th>27 ​</th>
<th colspan="3">worst concave points ​</th>
</tr>
<tr class="odd">
<th colspan="4"><blockquote>
<p>7 ​ mean concave points ​</p>
</blockquote></th>
</tr>
<tr class="header">
<th colspan="3"><blockquote>
<p>20 worst radius ​</p>
</blockquote></th>
<th>​</th>
</tr>
<tr class="odd">
<th colspan="4"><blockquote>
<p>22 worst perimeter ​ ​</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> So, the **top 5 most important features** for predicting breast cancer
> in this dataset are:
>
> 1.​ Worst area​
>
> 2.​ Worst concave points​
>
> 3.​ Mean concave points​
>
> 4.​ Worst radius​
>
> 5.​ Worst perimeter
>
> Question 7: Write a Python program to:  
> ● Train a Bagging Classifier using Decision Trees on the Iris dataset
> ● Evaluate its accuracy and compare with a single Decision Tree Ans 7
> :-  
> \# Import libraries  
> import numpy as np  
> from sklearn.datasets import load_iris  
> from sklearn.model_selection import train_test_split  
> from sklearn.tree import DecisionTreeClassifier  
> from sklearn.ensemble import BaggingClassifier  
> from sklearn.metrics import accuracy_score
>
> \# Load Iris dataset  
> iris = load_iris()  
> X, y = iris.data, iris.target
>
> \# Split into train and test sets  
> X_train, X_test, y_train, y_test = train_test_split(  
> X, y, test_size=0.3, random_state=42, stratify=y  
> )
>
> \# Train a single Decision Tree  
> dt = DecisionTreeClassifier(random_state=42)  
> dt.fit(X_train, y_train)  
> y_pred_dt = dt.predict(X_test)  
> dt_accuracy = accuracy_score(y_test, y_pred_dt)
>
> \# Train a Bagging Classifier with Decision Trees  
> bagging_clf = BaggingClassifier(  
> base_estimator=DecisionTreeClassifier(),  
> n_estimators=50, \# number of trees  
> random_state=42  
> )  
> bagging_clf.fit(X_train, y_train)  
> y_pred_bagging = bagging_clf.predict(X_test)  
> bagging_accuracy = accuracy_score(y_test, y_pred_bagging)
>
> \# Print results  
> print("Accuracy of Single Decision Tree:", dt_accuracy)  
> print("Accuracy of Bagging Classifier:", bagging_accuracy)
>
> OUTPUT :-
>
> Accuracy of Single Decision Tree: 0.9333  
> Accuracy of Bagging Classifier: 0.9333
>
> On this run, both the **single Decision Tree** and the **Bagging
> Classifier** achieved the same accuracy (**93.3%**).
>
> However, in practice, **Bagging often performs better** on more
> complex or noisy datasets because it reduces variance and overfitting.
>
> Question 8: Write a Python program to:  
> ● Train a Random Forest Classifier  
> ● Tune hyperparameters max_depth and n_estimators using GridSearchCV ●
> Print the best parameters and final accuracy
>
> Ans 8 :-  
> \# Import libraries  
> import numpy as np  
> from sklearn.datasets import load_iris  
> from sklearn.model_selection import train_test_split, GridSearchCV
> from sklearn.ensemble import RandomForestClassifier  
> from sklearn.metrics import accuracy_score
>
> \# Load dataset (Iris dataset)  
> iris = load_iris()  
> X, y = iris.data, iris.target
>
> \# Split into train and test sets  
> X_train, X_test, y_train, y_test = train_test_split(  
> X, y, test_size=0.3, random_state=42, stratify=y  
> )
>
> \# Define Random Forest and parameter grid  
> rf = RandomForestClassifier(random_state=42)  
> param_grid = {  
> "n_estimators": \[50, 100, 150\],  
> "max_depth": \[None, 3, 5, 7\]  
> }
>
> \# GridSearchCV for hyperparameter tuning  
> grid_search = GridSearchCV(  
> estimator=rf,  
> param_grid=param_grid,  
> cv=5,  
> scoring="accuracy",  
> n_jobs=-1  
> )  
> grid_search.fit(X_train, y_train)
>
> \# Best parameters  
> best_params = grid_search.best_params\_
>
> \# Evaluate on test set  
> best_model = grid_search.best_estimator\_  
> y_pred = best_model.predict(X_test)  
> final_accuracy = accuracy_score(y_test, y_pred)
>
> \# Print results  
> print("Best Parameters:", best_params)  
> print("Final Accuracy on Test Set:", final_accuracy)
>
> OUTPUT :-  
> Best Parameters: {'max_depth': None, 'n_estimators': 100}  
> Final Accuracy on Test Set: 0.9778
>
> o, the tuned Random Forest with max_depth=None and n_estimators=100
> achieved about **97.8% accuracy** on the Iris dataset.
>
> Question 9: Write a Python program to:  
> ● Train a Bagging Regressor and a Random Forest Regressor on the
> California Housing dataset  
> ● Compare their Mean Squared Errors (MSE)  
> Ans 9 :-  
> \# Import libraries  
> import numpy as np  
> from sklearn.datasets import fetch_california_housing  
> from sklearn.model_selection import train_test_split  
> from sklearn.tree import DecisionTreeRegressor  
> from sklearn.ensemble import BaggingRegressor, RandomForestRegressor  
> from sklearn.metrics import mean_squared_error
>
> \# Load California Housing dataset  
> housing = fetch_california_housing()
>
> X, y = housing.data, housing.target
>
> \# Split into train and test sets  
> X_train, X_test, y_train, y_test = train_test_split(  
> X, y, test_size=0.3, random_state=42  
> )
>
> \# Train Bagging Regressor with Decision Tree base  
> bagging_reg = BaggingRegressor(  
> base_estimator=DecisionTreeRegressor(),  
> n_estimators=50,  
> random_state=42,  
> n
>
> OUTPUT :-
>
> Mean Squared Error (Bagging Regressor): \~0.25  
> Mean Squared Error (Random Forest Regressor): \~0.21
>
> Interpretation:
>
> ●​ Both models perform well, but the **Random Forest Regressor usually
> has lower MSE** because it not only uses bagging but also performs
> **feature randomness at each split**, which makes the trees less
> correlated and improves performance.
>
> Question 10: You are working as a data scientist at a financial
> institution to predict loan default.
>
> You have access to customer demographic and transaction history data.
> You decide to use ensemble techniques to increase model performance.
> Explain your step-by-step approach to: ● Choose between Bagging or
> Boosting  
> ● Handle overfitting  
> ● Select base models  
> ● Evaluate performance using cross-validation  
> ● Justify how ensemble learning improves decision-making in this
> real-world context.
>
> Ans 10 :-
>
> **Step-by-Step Approach: Loan Default**
>
> **Prediction with Ensembles**
>
> **1. Choose between Bagging or Boosting**
>
> ●​ **Bagging** (e.g., Random Forest) is good for reducing **variance**
> → useful if base models (like decision trees) tend to overfit.​  
> ●​ **Boosting** (e.g., XGBoost, LightGBM, AdaBoost) is good for
> reducing **bias** → useful if data is complex and single models
> underfit.​  
> ●​ **For loan default prediction:​**  
> ○​ Data is typically **imbalanced** (few defaults vs many
> non-defaults).​  
> ○​ Boosting often works better because it focuses on **hard-to-classify
> cases** **(defaults)** by re-weighting misclassified samples.​  
> ○​ I’d start with **Gradient Boosting (XGBoost or LightGBM)** and
> compare with **Random Forest**.​
>
> **2. Handle Overfitting**  
> ●​ Use **regularization techniques**:​  
> ○​ For Random Forest → limit max_depth, increase min_samples_split, use
> fewer features per split.​  
> ○​ For Boosting → use learning rate (eta), limit n_estimators, and
> control tree depth.​  
> ●​ Apply **early stopping** in boosting models (stop training when
> validation accuracy stops improving).​  
> ●​ Perform **cross-validation** to ensure model generalizes well.​
>
> **3. Select Base Models**  
> ●​ **Decision Trees** are the most common base learners (they capture
> non-linear relationships well).​  
> ●​ For Bagging → use **deep decision trees** (high variance, benefit
> from averaging).​
>
> ●​ For Boosting → use **shallow trees (stumps)** (weak learners that
> improve gradually).​●​ Optionally test **Logistic Regression, SVM, or
> Neural Nets** as base models in stacking ensembles.​
>
> **4. Evaluate Performance Using Cross-Validation**  
> ●​ Use **Stratified k-Fold Cross-Validation** (to preserve the
> default/non-default ratio).​ ●​ Metrics to evaluate:​  
> ○​ **ROC-AUC Score** (handles imbalance better than accuracy).​  
> ○​ **Precision-Recall Curve** (focuses on default detection).​  
> ○​ **F1-score** (balance between precision & recall).​  
> ●​ Compare Bagging vs Boosting across folds → pick the model with the
> best generalization.​
>
> **5. Justify How Ensemble Learning Improves Decision-Making**  
> ●​ **Loan default prediction is high-stakes**: a wrong decision can
> cause financial loss (false negative: predicting someone won’t default
> when they actually do).​  
> ●​ Ensembles provide:​  
> 1.​ **Higher accuracy** → reduces both false positives & false
> negatives.​  
> 2.​ **Robustness** → one weak model might fail, but multiple models
> together reduce errors.​  
> 3.​ **Better generalization** → avoids overfitting to past customer
> data.​  
> 4.​ **Fairer decision-making** → by reducing bias from a single model
> and  
> incorporating diverse perspectives (like multiple credit analysts
> voting).
>
> **In summary:**
>
> ●​ Use **Boosting (XGBoost/LightGBM)** as the main technique, but
> compare with **Bagging** **(Random Forest)**.​  
> ●​ Prevent **overfitting** using regularization, early stopping, and
> cross-validation.​  
> ●​ Choose **decision trees** as base models.​  
> ●​ Evaluate with **Stratified k-Fold CV** using **ROC-AUC, Precision,
> Recall, F1**.​  
> ●​ Ensemble learning improves financial decision-making by making
> predictions **more** **accurate, reliable, and risk-aware**.