Question 6: Write a Python program to:
● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.


In [1]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

# 1. Load the Breast Cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# 2. Train a Random Forest Classifier
# We'll use 100 trees and a random_state for reproducibility
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X, y)

# 3. Get feature importance scores
importances = rf_model.feature_importances_

# Create a DataFrame to view feature names and their scores together
feature_importance_df = pd.DataFrame({
    'Feature': data.feature_names,
    'Importance': importances
})

# Sort by importance and get the top 5
top_5_features = feature_importance_df.sort_values(by='Importance', ascending=False).head(5)

# Print the results
print("Top 5 Most Important Features:")
print("-" * 30)
print(top_5_features.to_string(index=False))

Top 5 Most Important Features:
------------------------------
             Feature  Importance
          worst area    0.139357
worst concave points    0.132225
 mean concave points    0.107046
        worst radius    0.082848
     worst perimeter    0.080850


Question 7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Train a Single Decision Tree
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)
tree_preds = single_tree.predict(X_test)
tree_acc = accuracy_score(y_test, tree_preds)

# 3. Train a Bagging Classifier using Decision Trees
# n_estimators=50 means we are ensemble-ing 50 different trees
bagging_model = BaggingClassifier(estimator=DecisionTreeClassifier(),
                                  n_estimators=50,
                                  random_state=42)
bagging_model.fit(X_train, y_train)
bagging_preds = bagging_model.predict(X_test)
bagging_acc = accuracy_score(y_test, bagging_preds)

# 4. Compare Results
print(f"Accuracy of a Single Decision Tree: {tree_acc:.4f}")
print(f"Accuracy of Bagging Classifier (50 trees): {bagging_acc:.4f}")

if bagging_acc > tree_acc:
    print("\nResult: The Bagging ensemble outperformed the single tree.")
elif bagging_acc == tree_acc:
    print("\nResult: Both models achieved the same accuracy on this split.")
else:
    print("\nResult: The single tree performed better (this can happen on very small/simple datasets).")

Accuracy of a Single Decision Tree: 1.0000
Accuracy of Bagging Classifier (50 trees): 1.0000

Result: Both models achieved the same accuracy on this split.


Question 8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy

In [3]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 1. Load the dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Define the model and the parameter grid
rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200],  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30]   # Maximum depth of the trees
}

# 3. Initialize GridSearchCV
# cv=5 means 5-fold cross-validation
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='accuracy')

# 4. Train/Fit the model
grid_search.fit(X_train, y_train)

# 5. Extract Best Parameters and Evaluate
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Predict using the best model found by the grid search
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters Found:")
print(f"- max_depth: {best_params['max_depth']}")
print(f"- n_estimators: {best_params['n_estimators']}")
print("-" * 30)
print(f"Final Accuracy on Test Set: {accuracy:.4f}")

Best Parameters Found:
- max_depth: None
- n_estimators: 200
------------------------------
Final Accuracy on Test Set: 0.9649


Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# 1. Load the California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Train a Bagging Regressor
# We use 100 Decision Trees as our base estimators
bagging_reg = BaggingRegressor(estimator=DecisionTreeRegressor(),
                               n_estimators=100,
                               random_state=42)
bagging_reg.fit(X_train, y_train)
bag_pred = bagging_reg.predict(X_test)
bag_mse = mean_squared_error(y_test, bag_pred)

# 3. Train a Random Forest Regressor
# Random Forest also uses 100 trees but adds feature randomness
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

# 4. Compare Results
print(f"Mean Squared Error (Bagging Regressor): {bag_mse:.4f}")
print(f"Mean Squared Error (Random Forest Regressor): {rf_mse:.4f}")

# Calculate percentage improvement
improvement = ((bag_mse - rf_mse) / bag_mse) * 100
print(f"\nRandom Forest reduced error by {improvement:.2f}% compared to standard Bagging.")

Mean Squared Error (Bagging Regressor): 0.2559
Mean Squared Error (Random Forest Regressor): 0.2554

Random Forest reduced error by 0.22% compared to standard Bagging.


Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.


Answer: In the financial sector, predicting loan defaults is a high-stakes task where both accuracy (identifying potential defaulters) and interpretability (knowing why a loan was denied) are crucial.

Here is the step-by-step strategic approach:

1. Choosing Between Bagging and Boosting
For loan default prediction, I would choose Boosting (specifically XGBoost or LightGBM).

Reasoning: Loan datasets are often imbalanced (most people don't default). Boosting is superior here because it focuses on "hard-to-classify" cases—the subtle patterns that distinguish a risky borrower from a safe one—by sequentially correcting errors from previous trees.

2. Handling Overfitting
Ensemble models can overfit if allowed to grow too complex. I would manage this by:

Tree Constraints: Limiting the max_depth to prevent trees from memorizing specific transactions.

Learning Rate (Shrinkage): Using a small learning rate (e.g., 0.01) so the model learns slowly and generalizes better.

Early Stopping: Monitoring validation error and stopping training when the error stops decreasing.

3. Selecting Base Models
I would use Decision Trees as base models.

Reasoning: They handle non-linear relationships well (e.g., a borrower might be "low risk" if they have a medium income AND low debt, but "high risk" if they have high income AND high debt).

4. Evaluating Performance via Cross-Validation
I would use Stratified K-Fold Cross-Validation.

Since defaults are rare, "stratified" ensures each fold has the same percentage of defaulters as the original data, providing a more reliable estimate of how the model will perform on new customers.

5. Justifying Ensemble Learning in Finance
Ensemble learning reduces the "model risk" associated with any single algorithm. In a bank, a single decision tree might deny a loan based on one arbitrary cutoff (e.g., Credit Score < 650). An ensemble aggregates hundreds of perspectives, ensuring that the final decision is based on a consensus of evidence, leading to fairer and more accurate risk assessment.

In [5]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report

# 1. Simulate a loan default dataset
# (1000 customers, 20 features like income, debt-to-income ratio, etc.)
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, weights=[0.8, 0.2], random_state=42)

# 2. Define the Boosting Model (XGBoost logic via GradientBoostingClassifier)
# We limit max_depth and use a learning_rate to handle overfitting
boosting_model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

# 3. Evaluate using Stratified Cross-Validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(boosting_model, X, y, cv=skf, scoring='f1')

# 4. Fit the model to see feature importances (Simulated)
boosting_model.fit(X, y)

# Output the results
print(f"Cross-Validation F1-Scores: {cv_scores}")
print(f"Mean F1-Score: {np.mean(cv_scores):.4f}")
print("\nTop 5 Feature Importance Scores:")
for i in range(5):
    print(f"Feature {i}: {boosting_model.feature_importances_[i]:.4f}")

Cross-Validation F1-Scores: [0.6875     0.64516129 0.61290323 0.71232877 0.6875    ]
Mean F1-Score: 0.6691

Top 5 Feature Importance Scores:
Feature 0: 0.0343
Feature 1: 0.0629
Feature 2: 0.0751
Feature 3: 0.0221
Feature 4: 0.0719
