In [None]:
THEORETICAL QUESTIONS

In [None]:
Ensemble Learning Questions
#1. Can we use Bagging for regression problems?
Answer: Yes, Bagging can be used for regression problems. The BaggingRegressor in scikit-learn is an example. It reduces variance by averaging predictions from multiple base regressors, typically Decision Trees.


#2. What is the difference between multiple model training and single model training?
Answer: Single model training: One model is trained on the dataset and used for predictions.
Multiple model training: Multiple models are trained, and their outputs are combined to improve accuracy and reduce overfitting, as seen in ensemble methods like Bagging and Boosting.


#3. Explain the concept of feature randomness in Random Forest.
Answer:Random Forest introduces feature randomness by selecting a random subset of features at each split in a Decision Tree. This reduces correlation between trees and enhances generalization.

#4. What is OOB (Out-of-Bag) Score?
Answer:OOB Score is an estimate of model accuracy calculated using data samples not included in training due to bootstrap sampling in Bagging-based methods like Random Forest.

#5. How can you measure the importance of features in a Random Forest model?
Answer:Feature importance is measured using metrics like:
Mean Decrease in Impurity (MDI) – Measures how much each feature reduces impurity across splits.
Permutation Importance – Measures the change in model performance when a feature's values are shuffled.

#6. Explain the working principle of a Bagging Classifier.
Answer:A Bagging Classifier trains multiple weak learners (e.g., Decision Trees) on different bootstrap samples and aggregates their predictions through majority voting (classification) or averaging (regression).

#7. How do you evaluate a Bagging Classifier’s performance?
Answer: Performance is evaluated using:
Accuracy (for classification tasks)
Precision, Recall, F1-score (for imbalanced datasets)
ROC-AUC Score (for probabilistic outputs)


#8. How does a Bagging Regressor work?
Answer: A Bagging Regressor trains multiple base regressors (e.g., Decision Trees) on bootstrap samples and averages their predictions to reduce variance and improve generalization.


#9. What is the main advantage of ensemble techniques?
Answer:Ensemble techniques improve accuracy, reduce overfitting, and provide robustness by combining multiple models instead of relying on a single weak learner.


#10. What is the main challenge of ensemble methods?
Answer: The main challenge is increased computational complexity, as multiple models must be trained, requiring more time and resources.

#11. Explain the key idea behind ensemble techniques.
Answer: Ensemble techniques combine predictions from multiple models to improve accuracy and stability, leveraging diversity among base learners.

#12. What is a Random Forest Classifier?
Answer: A Random Forest Classifier is an ensemble model that builds multiple Decision Trees and aggregates their predictions using majority voting.

#13. What are the main types of ensemble techniques?
Answer:The main types include:

Bagging – Reduces variance by averaging multiple model outputs.
Boosting – Reduces bias by training models sequentially.
Stacking – Combines multiple models using a meta-learner.
#14. What is ensemble learning in machine learning?
Answer: Ensemble learning is a technique that combines multiple models to improve predictive performance beyond what a single model can achieve.

#15. When should we avoid using ensemble methods?
Answer: Ensemble methods should be avoided when:

Computational resources are limited.
The dataset is small, leading to overfitting.
A single model already performs well.
#16. How does Bagging help in reducing overfitting?
Answer: Bagging reduces overfitting by training multiple models on different bootstrap samples and averaging their outputs, preventing any single model from learning noise.

#17. Why is Random Forest better than a single Decision Tree?
Answer: Random Forest reduces overfitting by averaging multiple trees, leading to better generalization and stability compared to a single Decision Tree.

#18. What is the role of bootstrap sampling in Bagging?
Answer: Bootstrap sampling creates diverse training datasets by randomly selecting samples with replacement, ensuring models learn different aspects of the data.

#19. What are some real-world applications of ensemble techniques?
Answer:
Finance: Fraud detection
Healthcare: Disease diagnosis
E-commerce: Recommendation systems
Self-driving cars: Object detection
#20. What is the difference between Bagging and Boosting?
Answer:Bagging trains multiple models independently and averages their predictions.
Boosting trains models sequentially, where each model corrects errors from the previous one.

In [None]:
PRACTICAL QUESTIONS

In [None]:

#Q21. Train a Bagging Classifier using Decision Trees on a sample dataset and print model accuracy.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Bagging Classifier
bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10, random_state=42)
bagging_clf.fit(X_train, y_train)

# Predictions and accuracy
y_pred = bagging_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Bagging Classifier Accuracy:", accuracy)




#Q22. Train a Bagging Regressor using Decision Trees and evaluate using Mean Squared Error (MSE).
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error

# Load dataset
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Bagging Regressor
bagging_reg = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=10, random_state=42)
bagging_reg.fit(X_train, y_train)

# Predictions and MSE
y_pred = bagging_reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Bagging Regressor MSE:", mse)



#Q23. Train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer

# Load dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
# Feature Importance
importances = rf_clf.feature_importances_
for i, v in enumerate(importances):
    print(f"Feature {i}: Importance {v:.4f}")



#Q24. Train a Random Forest Regressor and compare its performance with a single Decision Tree.
from sklearn.ensemble import RandomForestRegressor

# Train Decision Tree Regressor
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(X_train, y_train)
tree_pred = tree_reg.predict(X_test)
tree_mse = mean_squared_error(y_test, tree_pred)

# Train Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

print("Decision Tree MSE:", tree_mse)
print("Random Forest MSE:", rf_mse)


#Q25. Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier.

rf_clf_oob = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
rf_clf_oob.fit(X_train, y_train)
print("OOB Score:", rf_clf_oob.oob_score_)
Q26. Train a Bagging Classifier using SVM as a base estimator and print accuracy.
python
Copy
Edit
from sklearn.svm import SVC

bagging_svm = BaggingClassifier(base_estimator=SVC(), n_estimators=10, random_state=42)
bagging_svm.fit(X_train, y_train)

y_pred = bagging_svm.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Bagging Classifier with SVM Accuracy:", accuracy)


#Q27. Train a Random Forest Classifier with different numbers of trees and compare accuracy.

for n in [10, 50, 100, 200]:
    rf_clf = RandomForestClassifier(n_estimators=n, random_state=42)
    rf_clf.fit(X_train, y_train)
    acc = accuracy_score(y_test, rf_clf.predict(X_test))
    print(f"Random Forest with {n} trees: Accuracy {acc}")



#Q28. Train a Bagging Classifier using Logistic Regression as a base estimator and print AUC score.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

bagging_lr = BaggingClassifier(base_estimator=LogisticRegression(), n_estimators=10, random_state=42)
bagging_lr.fit(X_train, y_train)

y_pred_proba = bagging_lr.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_pred_proba)
print("Bagging Classifier with Logistic Regression AUC Score:", roc_auc)


#Q29. Train a Random Forest Regressor and analyze feature importance scores.

importances = rf_reg.feature_importances_
for i, v in enumerate(importances):
    print(f"Feature {i}: Importance {v:.4f}")


#Q30. Train an ensemble model using both Bagging and Random Forest and compare accuracy.

bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10, random_state=42)
bagging_clf.fit(X_train, y_train)

rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

bagging_acc = accuracy_score(y_test, bagging_clf.predict(X_test))
rf_acc = accuracy_score(y_test, rf_clf.predict(X_test))

print("Bagging Classifier Accuracy:", bagging_acc)
print("Random Forest Classifier Accuracy:", rf_acc)


 #31Train a Random Forest Classifier and tune hyperparameters using GridSearchCV

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]}

# Perform GridSearchCV
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)


#32. Train a Bagging Regressor with different numbers of base estimators and compare performance

from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

for n in [5, 10, 20, 50]:
    bagging_reg = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=n, random_state=42)
    bagging_reg.fit(X_train, y_train)
    mse = mean_squared_error(y_test, bagging_reg.predict(X_test))
    print(f"Bagging Regressor with {n} estimators - MSE: {mse}")


#33. Train a Random Forest Classifier and analyze misclassified samples

import numpy as np

y_pred = rf_clf.predict(X_test)
misclassified = np.where(y_pred != y_test)[0]
print("Misclassified samples:", misclassified)
#34. Train a Bagging Classifier and compare its performance with a single Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

dt_clf = DecisionTreeClassifier(random_state=42)
dt_clf.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt_clf.predict(X_test))

bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10, random_state=42)
bagging_clf.fit(X_train, y_train)
bagging_acc = accuracy_score(y_test, bagging_clf.predict(X_test))

print("Decision Tree Accuracy:", dt_acc)
print("Bagging Classifier Accuracy:", bagging_acc)


#35. Train a Random Forest Classifier and visualize the confusion matrix
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

y_pred = rf_clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()



#36. Train a Stacking Classifier using Decision Trees, SVM, and Logistic Regression, and compare accuracy
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

stacking_clf = StackingClassifier(
    estimators=[
        ('dt', DecisionTreeClassifier()),
        ('svm', SVC(probability=True)),
        ('lr', LogisticRegression())
    ],
    final_estimator=LogisticRegression())

stacking_clf.fit(X_train, y_train)
stacking_acc = accuracy_score(y_test, stacking_clf.predict(X_test))
print("Stacking Classifier Accuracy:", stacking_acc)
#37. Train a Random Forest Classifier and print the top 5 most important features

import numpy as np

importances = rf_clf.feature_importances_
sorted_indices = np.argsort(importances)[::-1]

print("Top 5 Most Important Features:")
for i in range(5):
    print(f"Feature {sorted_indices[i]}: Importance {importances[sorted_indices[i]]:.4f}")


#38. Train a Bagging Classifier and evaluate performance using Precision, Recall, and F1-score
from sklearn.metrics import precision_score, recall_score, f1_score

y_pred = bagging_clf.predict(X_test)

print("Precision:", precision_score(y_test, y_pred, average='macro'))
print("Recall:", recall_score(y_test, y_pred, average='macro'))
print("F1-score:", f1_score(y_test, y_pred, average='macro'))


#39. Train a Random Forest Classifier and analyze the effect of max_depth on accuracy
for depth in [5, 10, 15, None]:
    rf_clf = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=42)
    rf_clf.fit(X_train, y_train)
    acc = accuracy_score(y_test, rf_clf.predict(X_test))
    print(f"Random Forest with max_depth={depth}: Accuracy {acc}")


#40. Train a Bagging Regressor using different base estimators (DecisionTree and KNeighbors) and compare performance
python
Copy
Edit
from sklearn.neighbors import KNeighborsRegressor

base_estimators = {
    "Decision Tree": DecisionTreeRegressor(),
    "KNeighbors": KNeighborsRegressor()}

for name, estimator in base_estimators.items():
    bagging_reg = BaggingRegressor(base_estimator=estimator, n_estimators=10, random_state=42)
    bagging_reg.fit(X_train, y_train)
    mse = mean_squared_error(y_test, bagging_reg.predict(X_test))
    print(f"{name} as base estimator - MSE: {mse}")
#41. Train a Random Forest Classifier and evaluate its performance using ROC-AUC Score

from sklearn.metrics import roc_auc_score

y_pred_proba = rf_clf.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_pred_proba)
print("ROC-AUC Score:", roc_auc)




#42. Train a Bagging Classifier and evaluate its performance using cross-validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(bagging_clf, X, y, cv=5)
print("Cross-validation Accuracy:", scores.mean())



#43. Train a Random Forest Classifier and plot the Precision-Recall curve
from sklearn.metrics import precision_recall_curve

precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)

plt.plot(recall, precision, marker='.')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.show()
#44. Train a Stacking Classifier with Random Forest and Logistic Regression and compare accuracy

stacking_clf = StackingClassifier(
    estimators=[
        ('rf', RandomForestClassifier()),
        ('lr', LogisticRegression())
    ],
    final_estimator=LogisticRegression())

stacking_clf.fit(X_train, y_train)
stacking_acc = accuracy_score(y_test, stacking_clf.predict(X_test))
print("Stacking Classifier Accuracy:", stacking_acc)
#45. Train a Bagging Regressor with different levels of bootstrap samples and compare performance
for bootstrap in [True, False]:
    bagging_reg = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=10, bootstrap=bootstrap, random_state=42)
    bagging_reg.fit(X_train, y_train)
    mse = mean_squared_error(y_test, bagging_reg.predict(X_test))
    print(f"Bagging Regressor with bootstrap={bootstrap} - MSE: {mse}")
