# **THEORY QUESTIONS**

Q1. Can we use Bagging for regression problems?

- Yes, Bagging can be used for regression problems.
Bagging, or Bootstrap Aggregating, is an ensemble method that can be applied to both classification and regression tasks.

 In regression, Bagging works by training multiple models (often decision trees) on different subsets of the training data and then averaging their predictions to improve accuracy and reduce variance.

Q2. What is the difference between multiple model training and single model training?

- Multiple Model Training:

 Involves training several models independently on the same dataset or different subsets of the dataset.
 Aims to improve performance by combining the predictions of multiple models (e.g., through averaging or voting).
 Reduces the risk of overfitting and increases robustness.

- Single Model Training:

 Involves training one model on the entire dataset.

 Simpler and faster but may lead to overfitting, especially with complex models.

 Performance is solely dependent on the single model's ability to generalize from the training data.

Q3. Explain the concept of feature randomness in Random Forest.

- Feature Randomness:

 In Random Forest, feature randomness refers to the practice of selecting a random subset of features for each decision tree during the training process.

 This randomness helps to ensure that the trees are diverse and reduces the correlation between them, which enhances the overall model's performance.

 By using different subsets of features, Random Forest can capture a wider range of patterns in the data, leading to better generalization and reduced overfitting.

Q4. What is OOB (Out-of-Bag) Score?

- Definition:

 The Out-of-Bag (OOB) Score is a validation technique used in Random Forest models to estimate the model's performance without needing a separate validation dataset.

- How it Works:

 During the training of each decision tree, some samples are not included in the bootstrap sample (the training set for that tree). These samples are referred to as "out-of-bag" samples.

 The OOB Score is calculated by using these out-of-bag samples to evaluate the model's predictions, providing an unbiased estimate of the model's accuracy.

- Advantages:

 No data leakage, as the OOB samples are not used in training the trees.

 Provides a reliable estimate of model performance, especially useful for small to medium-sized datasets.

Q5. How can you measure the importance of features in a Random Forest model?

- Feature Importance Measurement:

 Random Forest provides several methods to measure the importance of features:

- Mean Decrease Impurity (MDI):

 Measures the total decrease in node impurity (e.g., Gini impurity or entropy) brought by a feature across all trees in the forest.

 Features that lead to larger decreases in impurity are considered more important.

- Mean Decrease Accuracy (MDA):

 Evaluates the impact of permuting a feature on the model's accuracy.
If permuting a feature significantly decreases the model's accuracy, that feature is deemed important.
- Permutation Importance:

 Involves shuffling the values of a feature and measuring the change in model performance.

 A significant drop in performance indicates that the feature is important for the model's predictions

Q6. Explain the working principle of a Bagging Classifier?

- Definition: Bagging, or Bootstrap Aggregating, is an ensemble learning technique that aims to improve the stability and accuracy of machine learning algorithms.

 Working Steps:

 Data Sampling: Create multiple subsets of the training dataset using bootstrap sampling (random sampling with replacement). Each subset may contain duplicate instances.

 Model Training: Train a separate model (often a weak learner like a decision tree) on each subset of the data independently.

 Prediction Aggregation: For classification tasks, the final prediction is made by majority voting among the predictions of all models. For regression tasks, the predictions are averaged.

 Goal: The primary goal of bagging is to reduce variance and prevent overfitting by averaging the predictions of multiple models.

Q7. How do you evaluate a Bagging Classifier’s performance?

- Common Evaluation Metrics:

 Accuracy: The proportion of correct predictions made by the model.

 Precision: The ratio of true positive predictions to the total predicted positives.

 Recall (Sensitivity): The ratio of true positive predictions to the total actual positives.

 F1 Score: The harmonic mean of precision and recall, providing a balance between the two.

 ROC-AUC: The area under the Receiver Operating Characteristic curve, which evaluates the trade-off between true positive rate and false positive rate.

 Cross-Validation: Use k-fold cross-validation to assess the model's performance on different subsets of the data, ensuring that the evaluation is robust and not dependent on a single train-test split.

 Confusion Matrix: Analyze the confusion matrix to understand the types of errors made by the classifier.

Q8. How does a Bagging Regressor work?

- Definition: A Bagging Regressor is similar to a Bagging Classifier but is used for regression tasks.

 Working Steps:

 Data Sampling: Generate multiple bootstrap samples from the original dataset.

 Model Training: Train a separate regression model (e.g., decision tree regressor) on each bootstrap sample independently.

 Prediction Aggregation: Combine the predictions from all models by averaging them to produce the final output.

 Goal: The main objective is to reduce variance in the predictions, leading to a more stable and accurate regression model.

Q9. What is the main advantage of ensemble techniques?

- Improved Accuracy: Ensemble techniques often lead to better predictive performance than individual models by combining their strengths.

 Reduction of Overfitting: Techniques like bagging help reduce overfitting by averaging predictions from multiple models, which stabilizes the output.

 Robustness: Ensemble methods are generally more robust to noise and outliers in the data, as they leverage multiple models to make predictions.

 Flexibility: They can be applied to various types of models and problems, making them versatile in different machine learning scenarios.

Q10. What is the main challenge of ensemble methods?

- Increased Complexity: Ensemble methods can be more complex to implement and interpret compared to single models, making them harder to debug and understand.

 Computational Cost: Training multiple models can be computationally expensive and time-consuming, especially with large datasets.

 Risk of Overfitting: While ensemble methods reduce overfitting, they can still overfit if the base models are too complex or if the ensemble is not properly tuned.

 Dependency on Base Models: The performance of ensemble methods heavily relies on the choice of base models. Poorly chosen models can lead to suboptimal performance.

Q11. Explain the key idea behind ensemble techniques?

- Definition: Ensemble techniques combine multiple models (classifiers or regressors) to improve overall performance compared to individual models.

 Key Ideas:

 Diversity: By using different models or training on different subsets of data, ensemble methods can capture a wider range of patterns.

 Reduction of Overfitting: Combining predictions helps to mitigate the risk of overfitting that individual models may suffer from.

 Improved Accuracy: The final prediction is often more accurate as it averages out the errors of individual models.

Q12. What is a Random Forest Classifier?

- Definition: A Random Forest Classifier is an ensemble learning method that constructs a multitude of decision trees during training and outputs the mode of their predictions for classification tasks.

 Key Features:

 Bagging Technique: It uses bootstrapping (sampling with replacement) to create multiple subsets of the training data.

 Random Feature Selection: At each split in the decision trees, a random subset of features is considered, which enhances diversity among the trees.

 Final Prediction: For classification, the final output is determined by majority voting among the trees.

Q13. What are the main types of ensemble techniques?

- Bagging:

 Combines predictions from multiple models trained on different subsets of the data.

 Example: Random Forest.

- Boosting:

 Sequentially builds models, where each new model focuses on correcting errors made by the previous ones.

 Example: AdaBoost, Gradient Boosting.
- Stacking:


Combines multiple models (base learners) and uses their predictions as input for a higher-level model (meta-learner).
- Voting:

 Aggregates predictions from multiple models, either by majority voting (for classification) or averaging (for regression).

Q14. What is ensemble learning in machine learning?

- Definition: Ensemble learning is a machine learning paradigm where multiple models are trained to solve the same problem and their predictions are combined to produce a more accurate and robust final prediction.

 Purpose:

 To improve model performance by leveraging the strengths of various algorithms.

 To reduce the likelihood of poor predictions by averaging out errors.

 Applications: Used in various domains such as finance, healthcare, and image recognition to enhance predictive accuracy.

Q15. When should we avoid using ensemble methods?

- Small Datasets:

 When the dataset is small, simpler models may perform better due to lower complexity and reduced risk of overfitting.

 Real-Time Predictions:

 Ensemble methods, especially those with many trees, can be computationally expensive and slow, making them unsuitable for applications requiring real-time predictions.

 Limited Resources:

 In environments with limited computational resources, the overhead of training multiple models may not be feasible.

 High Dimensionality:

 In cases with a very high number of features, ensemble methods may not provide significant benefits and could lead to increased complexity without improved performance.

Q16. How does Bagging help in reducing overfitting?

- Variance Reduction:

 Bagging (Bootstrap Aggregating) reduces the variance of the model by averaging the predictions of multiple models trained on different subsets of the data.

 Diversity Among Models:

 By training each model on a different bootstrap sample (random samples with replacement), Bagging introduces diversity among the base learners, which helps in generalizing better to unseen data.

 Stability:

 The aggregation of predictions from multiple models leads to a more stable and robust final prediction, which is less likely to overfit to the noise in the training data.

 Independence of Models:

 Each model is trained independently, which means that the errors made by individual models are less likely to be correlated, further reducing the risk of overfitting.

Q17. Why is Random Forest better than a single Decision Tree?

- Reduction of Overfitting:

 Random Forest mitigates the overfitting problem commonly associated with single decision trees by averaging the results of multiple trees, which smooths out the predictions.

- Feature Randomness:

 At each split in the decision trees, Random Forest selects a random subset of features, which adds an additional layer of randomness and helps in capturing more complex patterns in the data.

- Improved Accuracy:

 The ensemble approach of combining multiple trees generally leads to better accuracy compared to a single decision tree, especially in complex datasets.
- Robustness:

 Random Forest is less sensitive to noise and outliers in the data, making it a more robust model for various applications.

Q18. What is the role of bootstrap sampling in Bagging?

- Data Subsets Creation:

 Bootstrap sampling involves creating multiple subsets of the training data by sampling with replacement. This means that some instances may appear multiple times in a subset, while others may not appear at all.

 Independence of Models:

 Each model in Bagging is trained on a different bootstrap sample, which ensures that the models are independent of each other, leading to diverse predictions.

 Variance Reduction:

 By training on different subsets, bootstrap sampling helps in reducing the overall variance of the model, which is crucial for improving generalization to unseen data.

 Aggregation of Predictions:

 The final prediction is made by aggregating the predictions from all models trained on the bootstrap samples, which enhances the stability and accuracy of the model.

Q19. What are some real-world applications of ensemble techniques?

- Finance:

 Credit scoring and risk assessment models often use ensemble techniques to improve prediction accuracy.

 Healthcare:

 Ensemble methods are used in disease prediction and diagnosis, where combining multiple models can lead to better outcomes.

 Image Recognition:

 In computer vision, ensemble techniques enhance the performance of models in tasks like object detection and image classification.

 Natural Language Processing:

 Sentiment analysis and text classification benefit from ensemble methods to improve accuracy and robustness.

 Fraud Detection:

 Ensemble techniques are employed in detecting fraudulent transactions by combining multiple models to identify patterns indicative of fraud.

Q20. What is the difference between Bagging and Boosting?

- Methodology:

 Bagging: Builds multiple models independently and combines their predictions (e.g., averaging or voting).

 Boosting: Builds models sequentially, where each new model focuses on correcting the errors made by the previous ones.

 Model Independence:

 Bagging: Models are trained independently, which helps in reducing variance.

 Boosting: Models are dependent on each other, as each model is trained based on the performance of the previous ones.

 Error Handling:

 Bagging: Reduces variance by averaging predictions from multiple models.

 Boosting: Reduces bias by focusing on difficult-to-predict instances and adjusting the weights of misclassified examples.

 Performance:

 Bagging: Generally performs better with high-variance models (like decision trees).

 Boosting: Often yields better performance on complex datasets by reducing bias and improving accuracy

# **PRACTICAL QUESTIONS**

In [None]:
# Q21. Train a Bagging Classifier using Decision Trees on a sample dataset and print model accuracy
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base estimator
base_estimator = DecisionTreeClassifier()

# Create Bagging Classifier
model = BaggingClassifier(base_estimator=base_estimator, n_estimators=10, random_state=42)

# Train model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Print accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))


In [None]:
# Q22. Train a Bagging Regressor using Decision Trees and evaluate using Mean Squared Error (MSE)
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
X, y = load_diabetes(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base estimator
base_estimator = DecisionTreeRegressor()

# Create Bagging Regressor
model = BaggingRegressor(base_estimator=base_estimator, n_estimators=10, random_state=42)

# Train model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Print MSE
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))


In [None]:
# Q23. Train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import pandas as pd

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Get feature importances
importances = model.feature_importances_
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Print feature importance scores
print(importance_df)


In [None]:
# Q24.  Train a Random Forest Regressor and compare its performance with a single Decision Tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
X, y = load_diabetes(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Regressor
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
dt_mse = mean_squared_error(y_test, dt_pred)

# Train Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

# Print MSE for both models
print("Decision Tree MSE:", dt_mse)
print("Random Forest MSE:", rf_mse)


In [None]:
# Q25. Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
X, y = load_iris(return_X_y=True)

# Split data (not required for OOB but done for clarity)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest with OOB enabled
model = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
model.fit(X_train, y_train)

# Print OOB score
print("OOB Score:", model.oob_score_)


In [None]:
# Q26. Train a Bagging Classifier using SVM as a base estimator and print accuracy
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base estimator
base_estimator = SVC(probability=True, kernel='rbf')

# Create Bagging Classifier
model = BaggingClassifier(base_estimator=base_estimator, n_estimators=10, random_state=42)

# Train model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Print accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))


In [None]:
# Q27. Train a Random Forest Classifier with different numbers of trees and compare accuracy
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define different number of trees
n_estimators_list = [1, 10, 50, 100, 200]

# Train and evaluate models
for n in n_estimators_list:
    model = RandomForestClassifier(n_estimators=n, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy with {n} trees: {accuracy}")


In [None]:
# Q28. Train a Bagging Classifier using Logistic Regression as a base estimator and print AUC score.
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load dataset
X, y = load_breast_cancer(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base estimator
base_estimator = LogisticRegression(max_iter=1000, solver='liblinear')

# Create Bagging Classifier
model = BaggingClassifier(base_estimator=base_estimator, n_estimators=10, random_state=42)

# Train model
model.fit(X_train, y_train)

# Predict probabilities
y_proba = model.predict_proba(X_test)[:, 1]

# Print AUC score
print("AUC Score:", roc_auc_score(y_test, y_proba))


In [None]:
# Q29. Train a Random Forest Regressor and analyze feature importance scores
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
import pandas as pd

# Load dataset
data = load_diabetes()
X, y = data.data, data.target
feature_names = data.feature_names

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest Regressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Get feature importances
importances = model.feature_importances_
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Print feature importance scores
print(importance_df)


In [None]:
# Q30. Train an ensemble model using both Bagging and Random Forest and compare accuracy
 from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Bagging with Decision Tree
bagging_model = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bagging_model.fit(X_train, y_train)
bagging_pred = bagging_model.predict(X_test)
bagging_acc = accuracy_score(y_test, bagging_pred)

# Random Forest
rf_model = RandomForestClassifier(n_estimators=50, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
rf_acc = accuracy_score(y_test, rf_pred)

# Compare accuracy
print("Bagging Accuracy:", bagging_acc)
print("Random Forest Accuracy:", rf_acc)


In [None]:
# Q31. Train a Random Forest Classifier and tune hyperparameters using GridSearchCV
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the model
rf = RandomForestClassifier(random_state=42)

# Define hyperparameters to tune
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 4],
    'min_samples_leaf': [1, 2]
}

# Setup GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
                           cv=5, n_jobs=-1, scoring='accuracy')

# Fit the model
grid_search.fit(X_train, y_train)

# Best parameters and best estimator
best_params = grid_search.best_params_
best_rf = grid_search.best_estimator_

# Predict and evaluate
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", best_params)
print("Test Accuracy:", accuracy)



In [None]:
# Q32. Train a Bagging Regressor with different numbers of base estimators and compare performance
from sklearn.datasets import load_boston
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Load dataset
X, y = load_boston(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Test different numbers of base estimators
estimators_range = [1, 5, 10, 20, 50, 100]
mse_scores = []

for n_estimators in estimators_range:
    bagging = BaggingRegressor(base_estimator=DecisionTreeRegressor(),
                                n_estimators=n_estimators, random_state=42)
    bagging.fit(X_train, y_train)
    y_pred = bagging.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)

# Plot the results
plt.figure(figsize=(8, 5))
plt.plot(estimators_range, mse_scores, marker='o')
plt.xlabel('Number of Base Estimators')
plt.ylabel('Mean Squared Error')
plt.title('Performance of Bagging Regressor with Varying Estimators')
plt.grid(True)
plt.show()


In [None]:
# Q33. Train a Random Forest Classifier and analyze misclassified samples
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np
import pandas as pd

# Load dataset
data = load_iris()
X = data.data
y = data.target
feature_names = data.feature_names
target_names = data.target_names

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest Classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Predict
y_pred = clf.predict(X_test)

# Analyze misclassified samples
misclassified_indices = np.where(y_test != y_pred)[0]
misclassified_samples = X_test[misclassified_indices]
actual_labels = y_test[misclassified_indices]
predicted_labels = y_pred[misclassified_indices]

# Display misclassified samples
df_misclassified = pd.DataFrame(misclassified_samples, columns=feature_names)
df_misclassified['Actual Label'] = [target_names[i] for i in actual_labels]
df_misclassified['Predicted Label'] = [target_names[i] for i in predicted_labels]

# Output results
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=target_names))
print("\nMisclassified Samples:")
print(df_misclassified)


In [None]:
# Q34. Train a Bagging Classifier and compare its performance with a single Decision Tree Classifier
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train single Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

# Train Bagging Classifier with Decision Trees
bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                            n_estimators=50, random_state=42)
bagging.fit(X_train, y_train)
bagging_pred = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_pred)

# Print and compare performance
print(f"Decision Tree Accuracy: {dt_accuracy:.4f}")
print(f"Bagging Classifier Accuracy: {bagging_accuracy:.4f}")


In [None]:
# Q35. Train a Random Forest Classifier and visualize the confusion matrix
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Load dataset
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest Classifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Predict
y_pred = rf.predict(X_test)

# Compute and visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=rf.classes_)
disp.plot(cmap='Blues')
plt.title("Confusion Matrix - Random Forest Classifier")
plt.show()


In [None]:
# Q36. Train a Stacking Classifier using Decision Trees, SVM, and Logistic Regression, and compare accuracy
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base estimators
base_estimators = [
    ('decision_tree', DecisionTreeClassifier(random_state=42)),
    ('svm', SVC(probability=True, random_state=42))
]

# Define final estimator
final_estimator = LogisticRegression(max_iter=1000)

# Build stacking classifier
stacking_clf = StackingClassifier(
    estimators=base_estimators,
    final_estimator=final_estimator,
    cv=5
)

# Train model
stacking_clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = stacking_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Compare base models
dt = DecisionTreeClassifier(random_state=42)
svm = SVC(probability=True, random_state=42)
lr = LogisticRegression(max_iter=1000)

dt.fit(X_train, y_train)
svm.fit(X_train, y_train)
lr.fit(X_train, y_train)

# Print accuracy scores
print(f"Decision Tree Accuracy: {accuracy_score(y_test, dt.predict(X_test)):.4f}")
print(f"SVM Accuracy: {accuracy_score(y_test, svm.predict(X_test)):.4f}")
print(f"Logistic Regression Accuracy: {accuracy_score(y_test, lr.predict(X_test)):.4f}")
print(f"Stacking Classifier Accuracy: {accuracy:.4f}")


In [None]:
# Q37. Train a Random Forest Classifier and print the top 5 most important features
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Load dataset
data = load_iris()
X = data.data
y = data.target
feature_names = data.feature_names

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest Classifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Get feature importances
importances = rf.fe


In [None]:
# Q38. Train a Bagging Classifier and evaluate performance using Precision, Recall, and F1-score
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load dataset
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Bagging Classifier
bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                n_estimators=50, random_state=42)
bagging_clf.fit(X_train, y_train)

# Predict
y_pred = bagging_clf.predict(X_test)

# Evaluate performance
report = classification_report(y_test, y_pred)
print("Classification Report (Precision, Recall, F1-score):\n")
print(report)


In [None]:
# Q39.  Train a Random Forest Classifier and analyze the effect of max_depth on accuracy
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load dataset
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Range of max_depth values to test
max_depth_values = [1, 2, 3, 5, 10, None]
accuracies = []

# Train model with different max_depth values
for depth in max_depth_values:
    clf = RandomForestClassifier(max_depth=depth, random_state=42)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)

# Plot results
plt.figure(figsize=(8, 5))
depth_labels = ['1', '2', '3', '5', '10', 'None']
plt.plot(depth_labels, accuracies, marker='o')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Effect of max_depth on Random Forest Accuracy')
plt.grid(True)
plt.show()


In [None]:
# Q40. Train a Bagging Regressor using different base estimators (DecisionTree and KNeighbors) and compare performance
from sklearn.datasets import load_boston
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
X, y = load_boston(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Bagging Regressor with Decision Tree
bagging_dt = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=50, random_state=42)
bagging_dt.fit(X_train, y_train)
y_pred_dt = bagging_dt.predict(X_test)
mse_dt = mean_squared_error(y_test, y_pred_dt)

# Bagging Regressor with KNeighbors Regressor
bagging_knn = BaggingRegressor(base_estimator=KNeighborsRegressor(), n_estimators=50, random_state=42)
bagging_knn.fit(X_train, y_train)
y_pred_knn = bagging_knn.predict(X_test)
mse_knn = mean_squared_error(y_test, y_pred_knn)

# Print comparison
print(f"Mean Squared Error - Decision Tree Base Estimator: {mse_dt:.4f}")
print(f"Mean Squared Error - KNeighbors Base Estimator:   {mse_knn:.4f}")


In [None]:
# Q41. Train a Random Forest Classifier and evaluate its performance using ROC-AUC Score
 from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_auc_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Binarize the output for multi-class ROC AUC
y_binarized = label_binarize(y, classes=[0, 1, 2])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y_binarized, test_size=0.3, random_state=42)

# Train Random Forest Classifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Predict probabilities
y_proba = rf.predict_proba(X_test)

# Calculate ROC-AUC score (macro average for multi-class)
roc_auc = roc_auc_score(y_test, y_proba, average='macro', multi_class='ovr')

# Print ROC-AUC score
print(f"ROC-AUC Score (Macro, Multi-class): {roc_auc:.4f}")


In [None]:
# Q42. Train a Bagging Classifier and evaluate its performance using cross-validation
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Load dataset
X, y = load_iris(return_X_y=True)

# Initialize Bagging Classifier
bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                n_estimators=50,
                                random_state=42)

# Evaluate using cross-validation
cv_scores = cross_val_score(bagging_clf, X, y, cv=5, scoring='accuracy')

# Print results
print("Cross-Validation Accuracy Scores:", cv_scores)
print("Mean Accuracy:", np.mean(cv_scores))
print("Standard Deviation:", np.std(cv_scores))



In [None]:
# Q43. Train a Random Forest Classifier and plot the Precision-Recall curve
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import PrecisionRecallDisplay
import matplotlib.pyplot as plt

# Load dataset
X, y = load_iris(return_X_y=True)
y_binarized = label_binarize(y, classes=[0, 1, 2])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y_binarized, test_size=0.3, random_state=42)

# Train Random Forest Classifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Predict probabilities
y_scores = rf.predict_proba(X_test)

# Plot Precision-Recall curve for each class
plt.figure(figsize=(8, 6))
for i in range(y_binarized.shape[1]):
    precision, recall, _ = precision_recall_curve(y_test[:, i], y_scores[i][:, 1])
    ap_score = average_precision_score(y_test[:, i], y_scores[i][:, 1])
    plt.plot(recall, precision, label=f'Class {i} (AP={ap_score:.2f})')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve - Random Forest Classifier')
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Q44. Train a Stacking Classifier with Random Forest and Logistic Regression and compare accuracy
 from sklearn.datasets import load_iris
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base estimators
base_estimators = [
    ('random_forest', RandomForestClassifier(n_estimators=100, random_state=42))
]

# Define final estimator
final_estimator = LogisticRegression(max_iter=1000)

# Create Stacking Classifier
stacking_clf = StackingClassifier(
    estimators=base_estimators,
    final_estimator=final_estimator,
    cv=5
)

# Train stacking classifier
stacking_clf.fit(X_train, y_train)

# Predict and evaluate
y_pred_stack = stacking_clf.predict(X_test)
stack_acc = accuracy_score(y_test, y_pred_stack)

# Train and evaluate individual models
rf = RandomForestClassifier(n_estimators=100, random_state=42)
lr = LogisticRegression(max_iter=1000)

rf.fit(X_train, y_train)
lr.fit(X_train, y_train)

rf_acc = accuracy_score(y_test, rf.predict(X_test))
lr_acc = accuracy_score(y_test, lr.predict(X_test))

# Print accuracy scores
print(f"Random Forest Accuracy: {rf_acc:.4f}")
print(f"Logistic Regression Accuracy: {lr_acc:.4f}")
print(f"Stacking Classifier Accuracy: {stack_acc:.4f}")


In [None]:
# Q45. Train a Bagging Regressor with different levels of bootstrap samples and compare performance
from sklearn.datasets import load_boston
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Load dataset
X, y = load_boston(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Different bootstrap sample sizes (as a fraction of the training set)
max_samples_list = [0.3, 0.5, 0.7, 0.9, 1.0]
mse_scores = []

# Train and evaluate Bagging Regressor with different bootstrap sample sizes
for max_samples in max_samples_list:
    bagging = BaggingRegressor(base_estimator=DecisionTreeRegressor(),
                                n_estimators=50,
                                max_samples=max_samples,
                                bootstrap=True,
                                random_state=42)
    bagging.fit(X_train, y_train)
    y_pred = bagging.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)

# Plot the performance
plt.figure(figsize=(8, 5))
plt.plot(max_samples_list, mse_scores, marker='o')
plt.xlabel('Bootstrap Sample Fraction (max_samples)')
plt.ylabel('Mean Squared Error')
plt.title('Effect of Bootstrap Sample Size on Bagging Regressor Performance')
plt.grid(True)
plt.show()
