# Ensemble Methods for Optimal Machine Learning Results

This notebook demonstrates how to use ensemble methods to achieve high-performance results in a classification task. We’ll use a synthetic dataset, preprocess it, and apply multiple ensemble techniques including Random Forest, Gradient Boosting, and a Voting Classifier. Each method will be evaluated, and we’ll compare their performance.

## Objectives
- Generate a sample dataset
- Preprocess the data
- Train and evaluate individual ensemble models
- Combine models using a Voting Classifier
- Compare results

## Libraries Used
- `numpy` and `pandas` for data manipulation
- `sklearn` for machine learning models and metrics
- `matplotlib` and `seaborn` for visualization

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)

## Step 1: Generate Synthetic Dataset

We’ll create a synthetic classification dataset with 1000 samples, 20 features, and 2 classes using `make_classification`. This simulates a real-world scenario where we have features and a target variable.

In [None]:
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Step 2: Individual Ensemble Models

We’ll train two powerful ensemble models: Random Forest and Gradient Boosting. These models aggregate predictions from multiple decision trees to improve accuracy and robustness.

In [None]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_pred)

gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)
gb_pred = gb_model.predict(X_test)
gb_accuracy = accuracy_score(y_test, gb_pred)

## Step 3: Voting Classifier (Ensemble of Models)

To combine the strengths of different models, we’ll use a Voting Classifier that integrates Random Forest, Gradient Boosting, and Logistic Regression. We’ll use soft voting, which averages predicted probabilities for better performance.

In [None]:
lr_model = LogisticRegression(random_state=42)
voting_model = VotingClassifier(estimators=[
    ('rf', rf_model),
    ('gb', gb_model),
    ('lr', lr_model)
], voting='soft')
voting_model.fit(X_train, y_train)
voting_pred = voting_model.predict(X_test)
voting_accuracy = accuracy_score(y_test, voting_pred)

## Step 4: Evaluation and Comparison

We’ll evaluate each model using accuracy and a detailed classification report (precision, recall, F1-score). Then, we’ll visualize the results for comparison.

In [None]:
print("Random Forest Accuracy:", rf_accuracy)
print("Gradient Boosting Accuracy:", gb_accuracy)
print("Voting Classifier Accuracy:", voting_accuracy)

print("\nRandom Forest Classification Report:\n", classification_report(y_test, rf_pred))
print("Gradient Boosting Classification Report:\n", classification_report(y_test, gb_pred))
print("Voting Classifier Classification Report:\n", classification_report(y_test, voting_pred))

results = pd.DataFrame({
    'Model': ['Random Forest', 'Gradient Boosting', 'Voting Classifier'],
    'Accuracy': [rf_accuracy, gb_accuracy, voting_accuracy]
})

plt.figure(figsize=(8, 6))
sns.barplot(x='Model', y='Accuracy', data=results, palette='viridis')
plt.title('Model Accuracy Comparison')
plt.ylim(0, 1)
plt.show()

## Conclusion

This notebook implemented and compared three ensemble approaches:
- **Random Forest**: A bagging method that reduces variance.
- **Gradient Boosting**: A boosting method that reduces bias.
- **Voting Classifier**: A hybrid approach combining multiple models.

The Voting Classifier often achieves the best results by leveraging the strengths of all models. You can adapt this code to your specific dataset by replacing the synthetic data generation with your own data loading step.