# E - Candidates Selection for Ensembles

## **Ensemble Candidate Selection and Quadrant-Based Classification**

In this stage of the notebook, we aim to identify the most promising models for ensemble learning by evaluating their **agreement** and **diversity**. This ensures that only models with **complementary strengths** are selected, improving the ensemble’s overall robustness and generalization.

### **Step 1: Merging and Filtering Models**
We begin by **joining** the datasets `DNN_models_combination_metrics.csv` and `CNN_models_combination_metrics.csv`, consolidating the performance metrics of both deep neural networks (DNNs) and convolutional neural networks (CNNs). 

To ensure that only **reliable models** are included in the ensemble selection process, we apply the following filtering criterion:
- **Cohen’s Kappa Score** ≤ 0.40 → The model is removed, as it indicates weak agreement and inconsistent predictions.

### **Step 2: Generating Ensemble Combinations**
After filtering, we generate **all possible ensemble combinations** of **size 2 and 3**. Each ensemble is evaluated based on two key metrics:

**1. Ensemble Agreement Between Models**
   - This metric is computed using **Cohen’s Kappa**, which quantifies the level of agreement between model predictions.
   - High values (≥ 0.5) indicate strong agreement, while low values (< 0.5) suggest independent or divergent decision patterns.

**2. Ensemble Diversity in Correct Prediction**
   - Calculated as **1 - Jaccard Similarity**, this metric measures how diverse the models are in their correct predictions.
   - A higher value (≥ 0.4) indicates that models make different but complementary correct predictions, which is beneficial for ensemble learning.

### **Step 3: Quadrant-Based Classification**
Using these two metrics, each ensemble is categorized into one of **four quadrants**, which helps determine its suitability for ensemble learning.

| **Quadrant** | **Diversity Score (1 - Jaccard)** | **Cohen’s Kappa** | **Interpretation** |
|-------------|----------------------------------|------------------|--------------------|
| **Q1** | High (≥ 0.4) | Low (< 0.5) | **Models are highly diverse and complement each other well. Ideal for ensemble learning.** |
| **Q2** | High (≥ 0.4) | High (≥ 0.5) | **Models share some similarities but still retain complementary strengths. Can be a good ensemble.** |
| **Q3** | Low (< 0.4) | Low (< 0.5) | **Weak and unstable models. Not recommended for ensemble learning.** |
| **Q4** | Low (< 0.4) | High (≥ 0.5) | **Models are too similar, leading to redundancy. Not beneficial for ensemble learning.** |

### **Conclusion**
By classifying ensemble candidates into these quadrants, we can efficiently select the best combinations that **maximize diversity while maintaining a reasonable agreement**. Ensembles in **Q1 and Q2** are prioritized, as they offer **strong generalization potential** and improved robustness against overfitting.

This process ensures that **only the best-performing and complementary models** contribute to the final ensemble, ultimately leading to a more effective and reliable predictive model. 🚀





In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import itertools
import numpy as np
from tqdm import tqdm  # Barra de progresso
from sklearn.metrics import cohen_kappa_score
from scipy.spatial.distance import jaccard
import ast


In [None]:
# Define output directory and file paths
OUTPUT_DIR = "ENSEMBLE_CANDIDATES"
os.makedirs(OUTPUT_DIR, exist_ok=True)
FILE1 = "DNN_MODEL_TRAINING/DNN_models_combination_metrics_SAMPLE.csv"
FILE2 = "DNN_MODEL_TRAINING/DNN_models_combination_metrics_SAMPLE.csv"
OUTPUT_FILE = os.path.join(OUTPUT_DIR, 'ensemble_candidates.csv')

In [None]:
# Load CSVs into pandas DataFrames
df1 = pd.read_csv(FILE1)
df2 = pd.read_csv(FILE2)

# Concatenate both DataFrames
df = pd.concat([df1, df2], ignore_index=True)

# Filter models with val2_Cohen_Kappa_Score > 0.40
df = df[df["val2_Cohen_Kappa_Score"] > 0.40].reset_index(drop=True)

# Convert string representations of lists to actual lists
def safe_eval_list(value):
    """Safely converts a string representation of a list into a real list."""
    try:
        return ast.literal_eval(value) if isinstance(value, str) else value
    except (SyntaxError, ValueError):
        return []

list_columns = [
    "val2_y_pred", "val2_accuracy_vector", "val2_y_true", "val2_y_proba",
    "val2_Confusion_Matrix", "val2_error_indices"
]

for col in list_columns:
    df[col] = df[col].apply(safe_eval_list)

# Define chunk size for saving intermediate results
CHUNK_SIZE = 100000  # Salva a cada 100k combinações para evitar estouro de memória

# Calcula o total de combinações
total_combinations = sum(len(list(itertools.combinations(df.iterrows(), size))) for size in [2, 3])

def generate_combinations():
    """Generator function to yield model combinations in batches."""
    count = 0  # Para gerar ID único nos ensembles
    for size in [2, 3]:
        for combo in itertools.combinations(df.iterrows(), size):
            indices, models = zip(*combo)

            # Garante um nome longo para o ensemble
            ensemble_name = f"ensemble_{count:09d}"
            count += 1

            # Extract model names and features
            models_type = [m["Model"] for m in models]
            models_features = [[m["Feature Group"]] for m in models]

            # Extract all individual metrics as lists
            individual_metrics = {
                "train_accuracy": [m["train_accuracy"] for m in models],
                "val_accuracy": [m["val_accuracy"] for m in models],
                "val2_accuracy": [m["val2_accuracy"] for m in models],
                "gap": [m["gap"] for m in models],
                "val2_recall": [m["val2_recall"] for m in models],
                "val2_precision": [m["val2_precision"] for m in models],
                "val2_f1": [m["val2_f1"] for m in models],
                "val2_model_path": [m["val2_model_path"] for m in models],
                "val2_confusion_matrix": [m["val2_Confusion_Matrix"] for m in models],
                "val2_error_indices": [m["val2_error_indices"] for m in models]
            }

            # Extract predictions and accuracy vectors
            y_preds = [m["val2_y_pred"] for m in models]
            accuracy_vectors = [m["val2_accuracy_vector"] for m in models]

            # Compute Cohen's Kappa (agreement between models)
            pairwise_kappas = []
            for i in range(len(y_preds)):
                for j in range(i + 1, len(y_preds)):
                    try:
                        kappa = cohen_kappa_score(y_preds[i], y_preds[j])
                        pairwise_kappas.append(kappa)
                    except ValueError:
                        pairwise_kappas.append(0)  # Default if Cohen's Kappa cannot be computed

            ensemble_kappa = np.mean(pairwise_kappas) if pairwise_kappas else 0

            # Compute Jaccard diversity (1 - Jaccard similarity)
            pairwise_jaccards = []
            for i in range(len(accuracy_vectors)):
                for j in range(i + 1, len(accuracy_vectors)):
                    if len(accuracy_vectors[i]) == len(accuracy_vectors[j]):
                        intersection = np.sum(np.logical_and(accuracy_vectors[i], accuracy_vectors[j]))
                        union = np.sum(np.logical_or(accuracy_vectors[i], accuracy_vectors[j]))
                        jaccard_score = 1 - (intersection / union) if union != 0 else 1
                        pairwise_jaccards.append(jaccard_score)

            ensemble_diversity = np.mean(pairwise_jaccards) if pairwise_jaccards else 0

            # Assign the ensemble to a quadrant
            if ensemble_diversity >= 0.4 and ensemble_kappa < 0.5:
                quadrant = "Q1"
            elif ensemble_diversity >= 0.4 and ensemble_kappa >= 0.5:
                quadrant = "Q2"
            elif ensemble_diversity < 0.4 and ensemble_kappa < 0.5:
                quadrant = "Q3"
            else:
                quadrant = "Q4"

            # Store the ensemble data
            yield {
                "ensemble_name": ensemble_name,
                "ensemble_length": size,
                "models_type": models_type,
                "models_features": models_features,
                "ensemble_agreement_between_models": ensemble_kappa,
                "ensemble_diversity_in_correct_prediction": ensemble_diversity,
                "quadrant": quadrant,
                **individual_metrics  # Add all individual metrics
            }

# Process and save in chunks
with tqdm(total=total_combinations, desc="Generating Ensembles", unit="combination") as pbar:
    chunk = []
    for i, ensemble_data in enumerate(generate_combinations()):
        chunk.append(ensemble_data)

        # Save periodically
        if len(chunk) >= CHUNK_SIZE:
            pd.DataFrame(chunk).to_csv(OUTPUT_FILE, mode='a', header=not os.path.exists(OUTPUT_FILE), index=False)
            chunk = []  # Clear memory

        pbar.update(1)

    # Save remaining data
    if chunk:
        pd.DataFrame(chunk).to_csv(OUTPUT_FILE, mode='a', header=not os.path.exists(OUTPUT_FILE), index=False)

print(f"✅ Ensemble candidates saved to {OUTPUT_FILE}")

In [None]:
# Load dataset
df = pd.read_csv(OUTPUT_FILE)

In [None]:
# Extract relevant columns
x = df["ensemble_agreement_between_models"]  # Cohen’s Kappa
y = df["ensemble_diversity_in_correct_prediction"]  # 1 - Jaccard Similarity

# Define threshold values
kappa_threshold = 0.5
diversity_threshold = 0.4

# Assign quadrants based on thresholds
df["quadrant"] = "Q3"  # Default: Low diversity, low agreement (Q3 - Blue)
df.loc[(y >= diversity_threshold) & (x < kappa_threshold), "quadrant"] = "Q1"  # High diversity, low agreement (Green)
df.loc[(y >= diversity_threshold) & (x >= kappa_threshold), "quadrant"] = "Q2"  # High diversity, high agreement (Orange)
df.loc[(y < diversity_threshold) & (x >= kappa_threshold), "quadrant"] = "Q4"  # Low diversity, high agreement (Red)

# Define color map for quadrants
colors = {
    "Q1": "green",  # Best ensemble candidates
    "Q2": "orange",  # Complementary models
    "Q3": "blue",  # Weak and unstable models
    "Q4": "red"  # Very similar models
}

# Scatter plot
plt.figure(figsize=(10, 8))
for quadrant, color in colors.items():
    subset = df[df["quadrant"] == quadrant]
    plt.scatter(
        subset["ensemble_agreement_between_models"], 
        subset["ensemble_diversity_in_correct_prediction"], 
        color=color, label=f"{quadrant}: {color.capitalize()} Quadrant", alpha=0.6, s=10
    )

# Add decision boundaries
plt.axhline(y=diversity_threshold, color="gray", linestyle="--", alpha=0.7, label="Threshold D(A,B) = 0.4")
plt.axvline(x=kappa_threshold, color="gray", linestyle="--", alpha=0.7, label="Threshold κ = 0.5")

# Labels and title
plt.xlabel("Agreement Between Models")
plt.ylabel("Diversity in Correct Predictions (D(A,B))")
plt.title("Scatterplot of Model Pairs by Metrics")
plt.legend()
plt.grid(True, linestyle="--", alpha=0.5)

# Show plot
plt.show()


## **Final Selection of Ensemble Candidates**

In this step, we refine the ensemble selection by filtering and ranking the best ensemble candidates. The goal is to ensure that only the most **diverse and complementary** models are chosen for the final ensemble set.

### **Step 1: Filtering Ensembles from Quadrants Q1 and Q2**
From the previously generated ensemble candidates, we select only those that fall into the **Q1 and Q2 quadrants**. These quadrants represent:
- **Q1 (High Diversity, Low Agreement)** → Highly complementary models with diverse correct predictions.
- **Q2 (High Diversity, High Agreement)** → Models that share some similarities but still retain complementary strengths.

By focusing on **Q1 and Q2**, we ensure that the final selection consists of models with strong generalization capabilities.

### **Step 2: Ranking Ensembles**
To determine the **best ensembles**, we sort the filtered candidates based on:

**Primary Metric for Ordering: `ensemble_diversity_in_correct_prediction` (Descending)**
- **Why?**  
  - Higher diversity ensures that the ensemble benefits from a **broader range of learned patterns**.
  - It reduces redundancy and improves the overall robustness of the model.
  - **Cohen’s Kappa**, while important, may favor overly similar models, leading to diminished diversity.

### **Step 3: Selecting the Top 10,000 Ensembles**
After sorting, we **select the top 10,000 ensembles** that exhibit the highest diversity while maintaining strong model agreement.

### **Step 4: Saving the Final Ensemble Candidates**
The final selection is saved as `ensemble_candidates_final.csv`, which will serve as the basis for the next steps in ensemble learning.

---
✅ **Outcome:**  
- The most **promising, diverse, and complementary


In [None]:
# Load the ensemble candidates CSV
OUTPUT_FINAL_FILE = "ENSEMBLE_CANDIDATES/ensemble_candidates_final.csv"

# Filter only Q1 and Q2 ensembles
df_filtered = df[df["quadrant"].isin(["Q1", "Q2"])]

# Suggestion: Order by ensemble_diversity_in_correct_prediction (Descending)
df_sorted = df_filtered.sort_values(by="ensemble_diversity_in_correct_prediction", ascending=False)

# Select the top 10,000 ensembles
df_final = df_sorted.head(10000)

# Save to CSV
df_final.to_csv(OUTPUT_FINAL_FILE, index=False)

print(f"✅ Top 10,000 ensemble candidates saved to {OUTPUT_FILE}")
