# Annotation Voting

Each method used to predict cell type annotations is considered as part of the votation in this way:

Let $w$ be the weight for each method in $M$ and $k$ a cell type annotation in $K$

$$P(k) = \sum_{i=1}^M w_i * P(k|m_i) $$


Each annotation method $m_i \in M$ provides either a predicted label $k_i \in K \cup \{unknown\}$, and an associated confidence score $c_i \in [0,1]$

We define a normalized weight $w_i$ for each method, reflecting its global concordance.

Then, for a given cell, the ensemble assigns a probability to each possible cell type $k \in K$ as:

$$P(k)=\frac{1}{Z} \sum_{i=1}^M w_i P(k∣mi)$$

where $Z=\sum_{i=1}^M w_i$ is a normalization constant ensuring $\sum_k P(k)=1$.


## Final ensemble decision

The ensemble’s predicted label is:

$$\hat{k} = \text{arg max}_{k \in K}(P(k))$$

and uncertainty (entropy) can be quantified as:

$$H = -\sum_{k \in K} P(k) \log{P(k)}$$


In [5]:
path_to_predictions = "../data/ann_integration/predictions.csv"
path_to_adata = "../data/ann_integration/adata.h5ad"

In [None]:
import pandas as pd
import numpy as np
from itertools import combinations
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score, silhouette_score
from statsmodels.stats.inter_rater import fleiss_kappa
from scipy.stats import entropy
from sklearn.preprocessing import LabelEncoder
import scanpy as sc


## Metrics for Predictions

#### - Fleiss' Kappa
Measures the degree of agreement between various categorical predictors (comparing it to a random choice)

>Let ( N ) be the number of cells, ( M ) the number of models, and ( K ) the number of possible labels.
For each cell ( j ), let ( n_{jk} ) be the number of models that assigned label ( k ).
Then the agreement for that cell is:
>$$P_j = \frac{1}{M(M-1)} \sum_{k=1}^{K} n_{jk}(n_{jk} - 1)$$
>and the mean agreement across all cells:
>$$\bar{P} = \frac{1}{N} \sum_{j=1}^{N} P_j$$
>Let the expected agreement by chance be:
>$$\bar{P}*e = \sum*{k=1}^{K} p_k^2, \quad \text{where} \quad p_k = \frac{1}{N M} \sum_{j=1}^{N} n_{jk}$$
>Then Fleiss’ Kappa is given by:
>$$\kappa = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e}$$
>A value of $\kappa = 1$ indicates perfect agreement, $\kappa = 0$ corresponds to random labeling, and negative values indicate systematic disagreement.


#### - Pairwise Agreement

Quantifies the proportion of cells for which two models give the same label, averaged over all model pairs.

>For models $m_a, m_b \in M$:
>$$A_{ab} = \frac{1}{N} \sum_{j=1}^{N} [k_{a,j} = k_{b,j}]$$
>The overall pairwise agreement is the mean over all unique model pairs:
>$$\bar{A} = \frac{2}{M(M-1)} \sum_{a < b} A_{ab}$$
>This metric can be computed globally, per cluster, or per label to identify systematic disagreements between models.


#### - Ensemble Entropy

Measures the uncertainty of the ensemble prediction for each cell, given the distribution of predicted labels across models.

>For a cell $j$, let $P_j(k)$ be the ensemble probability of label $k$, defined as:
>$$P_j(k) = \frac{1}{Z_j} \sum_{i=1}^{M} w_i , c_{ij} , [k_{ij} = k]$$
>where $w_i$ is the model’s weight, $c_{ij}$ its confidence score for that cell, and $Z_j$ the normalization constant.
>Then the entropy for cell $j$ is:
>$$H_j = - \sum_{k=1}^{K} P_j(k) , \log P_j(k)$$
>Averaging across all cells gives the global ensemble entropy:
>$$\bar{H} = \frac{1}{N} \sum_{j=1}^{N} H_j$$
>Low entropy indicates consistent predictions (high agreement), while high entropy reflects uncertainty or label conflict among models.



In [6]:

def _compute_fleiss_kappa(preds):
    """Compute Fleiss' Kappa for a set of categorical predictions."""
    preds = preds.dropna(how="all")
    if preds.empty or preds.shape[1] < 2:
        return np.nan
    
    # Map all labels to integers
    cats = pd.Categorical(pd.concat([preds[c] for c in preds], axis=0).dropna()).categories
    label_to_int = {cat: i for i, cat in enumerate(cats)}
    ratings = preds.applymap(lambda x: label_to_int.get(x, np.nan)).dropna()
    
    n_items = len(ratings)
    n_models = ratings.shape[1]
    n_cats = len(cats)
    
    # Build category count matrix
    counts = np.zeros((n_items, n_cats), dtype=int)
    for j in range(n_models):
        for i, val in enumerate(ratings.iloc[:, j]):
            if not np.isnan(val):
                counts[i, int(val)] += 1
    try:
        return fleiss_kappa(counts)
    except Exception:
        return np.nan


def _pairwise_agreement(preds):
    """Compute mean ARI and NMI across all pairs."""
    preds = preds.dropna(how="all")
    if preds.shape[1] < 2:
        return np.nan, np.nan
    ari, nmi = [], []
    for c1, c2 in combinations(preds.columns, 2):
        common = preds[[c1, c2]].dropna()
        if len(common) > 5:
            ari.append(adjusted_rand_score(common[c1], common[c2]))
            nmi.append(normalized_mutual_info_score(common[c1], common[c2]))
    return np.nanmean(ari) if ari else np.nan, np.nanmean(nmi) if nmi else np.nan


def _ensemble_entropy(preds):
    """Compute per-cell entropy of model predictions."""
    valid = preds.dropna(axis=1, how='all')
    cats = pd.Categorical(pd.concat([valid[c] for c in valid], axis=0).dropna()).categories
    n_cats = len(cats)
    label_to_int = {cat: i for i, cat in enumerate(cats)}
    
    counts = np.zeros((len(valid), n_cats))
    for j in range(valid.shape[1]):
        for i, val in enumerate(valid.iloc[:, j]):
            if val in label_to_int:
                counts[i, label_to_int[val]] += 1
    probs = counts / np.maximum(counts.sum(axis=1, keepdims=True), 1)
    ent = entropy(probs.T)
    consensus = np.max(probs, axis=1)
    return ent, consensus

### Model's Weight Calculation

Define Model's Weight for score by its average mean Predicting Score and average agreement. 

$$w_m = \frac{\text{mean(PredScore)}_m \times \text{mean(agreement)}_m}{\sum w}$$

In [7]:
def _compute_model_weights(df, model_names):
    """Compute per-model weights based on confidence and mutual agreement."""
    # Mean confidence
    confs = {
        m: df[f"{m}_pred_score"].mean(skipna=True)
        for m in model_names if f"{m}_pred_score" in df.columns
    }
    # Agreement: average ARI vs others
    aris = {}
    for m1 in model_names:
        others = [m2 for m2 in model_names if m2 != m1 and f"{m2}_pred" in df.columns]
        ari_vals = []
        for m2 in others:
            c1, c2 = f"{m1}_pred", f"{m2}_pred"
            common = df[[c1, c2]].dropna()
            if len(common) > 5:
                ari_vals.append(adjusted_rand_score(common[c1], common[c2]))
        aris[m1] = np.nanmean(ari_vals) if ari_vals else np.nan
    # Combine
    weights = {}
    for m in model_names:
        w = (confs.get(m, 1) * aris.get(m, 1))
        weights[m] = w if not np.isnan(w) else 1
    # Normalize
    total = np.sum(list(weights.values()))
    return {m: w / total if total > 0 else 1/len(weights) for m, w in weights.items()}


def _weighted_majority_vote(df, model_names, weights):
    """Compute weighted ensemble label based on model confidence and agreement."""
    preds = df[[f"{m}_pred" for m in model_names if f"{m}_pred" in df.columns]]
    all_labels = pd.unique(preds.values.ravel()[~pd.isnull(preds.values.ravel())])
    weighted_labels = []
    for i, row in preds.iterrows():
        label_weights = {lab: 0 for lab in all_labels}
        for m in model_names:
            val = row.get(f"{m}_pred", np.nan)
            if pd.notna(val):
                label_weights[val] += weights.get(m, 1)
        # Pick label with max total weight
        weighted_labels.append(max(label_weights, key=label_weights.get))
    return pd.Series(weighted_labels, index=df.index, name="ensemble_label")



In [8]:

def evaluate_ensemble(df, embedding=None, cluster_col="cluster"):
    """
    Evaluate ensemble model consistency for scRNA-seq cross-annotation.
    
    Parameters
    ----------
    df : pd.DataFrame
        Must contain '{model}_pred' columns (and optionally '{model}_pred_score').
        Must also contain a cluster column ('cluster' or specified cluster_col).
    embedding : np.ndarray, optional
        PCA or Harmony embedding (n_cells x n_dims) for silhouette evaluation.
    cluster_col : str, default 'cluster'
        Column name defining cell clusters (e.g., 'leiden').
        
    Returns
    -------
    dict with:
      - global_metrics : pd.Series
      - model_weights : pd.Series
      - per_cell : pd.DataFrame
      - per_cluster : pd.DataFrame
      - per_label : pd.DataFrame
    """

    model_names = sorted({c.split('_pred')[0] for c in df.columns if c.endswith('_pred')})
    pred_cols = [f"{m}_pred" for m in model_names if f"{m}_pred" in df.columns]
    preds = df[pred_cols]
    clusters = df[cluster_col]
    
    # --- COMPUTE MODEL WEIGHTS ---
    weights = _compute_model_weights(df, model_names)
    
    # --- WEIGHTED ENSEMBLE LABEL ---
    df["ensemble_label"] = _weighted_majority_vote(df, model_names, weights)
    
    # --- GLOBAL METRICS ---
    ent, consensus = _ensemble_entropy(preds)
    ari, nmi = _pairwise_agreement(preds)
    global_metrics = pd.Series({
        "fleiss_kappa": _compute_fleiss_kappa(preds),
        "mean_ARI": ari,
        "mean_NMI": nmi,
        "mean_entropy": np.nanmean(ent),
        "entropy_std": np.nanstd(ent),
        "mean_consensus": np.nanmean(consensus)
    })
    if embedding is not None:
        labels_int = LabelEncoder().fit_transform(df["ensemble_label"])
        try:
            global_metrics["silhouette_score"] = silhouette_score(embedding, labels_int)
        except Exception:
            global_metrics["silhouette_score"] = np.nan
    
    # --- PER CELL ---
    per_cell = pd.DataFrame({
        "entropy": ent,
        "consensus": consensus,
        cluster_col: clusters,
        "ensemble_label": df["ensemble_label"]
    }, index=df.index)
    
    # --- PER CLUSTER ---
    cluster_results = []
    for cluster_id, subset in df.groupby(cluster_col):
        preds_sub = subset[pred_cols]
        ent_sub, cons_sub = _ensemble_entropy(preds_sub)
        ari_sub, nmi_sub = _pairwise_agreement(preds_sub)
        cluster_results.append({
            cluster_col: cluster_id,
            "fleiss_kappa": _compute_fleiss_kappa(preds_sub),
            "mean_ARI": ari_sub,
            "mean_NMI": nmi_sub,
            "mean_entropy": np.nanmean(ent_sub),
            "entropy_std": np.nanstd(ent_sub),
            "mean_consensus": np.nanmean(cons_sub)
        })
    per_cluster = pd.DataFrame(cluster_results)
    
    # --- PER ENSEMBLE LABEL ---
    label_results = []
    for label, subset in df.groupby("ensemble_label"):
        preds_sub = subset[pred_cols]
        ent_sub, cons_sub = _ensemble_entropy(preds_sub)
        ari_sub, nmi_sub = _pairwise_agreement(preds_sub)
        label_results.append({
            "ensemble_label": label,
            "fleiss_kappa": _compute_fleiss_kappa(preds_sub),
            "mean_ARI": ari_sub,
            "mean_NMI": nmi_sub,
            "mean_entropy": np.nanmean(ent_sub),
            "entropy_std": np.nanstd(ent_sub),
            "mean_consensus": np.nanmean(cons_sub),
            "n_cells": len(subset)
        })
    per_label = pd.DataFrame(label_results)
    
    return {
        "global_metrics": global_metrics,
        "model_weights": pd.Series(weights),
        "per_cell": per_cell,
        "per_cluster": per_cluster,
        "per_label": per_label
    }

## Data import

In [None]:
df = pd.read_csv(path_to_predictions)

In [None]:
adata = sc.read(path_to_adata)

In [None]:
# Replace NaNs in predictions with 'unknown'
for col in df.columns:
    if col.endswith('_pred'):
        df[col] = df[col].fillna('unknown')


In [9]:
# Assuming df has:
# ['scanvi_pred', 'scanvi_pred_score', 'celltypist_pred', 'celltypist_pred_score', 'cluster']

results = evaluate_ensemble(df, embedding=adata.obsm['X_pca'], label_col='cluster')

print("Global metrics:\n", results['global_metrics'])
results['per_cluster'].to_csv("ensemble_per_cluster.csv", index=False)
results['per_label'].to_csv("ensemble_per_label.csv", index=False)


NameError: name 'df' is not defined