# Multilingual Symmetry Evaluation with Sliced KS

This notebook implements the core pieces of a **multilingual symmetry** evaluation pipeline:

1. Generate multiple answers per prompt, per `(model, language)` condition.
2. Encode answers with a **shared multilingual embedding model**.
3. Compute a **Sliced Kolmogorov–Smirnov (sKS)** distance between the two embedding distributions.
4. Aggregate sKS across random projection directions to get a **mean symmetry score with uncertainty**.

The code is structured so you can plug in your own LLM backends and your own prompt sets.

In [1]:

# Optional: install dependencies (uncomment if needed)
# You can skip this if you already have these packages in your environment.

# !pip install sentence-transformers scipy numpy pandas tqdm


In [None]:

import numpy as np
import pandas as pd
from dataclasses import dataclass
from typing import List, Dict, Tuple, Optional
from tqdm.auto import tqdm

from scipy.stats import ks_2samp

np.random.seed(12345)


In [None]:
# For multilingual embeddings
from sentence_transformers import SentenceTransformer

In [None]:

# ---------------------------------------------------------
# Configuration
# ---------------------------------------------------------

# Example prompts for quick testing (replace with dataset-driven prompts later)
PROMPTS = [
    # Factual
    {
        "id": "fact_1",
        "en": "Who discovered penicillin?",
        "fr": "Qui a découvert la pénicilline ?",
        "type": "factual",
    },
    {
        "id": "fact_2",
        "en": "What is the capital of Japan?",
        "fr": "Quelle est la capitale du Japon ?",
        "type": "factual",
    },
    # Open-ended / intent-like
    {
        "id": "open_1",
        "en": "How would you resolve a conflict between two colleagues at work?",
        "fr": "Comment résoudriez-vous un conflit entre deux collègues au travail ?",
        "type": "open",
    },
]

# Define conditions you want to compare.
# For example, same model but different languages:
CONDITIONS = [
    {"name": "modelA_en", "model": "MODEL_A_NAME", "lang": "en"},
    {"name": "modelA_fr", "model": "MODEL_A_NAME", "lang": "fr"},
    # You can add more, e.g. another model:
    # {"name": "modelB_en", "model": "MODEL_B_NAME", "lang": "en"},
    # {"name": "modelB_fr", "model": "MODEL_B_NAME", "lang": "fr"},
]

# Pairs of conditions to compare with the symmetry metric
CONDITION_PAIRS = [
    ("modelA_en", "modelA_fr"),
    # e.g. ("modelA_en", "modelB_en"),
    # e.g. ("modelA_fr", "modelB_fr"),
]

# Number of stochastic samples per (prompt, condition)
N_SAMPLES_PER_CONDITION = 16

# Number of random projection directions for sliced KS
N_DIRECTIONS = 64

# Random seed for reproducibility
RANDOM_STATE = 123


## Answer generation

In this section, we define a stub for generating multiple answers from each model.
You should plug in your own LLM backend (OpenAI, Cohere, Anthropic, local models, etc.).

By default, we use a dummy implementation that echoes the prompt and appends a random suffix,
so that the rest of the pipeline can be tested without external APIs.

In [None]:

def generate_answers_for_condition(
    prompt_en: str,
    prompt_fr: str,
    condition: Dict,
    n_samples: int = 8,
) -> List[str]:
    """Generate answers for a single condition.

    Parameters
    ----------
    prompt_en : str
        English version of the prompt (semantic base).
    prompt_fr : str
        French version of the prompt (semantic base).
    condition : dict
        Contains at least 'name', 'model', 'lang' keys.
    n_samples : int
        How many answers to sample.

    Returns
    -------
    List[str]
        List of model outputs as text.

    Notes
    -----
    Replace the body of this function with actual calls to your LLM(s).
    For example:
        - OpenAI ChatCompletions
        - Cohere chat.generate
        - Anthropic client.messages.create
    """
    lang = condition["lang"]
    model_name = condition["model"]

    # Choose the prompt according to language
    if lang == "en":
        prompt = prompt_en
    elif lang == "fr":
        prompt = prompt_fr
    else:
        # Extend as needed for other languages
        prompt = prompt_en

    # TODO: Replace this dummy implementation with real API calls.
    # ------------------------------------------------------------------
    # Example pseudo-code for Cohere (commented out):
    #
    # from cohere import Client
    # co = Client(api_key="YOUR_API_KEY")
    # outputs = []
    # for _ in range(n_samples):
    #     resp = co.chat(
    #         model=model_name,
    #         message=prompt,
    #         temperature=0.7,
    #     )
    #     outputs.append(resp.text)
    # return outputs
    # ------------------------------------------------------------------

    # Dummy implementation: echo + random integer suffix
    outputs = []
    for i in range(n_samples):
        outputs.append(f"[{condition['name']}] {prompt} (sample {i})")
    return outputs


## Multilingual embedding model

We now load a single **multilingual sentence encoder** that will embed **all** answers
into a shared semantic vector space. This decouples the evaluation geometry from any particular LLM.

In [None]:

EMBEDDING_MODEL_NAME = "sentence-transformers/LaBSE"  # or any multilingual model you like

def load_embedding_model(model_name: str = EMBEDDING_MODEL_NAME):
    if SentenceTransformer is None:
        raise ImportError(
            "sentence-transformers is not installed. "
            "Install it with `pip install sentence-transformers`."
        )
    print(f"Loading embedding model: {model_name}")
    return SentenceTransformer(model_name)

embedding_model = load_embedding_model()


In [None]:

def embed_texts(texts: List[str]):
    """Encode a list of strings into embeddings (np.ndarray of shape [N, D])."""
    if not texts:
        return np.zeros((0, embedding_model.get_sentence_embedding_dimension()))
    emb = embedding_model.encode(texts, convert_to_numpy=True, show_progress_bar=False)
    return emb


## Sliced Kolmogorov–Smirnov (sKS) distance

We now implement the **sliced KS** metric:

1. Draw `L` random unit vectors in embedding space.
2. Project the two sets of embeddings onto each vector.
3. Compute the 1D KS distance per direction.
4. Aggregate over directions to get a mean, standard deviation, and confidence interval.

In [None]:

@dataclass
class SlicedKSMetrics:
    mean: float
    std: float
    sem: float
    ci_low: float
    ci_high: float
    per_direction: np.ndarray


def _ks_1d(a: np.ndarray, b: np.ndarray) -> float:
    """1D Kolmogorov–Smirnov distance between two samples.

    If scipy is available, we use ks_2samp; otherwise we fall back to a simple implementation.
    """
    if a.size == 0 or b.size == 0:
        return 0.0

    if ks_2samp is not None:
        return ks_2samp(a, b, alternative="two-sided", mode="auto").statistic

    # Fallback: manual KS
    a_sorted = np.sort(a)
    b_sorted = np.sort(b)
    data_all = np.concatenate([a_sorted, b_sorted])

    # Empirical CDFs
    cdf_a = np.searchsorted(a_sorted, data_all, side="right") / a_sorted.size
    cdf_b = np.searchsorted(b_sorted, data_all, side="right") / b_sorted.size

    return np.max(np.abs(cdf_a - cdf_b))


def sliced_ks_distance(
    emb_A: np.ndarray,
    emb_B: np.ndarray,
    n_directions: int = 64,
    random_state: Optional[int] = None,
) -> SlicedKSMetrics:
    """Compute Sliced Kolmogorov–Smirnov distance between two embedding sets.

    Parameters
    ----------
    emb_A : np.ndarray
        Embeddings of set A, shape (N_A, D).
    emb_B : np.ndarray
        Embeddings of set B, shape (N_B, D).
    n_directions : int
        Number of random projection directions.
    random_state : int, optional
        Seed for reproducibility.

    Returns
    -------
    SlicedKSMetrics
        Container with mean, std, sem, CI, and per-direction KS values.
    """
    if emb_A.size == 0 or emb_B.size == 0:
        return SlicedKSMetrics(
            mean=0.0, std=0.0, sem=0.0, ci_low=0.0, ci_high=0.0,
            per_direction=np.zeros(n_directions)
        )

    assert emb_A.shape[1] == emb_B.shape[1], "Embedding dimensions must match."
    d = emb_A.shape[1]

    rng = np.random.default_rng(random_state)

    ks_values = []
    for _ in range(n_directions):
        # Sample a random direction on the unit sphere
        v = rng.normal(size=d)
        v /= np.linalg.norm(v) + 1e-12

        proj_A = emb_A @ v
        proj_B = emb_B @ v

        ks_val = _ks_1d(proj_A, proj_B)
        ks_values.append(ks_val)

    ks_values = np.array(ks_values)
    mean = float(ks_values.mean())
    std = float(ks_values.std(ddof=1)) if ks_values.size > 1 else 0.0
    sem = float(std / np.sqrt(ks_values.size)) if ks_values.size > 0 else 0.0
    # 95% CI (approx)
    ci_low = max(0.0, mean - 1.96 * sem)
    ci_high = min(1.0, mean + 1.96 * sem)

    return SlicedKSMetrics(
        mean=mean,
        std=std,
        sem=sem,
        ci_low=ci_low,
        ci_high=ci_high,
        per_direction=ks_values,
    )


In [None]:

def symmetry_from_sks(metrics: SlicedKSMetrics) -> Dict[str, float]:
    """Convert sliced KS metrics into a symmetry-oriented view.

    Symmetry is defined as 1 - mean KS, with the same uncertainty structure.
    """
    sym_mean = 1.0 - metrics.mean
    sym_ci_low = 1.0 - metrics.ci_high
    sym_ci_high = 1.0 - metrics.ci_low
    return {
        "sym_mean": sym_mean,
        "sym_std": metrics.std,
        "sym_sem": metrics.sem,
        "sym_ci_low": sym_ci_low,
        "sym_ci_high": sym_ci_high,
    }


## Evaluation loop

This cell ties everything together:

1. Iterate over prompts.
2. For each `(condition, prompt)` pair, generate multiple answers and embed them.
3. For each **condition pair**, compute the sliced KS metrics and corresponding symmetry scores.
4. Collect results in a DataFrame for further analysis.

In [None]:

def build_condition_lookup(conditions: List[Dict]) -> Dict[str, Dict]:
    return {c["name"]: c for c in conditions}


condition_lookup = build_condition_lookup(CONDITIONS)

results = []

for prompt in tqdm(PROMPTS, desc="Prompts"):
    prompt_id = prompt["id"]
    prompt_type = prompt.get("type", "unknown")
    en_text = prompt["en"]
    fr_text = prompt["fr"]

    # Cache answers and embeddings per condition
    answers_by_condition: Dict[str, List[str]] = {}
    emb_by_condition: Dict[str, np.ndarray] = {}

    # Generate answers and embeddings for each condition
    for cond in CONDITIONS:
        cond_name = cond["name"]
        answers = generate_answers_for_condition(
            prompt_en=en_text,
            prompt_fr=fr_text,
            condition=cond,
            n_samples=N_SAMPLES_PER_CONDITION,
        )
        answers_by_condition[cond_name] = answers
        emb_by_condition[cond_name] = embed_texts(answers)

    # Now compute symmetry metrics for each condition pair
    for cond_a, cond_b in CONDITION_PAIRS:
        emb_A = emb_by_condition[cond_a]
        emb_B = emb_by_condition[cond_b]

        sks_metrics = sliced_ks_distance(
            emb_A,
            emb_B,
            n_directions=N_DIRECTIONS,
            random_state=RANDOM_STATE,
        )
        sym = symmetry_from_sks(sks_metrics)

        results.append(
            {
                "prompt_id": prompt_id,
                "prompt_type": prompt_type,
                "condition_A": cond_a,
                "condition_B": cond_b,
                "ks_mean": sks_metrics.mean,
                "ks_std": sks_metrics.std,
                "ks_sem": sks_metrics.sem,
                "ks_ci_low": sks_metrics.ci_low,
                "ks_ci_high": sks_metrics.ci_high,
                "sym_mean": sym["sym_mean"],
                "sym_std": sym["sym_std"],
                "sym_sem": sym["sym_sem"],
                "sym_ci_low": sym["sym_ci_low"],
                "sym_ci_high": sym["sym_ci_high"],
            }
        )

results_df = pd.DataFrame(results)
results_df


## Aggregate symmetry scores

We can now aggregate symmetry metrics across prompts, or restrict to factual vs open-ended prompts.

In [None]:

# Aggregate by condition pair and prompt type
agg = (
    results_df
    .groupby(["condition_A", "condition_B", "prompt_type"])
    .agg(
        sym_mean=("sym_mean", "mean"),
        sym_std=("sym_mean", "std"),
        n_prompts=("prompt_id", "nunique"),
    )
    .reset_index()
)

agg


You can extend this section with:

- Boxplots or violin plots of `sym_mean` per prompt,
- Separate reporting for factual vs open-ended prompts,
- Per-prompt inspection of the most asymmetric cases (lowest `sym_mean`).