# Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration
This Colab noteook contains the code for reproducing the results from the publication "Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration" by Daniel Deutsch, George Foster, and Markus Freitag.

If you just want to use tie calibration and pairwise accuracy as proposed in the paper, here is an example of the most direct way of doing so:

```python
from mt_metrics_eval import tau_optimization
from mt_metrics_eval import stats

# M is the number of groups, N is the number of observations per group.
# For instance, in the case of the group-by-item correlations, M is the number
# of items (or segments) and N is the number of systems. If you have no
# groups, M=1.
M = 10
N = 20

# Generate fake data for this example. X should be the matrix of metric scores
# and Y should be the matrix of human scores
X = np.random.rand(M, N)
Y = np.random.rand(M, N)

# Run tie calibration (called tau_optimization in this notebook). The sample_rate
# parameter indicates what proportion of all possible pairs of observations
# to use when searching for the optimal epsilon. If 1.0, all will be used. If
# you have a very large number of pairs, you may want to lower this. We found
# that small values (0.1) actually tend to yield relatively stable results
# (see Appendix E in the paper).
sample_rate = 1.0
result = tau_optimization.tau_optimization(
  X, Y, tau_optimization.TauSufficientStats.acc_23,
)

# The result object has various information, including `best_threshold` (equal
# to the best epsilon), `best_tau` (equal to the best `acc_23` score), plus
# `thresholds` and `taus` (they contain the various epsilsons and corresponding
# accuracy scores that were used in the search).
print(result.best_threshold)
print(result.best_tau)

# If you already have an epsilon and you want to compute an accuracy score
# with that epsilon with two vectors, then you can do so with the following:
x = np.random.rand(N)  # the metric scores
y = np.random.rand(N)  # the human scores
epsilon = 0.05
accuracy, _ = stats.KendallVariants(
  Y, X, variant="acc23", epsilon=epsilon
)
print(accuracy)
```

If you use this meta-evaluation methodology, please cite the following paper:
```
@misc{deutsch2023ties,
      title={{Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration}},
      author={Daniel Deutsch and George Foster and Markus Freitag},
      year={2023},
      eprint={2305.14324},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

## Environment Setup
Installs the MTME library.

In [10]:
from mt_metrics_eval import tau_optimization
from mt_metrics_eval import stats

# M is the number of groups, N is the number of observations per group.
# For instance, in the case of the group-by-item correlations, M is the number
# of items (or segments) and N is the number of systems. If you have no
# groups, M=1.
M = 2 # number of unique sources
N = 5 # number of systems

# Generate fake data for this example. X should be the matrix of metric scores
# and Y should be the matrix of human scores
X = np.random.rand(M, N)
Y = np.random.rand(M, N)

In [12]:
# Run tie calibration (called tau_optimization in this notebook). The sample_rate
# parameter indicates what proportion of all possible pairs of observations
# to use when searching for the optimal epsilon. If 1.0, all will be used. If
# you have a very large number of pairs, you may want to lower this. We found
# that small values (0.1) actually tend to yield relatively stable results
# (see Appendix E in the paper).
sample_rate = 1.0
result = tau_optimization.tau_optimization(
  X, Y, tau_optimization.TauSufficientStats.acc_23,
)

# The result object has various information, including `best_threshold` (equal
# to the best epsilon), `best_tau` (equal to the best `acc_23` score), plus
# `thresholds` and `taus` (they contain the various epsilsons and corresponding
# accuracy scores that were used in the search).
print(result.best_threshold)
print(result.best_tau)

# If you already have an epsilon and you want to compute an accuracy score
# with that epsilon with two vectors, then you can do so with the following:
x = np.random.rand(N)  # the metric scores
y = np.random.rand(N)  # the human scores
epsilon = 0.05
accuracy, _ = stats.KendallVariants(
  Y, X, variant="acc23", epsilon=epsilon
)
print(accuracy)

0.0
0.5
29.0


array([[0.61565245, 0.94124171, 0.78704148, 0.7048761 , 0.33322002],
       [0.98010779, 0.51545494, 0.58234558, 0.45539542, 0.81932577]])

In [1]:
!git clone https://github.com/google-research/mt-metrics-eval.git && cd mt-metrics-eval && git checkout d18c3ebe91a004c124c179ad5614b8dba96f1f48 && pip install .

Cloning into 'mt-metrics-eval'...
remote: Enumerating objects: 281, done.[K
remote: Counting objects: 100% (104/104), done.[K
remote: Compressing objects: 100% (66/66), done.[K
remote: Total 281 (delta 65), reused 56 (delta 38), pack-reused 177 (from 1)[K
Receiving objects: 100% (281/281), 254.09 KiB | 7.26 MiB/s, done.
Resolving deltas: 100% (171/171), done.
Note: switching to 'd18c3ebe91a004c124c179ad5614b8dba96f1f48'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at d18c3eb Add dependency.
Processing /Use

## Download Data
Downloads the WMT'22 metrics scores and the GEMBA metric outputs.

In [2]:
# MTME data
!python3 -m mt_metrics_eval.mtme --download

# GEMBA data
!mkdir -p gemba/wmt22/metric-scores/en-de gemba/wmt22/metric-scores/en-ru gemba/wmt22/metric-scores/zh-en
!wget https://raw.githubusercontent.com/MicrosoftTranslator/GEMBA/main/mt-metrics-eval-v2/wmt22/metric-scores/en-de/GEMBA-Dav3-DA-refA.seg.score -O gemba/wmt22/metric-scores/en-de/GEMBA-Dav3-DA-refA.seg.score
!wget https://raw.githubusercontent.com/MicrosoftTranslator/GEMBA/main/mt-metrics-eval-v2/wmt22/metric-scores/en-ru/GEMBA-Dav3-DA-refA.seg.score -O gemba/wmt22/metric-scores/en-ru/GEMBA-Dav3-DA-refA.seg.score
!wget https://raw.githubusercontent.com/MicrosoftTranslator/GEMBA/main/mt-metrics-eval-v2/wmt22/metric-scores/zh-en/GEMBA-Dav3-DA-refA.seg.score -O gemba/wmt22/metric-scores/zh-en/GEMBA-Dav3-DA-refA.seg.score
!wget https://raw.githubusercontent.com/MicrosoftTranslator/GEMBA/main/mt-metrics-eval-v2/wmt22/metric-scores/en-de/GEMBA-GPT4-DA-refA.seg.score -O gemba/wmt22/metric-scores/en-de/GEMBA-GPT4-DA-refA.seg.score
!wget https://raw.githubusercontent.com/MicrosoftTranslator/GEMBA/main/mt-metrics-eval-v2/wmt22/metric-scores/en-ru/GEMBA-GPT4-DA-refA.seg.score -O gemba/wmt22/metric-scores/en-ru/GEMBA-GPT4-DA-refA.seg.score
!wget https://raw.githubusercontent.com/MicrosoftTranslator/GEMBA/main/mt-metrics-eval-v2/wmt22/metric-scores/zh-en/GEMBA-GPT4-DA-refA.seg.score -O gemba/wmt22/metric-scores/zh-en/GEMBA-GPT4-DA-refA.seg.score

Downloading data into /Users/zhangran/.mt-metrics-eval
--2025-03-21 17:35:00--  https://raw.githubusercontent.com/MicrosoftTranslator/GEMBA/main/mt-metrics-eval-v2/wmt22/metric-scores/en-de/GEMBA-Dav3-DA-refA.seg.score
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8002::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 489316 (478K) [text/plain]
Saving to: ‘gemba/wmt22/metric-scores/en-de/GEMBA-Dav3-DA-refA.seg.score’


2025-03-21 17:35:01 (10,4 MB/s) - ‘gemba/wmt22/metric-scores/en-de/GEMBA-Dav3-DA-refA.seg.score’ saved [489316/489316]

--2025-03-21 17:35:01--  https://raw.githubusercontent.com/MicrosoftTranslator/GEMBA/main/mt-metrics-eval-v2/wmt22/metric-scores/en-ru/GEMBA-Dav3-DA-refA.seg.score
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:

## Imports

In [4]:
import functools
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
from matplotlib.patches import Patch
from mt_metrics_eval import data as mtme_data
from mt_metrics_eval import stats as mtme_stats
from mt_metrics_eval import tau_optimization
import numpy as np
import pandas as pd
import scipy.stats
from typing import Any

## Load WMT'22 Evaluation Sets

In [8]:
eval_sets = {
    "en-de": mtme_data.EvalSet("wmt22", "en-de", read_stored_metric_scores=True, path=["gemba", "/gemba/wmt22/metric-scores"]),
    "en-ru": mtme_data.EvalSet("wmt22", "en-ru", read_stored_metric_scores=True, path=["gemba", "/gemba/wmt22/metric-scores"]),
    "zh-en": mtme_data.EvalSet("wmt22", "zh-en", read_stored_metric_scores=True, path=["gemba", "/gemba/wmt22/metric-scores"]),
}

FileNotFoundError: [Errno 2] No such file or directory: 'gemba/wmt22/documents/en-de.docs'

## Utility Functions
Implements getting and filtering scores, calculating correlations that aren't checked in to the MTME library.

In [None]:
def get_metric_scores(evs: mtme_data.EvalSet, metric: str) -> dict[str, list[float]]:
  scores_dict = evs.Scores("seg", metric)
  bad_systems = evs.outlier_sys_names | {evs.std_ref}
  return {
      system: scores for system, scores in scores_dict.items() if system not in bad_systems
  }


def kendall(x, y, variant):
  x = np.asarray(x).ravel()
  y = np.asarray(y).ravel()

  if x.size != y.size:
    raise ValueError(
        'All inputs to `kendalltau` must be of the same '
        f'size, found x-size {x.size} and y-size {y.size}'
    )
  elif not x.size or not y.size:
    raise ValueError('x or y are empty')

  # check both x and y
  cnx = np.any(np.isnan(x))
  cny = np.any(np.isnan(y))
  contains_nan = cnx or cny
  if contains_nan:
    raise ValueError('x or y contains NaN')

  def count_rank_tie(ranks):
    cnt = np.bincount(ranks).astype('int64', copy=False)
    cnt = cnt[cnt > 1]
    return (
        (cnt * (cnt - 1) // 2).sum(),
        (cnt * (cnt - 1.0) * (cnt - 2)).sum(),
        (cnt * (cnt - 1.0) * (2 * cnt + 5)).sum(),
    )

  size = x.size
  perm = np.argsort(y)  # sort on y and convert y to dense ranks
  x, y = x[perm], y[perm]
  y = np.r_[True, y[1:] != y[:-1]].cumsum(dtype=np.intp)

  # stable sort on x and convert x to dense ranks
  perm = np.argsort(x, kind='mergesort')
  x, y = x[perm], y[perm]
  x = np.r_[True, x[1:] != x[:-1]].cumsum(dtype=np.intp)

  dis = scipy.stats._stats._kendall_dis(x, y)  # discordant pairs

  obs = np.r_[True, (x[1:] != x[:-1]) | (y[1:] != y[:-1]), True]
  cnt = np.diff(np.nonzero(obs)[0]).astype('int64', copy=False)

  ntie = (cnt * (cnt - 1) // 2).sum()  # joint ties
  xtie, _, _ = count_rank_tie(x)  # ties in x, stats
  ytie, _, _ = count_rank_tie(y)  # ties in y, stats

  tot = (size * (size - 1)) // 2
  con = tot - ((xtie - ntie) + (ytie - ntie) + ntie + dis)

  minclasses = min(len(set(x)), len(set(y)))

  tx = xtie - ntie
  ty = ytie - ntie
  txy = ntie

  if variant == "a":
    return (con - dis) / (con + dis + tx + ty + txy), 0
  elif variant == "b":
    return (con - dis) / np.sqrt((con + dis + tx) * (con + dis + ty)), 0
  elif variant == "c":
    minclasses = min(len(set(x)), len(set(y)))
    return 2 * (con - dis) / (size**2 * (minclasses - 1) / minclasses), 0
  elif variant == "10":
    return (con - dis - ty) / (con + dis + ty), 0
  elif variant == "13":
    return (con - dis) / (con + dis), 0
  elif variant == "14":
    return (con - dis) / (con + dis + ty), 0
  elif variant == "acc_eq":
    # Accuracy assuming tie optimization is done.
    return (con + txy) / (con + dis + tx + ty + txy), 0
  return 0


def custom_kendall(corr, variant: str, average_by: str = "none"):
  cf = corr.AverageCorrelation(
      kendall, average_by, variant=variant)
  return cf(corr.gold_scores, corr.metric_scores)


def calculate_correlations(
    evs: mtme_data.EvalSet,
    mqm_scores: dict[str, list[float]],
    metric_scores: dict[str, list[float]],
    coef: str,
) -> dict[str, float]:
  corr = evs.Correlation(mqm_scores, metric_scores)
  if coef == "kendall-a":
    corr_fn = functools.partial(custom_kendall, corr=corr, variant="a")
  elif coef == "kendall-b":
    corr_fn = functools.partial(custom_kendall, corr=corr, variant="b")
  elif coef == "kendall-c":
    corr_fn = functools.partial(custom_kendall, corr=corr, variant="c")
  elif coef == "kendall-10":
    corr_fn = functools.partial(custom_kendall, corr=corr, variant="10")
  elif coef == "kendall-13":
    corr_fn = functools.partial(custom_kendall, corr=corr, variant="13")
  elif coef == "kendall-14":
    corr_fn = functools.partial(custom_kendall, corr=corr, variant="14")
  elif coef == "accuracy-eq_no_calib":
    corr_fn = functools.partial(custom_kendall, corr=corr, variant="acc_eq")
  elif coef == "accuracy-eq":
    corr_fn = functools.partial(
        corr.KendallWithTiesOpt,
        sample_rate=1.0
    )
  elif coef == "pearson":
    corr_fn = corr.Pearson
  else:
    raise ValueError(coef)

  no_grouping = corr_fn()[0]
  group_by_item, _, num_items = corr_fn(average_by="item")
  group_by_system = corr_fn(average_by="sys")[0]
  return {
      "no_grouping": no_grouping,
      "group_by_item": group_by_item,
      "group_by_item_num_items": num_items,
      "group_by_system": group_by_system,
  }


## Analyze Ties
These results correspond to Tables 3, 4, and 10 from the paper.

In [None]:
def analyze_ties(grouping: str, metric: str) -> pd.DataFrame:
  df = []
  for lp, evs in eval_sets.items():
    mqm_dict = get_metric_scores(evs, "mqm")
    scores_dict = get_metric_scores(evs, metric)
    num_translations = 0
    num_pairs = 0
    num_tied_pairs = 0
    num_zero_pairs = 0

    if grouping == "no_grouping":
      all_scores = []
      for system, scores in scores_dict.items():
        for i, score in enumerate(scores):
          if score is not None and mqm_dict[system][i] is not None:
            all_scores.append(score)

      num_translations = len(all_scores)
      for i in range(len(all_scores)):
        for j in range(i + 1, len(all_scores)):
          num_pairs += 1
          if all_scores[i] == all_scores[j]:
            num_tied_pairs += 1
            if all_scores[i] == 0.0:
              num_zero_pairs += 1

    elif grouping == "group_by_item":
      for i in range(len(evs.src)):
        item_scores = []
        for system, scores in scores_dict.items():
          if scores[i] is not None and mqm_dict[system][i] is not None:
            item_scores.append(scores[i])

        num_translations += len(item_scores)
        for j in range(len(item_scores)):
          for k in range(j + 1, len(item_scores)):
            num_pairs += 1
            if item_scores[j] == item_scores[k]:
              num_tied_pairs += 1
              if item_scores[j] == 0.0:
                num_zero_pairs += 1

    elif grouping == "group_by_system":
      for system, scores in scores_dict.items():
        system_scores = []
        for i, score in enumerate(scores):
          if score is not None and mqm_dict[system][i] is not None:
            system_scores.append(score)

        num_translations += len(system_scores)
        for i in range(len(system_scores)):
          for j in range(i + 1, len(system_scores)):
            num_pairs += 1
            if system_scores[i] == system_scores[j]:
              num_tied_pairs += 1
              if system_scores[i] == 0.0:
                num_zero_pairs += 1

    df.append({
        "lp": lp,
        "num_translations": num_translations,
        "num_pairs": num_pairs,
        "num_tied_pairs": num_tied_pairs,
        "percent_of_pairs_tied": num_tied_pairs / num_pairs * 100,
        "num_zero_pairs": num_zero_pairs,
        "percent_of_tied_pairs_zero_tied": num_zero_pairs / num_tied_pairs * 100,
        "percent_of_pairs_zero_tied": num_zero_pairs / num_pairs * 100,
    })
  return pd.DataFrame(df)


### MQM

In [None]:
analyze_ties("no_grouping", "mqm")

In [None]:
analyze_ties("group_by_item", "mqm")

In [None]:
analyze_ties("group_by_system", "mqm")

### Metrics

In [None]:
analyze_ties("group_by_item", "metricx_xxl_MQM_2020-refA")

In [None]:
analyze_ties("group_by_item", "COMET-22-refA")

In [None]:
analyze_ties("group_by_item", "MATESE-refA")

In [None]:
analyze_ties("group_by_item", "GEMBA-Dav3-DA-refA")

In [None]:
analyze_ties("group_by_item", "GEMBA-GPT4-DA-refA")

## Equal Width Buckets Experiments
This experiment maps a metric score into k buckets of each width, simulating what would happen if the metric predicted a larger number of ties than it actually does.
This experiment produces Figure 3.

In [None]:
def map_to_equal_width_buckets(
    evs: mtme_data.EvalSet,
    scores_dict: dict[str, list[float]],
    num_buckets: int,
) -> dict[str, list[float]]:
  """Maps the scores to integer buckets where each bucket represents an equal score range."""
  all_scores = [x for scores in scores_dict.values() for x in scores]
  min_value = min(all_scores)
  max_value = max(all_scores)
  width = (max_value - min_value) / num_buckets
  bins = [min_value + width * (i + 1) for i in range(num_buckets - 1)]
  return {
      system: np.digitize(scores, bins)  for system, scores in scores_dict.items()
  }

def run_equal_width_buckets_experiment(lp: str, metric: str, coef: str):
  evs = eval_sets[lp]
  num_buckets_list = [
      2, 3, 4, 5, 10, 25, 50, 100, 200, 500, 1000, 5000, 10000, 20000, 30000
  ]
  mqm_dict = get_metric_scores(evs, "mqm")
  scores_dict = get_metric_scores(evs, metric)

  # Calculate the correlation when the scores are bucketed
  correlations = []
  num_non_nan_segments = []
  for num_buckets in num_buckets_list:
    bucketed_scores = map_to_equal_width_buckets(evs, scores_dict, num_buckets)
    correlations_dict = calculate_correlations(evs, mqm_dict, bucketed_scores, coef)
    correlations.append(correlations_dict["group_by_item"])
    num_non_nan_segments.append(correlations_dict["group_by_item_num_items"])

  # Calculate the original correlation
  original_dict = calculate_correlations(evs, mqm_dict, scores_dict, coef)
  original = original_dict["group_by_item"]

  return {
      "num_buckets": num_buckets_list,
      "bucketed_correlations": correlations,
      "num_segments": num_non_nan_segments,
      "original_correlation": original,
      "original_correlation_num_segments": original_dict["group_by_item_num_items"],
  }


In [None]:
def plot_bucketed_correlations(
    lp: str,
    metric: str,
    data: dict[str, Any],
):

  fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(10, 8))

  num_buckets = data["num_buckets"]
  x = np.log10(num_buckets)

  y1_bucketed = data["bucketed_correlations"]
  y1_original = data["original_correlation"]

  y2_bucketed = data["num_segments"]
  y2_original = data["original_correlation_num_segments"]

  axes[0].plot(x, y1_bucketed, label="Bucketed Scores", color="blue", marker="o", markersize=10)
  axes[0].axhline(y1_original, label="Original Scores", linestyle="dashed", color="orange")

  axes[1].plot(x, y2_bucketed, label="Bucketed Scores", color="blue", marker="o", markersize=10)
  axes[1].axhline(y2_original, label="Original Scores", linestyle="dashed", color="orange")

  axes[0].set_ylabel("Group-by-Item $\\tau_b$")
  axes[1].set_xlabel("log$_{10}$(Number of Buckets)")
  axes[1].set_ylabel("#Non-NaN Groups")

  handles, labels = axes[0].get_legend_handles_labels()

  # Add legend with the modified handles and labels. Disable the legend box,
  # which only takes up space and doesn't look good anyway.
  # Reshape the line symbol a bit.
  axes[0].legend(handles, labels, frameon=False, handlelength=2.2,
                 handletextpad=0.5, borderpad=1)

  # Modify space beween subplots
  fig.subplots_adjust(hspace=0.2)

  fig.show()


In [None]:
lp = "en-de"
metric = "metricx_xxl_MQM_2020-refA"
results = run_equal_width_buckets_experiment(lp, metric, "kendall-b")
plot_bucketed_correlations(lp, metric, results)

## Tie Calibration Experiments
This reproduces the main result of the paper.
It ranks metrics by different correlation coefficients, including the pairwise accuracy with tie calibration.
These results correspond to Tables 6 and 12-20.

In [None]:
def select_optimal_epsilon(
    evs: mtme_data.EvalSet,
    metric: str,
    grouping: str,
):
  mqm_scores = get_metric_scores(evs, "mqm")
  metric_scores = get_metric_scores(evs, metric)
  if not metric_scores:
    return None

  if grouping == "no_grouping":
    average_by = "none"
    sample_rate = 0.1
  elif grouping == "group_by_item":
    average_by = "item"
    sample_rate = 1.0
  elif grouping == "group_by_system":
    average_by = "sys"
    sample_rate = 0.1
  else:
    raise ValueError(grouping)

  corr = evs.Correlation(mqm_scores, metric_scores)

  return mtme_stats.KendallWithTiesOpt(
      corr.gold_scores,
      corr.metric_scores,
      num_sys=corr.num_sys,
      average_by=average_by,
      sample_rate=sample_rate,
  )


def run_tie_calibration_experiment(
    evs: mtme_data.EvalSet,
    grouping: str,
    metrics: list[str],
):
  mqm_scores = get_metric_scores(evs, "mqm")
  coefs = ["kendall-a", "kendall-b", "kendall-c", "kendall-10", "kendall-13", "kendall-14", "accuracy-eq_no_calib"]

  df = []
  for metric in metrics:
    # Run the optimization
    opt_results = select_optimal_epsilon(
        evs, metric, grouping,
    )
    if not opt_results:
      continue

    # Calculate baseline metric scores
    metric_scores = get_metric_scores(evs, metric)
    correlations = {
        coef: calculate_correlations(
            evs,
            mqm_scores,
            metric_scores,
            coef,
        ) for coef in coefs
    }

    results = {
        "metric": metric,
        "best_epsilon": opt_results[1],
        "best_acc": opt_results[0],
    }
    for coef in coefs:
      results[coef] = correlations[coef][grouping]
    df.append(results)

  df = pd.DataFrame(df)
  for coef in ["best_acc"] + coefs:
    df[f"{coef}_rank"] = df[coef].rank(ascending=False)

  return df.sort_values(by=["best_acc_rank"])


In [None]:
ende_metrics = [
    "metricx_xxl_MQM_2020-refA",
    "UniTE-ref-refA",
    "COMET-22-refA",
    "MATESE-refA",
    "UniTE-src-src",
    "GEMBA-GPT4-DA-refA",
    "MATESE-QE-src",
    "COMETKiwi-src",
    "MS-COMET-22-refA",
    "COMET-QE-src",
    "SEScore-refA",
    "HWTSC-Teacher-Sim-src",
    "GEMBA-Dav3-DA-refA",
    "MEE-refA",
    "REUSE-src",
    "BLEURT-20-refA",
]

In [None]:
# Too slow to run on the free Colab server
# run_tie_calibration_experiment(
#     eval_sets["en-de"],
#     "no_grouping",
#     ende_metrics,
# )

In [None]:
run_tie_calibration_experiment(
    eval_sets["en-de"],
    "group_by_item",
    ende_metrics,
)

In [None]:
# run_tie_calibration_experiment(
#     eval_sets["en-de"],
#     "group_by_system",
#     ende_metrics,
# )

In [None]:
enru_metrics = [
    "metricx_xxl_MQM_2020-refA",
    "UniTE-ref-refA",
    "COMET-22-refA",
    "MATESE-refA",
    "UniTE-src-src",
    "GEMBA-GPT4-DA-refA",
    "MATESE-QE-src",
    "COMETKiwi-src",
    "MS-COMET-22-refA",
    "COMET-QE-src",
    "HWTSC-Teacher-Sim-src",
    "GEMBA-Dav3-DA-refA",
    "MEE-refA",
    "REUSE-src",
    "BLEURT-20-refA",
]

In [None]:
# Too slow/too much memory for public Colab
# run_tie_calibration_experiment(
#     eval_sets["en-ru"],
#     "no_grouping",
#     enru_metrics,
# )

In [None]:
run_tie_calibration_experiment(
    eval_sets["en-ru"],
    "group_by_item",
    enru_metrics,
)

In [None]:
# Too slow/too much memory for public Colab
# run_tie_calibration_experiment(
#     eval_sets["en-ru"],
#     "group_by_system",
#     enru_metrics,
# )

In [None]:
zhen_metrics = [
    "metricx_xxl_MQM_2020-refA",
    "UniTE-ref-refA",
    "COMET-22-refA",
    "MATESE-refA",
    "UniTE-src-src",
    "GEMBA-GPT4-DA-refA",
    "MATESE-QE-src",
    "COMETKiwi-src",
    "MS-COMET-22-refA",
    "COMET-QE-src",
    "SEScore-refA",
    "HWTSC-Teacher-Sim-src",
    "GEMBA-Dav3-DA-refA",
    "MEE-refA",
    "REUSE-src",
    "BLEURT-20-refA",
]

In [None]:
# Too slow/too much memory for public Colab
# run_tie_calibration_experiment(
#     eval_sets["zh-en"],
#     "no_grouping",
#     zhen_metrics,
# )

In [None]:
run_tie_calibration_experiment(
    eval_sets["zh-en"],
    "group_by_item",
    zhen_metrics,
)

In [None]:
# Too slow/too much memory for public Colab
# run_tie_calibration_experiment(
#     eval_sets["zh-en"],
#     "group_by_system",
#     zhen_metrics,
# )

## Analyze Epsilon Across Years
This experiment analyzes whether or not the epsilon selected from one year of WMT generalizes to other years.
This corresponds to Figure 4.

In [None]:
def convert_to_matrices(mqm_scores: dict[str, list[float]], metric_scores: dict[str, list[float]]):
  X, Y = [], []
  for system in mqm_scores.keys():
    if system not in metric_scores:
      continue
    x = metric_scores[system]
    y = mqm_scores[system]
    if not y or not any(score is not None for score in y):
      continue
    assert len(x) == len(y)
    X.append([x[i] for i in range(len(x)) if y[i] is not None])
    Y.append([y[i] for i in range(len(y)) if y[i] is not None])
  return np.array(X), np.array(Y)


def calculate_group_by_item_acc(X: np.ndarray, Y: np.ndarray, epsilon: float):
  accs = []
  for x, y in zip(X.T, Y.T):
    accs.append(mtme_stats.KendallVariants(x, y, variant="acc23", epsilon=epsilon)[0])
  return np.mean(accs)


def _compare_epsilson_across_years(lp: str, metric: str, ax, legend: bool = True, xlabel = True, ylabel = True):
  if lp == "en-de":
    if metric == "bleurt":
      metric21 = "bleurt-20-refC"
      metric22 = "BLEURT-20-refA"
    elif metric == "comet":
      metric21 = "COMET-DA_2020-refC"
      metric22 = "COMET-20-refA"
    elif metric == "bleu":
      metric21 = "sentBLEU-refC"
      metric22 = "BLEU-refA"
    elif metric == "bertscore":
      metric21 = "BERTScore-refC"
      metric22 = "BERTScore-refA"
  elif lp == "zh-en":
    if metric == "bleurt":
      metric21 = "bleurt-20-refB"
      metric22 = "BLEURT-20-refA"
    elif metric == "comet":
      metric21 = "COMET-DA_2020-refB"
      metric22 = "COMET-20-refA"
    elif metric == "bleu":
      metric21 = "sentBLEU-refB"
      metric22 = "BLEU-refA"
    elif metric == "bertscore":
      metric21 = "BERTScore-refB"
      metric22 = "BERTScore-refA"

  evs21 = mtme_data.EvalSet("wmt21.news", lp, read_stored_metric_scores=True)
  evs22 = mtme_data.EvalSet("wmt22", lp, read_stored_metric_scores=True)

  mqm_scores_21 = get_metric_scores(evs21, "mqm")
  mqm_scores_22 = get_metric_scores(evs22, "mqm")
  metric_scores_21 = get_metric_scores(evs21, metric21)
  metric_scores_22 = get_metric_scores(evs22, metric22)

  if lp == "en-de":
    del mqm_scores_21["refB"]
    del metric_scores_21["refB"]

  X_21, Y_21 = convert_to_matrices(mqm_scores_21, metric_scores_21)
  X_22, Y_22 = convert_to_matrices(mqm_scores_22, metric_scores_22)

  res21 = tau_optimization.tau_optimization(
      X_21.T, Y_21.T, tau_optimization.TauSufficientStats.acc_23
  )
  res22 = tau_optimization.tau_optimization(
      X_22.T, Y_22.T, tau_optimization.TauSufficientStats.acc_23
  )

  acc21_threshold22 = calculate_group_by_item_acc(X_21, Y_21, res22.best_threshold)
  acc22_threshold21 = calculate_group_by_item_acc(X_22, Y_22, res21.best_threshold)

  print(metric)
  print(f"WMT'21 eps={res21.best_threshold}, WMT'21 acc={res21.best_tau}, WMT'22 acc={acc22_threshold21}")
  print(f"WMT'22 eps={res22.best_threshold}, WMT'22 acc={res22.best_tau}, WMT'21 acc={acc21_threshold22}")
  print(f"WMT'21 accuracy abs delta: {(res21.best_tau - acc21_threshold22) * 100}")
  print(f"WMT'22 accuracy abs delta: {(res22.best_tau - acc22_threshold21) * 100}")
  print(f"WMT'21 accuracy rel delta: {(res21.best_tau - acc21_threshold22) / res21.best_tau:.2%}")
  print(f"WMT'22 accuracy rel delta: {(res22.best_tau - acc22_threshold21) / res22.best_tau:.2%}")
  print(f"WMT'21 -> 22 epsilon abs delta: {res21.best_threshold - res22.best_threshold}")
  print(f"WMT'22 -> 21 epsilon abs delta: {res22.best_threshold - res21.best_threshold}")
  print(f"WMT'21 -> 22 epsilon rel delta: {(res21.best_threshold - res22.best_threshold) / res21.best_threshold:.2%}")
  print(f"WMT'22 -> 21 epsilon rel delta: {(res22.best_threshold - res21.best_threshold) / res22.best_threshold:.2%}")


  ax.plot(res21.thresholds, res21.taus, label="WMT'21", color="blue")
  ax.plot(res22.thresholds, res22.taus, label="WMT'22", color="orange")
  ax.axvline(res21.best_threshold, color="blue", linestyle="dashed")
  ax.axvline(res22.best_threshold, color="orange", linestyle="dashed")
  if ylabel:
    ax.set_ylabel("Pairwise Accuracy")
  if xlabel:
    ax.set_xlabel("Epsilon")
  ax.set_title(lp)

  elements = [
      Patch(facecolor="blue", edgecolor="blue", label="WMT'21"),
      Patch(facecolor="orange", edgecolor="orange", label="WMT'22"),
      Line2D([0], [0], lw=4, color="black", label="Accuracy"),
      Line2D([0], [0], lw=4, color="black", label="Best Epsilon", linestyle="dashed"),
  ]
  if legend:
    ax.legend(handles=elements, frameon=False, handlelength=2.2, handletextpad=0.5, borderpad=0.2)


def compare_epsilons_across_years(metric: str):
  fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(10, 6), sharex=True)
  _compare_epsilson_across_years("en-de", metric, axes[0], xlabel=False, ylabel=False)
  _compare_epsilson_across_years("zh-en", metric, axes[1], legend=False, ylabel=False)
  # Modify space beween subplots
  fig.subplots_adjust(hspace=0.2)
  # plt.xlim(0, 0.4)

  fig.text(0.04, 0.5, 'Pairwise Accuracy', va='center', rotation='vertical')


  plt.show()


In [None]:
compare_epsilons_across_years("bleurt")

## Ties-F1 Experiment
This experiment analyzes what happens if you decompose the accuracy score into F1 scores instead.
It corresponds to Figure 6.

In [None]:
def _ties_precision(ss) -> float:
  """Calculates the precision of metric tie predictions."""
  denom = ss.ties_both + ss.ties_metric
  return ss.ties_both / denom if denom > 0 else 1.0

def _ties_recall(ss) -> float:
  """Calculates the recall of human ties."""
  denom = ss.ties_both + ss.ties_human
  return ss.ties_both / denom if denom > 0 else 1.0

def _ties_f1(ss) -> float:
  precision = _ties_precision(ss)
  recall = _ties_recall(ss)
  denom = precision + recall
  return 2 * (precision * recall) / denom if denom > 0 else 0.0

def _correct_rank_precision(ss) -> float:
  denom = ss.con + ss.dis + ss.ties_human
  return ss.con / denom if denom > 0 else 1.0

def _correct_rank_recall(ss) -> float:
  denom = ss.con + ss.dis + ss.ties_metric
  return ss.con / denom if denom > 0 else 1.0

def _correct_rank_f1(ss) -> float:
  """Calculates the correct rank F1."""
  precision = _correct_rank_precision(ss)
  recall = _correct_rank_recall(ss)
  denom = precision + recall
  return 2 * (precision * recall) / denom if denom > 0 else 0.0


def plot_other_f1s(lp: str, metric: str):
  mqm_scores = get_metric_scores(eval_sets[lp], "mqm")
  metric_scores = get_metric_scores(eval_sets[lp], metric)

  X, Y = convert_to_matrices(mqm_scores, metric_scores)

  res = tau_optimization.tau_optimization(
      X.T, Y.T, tau_optimization.TauSufficientStats.acc_23
  )
  thresholds = []
  taus_subset = []
  ties_p_list, ties_r_list, ties_f1_list = [], [], []
  rank_p_list, rank_r_list, rank_f1_list = [], [], []
  for i in range(0, len(res.thresholds), 1000):
    threshold = res.thresholds[i]
    taus_subset.append(res.taus[i])
    thresholds.append(threshold)

    ties_p, ties_r, ties_f1 = [], [], []
    rank_p, rank_r, rank_f1 = [], [], []

    for x, y in zip(X.T, Y.T):
      con, dis, t_x, t_y, t_xy = mtme_stats._MatrixSufficientStatistics(x, y, threshold, None, None)
      ss = tau_optimization.TauSufficientStats(con, dis, t_y, t_x, t_xy)
      ties_p.append(_ties_precision(ss))
      ties_r.append(_ties_recall(ss))
      ties_f1.append(_ties_f1(ss))
      rank_p.append(_correct_rank_precision(ss))
      rank_r.append(_correct_rank_recall(ss))
      rank_f1.append(_correct_rank_f1(ss))

    ties_p_list.append(np.mean(ties_p))
    ties_r_list.append(np.mean(ties_r))
    ties_f1_list.append(np.mean(ties_f1))
    rank_p_list.append(np.mean(rank_p))
    rank_r_list.append(np.mean(rank_r))
    rank_f1_list.append(np.mean(rank_f1))

  fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 4.5))
  axes = [ax]
  plt.plot(thresholds, taus_subset, label="Accuracy", color="blue")
  plt.plot(thresholds, ties_f1_list, label="Ties F1", color="orange")
  plt.plot(thresholds, rank_f1_list, label="Correct Rank F1", color="green")
  plt.axvline(res.best_threshold, color="blue", linestyle="dashed", label="Best Epsilon")
  plt.xlabel("Epsilon")
  plt.xlim(-0.1, 2.0)


  handles, labels = axes[0].get_legend_handles_labels()

  # Add legend with the modified handles and labels. Disable the legend box,
  # which only takes up space and doesn't look good anyway.
  # Reshape the line symbol a bit.
  axes[0].legend(handles, labels, frameon=False, handlelength=2.2,
                 handletextpad=0.5, borderpad=1)

  plt.show()


In [None]:
plot_other_f1s("en-de", "COMET-22-refA")