Quality and Quantity of Machine Translation References for Automated Metrics [paper]

This is a repository for two papers: Quality and Quantity of Machine Translation References for Automated Metrics [paper] - effect of reference quality and quantity on automatic metric performance, and Evaluating Optimal Reference Translations [paper] - creation of the data and human aspects of annotation and translation.

The compiled dataset is available on huggingface. This repository contains the same data, in addition to the raw data and all processing scripts.

Quality and Quantity of Machine Translation References for Automated Metrics [paper]

Abstract: Automatic machine translation metrics often use human translations to determine the quality system translations. Common wisdom in the field dictates that the human references should be of very high quality. However, there are no cost-benefit analyses that could be used to guide practitioners who plan to collect references for machine translation evaluation. We find that higher-quality references lead to better metric correlations with humans at the segment-level. Having up to 7 references per segment and taking their average helps. Interestingly, the references from vendors of different qualities can be mixed together and improve metric success. Higher quality references, however, cost more to create and we frame this as an optimization problem: given a specific budget, what types of references should be collected to maximize metric success. These findings can be used by evaluators of shared tasks when references need to be created under a certain budget.

Cite this paper as:

@misc{zouhar2024quality,
      title={Quality and Quantity of Machine Translation References for Automated Metrics}, 
      author={Vilém Zouhar and Ondřej Bojar},
      year={2024},
      eprint={2401.01283},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Results

Higher quality translation lead to better segment-level correlations. Very high quality translations (R4, which come from translatologists) contain translation shifts and are not the best as references. Using up to 7 references per segment helps.

A heuristic-based algorithm can select which references to invest in. It is controlled by a hyperparameter which balances between quality and quantity.

Evaluating Optimal Reference Translations [paper]

Abstract: The overall translation quality reached by current machine translation (MT) systems for high-resourced language pairs is remarkably good. Standard methods of evaluation are not suitable nor intended to uncover the many translation errors and quality deficiencies that still persist. Furthermore, the quality of standard reference translations is commonly questioned and comparable quality levels have been reached by MT alone in several language pairs. Navigating further research in these high-resource settings is thus difficult. In this article, we propose a methodology for creating more reliable document-level human reference translations, called "optimal reference translations," with the simple aim to raise the bar of what should be deemed "human translation quality." We evaluate the obtained document-level optimal reference translations in comparison with "standard" ones, confirming a significant quality increase and also documenting the relationship between evaluation and translation editing.

This is project at ETH Zürich and ÚFAL Charles University. Paper to be published in Natural Language Engineering 2024. For now cite as:

@misc{zouhar2023evaluating,
      title={Evaluating Optimal Reference Translations}, 
      author={Vilém Zouhar and Věra Kloudová and Martin Popel and Ondřej Bojar},
      year={2023},
      eprint={2311.16787},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Collected English to Czech translation evaluation human data are in data/ort_human.json. The rest of this repository contains data preparation and evaluation code. Our data is based on WMT2020 data and can thus be also used to e.g. evaluate the quality of various translations as references. The process of the data is as follows:

P1, P2, and P3 are independent translations from English to Czech. N1 is an expert translation by a translatologist.
All the human translations are evaluated on document and segment level with detail (in data/ort_human.json) by different types of human annotators (laypeople, translatology students, professional translators). If the translation is not perfect, the annotators provide a post-edited version for which they would assign the highest grade (6).

Note: If you you also want to use the WMT2020 system submissions, please contact Vilém Zouhar. The code is here, just not pretty yet. 🙂

Example usage

# in Python
from datasets import load_dataset
data = load_dataset("zouharvi/optimal-reference-translations", 'ort_human')["train"]

# 220 annotated documents
len(data)

# 1760 annotated source lines
sum([len(doc["lines"]) for doc in data])

# 7040 annotated translations
sum([sum([len(line["translations"]) for line in doc["lines"]]) for doc in data])

# 11 annotators
len(set(doc["uid"] for doc in data))

import numpy as np
# Average document-level for N1: 5.865
np.average([doc["rating"]["4"]["overall"] for doc in data])

# Average document-level for P3: 4.810
np.average([doc["rating"]["3"]["overall"] for doc in data])

Results

It make sense to have multiple rounds of translation post-editing.

Translatology students, professionals and laypeople perceive quality differently.

Data structure

Beginning of data/ort_human.json:

[
    {
        "uid": "sahara",
        "expertise": "student",
        "doc": "huffingtonpost.com.19385",
        "time": 210.0,                             # self-reported in minutes
        "rating": {
            "2": {                                 # 2 = P2
                "spelling": 4.0,                   # ranges from 0 to 6
                "terminology": 5.5,
                "grammar": 5.5,
                "meaning": 5.0,
                "style": 4.5,
                "pragmatics": 6.0,
                "overall": 4.5
            },
            "4": {                                 # 4 = N1
                "spelling": 6.0,
                "terminology": 6.0,
                "grammar": 6.0,
                "meaning": 5.0,
                "style": 5.0,
                "pragmatics": 6.0,
                "overall": 5.7
            },
            "1": {                                 # 1 = P1
                "spelling": 6.0,
                "terminology": 5.9,
                "grammar": 5.4,
                "meaning": 4.7,
                "style": 4.6,
                "pragmatics": 5.8,
                "overall": 5.0
            },
            "3": {                                 # 3 = P3
                "spelling": 4.5,
                "terminology": 4.7,
                "grammar": 5.0,
                "meaning": 4.5,
                "style": 5.0,
                "pragmatics": 6.0,
                "overall": 4.6
            }
        },
        "lines": [
            {
                "source": "Sony, Disney Back To Work On Third Spider-Man Film",               # source sentence
                "comment": null,
                "translations": {
                    "2": {
                        "orig": "Sony a Disney opět pracují na třetím filmu o Spider-Manovi", # original translation
                        "done": "Sony a Disney pracují na třetím filmu o Spider-Manovi",      # post-edited translation
                        "rating": {
                            "spelling": 6.0,
                            "terminology": 6.0,
                            "grammar": 6.0,
                            "meaning": 5.0,
                            "style": 6.0,
                            "pragmatics": 6.0,
                            "overall": 5.0
                        }
                    },
                    "4": {
                        "orig": "Sony a Disney opět spolupracují na třetím filmu o Spider-Manovi",
                        "done": "Sony a Disney opět spolupracují na třetím filmu o Spider-Manovi",
                        "rating": {
                            "spelling": 6.0,
                            "terminology": 6.0,
                            "grammar": 6.0,
                            "meaning": 6.0,
                            "style": 6.0,
                            "pragmatics": 6.0,
                            "overall": 6.0
                        }
                    },
...

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
data		data
meta		meta
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

meta

meta

src

src

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Quality and Quantity of Machine Translation References for Automated Metrics [paper]

Results

Evaluating Optimal Reference Translations [paper]

Example usage

Results

Data structure

About

Releases

Packages

Languages

ufal/optimal-reference-translations

Folders and files

Latest commit

History

Repository files navigation

Quality and Quantity of Machine Translation References for Automated Metrics [paper]

Results

Evaluating Optimal Reference Translations [paper]

Example usage

Results

Data structure

About

Resources

Stars

Watchers

Forks

Languages