Skip to content

rbawden/DiaBLa-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DiaBLa English-French MT dialogue dataset

(Dialogue BiLingue "Bilingual Dialogue")

English-French dataset for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue.

The test set contains 5,700+ sentences from 144 spontaneous, written dialogues between English and French speakers. The dialogues are mediated by one of two neural MT systems (a baseline RNN and a lightly contextual RNN model that uses the previous sentence). Each dialogue is associated with one of twelve varied scenarios, which are listed in meta_info/scenarios.txt. In addition, the participants evaluated in real time the quality of the MT systems from a monolingual point of view. Once collected, the dialogues were anonymised, and manually normalised versions of sentences were produced where necessary. Importantly, reference translations were produced for all sentences.

Dialogue annotations:

  • fine-grained sentence-level judgments of MT quality, produced by the dialogue participants themselves
  • manually produced reference translations
  • manually normalised versions of source sentences

👉 Visit the website here

Use as a test set

Raw source and reference files in DiaBLa-corpus/raw-corpus:

  • Source files: diabla.en2fr_orig and diabla.fr2en_orig.
  • Reference files: diabla.en2fr_ref and diabla.fr2en_ref.
  • Info file: diabla.info

Each source file contains the entire dialogue, from the point of view of the speaker of the source language (containing their own original sentences and the machine translated versions of the other speaker's utterances). This is important for contextual translation - to have access to all context present when the source utterances were produced.

The reference files contain only those sentences that should be evaluated (i.e. the sentences that were originally in the source language). Once you have translated the entire source file, filter your translations as follows:

bash scripts/filter-sents-for-eval.sh <YOUR_TRANSLATION_FILE> DiaBLa-corpus/raw-corpus/diabla.{en2fr,fr2en}.eval-filter > OUT

Then evaluate the filtered sentences against the reference translations using your favourite metric.

Citation

If you use this corpus, please cite:

@article{bawden_DiaBLa:-A-Corpus-of_2021,
  author = {Bawden, Rachel and Bilinski, Eric and Lavergne, Thomas and Rosset, Sophie},
  doi = {10.1007/s10579-020-09514-4},
  title = {DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation},
  year = {2021},
  journal = {Language Resources and Evaluation},
  publisher = {Springer Verlag},
  volume = {55},
  pages = {635--660},
  url = {https://hal.inria.fr/hal-03021633},
  pdf = {https://hal.inria.fr/hal-03021633/file/diabla-lre-personal-formatting.pdf},
}

Licence

The dataset is distributed under a CC BY-SA 4.0 licence.

Structure and content

.json formatted corpus (containing all annotations)

The corpus also exists in .json format, containing all annotations and information. All dialogue files are found dialogues/ and information for each user is found in users/.

Each dialogue file has the following dialogue-level information:

 "start_time": <DATETIME>,
 "scenario": [
            [SCENARIO DESCRIPTION (IN ENGLISH), SCENARIO DESCRIPTION (IN FRENCH)],
            [ROLE 1 (IN ENGLISH), ROLE 2 (IN FRENCH)],
            [ROLE 2 (IN ENGLISH), ROLE 2 (IN FRENCH)]
           ],
 "user1": {
    "idnum": <ID>,
    "gender": "male/female",
    "age": "18-24|25-34|35-44|55-64|65-74",
    "turn_number": "1|2",
    "role": [TEXT DESCRIPTION OF ROLE (IN ENGLISH), TEXT DESCRIPTION OF ROLE (IN FRENCH)],
    "lang": "french or "english"
    }
  "user2": {
    "turn_number": "1|2",
    "role": [TEXT DESCRIPTION OF ROLE (IN ENGLISH), TEXT DESCRIPTION OF ROLE (IN FRENCH)],
    "lang": "french or "english"
    }
 "translation_model": "baseline" or "contextual",
 "utterances": {
    "0": {...},
    "1": {...},
     ...
    }

Each utterance in the dialogue (ids start at 0) is structured as followed:

id : {
  "language": "english/french",
  "original_text": <ORIGINAL TEXT>,
  "normalised_version": <NORMALISED TEXT (IF NECESSARY)>,
  "preprocessed_text": <PREPROCESSED TEXT>,
  "translated_text": <TRANSLATED TEXT>,
  "postprocessed_text": <TRANSLATED, PREPROCESSED TEXT>,
  "reference_translation": <HUMAN TRANSLATION>,
  "composition_time": <DATETIME>,
  "preprocessing_begin": <DATETIME>,
  "preprocessing_end": <DATETIME>,
  "translation_begin": <DATETIME>,
  "translation_end": <DATETIME>,
  "postprocessing_begin": <DATETIME>,
  "postprocessing_end": <DATETIME>,
  "sent_time": TRANSLATION SENT,
  "eval": {
     "judgment": "poor/medium/perfect",
     "judgment history": [
       ["poor/medium/perfect", <DATETIME>], [...]
     ]
     "verbatim": <FREE EVALUATION COMMENT>,
     "verbatim_history": [
       [<VERBATIM>, <DATETIME>], [...]
     ], 
     "problems": ["word choice", "grammaticality", "coherence", "style"], # list of problems present (if any)
     "problem_history": [
         [<PROBLEM>, <DATETIME>, true/false], [...]
      ]
     
   }

User files are structured as follows:

{
   "idnum": 1,
   "age": "18-24|25-34|35-44|55-64|65-74",
   "gender": "male|female",
   "english_ability": "poor|medium|good|near-native|native",
   "french_ability": "poor|medium|good|near-native|native",
   "otherlangs": <LIST OF OTHER LANGUAGES SPOKEN>,
   "worked_in_research": true|false,
   "worked_in_NLP": true|false,
   "agreed_to_terms_and_conditions": true
   "creation_date": <DATETIME>,
}

N.B. Historic changes concerning the evaluation of the machine translated sentences are logged (when sentences are evaluated and whether the participants change their mind).

About

English-French MT dialogue dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages