MT@BigScience

Evaluation results for Machine Translation within the BigScience project. Evaluation is carried out using the BigScience fork of lm-evaluation-harness coupled with the eval-hackathon branch of PromptSource. N.B. Updates of latest versions are currently ongoing and will be available shortly.

Citation

This repository contains codes and outputs to accompany the paper "Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM". Please cite the following:

@inproceedings{bawden-yvon-bloom-mt-2023,
    author = {Bawden, Rachel and Yvon, François},
    title = {Investigating the Translation Performance of a Large Multilingual
Language Model: the Case of {BLOOM}},
    booktitle = {Proceedings of the 24th Annual Conference of the European Association for Machine Translation},
    url = {https://arxiv.org/abs/2303.01911},
    year = {2023},
    notes = {To appear}
}

Outputs and evaluation

Extract all predictions and evaluate

python scripts/process_results_{flores,diabla,wmt}.py

This extracts all predictions from .jsonl files into .tsv files and calculates BLEU and COMET scores. These are written out to the following folders:

outputs/{wmt14_hi_en,wmt14_fr_en,flores-101}/{0,1,2,5}-shot/{comet,bleu}-results.tsv for WMT and flores-101
outputs/diabla/{0,1}-shot/{comet,bleu}-results.{English-French,French-English}.tsv for DiaBLa.

Three versions of each output are generated:

The original outputs
The outputs truncated at the first newline (newline-cut)
The outputs truncated at the first newline or before the first repetition of the 'xglm' prompt (newline-cut-custom-truncate). This corresponds to the truncated outputs from the paper.

Generate latex tables:

python scripts/make-tables-{flores,diabla,wmt}.py

Results (BLEU scores)

Cross-dataset and model comparison (focus on English-French and English-Hindi)

WMT14 results (Original outputs)

Lang. dir	#shots	BLOOM	T0	mT0-xxl	OPT
en→fr	0	14.91	1.21	29.27	12.95
	1	27.83	1.41	25.24	21.92
fr→en	0	15.52	25.79	32.88	15.54
	1	34.61	21.01	30.03	24.55
en→hi	0	6.80	0.16	11.20	0.14
	1	13.62	0.12	9.50	0.08
hi→en	0	12.05	0.00	26.13	0.42
	1	25.04	0.01	20.15	0.58

DiaBLa results (Original outputs)

Lang. dir	#shots	BLOOM	T0	mT0-xxl	OPT
en→fr	0	0.88	0.52	28.44	0.53
	1	5.70	0.61	21.03	15.52
fr→en	0	0.85	25.51	34.96	0.83
	1	12.05	20.57	26.88	12.05

Flores-101 results (Original outputs)

Lang. dir	#shots	BLOOM	T0	mT0-xxl	OPT
en→fr	0	2.77	1.86	55.45	2.76
	1	44.99	2.13	53.53	24.36
fr→en	0	2.73	31.90	60.10	2.59
	1	45.59	24.86	58.22	16.74
en→hi	0	1.29	0.15	67.69	0.07
	1	27.25	0.06	54.66	0.12
hi→en	0	3.40	0.00	59.55	0.10
	1	35.06	0.19	57.32	0.45

WMT14 results (Truncated outputs)

Lang. dir	#shots	BLOOM	T0	mT0-xxl	OPT
en→fr	0	32.25	1.21	29.24	18.86
	1	36.29	1.41	25.19	22.31
fr→en	0	37.16	25.80	32.87	33.18
	1	38.18	21.07	29.95	33.25
en→hi	0	12.10	0.16	11.20	0.11
	1	15.73	0.12	9.50	0.08
hi→en	0	24.29	0.00	26.06	0.51
	1	25.04	0.01	20.06	0.61

DiaBLa results (Truncated outputs)

Lang. dir	#shots	BLOOM	T0	mT0-xxl	OPT
en→fr	0	24.23	0.52	28.44	17.42
	1	37.57	0.61	21.89	20.71
fr→en	0	22.94	25.51	34.92	36.80
	1	41.36	21.09	27.20	37.63

Flores-101 results (Truncated outputs)

Lang. dir	#shots	BLOOM	T0	mT0-xxl	OPT
en→fr	0	26.91	1.85	55.34	21.40
	1	49.32	2.13	53.40	28.41
fr→en	0	40.28	31.90	60.01	39.41
	1	47.24	25.20	58.24	39.82
en→hi	0	7.74	0.15	67.69	0.12
	1	29.52	0.06	54.66	0.12
hi→en	0	30.19	0.00	59.55	0.23
	1	35.06	0.19	57.27	0.50