COCO-SM

This repository contains the scripts and data for the evaluation of Vision-Language models' multilingual properties in 20 different languages.

COCO-SM (COCO Synthetic Multilingual Evaluation) includes translations of the COCO Karpathy test split into 20 languages using 3 different neural translation systems: Google Translate, Microsoft Bing Translator and NLLB.

Data Description

The original data is the COCO Karpathy test split. We excluded 7 samples from it because NLLB could not translate some of their captions. The full list of excluded files is in meta/excluded.txt

We made three different versions of the translation:

Google Translate – meta/google.json
Microsoft Bing Translator – meta/bing.json
NLLB – meta/nllb.json

Supported Languages

Language	Speakers	Language	Speakers
Arabic (ar)	274 M	Korean (ko)	81 M
Armenian (hy)	4 M	Persian (fa)	77 M
Chinese (zh)	1'118 M	Polish (pl)	41 M
French (fr)	274 M	Portuguese (pt)	257 M
German (de)	134 M	Russian (ru)	258 M
Hebrew (he)	9 M	Spanish (es)	548 M
Hindi (hi)	602 M	Thai (th)	61 M
Indonesian (id)	199 M	Turkish (tr)	88 M
Italian (it)	67 M	Ukranian (uk)	41 M
Japanese (ja)	125 M	Vietnamese (vi)	85 M

Methodology and Metrics

Since any neural translation system introduces noise in the evaluation due to its imperfection and translation task ambiguity, three different versions of translation from three different NT systems are used for metrics calculation to get a more robust estimation of model performance.

COCO-SM is mainly for text-image retrieval evaluation. We use the following metrics:

Recall@(1 / 5 / 10): the percentage of relevant results retrieved by a query among top-1/5/10 candidates. It depicts the retrieval performance of the model on a particular language. Recall is in [0, 1] range, the higher – the better.
NDCG@20 (Normalized Discounted Cumulative Gain at 20): the measure of two rankings similarity. it compares how the model ranks top-20 candidates retrieved by an English query and ones retrieved by the same query on Language X. NDCG is in [0, 1] range, the higher – the better.

If there is a high NDCG on Language X, it will mean consistency of model behaviour on English and Language X. So, in that case, we can assess model performance on Language X via metrics on English which would be a robust estimation because English data isn't synthetic.

Usage

First of all, the Python file with image and text encoding function definitions and model description is needed.

Consider the example of UForm models which is located in modules/uform.py:

import uform

def image_forward_fn(model, images, device, transform):
    images = model.preprocess_image(images).to(device)
    return model.encode_image(images)

def text_forward_fn(model, texts, device, transform):
    texts = model.preprocess_text(texts)
    texts = {k: v.to(device) for k, v in texts.items()}
    return model.encode_text(texts)

embedding_dim = 256
image_preprocess = None
text_preprocess = None

model = uform.get_model(
    'unum-cloud/uform-vl-multilingual-v2'
)

image_forward_fn is the function which will be called to get image embeddings
text_forward_fn is the function which will be called to get text embeddings

Both functions accept the following arguments: model – an instance of your model, images/texts – list of PIL Image instances/list of strings, device – id of the device on which evaluation will be done, transform – optional image/text transform.

embedding_dim – dimension of image/text embeddings
image_preprocess – optional transform for image preprocessing
text_preprocess – optional transform for text preprocessing
model – model that will be evaluated

An example of OpenCLIP model file is also available in the modules directory.

After the model file is ready, place it in the modules directory. Finally, evaluation can be run via:

cd coco-sm

python eval.py \
> --model_name {model file name without .py} \
> --image_dir_path {path to the directory with COCO val images} \
> --meta_files_paths {paths to meta files with translations, located in the meta directory} \
> --batch_size {size of batch} \
> --device {device id} \
> --report_name {name of the file with the report, will be located in the reports directory}

For instance, for evaluating the UForm model on all translations execute this command:

python3 eval.py --model_name 'uform' --image_dir_path 'val2014' --meta_files_paths 'meta/google.json' 'meta/bing.json' 'meta/nllb.json' --batch_size 512 --device 'cuda:0' --report_name 'uform-multilingual-v2'

The evaluation script produces two files:

reports/{report_name}.csv with metrics for all translations
reports/{report_name}_reduced.csv with averaged metrics across translations/languages

The evaluation results for UForm and OpenCLIP models can be found in the reports directory.

Results

Recall

Target Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10	Speakers
Arabic	22.7	31.7	44.9	57.8	55.8	69.2	274 M
Armenian	5.6	22.0	14.3	44.7	20.2	56.0	4 M
Chinese	27.3	32.2	51.3	59.0	62.1	70.5	1'118 M
English	37.8	37.7	63.5	65.0	73.5	75.9	1'452 M
French	31.3	35.4	56.5	62.6	67.4	73.3	274 M
German	31.7	35.1	56.9	62.2	67.4	73.3	134 M
Hebrew	23.7	26.7	46.3	51.8	57.0	63.5	9 M
Hindi	20.7	31.3	42.5	57.9	53.7	69.6	602 M
Indonesian	26.9	30.7	51.4	57.0	62.7	68.6	199 M
Italian	31.3	34.9	56.7	62.1	67.1	73.1	67 M
Japanese	27.4	32.6	51.5	59.2	62.6	70.6	125 M
Korean	24.4	31.5	48.1	57.8	59.2	69.2	81 M
Persian	24.0	28.8	47.0	54.6	57.8	66.2	77 M
Polish	29.2	33.6	53.9	60.1	64.7	71.3	41 M
Portuguese	31.6	32.7	57.1	59.6	67.9	71.0	257 M
Russian	29.9	33.9	54.8	60.9	65.8	72.0	258 M
Spanish	32.6	35.6	58.0	62.8	68.8	73.7	548 M
Thai	21.5	28.7	43.0	54.6	53.7	66.0	61 M
Turkish	25.5	33.0	49.1	59.6	60.3	70.8	88 M
Ukranian	26.0	30.6	49.9	56.7	60.9	68.1	41 M
Vietnamese	25.4	28.3	49.2	53.9	60.3	65.5	85 M

Mean	26.5±6.4	31.8±3.5	49.8±9.8	58.1±4.5	60.4±10.6	69.4±4.3	-
Google Translate	27.4±6.3	31.5±3.5	51.1±9.5	57.8±4.4	61.7±10.3	69.1±4.3	-
Microsoft Translator	27.2±6.4	31.4±3.6	50.8±9.8	57.7±4.7	61.4±10.6	68.9±4.6	-
Meta NLLB	24.9±6.7	32.4±3.5	47.5±10.3	58.9±4.5	58.2±11.2	70.2±4.3	-

NDCG@20

	Arabic	Armenian	Chinese	French	German	Hebrew	Hindi	Indonesian	Italian	Japanese	Korean	Persian	Polish	Portuguese	Russian	Spanish	Thai	Turkish	Ukranian	Vietnamese	Mean (all)	Mean (Google Translate)	Mean(Microsoft Translator)	Mean(NLLB)
OpenCLIP NDCG	0.639	0.204	0.731	0.823	0.806	0.657	0.616	0.733	0.811	0.737	0.686	0.667	0.764	0.832	0.777	0.849	0.606	0.701	0.704	0.697	0.716 ± 0.149	0.732 ± 0.145	0.730 ± 0.149	0.686 ± 0.158
UForm NDCG	0.868	0.691	0.880	0.932	0.927	0.791	0.879	0.870	0.930	0.885	0.869	0.831	0.897	0.897	0.906	0.939	0.822	0.898	0.851	0.818	0.875 ± 0.064	0.869 ± 0.063	0.869 ± 0.066	0.888 ± 0.064

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

COCO-SM

Data Description

Supported Languages

Methodology and Metrics

Usage

Results

Recall

NDCG@20

Files

README.md

Latest commit

History

README.md

File metadata and controls

COCO-SM

Data Description

Supported Languages

Methodology and Metrics

Usage

Results

Recall

NDCG@20