An analysis of linguistic similarity between Spanish locale variants in Mozilla translation files.
By Kekoa Riggin
Read the full report here
Presented at the #LocWorldWide43 conference as "How Similar Are Your Spanish Locales: A Data-driven Analysis"
There are scripts for Mozilla's translation memories and Microsoft's My Visual Studio Translation and UI Strings Glossaries located in the src folder. To run scripts for MMicrosoft data, replace moz
argument with ms
.
- Put translation data in
src
folder.- For Mozilla data: Get TMX files from https://transvision.mozfr.org/. Save files with naming convention
mozilla_en-US_es-AR.tmx
in a folder calledmoz_tmx
. - Get CSV files for Translation and UI Strings Glossaries from My Visual Studio downloads. Save all files to folders with naming convention
es-mx_MS
.
- For Mozilla data: Get TMX files from https://transvision.mozfr.org/. Save files with naming convention
python3 tojson_moz.py
orpython3 tojson_ms.py
to convert the translation files to json files (performs some data cleaning).python3 perfect.py moz
to get perfect matches and create json files for non-perfect matches.python3 diff_baseline.py moz
to get a baseline BLEU score (X to X Locale).python3 max_distance.py moz
to get a baseline poor BLEU score (Source to Target).python3 edit_distance.py moz
to get the BLEU score between locales (X to Y Locale).python3 template_generate.py moz
to generate language templates.
Requirements:
- Python 3.x
- NLTK Tokenize
- NLTK BLEU