Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cross-language translation data #23

Closed
2 tasks done
andrewtavis opened this issue Nov 5, 2022 · 15 comments
Closed
2 tasks done

Add cross-language translation data #23

andrewtavis opened this issue Nov 5, 2022 · 15 comments
Assignees
Labels
-priority- High priority data Relates to data or Wikidata help wanted Extra attention is needed

Comments

@andrewtavis
Copy link
Member

andrewtavis commented Nov 5, 2022

Terms

Languages

All languages

Description

This issue is the interim step to adding full translation support to Scribe apps. What it entails is finding accurate enough models on 🤗 Hugging Face to translate between all currently supported languages and English. The format_translations.py script for each language will then need to be edited to run each model over a basic corpus to generate seven different translations.json files per language.

From there this will allow an option for which base language to translate from to be added to Scribe-iOS' menu, which will be developed in scribe-org/Scribe-iOS#16.

@andrewtavis andrewtavis added help wanted Extra attention is needed data Relates to data or Wikidata labels Nov 5, 2022
@andrewtavis
Copy link
Member Author

andrewtavis commented Nov 5, 2022

Included in this issue is:

  • Finding viable translation models for each direction of each language pair
  • Updating the format_translation.py files to generate one file for each input language
  • An exploration of running the translation models on Google Colab or another platform that will give us GPU access in order to speed the model runs up
  • Updating the readmes with translation ranges of the least and most translations available
  • Editing the data_table.txt output to read average translations

@andrewtavis
Copy link
Member Author

andrewtavis commented Nov 5, 2022

We should only use models that can translate both ways. The following are the language models from 🤗 Hugging Face that we can use to generate the translations (checked when implemented):

@abhijeet78880
Copy link

hey..👋🏻
@andrewtavis I have gone through the issue and done some research regarding this . but can you please explain more about this.

@andrewtavis
Copy link
Member Author

Hey @abhijeet78880 👋 Yes I definitely expected that more information would be needed. This is a tough one, but as I said just needs some persistence 😊

Will write more later today! :)

@andrewtavis
Copy link
Member Author

andrewtavis commented Apr 14, 2023

Ok, sooooo :) There's plenty to do here, with the first part of it being some research you/we could do. This comment is an ongoing list of all machine translation models available from 🤗 Hugging Face that we're using (I just edited it to add links to the models). As of now we only translate from English to the language that a user is typing in, but we want to expand this so that the user can translate from any of Scribe's supported languages to any other. This will be an option in the new menu we'll build in this iOS issue, with the designs for that being found here.

Once we have a translation model we make a file like this one that translates from English to German. In this file we load a JSON that's words that we query in the source language (for now only English) from Wikidata, then we make a list of words from the JSON, set the model variable with one of the models we're documenting in the comment above that I already mentioned, and then in the for loop we go through and translate the words and add them all into a new JSON. This JSON is then used in Scribe apps to provide a translation 😊 It's not perfect, but it's what we're doing for now 🙃

What would be great is if you/we could look and find the missing models that we need for the new translations. A lot of these can just be Helsinki-NLP/opus-mt models, which are giant machine translation (hence mt) models from researchers in Helsinki. As we're using Helsinki-NLP/opus-mt-en-fr for English to French, we can use Helsinki-NLP/opus-mt-fr-en for French to English 🚀

At this point I think it's best if I check in with you and see how the above sounds. If you want to contribute in a simple way at first, going through and finding links for what models we'd need would be best. I could then do a more in depth explanation of the translation file from before and show you how to set up new ones for each of the models we find. We'd then run them, and bam we'd have translation data that we'd then reference and thus give users the option to translate from Spanish to German and all the other options 😊

This has been a lot! Again, just let me know how it sounds and feel free to ask questions. Everything we write here will help make things easier going forward :) :)

@andrewtavis
Copy link
Member Author

@abhijeet78880, FYI I also made #35 just now which might be a nice first issue for you 😊 That's checking if the word Scribe is in the database for a given language and adding it if not :)

@andrewtavis
Copy link
Member Author

Note that I've updated the models comment with further models we could use from Helsinki-NLP. There does appear to be some holes in their translation model coverage, so for some pairs we'll need to look harder for other models.

At this point we'd be ready to start copying over some translation files 😊 For this the script from any Helsinki-NLP translation file can be used and the model name just needs to be changed :)

@andrewtavis
Copy link
Member Author

The following two models could help us plug some of the holes in the above translation coverage:

@andrewtavis
Copy link
Member Author

andrewtavis commented Jun 21, 2023

Neither of the above models was what we were looking for, and after playing around with T5 a bit more and getting some very sub par German-Portuguese translations I was able to get some strong results from a dummy JSON dataset using m2m100_418M. I'd say that the small model is enough for our purposes as the single word or small phrase translation that we're doing isn't going to be improved by a larger model (or only marginally, as larger models would be taking advantage of contextual information that our short input strings would lack).

m2m100_418M should be able to handle all the missing language pairs for Scribe keyboard languages. A general thought might be to create a data pipeline that would use it as the sole model and then just switch the input and output languages as well as the input data during the run. Another thing to factor is that for now the outputs I was getting were capitalized as I'm assuming the model is expecting and returning a sentence. This can be remedied by the metadata that comes from Wikidata though, as we'll be querying a base translation corpus that includes word type and would thus know if a word is a proper noun that needs to be capitalized (or all nouns in German), or just lower case it.

Will continue to fine tune the current example and then present the results at the next Scribe Weekly 😊

@andrewtavis andrewtavis self-assigned this Jul 17, 2023
@nyfz18
Copy link

nyfz18 commented Aug 22, 2023

Hi! Sorry for the delay -- I had to figure out a bunch of stuff on my end. Where should I start?

@andrewtavis
Copy link
Member Author

andrewtavis commented Aug 26, 2023

No stress on a delay, @nyfz18! Sorry for mine as well :) Let me organize some stuff and I'll send along some pseudocode for how this would be written as I said I would 😊 Generally the steps would be:

  • I'll write some SPARQL scripts to get words from each language
    • We'll call those scripts to update the data each time before the translation's are done
  • I'll write a base script so that we can pass a language to the process like 'English'
    • This allows us to test it easier
  • We then add the following steps:
    • For any language that we're passing or all languages if we haven't past something
      • Run the script to update the data for this language (source language)
      • Load in the appropriate translation model
      • Translate from the source language to all the other languages
      • As we translate we want to update a dictionary
      • Save this dictionary to a JSON
      • Continue to the next language (if necessary)

Do you have any questions on the above, @nyfz18? Btw I messaged on Matrix to see if a checkin call would help for this 🙃

@nyfz18
Copy link

nyfz18 commented Aug 27, 2023

Okay, sounds good. I sort of understand, but a check in call might be more helpful!

@andrewtavis
Copy link
Member Author

andrewtavis commented Sep 8, 2023

Hey there @nyfz18! 👋 You now have extract_transform/translate.py at your disposal that loads in the model, checks arguments if you'd like to pass them, and prints out the ISO codes at the end. Following the working code there's also some pseudocode that outlines the steps we discussed in the call 😊 Let me know if you have any questions/comments!

@andrewtavis
Copy link
Member Author

@nyfz18, we'll be doing the conversion of the JSON data that's being produced here in the new issue #46. @lillian-mo will work on that one 🚀

@andrewtavis
Copy link
Member Author

Closing this issue as individual ones have been made for each language that can be worked on as a part of Google Summer of Code ☀️ Thanks all for the discussion here! Help on the individual issues would be welcome 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-priority- High priority data Relates to data or Wikidata help wanted Extra attention is needed
Projects
Archived in project
Development

No branches or pull requests

3 participants