Skip to content

shreydan/multilingual-translation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multilingual Machine Translation with Transformers

This project was for learning purposes only. Hence, focused on getting decent results rather than building an alternative to existing multilingual models.

  • Implemented a 7M parameter model.
  • Trained a BERT style tokenizer.
  • Trained on Opus100 Dataset with en-hi & en-te subsets.
  • Go through the entirety on Kaggle.
ENGLISH ----> HINDI
          |
          --> TELUGU

Working

  • The model understands which language to translate to based on the preceding beginning-of-sentence bos token:
    • english sentences start with <s-en> token
    • hindi sentences start with <s-hi> token
    • telugu sentences start with <s-te> token
    • all sentences end with </s> token
  • trained as a Sequence-to-Sequence transformer model with an encoder-decoder style architecture. Encoder handles english and decoder handles both hindi & telugu.

Model Config

config = {
    'dim': 128,
    'n_heads': 4,
    'attn_dropout': 0.1,
    'mlp_dropout': 0.1,
    'depth': 8,
    'vocab_size': 30000,
    'max_len': 128
 }

Inference Results

python inference.py --text 'how are you?' -l hi -s
>>> आप कैसे हैं?

python inference.py --text 'please call me' -l hi   
>>> कृपया मुझे पुकारो

python inference.py --text 'what are you doing?' -l te -s -t 0.5
>>> మీరు ఏం చేస్తున్నారు?

python inference.py --text "what's wrong?" -l te -s
>>> ఏమి తప్పు?

The results are kinda hilarious but atleast it works.

Here's the SOTA model if you really want good quality multilingual indic translation: ai4bharat/indictrans2-indic-en-1B, it's even used by the govt. of India officially.

I have refrained my feet from every evil way,
That I might keep thy word.
                                Psalm 119:101

About

Training a transformer for multilingual translation from scratch. Translates English to Hindi or Telugu. Trained on the Opus100 dataset for learning purposes.

Topics

Resources

Stars

Watchers

Forks