The code works with three main components:
To handle segmenting a block of text into sentences, accounting for some Indian Language delimiters. This is a bit crude and rule based and contributed by Binu Jasim.
We use SentencePiece to as an unsupervised tokenizer for Indian languages, which works surprisingly well in our experiments. There are trained models on whatever corpora we could find for the specific languages in sentencepiece/models of 4000 vocabulary units and 8000 vocabulary units.
Training a joint SentencePiece over all languages lead to character level tokenization for under-represented languages and since there isn't much to gain due to the difference in scripts, we use individual tokenizers for each language. Combined however, this will have less than 4000 x |#languages| as some common English code mixes come in. This however, makes the MT system robust in some sense to code-mixed inputs.
Translator is a wrapper around a fairseq which we have reused for some web-interfaces.