<a href="https://colab.research.google.com/github/ymoslem/Adaptive-MT-LLM/blob/main/MT/NLLB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Translation with NLLB
* Repository: https://github.com/facebookresearch/fairseq/tree/nllb
* Paper: https://arxiv.org/abs/2207.04672

# Method 1: CTranslate2 for Translation and SentencePiece for tokenization (faster)

## Download the NLLB model(s)

In [None]:
# Models required for using NLLB with CTranslate2

# NLLB 600M - CTranslate2 int8
# !wget https://pretrained-nmt-models.s3.us-west-2.amazonaws.com/CTranslate2/nllb/nllb-200_600M_int8_ct2.zip

# NLLB 3.3B - CTranslate2 int8
# !wget https://pretrained-nmt-models.s3.us-west-2.amazonaws.com/CTranslate2/nllb/nllb-200_3.3B_int8_ct2.zip

# SentencePiece
# !wget https://pretrained-nmt-models.s3.us-west-2.amazonaws.com/CTranslate2/nllb/flores200_sacrebleu_tokenizer_spm.model


In [None]:
# Example of converting an NLLB model to CTranslate2 with int8 quantization
# Note: You do not need this step, if you downloaed the NLLB model from the previous cell.

# !ct2-transformers-converter --model facebook/nllb-200-distilled-600M --quantization int8 --output_dir ct2/nllb-200-distilled-600M-int8


In [None]:
# Load the model and tokenizer
# Important: This should be done only once

import ctranslate2
import sentencepiece as spm

device = "cuda"  # or "cpu"

# [Modify] Set paths to the CTranslate2 and SentencePiece models
ct_model_path = "ct2/nllb-200-3.3B-int8"
sp_model_path = "flores200_sacrebleu_tokenizer_spm.model"

# Load the source SentecePiece model
sp = spm.SentencePieceProcessor()
sp.load(sp_model_path)

translator = ctranslate2.Translator(ct_model_path, device)
# translator = ctranslate2.Translator(ct_model_path, device="cuda", device_index=[0,1])  # multiple GPUs

In [None]:
print(ctranslate2.__version__)

3.0.1


## Translation of a list of sentences

In [None]:
# Translate a list of sentences

# Source and target langauge codes
src_lang = "eng_Latn"
tgt_lang = "spa_Latn"

beam_size = 4

source_sents = ['Chinese clinical trials in Wuhan and Shenzhen claimed to show that favipiravir was "clearly effective".',
                "We can see that the real-time PCR test for nucleic acid in respiratory tract or blood samples was added to the second (18 January 2020) and third (22 January 2020) editions.",
                "The SARS-CoV-2 virus is the cause of COVID-19 (coronavirus disease 2019), a contagious respiratory disease that was first identified in December 2019, in Wuhan, Hubei, China."
               ]

source_sents = [sent.strip() for sent in source_sents]
target_prefix = [[tgt_lang]] * len(source_sents)

# Subword the source sentences
source_sents_subworded = sp.encode_as_pieces(source_sents)
source_sents_subworded = [[src_lang] + sent + ["</s>"] for sent in source_sents_subworded]

# Translate the source sentences
translator = ctranslate2.Translator(ct_model_path, device=device)
translations = translator.translate_batch(source_sents_subworded, batch_type="tokens", max_batch_size=2024, beam_size=beam_size, target_prefix=target_prefix)
translations = [translation.hypotheses[0] for translation in translations]

# Desubword the target sentences
translations_desubword = sp.decode(translations)
translations_desubword = [sent[len(tgt_lang):].strip() for sent in translations_desubword]


print("Translations:", *translations_desubword, sep="\n• ")


Translations:
•  Los ensayos clínicos chinos en Wuhan y Shenzhen afirmaron haber demostrado que el favipiravir era "claramente efectivo".
•  Podemos ver que la prueba de PCR en tiempo real para el ácido nucleico en el tracto respiratorio o muestras de sangre se agregó a la segunda (18 de enero de 2020) y tercera (22 de enero de 2020) ediciones.
•  El virus SARS-CoV-2 es la causa de COVID-19 (enfermedad por coronavirus 2019), una enfermedad respiratoria contagiosa que se identificó por primera vez en diciembre de 2019, en Wuhan, Hubei, China.


## File Translation

In [None]:
# Translate a file

# Source and target langauge codes
# arb_Arab deu_Latn ita_Latn rus_Cyrl spa_Latn zho_Hans
src_lang = "eng_Latn"
tgt_lang = "spa_Latn"

beam_size = 4

file_path = "tico/tico-19.final.en.test"

with open(file_path) as source_file:
    source_sents = source_file.readlines()

source_sents = [sent.strip() for sent in source_sents]
print(src_lang, source_sents[0], sep=" --> ")
target_prefix = [[tgt_lang]] * len(source_sents)

# Subword the source sentences
source_sents_subworded = sp.encode_as_pieces(source_sents)
source_sents_subworded = [[src_lang] + sent + ["</s>"] for sent in source_sents_subworded]

# Translate the source sentences
translator = ctranslate2.Translator(ct_model_path, device=device)
translations = translator.translate_batch(source_sents_subworded, batch_type="tokens", max_batch_size=2048, beam_size=beam_size, target_prefix=target_prefix)
translations = [translation.hypotheses[0] for translation in translations]

# Desubword the target sentences
translations_desubword = sp.decode(translations)
translations_desubword = [sent[len(tgt_lang):].strip() for sent in translations_desubword]
print(tgt_lang, translations_desubword[0], sep=" --> ")


In [None]:
# Save the translations to the a file

target_file_path = file_path + "." + tgt_lang

with open(target_file_path, "w+", encoding="utf-8") as target:
    for line in translations_desubword:
        target.write(line.strip() + "\n")

print("Done! Target file saved at:", target_file_path)

# Method 2: Hugging face for both tokenization and translation

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

src_lang = "eng_Latn"
tgt_lang = "spa_Latn"

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-3.3B", src_lang=src_lang)
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-3.3B",
                                              torch_dtype=torch.float16,
                                              low_cpu_mem_usage=True,)
model = model.half()
model.to("cuda")

source_text = 'Chinese clinical trials in Wuhan and Shenzhen claimed to show that favipiravir was "clearly effective".'
inputs = tokenizer(source_text, return_tensors="pt").to("cuda")

translated_tokens = model.generate(
        **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang), num_beams=5, max_length=100
)
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

'Los ensayos clínicos chinos en Wuhan y Shenzhen afirmaron mostrar que el favipiravir era "claramente efectivo".'

# Method 3: CTranslate2 for translation, but Hugging Face for tokenization

In [None]:
# Load the model and tokenizer
# Important: This should be done only once

import ctranslate2
import transformers

src_lang = "eng_Latn"
tgt_lang = "spa_Latn"

device = "cuda"  # or "cpu"
beam_size = 4

translator = ctranslate2.Translator("ct2/nllb-200-3.3B-int8", device)
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/nllb-200-3.3B", src_lang=src_lang)

In [None]:
source_text = 'Chinese clinical trials in Wuhan and Shenzhen claimed to show that favipiravir was "clearly effective".'

source = tokenizer.convert_ids_to_tokens(tokenizer.encode(source_text))
target_prefix = [tgt_lang]
results = translator.translate_batch([source], target_prefix=[target_prefix], beam_size=beam_size)
target = results[0].hypotheses[0][1:]

print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))

Los ensayos clínicos chinos en Wuhan y Shenzhen afirmaron mostrar que el favipiravir era "claramente efectivo".
