### This Notebook shows a full pipeline for Text language identification and Translation using Facebook models fasttext and No Language Left Behind (NLLB).

First, we start with taking an input text in any language, then we will detect its language code using fasttext.

After that, we take the entered text, and predicted label and feed them to NLLB which translates text from our original language to whatever language NLLB supports.

# Language Identification

In [None]:
# download the language model pretrained file
!wget https://dl.fbaipublicfiles.com/nllb/lid/lid218e.bin

--2024-05-03 19:10:30--  https://dl.fbaipublicfiles.com/nllb/lid/lid218e.bin
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 108.157.254.15, 108.157.254.124, 108.157.254.102, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|108.157.254.15|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1176355829 (1.1G) [application/octet-stream]
Saving to: ‘lid218e.bin’


2024-05-03 19:10:34 (323 MB/s) - ‘lid218e.bin’ saved [1176355829/1176355829]



In [None]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/68.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.12.0-py3-none-any.whl (234 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp310-cp310-linux_x86_64.whl size=4227137 sha256=77b9c8dcac02715a6f88dc42d5ffcc16a75f8a14f3b9b0d56d8f9d81b676a2de
  Stored in directory: /root/.cache/pip/wheels/a5/13/75/f811c84a8ab36eedbaef977a6a58a98990e8e0f1967f98f394
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.12.0


In [None]:
import fasttext

pretrained_lang_model = "/content/lid218e.bin" # path of pretrained model file
model = fasttext.load_model(pretrained_lang_model)



Now lets enter a test text in the original language, here we will translate from Arabic to Spanish.

In [None]:
text = "صباح الخير، الجو جميل اليوم والسماء صافية."
# text = "Mange tilføjer rå blegselleri til salater. Det første spørgsmål er til skatteministeren af hr."

In [None]:
predictions = model.predict(text, k=1)
print(predictions)

(('__label__arb_Arab',), array([0.99960977]))


In [None]:
input_lang = predictions[0][0].replace('__label__', '')
print(input_lang)

arb_Arab


Danish target: dan_Latn

# Text Translation

In [None]:
!pip install -U pip transformers

Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.0


In [None]:
!pip install sentencepiece

[0m

In [None]:
checkpoint = 'facebook/nllb-200-distilled-600M'
# checkpoint = 'facebook/nllb-200-1.3B'
# checkpoint = 'facebook/nllb-200-3.3B'
# checkpoint = 'facebook/nllb-200-distilled-1.3B'

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
model_translate = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
# target_lang = 'spa_Latn'
target_lang = 'dan_Latn'
translation_pipeline = pipeline('translation',
                                model=model_translate,
                                tokenizer=tokenizer,
                                src_lang=input_lang,
                                tgt_lang=target_lang,
                                max_length = 400)
output = translation_pipeline(text)
print(output[0]['translation_text'])

Godmorgen, det er godt vejr og himlen er ren.


From NLLB: Godmorgen, det er godt vejr og himlen er ren.
From Google: Godmorgen, vejret er smukt i dag og himlen er klar.