In this notebook, we will use the BLEU metric to compare the quality of two different approaches for performing translations.


In [17]:
# !pip install -q googletrans==3.1.0a0
# !pip install -q evaluate==0.4.2
# !pip install -q transformers==4.42.4

In [2]:
from googletrans import Translator
import transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
import evaluate

2024-12-20 11:35:15.738138: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-12-20 11:35:15.792570: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-12-20 11:35:18.514731: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-12-20 11:35:18.526676: I tens

VOC-NOTICE: GPU memory for this assignment is capped at 2048MiB


In [3]:
#Sentences to Translate.
sentences = [
    "In the previous chapters, you've mainly seen how to work with OpenAI models, and you've had a very practical introduction to Hugging Face's open-source models, the use of embeddings, vector databases, and agents.",
    "These have been very practical chapters in which I've tried to gradually introduce concepts that have allowed you, or at least I hope so, to scale up your knowledge and start creating projects using the current technology stack of large language models."
    ]

In [4]:
#Spanish Translation References.
reference_translations = [
    ["En los capítulos anteriores has visto mayoritariamente como trabajar con los modelos de OpenAI, y has tenido una introducción muy práctica a los modelos Open Source de Hugging Face, al uso de embeddings, las bases de datos vectoriales, los agentes."],
    ["Han sido capítulos muy prácticos en los que he intentado ir introduciendo conceptos que te han permitido, o eso espero, ir escalando en tus conocimientos y empezar a crear proyectos usando el stack tecnológico actual de los grandes modelos de lenguaje."]
    ]

In [5]:
model_id = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

In [6]:
translator = pipeline('translation', model=model, tokenizer=tokenizer,
                        src_lang="eng_Latn", tgt_lang="spa_Latn")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [7]:
translations_nllb = []

for text in sentences:
  print ("to translate: " + text)
  translation = ""
  translation = translator(text)

  #Add the summary to summaries list
  translations_nllb += translation[0].values()

to translate: In the previous chapters, you've mainly seen how to work with OpenAI models, and you've had a very practical introduction to Hugging Face's open-source models, the use of embeddings, vector databases, and agents.
to translate: These have been very practical chapters in which I've tried to gradually introduce concepts that have allowed you, or at least I hope so, to scale up your knowledge and start creating projects using the current technology stack of large language models.


Now we have the translations stored in the list 'translations_nllb'.

In [8]:
translations_nllb

['En los capítulos anteriores, han visto principalmente cómo trabajar con modelos OpenAI, y han tenido una introducción muy práctica a los modelos de código abierto de Hugging Face, el uso de embebidos, bases de datos vectoriales y agentes.',
 'Estos han sido capítulos muy prácticos en los que he intentado introducir gradualmente conceptos que han permitido, o al menos espero que lo hagan, ampliar sus conocimientos y comenzar a crear proyectos utilizando la tecnología actual de los modelos de lenguaje grande.']

In [9]:
translator_google = Translator()

In [10]:
translations_google = []

for text in sentences:
  print ("to translate: " + text)
  translation = ""
  translation = translator_google.translate(text, dest="es")

  #Add the summary to summaries list
  translations_google.append(translation.text)
  print (translation.text)

to translate: In the previous chapters, you've mainly seen how to work with OpenAI models, and you've had a very practical introduction to Hugging Face's open-source models, the use of embeddings, vector databases, and agents.
En los capítulos anteriores, vio principalmente cómo trabajar con modelos OpenAI y tuvo una introducción muy práctica a los modelos de código abierto de Hugging Face, el uso de incrustaciones, bases de datos vectoriales y agentes.
to translate: These have been very practical chapters in which I've tried to gradually introduce concepts that have allowed you, or at least I hope so, to scale up your knowledge and start creating projects using the current technology stack of large language models.
Estos han sido capítulos muy prácticos en los que he intentado introducir gradualmente conceptos que te han permitido, o al menos eso espero, ampliar tus conocimientos y empezar a crear proyectos utilizando la tecnología actual de grandes modelos de lenguaje.


In this list, we have the translations created by Google.

In [11]:
translations_google

['En los capítulos anteriores, vio principalmente cómo trabajar con modelos OpenAI y tuvo una introducción muy práctica a los modelos de código abierto de Hugging Face, el uso de incrustaciones, bases de datos vectoriales y agentes.',
 'Estos han sido capítulos muy prácticos en los que he intentado introducir gradualmente conceptos que te han permitido, o al menos eso espero, ampliar tus conocimientos y empezar a crear proyectos utilizando la tecnología actual de grandes modelos de lenguaje.']

## Evaluate translations with BLEU

We will use the BLEU implementation from the Evaluate library by Hugging Face.

In [12]:
bleu = evaluate.load('bleu')

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

In [13]:
results_nllb = bleu.compute(predictions=translations_nllb, references=reference_translations)


In [14]:
results_google = bleu.compute(predictions=translations_google, references=reference_translations)

In [15]:
print(results_nllb)

{'bleu': 0.3686324165619373, 'precisions': [0.7159090909090909, 0.47674418604651164, 0.30952380952380953, 0.18292682926829268], 'brevity_penalty': 0.988700685876667, 'length_ratio': 0.9887640449438202, 'translation_length': 88, 'reference_length': 89}


In [16]:
print(results_google)

{'bleu': 0.44975901966417653, 'precisions': [0.7710843373493976, 0.5679012345679012, 0.4177215189873418, 0.2987012987012987], 'brevity_penalty': 0.9302618655343314, 'length_ratio': 0.9325842696629213, 'translation_length': 83, 'reference_length': 89}
