# Modelo T5

El modelo T5 se encuentra preentrenado para realizar varias tareas, entre ellas:

- Traducción
- CoLA / Aceptabilidad lingüistica (si la frase tiene o carece de sentido)
- Similitud semántica
- Resumir párrafos
- Preguntas y respuestas con contexto

In [1]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

In [2]:
# Solo necesario en caso de problemas con los certificados SSL
import os
import certifi
os.environ['REQUESTS_CA_BUNDLE'] = certifi.where()
os.environ['HF_HOME'] = 'D:\\huggingface_cache' # Cambia esta ruta a la que prefieras

In [5]:
# Agregamos una prueba para verificar si estamos usando cuda o cpu
# e imprimimos el dispositivo que se está utilizando así como su nombre

import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Dispositivo utilizado:", device)
if device == "cuda":
    print("Nombre del dispositivo:", torch.cuda.get_device_name(0))

Dispositivo utilizado: cpu


In [6]:
base_tokenizer = T5Tokenizer.from_pretrained("t5-base")
base_model = T5ForConditionalGeneration.from_pretrained("t5-base").to(device)

In [7]:
# Agrueguemos un ejemplo de un texto para resumir
text_for_summary = """The Hugging Face Hub is a platform that allows users to share and discover machine
learning models and datasets. It provides a central repository for pre-trained models, making it easy for
developers to access and use state-of-the-art NLP models without having to train them from scratch."""

preprocessed_text = text_for_summary.strip().replace("\n", " ")
print("Texto original:", preprocessed_text)

Texto original: The Hugging Face Hub is a platform that allows users to share and discover machine learning models and datasets. It provides a central repository for pre-trained models, making it easy for developers to access and use state-of-the-art NLP models without having to train them from scratch.


In [8]:
# Utilizaremos prompting para indicarle al modelo que queremos hacer un resumen
input_text = "summarize: " + preprocessed_text

input_ids = base_tokenizer.encode(input_text, return_tensors="pt").to(device)
summary_ids = base_model.generate(input_ids,
                                  no_repeat_ngram_size=3,
                                  min_length=30,
                                  max_length=50, 
                                  early_stopping=True)

output = base_tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(f"Resumen generado:\n{output}")


The following generation flags are not valid and may be ignored: ['early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Resumen generado:
the Hugging Face Hub is a platform that allows users to share and discover machine learning models and datasets . it provides a central repository for pre-trained models .


## Traducción de Inglés a Alemán

In [9]:
input_ids = base_tokenizer.encode("translate English to German: I live in Germany", 
                                  return_tensors="pt").to(device)

translation_ids = base_model.generate(input_ids,
                                        num_beams=4,
                                        no_repeat_ngram_size=3,
                                        max_length=20,
                                        early_stopping=True)

output = base_tokenizer.decode(translation_ids[0], skip_special_tokens=True)
print(f"Traducción generada:\n{output}")

Traducción generada:
Ich lebe in Deutschland


## CoLA: Corpus of Linguistic Acceptability

In [10]:
input_ids = base_tokenizer.encode("cola sentence: I love eating pizza.", return_tensors="pt").to(device)

cola_ids = base_model.generate(input_ids,
                               num_beams=4,
                               no_repeat_ngram_size=3,
                               max_length=20,
                               early_stopping=True)

output = base_tokenizer.decode(cola_ids[0], skip_special_tokens=True)
print(f"Es correcta la frase?:\n{output}")


Es correcta la frase?:
acceptable


In [11]:
input_ids = base_tokenizer.encode("cola sentence: The want jumping pizza.", return_tensors="pt").to(device)

cola_ids = base_model.generate(input_ids,
                               num_beams=4,
                               no_repeat_ngram_size=3,
                               max_length=20,
                               early_stopping=True)

output = base_tokenizer.decode(cola_ids[0], skip_special_tokens=True)
print(f"Es correcta la frase?:\n{output}")

Es correcta la frase?:
unacceptable


## STSB - Semantic Text Similarity Benchmark

In [12]:
sentence_one = "I love eating pizza."
sentence_two = "Pizza is one of my favorite foods."

input_ids = base_tokenizer.encode(f"stsb sentence1: {sentence_one} sentence2: {sentence_two}", 
                                  return_tensors="pt").to(device)

stsb_ids = base_model.generate(input_ids,
                               max_length=3,
                               early_stopping=True)

output = base_tokenizer.decode(stsb_ids[0], skip_special_tokens=True)
print(f"Similitud semántica? (1-5):\n{output}")

Similitud semántica? (1-5):
4.4


In [13]:
sentence_one = "I love eating pizza."
sentence_two = "My dog likes to eat meat."

input_ids = base_tokenizer.encode(f"stsb sentence1: {sentence_one} sentence2: {sentence_two}", 
                                  return_tensors="pt").to(device)

stsb_ids = base_model.generate(input_ids,
                               max_length=3,
                               early_stopping=True)

output = base_tokenizer.decode(stsb_ids[0], skip_special_tokens=True)
print(f"Similitud semántica? (1-5):\n{output}")

Similitud semántica? (1-5):
0.0


## Q/A

Esta tarea trata de dar respuesta a la pregunta realizada dando un contexto específico.

In [14]:
input_ids = base_tokenizer.encode(
    "question: Where did Othon study engineering? context: Othon studied engineering at IPN, but he also took some \
        german courses at UNAM, and also took swimming lessons in Queretaro.",
                                    return_tensors="pt").to(device)

input_ids = base_model.generate(input_ids, early_stopping=True)
output = base_tokenizer.decode(input_ids[0], skip_special_tokens=True)
print(f"Respuesta:\n{output}")

Respuesta:
IPN


In [15]:
input_ids = base_tokenizer.encode(
    "question: Where did Othon learned german? context: Othon studied engineering at IPN, but he also took some \
        german courses at UNAM, and also took swimming lessons in Queretaro.",
                                    return_tensors="pt").to(device)

input_ids = base_model.generate(input_ids, early_stopping=True)
output = base_tokenizer.decode(input_ids[0], skip_special_tokens=True)
print(f"Respuesta:\n{output}")

Respuesta:
UNAM
