<a href="https://colab.research.google.com/github/spinto88/Clases_y_tutoriales/blob/main/Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Transformers y HuggingFace

En esta notebook a explorar algunos modelos entrenados que podemos descargar de la plataforma [HuggingFace](https://huggingface.co/) a través de la librería *transformers*.

Para muchos de los casos bastará con utilizar la función *pipeline* indicando el nombre del modelo que descargará. Muchas veces incluso bastará con indicar la tarea que queremos realizar (análisis de sentimiento, reconocimiento de entidades, etc.) y la librería descargará un modelo por defecto adecuado para dicha tarea. En caso que no alcance con *pipeline* podemos encontrar la documentación de muchos modelos en la página de HuggingFace e indicaciones de cómo utilizarlos.

In [None]:
# Comenzamos instalando la librería transformers e importando la función pipeline
!pip install transformers
from transformers import pipeline

Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m92.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m60.8 MB/s[0m eta [36m0:00:00[0m
Col

In [None]:
# Importamos pandas para presentar algunos outputs
import pandas as pd

### Tareas comunes y modelos por defecto

Recordemos que la mayoría de los modelos funcionan bien para el idioma inglés, por lo tanto, para utilizarlo en español quizás sea más conveniente indicarle explícitamente el modelo a descargar. Si este no es el caso, basta con indicar qué tarea queremos realizar.
Probemos diferentes tareas sobre un mismo texto de prueba:

In [None]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure
from your online store in Germany. Unfortunately, when I opened the package,
I discovered to my horror that I had been sent an action figure of Megatron
instead! As a lifelong enemy of the Decepticons, I hope you can understand my
dilemma. To resolve the issue, I demand an exchange of Megatron for the
Optimus Prime figure I ordered. Enclosed are copies of my records concerning
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

#### Análisis de sentimiento

In [None]:
# Traemos algún modelo por defecto que pueda hacer análisis de sentimiento
# Cuando empiece a cargar el modelo, notar el enlace que nos devuelve,
# donde se nos indica qué modelo efectivamente vamos a utilizar
sentiment_classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [None]:
# Utilizamos el modelo
ans = sentiment_classifier(text)
print(ans)

[{'label': 'NEGATIVE', 'score': 0.9015460014343262}]


Podemos obtener un resultado mejor formateado transformando el output en un DataFrame de pandas:

In [None]:
df = pd.DataFrame(ans)
df

Unnamed: 0,label,score
0,NEGATIVE,0.901546


#### Reconocimiento de entidades

Esta tarea reconoce lugares, personas, organizaciones y las clasifica en estas categorías:

In [None]:
# ner: name entity recognition
entities_recognition = pipeline('ner', aggregation_strategy = 'average')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

In [None]:
# Entidades del texto (analizar las categorías que devuelve)
ans_ner = entities_recognition(text)
print(ans_ner)

[{'entity_group': 'ORG', 'score': 0.8790103, 'word': 'Amazon', 'start': 5, 'end': 11}, {'entity_group': 'MISC', 'score': 0.9900365, 'word': 'Optimus Prime', 'start': 36, 'end': 49}, {'entity_group': 'LOC', 'score': 0.9997547, 'word': 'Germany', 'start': 90, 'end': 97}, {'entity_group': 'PER', 'score': 0.4993918, 'word': 'Megatron', 'start': 208, 'end': 216}, {'entity_group': 'ORG', 'score': 0.5012481, 'word': 'Decepticons', 'start': 253, 'end': 264}, {'entity_group': 'MISC', 'score': 0.77536213, 'word': 'Megatron', 'start': 350, 'end': 358}, {'entity_group': 'MISC', 'score': 0.9866426, 'word': 'Optimus Prime', 'start': 367, 'end': 380}, {'entity_group': 'PER', 'score': 0.8120962, 'word': 'Bumblebee', 'start': 502, 'end': 511}]


In [None]:
df = pd.DataFrame(ans_ner)
df

Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.87901,Amazon,5,11
1,MISC,0.990036,Optimus Prime,36,49
2,LOC,0.999755,Germany,90,97
3,PER,0.499392,Megatron,208,216
4,ORG,0.501248,Decepticons,253,264
5,MISC,0.775362,Megatron,350,358
6,MISC,0.986643,Optimus Prime,367,380
7,PER,0.812096,Bumblebee,502,511


#### Resumen del texto
Podemos además obtener un resumen del texto:

In [None]:
# Creamos un modelo que resume el texto
summarizer = pipeline('summarization', min_length = 5, max_length = 50)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [None]:
# Texto resumido
ans_sum = summarizer(text)
print(ans_sum)

[{'summary_text': ' Bumblebee ordered an Optimus Prime action figure from your online store in Germany . Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead .'}]


#### Respuestas a preguntas

Podemos definir una pregunta y que el algortimo nos la conteste!

In [None]:
# Modelo por defecto para responder preguntas
reader = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
# Definición de la pregunta
question = "What does the customer want?"

# Respuesta generada
ans = reader(question=question, context=text)
print(ans)

{'score': 0.631291925907135, 'start': 335, 'end': 358, 'answer': 'an exchange of Megatron'}


#### Clasificación en etiquetas predefinidas

Aquí podemos clasificar un texto en etiquetas predefinidas, como pueden ser etiquetas relacionadas con una dada temática, por ejemplo, "economía", "política", "deportes", etc. Esta tarea se denomina "zero-shot classification".

In [None]:
zero_shot_classifier = pipeline('zero-shot-classification')

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
text = """ Trump asks Supreme Court to intervene in classifed documents case.
Donald Trump’s filing came after an appeals court granted the Justice Department’s
request to keep about 100 classified documents seized from the
former president’s Florida residence separate from an independent review."""

In [None]:
zero_shot_classifier(text, candidate_labels = ['politics', 'justice', 'sports'])

{'sequence': ' Trump asks Supreme Court to intervene in classifed documents case.\nDonald Trump’s filing came after an appeals court granted the Justice Department’s\nrequest to keep about 100 classified documents seized from the\nformer president’s Florida residence separate from an independent review.',
 'labels': ['politics', 'justice', 'sports'],
 'scores': [0.7066016793251038, 0.26491713523864746, 0.028481263667345047]}

Con multi_label = True le indicamos al algortimo que las etiquetas no son excluyentes y que podría haber buenas opciones para un texto.

In [None]:
zero_shot_classifier(text, candidate_labels = ['politics', 'green', 'yellow'], multi_label = True)

{'sequence': ' Trump asks Supreme Court to intervene in classifed documents case.\nDonald Trump’s filing came after an appeals court granted the Justice Department’s\nrequest to keep about 100 classified documents seized from the\nformer president’s Florida residence separate from an independent review.',
 'labels': ['politics', 'green', 'yellow'],
 'scores': [0.8853618502616882, 0.3144931495189667, 0.1560424417257309]}

### Otros modelos

Al buscar en la página de HuggingFace podemos encontrar modelo ya entrenados en alguna tarea específica que nos interese. En particular, podemos buscar modelos específicos para el idioma español.

Por ejemplo, la librería *pysentimiento* que vimos en la clase de análisis de sentimiento (en español) está también disponible en HuggingFace y podemos llamarla con *pipeline*, haciendo explícito que queremos este modelo:

In [None]:
# Por ejemplo, traemos el clasificador de emociones
sentiment = pipeline(model = "pysentimiento/robertuito-sentiment-analysis", task = 'sentiment-analysis')

Downloading (…)lve/main/config.json:   0%|          | 0.00/925 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/435M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

In [None]:
# Output del clasificador de emociones de pysentimiento
sentiment("Rompiste todo, te felicito")

[{'label': 'POS', 'score': 0.9326732754707336}]

##### Reconocimiento de entidades en español:


In [None]:
# Modelo específico para NER en español
ner_spanish = pipeline(model = 'mrm8488/bert-spanish-cased-finetuned-ner', aggregation_strategy = 'average')

Downloading (…)lve/main/config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

Some weights of the model checkpoint at mrm8488/bert-spanish-cased-finetuned-ner were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/136 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/242k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
text = """Coldplay confirma sus diez shows en la Argentina.
La banda comunicó la postergación de sus shows en Brasil,
en la ciudades de Río de Janeiro y San Pablo,
a través de sus redes sociales y de su sitio oficial;
el cantante de la banda Chris Martin deberá guardar estricto reposo
tras recibir el diagnóstico de que padece una afección pulmonar grave."""

In [None]:
# Output del reconocedor de entidades en español
ans_ner = ner_spanish(text)

df = pd.DataFrame(ans_ner)
print(df)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  entity_group     score            word  start  end
0          ORG  0.572822        Coldplay      0    8
1          LOC  0.999662       Argentina     39   48
2          LOC  0.999721          Brasil    100  106
3          LOC  0.999094  Río de Janeiro    126  140
4          LOC  0.999338       San Pablo    143  152
5          PER  0.999473    Chris Martin    232  244


##### Clasificación en tópicos en español con etiquetas definidas:

In [None]:
zero_shot_spanish = pipeline("zero-shot-classification",
                             model="Recognai/bert-base-spanish-wwm-cased-xnli")

Downloading (…)lve/main/config.json:   0%|          | 0.00/834 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/439M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/528 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/242k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
# Output del clasificador en etiquetas
zero_shot_spanish(text, candidate_labels = ['música', 'deporte', 'política'])

{'sequence': 'Coldplay confirma sus diez shows en la Argentina.\nLa banda comunicó la postergación de sus shows en Brasil,\nen la ciudades de Río de Janeiro y San Pablo,\na través de sus redes sociales y de su sitio oficial;\nel cantante de la banda Chris Martin deberá guardar estricto reposo\ntras recibir el diagnóstico de que padece una afección pulmonar grave.',
 'labels': ['música', 'deporte', 'política'],
 'scores': [0.5459451079368591, 0.24547189474105835, 0.20858292281627655]}