
Zero-shot learning en un problema de clasificación
==================================================

Introducción
------------

Los grandes modelos de lenguaje son capaces de resolver problemas de clasificación al utilizar determinadas estructuras del idioma.

### Para ejecutar este notebook

Para ejecutar este notebook, instale las siguientes librerias:

In [1]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [2]:
!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Datasets/mascorpus/tweets_marketing.csv \
    --quiet --no-clobber --directory-prefix ./Datasets/mascorpus/

!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Utils/TextDataset.py \
    --quiet --no-clobber --directory-prefix ./Utils/
    
!pip install transformers huggingface_hub sentencepiece setfit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface_hub
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.98-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting setfit
  Downloading setfit-0.7.0-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.9/45.9 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import torch

Cargamos el set de datos

In [3]:
import pandas as pd

tweets = pd.read_csv('Datasets/mascorpus/tweets_marketing.csv')

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(tweets['TEXTO'], tweets['SECTOR'], 
                                                    test_size=0.33, 
                                                    stratify=tweets['SECTOR'])

### Verificando el hardware disponible

In [5]:
import torch
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

print("Este notebook se está ejecutando en", device)

Este notebook se está ejecutando en cuda


## Creando un modelo de clasificación utilizando zero-shot learning

Trataremos de resolver entonces el mismo problema de clasificación con el que veniamos trabajando: clasificar los tweets dependiendo del sector al que pertenecen.Recordemos que tenemos 7 categorias distintas:

In [6]:
labels = tweets['SECTOR'].unique().tolist()
labels

['RETAIL',
 'TELCO',
 'ALIMENTACION',
 'AUTOMOCION',
 'BANCA',
 'BEBIDAS',
 'DEPORTES']

In [7]:
from transformers import pipeline

In [43]:
model_name = "facebook/bart-large-mnli"

In [8]:
classifier = pipeline(task="zero-shot-classification", model=model_name, device=0)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Tomemos un tweet del conjunto de datos como ejemplo:

In [19]:
example = tweets.iloc[2131]
print(example["TEXTO"], "\n", example["SECTOR"])

El BBVA debería hacer nuevos comerciales con Claudio Bravo.
Aprovechando que ahora es el rey de la banca.
@alebattocchio 
 BANCA


In [20]:
sequence = example["TEXTO"]
candidate_labels = labels

In [21]:
classifier(sequence, candidate_labels)

{'sequence': 'El BBVA debería hacer nuevos comerciales con Claudio Bravo.\nAprovechando que ahora es el rey de la banca.\n@alebattocchio',
 'labels': ['BANCA',
  'BEBIDAS',
  'ALIMENTACION',
  'AUTOMOCION',
  'DEPORTES',
  'RETAIL',
  'TELCO'],
 'scores': [0.48729321360588074,
  0.34352853894233704,
  0.05284114554524422,
  0.04914722219109535,
  0.03771361708641052,
  0.01828593574464321,
  0.011190303601324558]}

Vemos que en este caso es capaz de predecir la etiqueta correcta. ¿Sucede con todos los casos? ¿Puede verificar como se comporta este model en otras situaciones?

### Mejorando la plantilla para nuestro conjunto de datos

In [26]:
hypothesis_template = "Este tweet se refiere a {}."

In [27]:
classifier(sequence, candidate_labels, hypothesis_template=hypothesis_template)

{'sequence': 'El BBVA debería hacer nuevos comerciales con Claudio Bravo.\nAprovechando que ahora es el rey de la banca.\n@alebattocchio',
 'labels': ['BANCA',
  'BEBIDAS',
  'ALIMENTACION',
  'AUTOMOCION',
  'RETAIL',
  'DEPORTES',
  'TELCO'],
 'scores': [0.5197097659111023,
  0.17679853737354279,
  0.0929044634103775,
  0.06869390606880188,
  0.0685601457953453,
  0.04455644264817238,
  0.028776705265045166]}

Vemos que cambiar el prompt que utilizamos aumentó la probabilidad de la etiqueta correcta. Sin embargo, ¿afecta este cambio la performance del clasificador en general? 

Verifiquemos la performance de este clasificador en el conjunto de evaluación.

In [28]:
predictions = classifier(X_test.tolist(), candidate_labels, hypothesis_template=hypothesis_template, batch_size=100)

In [29]:
predictions_label = [pred["labels"][0] for pred in predictions]

In [30]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions_label))

              precision    recall  f1-score   support

ALIMENTACION       0.08      0.21      0.12       110
  AUTOMOCION       0.45      0.84      0.58       148
       BANCA       0.72      0.25      0.37       198
     BEBIDAS       0.20      0.23      0.22       223
    DEPORTES       0.08      0.01      0.02       216
      RETAIL       0.38      0.47      0.42       268
       TELCO       0.13      0.03      0.04        79

    accuracy                           0.30      1242
   macro avg       0.29      0.29      0.25      1242
weighted avg       0.32      0.30      0.27      1242



### Mejorando las etiquetas

Es válido pensar que la etiqueta "ALIMENTACIÓN" es un tanto extraña dentro de una frase. El modelo de lenguaje que estamos utilizando es un Masked Language Model. Esto significa que el mismo predice la probabilidad de la palabra que representa la etiqueta dentro del texto. Esto tiene dos supociones importantes:

1. Que la etiqueta es parte del vocabulario del modelo.
2. Que la etiqueta puede utilizarce exitosamente en la posición en donde estamos ubicando la palabra a predecir.

Veamos como se comporta el modelo si cambiamos la etiquetas por otras palabras un poco más representativas:

In [31]:
label_mapping = {
    "alimentos": "ALIMENTACION",
    "automobiles": "AUTOMOCION",
    "bancos": "BANCA",
    "bebidas": "BEBDIDAS",
    "deportes": "DEPORTES",
    "supermercados": "RETAIL",
    "telefonía": "TELCO"
}

In [32]:
candidate_labels = list(label_mapping.keys())
candidate_labels

['alimentos',
 'automobiles',
 'bancos',
 'bebidas',
 'deportes',
 'supermercados',
 'telefonía']

In [33]:
predictions = classifier(X_test.tolist(), candidate_labels, hypothesis_template=hypothesis_template, batch_size=100)

In [34]:
predictions_label = [label_mapping[pred["labels"][0]] for pred in predictions]

In [35]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions_label))

              precision    recall  f1-score   support

ALIMENTACION       0.22      0.54      0.31       110
  AUTOMOCION       0.54      0.95      0.69       148
       BANCA       0.87      0.88      0.87       198
    BEBDIDAS       0.00      0.00      0.00         0
     BEBIDAS       0.00      0.00      0.00       223
    DEPORTES       0.42      0.18      0.25       216
      RETAIL       0.97      0.13      0.22       268
       TELCO       0.43      0.77      0.55        79

    accuracy                           0.41      1242
   macro avg       0.43      0.43      0.36      1242
weighted avg       0.53      0.41      0.38      1242



### Buscando la etiquetas automáticamente

In [36]:
!git clone https://github.com/ucinlp/autoprompt

Cloning into 'autoprompt'...
remote: Enumerating objects: 4642, done.[K
remote: Counting objects: 100% (49/49), done.[K
remote: Compressing objects: 100% (44/44), done.[K
remote: Total 4642 (delta 7), reused 9 (delta 5), pack-reused 4593[K
Receiving objects: 100% (4642/4642), 78.06 MiB | 16.42 MiB/s, done.
Resolving deltas: 100% (3451/3451), done.


In [None]:
%pip install spacy termcolor colorama matplotlib

In [None]:
!python -m spacy download en

In [46]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name, return_dict=True, output_hidden_states=True)

Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartForConditionalGeneration: ['classification_head.dense.weight', 'classification_head.out_proj.bias', 'classification_head.out_proj.weight', 'classification_head.dense.bias']
- This IS expected if you are initializing BartForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BartForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [70]:
class ForwardPassWrapper():
    """
    This object stores the intermediate gradients of the output a the given PyTorch module, which
    otherwise might not be retained.
    """
    def __init__(self, module):
        self._output = None
        module.register_forward_hook(self.hook)

    def hook(self, module, input, output):
        self._output = output

    def get_output(self):
        return self._output

In [71]:
encoder_embedding_output = ForwardPassWrapper(model.model.encoder.layernorm_embedding)

In [64]:
decoder_embeddings_weights = model.lm_head.weight

In [66]:
label2id = { label: idx for idx, label in enumerate(labels) }
id2label = { value: key for key, value in label2id.items() }

In [69]:
projection = torch.nn.Linear(model.config.hidden_size, len(label2id))
projection.to(model.device)

Linear(in_features=1024, out_features=7, bias=True)

## Few-shot learning

In [None]:
import torch
from transformers import AutoTokenizer, XGLMForCausalLM

In [None]:
tokenizer = AutoTokenizer.from_pretrained("facebook/xglm-564M")
model = XGLMForCausalLM.from_pretrained("facebook/xglm-564M")

In [None]:
# Example task: predict whether a sentence implies a cause or an effect
prompt = "A new study has found that "
effect = "mice that were fed a high-fat diet gained more weight."
cause = "the mice that were fed a high-fat diet were more active."

# Combine the prompt and the examples into batches
batch = tokenizer(prompt + effect, prompt + cause, return_tensors='pt')

# Predict the labels for the examples
outputs = model(**batch)
logits = outputs.logits
predicted_labels = logits.argmax(dim=1)

In [None]:
import torch
import torch.nn.functional as F

def get_logprobs(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids, output_ids = inputs["input_ids"], inputs["input_ids"][:, 1:]
    outputs = model(**inputs, labels=input_ids)
    logits = outputs.logits
    logprobs = torch.gather(F.log_softmax(logits, dim=2), 2, output_ids.unsqueeze(2))
    return logprobs

def get_prompt_prob(prompt):
    return get_logprobs(prompt).sum()


def eval(prompt, alternative1, alternative2):
    lprob1 = get_prompt_prob(prompt + " " + alternative1)
    lprob2 = get_prompt_prob(prompt + " " + alternative2)

    print(alternative1 if lprob1 > lprob2 else alternative2)

In [None]:
eval(prompt, effect, cause)

mice that were fed a high-fat diet gained more weight.


## Fine-tune few-shot learning

In [None]:
import torch
from setfit import SetFitModel, SetFitTrainer

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

In [None]:
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2", num_labels=7)

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


In [None]:
tweets = pd.read_csv('Datasets/mascorpus/tweets_marketing.csv')
tweets = tweets[["TEXTO", "SECTOR"]].rename(columns={"TEXTO": "text", "SECTOR": "label"})

In [None]:
tweets['label'] = tweets['label'].astype("category")

In [None]:
candidate_labels = list(tweets['label'].cat.categories)
tweets['label'] = tweets['label'].values.codes

In [None]:
ds = Dataset.from_pandas(tweets, features=features, preserve_index=False).train_test_split(test_size=0.33, stratify_by_column="label")

In [None]:
train_ds = ds[Split.TRAIN].shuffle(seed=42)
test_ds = ds[Split.TEST]

In [None]:
from setfit import get_templated_dataset

examples_ds = get_templated_dataset(candidate_labels=candidate_labels, template=hypothesis_template, sample_size=8)

In [None]:
from datasets import Dataset, Split, Features, Value, ClassLabel

In [None]:
features = Features(text=Value("string"), label=ClassLabel(names=candidate_labels))

In [None]:
trainer = SetFitTrainer(
    model=model,
    train_dataset=examples_ds,
    eval_dataset=test_ds,
    loss_class=CosineSimilarityLoss,
    batch_size=16,
    num_iterations=20, # Number of text pairs to generate for contrastive learning
    num_epochs=1 # Number of epochs to use for contrastive learning
)

In [None]:
trainer.train()
metrics = trainer.evaluate()

Generating Training Pairs:   0%|          | 0/20 [00:00<?, ?it/s]

***** Running training *****
  Num examples = 2240
  Num epochs = 1
  Total optimization steps = 140
  Total train batch size = 16


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/140 [00:00<?, ?it/s]

***** Running evaluation *****


In [None]:
metrics

{'accuracy': 0.538647342995169}

In [None]:
trainer.model.save_pretrained('finetuned4')

In [None]:
from transformers import pipeline

In [None]:
id2label = { key:value for key, value in enumerate(candidate_labels) }

In [None]:
pipe = pipeline(model='finetuned4', task="text-classification", model_kwargs={ "id2label": id2label })

Some weights of the model checkpoint at finetuned4 were not used when initializing MPNetForSequenceClassification: ['pooler.dense.weight', 'pooler.dense.bias']
- This IS expected if you are initializing MPNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MPNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of MPNetForSequenceClassification were not initialized from the model checkpoint at finetuned4 and are newly initialized: ['classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and

In [None]:
predictions = [label_mapping[pred['label']] for pred in pipe.predict(test_ds['text'])]

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

ALIMENTACION       0.07      0.03      0.04       110
  AUTOMOCION       0.13      0.29      0.18       148
       BANCA       0.25      0.02      0.04       198
    BEBDIDAS       0.00      0.00      0.00         0
     BEBIDAS       0.00      0.00      0.00       223
    DEPORTES       0.21      0.15      0.17       216
      RETAIL       0.20      0.25      0.22       268
       TELCO       0.08      0.22      0.11        79

    accuracy                           0.13      1242
   macro avg       0.12      0.12      0.10      1242
weighted avg       0.15      0.13      0.12      1242

