
Zero-shot learning en un problema de clasificación
==================================================

Introducción
------------

Los grandes modelos de lenguaje son capaces de resolver problemas de clasificación al utilizar determinadas estructuras del idioma.

### Para ejecutar este notebook

Para ejecutar este notebook, instale las siguientes librerias:

In [2]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [4]:
!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Datasets/mascorpus/tweets_marketing.csv \
    --quiet --no-clobber --directory-prefix ./Datasets/mascorpus/

!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Utils/TextDataset.py \
    --quiet --no-clobber --directory-prefix ./Utils/
    
!pip install transformers huggingface_hub sentencepiece setfit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting setfit
  Downloading setfit-0.7.0-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.9/45.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting datasets>=2.3.0
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate>=0.3.0
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers>=2.2.1
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting aiohttp

In [1]:
import warnings
warnings.filterwarnings('ignore')

Cargamos el set de datos

In [35]:
import pandas as pd

tweets = pd.read_csv('Datasets/mascorpus/tweets_marketing.csv')

In [36]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(tweets['TEXTO'], tweets['SECTOR'], 
                                                    test_size=0.33, 
                                                    stratify=tweets['SECTOR'])

### Verificando el hardware disponible

In [10]:
import torch
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

print("Este notebook se está ejecutando en", device)

Este notebook se está ejecutando en cuda


## Creando un modelo de clasificación utilizando zero-shot learning

Trataremos de resolver entonces el mismo problema de clasificación con el que veniamos trabajando: clasificar los tweets dependiendo del sector al que pertenecen.Recordemos que tenemos 7 categorias distintas:

In [11]:
labels = tweets['SECTOR'].unique().tolist()
labels

['RETAIL',
 'TELCO',
 'ALIMENTACION',
 'AUTOMOCION',
 'BANCA',
 'BEBIDAS',
 'DEPORTES']

In [12]:
from transformers import pipeline

In [None]:
classifier = pipeline(task="zero-shot-classification", model="facebook/bart-large-mnli", device=0)

Tomemos un tweet del conjunto de datos como ejemplo:

In [None]:
example = tweets.iloc[2330]
print(example["TEXTO"], "\n", example["SECTOR"])

Urinarios en el Banco Sabadell? https://t.co/yCWx4exUpt 
 BANCA


In [None]:
sequence = example["TEXTO"]
candidate_labels = labels

In [None]:
classifier(sequence, candidate_labels)

{'sequence': 'Urinarios en el Banco Sabadell? https://t.co/yCWx4exUpt',
 'labels': ['BEBIDAS',
  'BANCA',
  'DEPORTES',
  'ALIMENTACION',
  'AUTOMOCION',
  'RETAIL',
  'TELCO'],
 'scores': [0.5823943614959717,
  0.15687517821788788,
  0.0855400487780571,
  0.08173932880163193,
  0.04727163910865784,
  0.03416234999895096,
  0.012017052620649338]}

### Mejorando la plantilla para nuestro conjunto de datos

In [154]:
hypothesis_template = "Este tweet trata de {}."

In [None]:
classifier(sequence, candidate_labels, hypothesis_template=hypothesis_template)

{'sequence': 'Urinarios en el Banco Sabadell? https://t.co/yCWx4exUpt',
 'labels': ['BEBIDAS',
  'BANCA',
  'ALIMENTACION',
  'AUTOMOCION',
  'RETAIL',
  'DEPORTES',
  'TELCO'],
 'scores': [0.5933619737625122,
  0.11397405713796616,
  0.10416935384273529,
  0.06070404127240181,
  0.05527542904019356,
  0.054225485771894455,
  0.01828962005674839]}

In [None]:
predictions = classifier(X_test.tolist(), candidate_labels, hypothesis_template=hypothesis_template, batch_size=100)

In [None]:
predictions_label = [pred["labels"][0] for pred in predictions]

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions_label))

              precision    recall  f1-score   support

ALIMENTACION       0.12      0.40      0.18       110
  AUTOMOCION       0.45      0.85      0.59       148
       BANCA       0.73      0.28      0.40       198
     BEBIDAS       0.19      0.17      0.18       223
    DEPORTES       0.12      0.01      0.02       216
      RETAIL       0.46      0.48      0.47       268
       TELCO       0.33      0.01      0.02        79

    accuracy                           0.32      1242
   macro avg       0.34      0.32      0.27      1242
weighted avg       0.36      0.32      0.29      1242



### Mejorando las etiquetas

In [150]:
label_mapping = {
    "alimentos": "ALIMENTACION",
    "automobiles": "AUTOMOCION",
    "bancos": "BANCA",
    "bebidas": "BEBDIDAS",
    "deportes": "DEPORTES",
    "supermercados": "RETAIL",
    "telefonía": "TELCO"
}

In [153]:
candidate_labels = list(label_mapping.keys())
candidate_labels

['alimentos',
 'automobiles',
 'bancos',
 'bebidas',
 'deportes',
 'supermercados',
 'telefonía']

In [None]:
predictions = classifier(X_test.tolist(), candidate_labels, hypothesis_template=hypothesis_template, batch_size=100)

In [None]:
predictions_label = [label_mapping[pred["labels"][0]] for pred in predictions]

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions_label))

              precision    recall  f1-score   support

ALIMENTACION       0.25      0.49      0.34       110
  AUTOMOCION       0.49      0.95      0.64       148
       BANCA       0.94      0.87      0.91       198
    BEBDIDAS       0.00      0.00      0.00         0
     BEBIDAS       0.00      0.00      0.00       223
    DEPORTES       0.43      0.37      0.40       216
      RETAIL       0.82      0.12      0.21       268
       TELCO       0.43      0.70      0.53        79

    accuracy                           0.43      1242
   macro avg       0.42      0.44      0.38      1242
weighted avg       0.51      0.43      0.40      1242



## Few-shot learning

In [1]:
import torch
from transformers import AutoTokenizer, XGLMForCausalLM

In [5]:
tokenizer = AutoTokenizer.from_pretrained("facebook/xglm-564M")
model = XGLMForCausalLM.from_pretrained("facebook/xglm-564M")

In [None]:
# Example task: predict whether a sentence implies a cause or an effect
prompt = "A new study has found that "
effect = "mice that were fed a high-fat diet gained more weight."
cause = "the mice that were fed a high-fat diet were more active."

# Combine the prompt and the examples into batches
batch = tokenizer(prompt + effect, prompt + cause, return_tensors='pt')

# Predict the labels for the examples
outputs = model(**batch)
logits = outputs.logits
predicted_labels = logits.argmax(dim=1)

In [None]:
import torch
import torch.nn.functional as F

def get_logprobs(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids, output_ids = inputs["input_ids"], inputs["input_ids"][:, 1:]
    outputs = model(**inputs, labels=input_ids)
    logits = outputs.logits
    logprobs = torch.gather(F.log_softmax(logits, dim=2), 2, output_ids.unsqueeze(2))
    return logprobs

def get_prompt_prob(prompt):
    return get_logprobs(prompt).sum()


def eval(prompt, alternative1, alternative2):
    lprob1 = get_prompt_prob(prompt + " " + alternative1)
    lprob2 = get_prompt_prob(prompt + " " + alternative2)

    print(alternative1 if lprob1 > lprob2 else alternative2)

In [None]:
eval(prompt, effect, cause)

mice that were fed a high-fat diet gained more weight.


## Fine-tune few-shot learning

In [3]:
import torch
from setfit import SetFitModel, SetFitTrainer

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

In [124]:
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2", num_labels=7)

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


In [125]:
tweets = pd.read_csv('Datasets/mascorpus/tweets_marketing.csv')
tweets = tweets[["TEXTO", "SECTOR"]].rename(columns={"TEXTO": "text", "SECTOR": "label"})

In [140]:
tweets['label'] = tweets['label'].astype("category")

In [141]:
candidate_labels = list(tweets['label'].cat.categories)
tweets['label'] = tweets['label'].values.codes

In [159]:
ds = Dataset.from_pandas(tweets, features=features, preserve_index=False).train_test_split(test_size=0.33, stratify_by_column="label")

In [194]:
train_ds = ds[Split.TRAIN].shuffle(seed=42)
test_ds = ds[Split.TEST]

In [201]:
from setfit import get_templated_dataset

examples_ds = get_templated_dataset(candidate_labels=candidate_labels, template=hypothesis_template, sample_size=8)

In [157]:
from datasets import Dataset, Split, Features, Value, ClassLabel

In [158]:
features = Features(text=Value("string"), label=ClassLabel(names=candidate_labels))

In [161]:
trainer = SetFitTrainer(
    model=model,
    train_dataset=examples_ds,
    eval_dataset=test_ds,
    loss_class=CosineSimilarityLoss,
    batch_size=16,
    num_iterations=20, # Number of text pairs to generate for contrastive learning
    num_epochs=1 # Number of epochs to use for contrastive learning
)

In [162]:
trainer.train()
metrics = trainer.evaluate()

Generating Training Pairs:   0%|          | 0/20 [00:00<?, ?it/s]

***** Running training *****
  Num examples = 2240
  Num epochs = 1
  Total optimization steps = 140
  Total train batch size = 16


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/140 [00:00<?, ?it/s]

***** Running evaluation *****


In [163]:
metrics

{'accuracy': 0.538647342995169}

In [169]:
trainer.model.save_pretrained('finetuned4')

In [170]:
from transformers import pipeline

In [176]:
id2label = { key:value for key, value in enumerate(candidate_labels) }

In [180]:
pipe = pipeline(model='finetuned4', task="text-classification", model_kwargs={ "id2label": id2label })

Some weights of the model checkpoint at finetuned4 were not used when initializing MPNetForSequenceClassification: ['pooler.dense.weight', 'pooler.dense.bias']
- This IS expected if you are initializing MPNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MPNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of MPNetForSequenceClassification were not initialized from the model checkpoint at finetuned4 and are newly initialized: ['classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and

In [183]:
predictions = [label_mapping[pred['label']] for pred in pipe.predict(test_ds['text'])]

In [185]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

ALIMENTACION       0.07      0.03      0.04       110
  AUTOMOCION       0.13      0.29      0.18       148
       BANCA       0.25      0.02      0.04       198
    BEBDIDAS       0.00      0.00      0.00         0
     BEBIDAS       0.00      0.00      0.00       223
    DEPORTES       0.21      0.15      0.17       216
      RETAIL       0.20      0.25      0.22       268
       TELCO       0.08      0.22      0.11        79

    accuracy                           0.13      1242
   macro avg       0.12      0.12      0.10      1242
weighted avg       0.15      0.13      0.12      1242

