
Zero-shot learning en un problema de clasificación
==================================================

Introducción
------------

Los grandes modelos de lenguaje son capaces de resolver problemas de clasificación al utilizar determinadas estructuras del idioma.

### Para ejecutar este notebook

Para ejecutar este notebook, instale las siguientes librerias:

In [14]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Datasets/mascorpus/tweets_marketing.csv \
    --quiet --no-clobber --directory-prefix ./Datasets/mascorpus/

!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Utils/TextDataset.py \
    --quiet --no-clobber --directory-prefix ./Utils/
    
!pip install transformers mlflow huggingface_hub sentencepiece

In [2]:
import warnings
warnings.filterwarnings('ignore')

Cargamos el set de datos

In [3]:
import pandas as pd

tweets = pd.read_csv('Datasets/mascorpus/tweets_marketing.csv')

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(tweets['TEXTO'], tweets['SECTOR'], 
                                                    test_size=0.33, 
                                                    stratify=tweets['SECTOR'])

### Verificando el hardware disponible

In [5]:
import torch
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

print("Este notebook se está ejecutando en", device)

Este notebook se está ejecutando en cuda


## Crando un modelo de clasificación utilizando zero-shot learning

Trataremos de resolver entonces el mismo problema de clasificación con el que veniamos trabajando: clasificar los tweets dependiendo del sector al que pertenecen.Recordemos que tenemos 7 categorias distintas:

In [6]:
labels = tweets['SECTOR'].unique().tolist()
labels

['RETAIL',
 'TELCO',
 'ALIMENTACION',
 'AUTOMOCION',
 'BANCA',
 'BEBIDAS',
 'DEPORTES']

In [7]:
from transformers import pipeline

In [8]:
classifier = pipeline(task="zero-shot-classification", model="facebook/bart-large-mnli", device=0)

Tomemos un tweet del conjunto de datos como ejemplo:

In [28]:
example = tweets.iloc[2330]
print(example["TEXTO"], "\n", example["SECTOR"])

Urinarios en el Banco Sabadell? https://t.co/yCWx4exUpt 
 BANCA


In [29]:
sequence = example["TEXTO"]
candidate_labels = labels

In [30]:
classifier(sequence, candidate_labels)

{'sequence': 'Urinarios en el Banco Sabadell? https://t.co/yCWx4exUpt',
 'labels': ['BEBIDAS',
  'BANCA',
  'DEPORTES',
  'ALIMENTACION',
  'AUTOMOCION',
  'RETAIL',
  'TELCO'],
 'scores': [0.5823943614959717,
  0.15687517821788788,
  0.0855400487780571,
  0.08173932880163193,
  0.04727163910865784,
  0.03416234999895096,
  0.012017052620649338]}

### Mejorando la plantilla para nuestro conjunto de datos

In [31]:
hypothesis_template = "Este tweet trata de {}."

In [32]:
classifier(sequence, candidate_labels, hypothesis_template=hypothesis_template)

{'sequence': 'Urinarios en el Banco Sabadell? https://t.co/yCWx4exUpt',
 'labels': ['BEBIDAS',
  'BANCA',
  'ALIMENTACION',
  'AUTOMOCION',
  'RETAIL',
  'DEPORTES',
  'TELCO'],
 'scores': [0.5933619737625122,
  0.11397405713796616,
  0.10416935384273529,
  0.06070404127240181,
  0.05527542904019356,
  0.054225485771894455,
  0.01828962005674839]}

In [33]:
predictions = classifier(X_test.tolist(), candidate_labels, hypothesis_template=hypothesis_template, batch_size=100)

In [34]:
predictions_label = [pred["labels"][0] for pred in predictions]

In [35]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions_label))

              precision    recall  f1-score   support

ALIMENTACION       0.12      0.40      0.18       110
  AUTOMOCION       0.45      0.85      0.59       148
       BANCA       0.73      0.28      0.40       198
     BEBIDAS       0.19      0.17      0.18       223
    DEPORTES       0.12      0.01      0.02       216
      RETAIL       0.46      0.48      0.47       268
       TELCO       0.33      0.01      0.02        79

    accuracy                           0.32      1242
   macro avg       0.34      0.32      0.27      1242
weighted avg       0.36      0.32      0.29      1242



### Mejorando las etiquetas

In [39]:
label_mapping = {
    "alimentos": "ALIMENTACION",
    "automobiles": "AUTOMOCION",
    "bancos": "BANCA",
    "bebidas": "BEBDIDAS",
    "deportes": "DEPORTES",
    "supermercados": "RETAIL",
    "telefonía": "TELCO"
}

In [42]:
candidate_labels = list(label_mapping.keys())
candidate_labels

['alimentos',
 'automobiles',
 'bancos',
 'bebidas',
 'deportes',
 'supermercados',
 'telefonía']

In [43]:
predictions = classifier(X_test.tolist(), candidate_labels, hypothesis_template=hypothesis_template, batch_size=100)

In [44]:
predictions_label = [label_mapping[pred["labels"][0]] for pred in predictions]

In [45]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions_label))

              precision    recall  f1-score   support

ALIMENTACION       0.25      0.49      0.34       110
  AUTOMOCION       0.49      0.95      0.64       148
       BANCA       0.94      0.87      0.91       198
    BEBDIDAS       0.00      0.00      0.00         0
     BEBIDAS       0.00      0.00      0.00       223
    DEPORTES       0.43      0.37      0.40       216
      RETAIL       0.82      0.12      0.21       268
       TELCO       0.43      0.70      0.53        79

    accuracy                           0.43      1242
   macro avg       0.42      0.44      0.38      1242
weighted avg       0.51      0.43      0.40      1242



## Few-shot learning

In [21]:
from transformers import AutoTokenizer, XGLMForCausalLM

In [23]:
tokenizer = AutoTokenizer.from_pretrained("facebook/xglm-564M")
model = XGLMForCausalLM.from_pretrained("facebook/xglm-564M")

Downloading (…)okenizer_config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/4.92M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/9.03M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/276 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/546 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.13G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

In [28]:
# Example task: predict whether a sentence implies a cause or an effect
prompt = "A new study has found that "
effect = "mice that were fed a high-fat diet gained more weight."
cause = "the mice that were fed a high-fat diet were more active."

# Combine the prompt and the examples into batches
batch = tokenizer(prompt + effect, prompt + cause, return_tensors='pt')

# Predict the labels for the examples
outputs = model(**batch)
logits = outputs.logits
predicted_labels = logits.argmax(dim=1)

In [38]:
import torch
import torch.nn.functional as F

def get_logprobs(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids, output_ids = inputs["input_ids"], inputs["input_ids"][:, 1:]
    outputs = model(**inputs, labels=input_ids)
    logits = outputs.logits
    logprobs = torch.gather(F.log_softmax(logits, dim=2), 2, output_ids.unsqueeze(2))
    return logprobs

def get_prompt_prob(prompt):
    return get_logprobs(prompt).sum()


def eval(prompt, alternative1, alternative2):
    lprob1 = get_prompt_prob(prompt + " " + alternative1)
    lprob2 = get_prompt_prob(prompt + " " + alternative2)

    print(alternative1 if lprob1 > lprob2 else alternative2)

In [43]:
eval(prompt, effect, cause)

mice that were fed a high-fat diet gained more weight.
