# MODELO DE RESPUESTA DE CLICKBAIT

Esta es la implementación del proyecto de Clickbait Spoiling para la asignatura de Minería de Datos Textuales.

El objetivo es explorar diferentes técnicas vistas en clase.

##Instalar e importar librerías

In [23]:
%%capture

!pip install tldextract
!pip install simpletransformers
!pip install bert_score

In [24]:
import pandas as pd
import numpy as np
import json
import torch
import tldextract
import re
import warnings

from statistics import mean

import pandas as pd
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.preprocessing import LabelEncoder
import sklearn
import logging
import wandb
from simpletransformers.classification import ClassificationArgs, ClassificationModel
from simpletransformers.question_answering import QuestionAnsweringModel

from bert_score import score
from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import word_tokenize

In [25]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Utilizar la GPU:

In [26]:
device = "cuda" if torch.cuda.is_available() else "cpu"

## Carga de los datos

El conjunto de datos se puede conseguir en la siguiente página web: [Clickbait challenge](https://pan.webis.de/semeval23/pan23-web/clickbait-challenge.html)

In [None]:
train_dir = '/content/drive/MyDrive/Colab Notebooks/MDT/PROYECTO/DATA/train.jsonl'
val_dir = '/content/drive/MyDrive/Colab Notebooks/MDT/PROYECTO/DATA/validation.jsonl'
test_dir = '/content/drive/MyDrive/Colab Notebooks/MDT/PROYECTO/DATA/test.jsonl'

In [None]:
def create_df_json(file_dir):
  data = []
  with open(file_dir, "r") as file:
    for line in file:
      json_data = json.loads(line)
      data.append(json_data)

  df = pd.DataFrame(data)

  return df

train_df = create_df_json(train_dir)
val_df = create_df_json(val_dir)
test_df = create_df_json(test_dir)

Lo primero es realizar una exploración de los datos, para saber con qué estamos tratando y ver si vamos a necesitar hacer limpieza o transformación de ellos.

In [None]:
print(train_df.loc[0])

uuid                              0af11f6b-c889-4520-9372-66ba25cb7657
postId                                                          532quh
postText             [Wes Welker Wanted Dinner With Tom Brady, But ...
postPlatform                                                    reddit
targetParagraphs     [It’ll be just like old times this weekend for...
targetTitle          Wes Welker Wanted Dinner With Tom Brady, But P...
targetDescription    It'll be just like old times this weekend for ...
targetKeywords         new england patriots, ricky doyle, top stories,
targetMedia          [http://pixel.wp.com/b.gif?v=noscript, http://...
targetUrl            http://nesn.com/2016/09/wes-welker-wanted-dinn...
provenance           {'source': 'anonymized', 'humanSpoiler': 'They...
spoiler                          [how about that morning we go throw?]
spoilerPositions                                [[[3, 151], [3, 186]]]
tags                                                         [passage]
Name: 

Aquí vemos las particiones que tenemos en el conjunto de datos:

In [None]:
print(train_df.shape)
print(val_df.shape)
print(test_df.shape)

(3200, 14)
(800, 14)
(1000, 14)


Y de la siguiente forma comprobamos el número de valores null que tenemos en cada variable.

In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3200 entries, 0 to 3199
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   uuid               3200 non-null   object
 1   postId             3200 non-null   object
 2   postText           3200 non-null   object
 3   postPlatform       3200 non-null   object
 4   targetParagraphs   3200 non-null   object
 5   targetTitle        3200 non-null   object
 6   targetDescription  2933 non-null   object
 7   targetKeywords     2116 non-null   object
 8   targetMedia        2685 non-null   object
 9   targetUrl          2717 non-null   object
 10  provenance         3200 non-null   object
 11  spoiler            3200 non-null   object
 12  spoilerPositions   3200 non-null   object
 13  tags               3200 non-null   object
dtypes: object(14)
memory usage: 350.1+ KB


En ninguna de ellas la cifra es demasiado grande, y en el caso de 'targetKeywords' no vamos a necesitarlo para las tareas.

Eliminamos aquellas columnas que no nos sirven para la tarea. Además, aplico varias transformaciones que nos vendrán bien.

In [None]:
def filter_df(df):
  df['targetUrl'] = df['targetUrl'].astype(str)
  df = df.fillna("")
  df['targetDescription'] = df['targetDescription'].astype(str)
  df['postText'] = df['postText'].astype(str)
  return df.drop(columns=['postId', 'targetDescription', 'targetKeywords', 'targetMedia'])

train_df = filter_df(train_df)
val_df = filter_df(val_df)
test_df = filter_df(test_df)

También voy a cambiar la variable targetParagraphs por una en la que se dé el contexto completo. Los párrafos venían separados, y queremos que estén todos en la misma frase.

In [None]:
def create_context(df):
  string_list = []
  for index, row in df.iterrows():
    string = ""
    for k in row["targetParagraphs"]:
      string += k + " "
    string_list.append(string)
  df["full_context"] = string_list
  return df

train_df = create_context(train_df)
val_df = create_context(val_df)
test_df = create_context(test_df)

In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3200 entries, 0 to 3199
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   uuid              3200 non-null   object
 1   postText          3200 non-null   object
 2   postPlatform      3200 non-null   object
 3   targetParagraphs  3200 non-null   object
 4   targetTitle       3200 non-null   object
 5   targetUrl         3200 non-null   object
 6   provenance        3200 non-null   object
 7   spoiler           3200 non-null   object
 8   spoilerPositions  3200 non-null   object
 9   tags              3200 non-null   object
 10  full_context      3200 non-null   object
dtypes: object(11)
memory usage: 275.1+ KB


Vemos cómo se separan los conjuntos de datos según la etiquetas (frase, pasaje o multi):

In [None]:
print('On training dataset: \n',train_df.tags.value_counts(), '\n')
print('On validation dataset: \n',val_df.tags.value_counts(), '\n')
print('On test dataset: \n',test_df.tags.value_counts(), '\n')

On training dataset: 
 tags
[phrase]     1367
[passage]    1274
[multi]       559
Name: count, dtype: int64 

On validation dataset: 
 tags
[phrase]     335
[passage]    322
[multi]      143
Name: count, dtype: int64 

On test dataset: 
 tags
[phrase]     423
[passage]    403
[multi]      174
Name: count, dtype: int64 



Es necesario transformar nuestros datos en algo que podamos meter en el modelo. Entonces, he decidido crear un texto con las distintas variables predictoras del dataset que pueden ser de ayuda e incluir el label.

Se puede utilizar algún método de NER para sacar más información y crear un prompt más completo para cada instancia. Pero en un experimento inicial vi que no mejoraba los resultados de clasificación.

In [None]:
def combinar_features(df):
  textos_df = []
  labels = []

  for index, row in df.iterrows():
    texto = ''
    texto += f'Title: {row["postText"][1:-1]}. '
    if re.match(".*\d+\s*[\.\)].+\d+?\s*[\.\)].+?\d+\s*[\.\)]",row["full_context"], re.MULTILINE | re.IGNORECASE):
            texto += f'Enumeration or multi-line. '

    texto += f'Publishing Platform: {row["postPlatform"]}. '
    texto += f'Source Website {row["targetUrl"]}. '
    texto += f'Context: {row["full_context"]}. '
    texto = texto.replace('"', "'")
    textos_df.append(texto)
    labels.append(row["tags"][0])

  return textos_df, labels

La siguiente función crea el dataframe final que voy a utilizar para la clasificación del spoiler:

In [None]:
def crear_df_final(df):
  textos, labels = combinar_features(df)
  train_tsv = pd.DataFrame(list(zip(textos, labels)), columns=['text','label'])
  return train_tsv

train_tsv = crear_df_final(train_df)
val_tsv = crear_df_final(val_df)
test_tsv = crear_df_final(test_df)

In [None]:
train_tsv.text[1]

"Title: 'NASA sets date for full recovery of ozone hole'. Publishing Platform: Twitter. Source Website http://huff.to/1cH672Z. Context: 2070 is shaping up to be a great year for Mother Earth. That's when NASA scientists are predicting the hole in the ozone layer might finally make a full recovery. Researchers announced their conclusion, in addition to other findings, in a presentation Wednesday during the annual American Geophysical Union meeting in San Francisco. The team of scientists specifically looked at the chemical composition of the ozone hole, which has shifted in both size and depth since the passing of the Montreal Protocol in 1987. The agreement banned its 197 signatory countries from using chemicals, like chlorofluorocarbons (CFCs), that break down into chlorine in the upper atmosphere and harm the ozone layer. They found that, while levels of chlorine in the atmosphere have indeed decreased as a result of the protocol, it's too soon to tie them to a healthier ozone layer.

Ahora, guardo los nuevos datasets creados para no tener que hacer este proceso cada vez que los necesite.

In [None]:
train_tsv.to_csv("/content/drive/MyDrive/Colab Notebooks/MDT/PROYECTO/DATA/train2.tsv", sep="\t", encoding='utf-8', index=False)
val_tsv.to_csv("/content/drive/MyDrive/Colab Notebooks/MDT/PROYECTO/DATA/val2.tsv", sep="\t", encoding='utf-8', index=False)
test_tsv.to_csv("/content/drive/MyDrive/Colab Notebooks/MDT/PROYECTO/DATA/test2.tsv", sep="\t", encoding='utf-8', index=False)


## Tarea 1: Clasificación del tipo de spoiler

Como se menciona en el informe, vamos a probar diferentes modelos y obtener sus resultados para la tarea de clasificación del spoiler.

Cargar los conjuntos de datos guardados. Así no hace falta ejecutar todas las celdas anteriores.

In [None]:
train_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/MDT/PROYECTO/DATA/train2.tsv",sep='\t')
val_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/MDT/PROYECTO/DATA/val2.tsv",sep='\t')
test_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/MDT/PROYECTO/DATA/test2.tsv",sep='\t')

expanded_df = pd.concat([train_df, val_df], axis=0)

In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3200 entries, 0 to 3199
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    3200 non-null   object
 1   label   3200 non-null   object
dtypes: object(2)
memory usage: 50.1+ KB


El conjunto expandido incluye los datos de entrenamiento y validación concatenados. Este experimento no mejoró los resultados.

In [None]:
expanded_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4000 entries, 0 to 799
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    4000 non-null   object
 1   label   4000 non-null   object
dtypes: object(2)
memory usage: 93.8+ KB


Función para transformar las etiquetas de los dataframes en valores numéricos. Guardamos el mapping para poder recuperar los valores posteriormente.

In [None]:
def encode_labels(df):
  le = LabelEncoder()
  df['label'] = le.fit_transform(df.label.values)
  le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))

  return df, le_name_mapping

train_df_clas, le_name_mapping = encode_labels(train_df)
val_df_clas, _ = encode_labels(val_df)
test_df_clas, _ = encode_labels(test_df)

expanded_df_clas, _ = encode_labels(expanded_df)

In [None]:
le_name_mapping

{'multi': 0, 'passage': 1, 'phrase': 2}

Vemos que la variable 'label' ha cambiado de tipo a int64.

In [None]:
train_df_clas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3200 entries, 0 to 3199
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    3200 non-null   object
 1   label   3200 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 50.1+ KB


Tomamos los textos en un formato correcto para poder hacer las predicciones una vez entrenemos el modelo.

In [None]:
train_df_texts = train_df_clas['text'].tolist()
val_df_texts = val_df_clas['text'].tolist()
test_df_texts = test_df_clas['text'].tolist()

expanded_df_texts = expanded_df_clas['text'].tolist()

Ahora vamos a tomar el modelo que usaremos para la clasificación. Primero, hay que definir los párametros del modelo. Para ello, he experimentado sobre el conjunto de validación con el modelo RoBERTa-base, y los hiperpárametros han quedado así:

In [None]:
model_args = ClassificationArgs()
model_args.evaluate_during_training = False
model_args.save_eval_checkpoints = False
model_args.save_model_every_epoch = False
model_args.learning_rate = 1e-5
model_args.max_seq_length = 200
model_args.num_train_epochs = 5
model_args.overwrite_output_dir = True
model_args.labels_list = [0, 1, 2]
model_args.reprocess_input_data = True

Comprobar que tenemos la GPU disponible.

In [None]:
cuda_available = torch.cuda.is_available()
print(cuda_available)

True


Y generamos la instancia de nuestro modelo, utilizando la librería simpletransformers, que facilita la implementación de los modelos Transformers. Para entrenar un modelo distinto simplemente hay que cambiar el nombre de este y el path a su repositorio de HuggingFace.

In [None]:
model = ClassificationModel(
    "roberta",
    "FacebookAI/roberta-base",
    use_cuda= cuda_available,
    num_labels=3,
    args=model_args
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Entrenamos el modelo sobre el conjunto de entrenamiento:

In [None]:
model.train_model(train_df_clas)
#model.train_model(expanded_df_clas)

  self.pid = os.fork()


  0%|          | 0/6 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/400 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/400 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/400 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/400 [00:00<?, ?it/s]

Running Epoch 5 of 5:   0%|          | 0/400 [00:00<?, ?it/s]

(2000, 0.6811846494674683)

Realizamos las predicciones sobre el conjunto de validación:

In [None]:
preds, model_outputs = model.predict(val_df_texts)

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]

Vamos a ver el accuracy total y el de por cada label:

In [None]:
warnings.filterwarnings('ignore')

label_map = {0: 'multi', 1: 'passage', 2: 'phrase'}

y_true = val_df['label'].tolist()

y_true = y_true

accuracy_total = accuracy_score(y_true, preds)
precision_total = precision_score(y_true, preds, average='weighted')
recall_total = recall_score(y_true, preds, average='weighted')
f1_total = f1_score(y_true, preds, average='weighted')

accuracy_por_clase = {}
precision_por_clase = {}
recall_por_clase = {}
f1_por_clase = {}

y_true_mapped = [label_map[label] for label in y_true]
preds_mapped = [label_map[label] for label in preds]

print(y_true_mapped)
print(preds_mapped)

for label in label_map.values():
  predicted_labels_for_class = [preds_mapped[i] for i in range(len(y_true_mapped)) if y_true_mapped[i] == label]
  true_labels_for_class = [y_true_mapped[i] for i in range(len(y_true_mapped)) if y_true_mapped[i] == label]

  accuracy_por_clase[label] = accuracy_score(true_labels_for_class, predicted_labels_for_class)
  precision_por_clase[label] = precision_score(true_labels_for_class, predicted_labels_for_class, average='weighted')
  recall_por_clase[label] = recall_score(true_labels_for_class, predicted_labels_for_class, average='weighted')
  f1_por_clase[label] = f1_score(true_labels_for_class, predicted_labels_for_class, average='weighted')

print("Accuracy total:", accuracy_total)
print("Precision total:", precision_total)
print("Recall total:", recall_total)
print("F1-score total:", f1_total)

print("\n")

print("Accuracy por clase:", accuracy_por_clase)
print("Precision por clase:", precision_por_clase)
print("Recall por clase:", recall_por_clase)
print("F1-score por clase:", f1_por_clase)

['passage', 'multi', 'phrase', 'multi', 'passage', 'phrase', 'phrase', 'passage', 'passage', 'passage', 'multi', 'phrase', 'passage', 'multi', 'multi', 'phrase', 'phrase', 'passage', 'passage', 'passage', 'phrase', 'passage', 'passage', 'passage', 'phrase', 'phrase', 'multi', 'passage', 'passage', 'phrase', 'passage', 'phrase', 'phrase', 'phrase', 'multi', 'passage', 'phrase', 'multi', 'passage', 'phrase', 'multi', 'passage', 'phrase', 'phrase', 'phrase', 'passage', 'multi', 'multi', 'phrase', 'passage', 'phrase', 'passage', 'phrase', 'phrase', 'phrase', 'passage', 'phrase', 'phrase', 'passage', 'phrase', 'phrase', 'passage', 'multi', 'phrase', 'multi', 'phrase', 'passage', 'phrase', 'passage', 'phrase', 'multi', 'phrase', 'phrase', 'passage', 'phrase', 'phrase', 'phrase', 'phrase', 'multi', 'phrase', 'phrase', 'passage', 'passage', 'phrase', 'multi', 'passage', 'passage', 'passage', 'phrase', 'passage', 'multi', 'multi', 'phrase', 'multi', 'phrase', 'phrase', 'phrase', 'passage', 'phr

Una vez determinados los hiperparámetros, evaluamos el modelo sobre el conjunto de test:

In [None]:
preds_test, model_outputs = model.predict(test_df_texts)

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

In [None]:
label_map = {0: 'multi', 1: 'passage', 2: 'phrase'}

y_true = test_df['label'].tolist()


accuracy_total = accuracy_score(y_true, preds_test)
precision_total = precision_score(y_true, preds_test, average='weighted')
recall_total = recall_score(y_true, preds_test, average='weighted')
f1_total = f1_score(y_true, preds_test, average='weighted')
accuracy_por_clase = {}

y_true_mapped = [label_map[label] for label in y_true]
preds_mapped = [label_map[label] for label in preds_test]

for label in label_map.values():
  predicted_labels_for_class = [preds_mapped[i] for i in range(len(y_true_mapped)) if y_true_mapped[i] == label]
  true_labels_for_class = [y_true_mapped[i] for i in range(len(y_true_mapped)) if y_true_mapped[i] == label]

  accuracy_por_clase[label] = accuracy_score(true_labels_for_class, predicted_labels_for_class)


print("Accuracy total:", accuracy_total)
print("Precision total:", precision_total)
print("Recall total:", recall_total)
print("F1-score total:", f1_total)

print("Accuracy por clase:", accuracy_por_clase)

Accuracy total: 0.733
Precision total: 0.7356338120580302
Recall total: 0.733
F1-score total: 0.7310072682227511
Accuracy por clase: {'multi': 0.5804597701149425, 'passage': 0.7344913151364765, 'phrase': 0.7943262411347518}


## Tarea 2: Generación de spoiler



Primero de todo, vamos a tener que convertir nuestros datos a un formato correcto en base a la aproximación que tomamos.

La generación del spoiler se puede hacer con dos aproximaciones distintas:

1) En base a la clasificación del tipo de spoiler, utilizar un modelo de QA o uno de Passage retrieval.

2) Si la clasificación no es lo suficientemente buena, generar directamente el spoiler con un modelo de QA e ignorar su clasificación.

En un principio, voy a optar por la segunda aproximación.

Entonces, hay que cambiar el formato del conjunto de datos. El conjunto de datos más conocido de la literatura para la tarea de Question Answering es SQuAD, desarrollado por la universidad de Stanford, y la respuesta a cada pregunta es un segmento de un texto. Así que será muy apropiado para esta tarea. Además, la mayoría de los modelos de QA más conocidos han sido fine-tuneados sobre este dataset.

In [None]:
train_dir = '/content/drive/MyDrive/Colab Notebooks/MDT/PROYECTO/DATA/train.jsonl'
val_dir = '/content/drive/MyDrive/Colab Notebooks/MDT/PROYECTO/DATA/validation.jsonl'
test_dir = '/content/drive/MyDrive/Colab Notebooks/MDT/PROYECTO/DATA/test.jsonl'

In [None]:
train_df = pd.read_json(path_or_buf=train_dir, lines=True)
val_df = pd.read_json(path_or_buf=val_dir, lines=True)
test_df = pd.read_json(path_or_buf=test_dir, lines=True)

Quitamos las instancias del conjunto de entrenamiento que requieran de QA abstractivo.

In [None]:
def drop_abstractive(df):
  counter = 0
  index_list = []

  for i, j in df.iterrows():
    string = ""
    for k in j["targetParagraphs"]:
      string += k
    for l in j["spoiler"]:
      if l not in string:
        index_list.append(i)
        counter +=1
  print('Drop:', counter)
  return df.drop(index=index_list)

train_df = drop_abstractive(train_df)

Drop: 152


La siguiente función convierte el dataset al formato SQuAD requerido por los modelos de Question Answering.

In [None]:
def convert_to_squad_format(jsonObj):
  list_of_spoiler_dicts = []
  for _, j in jsonObj.iterrows():
    spoiler_dict = {}
    spoiler_dict["title"] = j["targetTitle"]
    context = ""
    for k in j["targetParagraphs"]:
      context += k + " "
    list_ans = []
    dict_ans = {}
    dict_ans["question"] = j["postText"][0]
    dict_ans["id"] = j["uuid"]
    list_ans_2 = []
    for l in range(len(j["spoiler"])):
      dict_ans_2 = {}
      dict_ans_2["text"] = j["spoiler"][l]
      offset = 0
      counter = 0
      for k in j["targetParagraphs"]:
        if counter == j["spoilerPositions"][l][0][0]:
          break
        offset += len(k) + 1
        counter += 1
      if len(j["spoilerPositions"][l]) == 2:
        if j["spoilerPositions"][l][0][0] == j["spoilerPositions"][l][1][0]:
          if context[j["spoilerPositions"][l][0][1] + offset : j["spoilerPositions"][l][0][1] + offset + len(j["spoiler"][l])].startswith(" "):
            dict_ans_2["answer_start"] = j["spoilerPositions"][l][0][1] + offset + 1
          elif context[j["spoilerPositions"][l][0][1] + offset : j["spoilerPositions"][l][0][1] + offset + len(j["spoiler"][l])].endswith(" "):
            dict_ans_2["answer_start"] = j["spoilerPositions"][l][0][1] + offset + 2
          else:
            dict_ans_2["answer_start"] = j["spoilerPositions"][l][0][1] + offset
        else:
          dict_ans_2["answer_start"] = j["spoilerPositions"][l][0][1] - 1
      elif len(j["spoilerPositions"][l]) == 1:
        dict_ans_2["answer_start"] = j["spoilerPositions"][l][0][1]
      list_ans_2.append(dict_ans_2)
    dict_ans["answers"] = list_ans_2
    list_ans.append(dict_ans)
    dict_final = {}
    dict_final["title"] = j["targetTitle"]
    dict_final["context"] = context
    dict_final["qas"] = list_ans
    list_of_spoiler_dicts.append(dict_final)

  return list_of_spoiler_dicts

In [None]:
dict_train = convert_to_squad_format(train_df)
dict_val = convert_to_squad_format(val_df)
dict_test = convert_to_squad_format(test_df)

In [None]:
cuda_available = torch.cuda.is_available()
print(cuda_available)

True


Cargamos el modelo que queramos utilizar. Para probar distintos modelos simplemente hay que cambiar el nombre del modelo.

In [None]:
model_name = "FacebookAI/roberta-base"

model = QuestionAnsweringModel('roberta',
                               model_name,
                               args={'reprocess_input_data': True,
                                     'overwrite_output_dir': True,
                                     'learning_rate': 5e-5,
                                     'num_train_epochs': 3,
                                     'max_seq_length': 192,
                                     'doc_stride': 64,
                                     'train_batch_size': 8,
                                     'fp16': False,
                                    },
                              use_cuda=cuda_available)

Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Y ahora lo podemos entrenar con la función de train_model de la librería simpletransformers.

In [None]:
model.train_model(dict_train)

convert squad examples to features: 100%|██████████| 3070/3070 [03:55<00:00, 13.02it/s]
add example index and unique id: 100%|██████████| 3070/3070 [00:00<00:00, 118474.44it/s]


Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/3525 [00:00<?, ?it/s]

Y realizamos las predicciones sobre el conjunto de test. Para buscar los mejores parámetros he realizado la predicción primero sobre el conjunto de validación y después lo he cambiado por el de test.

In [None]:
predictions = model.predict(dict_test)

La siguiente función le da un formato correcto a las predicciones para poder evaluarlas. Esto se debe a que las predicciones contienen una lista de soluciones por cada instancia, siendo la primera la que mayor probabilidad tiene asociada. Por tanto, nos quedamos con esa primera.

In [None]:
def cambiar_predicciones(predictions):
  predicciones = []
  i=0

  for lista_inst in predictions:
    if i == 0:
      for inst in lista_inst:
        inst_id = inst['id']
        answer_mas_prob = inst['answer'][0] if inst['answer'] else None

        predicciones.append({'id': inst_id, 'answer': answer_mas_prob})
    i+=1
  return predicciones

Guardamos las predicciones

In [None]:
predicciones = cambiar_predicciones(predictions)
preds_df = pd.DataFrame(predicciones)
preds_df.to_csv('predicciones_roberta_base.csv', index=False)

Volvemos a cargar el diccionario de test para la evaluación, y sacamos las respuestas.

In [None]:
dict_test = convert_to_squad_format(test_df)

In [None]:
answers_test = []

for instance in dict_test:
  qas = instance['qas']

  answers_instance = []

  for qa in qas:
    answers_instance.extend([answer['text'] for answer in qa['answers']])

  answers_test.append(answers_instance)

Calculamos el valor de la métrica BLEU de la siguiente forma:

In [None]:
reference_tokens = [[word_tokenize(sentence.lower()) for sentence in solution] for solution in answers_test]

hypotheses = []

for index, row in preds_df.iterrows():
    prediction = row['answer']
    prediction_tokens = word_tokenize(prediction.lower())
    hypotheses.append(prediction_tokens)

bleu_score = corpus_bleu(reference_tokens, hypotheses)

print("Puntaje BLEU:", bleu_score)

Función para calcular el exact matching:

In [None]:
def calculate_exact_matches(predictions_df, solutions_list):
  total_exact_matches = 0
  for index, prediction in predictions_df.iterrows():
    exact_matches = 0
    solutions = solutions_list[index]
    found_solutions = set()
    for solution in solutions:
      if solution in found_solutions:
        continue
      if prediction.iloc[1] == solution:
        exact_matches += 1
        found_solutions.add(solution)
    total_exact_matches += exact_matches
  return total_exact_matches

In [None]:
print('Proporción de exact matches: ', calculate_exact_matches(preds_df, answers_test)/len(preds_df))

Cacular el BertScore. Como valor final he cogido el del F1.

In [None]:
preds_df['answer'] = preds_df['answer'].fillna("")
candidates = preds_df['answer'].tolist()
P, R, F1 = score(candidates, answers_test, lang='en', verbose=True)
print('Bertscore P: ', P.mean())
print('Bertscore R: ', R.mean())
print('Bertscore F1: ', F1.mean())

Ahora, voy a realizar un experimento para ver cómo funcionan las métricas

Lo primero, cargo las predicciones de los modelos que quiero comparar

In [None]:
preds_roberta_squad_dir = '/content/drive/MyDrive/Colab Notebooks/MDT/PROYECTO/DATA/predicciones/predicciones_roberta_base_squad.csv'
preds_electra_dir = '/content/drive/MyDrive/Colab Notebooks/MDT/PROYECTO/DATA/predicciones/predicciones_electra_base.csv'

In [None]:
preds_roberta_squad = pd.read_csv(preds_roberta_squad_dir)
preds_electra = pd.read_csv(preds_electra_dir)

In [None]:
train_df = pd.read_json(path_or_buf=train_dir, lines=True)
val_df = pd.read_json(path_or_buf=val_dir, lines=True)
test_df = pd.read_json(path_or_buf=test_dir, lines=True)

In [None]:
val_df.iloc[795]

uuid                              1189d343-42eb-47e7-8395-ff978a683875
postId                                              428006164904034305
postText             [This is what happens when you leave a hotel c...
postPlatform                                                   Twitter
targetParagraphs     [Instead of encountering a mound of dirty towe...
targetTitle          This Is What Happens When You Leave A Hotel Cl...
targetDescription    Instead of encountering a mound of dirty towel...
targetKeywords       givebackfilms,give back films,video,random act...
targetMedia          [http://s.m.huffpost.com/assets/Logo_Huffingto...
targetUrl                                       http://huff.to/1ebARdm
provenance           {'source': 'anonymized', 'humanSpoiler': 'She ...
spoiler              [The video below shows the stunned cleaner ini...
spoilerPositions                                  [[[3, 0], [3, 150]]]
tags                                                         [passage]
Name: 

Vamos a ver cuáles son las predicciones de los dos modelos y la respuestas real.

In [None]:
caso_roberta_squad = preds_roberta_squad.iloc[795]['answer']
caso_electra = preds_electra.iloc[795]['answer']
respuesta = val_df.iloc[795]['spoiler'][0]

In [None]:
print('Predicción de RoBERTa-squad: ', caso_roberta_squad)
print('\nPredicción de Electra: ', caso_electra)
print('\nRespuesta real: ', respuesta)

Predicción de RoBERTa-squad:  The video below shows the stunned cleaner initially refusing to accept the tip, before another hotel worker reassured her by saying, "You deserve it."

Predicción de Electra:  The video below shows the stunned cleaner initially refusing to accept the tip,

Respuesta real:  The video below shows the stunned cleaner initially refusing to accept the tip, before another hotel worker reassured her by saying, "You deserve it."


Calculamos el BLEU score de cada modelo.

In [None]:
caso_rs_token = caso_roberta_squad.split()
caso_el_token = caso_electra.split()
res_token = respuesta.split()

bleu_score_rs = sentence_bleu([caso_rs_token], res_token)
bleu_score_el = sentence_bleu([caso_el_token], res_token)

print("BLEU Score de la predicción de RoBERTa-squad:", bleu_score_rs)
print("BLEU Score de la predicción de Electra:", bleu_score_el)

BLEU Score de la predicción de RoBERTa-squad: 1.0
BLEU Score de la predicción de Electra: 0.5093121744590026


Y ahora el BertScore

In [None]:
lista_rs = [caso_roberta_squad]
lista_el = [caso_electra]
lista_res = [respuesta]

p, r, f1 = score(lista_rs, lista_res, lang="en")
p2, r2, f1_2 = score(lista_el, lista_res, lang="en")

print('BertScore de la predicción de Roberta-squad: ', f1)
print('BertScore de la predicción de electra: ', f1_2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertScore de la predicción de Roberta-squad:  tensor([1.])
BertScore de la predicción de electra:  tensor([0.9443])
