<a href="https://colab.research.google.com/github/rocabrera/language-uncertainty/blob/master/geracao_dataset_QA_para_incerteza.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install --quiet datasets transformers sentencepiece==0.1.96

[K     |████████████████████████████████| 362 kB 4.3 MB/s 
[K     |████████████████████████████████| 4.4 MB 62.5 MB/s 
[K     |████████████████████████████████| 1.2 MB 58.5 MB/s 
[K     |████████████████████████████████| 140 kB 68.9 MB/s 
[K     |████████████████████████████████| 101 kB 4.5 MB/s 
[K     |████████████████████████████████| 1.1 MB 51.7 MB/s 
[K     |████████████████████████████████| 212 kB 64.1 MB/s 
[K     |████████████████████████████████| 596 kB 69.7 MB/s 
[K     |████████████████████████████████| 127 kB 75.2 MB/s 
[K     |████████████████████████████████| 6.6 MB 36.3 MB/s 
[K     |████████████████████████████████| 271 kB 52.2 MB/s 
[K     |████████████████████████████████| 144 kB 16.4 MB/s 
[K     |████████████████████████████████| 94 kB 3.6 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires fol

In [2]:
!nvidia-smi

Wed Jul  6 21:35:34 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
!export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6

# Generating dataset to use to finetune a model for uncertainty estimation

The goal of this notebook is generate paraphrases from a know QA dataset to see how a language model, like T5, behaves when we present an example with the same meaning, but wrote in a different way. To do that, we will use t5-base model for extractive QA task.

## 1. Evaluation metrics for QA

This is a field that is growing, since the major part of metrics used for QA were inherited from translation problems. Below we will list some metrics used to evaluate extractive QA:

- Exact Match
- F1 Score
- METEOR
- BERTScore
- SAS

To keep simple our experiments we will use F1 score to evaluate our results as it's easy to compute and we don't have so much time.

References:

- https://arxiv.org/pdf/2108.06130.pdf
- https://aclanthology.org/D19-5817.pdf

In [None]:
def compute_f1(predict_text: str, label_text:str):
    pred_tokens = predict_text.split()
    truth_tokens = label_text.split()
    
    # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return int(pred_tokens == truth_tokens)
    
    common_tokens = set(pred_tokens) & set(truth_tokens)
    
    # if there are no common tokens then f1 = 0
    if len(common_tokens) == 0:
        return 0
    
    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(truth_tokens)
    
    return 2 * (prec * rec) / (prec + rec)


def custom_f1s(x):
  predicted_answer = x["predicted_answer"]
  answers: dict = eval(x["answers"])
  f1s = [compute_f1(predicted_answer, answer) for answer in answers["text"]]
  return f1s, max(f1s)

df["f1s"], df["max_f1"] = zip(*df.apply(custom_f1s, axis=1))

## 2. Exploring an extractive QA dataset

Como podemos ver na base abaixo, um exemplo de QA extrativo é composto por um contexto, uma pergunta (que, em teoria, deveria ter a sua resposta dentro do contexto) e a resposta.

Como iremos usar o T5 como base do nosso finetuning e uns dos datasets usados no seu treinamento foi o SQuAD, optamos por utilizar para nossa tarefa um dataset que foi gerado de maneira semelhante, mas que fosse de um outro domínio.

Dessa forma, escolhemos o dataset squadshitfs.

In [4]:
import pandas as pd
from datasets import load_dataset, Dataset

In [5]:
domains = ['new_wiki', 'nyt', 'reddit', 'amazon']
data = load_dataset("squadshifts", domains[0])

Downloading builder script:   0%|          | 0.00/2.23k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading and preparing dataset squadshifts/new_wiki (download: 15.74 MiB, generated: 7.50 MiB, post-processed: Unknown size, total: 23.24 MiB) to /root/.cache/huggingface/datasets/squadshifts/new_wiki/1.0.0/8303de6ce29bd28061c984dc50d04351a73bc3c344d5efe46f38b9948c2e3aca...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/774k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/7938 [00:00<?, ? examples/s]

Dataset squadshifts downloaded and prepared to /root/.cache/huggingface/datasets/squadshifts/new_wiki/1.0.0/8303de6ce29bd28061c984dc50d04351a73bc3c344d5efe46f38b9948c2e3aca. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [6]:
# Creating a dataframe from the previous structure
df = pd.DataFrame()
df['id'] = data['test']['id']
df['title'] = data['test']['title']
df['context'] = data['test']['context']
df['question'] = data['test']['question']
df['answers'] = data['test']['answers']
df.head()

Unnamed: 0,id,title,context,question,answers
0,5d6571572b22cd4dfcfbc8ea,Armenian_Apostolic_Church,The Monastic Brotherhood consists of the celib...,is there a delegate?,{'text': ['Each brotherhood elects two delegat...
1,5d6571572b22cd4dfcfbc8e9,Armenian_Apostolic_Church,The Monastic Brotherhood consists of the celib...,how does the brotherhood make decisions?,{'text': ['The brotherhood makes decisions con...
2,5d6571572b22cd4dfcfbc8e8,Armenian_Apostolic_Church,The Monastic Brotherhood consists of the celib...,how does an Armenian priest become a member of...,{'text': ['Each Armenian celibate priest becom...
3,5d6571572b22cd4dfcfbc8e6,Armenian_Apostolic_Church,The Monastic Brotherhood consists of the celib...,what does the monastic brotherhood consist of,{'text': ['the celibate clergy of the monaster...
4,5d6571572b22cd4dfcfbc8e7,Armenian_Apostolic_Church,The Monastic Brotherhood consists of the celib...,how many brotherhoods are in the Armenian church?,"{'text': ['Mother See of Holy Etchmiadzin, the..."


## 3. Generating prompt variance

Queremos usar um prompt e gerar um mesmo prompt com mesmo sentido utilizando palavras diferentes, ou seja, uma parafrase. Para isso, vamos testar algumas maneiras de gerar essas parafrases:

- Usando GPT-3 API da OpenAI
- Usando algum modelo da HF treinado para parafrase

Usar o GPT-3 se mostrou uma boa opção em termos de resultados, já que podiamos inserir o contexto todo. Contudo, o valor gasto na API iria ultrapassar os 500 dolares, se tornando inviável. Dessa forma, decidimos prosseguir com a segunda opção.

Após alguns testes, foi percebido que o modelo que escolhemos performa melhor em sentenças pequenas. Dessa forma, optamos por passar uma ou duas sentenças de cada vez, e gerar uma combinação dos resultados para compor as frases finais. Focamos apenas em parafrasear os contextos, deixando a possibilidade de parafrasear as perguntas em aberto.

In [7]:
import gc
import torch
import random
from tqdm.notebook import tqdm
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from google.colab import drive

drive.mount('/content/drive')
gc.collect()
torch.cuda.empty_cache()

Mounted at /content/drive


In [8]:
generate_paraphrased_dataset = True

In [9]:
if generate_paraphrased_dataset:

  def generate_input(input_text: str) -> str:
    output_text = f"paraphrase: {input_text} </s>"
    return output_text
  
  def generate_paraphrase(text: str, n: int) -> list:
    input_text = generate_input(text)
    encoding = TOKENIZER.encode_plus(input_text, max_length=254, truncation=True, return_tensors="pt")
    input_ids, attention_mask  = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)
    MODEL.eval()
    beam_outputs = MODEL.generate(
        input_ids = input_ids, attention_mask=attention_mask,
        max_length = 254,
        early_stopping = True,
        num_beams = 15,
        num_return_sequences = n
    )
  
    output = {'input': text, 'output': []}
    for beam_output in beam_outputs:
      sent = TOKENIZER.decode(beam_output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
      output['output'].append(sent.replace('paraphrasedoutput: ', ''))
    return output

  MODEL = AutoModelForSeq2SeqLM.from_pretrained("ramsrigouthamg/t5-large-paraphraser-diverse-high-quality")
  TOKENIZER = AutoTokenizer.from_pretrained("ramsrigouthamg/t5-large-paraphraser-diverse-high-quality")

  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  print ("device ",device)
  MODEL = MODEL.to(device)

  new_df = []
  examples_processed = 0
  for row in tqdm(df.iloc[2000:2082,:].iterrows(), total=82):
    id = row[1]['id']
    question = row[1]['question']
    answers = row[1]['answers']
    original_context = row[1]['context']
    new_df.append([id, original_context, question, answers])

    sentences = original_context.split('.')
    alternatives = []
    i = 0
    while i < len(sentences[:-1]):
      end = i + random.randint(1,2)
      sentence = '.'.join(sentences[i:end])
      i = end
      paraphrases = generate_paraphrase(sentence + '.', 5)
      alternatives.append(list(set(paraphrases['output'])))
    
    for i in range(5):
      context = ''
      for alternative in alternatives:
        context += f'{alternative[random.randint(0, len(alternative) - 1)]} '
      new_df.append([id, context[:-1], question, answers])
    examples_processed += 1
    if examples_processed % 1 == 0:
      pd.DataFrame(new_df, columns = ['id', 'context', 'question', 'answers']).\
        to_csv('/content/drive/MyDrive/squadshifts_parafraseado_4.csv', index=False)

In [12]:
generate_paraphrase('Each brotherhood elects two delegates who take part in the National Ecclesiastical Assembly.', 3)

{'input': 'Each brotherhood elects two delegates who take part in the National Ecclesiastical Assembly.',
 'output': ['Two delegates are elected by each brotherhood to serve in the National Ecclesiastical Assembly.',
  'Two delegates are elected by each brotherhood in the National Ecclesiastical Assembly.',
  'Each brotherhood elects two delegates from the National Ecclesiastical Assembly.']}

In [10]:
df_squadshifts_paraphrased = pd.read_csv('/content/drive/MyDrive/squadshifts_paraphrased.csv')

## 4. Predicting answer for paraphrased text using T5

In [15]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

T5_TOKENIZER = AutoTokenizer.from_pretrained("t5-base")
T5_MODEL = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print ("device ",device)
T5_MODEL = T5_MODEL.to(device)

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

device  cuda


In [18]:
scored_df_data = []
T5_MODEL.eval()
for row in tqdm(df_squadshifts_paraphrased.iterrows(), total=df_squadshifts_paraphrased.shape[0]):
  id = row[1]['id']
  context = row[1]['context']
  question = row[1]['question']
  answers = row[1]['answers']
  input_text = f'question: {question} context: {context}'

  encoding = T5_TOKENIZER(input_text, max_length=4000, truncation=True, return_tensors="pt")
  input_ids = encoding["input_ids"].to(device)
  
  output = T5_MODEL.generate(
      input_ids = input_ids,
      max_length = 4000,
      return_dict_in_generate=True,
      output_scores=True
  )

  mean_token_prob = sum([torch.nn.functional.softmax(score[0], dim=0).max().item() for score in output.scores]) / len(output.scores)
  predicted_answer = T5_TOKENIZER.decode(output.sequences[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
  scored_df_data.append([id, context, question, answers, mean_token_prob, predicted_answer])

  0%|          | 0/48120 [00:00<?, ?it/s]

In [19]:
df_scored = pd.DataFrame(scored_df_data, columns = ['id', 'context', 'question', 'answers', 'prob', 'predicted_answer'])
df_scored.to_csv('/content/drive/MyDrive/scored_squadshifts_paraphrased.csv', index=False)

## 5. Computing and aggregating F1 score per id

In [None]:
id_mean = df_scored.groupby("id", as_index=False).agg({"max_f1":["mean"]}).droplevel(level=1, axis=1).rename(columns={"max_f1":"mean_f1"})

In [None]:
df = df.dropna().merge(id_mean, on="id")
df["uncertainty"] = round(1 - df["mean_f1"], 2)

In [None]:
def create_not_bucket_uncertainty_label(x):

  uncertainty = x["uncertainty"]
  answers: dict = eval(x["answers"])
  true_labels = [f"{answer} Uncertainty: {uncertainty}" for answer in answers["text"]]
  return {"text": true_labels}


df["answers_not_bucket_uncertainty"] = df.apply(create_not_bucket_uncertainty_label, axis=1)

In [None]:
bins = pd.IntervalIndex.from_tuples([(0.0, 0.32), (0.33, 0.65), (0.66, 1.)], closed="both")
bucket_uncertainty = pd.cut(df["uncertainty"], bins=bins)
print(bucket_uncertainty.cat.categories)
bucket_uncertainty.cat.categories = ["low", "medium", "high"]
print(bucket_uncertainty.cat.categories)
df["bucket_uncertainty"] = bucket_uncertainty

In [None]:
def create_bucket_uncertainty_label(x):

  uncertainty = x["bucket_uncertainty"]
  answers: dict = eval(x["answers"])
  true_labels = [f"{answer} Uncertainty: {uncertainty}" for answer in answers["text"]]
  return {"text": true_labels}

df["answers_bucket_uncertainty"] = df.apply(create_bucket_uncertainty_label, axis=1)

In [None]:
def split_dataset(df:pd.DataFrame, approximated_train_pct:float, approximated_eval_pct:float) -> List[pd.DataFrame]:

  df = df.sample(frac=1)
  df["context_codes"] = df["context"].astype("category").cat.codes
  
  dataset_max = df["context_codes"].max()
  max_train_idx = int(np.ceil(dataset_max*approximated_train_pct))
  train_df = df.query(f"context_codes<={max_train_idx}").copy()
  aux =  df.query(f"context_codes > {max_train_idx}").copy()
  max_eval_index = int(np.ceil(max_train_idx + (dataset_max - max_train_idx)*approximated_eval_pct))
  eval_df = aux.query(f"context_codes<={max_eval_index}").copy()
  test_df =  aux.query(f"context_codes > {max_eval_index}").copy()

  return train_df, eval_df, test_df

approximated_eval_pct = 0.5
approximated_train_pct = 0.85
train_df, eval_df, test_df = split_dataset(df, approximated_train_pct, approximated_eval_pct)

In [None]:
print(f"Temos aproximadamente {round(approximated_train_pct,3)} do dataset para treino")
print(f"Train Percentage: {round(len(train_df)/len(original_df),3)}")
print(f"Temos aproximadamente {round(1-approximated_train_pct,3)} do dataset para separar entre test e validação com porcentagem {approximated_eval_pct} para validacao.")
print(f"Eval Percentage: {round(len(eval_df)/len(original_df),3)}")
print(f"Test Percentage: {round(len(test_df)/len(original_df),3)}")

In [None]:
print("Mostrando que não tem intersecção de contexto nos datasets")
print(set(train_df.context_codes.unique()).intersection(test_df.context_codes.unique()))
print(set(train_df.context_codes.unique()).intersection(eval_df.context_codes.unique()))
print(set(eval_df.context_codes.unique()).intersection(test_df.context_codes.unique()))

In [None]:
train_df.to_csv("/content/drive/MyDrive/squadshifts_aggregated_train.csv", index=False)
eval_df.to_csv("/content/drive/MyDrive/squadshifts_aggregated_eval.csv", index=False)
test_df.to_csv("/content/drive/MyDrive/squadshifts_aggregated_test.csv", index=False)