# Document Expansion with T5 doc2query evaluated on TREC-COVID with nDCG@10

**Treinar um modelo seq2seq (a partir do T5-base) na tarefa de expansão de documentos usando o doc2query**

- Usar como treino o dataset "tiny" do MS MARCO na tarefa doc2query
  - https://storage.googleapis.com/unicamp-dl/ia368dd_2023s1/msmarco/msmarco_triples.train.tiny.tsv
- doc2query: A entrada é a passagem e o target é a query
- Note que apenas pares (query, passagem relevante) são usados como treino.
- O treino é relativamente rápido (<1 hora).
- Validar a cada X steps usando o sacreBLEU
- A parte lenta deste exercício é a pré-indexação: para cada documento da coleção, temos que gerar uma ou mais queries, que depois são concatenadas ao documento original, e esse documento "expandido" é indexado.
- Avaliar no TREC-COVID (171K docs), pois é menor que o MS MARCO/TREC-DL 2020 (8.8M passagens). 
- Indice invertido do Trec-covid no pyserini: beir-v1.0.0-trec-covid-flat
- Corpus e queries na HF: https://huggingface.co/datasets/BeIR/trec-covid
- qrels: https://huggingface.co/datasets/BeIR/trec-covid-qrels
- Usar nDCG@10
- Comparar com o BM25 com e sem os documentos expandidos pelo doc2query

[![google colab link](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tcvieira/IA368-DD-012023/blob/main/assingments/05/notebook.ipynb)

# Installs

In [1]:
!pip install transformers -q
! pip install sentencepiece -q
#!pip install datasets -q
!pip install ipython-autotime -q
%load_ext autotime

time: 579 µs (started: 2023-04-02 15:51:11 +00:00)


In [2]:
#!pip install numba -q

time: 270 µs (started: 2023-04-02 15:51:11 +00:00)


# Imports

In [3]:
# import torch

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# device

time: 364 µs (started: 2023-04-02 15:51:11 +00:00)


# Dataset

In [4]:
# from google.colab import drive
# drive.mount('/content/drive', force_remount=True)

time: 400 µs (started: 2023-04-02 15:51:11 +00:00)


In [5]:
# PATH_DATASET = '/content/drive/MyDrive/unicamp/IA368DD/class_5'

time: 634 µs (started: 2023-04-02 15:51:11 +00:00)


In [6]:
#!gsutil cp gs://unicamp-dl/ia025a_2022s1/aula9/sample-1gb.txt {PATH_DATASET}/sample-1gb.txt

time: 341 µs (started: 2023-04-02 15:51:11 +00:00)


## small dataset for testing

In [7]:
#!sed -n '1,100p' {PATH_DATASET}/sample-1gb.txt > {PATH_DATASET}/sample_small.txt

time: 352 µs (started: 2023-04-02 15:51:11 +00:00)


# Select Dataset

In [8]:
#dataset = small_dataset
#dataset = base_dataset

time: 386 µs (started: 2023-04-02 15:51:11 +00:00)


# Parameters

In [9]:
MODEL_NAME = 't5-base'
MAX_SEQ_LENGTH=512
BATCH_SIZE=8
EPOCHS=2
# MODEL_OUTPUT_FOLDER=f'{PATH_DATASET}/model_output'
# MODEL_SAVE_FOLDER=f'{PATH_DATASET}/model_save'
# TOKENIZER_SAVE_FOLDER=f'{PATH_DATASET}/tokenizer_save'

time: 815 µs (started: 2023-04-02 15:51:11 +00:00)


# Tokenizer

In [12]:
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


time: 8.44 s (started: 2023-04-02 15:51:34 +00:00)


# Model

## Initial Tests with simple T5-base

In [15]:
from transformers import T5ForConditionalGeneration

# Load the pre-trained T5 model and tokenizer
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
# Set the input text and generate queries
text = "This is an example document."
input_ids = tokenizer.encode("generate queries: " + text, return_tensors='pt')
output = model.generate(input_ids=input_ids, max_length=100, num_beams=4, early_stopping=True)
queries = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated queries:", queries)

Generated queries: generate queries: This is an example document.
time: 8.23 s (started: 2023-04-02 16:05:50 +00:00)


## Initial tests with T5 with prefix

In [16]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = 'doc2query/all-with_prefix-t5-base-v1'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

prefix = "answer2question"
text = "Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."

text = prefix+": "+text

input_ids = tokenizer.encode(text, max_length=384, truncation=True, return_tensors='pt')
outputs = model.generate(
    input_ids=input_ids,
    max_length=64,
    do_sample=True,
    top_p=0.95,
    num_return_sequences=5)

print("Text:")
print(text)

print("\nGenerated Queries:")
for i in range(len(outputs)):
    query = tokenizer.decode(outputs[i], skip_special_tokens=True)
    print(f'{i + 1}: {query}')

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.12k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/702 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Text:
answer2question: Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.

Generated Queries:
1: Why is Python not good in CS?
2: What is python as a programming language?
3: What does python use in our daily lives?
4: Why would you choose Python as a programming language?
5: What is the language called for Python?
time: 1min 1s (started: 2023-04-02 16:06:12 +00:00)


In [None]:
model_size = sum(t.numel() for t in model.parameters())
print(f"T5 size: {model_size/1000**2:.1f}M parameters")

OPT-125m size: 125.2M parameters
time: 1.55 ms (started: 2023-03-30 02:44:54 +00:00)


# Tokenization

In [None]:
tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"], 
                                      truncation=True, 
                                      padding="max_length", 
                                      max_length=MAX_SEQ_LENGTH), 
                                      batched=True, 
                                      num_proc=4, 
                                      remove_columns=["text"],
                                      #return_overflowing_tokens=True,
                                      #return_length=True,
                                     )



time: 96.9 ms (started: 2023-03-30 02:44:54 +00:00)


In [None]:
#tokenized_dataset.save_to_disk(TOKENIZER_SAVE_FOLDER)

Saving the dataset (0/2 shards):   0%|          | 0/200000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/50000 [00:00<?, ? examples/s]

time: 3.05 s (started: 2023-03-30 02:44:54 +00:00)


In [None]:
#tokenized_dataset = load_dataset(TOKENIZER_SAVE_FOLDER)

time: 227 µs (started: 2023-03-30 02:41:52 +00:00)


In [None]:
# print(tokenized_dataset)
# print(f"{len(tokenized_dataset['train']['input_ids'][0])} tokens - {tokenized_dataset['train']['input_ids'][0]}")

time: 228 µs (started: 2023-03-30 02:27:06 +00:00)


# Training

In [None]:
# The training is done using an T4 with 16GB of memory
!nvidia-smi

Thu Mar 30 02:44:57 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    55W / 400W |   2507MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
from numba import cuda 

device = cuda.get_current_device()
device.reset()

time: 762 ms (started: 2023-03-30 02:35:44 +00:00)


In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir=MODEL_OUTPUT_FOLDER,
                                  num_train_epochs=EPOCHS, 
                                  per_device_train_batch_size=BATCH_SIZE,
                                  per_device_eval_batch_size=BATCH_SIZE, 
                                  evaluation_strategy="epoch", # TODO: change to steps
                                  save_strategy="epoch", # TODO: change to steps
                                  logging_strategy="epoch",  # TODO: change to steps
                                  learning_rate=2e-5,  # TODO: change to what was used in the doctttttquery 
                                  weight_decay=0.01, # TODO: change to what was used in the doctttttquery 
                                  fp16=True # Use mixed precision
                                )

time: 2.8 ms (started: 2023-03-30 02:44:57 +00:00)


In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
trainer = Trainer(model=model, 
                  args=training_args, 
                  train_dataset=tokenized_dataset["train"],
                  eval_dataset=tokenized_dataset["validation"], 
                  data_collator=data_collator)

time: 5.62 ms (started: 2023-03-30 02:44:57 +00:00)


In [None]:
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,2.571,2.339543
2,2.3275,2.269327


Exception ignored in: <generator object Json._generate_tables at 0x7f5ed3aeb200>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/datasets/packaged_modules/json/json.py", line 158, in _generate_tables
    batch_idx += 1
OSError: [Errno 107] Transport endpoint is not connected


TrainOutput(global_step=50000, training_loss=2.449279375, metrics={'train_runtime': 5575.0933, 'train_samples_per_second': 71.748, 'train_steps_per_second': 8.968, 'total_flos': 1.045168128e+17, 'train_loss': 2.449279375, 'epoch': 2.0})

time: 1h 32min 55s (started: 2023-03-30 02:44:57 +00:00)


In [None]:
model.save_pretrained(MODEL_SAVE_FOLDER)
tokenizer.save_pretrained(MODEL_SAVE_FOLDER)

('/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/tokenizer_config.json',
 '/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/special_tokens_map.json',
 '/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/vocab.json',
 '/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/merges.txt',
 '/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/added_tokens.json',
 '/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/tokenizer.json')

time: 1.6 s (started: 2023-03-30 04:19:47 +00:00)


In [None]:
#!zip model {MODEL_SAVE_FOLDER}/*

time: 283 µs (started: 2023-03-29 22:55:03 +00:00)


# Evaluation

# Results

split train/validation (200.000, 50.000)

| seq length | epochs | batch size |    gpu    | time/epoch | val_ppl |
|:----------:|:------:|:----------:|:---------:|:----------:|:-------:|
|     -    |    -   |      -     |  -  |     -    |  -  |

----