# Document Expansion with T5 doc2query evaluated on TREC-COVID with nDCG@10

**Treinar um modelo seq2seq (a partir do T5-base) na tarefa de expansão de documentos usando o doc2query**

- Usar como treino o dataset "tiny" do MS MARCO na tarefa doc2query
  - https://storage.googleapis.com/unicamp-dl/ia368dd_2023s1/msmarco/msmarco_triples.train.tiny.tsv
- doc2query: A entrada é a passagem e o target é a query
- Note que apenas pares (query, passagem relevante) são usados como treino.
- O treino é relativamente rápido (<1 hora).
- Validar a cada X steps usando o sacreBLEU
- A parte lenta deste exercício é a pré-indexação: para cada documento da coleção, temos que gerar uma ou mais queries, que depois são concatenadas ao documento original, e esse documento "expandido" é indexado.
- Avaliar no TREC-COVID (171K docs), pois é menor que o MS MARCO/TREC-DL 2020 (8.8M passagens). 
- Indice invertido do Trec-covid no pyserini: beir-v1.0.0-trec-covid-flat
- Corpus e queries na HF: https://huggingface.co/datasets/BeIR/trec-covid
- qrels: https://huggingface.co/datasets/BeIR/trec-covid-qrels
- Usar nDCG@10
- Comparar com o BM25 com e sem os documentos expandidos pelo doc2query

[![google colab link](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tcvieira/IA368-DD-012023/blob/main/assingments/05/notebook.ipynb)

# Installs

In [1]:
!pip install transformers -q
#!pip install datasets -q
!pip install ipython-autotime -q
%load_ext autotime

time: 732 µs (started: 2023-04-02 12:35:21 -03:00)


In [30]:
!pip install numba -q

time: 4.2 s (started: 2023-03-30 02:44:39 +00:00)


# Imports

In [31]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

time: 3.38 ms (started: 2023-03-30 02:44:43 +00:00)


# Dataset

In [32]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive
time: 5.14 s (started: 2023-03-30 02:44:43 +00:00)


In [33]:
PATH_DATASET = '/content/drive/MyDrive/unicamp/IA368DD/class_5'

time: 410 µs (started: 2023-03-30 02:44:48 +00:00)


In [34]:
#!gsutil cp gs://unicamp-dl/ia025a_2022s1/aula9/sample-1gb.txt {PATH_DATASET}/sample-1gb.txt

time: 241 µs (started: 2023-03-30 02:44:48 +00:00)


## small dataset for testing

In [37]:
#!sed -n '1,100p' {PATH_DATASET}/sample-1gb.txt > {PATH_DATASET}/sample_small.txt

time: 252 µs (started: 2023-03-30 02:44:49 +00:00)


# Select Dataset

In [41]:
#dataset = small_dataset
dataset = base_dataset

time: 341 µs (started: 2023-03-30 02:44:51 +00:00)


# Parameters

In [42]:
MODEL_NAME = 't5-base'
MAX_SEQ_LENGTH=512
BATCH_SIZE=8
EPOCHS=2
MODEL_OUTPUT_FOLDER=f'{PATH_DATASET}/model_output'
MODEL_SAVE_FOLDER=f'{PATH_DATASET}/model_save'
TOKENIZER_SAVE_FOLDER=f'{PATH_DATASET}/tokenizer_save'

time: 519 µs (started: 2023-03-30 02:44:51 +00:00)


# Tokenizer

In [None]:
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)

time: 415 ms (started: 2023-03-30 02:44:51 +00:00)


# Model

## Initial Tests with T5

In [44]:
from transformers T5ForConditionalGeneration

# Load the pre-trained T5 model and tokenizer
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
# Set the input text and generate queries
text = "This is an example document."
input_ids = tokenizer.encode("generate queries: " + text, return_tensors='pt')
output = model.generate(input_ids=input_ids, max_length=100, num_beams=4, early_stopping=True)
queries = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated queries:", queries)

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0): OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=1e-05,

time: 2.07 s (started: 2023-03-30 02:44:52 +00:00)


In [45]:
model_size = sum(t.numel() for t in model.parameters())
print(f"OPT-125m size: {model_size/1000**2:.1f}M parameters")

OPT-125m size: 125.2M parameters
time: 1.55 ms (started: 2023-03-30 02:44:54 +00:00)


# Tokenization

In [46]:
tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"], 
                                      truncation=True, 
                                      padding="max_length", 
                                      max_length=MAX_SEQ_LENGTH), 
                                      batched=True, 
                                      num_proc=4, 
                                      remove_columns=["text"],
                                      #return_overflowing_tokens=True,
                                      #return_length=True,
                                     )



time: 96.9 ms (started: 2023-03-30 02:44:54 +00:00)


In [47]:
#tokenized_dataset.save_to_disk(TOKENIZER_SAVE_FOLDER)

Saving the dataset (0/2 shards):   0%|          | 0/200000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/50000 [00:00<?, ? examples/s]

time: 3.05 s (started: 2023-03-30 02:44:54 +00:00)


In [22]:
#tokenized_dataset = load_dataset(TOKENIZER_SAVE_FOLDER)

time: 227 µs (started: 2023-03-30 02:41:52 +00:00)


In [20]:
# print(tokenized_dataset)
# print(f"{len(tokenized_dataset['train']['input_ids'][0])} tokens - {tokenized_dataset['train']['input_ids'][0]}")

time: 228 µs (started: 2023-03-30 02:27:06 +00:00)


# Training

In [48]:
# The training is done using an T4 with 16GB of memory
!nvidia-smi

Thu Mar 30 02:44:57 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    55W / 400W |   2507MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [29]:
from numba import cuda 

device = cuda.get_current_device()
device.reset()

time: 762 ms (started: 2023-03-30 02:35:44 +00:00)


In [49]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir=MODEL_OUTPUT_FOLDER,
                                  num_train_epochs=EPOCHS, 
                                  per_device_train_batch_size=BATCH_SIZE,
                                  per_device_eval_batch_size=BATCH_SIZE, 
                                  evaluation_strategy="epoch", # TODO: change to steps
                                  save_strategy="epoch", # TODO: change to steps
                                  logging_strategy="epoch",  # TODO: change to steps
                                  learning_rate=2e-5,  # TODO: change to what was used in the doctttttquery 
                                  weight_decay=0.01, # TODO: change to what was used in the doctttttquery 
                                  fp16=True # Use mixed precision
                                )

time: 2.8 ms (started: 2023-03-30 02:44:57 +00:00)


In [50]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
trainer = Trainer(model=model, 
                  args=training_args, 
                  train_dataset=tokenized_dataset["train"],
                  eval_dataset=tokenized_dataset["validation"], 
                  data_collator=data_collator)

time: 5.62 ms (started: 2023-03-30 02:44:57 +00:00)


In [51]:
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,2.571,2.339543
2,2.3275,2.269327


Exception ignored in: <generator object Json._generate_tables at 0x7f5ed3aeb200>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/datasets/packaged_modules/json/json.py", line 158, in _generate_tables
    batch_idx += 1
OSError: [Errno 107] Transport endpoint is not connected


TrainOutput(global_step=50000, training_loss=2.449279375, metrics={'train_runtime': 5575.0933, 'train_samples_per_second': 71.748, 'train_steps_per_second': 8.968, 'total_flos': 1.045168128e+17, 'train_loss': 2.449279375, 'epoch': 2.0})

time: 1h 32min 55s (started: 2023-03-30 02:44:57 +00:00)


In [52]:
model.save_pretrained(MODEL_SAVE_FOLDER)
tokenizer.save_pretrained(MODEL_SAVE_FOLDER)

('/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/tokenizer_config.json',
 '/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/special_tokens_map.json',
 '/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/vocab.json',
 '/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/merges.txt',
 '/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/added_tokens.json',
 '/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/tokenizer.json')

time: 1.6 s (started: 2023-03-30 04:19:47 +00:00)


In [58]:
#!zip model {MODEL_SAVE_FOLDER}/*

time: 283 µs (started: 2023-03-29 22:55:03 +00:00)


# Evaluation

# Results

split train/validation (200.000, 50.000)

| seq length | epochs | batch size |    gpu    | time/epoch | val_ppl |
|:----------:|:------:|:----------:|:---------:|:----------:|:-------:|
|     -    |    -   |      -     |  -  |     -    |  -  |

----