# Class 8 Code assignment InPars

[![google colab link](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tcvieira/IA368-DD-012023/blob/main/assingments/08-InPars/notebook.ipynb)

Thiago Coelho Vieira

```md
Instruções Exercício - InPars

**Objetivo**: gerar dataset para treino de modelos de buscas usando a técnica do InPars e avaliar um modelo reranqueador treinado neste dataset no TREC-COVID:

**Entrada**: 3-5 exemplos few-shot + documento amostrado da coleção do TREC-COVID
**Saída**: query que seja relevante para o documento amostrado

É opcional fazer a etapa de filtragem usando as queries de maior prob descrita no Artigo.

Como modelo gerador, use um dos seguintes modelos:

- ChatGPT-3.5-turbo: ~1 USD para cada 1k exemplos
- FLAN-T5 (base, large ou XL), LLAMA-(7,13B), Alpaca-(7/13B), que são possiveis de rodar no Colab Pro.
- Também tem a inference-api da HF: https://huggingface.co/inference-api.

Com exceção do LLAMA, é possivel usar zero-shot ao inves de few-shot.

Dado 1k-10k pares <query sintética; documento>, treinar um modelo reranqueador miniLM igual ao da aula 2/3.

Exemplos negativos (i.e., <query sintética; doc não relevant) vem do BM25: dado a query sintetica, retornar top 1000 com o BM25, e amostrar aleatoriamente alguns documentos como negativo

Começar treino do miniLM já treinado no MS MARCO

Avaliar no TREC-COVID e comparar com o reranqueador apenas treinado no MSMARCO

Nota: Também usar o dataset dos colegas para obter diversidade de exemplos: Assim que tiver gerado o dataset sintético, favor colocar na planilha, assim outras pessoas podem usá-lo.

- Para aumentar a aleatoriedade, seed usada deve o seu número na planilha.

> Colocar dataset no formato jsonlines:
> {"query": query, "positive_doc_id": doc_id, "negative_doc_ids": [opcional]}\n 
``` 

🥇 kudos to Júlia Tessler - https://colab.research.google.com/drive/1DvFvghq3rpDfj0qh1rdpF_Tr0RXa3yV0?usp=sharing#scrollTo=V5_cloZ6YOCy

🥇 kudos to 


# Setup

In [1]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [95]:
import numpy as np
import pandas as pd
from getpass import getpass
import os
import time

seed = 20
workdir = '/content/gdrive/MyDrive/unicamp/IA368DD/class_inpars'

np.random.seed(seed)

In [77]:
!pip install langchain -q
!pip install huggingface_hub -q
!pip install datasets -q
!pip install ftfy -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/53.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.1/53.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h

# Load Dataset

In [60]:
import json
from datasets import load_dataset

In [61]:
trec_covid_queries = load_dataset("BeIR/trec-covid", 'queries')
trec_covid_corpus = load_dataset("BeIR/trec-covid", 'corpus')



  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

In [74]:
trec_covid_queries

DatasetDict({
    queries: Dataset({
        features: ['_id', 'title', 'text'],
        num_rows: 50
    })
})

In [76]:
print(json.dumps(trec_covid_queries['queries'][:3], indent=2))

{
  "_id": [
    "1",
    "2",
    "3"
  ],
  "title": [
    "",
    "",
    ""
  ],
  "text": [
    "what is the origin of COVID-19",
    "how does the coronavirus respond to changes in the weather",
    "will SARS-CoV2 infected people develop immunity? Is cross protection possible?"
  ]
}


In [62]:
trec_covid_corpus

DatasetDict({
    corpus: Dataset({
        features: ['_id', 'title', 'text'],
        num_rows: 171332
    })
})

In [63]:
print(json.dumps(trec_covid_corpus['corpus'][:3], indent=2))

{
  "_id": [
    "ug7v899j",
    "02tnwd4m",
    "ejv2xln0"
  ],
  "title": [
    "Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia",
    "Nitric oxide: a pro-inflammatory mediator in lung disease?",
    "Surfactant protein-D and pulmonary host defense"
  ],
  "text": [
    "OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school

## Sample

In [64]:
queries_ids = np.random.randint(len(trec_covid_corpus['corpus']), size = 1000)
queries_ids.shape

(1000,)

In [93]:
test_queries_ids = np.random.randint(len(trec_covid_corpus['corpus']), size = 5)
test_queries_ids.shape

(5,)

In [66]:
len(np.intersect1d(queries_ids, test_queries_ids))

0

# Positive Queries Generation with LLM

In [79]:
from langchain.chat_models import ChatOpenAI
from langchain import (
    PromptTemplate, 
    LLMChain
)
import ftfy

In [68]:
OPENAI_API_KEY = getpass()

··········


In [69]:
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [85]:
llm = ChatOpenAI(model_name = 'gpt-3.5-turbo', model_kwargs={'temperature':0.7})

In [86]:
template = """Generate a short and objective query in the way a human user would in search engines that would help him find more information about the main topic on the following document:

Document: {title}
{text}"""

prompt = PromptTemplate(template = template, 
                        input_variables = ["title", "text"])
llm_chain = LLMChain(prompt = prompt, llm = llm)

title = trec_covid_corpus['corpus']['title'][1001]
text = trec_covid_corpus['corpus']['text'][1001]

print(title)
print(text)
print()
print(llm_chain.run({'title': title, 'text': text}))

Automatic Detection and Quantification of Tree-in-Bud (TIB) Opacities from CT Scans
This study presents a novel computer-assisted detection (CAD) system for automatically detecting and precisely quantifying abnormal nodular branching opacities in chest computed tomography (CT), termed tree-in-bud (TIB) opacities by radiology literature. The developed CAD system in this study is based on 1) fast localization of candidate imaging patterns using local scale information of the images, and 2) Möbius invariant feature extraction method based on learned local shape and texture properties of TIB patterns. For fast localization of candidate imaging patterns, we use ball-scale filtering and, based on the observation of the pattern of interest, a suitable scale selection is used to retain only small size patterns. Once candidate abnormality patterns are identified, we extract proposed shape features from regions where at least one candidate pattern occupies. The comparative evaluation of the prop

add "(based on trec-covid dataset)" to be similar to the dataset

In [87]:
template = """Generate a short and objective query in the way a human user would in search engines (based on trec-covid dataset) that would help him find more information about the main topic on the following document:

Document: {title}
{text}"""

prompt = PromptTemplate(template = template, 
                        input_variables = ["title", "text"])
llm_chain = LLMChain(prompt = prompt, llm = llm)

title = trec_covid_corpus['corpus']['title'][1001]
text = trec_covid_corpus['corpus']['text'][1001]

print(title)
print(text)
print()
print(llm_chain.run({'title': title, 'text': text}))

Automatic Detection and Quantification of Tree-in-Bud (TIB) Opacities from CT Scans
This study presents a novel computer-assisted detection (CAD) system for automatically detecting and precisely quantifying abnormal nodular branching opacities in chest computed tomography (CT), termed tree-in-bud (TIB) opacities by radiology literature. The developed CAD system in this study is based on 1) fast localization of candidate imaging patterns using local scale information of the images, and 2) Möbius invariant feature extraction method based on learned local shape and texture properties of TIB patterns. For fast localization of candidate imaging patterns, we use ball-scale filtering and, based on the observation of the pattern of interest, a suitable scale selection is used to retain only small size patterns. Once candidate abnormality patterns are identified, we extract proposed shape features from regions where at least one candidate pattern occupies. The comparative evaluation of the prop

In [73]:
# To make it easier to organize things, I created a Pandas DataFrame to fill the data
df = pd.DataFrame()
pos_doc_ids = []

for idx in test_queries_ids:
  pos_doc_ids.append(trec_covid_corpus['corpus']['_id'][idx])

df['positive_doc_id'] = pos_doc_ids
df.head()

Unnamed: 0,positive_doc_id
0,yos0djuo
1,4na7m3sr
2,ytmj9wga
3,fyqxablr
4,gwhfcg4i


In [96]:
# Inspired by Gustavo Guedes 
# https://colab.research.google.com/drive/1QE6xVgoZiRzksRdPRmKEZLfTNXZv0wNG#scrollTo=mBMmmB4Osnqw
MAX_REQUEST_PER_MINUTE = 60

def generate_queries(llm_chain, samples, save_file_path):
  request_count = 0
  df = pd.DataFrame()
  generated_queries = []
  pos_doc_ids = []

  for sample in tqdm(samples):
    title = ftfy.fix_text(trec_covid_corpus['corpus']['title'][sample])
    text = ftfy.fix_text(trec_covid_corpus['corpus']['text'][sample])

    generated_query = llm_chain.run({'title': title, 'text': text})
    generated_queries.append(generated_query)
    pos_doc_ids.append(trec_covid_corpus['corpus']['_id'][sample])

    request_count += 1

    if request_count == MAX_REQUEST_PER_MINUTE:
      print(f"{request_count} requests. Sleep")
      time.sleep(5)
      request_count = 0

  df['query'] = generated_queries
  df['positive_doc_id'] = pos_doc_ids

  df.to_csv(save_file_path)
  print(f'Saved to {save_file_path}')
  return df

In [97]:
generated_queries = generate_queries(llm_chain, queries_ids, f'{workdir}/generated_queries.csv')

  6%|▌         | 59/1000 [02:45<41:29,  2.65s/it]

60 requests. Sleep


 12%|█▏        | 119/1000 [05:46<42:34,  2.90s/it]

60 requests. Sleep


 18%|█▊        | 179/1000 [08:36<34:43,  2.54s/it]

60 requests. Sleep


 24%|██▍       | 239/1000 [11:30<35:42,  2.81s/it]

60 requests. Sleep


 30%|██▉       | 299/1000 [14:20<27:59,  2.40s/it]

60 requests. Sleep


 36%|███▌      | 359/1000 [17:06<26:40,  2.50s/it]

60 requests. Sleep


 42%|████▏     | 419/1000 [20:02<30:36,  3.16s/it]

60 requests. Sleep


 48%|████▊     | 479/1000 [22:50<22:22,  2.58s/it]

60 requests. Sleep


 54%|█████▍    | 539/1000 [25:48<19:36,  2.55s/it]

60 requests. Sleep


 60%|█████▉    | 599/1000 [28:53<17:35,  2.63s/it]

60 requests. Sleep


 66%|██████▌   | 659/1000 [31:49<16:27,  2.90s/it]

60 requests. Sleep


 72%|███████▏  | 719/1000 [34:44<12:16,  2.62s/it]

60 requests. Sleep


 78%|███████▊  | 779/1000 [37:33<10:52,  2.95s/it]

60 requests. Sleep


 84%|████████▍ | 839/1000 [40:26<07:51,  2.93s/it]

60 requests. Sleep


 90%|████████▉ | 899/1000 [43:49<06:19,  3.76s/it]

60 requests. Sleep


 96%|█████████▌| 959/1000 [46:45<02:00,  2.93s/it]

60 requests. Sleep


100%|██████████| 1000/1000 [48:42<00:00,  2.92s/it]

Saved to /content/gdrive/MyDrive/unicamp/IA368DD/class_inpars/generated_queries.csv





In [98]:
generated_queries.head()

Unnamed: 0,query,positive_doc_id
0,What are the management and outcomes of a vasc...,z2n01mh1
1,What are the clinical outcomes of using titani...,nms24qf5
2,What is power-aware scheduling and how can it ...,axfoktwu
3,What are parents' and healthcare professionals...,y1atprxa
4,What is the 2-5A system and how does it modula...,7yrmd2ig


# Negative Queries

Retrieve 5 random negative docs from the top 1000 BM25 retrieved docs

In [99]:
!pip install pyserini -q
!pip install faiss-cpu  -q

In [100]:
from pyserini.search.lucene import LuceneSearcher
import numpy as np
import json

In [101]:
# Inspired by Manoel Veríssimo dos Santos Neto
# https://github.com/verissimomanoel/P_IA368DD_2023S1/blob/main/Exercicio8/generate_dataset.py

def generate_random_numbers(max = 5, k = 1000):
  random_list = []
  while len(random_list) < max:
    n = np.random.randint(0, k)

    # Prevent duplicated index
    if n not in random_list:
      random_list.append(n)

  return random_list

def search_with_bm25(query, max = 5, k = 1000):
  searcher = LuceneSearcher.from_prebuilt_index('beir-v1.0.0-trec-covid.flat')
  hits = searcher.search(query, k)
  random_list = generate_random_numbers(max = max, k = k)
  random_ids = []

  for index in random_list:
    jsondoc = json.loads(hits[index].raw)
    random_ids.append(jsondoc["_id"])

  return random_ids

In [102]:
generated_queries['negative_doc_ids'] = generated_queries['query'].apply(search_with_bm25)

Downloading index at https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.tar.gz...


lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.tar.gz: 216MB [00:03, 72.3MB/s]                           


In [103]:
generated_queries.head()

Unnamed: 0,query,positive_doc_id,negative_doc_ids
0,What are the management and outcomes of a vasc...,z2n01mh1,"[bfzfll01, eproxsg3, kmlios1x, uit0hlwn, a4od0..."
1,What are the clinical outcomes of using titani...,nms24qf5,"[mmt9bjl0, gn1sk0qu, 0iim6d5b, sjrw00xo, h6msl..."
2,What is power-aware scheduling and how can it ...,axfoktwu,"[v3blnh02, 7izskero, xtymq6m9, jcnvnn2k, 8uswr..."
3,What are parents' and healthcare professionals...,y1atprxa,"[2u379fh3, nh9z3x15, 5wk7469e, rxvxf34z, jsm0j..."
4,What is the 2-5A system and how does it modula...,7yrmd2ig,"[9i8t4dhg, z4ypiuoa, xd9wzzv6, fat4ldy8, pndby..."


In [110]:
import polars as pl

data = pl.from_pandas(generated_queries)
data.write_ndjson(f'{workdir}/thiago_vieira_1k_queries.jsonl')

In [114]:
!wc -l {workdir}/thiago_vieira_1k_queries.jsonl

1000 /content/gdrive/MyDrive/unicamp/IA368DD/class_inpars/thiago_vieira_1k_queries.jsonl


In [111]:
!head {workdir}/thiago_vieira_1k_queries.jsonl

{"query":"What are the management and outcomes of a vascular surgery department with a sudden medical staff outbreak of COVID-19?","positive_doc_id":"z2n01mh1","negative_doc_ids":["bfzfll01","eproxsg3","kmlios1x","uit0hlwn","a4od06yo"]}
{"query":"What are the clinical outcomes of using titanium-coated mesh compared to polypropylene mesh in laparoscopic inguinal hernia repair?","positive_doc_id":"nms24qf5","negative_doc_ids":["mmt9bjl0","gn1sk0qu","0iim6d5b","sjrw00xo","h6msl5ng"]}
{"query":"What is power-aware scheduling and how can it be implemented using submission data in HPC systems?","positive_doc_id":"axfoktwu","negative_doc_ids":["v3blnh02","7izskero","xtymq6m9","jcnvnn2k","8uswrers"]}
{"query":"What are parents' and healthcare professionals' experiences of care after stillbirth in low- and middle-income countries according to a systematic review and meta-summary?","positive_doc_id":"y1atprxa","negative_doc_ids":["2u379fh3","nh9z3x15","5wk7469e","rxvxf34z","jsm0jfx8"]}
{"query":

## Send to HF dataset

In [112]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [113]:
import datasets
ds = datasets.load_dataset('unicamp-dl/trec-covid-experiment')

Downloading builder script:   0%|          | 0.00/2.65k [00:00<?, ?B/s]

Downloading and preparing dataset trec-covid-experiment/default to /root/.cache/huggingface/datasets/unicamp-dl___trec-covid-experiment/default/0.0.0/a3a6e18e08b6f3c017157e867f9c7b995c84f6b85e10e9e62facec5b6ea60707...


Downloading data files:   0%|          | 0/20 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/244k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/234k [00:00<?, ?B/s]

Generating example split: 0 examples [00:00, ? examples/s]

Generating example2 split: 0 examples [00:00, ? examples/s]

Generating eduseiti_100_queries_expansion_20230501_01 split: 0 examples [00:00, ? examples/s]

Generating leandro_carisio_01 split: 0 examples [00:00, ? examples/s]

Generating thales_1k_generated_queries_20230429 split: 0 examples [00:00, ? examples/s]

Generating manoel_1k_generated_queries_20230430 split: 0 examples [00:00, ? examples/s]

Generating manoel_2k_generated_queries_20230501 split: 0 examples [00:00, ? examples/s]

Generating thiago_laitz_1k_queries split: 0 examples [00:00, ? examples/s]

Generating mirelle_1k_generated_queries_20230501 split: 0 examples [00:00, ? examples/s]

Generating hugo_padovani_query_generation split: 0 examples [00:00, ? examples/s]

Generating marcus_borela_1k_gptj6b_20230501 split: 0 examples [00:00, ? examples/s]

Generating juliatessler_1000_queries split: 0 examples [00:00, ? examples/s]

Generating pedro_holanda_1k_generated_queries_20230502 split: 0 examples [00:00, ? examples/s]

Generating leonardo_avila_queries_v1 split: 0 examples [00:00, ? examples/s]

Generating marcus_borela_1k_gptj6b_20230501_v2 split: 0 examples [00:00, ? examples/s]

Generating gustavo_1k_cohere split: 0 examples [00:00, ? examples/s]

Generating marcospiau_1k_v1 split: 0 examples [00:00, ? examples/s]

Generating pedrogengo_queries_inparsv1 split: 0 examples [00:00, ? examples/s]

Generating ricardo_primi_1k split: 0 examples [00:00, ? examples/s]

Generating thiago_vieira_1k_queries split: 0 examples [00:00, ? examples/s]

Dataset trec-covid-experiment downloaded and prepared to /root/.cache/huggingface/datasets/unicamp-dl___trec-covid-experiment/default/0.0.0/a3a6e18e08b6f3c017157e867f9c7b995c84f6b85e10e9e62facec5b6ea60707. Subsequent calls will reuse this data.


  0%|          | 0/20 [00:00<?, ?it/s]

In [115]:
ds

DatasetDict({
    example: Dataset({
        features: ['query', 'positive_doc_id', 'negative_doc_ids'],
        num_rows: 3
    })
    example2: Dataset({
        features: ['query', 'positive_doc_id', 'negative_doc_ids'],
        num_rows: 3
    })
    eduseiti_100_queries_expansion_20230501_01: Dataset({
        features: ['query', 'positive_doc_id', 'negative_doc_ids'],
        num_rows: 463
    })
    leandro_carisio_01: Dataset({
        features: ['query', 'positive_doc_id', 'negative_doc_ids'],
        num_rows: 1001
    })
    thales_1k_generated_queries_20230429: Dataset({
        features: ['query', 'positive_doc_id', 'negative_doc_ids'],
        num_rows: 1000
    })
    manoel_1k_generated_queries_20230430: Dataset({
        features: ['query', 'positive_doc_id', 'negative_doc_ids'],
        num_rows: 1000
    })
    manoel_2k_generated_queries_20230501: Dataset({
        features: ['query', 'positive_doc_id', 'negative_doc_ids'],
        num_rows: 2000
    })
    thiago_l

# Fine-tune

In [107]:
# https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2
model_name = 'cross-encoder/ms-marco-MiniLM-L-6-v2'

In [57]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]