# Class 8 Code assignment InPars query generation with llm

[![google colab link](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tcvieira/IA368-DD-012023/blob/main/assingments/08-InPars/notebook.ipynb)

Thiago Coelho Vieira

```md
Instruções Exercício - InPars

**Objetivo**: gerar dataset para treino de modelos de buscas usando a técnica do InPars e avaliar um modelo reranqueador treinado neste dataset no TREC-COVID:

**Entrada**: 3-5 exemplos few-shot + documento amostrado da coleção do TREC-COVID
**Saída**: query que seja relevante para o documento amostrado

É opcional fazer a etapa de filtragem usando as queries de maior prob descrita no Artigo.

Como modelo gerador, use um dos seguintes modelos:

- ChatGPT-3.5-turbo: ~1 USD para cada 1k exemplos
- FLAN-T5 (base, large ou XL), LLAMA-(7,13B), Alpaca-(7/13B), que são possiveis de rodar no Colab Pro.
- Também tem a inference-api da HF: https://huggingface.co/inference-api.

Com exceção do LLAMA, é possivel usar zero-shot ao inves de few-shot.

Dado 1k-10k pares <query sintética; documento>, treinar um modelo reranqueador miniLM igual ao da aula 2/3.

Exemplos negativos (i.e., <query sintética; doc não relevant) vem do BM25: dado a query sintetica, retornar top 1000 com o BM25, e amostrar aleatoriamente alguns documentos como negativo

Começar treino do miniLM já treinado no MS MARCO

Avaliar no TREC-COVID e comparar com o reranqueador apenas treinado no MSMARCO

Nota: Também usar o dataset dos colegas para obter diversidade de exemplos: Assim que tiver gerado o dataset sintético, favor colocar na planilha, assim outras pessoas podem usá-lo.

- Para aumentar a aleatoriedade, seed usada deve o seu número na planilha.

> Colocar dataset no formato jsonlines:
> {"query": query, "positive_doc_id": doc_id, "negative_doc_ids": [opcional]}\n 
``` 

🥇 kudos to Júlia Tessler - https://colab.research.google.com/drive/1DvFvghq3rpDfj0qh1rdpF_Tr0RXa3yV0?usp=sharing#scrollTo=V5_cloZ6YOCy

🥇 kudos to 


# Setup

In [1]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [2]:
import numpy as np
from getpass import getpass

seed = 20

np.random.seed(seed)

In [11]:
!pip install langchain -q
!pip install huggingface_hub -q
!pip install datasets -q

# Load Dataset

In [3]:
import json
from datasets import load_dataset

In [4]:
trec_covid_queries = load_dataset("BeIR/trec-covid", 'queries')
trec_covid_corpus = load_dataset("BeIR/trec-covid", 'corpus')



  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

In [5]:
trec_covid_corpus

DatasetDict({
    corpus: Dataset({
        features: ['_id', 'title', 'text'],
        num_rows: 171332
    })
})

In [6]:
print(json.dumps(trec_covid_corpus['corpus'][:3], indent=2))

{
  "_id": [
    "ug7v899j",
    "02tnwd4m",
    "ejv2xln0"
  ],
  "title": [
    "Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia",
    "Nitric oxide: a pro-inflammatory mediator in lung disease?",
    "Surfactant protein-D and pulmonary host defense"
  ],
  "text": [
    "OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school

## Sample

In [7]:
queries_ids = np.random.randint(len(trec_covid_corpus['corpus']), size = 1000)
queries_ids.shape

(1000,)

In [8]:
test_queries_ids = np.random.randint(len(trec_covid_corpus['corpus']), size = 5)
test_queries_ids.shape

(5,)

In [9]:
len(np.intersect1d(queries_ids, test_queries_ids))

0

# Positive Queries Generation with LLM

In [54]:
from langchain.chat_models import ChatOpenAI
from langchain import (
    PromptTemplate, 
    LLMChain
)

In [20]:
OPENAI_API_KEY = getpass()

··········


In [21]:
import os
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [22]:
llm = ChatOpenAI(model_name = 'gpt-3.5-turbo', model_kwargs={'temperature':1e-10})

In [43]:
template = """Generate a short and objective query in the way a human user would in search engines that would help him find more information about the main topic on the following document:

Document: {title}
{text}"""

prompt = PromptTemplate(template = template, 
                        input_variables = ["title", "text"])
llm_chain = LLMChain(prompt = prompt, llm = llm)

title = trec_covid_corpus['corpus']['title'][1001]
text = trec_covid_corpus['corpus']['text'][1001]

print(title)
print(text)
print()
print(llm_chain.run({'title': title, 'text': text}))

Automatic Detection and Quantification of Tree-in-Bud (TIB) Opacities from CT Scans
This study presents a novel computer-assisted detection (CAD) system for automatically detecting and precisely quantifying abnormal nodular branching opacities in chest computed tomography (CT), termed tree-in-bud (TIB) opacities by radiology literature. The developed CAD system in this study is based on 1) fast localization of candidate imaging patterns using local scale information of the images, and 2) Möbius invariant feature extraction method based on learned local shape and texture properties of TIB patterns. For fast localization of candidate imaging patterns, we use ball-scale filtering and, based on the observation of the pattern of interest, a suitable scale selection is used to retain only small size patterns. Once candidate abnormality patterns are identified, we extract proposed shape features from regions where at least one candidate pattern occupies. The comparative evaluation of the prop

add "(based on trec-covid dataset)" to be similar to the dataset

In [42]:
template = """Generate a short and objective query in the way a human user would in search engines (based on trec-covid dataset) that would help him find more information about the main topic on the following document:

Document: {title}
{text}"""

prompt = PromptTemplate(template = template, 
                        input_variables = ["title", "text"])
llm_chain = LLMChain(prompt = prompt, llm = llm)

title = trec_covid_corpus['corpus']['title'][1001]
text = trec_covid_corpus['corpus']['text'][1001]

print(title)
print(text)
print()
print(llm_chain.run({'title': title, 'text': text}))

Automatic Detection and Quantification of Tree-in-Bud (TIB) Opacities from CT Scans
This study presents a novel computer-assisted detection (CAD) system for automatically detecting and precisely quantifying abnormal nodular branching opacities in chest computed tomography (CT), termed tree-in-bud (TIB) opacities by radiology literature. The developed CAD system in this study is based on 1) fast localization of candidate imaging patterns using local scale information of the images, and 2) Möbius invariant feature extraction method based on learned local shape and texture properties of TIB patterns. For fast localization of candidate imaging patterns, we use ball-scale filtering and, based on the observation of the pattern of interest, a suitable scale selection is used to retain only small size patterns. Once candidate abnormality patterns are identified, we extract proposed shape features from regions where at least one candidate pattern occupies. The comparative evaluation of the prop

In [None]:
# To make it easier to organize things, I created a Pandas DataFrame to fill the data
df = pd.DataFrame()
pos_doc_ids = []

for idx in test_queries_ids:
  pos_doc_ids.append(trec_covid_corpus['corpus']['_id'][idx])

df['positive_doc_id'] = pos_doc_ids
df.head()

# Negative Queries

Retrieve 5 random negative docs from the top 1000 BM25 retrieved docs

In [49]:
!pip install pyserini -q
!pip install faiss-cpu  -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m48.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [50]:
from pyserini.search.lucene import LuceneSearcher
import numpy as np
import json

In [51]:
# Inspired by Manoel Veríssimo dos Santos Neto
# https://github.com/verissimomanoel/P_IA368DD_2023S1/blob/main/Exercicio8/generate_dataset.py

def generate_random_numbers(max = 5, k = 1000):
  random_list = []
  while len(random_list) < max:
    n = np.random.randint(0, k)

    # Prevent duplicated index
    if n not in random_list:
      random_list.append(n)

  return random_list

def search_with_bm25(query, max = 5, k = 1000):
  searcher = LuceneSearcher.from_prebuilt_index('beir-v1.0.0-trec-covid.flat')
  hits = searcher.search(query, k)
  random_list = generate_random_numbers(max = max, k = k)
  random_ids = []

  for index in random_list:
    jsondoc = json.loads(hits[index].raw)
    random_ids.append(jsondoc["_id"])

  return random_ids

In [52]:
from tqdm import tqdm

negative_samples = []

for _, _, query in tqdm(responses):
  negative_doc_ids = search_with_bm25(query)
  negative_samples.append(negative_doc_ids)

NameError: ignored

In [53]:
from tqdm import tqdm

negative_ids = []
for idx, registro in tqdm(df.iterrows()):
    query_text = registro['generated_query']
    hits = searcher.search(query_text, num_max_hits)
    if len(hits)>0:
        indices_selecao = np.random.randint(0, high=len(hits), size=num_negative_example)
        negative_ids.append([hits[ndx].docid for ndx in indices_selecao])
    else:
        negative_ids.append([])

NameError: ignored

In [None]:
negative_samples[0:10]

In [None]:
with open(f"{main_dir}/trec-covid/thiago_vieira_1k_queries.jsonl", "w") as f:
  for i, (doc_id, _, query) in enumerate(responses):
    json.dump({"query":query, "positive_doc_id":doc_id, "negative_doc_ids":negative_samples[i]}, f)
    f.write("\n")

In [None]:
!head {main_dir}/trec-covid/thiago_vieira_1k_queries.jsonl

## Send to HF dataset

In [44]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [45]:
import datasets
ds = datasets.load_dataset('unicamp-dl/trec-covid-experiment')

Downloading builder script:   0%|          | 0.00/2.59k [00:00<?, ?B/s]

Downloading and preparing dataset trec-covid-experiment/default to /root/.cache/huggingface/datasets/unicamp-dl___trec-covid-experiment/default/0.0.0/b4916ab469ccacf895d77d33bd1c846bb5cfdd8b4c50a7d5ee10f01f77e0310a...


Downloading data files:   0%|          | 0/18 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/309 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/346 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/185k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/255k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/152k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/311k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/627k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/280k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/307k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/238k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/237k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/356k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/266k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/249k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/241k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/224k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/188k [00:00<?, ?B/s]

Generating example split: 0 examples [00:00, ? examples/s]

Generating example2 split: 0 examples [00:00, ? examples/s]

Generating eduseiti_100_queries_expansion_20230501_01 split: 0 examples [00:00, ? examples/s]

Generating leandro_carisio_01 split: 0 examples [00:00, ? examples/s]

Generating thales_1k_generated_queries_20230429 split: 0 examples [00:00, ? examples/s]

Generating manoel_1k_generated_queries_20230430 split: 0 examples [00:00, ? examples/s]

Generating manoel_2k_generated_queries_20230501 split: 0 examples [00:00, ? examples/s]

Generating thiago_laitz_1k_queries split: 0 examples [00:00, ? examples/s]

Generating mirelle_1k_generated_queries_20230501 split: 0 examples [00:00, ? examples/s]

Generating hugo_padovani_query_generation split: 0 examples [00:00, ? examples/s]

Generating marcus_borela_1k_gptj6b_20230501 split: 0 examples [00:00, ? examples/s]

Generating juliatessler_1000_queries split: 0 examples [00:00, ? examples/s]

Generating pedro_holanda_1k_generated_queries_20230502 split: 0 examples [00:00, ? examples/s]

Generating leonardo_avila_queries_v1 split: 0 examples [00:00, ? examples/s]

Generating marcus_borela_1k_gptj6b_20230501_v2 split: 0 examples [00:00, ? examples/s]

Generating gustavo_1k_cohere split: 0 examples [00:00, ? examples/s]

Generating marcospiau_1k_v1 split: 0 examples [00:00, ? examples/s]

Generating pedrogengo_queries_inparsv1 split: 0 examples [00:00, ? examples/s]

Dataset trec-covid-experiment downloaded and prepared to /root/.cache/huggingface/datasets/unicamp-dl___trec-covid-experiment/default/0.0.0/b4916ab469ccacf895d77d33bd1c846bb5cfdd8b4c50a7d5ee10f01f77e0310a. Subsequent calls will reuse this data.


  0%|          | 0/18 [00:00<?, ?it/s]

# Fine-tune

In [56]:
model_name = 'cross-encoder/ms-marco-MiniLM-L-6-v2'

In [57]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]