# Exercício 1

Enunciado do exercício da semana:

1. Usar o BM25 implementado pelo pyserini para buscar queries no TREC-DL 2020 ([Documentação de referência](https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md))
2. Implementar um buscador booleano/bag-of-words.
3. Implementar um buscador com TF-IDF
4. Avaliar implementações 1, 2, e 3 no TREC-DL 2020 e calcular o nDCG@10

Nos itens 2 e 3: (i) Fazer uma implementação que suporta buscar eficientemente milhões de documentos e (ii) Não se pode usar bibliotecas como sklearn, que já implementam o BoW e TF-IDF.

## Instalação de Pacotes

In [2]:
# !pip install pyserini
# !pip install faiss-cpu -q

In [1]:
  ### Used only to run on Google Colab
# from google.colab import drive
# drive.mount('/content/gdrive')

# Change de path to your drive
# base_path = "gdrive/MyDrive/Colab_Notebooks/P_IA368DD_2023S1/Exercicio1/"
base_path = ""

In [2]:
import os

## Download dos dados

In [None]:
!wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P "{base_path}data/"
!tar xvfz "{base_path}data/collectionandqueries.tar.gz" -C "{base_path}data/"

Converte os arquivos de TSV para JSON

## Instalação do pyserini e pyserini tools

In [None]:
os.makedirs(f"{base_path}tools")

In [None]:
!wget -q https://github.com/castorini/anserini-tools/archive/refs/heads/master.zip -O "{base_path}tools/anserini-tools.zip"

In [None]:
!unzip -q "{base_path}tools/anserini-tools.zip" -d "{base_path}tools"

In [None]:
!python "{base_path}tools/anserini-tools-master/scripts/msmarco/convert_collection_to_jsonl.py" \
 --collection-path "{base_path}data/collection.tsv" \
 --output-folder "{base_path}data/collection_jsonl"

In [None]:
!wget -q https://github.com/castorini/pyserini/archive/refs/heads/master.zip -O pyserini.zip

In [None]:
!unzip -q pyserini.zip -d  {base_path}/pyserini

## Cria os índices dos documentos

In [None]:
!python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input "{base_path}data/collection_jsonl" \
  --index "{base_path}data/indexes/lucene-index-msmarco-passage" \
  --generator DefaultLuceneDocumentGenerator \
  --threads 9 \
  --storePositions --storeDocvectors --storeRaw

In [None]:
!head "{base_path}tools/anserini-tools-master/topics-and-qrels/topics.msmarco-passage.dev-subset.txt"

1048585	what is paula deen's brother
2	 Androgen receptor define
524332	treating tension headaches without medication
1048642	what is paranoid sc
524447	treatment of varicose veins in legs
786674	what is prime rate in canada
1048876	who plays young dr mallard on ncis
1048917	what is operating system misconfiguration
786786	what is priority pass
524699	tricare service number


In [None]:
!python -m pyserini.search.lucene \
  --index "{base_path}data/indexes/lucene-index-msmarco-passage" \
  --topics msmarco-passage-dev-subset \
  --output runs/run.msmarco-passage.bm25tuned.txt \
  --output-format msmarco \
  --hits 1000 \
  --bm25 --k1 0.82 --b 0.68

2023-03-04 14:16:08.415318: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-04 14:16:09.663973: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-04 14:16:09.664090: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
Using pre-defined topic order for msma

In [None]:
!python "{base_path}tools/anserini-tools-master/scripts/msmarco/msmarco_passage_eval.py" \
   "{base_path}tools/anserini-tools-master/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt" runs/run.msmarco-passage.bm25tuned.txt

#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################


Avaliação oficial TREC

In [None]:
!python -m pyserini.eval.convert_msmarco_run_to_trec_run \
   --input runs/run.msmarco-passage.bm25tuned.txt \
   --output runs/run.msmarco-passage.bm25tuned.trec

Done!


In [None]:
!python {base_path}tools/anserini-tools-master/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
   --input {base_path}tools/anserini-tools-master/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
   --output {base_path}data/qrels.dev.small.trec

Done!


In [None]:
!python {base_path}pyserini/pyserini-master/pyserini/eval/trec_eval.py -c -mrecall.1000 -mmap \
   {base_path}/data/qrels.dev.small.trec runs/run.msmarco-passage.bm25tuned.trec

2023-03-04 15:03:24.397980: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-04 15:03:27.017487: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-04 15:03:27.017795: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
Downloading https://search.maven.org/r

Métrica ndcg@10 pelo pyserini

In [None]:
!python {base_path}pyserini/pyserini-master/pyserini/eval/trec_eval.py -c -mndcg_cut.10 \
   {base_path}/data/qrels.dev.small.trec runs/run.msmarco-passage.bm25tuned.trec

2023-03-04 15:05:36.350189: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-04 15:05:38.050326: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-04 15:05:38.050501: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
Downloading https://search.maven.org/r

## Buscador Bag of Words

Referência: https://colab.research.google.com/drive/1hELJYqsvUyja9HPeDzc9FU8okqdIjODE?usp=sharing


In [13]:
import nltk
from nltk.tokenize import word_tokenize
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from collections import Counter
import json
import pickle
import array
from pyserini.analysis import Analyzer, get_lucene_analyzer
from pyserini.search import get_topics
import os
import gc

In [8]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/manny/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
def preprocess_string(txt):
    """
    Return a preprocessed tokenized text.

    Args:
        txt (str): original text to process

    Returns:
        Return a preprocessed tokenized text.
    """
    txt = txt.lower()
    tokens = word_tokenize(txt)
    tokens = [stemmer.stem(word) for word in tokens if word.isalpha()]
    tokens = set(tokens).difference(stop_words)

    return tokens

In [10]:
def generate_index():
    tokenizer = Analyzer(get_lucene_analyzer(stemmer='porter'))
    num_lines = sum(1 for line in open(f'{base_path}data/collection.tsv', 'r'))
    index = {}
    doc_size = {}

    with open(f'{base_path}data/collection.tsv', encoding='utf-8') as f:
        for idx, line in tqdm(enumerate(f), total=num_lines):
            doc_id, text = line.rstrip().split('\t')
            tokens = preprocess_string(text) #tokenizer.analyze(text)

            tokens_doc = Counter(tokens)
            for token, n_ocorrencias in tokens_doc.items():
                index.setdefault(token, {"doc_id": array.array("L", []), "n_ocurr": array.array("L", [])})[
                    'doc_id'].append(int(doc_id))
                index.setdefault(token, {"doc_id": array.array("L", []), "n_ocurr": array.array("L", [])})[
                    'n_ocurr'].append(n_ocorrencias)

            doc_size[int(doc_id)] = len(tokens)

    with open(f'{base_path}data/inverted_index_nltk.pickle', 'wb') as f:
        pickle.dump(index, f)

    del index
    gc.collect()

    with open(f'{base_path}data/doc_size_nltk.pickle', 'wb') as f:
        pickle.dump(doc_size, f)

    del doc_size
    gc.collect()

In [11]:
def search_bow(query, tokenizer, stop_words):
    stopwords_nltk = set(stop_words)

#     tokens = tokenizer.analyze(query)
#     tokens = set(tokens).difference(stopwords_nltk)
    tokens = preprocess_string(query)

    # Se não tem token para ser pesquisado, retorna conjunto vazio
    if (len(tokens) == 0):
        return []

    docs_score = {}

    for token in tokens:

        # Busca somente os tokens encontrados
        if token in index:
            docs_found = index[token]['doc_id']
            n_ocurr_doc = index[token]['n_ocurr']

            for id_doc, n_ocurr in zip(docs_found, n_ocurr_doc):
                docs_score[id_doc] = docs_score.get(id_doc, 0) + n_ocurr

    docs_com_score = list(docs_score.items())

    # Ordena do mais relevante para o menos relevante
    return sorted(docs_com_score, key=lambda x: x[1], reverse=True)

In [15]:
stop_words = nltk.corpus.stopwords.words('english')
stemmer = nltk.stem.PorterStemmer()

In [20]:
# Carregar/gerar o índice e dicionário com o totalizador de tokens por documento
# path_index = f'{base_path}data/inverted_index.pickle'
# path_doc_size = f'{base_path}data/doc_size.pickle'

path_index = f'{base_path}data/inverted_index_nltk.pickle'
path_doc_size = f'{base_path}data/doc_size_nltk.pickle'

if os.path.exists(path_index):
    with open(path_index, 'rb') as f:
        index = pickle.load(f)

    with open(path_doc_size, 'rb') as f:
        doc_size = pickle.load(f)
else:
    generate_index()

In [17]:
tokenizer = Analyzer(get_lucene_analyzer(stemmer='porter'))
stop_words = nltk.corpus.stopwords.words('english')

In [18]:
topics = get_topics('dl20')

In [21]:
query = topics[1051399]["title"]
print(f'Query: {query}')
print(f'Tokens da query: {tokenizer.analyze(query)}')
resultado = search_bow(query, tokenizer, stop_words)
print(resultado[0:10])

Query: who sings monk theme song
Tokens da query: ['who', 'sing', 'monk', 'theme', 'song']
[(10076, 3), (10083, 3), (11193, 3), (28358, 3), (28359, 3), (49533, 3), (58138, 3), (58139, 3), (69817, 3), (242238, 3)]


In [23]:
def run_all_queries_search_bow(file_name, topics):
    with open(file_name, 'w') as file:
        for id in tqdm(topics):
            query = topics[id]['title']
            hits = search_bow(query, tokenizer, stop_words)
            for i in range(0, min(len(hits), 30)):
                _ = file.write('{} Q0 {} {} {:.6f} '.format(id, hits[i][0], i + 1, hits[i][1]) + "bow" + '\n')

In [44]:
run_all_queries_search_bow('run-search-bow_nltk.dl20.txt', topics)
!python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl20-passage run-search-bow_nltk.dl20.txt

  0%|          | 0/200 [00:00<?, ?it/s]

Downloading https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar to /Users/manny/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar...
/Users/manny/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar already exists!
Skipping download.
Running command: ['java', '-jar', '/Users/manny/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar', '-c', '-m', 'ndcg_cut.10', '/Users/manny/.cache/pyserini/topics-and-qrels/qrels.dl20-passage.txt', 'run-search-bow_nltk.dl20.txt']
Results:
ndcg_cut_10           	all	0.3151


## Buscador TF-IDF

Referência: https://colab.research.google.com/drive/1hELJYqsvUyja9HPeDzc9FU8okqdIjODE?usp=sharing

In [39]:
class TfidfSearcher:
    def __init__(self, inverted_index, doc_size, stop_words):
        self.inverted_index = inverted_index
        self.total_docs = len(doc_size)
        self.tokenizer = Analyzer(get_lucene_analyzer(stemmer='porter'))
        self.stop_words = stop_words

    def idf(self, token):
        if 'idf' not in self.inverted_index[token]:
            docs_with_token = len(self.inverted_index[token]['doc_id'])
            idf = np.log10(self.total_docs / docs_with_token)
            self.inverted_index[token]['idf'] = idf

        return self.inverted_index[token]['idf']

    def tf_idf(self, doc_id, freq, token):
        if 'idf' not in self.inverted_index[token]:
            self.idf(token)

        doc_size_in_tokens = doc_size[doc_id]
        tf = freq / doc_size_in_tokens

        return tf * self.inverted_index[token]['idf']

    def _tokenize(self, text):
#         tokens = tokenizer.analyze(text)
#         tokens = set(tokens).difference(self.stop_words)
        tokens = preprocess_string(text)

        return tokens

    def vec_query_tf(self, tokenized_query):
        query_counter = Counter(tokenized_query)

    def query(self, query):
#         tokenized_query = self._tokenize(query)
        tokenized_query = preprocess_string(query)

        scores = {}

        for token in tokenized_query:
            docs_with_tokens = self.inverted_index[token]['doc_id']
            freqs = self.inverted_index[token]['n_ocurr']

            for doc_id, freq in zip(docs_with_tokens, freqs):
                if doc_id not in scores:
                    scores[doc_id] = self.tf_idf(doc_id, freq, token)
                else:
                    scores[doc_id] += self.tf_idf(doc_id, freq, token)

        return sorted(scores.items(), key=lambda x: x[1], reverse=True)

In [40]:
searcher = TfidfSearcher(index, doc_size, stop_words)

In [41]:
query = topics[1051399]["title"]
print(f'Query: {query}')
print(f'Tokens da query: {tokenizer.analyze(query)}')
resultado = searcher.query(query)
print(resultado[0:10])

Query: who sings monk theme song
Tokens da query: ['who', 'sing', 'monk', 'theme', 'song']
[(1989028, 1.108891170783285), (69813, 1.0672879788485814), (69818, 1.0672879788485814), (4359243, 1.0008311730685004), (2319136, 0.8929669447961761), (2866978, 0.8536729666367812), (5329142, 0.8536729666367812), (3466639, 0.8340259775570836), (3943227, 0.8034992723261856), (4245382, 0.768305669973103)]


In [42]:
# Run all queries in topics
def run_all_queries(file, topics, searcher):
    with open(file, 'w') as runfile:
        cnt = 0
        for id in tqdm(topics):
            query = topics[id]['title']
            hits = searcher.query(query)
            for i in range(0, min(len(hits), 1000)):
                _ = runfile.write('{} Q0 {} {} {:.6f} TFIDF\n'.format(id, hits[i][0], i+1, hits[i][1]))

In [43]:
run_all_queries('run-search-tfidf_nltk.dl20.txt', topics, searcher)
!python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl20-passage run-search-tfidf_nltk.dl20.txt

  0%|          | 0/200 [00:00<?, ?it/s]

Downloading https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar to /Users/manny/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar...
/Users/manny/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar already exists!
Skipping download.
Running command: ['java', '-jar', '/Users/manny/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar', '-c', '-m', 'ndcg_cut.10', '/Users/manny/.cache/pyserini/topics-and-qrels/qrels.dl20-passage.txt', 'run-search-tfidf_nltk.dl20.txt']
Results:
ndcg_cut_10           	all	0.1050
