# Pyserini: BM25 Baseline for MS MARCO Passage Retrieval


based on https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md

> Dataset - https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020.html#passage-ranking-task

## Imports / Installs

In [6]:
%%shell
pip install pyserini
apt-get install maven -qq
git clone --recurse-submodules https://github.com/castorini/pyserini.git
cd pyserini
cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Extracting templates from packages: 100%
Selecting previously unselected package libapache-pom-java.
(Reading database ... 128275 files and directories currently installed.)
Preparing to unpack .../00-libapache-pom-java_18-1_all.deb ...
Unpacking libapache-pom-java (18-1) ...
Selecting previously unselected package libatinject-jsr330-api-java.
Preparing to unpack .../01-libatinject-jsr330-api-java_1.0+ds1-5_all.deb ...
Unpacking libatinject-jsr330-api-java (1.0+ds1-5) ...
Selecting previously unselected package libgeronimo-interceptor-3.0-spec-java.
Preparing to unpack .../02-libgeronimo-interceptor-3.0-spec-java_1.0.1-4fakesync_all.deb ...
Unpacking libgeronimo-interceptor-3.0-spec-java (1.0.1-4fakesync) ...
Selecting previously unselected package libcdi-api-java.
Preparing to unpack .../03-libcdi-api-java_1.2-2_all.deb ...
Unpacking libcdi-api-java (1.2-2) ...
Selecting previously unsel



## Dataset

In [3]:
# download dataset
!mkdir collections/msmarco-passage

!wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage

# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage

!tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage

mkdir: cannot create directory ‘collections/msmarco-passage’: No such file or directory
--2023-03-09 01:59:33--  https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz
Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 20.150.34.4
Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|20.150.34.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1057717952 (1009M) [application/gzip]
Saving to: ‘collections/msmarco-passage/collectionandqueries.tar.gz’


2023-03-09 02:00:52 (12.8 MB/s) - ‘collections/msmarco-passage/collectionandqueries.tar.gz’ saved [1057717952/1057717952]

collection.tsv
qrels.dev.small.tsv
qrels.train.tsv
queries.dev.small.tsv
queries.dev.tsv
queries.eval.small.tsv
queries.eval.tsv
queries.train.tsv


> missing EDA on dataset to get to know it 😞

> **TODO**: better understand dataset MSMARCO and TREC evaluation

In [8]:
# convert the MS MARCO tsv collection into Pyserini's jsonl files
# should generate 9 jsonl files in collections/msmarco-passage/collection_jsonl
!python pyserini/tools/scripts/msmarco/convert_collection_to_jsonl.py \
 --collection-path collections/msmarco-passage/collection.tsv \
 --output-folder collections/msmarco-passage/collection_jsonl

Converting collection...
Converted 0 docs, writing into file 1
Converted 100,000 docs, writing into file 1
Converted 200,000 docs, writing into file 1
Converted 300,000 docs, writing into file 1
Converted 400,000 docs, writing into file 1
Converted 500,000 docs, writing into file 1
Converted 600,000 docs, writing into file 1
Converted 700,000 docs, writing into file 1
Converted 800,000 docs, writing into file 1
Converted 900,000 docs, writing into file 1
Converted 1,000,000 docs, writing into file 2
Converted 1,100,000 docs, writing into file 2
Converted 1,200,000 docs, writing into file 2
Converted 1,300,000 docs, writing into file 2
Converted 1,400,000 docs, writing into file 2
Converted 1,500,000 docs, writing into file 2
Converted 1,600,000 docs, writing into file 2
Converted 1,700,000 docs, writing into file 2
Converted 1,800,000 docs, writing into file 2
Converted 1,900,000 docs, writing into file 2
Converted 2,000,000 docs, writing into file 3
Converted 2,100,000 docs, writing i

## Create Lucene Index

In [10]:
# create the index with 8,841,823 documents
!python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input collections/msmarco-passage/collection_jsonl \
  --index indexes/lucene-index-msmarco-passage \
  --generator DefaultLuceneDocumentGenerator \
  --threads 9 \
  --storePositions --storeDocvectors --storeRaw

2023-03-09 02:15:05,262 INFO  [main] index.IndexCollection (IndexCollection.java:391) - Setting log level to INFO
2023-03-09 02:15:05,275 INFO  [main] index.IndexCollection (IndexCollection.java:394) - Starting indexer...
2023-03-09 02:15:05,276 INFO  [main] index.IndexCollection (IndexCollection.java:396) - DocumentCollection path: collections/msmarco-passage/collection_jsonl
2023-03-09 02:15:05,277 INFO  [main] index.IndexCollection (IndexCollection.java:397) - CollectionClass: JsonCollection
2023-03-09 02:15:05,280 INFO  [main] index.IndexCollection (IndexCollection.java:398) - Generator: DefaultLuceneDocumentGenerator
2023-03-09 02:15:05,281 INFO  [main] index.IndexCollection (IndexCollection.java:399) - Threads: 9
2023-03-09 02:15:05,281 INFO  [main] index.IndexCollection (IndexCollection.java:400) - Language: en
2023-03-09 02:15:05,281 INFO  [main] index.IndexCollection (IndexCollection.java:401) - Stemmer: porter
2023-03-09 02:15:05,282 INFO  [main] index.IndexCollection (IndexC

## Development queries

> 6980 queries in tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt

Each line contains a tab-delimited (query id, query) pair. Pyserini already knows how to load and iterate through these pairs.

In [12]:
!head pyserini/tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt

1048585	what is paula deen's brother
2	 Androgen receptor define
524332	treating tension headaches without medication
1048642	what is paranoid sc
524447	treatment of varicose veins in legs
786674	what is prime rate in canada
1048876	who plays young dr mallard on ncis
1048917	what is operating system misconfiguration
786786	what is priority pass
524699	tricare service number


## Search using development queries

> using BM25 with parameters to k1=0.82, b=0.68

> The option **--output-format** msmarco says to generate output in the **MS MARCO** output format. The option **--hits** specifies the number of documents to return per query. Thus, the output file should have approximately 6980 × 1000 = 6.9M lines.

In [15]:
!python -m pyserini.search.lucene \
  --index indexes/lucene-index-msmarco-passage \
  --topics msmarco-passage-dev-subset \
  --output runs/run.msmarco-passage.bm25tuned.txt \
  --output-format msmarco \
  --hits 1000 \
  --bm25 --k1 0.82 --b 0.68 \
  --threads 16 \
  --batch-size 64

2023-03-09 02:35:18.448760: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-09 02:35:21.815429: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-09 02:35:21.816151: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
Using pre-defined topic order for msma

## Evaluate the results using the official MS MARCO evaluation script

In [17]:
!python pyserini/tools/scripts/msmarco/msmarco_passage_eval.py \
   pyserini/tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.bm25tuned.txt

#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################


## Evaluate with the official TREC evaluation tool

In [18]:
# convert the run file into TREC format

!python -m pyserini.eval.convert_msmarco_run_to_trec_run \
   --input runs/run.msmarco-passage.bm25tuned.txt \
   --output runs/run.msmarco-passage.bm25tuned.trec

!python pyserini/tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
   --input pyserini/tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
   --output collections/msmarco-passage/qrels.dev.small.trec

Done!
Done!


In [20]:
# run trec_eval
! pyserini/tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -m ndcg_cut.10 -mmap \
   collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.bm25tuned.trec

map                   	all	0.1957
recall_1000           	all	0.8573
ndcg_cut_10           	all	0.2340


>Average precision or AP (also called mean average precision, MAP) and recall@1000 (recall at rank 1000) are the two metrics we care about the most. AP captures aspects of both precision and recall in a single metric, and is the most common metric used by information retrieval researchers. On the other hand, recall@1000 provides the upper bound effectiveness of downstream reranking modules (i.e., rerankers are useless if there isn't a relevant document in the results).

> **TODO**: better understand those metrics

# Using python lib and support libs - WIP

In [1]:
!pip install trectools
!pip install pyserini
!pip install faiss-cpu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting trectools
  Downloading trectools-0.0.49.tar.gz (28 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sarge>=0.1.1
  Downloading sarge-0.1.7.post1-py2.py3-none-any.whl (18 kB)
Building wheels for collected packages: trectools
  Building wheel for trectools (setup.py) ... [?25l[?25hdone
  Created wheel for trectools: filename=trectools-0.0.49-py3-none-any.whl size=27140 sha256=69f716190aaa4ccc33efd3337d922fc6a2adb2c7420a32e51eea9251542e729b
  Stored in directory: /root/.cache/pip/wheels/b2/1d/4d/445b0fb9a145de0dc24861a535cbe755f637327da7f5d65ed7
Successfully built trectools
Installing collected packages: sarge, trectools
Successfully installed sarge-0.1.7.post1 trectools-0.0.49
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyserini
  Downloading pyserini-0.20.0-py3-none-any.whl (137.1 MB)
[2K   

# IR booleano/bag-of-words - WIP

# IR  TF-IDF - WIP

# Results