# MSMARCO Passage Reranking using PyTerrier and Sentence Transformers

This notebook demonstrates the easy application of Sentence Transformers using PyTerrier.

In [1]:
%pip install -q python-terrier sentence-transformers

[K     |████████████████████████████████| 102 kB 39.6 MB/s 
[K     |████████████████████████████████| 85 kB 4.9 MB/s 
[K     |████████████████████████████████| 1.3 MB 50.2 MB/s 
[K     |████████████████████████████████| 69 kB 8.2 MB/s 
[K     |████████████████████████████████| 311 kB 50.9 MB/s 
[K     |████████████████████████████████| 46 kB 3.9 MB/s 
[K     |████████████████████████████████| 286 kB 45.4 MB/s 
[K     |████████████████████████████████| 45 kB 3.4 MB/s 
[K     |████████████████████████████████| 72 kB 1.8 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 294 kB 61.5 MB/s 
[K     |████████████████████████████████| 1.2 MB 56.1 MB/s 
[K     |████████████████████████████████| 126 kB 72.2 MB/s 
[K     |████████████████████████████████| 5.5 MB 51.8 MB/s 
[K     |████████████████████████████████| 1.3 MB

In [2]:
import pyterrier as pt
from pyterrier.measures import *

terrier-assemblies 5.7 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.7 jar not found, downloading to /root/.pyterrier...
Done


PyTerrier 0.9.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



## BM25

We use a pre-built index for MSMARCO v1 passage corpus, obtained direct from the [Terrier data repository](http://data.terrier.org/). On Colab, this [particular index](http://data.terrier.org/msmarco_passage.dataset.html#terrier_stemmed_text) takes around 4 minutes to download.

In [3]:
bm25 = pt.terrier.Retriever.from_dataset('msmarco_passage', 'terrier_stemmed_text', wmodel='BM25', metadata=['docno', 'text'])

Downloading msmarco_passage index to /root/.pyterrier/corpora/msmarco_passage/index/terrier_stemmed_text


  warn("Downloading index of > 2GB.")


data.direct.bf:   0%|          | 0.00/486M [00:00<?, ?iB/s]

data.document.fsarrayfile:   0%|          | 0.00/177M [00:00<?, ?iB/s]

data.inverted.bf:   0%|          | 0.00/377M [00:00<?, ?iB/s]

data.lexicon.fsomapfile:   0%|          | 0.00/100M [00:00<?, ?iB/s]

data.lexicon.fsomaphash:   0%|          | 0.00/0.99k [00:00<?, ?iB/s]

data.lexicon.fsomapid:   0%|          | 0.00/4.47M [00:00<?, ?iB/s]

data.meta-0.fsomapfile:   0%|          | 0.00/295M [00:00<?, ?iB/s]

data.meta.idx:   0%|          | 0.00/67.5M [00:00<?, ?iB/s]

data.meta.zdata:   0%|          | 0.00/1.91G [00:00<?, ?iB/s]

data.properties:   0%|          | 0.00/4.33k [00:00<?, ?iB/s]

md5sums:   0%|          | 0.00/537 [00:00<?, ?iB/s]

18:37:34.127 [main] WARN org.terrier.structures.BaseCompressingMetaIndex - Structure meta reading data file directly from disk (SLOW) - try index.meta.data-source=fileinmem in the index properties file. 1.9 GiB of memory would be required.


This is the dataset we are will use for evaluation.

In [5]:
dataset = pt.get_dataset("trec-deep-learning-passages")

We now instantiate the SentenceTransformers models. Both of these model files download the necessary files from HuggingFace. 

We `pt.apply.doc_score()` transformers for applying these models. The use of a `batch_size` kwarg allows the models to be applied on multiple query-document pairs at once.

Finally, we use a `pt.Experiment` for evaluation. We combine the cross-encoder and bi-encoders to re-rank the output of BM25. We

In [7]:
import pandas as pd
from sentence_transformers import CrossEncoder, SentenceTransformer
crossmodel = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2', max_length=512)
bimodel = SentenceTransformer('paraphrase-MiniLM-L6-v2')

def _crossencoder_apply(df : pd.DataFrame):
    return crossmodel.predict(list(zip(df['query'].values, df['text'].values)))

cross_encT = pt.apply.doc_score(_crossencoder_apply, batch_size=128)

def _biencoder_apply(df : pd.DataFrame):
    from sentence_transformers.util import cos_sim
    query_embs = bimodel.encode(df['query'].values)
    doc_embs = bimodel.encode(df['text'].values)
    scores =  cos_sim(query_embs, doc_embs)
    return scores[0]

bi_encT = pt.apply.doc_score(_biencoder_apply, batch_size=128)

pt.Experiment(
    [ bm25, bm25 >> bi_encT, bm25 >> cross_encT ],
    dataset.get_topics("test-2019"),
    dataset.get_qrels("test-2019"),
    [RR(rel=2), nDCG@10, nDCG@100, AP(rel=2)],
    names=["BM25", "BM25 >> BiEncoder", "BM25 >> CrossEncoder"],
    filter_by_qrels=True
)

pt.apply:   0%|          | 0/43 [00:00<?, ?row/s]

pt.apply:   0%|          | 0/43 [00:00<?, ?row/s]

Unnamed: 0,name,RR(rel=2),nDCG@10,nDCG@100,AP(rel=2)
0,BM25,0.641565,0.47954,0.487416,0.286448
1,BM25 >> BiEncoder,0.79335,0.605352,0.54994,0.362597
2,BM25 >> CrossEncoder,0.914729,0.738378,0.669271,0.488546
