# NIR 2022 - Lab 5: Neural Re-Ranking with Transformers

In this lab, we will look at neural approaches for re-ranking.
The common pipeline consists of two stages: first, an efficient retrieval system (e.g. BM25) is used to retrieve a number of relevant documents (e.g. 1000) from the entire collection (e.g. 1M documents).
Then, a slower, but more accurate neural system is used to re-rank (i.e. re-order) the documents returned by the system in the first stage.

![Pipeline](https://drive.google.com/uc?id=1rj4d0ko4817WfptNd0FozQ3BluudWwI-)

The gains given by neural models are given by the dense representations they produce.
In fact, lexical search (e.g. BM25) does not recognize synonyms, acronyms or spelling variations. 
In contrast, semantic search (or dense retrieval) encodes the search query into vector space and retrieves the document embeddings that are close in vector space.

Today, we will build this two-stage pipeline.
In the second stage, we will use contextualized (i.e. from Transformers) document representations.
We will make use of the [SentenceTransformers](https://www.sbert.net/) library for this.

The material for this lab is largely based on the SentenceTransformers documentation.

## Data and PyTerrier Setup

In [1]:
# Load the data
import pandas as pd

BASEDIR = './'
# corpus
docs_df = pd.read_csv(BASEDIR + 'data/lab_docs.csv', dtype=str)
print(docs_df.shape)
print(docs_df.head())

# topics
topics_df = pd.read_csv(BASEDIR + 'data/lab_topics.csv', dtype=str)
print(topics_df.shape)
print(topics_df.head())

# Load qrels
qrels_df = pd.read_csv(BASEDIR + 'data/lab_qrels.csv', dtype=str)
print(qrels_df.shape)
print(qrels_df.head())

(2453, 2)
     docno                                               text
0   935016  he emigrated to france with his family in 1956...
1  2360440  after being ambushed by the germans in novembe...
2   347765  she was the second ship named for captain alex...
3  1969335  world war ii was a global war that was under w...
4  1576938  the ship was ordered on 2 april 1942 laid down...
(9, 2)
       qid                 query
0  1015979    president of chile
1     2674    computer animation
2   340095  2020 summer olympics
3  1502917         train station
4     2574       chinese cuisine
(2454, 4)
       qid    docno label iteration
0  1015979  1015979     2         0
1  1015979  2226456     1         0
2  1015979  1514612     1         0
3  1015979  1119171     1         0
4  1015979  1053174     1         0


In [2]:
# Init PyTerrier
import pyterrier as pt
if not pt.started():
    pt.init()

PyTerrier 0.8.1 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


In [4]:
# Build index

'''
def generate_dataset():
    with open('data/lab_docs_title.csv','r') as fin:
        for line in fin:
            fields = line.split('\t')
            docno = fields[0]
            text = fields[1]
            title = fields[2]

            yield {"docno": docno, "title": title, "text": text}
iter_indexer = pt.IterDictIndexer(
    "./indexes/iterindex",
    overwrite=True,
    meta=["docno", "title", "text"],
    meta_lengths=[20, 100, 4096],
)
indexref = iter_indexer.index(generate_dataset(), fields=["title", "text"])

'''
index = pt.IndexFactory.of("./indexes/iterindex")
print(index.getCollectionStatistics())

Number of documents: 2453
Number of terms: 23784
Number of postings: 208792
Number of fields: 2
Number of tokens: 280639
Field names: [title, text]
Positions:   false



In [5]:
# Build IR systems
tf = pt.BatchRetrieve(index, wmodel="Tf")
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

In [6]:
# Evaluate systems on the first three topics using the PyTerrier Experiment interface
qrels_df = qrels_df.astype({'label':'int32'})
pt.Experiment(
    retr_systems=[tf, tfidf, bm25,],
    names=['TF', 'TF-IDF', 'BM25'],
    topics=topics_df,
    qrels=qrels_df,
    eval_metrics=["map", "ndcg", "ndcg_cut_10", "P_10"])

Unnamed: 0,name,map,ndcg,ndcg_cut_10,P_10
0,TF,0.614762,0.791533,0.845332,0.788889
1,TF-IDF,0.625618,0.799612,0.842121,0.766667
2,BM25,0.632451,0.802556,0.844729,0.766667


In [13]:
text_pipeline = bm25 >> pt.text.get_text(index,'text')

text_pipeline.search("hello world")


Unnamed: 0,qid,docid,docno,rank,score,query,text
0,1,516,1731988,0,10.285461,hello world,directed by welles the show starred welles ors...
1,1,2278,1830554,1,2.316304,hello world,aniruddha d joshi m d medicine rhuematologist ...
2,1,161,2372290,2,2.210342,hello world,it will be the nation s eighth consecutive app...
3,1,1825,1168468,3,2.198790,hello world,juliette gordon low was committed to offering ...
4,1,2270,2376896,4,2.196002,hello world,it will be the nation s seventh consecutive ap...
...,...,...,...,...,...,...,...
736,1,1196,139634,736,1.123616,hello world,she was named for the battle of san jacinto du...
737,1,2178,941872,737,1.123616,hello world,tonne was credited with 122 aerial victories b...
738,1,2213,1038068,738,1.112537,hello world,the ship was ordered and laid down as uss pce ...
739,1,1504,1147781,739,1.108892,hello world,created in part by actor james stewart the fie...


## Sentence Transformers

Before implementing the 2-stage pipeline, let's first look at how to use [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) to compute the similarity between two sentences (e.g. a query and a document).

This library is based on PyTorch and [HuggingFace Transformers](https://github.com/huggingface/transformers) and offers a large collection of pre-trained models tuned for various tasks. 

This framework provides an easy method to compute dense vector representations for sentences and paragraphs. The models are based on transformer networks such as BERT and RoBERTa. Text is embedding in vector space such that similar text is close and can efficiently be found using cosine similarity.

You can also fine-tune models on your own data. See [this page](https://www.sbert.net/docs/training/overview.html) for more details.

In [38]:
# Install SentenceTransformers
# !pip install -U sentence-transformers

The SentenceTransformers library defines two types of neural architectures for neural sentence representations: bi-encoders and cross-encoders: 

![Bi_vs_Cross-Encoder.png](https://drive.google.com/uc?id=1D-bcfKooVKUpfIs1fl8TrGGLnbGBa1-p)

- A bi-encoder embeds two sentences separately and then computes their cosine similarity from the pooled representation.
- A cross-encoder embedsd two sentences together by concatenating them. Their similarity is then given as the output of a classifier.


In the following, we load a Transformer models that have been trained on [MS MARCO](https://microsoft.github.io/msmarco/). 
MS MARCO is a large-scale information retrieval corpus that was created based on real user search queries using Bing search engine. 
The training data constist of over 500k examples.

The SentenceTransformer models can be used to find relevant documents given keywords, a search phrase or a question.

### Bi-Encoder

Bi-Encoders produce for a given sentence a sentence embedding. We pass to a BERT independently the sentences A and B, which result in the sentence embeddings u and v. These sentence embedding can then be compared using cosine similarity.

Bi-Encoders are used whenever you need a sentence embedding in a vector space for efficient comparison. 

In [39]:
from sentence_transformers import SentenceTransformer, util

# Load pre-trained model
model = SentenceTransformer('msmarco-distilbert-base-v3')

# Sentences are encoded by calling model.encode()
query_embedding = model.encode('How big is London')
doc_embedding = model.encode('London has 9,787,426 inhabitants at the 2011 census')

# Compute cosine similarity between query and document representations
cos_sim = util.pytorch_cos_sim(query_embedding, doc_embedding)
print("Cosine-Similarity:", cos_sim)

Cosine-Similarity: tensor([[0.6082]])


In [40]:
# We can compute the similarity between a query and a list of sentences
documents = [
    'London has 9,787,426 inhabitants at the 2011 census',
    'London encompasses a total area of 1,583 square kilometres'
]

query_embedding = model.encode('How big is London')
docs_embedding = model.encode(documents)

cos_sim = util.pytorch_cos_sim(query_embedding, docs_embedding)
print("Cosine-Similarity:", cos_sim)

Cosine-Similarity: tensor([[0.6083, 0.7642]])


### Cross-Encoder

In contrast, for a Cross-Encoder, we pass both sentences simultaneously to the Transformer network. It then produces an output value indicating the similarity of the input sentence pair.

A Cross-Encoder does not produce a sentence embedding. Also, we are not able to pass individual sentences to a Cross-Encoder.

Cross-Encoder achieve better performances than Bi-Encoders. However, for many application they are not pratical as they do not produce embeddings we could index or efficiently compare using cosine similarity.

Cross-Encoders can be used whenever you have a pre-defined set of sentence pairs you want to score.


In [41]:
from sentence_transformers import CrossEncoder
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
scores = model.predict([('Query1', 'Paragraph1'), ('Query2', 'Paragraph2')])

scores

array([-7.2216372, -6.009411 ], dtype=float32)

In [42]:
# Cross-Encoder example
query_doc_pairs = [
    ('How many people live in Berlin?', 'Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.'), 
    ('What is the size of New York?', 'New York City is famous for the Metropolitan Museum of Art.')
]
scores = model.predict(query_doc_pairs)
scores

array([ 9.219173 , -6.8903775], dtype=float32)

## Pipeline

Now, we build our two-stage pipeline: For a given query, the top-K most similar documents in the collection are retrieved using BM25. These K documents are then re-ranked using a Transformer model.

In [43]:
# Retrieve the top-K documents using BM25
K = 100
topk_bm25 = bm25 % K

stage1 = topk_bm25.transform(topics_df)
stage1

Unnamed: 0,qid,docid,docno,rank,score,query
0,1015979,205,1015979,0,21.082803,president of chile
1,1015979,2435,229754,1,18.819140,president of chile
2,1015979,2417,1186821,2,18.584047,president of chile
3,1015979,546,2226456,3,14.697990,president of chile
4,1015979,549,1514612,4,13.276190,president of chile
...,...,...,...,...,...,...
2279,8438,619,1595351,28,6.600165,mexican cuisine
2280,8438,752,2454055,29,6.600165,mexican cuisine
2281,8438,2262,1365636,30,6.531712,mexican cuisine
2282,8438,2173,843418,31,6.420724,mexican cuisine


In [44]:
# Re-rank the documents using a Cross-Encoder
cross_run = []
model_name = 'cross'
for i in range(len(topics_df)):
    qid, query = topics_df.iloc[i]

    # Get top-K document ids for a given query
    qid_docnos = stage1[stage1['qid'] == qid]['docno']
    # Get top-K documents (docno and text in the same order) 
    qid_docnos = docs_df[docs_df['docno'].isin(qid_docnos)]['docno']
    qid_docs = docs_df[docs_df['docno'].isin(qid_docnos)]['text']

    # Concatenate the query and documents and predict the scores for the pairs [query, passage]
    model_inputs = [[query, doc] for doc in qid_docs]
    docno_inputs = [docno for docno in qid_docnos]
    scores = model.predict(model_inputs)

    # Sort the scores in decreasing order
    results = [{'input': inp, 'docno': docno, 'score': score} for inp, docno, score in zip(model_inputs, docno_inputs, scores)]
    results = sorted(results, key=lambda x: x['score'], reverse=True)

    # Save the results in TREC format
    for rank, hit in enumerate(results):
        docno = hit['docno']
        score = hit['score']
        row_str = f"{qid} 0 {docno} {rank} {score} {model_name}"
        cross_run.append(row_str)    
    
# Store ranking on disk in TREC format
with open(BASEDIR + 'lab6/' + f"outputs/{model_name}.run", "w") as f:
    for l in cross_run:
        f.write(l + "\n")

## Evaluate

Finally, we evaluate the performance of our neural pipeline using `pytrec_eval` and compare it to the standard approaches we set up at the beginning of the lab.

In [45]:
# !pip install pytrec_eval

In [46]:
import pytrec_eval

# Load qrels in a dictionary
qrels_dict = dict()
for _, r in qrels_df.iterrows():
    qid, docno, label, iteration = r
    if qid not in qrels_dict:
        qrels_dict[qid] = dict()
    qrels_dict[qid][docno] = int(label)

# Build evaluator based on the qrels and metrics
metrics = {"ndcg_cut_5", "ndcg_cut_10", "P_5", "P_10"}
my_qrel = {q: d for q, d in qrels_dict.items()}
evaluator = pytrec_eval.RelevanceEvaluator(my_qrel, metrics)

In [47]:
# Load Cross-Encoder run
with open(BASEDIR + "lab6/" + "outputs/cross.run", 'r') as f_run:
    cross_run = pytrec_eval.parse_run(f_run)

In [48]:
# Evaluate Cross-Encoder model
cross_evals = evaluator.evaluate(cross_run)

# Compute performance in different metrics for each query
cross_metric2vals = {m: [] for m in metrics}
for q, d in cross_evals.items():
    for m, val in d.items():
        cross_metric2vals[m].append(val)

# Average results by query
cross_metric2avg = dict()
for m in metrics:
    val = pytrec_eval.compute_aggregated_measure(m, cross_metric2vals[m])
    cross_metric2avg[m] = val
    # print(m, '\t', val)

In [49]:
# Compare system performance
experiment = pt.Experiment(
    retr_systems=[tf, tfidf, bm25,],
    names=['TF', 'TF-IDF', 'BM25'],
    topics=topics_df,
    qrels=qrels_df,
    eval_metrics=metrics)

cross_metric2avg['name'] = 'BM25 >> Cross-Encoder'
experiment.append(cross_metric2avg, ignore_index=True)

  experiment.append(cross_metric2avg, ignore_index=True)


Unnamed: 0,name,ndcg_cut_10,P_5,ndcg_cut_5,P_10
0,TF,0.845332,0.866667,0.861466,0.788889
1,TF-IDF,0.842121,0.888889,0.878503,0.766667
2,BM25,0.844729,0.911111,0.889389,0.766667
3,BM25 >> Cross-Encoder,0.917072,0.933333,0.918763,0.866667


In [51]:
# rerank based on pyterrier
from pyterrier_t5 import MonoT5ReRanker
monoT5 = MonoT5ReRanker()
mono_pipeline = bm25 >> pt.text.get_text(index,'text') >> monoT5

print(index.getCollectionStatistics().toString())
experiment = pt.Experiment(
    retr_systems=[tf, tfidf, bm25,mono_pipeline],
    names=['TF', 'TF-IDF', 'BM25', 'monoT5'],
    topics=topics_df,
    qrels=qrels_df,
    eval_metrics=metrics)

cross_metric2avg['name'] = 'BM25 >> Cross-Encoder'
experiment.append(cross_metric2avg, ignore_index=True)




Number of documents: 2453
Number of terms: 23784
Number of postings: 208792
Number of fields: 2
Number of tokens: 280639
Field names: [title, text]
Positions:   false



monoT5: 100%|██████████| 571/571 [00:14<00:00, 39.06batches/s]
  experiment.append(cross_metric2avg, ignore_index=True)


Unnamed: 0,name,ndcg_cut_10,P_5,ndcg_cut_5,P_10
0,TF,0.845332,0.866667,0.861466,0.788889
1,TF-IDF,0.842121,0.888889,0.878503,0.766667
2,BM25,0.844729,0.911111,0.889389,0.766667
3,monoT5,0.869293,0.911111,0.915579,0.788889
4,BM25 >> Cross-Encoder,0.917072,0.933333,0.918763,0.866667
