# NIR 2022 - Lab 3: Evaluation Metrics

In Lab 2, we have seen how to index a collection of documents and how to search the index with different systems in PyTerrier.
At the end of Lab 2, we also saw how to evaluate the performance of the different systems using standard metrics such as MAP and NDCG.

Today, we will take a closer look at standard evaluation metrics.
In particular, we will see how to use `pytrec_eval`, a Python library to evaluate on TREC-like data whether you use PyTerrier or not.

## Systems Setup

We will start by building an index of our data collection and a few systems in PyTerrier.
This step is only required to obtain system outputs.

As we will see shortly, `pytrec_eval` only needs access to output files, which can be obtained in any other way.

In [1]:
# Load the data
import pandas as pd

# corpus
docs_df = pd.read_csv('data/lab_docs.csv', dtype=str)
print(docs_df.shape)
print(docs_df.head())

# topics
topics_df = pd.read_csv('data/lab_topics.csv', dtype=str)
print(topics_df.shape)
print(topics_df.head())

(2453, 2)
     docno                                               text
0   935016  he emigrated to france with his family in 1956...
1  2360440  after being ambushed by the germans in novembe...
2   347765  she was the second ship named for captain alex...
3  1969335  world war ii was a global war that was under w...
4  1576938  the ship was ordered on 2 april 1942 laid down...
(9, 2)
       qid                 query
0  1015979    president of chile
1     2674    computer animation
2   340095  2020 summer olympics
3  1502917         train station
4     2574       chinese cuisine


In [2]:
# Init PyTerrier
import pyterrier as pt
if not pt.started():
    pt.init()

PyTerrier 0.8.1 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


In [3]:
# Build index
indexer = pt.DFIndexer("./indexes/default", overwrite=True, blocks=True)
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index = pt.IndexFactory.of(index_ref)
print(index.getCollectionStatistics().toString())

Number of documents: 2453
Number of terms: 23693
Number of postings: 208487
Number of fields: 0
Number of tokens: 273373
Field names: []
Positions:   true



In [4]:
# Build IR systems
tf = pt.BatchRetrieve(index, wmodel="Tf")
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

## Search and Evaluate in PyTerrier

In PyTerrier, we can use `search()` to search for documents relevant for a given query.

In [5]:
# Search the index for a query using TF-IDF model
tfidf.search("black wall").head(10)

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,1357,1221197,0,8.052834,black wall
1,1,203,1455841,1,6.727179,black wall
2,1,1172,679402,2,6.412557,black wall
3,1,2110,2391064,3,5.677738,black wall
4,1,1452,702865,4,5.161479,black wall
5,1,1845,1151865,5,4.991523,black wall
6,1,293,243238,6,4.941458,black wall
7,1,2028,163616,7,4.894899,black wall
8,1,592,692168,8,4.808818,black wall
9,1,1335,1872141,9,4.593691,black wall


We can also search for multiple queries at once by grouping them in a Pandas DataFrame and then using the `transform()` method.

In [6]:
# Search the index for multiple queries using TF-IDF model
queries = pd.DataFrame([["q1", "dragon"], ["q2", "wall"]], columns=["qid", "query"])
tfidf.transform(queries).head(10)

Unnamed: 0,qid,docid,docno,rank,score,query
0,q1,759,1782559,0,7.858059,dragon
1,q1,201,1076935,1,4.848641,dragon
2,q1,2194,654718,2,4.689944,dragon
3,q1,1383,1323966,3,4.656079,dragon
4,q1,1576,630588,4,4.639329,dragon
5,q1,26,1610206,5,4.573517,dragon
6,q2,1172,679402,0,6.412557,wall
7,q2,2110,2391064,1,5.677738,wall
8,q2,1452,702865,2,5.161479,wall
9,q2,1357,1221197,3,5.055549,wall


Finally, PyTerrier provides an interface for evaluating the performance of IR systems through the `Experiment` abstraction.
Behind the scenes, `pt.Experiment` uses the `pytrec_eval` library!

In [7]:
qrels_df = pd.read_csv('data/lab_qrels.csv', dtype=str)
qrels_df.head()

Unnamed: 0,qid,docno,label,iteration
0,1015979,1015979,2,0
1,1015979,2226456,1,0
2,1015979,1514612,1,0
3,1015979,1119171,1,0
4,1015979,1053174,1,0


In [8]:
topics_df.head()

Unnamed: 0,qid,query
0,1015979,president of chile
1,2674,computer animation
2,340095,2020 summer olympics
3,1502917,train station
4,2574,chinese cuisine


In [9]:
# Evaluate systems on the first three topics using the PyTerrier Experiment interface
qrels_df = qrels_df.astype({'label': 'int32'})
pt.Experiment(
    retr_systems=[tf, tfidf, bm25],
    names=['TF', 'TF-IDF', 'BM25'],
    topics=topics_df[:3],
    qrels=qrels_df,
    eval_metrics=["map", "ndcg", "ndcg_cut_10", "P_10"])

Unnamed: 0,name,map,ndcg,ndcg_cut_10,P_10
0,TF,0.727657,0.879601,0.943447,0.833333
1,TF-IDF,0.777422,0.881052,0.933542,0.833333
2,BM25,0.777422,0.881052,0.933542,0.833333


## Transformers & Operators

You'll have noted that BatchRetrieve has a `transform()` method that takes as input a dataframe, and returns another dataframe, which is somehow a *transformation* of the earlier dataframe (e.g., a retrieval transformation). In fact, `BatchRetrieve` is just one of many similar objects in PyTerrier, which we call [transformers](https://pyterrier.readthedocs.io/en/latest/transformer.html) (represented by the `TransformerBase` class).

Let's give a look at a `BatchRetrieve` transformer, starting with one for the TF_IDF weighting model.

In [10]:
# check tfidf is a transformer...
print(isinstance(tfidf, pt.transformer.TransformerBase))

True


In [11]:
# this prints the type hierarchy of the TF_IDF class
tfidf.__class__.__mro__

(pyterrier.batchretrieve.BatchRetrieve,
 pyterrier.batchretrieve.BatchRetrieveBase,
 pyterrier.transformer.TransformerBase,
 pyterrier.transformer.Transformer,
 matchpy.expressions.expressions.Symbol,
 matchpy.expressions.expressions.Atom,
 matchpy.expressions.expressions.Expression,
 object)

The interesting capability of all transformers is that they can be combined using Python operators (this is called operator overloading).

Concretely, imagine that you want to chain transformers together – e.g. rank documents first by Tf then re-ranked the exact same documents by TF_IDF. We can do this using the >> operator – we call this composition, or "then".

In [12]:
# now let's define a pipeline 
pipeline = tf >> tfidf
print(isinstance(tfidf, pt.transformer.TransformerBase))

True


In [13]:
print(tf.search("black wall"))
print(pipeline.search("black wall"))

   qid  docid    docno  rank  score       query
0    1   1172   679402     0    5.0  black wall
1    1   2028   163616     1    4.0  black wall
2    1   1335  1872141     2    3.0  black wall
3    1   1357  1221197     3    3.0  black wall
4    1   2110  2391064     4    3.0  black wall
..  ..    ...      ...   ...    ...         ...
79   1   2166   430722    79    1.0  black wall
80   1   2290    86366    80    1.0  black wall
81   1   2305   993780    81    1.0  black wall
82   1   2337   427183    82    1.0  black wall
83   1   2414  2177292    83    1.0  black wall

[84 rows x 6 columns]
   qid  docid    docno  rank     score       query
0    1   1357  1221197     0  8.052834  black wall
1    1    203  1455841     1  6.727179  black wall
2    1   1172   679402     2  6.412557  black wall
3    1   2110  2391064     3  5.677738  black wall
4    1   1452   702865     4  5.161479  black wall
..  ..    ...      ...   ...       ...         ...
79   1   1142   659277    79  2.787666  blac

## Practice Task – Pipeline Construction

Create a ranker that performs the follinwg:
 - obtains the top 10 highest scoring documents by term frequency (`wmodel="Tf"`)
 - obtains the top 10 highest scoring documents by TF.IDF (`wmodel="TF_IDF"`)
 - reranks only those documents found in BOTH of the previous retrieval settings using BM25.

How many documents are retrieved by this full pipeline for the query `"black wall"`. 

If you obtain the correct solution, the document with docid `'1357'` should have a score 14.5976

In [14]:
# Todo


### Saving system ouputs

We now save the output of each query onto disk so we can later evaluate it with `pytrec_eval`.

In [15]:
topics_df

Unnamed: 0,qid,query
0,1015979,president of chile
1,2674,computer animation
2,340095,2020 summer olympics
3,1502917,train station
4,2574,chinese cuisine
5,14082,world war ii
6,1250390,painting
7,5597,house
8,8438,mexican cuisine


In [16]:
!mkdir outputs

mkdir: cannot create directory ‘outputs’: File exists


In [17]:
# Save system rankings in TREC format
# qid Q0 docno rank score tag
tf_run = []
for _, row in topics_df.iterrows():
    qid, query = row
    res_df = tf.search(query)
    for _, res_row in res_df.iterrows():
        _, docid, docno, rank, score, query = res_row
        row_str = f"{qid} 0 {docno} {rank} {score} tfidf"
        tf_run.append(row_str)
with open("outputs/tf.run", "w") as f:
    for l in tf_run:
        f.write(l + "\n")
        
tfidf_run = []
for _, row in topics_df.iterrows():
    qid, query = row
    res_df = tfidf.search(query)
    for _, res_row in res_df.iterrows():
        _, docid, docno, rank, score, query = res_row
        row_str = f"{qid} 0 {docno} {rank} {score} tfidf"
        tfidf_run.append(row_str)
with open("outputs/tfidf.run", "w") as f:
    for l in tfidf_run:
        f.write(l + "\n")

bm25_run = []
for _, row in topics_df.iterrows():
    qid, query = row
    res_df = bm25.search(query)
    for _, res_row in res_df.iterrows():
        _, docid, docno, rank, score, query = res_row
        row_str = f"{qid} 0 {docno} {rank} {score} tfidf"
        bm25_run.append(row_str)
with open("outputs/bm25.run", "w") as f:
    for l in bm25_run:
        f.write(l + "\n")

bm25_run[0]

'1015979 0 1015979 0 20.927815031462014 tfidf'

## pytrec_eval

[pytrec_eval](https://github.com/cvangysel/pytrec_eval) is a Python interface to TREC's evaluation tool [`trec_eval`](https://github.com/usnistgov/trec_eval).
You can install it as follows.

In [18]:
!pip install pytrec_eval

You should consider upgrading via the '/home/wzm289/miniconda3/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

pytrec_eval requires three arguments:
- qrel: a dictionary mapping each query id to the relevant documents and their labels. For example:
```python
qrel = {
    'q1': {'d1': 0, 'd2': 1, 'd3': 0},
    'q2': {'d2': 1, 'd3': 1},
}
```
- metrics: a set of standard metrics to be used to assess your system. See [here](http://www.rafaelglater.com/en/post/learn-how-to-use-trec_eval-to-evaluate-your-information-retrieval-system) for a list of available metrics.
- run: similar to `qrel`, this is a dictionary of a given run which maps each query id to the relevant documents and their scores. For example:
```python
run = {
    'q1': {'d1': 1.0, 'd2': 0.0, 'd3': 1.5},
    'q2': {'d1': 1.5, 'd2': 0.2, 'd3': 0.5}
}
```

In [19]:
# Load qrels
qrels_df = pd.read_csv('data/lab_qrels.csv', dtype=str)
print(qrels_df.shape)
print(qrels_df.head())

qrels_dict = dict()
for _, r in qrels_df.iterrows():
    qid, docno, label, iteration = r
    if qid not in qrels_dict:
        qrels_dict[qid] = dict()
    qrels_dict[qid][docno] = int(label)

(2454, 4)
       qid    docno label iteration
0  1015979  1015979     2         0
1  1015979  2226456     1         0
2  1015979  1514612     1         0
3  1015979  1119171     1         0
4  1015979  1053174     1         0


Check out `pytrec_eval.parse_qrel()` to quickly load qrels files in TREC format (as in your project).

In [20]:
import pytrec_eval

In [21]:
# Build evaluator based on the qrels and metrics
metrics = {"map", "ndcg", "ndcg_cut_10", "P_10"}
my_qrel = {q: d for q, d in qrels_dict.items() if q in {'1015979', '2674', '340095'}}  # let's evaluate the first 3 topics to compare with PyTerrier above
evaluator = pytrec_eval.RelevanceEvaluator(my_qrel, metrics)

In [26]:
# Load run
with open("outputs/tf.run", 'r') as f_run:
    tf_run = pytrec_eval.parse_run(f_run)

In [23]:
# Evaluate tf model
tf_evals = evaluator.evaluate(tf_run)
tf_evals

{'1015979': {'map': 0.6884429327286471,
  'P_10': 0.5,
  'ndcg': 0.9080664916757495,
  'ndcg_cut_10': 0.830339843386306},
 '2674': {'map': 0.5050126570739595,
  'P_10': 1.0,
  'ndcg': 0.7326486367413882,
  'ndcg_cut_10': 1.0},
 '340095': {'map': 0.9895163758800121,
  'P_10': 1.0,
  'ndcg': 0.9980874339288609,
  'ndcg_cut_10': 1.0}}

In [24]:
tf_metric2vals = {m: [] for m in metrics}
for q, d in tf_evals.items():
    for m, val in d.items():
        tf_metric2vals[m].append(val)

In [25]:
# Compute average across topics
for m in metrics:
    print(m, '\t', pytrec_eval.compute_aggregated_measure(m, tf_metric2vals[m]))

ndcg_cut_10 	 0.943446614462102
map 	 0.7276573218942062
P_10 	 0.8333333333333334
ndcg 	 0.879600854115333
