# NIR 2022 - Lab 2: Introduction to PyTerrier

## 0: Libraries for Indexing and Ranking

There exist several software libraries that can help you develop a search engine.
They usually provide highly optimized indexes, scoring algorithms (e.g. BM25) and other features such as text pre-processing.

Popular libraries include:
- [INDRI](https://www.lemurproject.org/indri/)
- [Elasticsearch](https://github.com/elastic/elasticsearch)
- [Anserini](https://github.com/castorini/anserini)
- [Terrier](https://github.com/terrier-org/terrier-core)
- [Whoosh](https://pypi.org/project/Whoosh/)
- [BEIR](https://github.com/beir-cellar/beir.git) for BEIR

As Python is becoming the standard programming language in Data Science and Deep Learning, Python interfaces have been added to several libraries:
- [Elasticsearch-py](https://github.com/elastic/elasticsearch-py) for Elasticsearch
- [Pyserini](https://github.com/castorini/pyserini/) for Anserini
- [Py-Terrier](https://github.com/terrier-org/pyterrier) for Terrier


The role of these Python interfaces is to understand your Python code and to map it onto the underlying service.
For example, PyTerrier is a Python layer above Terrier:
![PyTerrier](figures/PyTerrier.png "PyTerrier")

## 1: PyTerrier

In our first labs, we will be using PyTerrier: a new Python Framework that makes it easy to perform information retrieval experiments in Python while using the Java-based Terrier platform for the expensive indexing and retrieval operations.

In particular, the material of the lab is heavily based on the [ECIR 2021 tutorial](https://github.com/terrier-org/ecir2021tutorial) with PyTerrier and [OpenNIR](https://github.com/Georgetown-IR-Lab/OpenNIR) search toolkits.

Another useful resource is the [PyTerrier documentation](https://pyterrier.readthedocs.io/_/downloads/en/latest/pdf/).

NB. You can choose any library you prefer to develop your final project. 
Our labs aim to provide guidance into applying the theoretical content of the lectures in practice.
As such, we only use one library and a small dataset during the lab sessions.

<div class="alert alert-warning">
PyTerrier does not yet easily install on Windows.

If you do not have a Linux or macOS device, one of the easiest options is to use <a href="https://colab.research.google.com/">Google Colab</a>.

Get in touch with the TA if you need help setting up Google Colab!

</div>

## 2: Pre-requisites

PyTerrier requires:
- Python 3.6 or newer
    + You can check your Python version by running `python --version` or `python3 --version` on the terminal
    + You can download a newer version from [here](https://www.python.org/downloads/)
    + Once you have a valid Python version installed, make sure your Jupyter notebook is using it (`Kernel -> Change Kernel`)
- Java 11 or newer
    + You can check your Java version by running `java --version` on the terminal
    + You can download and install Java 11 from [JDK 11](https://www.oracle.com/java/technologies/javase-jdk11-downloads.html) or [OpenJDK 11](http://jdk.java.net/archive/). Several tutorials exist to help you in this task, such as [this one for Linux](https://computingforgeeks.com/how-to-install-java-11-on-ubuntu-debian-linux/) and [this one for macOS](https://mkyong.com/java/how-to-install-java-on-mac-osx/)

## 3: Installation

PyTerrier can be easily installed from the terminal using Pip:
```bash
pip install python-terrier
```

In [1]:
# !pip install python-terrier

## 4: Configuration

To use PyTerrier, we need to both import it and initialize it.
The initialization with the `init()` method makes PyTerrier download Terrier's JAR file as well as start the Java virtual machine.
To avoid `init()` being called more than once, we can check if it's being initialized through the `started()` method.

In [2]:
import pyterrier as pt
if not pt.started():
    pt.init()

PyTerrier 0.8.1 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


## 5: Data

In our labs, we will be using a subset of the small version of [WikIR](https://www.aclweb.org/anthology/2020.lrec-1.237.pdf) dataset for English.

The data is located inside the `data/` folder, and consists of:
- `lab_docs.csv`: CSV file of document number and document text
- `lab_topics.csv`: CSV file of query id and query text
- `lab_qrels.csv`: CSV file of annotations with schema `qid, docno, label, iteration`

In [3]:
import pandas as pd

In [4]:
docs_df = pd.read_csv('data/lab_docs.csv', dtype=str)
print(docs_df.shape)
docs_df.head()

(2453, 2)


Unnamed: 0,docno,text
0,935016,he emigrated to france with his family in 1956...
1,2360440,after being ambushed by the germans in novembe...
2,347765,she was the second ship named for captain alex...
3,1969335,world war ii was a global war that was under w...
4,1576938,the ship was ordered on 2 april 1942 laid down...


In [5]:
topics_df = pd.read_csv('data/lab_topics.csv', dtype=str)
print(topics_df.shape)
topics_df.head()

(9, 2)


Unnamed: 0,qid,query
0,1015979,president of chile
1,2674,computer animation
2,340095,2020 summer olympics
3,1502917,train station
4,2574,chinese cuisine


In [6]:
qrels_df = pd.read_csv('data/lab_qrels.csv', dtype=str)
print(qrels_df.shape)
qrels_df.dtypes

(2454, 4)


Unnamed: 0,qid,docno,label,iteration
0,1015979,1015979,2,0
1,1015979,2226456,1,0
2,1015979,1514612,1,0
3,1015979,1119171,1,0
4,1015979,1053174,1,0


In [7]:
title_df = pd.read_csv("data/lab_titles",dtype = str,names = ['title'])
print(title_df.shape)
title_df.head()


(2453, 1)


Unnamed: 0,title
0,he emigrated to france with
1,after being ambushed by the germ
2,she was the second ship named for
3,world war ii was a global war
4,the ship was ordered on 2 apr


In [8]:
doc_title_df = pd.concat([docs_df,title_df],axis = 1)
print(doc_title_df.shape)
doc_title_df.head()

# doc_title_df.to_csv("data/lab_docs_title.csv",index = None, header = None,sep = '\t')

(2453, 3)


Unnamed: 0,docno,text,title
0,935016,he emigrated to france with his family in 1956...,he emigrated to france with
1,2360440,after being ambushed by the germans in novembe...,after being ambushed by the germ
2,347765,she was the second ship named for captain alex...,she was the second ship named for
3,1969335,world war ii was a global war that was under w...,world war ii was a global war
4,1576938,the ship was ordered on 2 april 1942 laid down...,the ship was ordered on 2 apr


## 6: Indexing and Indexes

To perform the task of retrieving relevant documents for a given query, a search engine needs to know which documents are available and index them to efficiently retrieve them.

In PyTerrier, we can create an index from a Pandas DataFrame with the `DFIndexer` method.
The index, with all its data structures, is written into a directory called `indexes/default`.

Note that the DFIndexer is just a example for this small dataset. For a large dataset. It is better to use `IterDictIndexer`

In [9]:
def generate_dataset():
    with open('data/lab_docs_title.csv','r') as fin:
        for line in fin:
            fields = line.split('\t')
            docno = fields[0]
            text = fields[1]
            title = fields[2]

            yield {"docno": docno, "title": title, "text": text}
            

In [10]:
iter_indexer = pt.IterDictIndexer(
    "./indexes/iterindex",
    overwrite=True,
    meta=["docno", "title", "text"],
    meta_lengths=[20, 100, 4096],
)

indexref = iter_indexer.index(generate_dataset(), fields=["title", "text"])
index = pt.IndexFactory.of("./indexes/iterindex")
print(index.getCollectionStatistics())

Number of documents: 2453
Number of terms: 23784
Number of postings: 208792
Number of fields: 2
Number of tokens: 280639
Field names: [title, text]
Positions:   false



In [11]:
pipe1 = pt.BatchRetrieve(index, metadata=["docno","title",'text'])
pipe1.search("wall").head()

Unnamed: 0,qid,docid,docno,title,text,rank,score,query
0,1,1172,679402,prior to the construction of the,prior to the construction of the berlin wall i...,0,6.756589,wall
1,1,2110,2391064,in 2013 the rifle range which was,in 2013 the rifle range which was constructed ...,1,5.92444,wall
2,1,1452,702865,it was one of a number of,it was one of a number of highly experimental ...,2,5.232783,wall
3,1,1915,2242428,the monument is placed in the wall,the monument is placed in the wall surrounding...,3,5.176962,wall
4,1,1357,1221197,he was inspired to climb during a,he was inspired to climb during a cycling holi...,4,5.170168,wall


In [12]:
# using DFindex?
indexer = pt.DFIndexer("./indexes/default", overwrite=True)
indexer.setProperty("FieldTags.process", "text,title")
index_ref = indexer.index(doc_title_df['text'], doc_title_df)
index = pt.IndexFactory.of("./indexes/default")
print(index.getCollectionStatistics())


Number of documents: 2453
Number of terms: 23693
Number of postings: 208487
Number of fields: 2
Number of tokens: 273373
Field names: [text, title]
Positions:   false



In [13]:
pipe1 = pt.BatchRetrieve(index, metadata=["docno","title",'text'])
pipe1.search("wall").head()

Unnamed: 0,qid,docid,docno,title,text,rank,score,query
0,1,1172,679402,prior to the construction of the,prior to the construction of the berlin wall i...,0,6.766137,wall
1,1,2110,2391064,in 2013 the rifle range which was,in 2013 the rifle range which was constructed ...,1,5.944105,wall
2,1,1452,702865,it was one of a number of,it was one of a number of highly experimental ...,2,5.241024,wall
3,1,1357,1221197,he was inspired to climb during a,he was inspired to climb during a cycling holi...,3,5.184876,wall
4,1,1845,1151865,designed in the shape of a five,designed in the shape of a five pointed americ...,4,5.151107,wall


In [14]:
indexer = pt.DFIndexer("./indexes/default", overwrite=True)
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index = pt.IndexFactory.of("./indexes/default")
print(index.getCollectionStatistics())

Number of documents: 2453
Number of terms: 23693
Number of postings: 208487
Number of fields: 2
Number of tokens: 273373
Field names: [text, title]
Positions:   false



The returned `IndexRef` is basically a string saying where an index is stored.
A PyTerrier index contains several files:

In [15]:
!ls -lh indexes/default

total 808K
-rw-r--r-- 1 wzm289 users 322K Aug 12 09:38 data.direct.bf
-rw-r--r-- 1 wzm289 users  60K Aug 12 09:38 data.document.fsarrayfile
-rw-r--r-- 1 wzm289 users 332K Aug 12 09:38 data.inverted.bf
-rw-r--r-- 1 wzm289 users 2.2M Aug 12 09:38 data.lexicon.fsomapfile
-rw-r--r-- 1 wzm289 users 1017 Aug 12 09:38 data.lexicon.fsomaphash
-rw-r--r-- 1 wzm289 users  93K Aug 12 09:38 data.lexicon.fsomapid
-rw-r--r-- 1 wzm289 users  63K Aug 12 09:38 data.meta-0.fsomapfile
-rw-r--r-- 1 wzm289 users  20K Aug 12 09:38 data.meta.idx
-rw-r--r-- 1 wzm289 users  55K Aug 12 09:38 data.meta.zdata
-rw-r--r-- 1 wzm289 users 4.3K Aug 12 09:38 data.properties


These files represent several data structures:
- Lexicon: Records the list of all unique terms and their statistics
- Document index: Records the statistics of all documents (e.g. document length)
- Inverted index: Records the mapping between terms and documents
- Meta index: Records document metadata (e.g. document number, URL, raw text, etc passed through `indexer.index()`)
- Direct index: Records terms for each document

Once we have an `IndexRef`, we can load it to an actual index:

In [16]:
index = pt.IndexFactory.of(index_ref)

# lets see what type index is
type(index)

jnius.reflect.org.terrier.structures.Index

Ok, so this object refers to Terrier's [`Index`](http://terrier.org/docs/current/javadoc/org/terrier/structures/Index.html) type. 

Looking at the linked Javadoc, we can see that this Java object has methods such as:
 - `getCollectionStatistics()`
 - `getInvertedIndex()`
 - `getLexicon()`

Let's see what is returned by the `CollectionStatistics()` method:

In [17]:
print(index.getCollectionStatistics().toString())

Number of documents: 2453
Number of terms: 23693
Number of postings: 208487
Number of fields: 2
Number of tokens: 273373
Field names: [text, title]
Positions:   false



In [18]:
index.getEnd()
index.getStart()

0

In [19]:
di = index.getDirectIndex()
doi = index.getDocumentIndex()
lex = index.getLexicon()
docid = 0

terms_di = []
values_di = []
for posting in di.getPostings(doi.getDocumentEntry(docid)):
    termid = posting.getId()
    lee = lex.getLexiconEntry(termid)
    print(lee.getKey(),posting.getFrequency())

come 1
semin 1
nolt 1
bush 1
protect 2
pa 1
cloud 1
cole 1
custom 2
detain 2
entitl 2
franc 2
deport 1
cbp 1
politiqu 1
current 1
countri 1
agent 2
director 1
22 1
1956 1
georg 1
rieur 1
june 1
inspir 1
notabl 1
februari 1
serv 1
vergangenheit 1
book 1
ne 1
antisemit 1
hour 1
doesn 1
10 1
articl 1
talk 2
2017 1
stipend 1
scientif 1
texa 1
enter 1
nearli 1
coin 1
ernst 1
de 2
want 1
normal 1
work 1
centr 1
era 1
qui 1
phrase 1
vergehen 1
saint 1
commonli 1
french 1
cnr 1
due 1
studi 1
vichi 2
rousso 3
give 1
includ 1
egypt 1
express 1
be 1
nicht 1
institut 1
etud 1
awai 1
famili 1
airport 1
arriv 1
past 2
pass 4
involv 1
visa 1
univers 1
border 2
6 1
1987 1
nation 1
houston 1
possibl 1
intercontinent 1
emigr 1
pari 1
sorbonn 1
1986 1
paid 1
tourist 1
1981 1
rise 1
will 1
die 2
syndrom 1
research 2


In [20]:
index.getMetaIndex().getAllItems(2)

['347765']

### Lexicon

What is our vocabulary of terms?

This is the [Lexicon](http://terrier.org/docs/current/javadoc/org/terrier/structures/Lexicon.html), which can be iterated easily from Python:

In [21]:
ix_range = range(1900, 1905)
for ix, kv in enumerate(index.getLexicon()):
    if ix in ix_range:
        print(f"{kv.getKey()} -> {kv.getValue().toString()}")
    elif ix > ix_range[-1]:
        break

agenda -> term9700 Nt=3 TF=3 maxTF=1 @{0 35867 5} TFf=0,0
agent -> term17 Nt=40 TF=51 maxTF=3 @{0 35875 1} TFf=0,0
aggi -> term9651 Nt=1 TF=1 maxTF=1 @{0 35942 6} TFf=0,0
aggrav -> term18279 Nt=2 TF=2 maxTF=1 @{0 35945 4} TFf=0,0
aggreg -> term9189 Nt=3 TF=3 maxTF=1 @{0 35951 2} TFf=0,0


Here, iterating over the Lexicon returns a pair of String term and a [LexiconEntry](http://terrier.org/docs/current/javadoc/org/terrier/structures/LexiconEntry.html) object – which itself is an [EntryStatistics](http://terrier.org/docs/current/javadoc/org/terrier/structures/EntryStatistics.html) – and contains information including the statistics of that term:
- `Nt` is the is the number of unique documents that each term occurs in (this is useful for calculating IDF)
- `TF` is the total number of occurrences – some weighting models use this instead of Nt
- The numbers in the `@{}` are pointers for Terrier to find that term in the inverted index

### Inverted Index

The inverted index tells us in which _documents_ each term occurs.

The LexiconEntry is also the pointer to find the postings (i.e. occurrences) for that term in the inverted index.

In [22]:
pointer = index.getLexicon()["agenda"]
for posting in index.getInvertedIndex().getPostings(pointer):
    print(f"{posting.toString()} doclen={posting.getDocumentLength()}")

(520,1,F[0,0]) doclen=117
(709,1,F[0,0]) doclen=112
(1052,1,F[0,0]) doclen=107


Ok, so we can see that `"agenda"` occurs once in documents with ids 520, 709 and 1052.

Note that these are internal document ids of Terrier.
We can know which documents (i.e. the string "docno" in the corpus DataFrame) from the metaindex:

In [23]:
meta = index.getMetaIndex()
pointer = index.getLexicon()["agenda"]
for posting in index.getInvertedIndex().getPostings(pointer):
    docno = meta.getItem("docno", posting.getId())
    print(f"{posting.toString()} doclen={posting.getDocumentLength()} docno={docno}")

(520,1,F[0,0]) doclen=117 docno=254370
(709,1,F[0,0]) doclen=112 docno=626787
(1052,1,F[0,0]) doclen=107 docno=305924


### Text Pre-processing

Looking at the terms in the Lexicon, do you think the index applied any text pre-processing?

What happens if we lookup a very frequent term?

In [24]:
index.getLexicon()["agenda"].toString()

'term9700 Nt=3 TF=3 maxTF=1 @{0 35867 5} TFf=0,0'

Indeed, Terrier removes standard stopwords and applies Porter's stemmer by default.

### Index Variants

We can modify the pre-processing transformations applied by Terrier when creating an index by changing its `term pipelines` property.

In [25]:
# No pre-processing
# todo

In [26]:
# Stopwords removal
# todo

See the [org.terrier.terms](http://terrier.org/docs/current/javadoc/org/terrier/terms/package-summary.html) package for a list of the available term pipeline objects provided by Terrier.

Similarly, tokenization is controlled by the _“tokeniser”_ property. For example:
```python
indexer.setProperty("tokeniser", "UTFTokeniser")
```

[EnglishTokeniser](http://terrier.org/docs/current/javadoc/org/terrier/indexing/tokenisation/EnglishTokeniser.html) is the default tokeniser. Other tokenisers are listed in [org.terrier.indexing.tokenisation](http://terrier.org/docs/current/javadoc/org/terrier/indexing/tokenisation/package-summary.html) package.

Finally, we can also use the `blocks=True` argument for the index to store position information of every term in each document:

In [27]:
# store position information
# todo

indexer = pt.DFIndexer("./indexes/position", overwrite=True, blocks=True)
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index = pt.IndexFactory.of(index_ref)
print(index.getCollectionStatistics().toString())


Number of documents: 2453
Number of terms: 23693
Number of postings: 208487
Number of fields: 2
Number of tokens: 273373
Field names: [text, title]
Positions:   true



In [28]:
meta = index.getMetaIndex()
pointer = index.getLexicon()["agenda"]
for posting in index.getInvertedIndex().getPostings(pointer):
    docno = meta.getItem("docno", posting.getId())
    print(f"{posting.toString()} doclen={posting.getDocumentLength()} docno={docno}")

(520,1,F[0,0],B[42]) doclen=117 docno=254370
(709,1,F[0,0],B[108]) doclen=112 docno=626787
(1052,1,F[0,0],B[67]) doclen=107 docno=305924


### Loading an Index

Creating an index can take significant time for large document collections.
We can load an index that we previously computed by specifying its path to `"data/properties"`.

In [29]:
index_ref = pt.IndexRef.of("./indexes/default/data.properties")
index = pt.IndexFactory.of(index_ref)
print(index.getCollectionStatistics().toString())

Number of documents: 2453
Number of terms: 23693
Number of postings: 208487
Number of fields: 2
Number of tokens: 273373
Field names: [text, title]
Positions:   false



## 7: Searching an Index

Now that we have an index, let's perform retrieval on it!

In PyTerrier, search is done through the `BatchRetrieve()` method.
BatchRetrieve takes two main arguments:
- an index
- a weighting model

For instance, we can search for the word `"wall"` with our index and a term frequency (`Tf`) model by:

In [30]:
tf = pt.BatchRetrieve(index, wmodel="Tf") >> pt.text.get_text(index, "docno")
tf(topics_df[:3])  # NB. This can also be a multi-word expression (e.g. "white wall")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1015979,205,1015979,0,10.0,president of chile
1,1015979,2417,1186821,1,8.0,president of chile
2,1015979,2435,229754,2,7.0,president of chile
3,1015979,215,1052471,3,5.0,president of chile
4,1015979,549,1514612,4,5.0,president of chile
...,...,...,...,...,...,...
289,340095,2238,223627,123,1.0,2020 summer olympics
290,340095,2277,215330,124,1.0,2020 summer olympics
291,340095,2320,24390,125,1.0,2020 summer olympics
292,340095,2352,1911430,126,1.0,2020 summer olympics


The `search()` method returns a DataFrame with columns:
 - `qid`: This is equal to "1" here since we only have a single query
 - `docid`: This is Terrier's internal integer for each document
 - `docno`: This is the external (string) unique identifier for each document
 - `rank`: This shows the descending order by score of retrieved documents
 - `score`: Since we use the `Tf` weighting model, this score corresponds the total frequency of the query (terms) in each document
 - `query`: The input query

We can also pass a DataFrame of one or more queries to the `transform()` method (rather than the `search()` method) with queries numbered "q1", "q2", etc.

In [31]:
queries = pd.DataFrame([["q1", "dragon"], ["q2", "wall"]], columns=["qid", "query"])
tf.transform(queries)

Unnamed: 0,qid,docid,docno,rank,score,query
0,q1,759,1782559,0,4.0,dragon
1,q1,26,1610206,1,1.0,dragon
2,q1,201,1076935,2,1.0,dragon
3,q1,1383,1323966,3,1.0,dragon
4,q1,1576,630588,4,1.0,dragon
5,q1,2194,654718,5,1.0,dragon
6,q2,1172,679402,0,5.0,wall
7,q2,2110,2391064,1,3.0,wall
8,q2,293,243238,2,2.0,wall
9,q2,592,692168,3,2.0,wall


Moreover, since `transform()` is the default method of a BatchRetrieve object `br`, we can directly write `br(queries)`:

In [32]:
tf(queries)

Unnamed: 0,qid,docid,docno,rank,score,query
0,q1,759,1782559,0,4.0,dragon
1,q1,26,1610206,1,1.0,dragon
2,q1,201,1076935,2,1.0,dragon
3,q1,1383,1323966,3,1.0,dragon
4,q1,1576,630588,4,1.0,dragon
5,q1,2194,654718,5,1.0,dragon
6,q2,1172,679402,0,5.0,wall
7,q2,2110,2391064,1,3.0,wall
8,q2,293,243238,2,2.0,wall
9,q2,592,692168,3,2.0,wall


Finally, while we have used the simple `"Tf"` ranking function in the example above, Terrier supports many other models that can be used by simply changing the `wmodel="Tf"` argument of `BatchRetrieve` (e.g. `wmodel="BM25"` for BM25 scoring).
A list of supported models is available in the [documentation](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html).

We can also tune internal Terrier configurations through the `properties` and `controls` arguments.
For example, we can tune [BM25](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/BM25.html)'s $b$, $k_1$ and $k_3$ parameters (c.f. Equation 4 [here](http://ir.dcs.gla.ac.uk/smooth/he-ecir05.pdf)) as follows:

In [33]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")  # default parameters
bm25v2 = pt.BatchRetrieve(index, wmodel="BM25", controls={"c": 0.1, "bm25.k_1": 2.0, "bm25.k_3": 10})
bm25v3 = pt.BatchRetrieve(index, wmodel="BM25", controls={"c": 8, "bm25.k_1": 1.4, "bm25.k_3": 10})

Here, the $b$ parameters is set via the generic `"c"` control parameter.

## 8: Measuring Retrieval Performance

Ranking metrics allow us to decide which search engine models are better than others for our application.

While we will look into evaluation metrics in a future lab, we can use PyTerrier's `Experiment` abstraction to evaluate multiple (BatchRetrieve) systems on queries "Q" and labels "RA":
```python
pt.Experiment([br1, br2], Q, RA, eval_metrics=["map", "ndcg"])
```

For instance, we can evaluate the MAP and NDCG metrics of the models we defined so far on the first three topics of our collection as follows:

In [36]:
# qrels_df = qrels_df.astype({'label': 'int32'})
pt.Experiment(
    retr_systems=[tf, bm25, bm25v2, bm25v3],
    names=['TF', 'BM25', 'BM25 (0.1, 2.0, 10)', 'BM25 (8, 1.4, 10)'],
    topics=topics_df[:3],
    qrels=qrels_df,
    eval_metrics=["map", "ndcg"])

Unnamed: 0,name,map,ndcg
0,TF,0.727657,0.879601
1,BM25,0.52517,0.686304
2,"BM25 (0.1, 2.0, 10)",0.762116,0.876775
3,"BM25 (8, 1.4, 10)",0.522156,0.685531
