# SI 650 / EECS 549: Homework 3 Part 1
## Introduction to PyTerrier 

This homework is intended to expose you to other types of information retrieval and demonstrates the use of another state of the art IR library, [PyTerrier](https://github.com/terrier-org/pyterrier). 

The overall learning goals of the assignment across all three parts are
  - Learn how to use PyTerrier
  - Understand how to train and use a Learning to Rank model
  - Understand how to train and use a dense vector retrieval (using deep learning)
  - Understand how to use document augmentation
  - Gain additional programming and debugging skills when working with modern IR libraries
  - Learn how to use the [Great Lakes cluster](https://arc.umich.edu/greatlakes/)
  
  
The Great Lakes cluster is a collection of high performance computers at the University of Michigan. The big advantage for this course is the ability to use its GPUs for doing deep learning. You will have access to this cluster for Homework 3 _and_ for your course project, which can expand the type of methods you can try. When launching jobs for this course, be sure to have your job use the `si650f21_class` account.

For this assignment, we'll be using the [CORD19 test collection](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge), which is a collection of documents about Covid-19 produced by AI2. In places, we've pretrained models and precomputed indices for you (which can take large amounts of time), but we'll ask you to try out the commands on a small scale so you'll know how to run them.

Homework 3 Part 1 will have you working on the following tasks to get you started:
  - PyTerrier installation & configuration
  - indexing a collection
  - accessing an index
  - using the `BatchRetrieve` transformer for searching an index
  - conducting an `Experiment` 

For all parts of the homework, you can run them on your local computer with enough time. However, for Part 3, you will see *significant* speed up running these as notebooks on Great Lakes with a GPU. The three parts are designed to be completed in order, as they build on each other conceptually.

For each notebook, all the tasks that you will need to complete are marked with **Task** in a cell title comment.

Note that just like Pyserini, PyTerrier also uses a Java-based  library underneath, [Terrier information retrieval toolkit](http://terrier.org), so you will need to set `JAVA_HOME` accordingly. underlying for many indexing and retrieval operations. PyTerrier is relatively new in 2020, but Terrier has a long history dating back to 2001 and  makes it easy to perform IR experiments in Python, which could come in handy for you when doing your course project.

See the [PyTerrier documentation](https://pyterrier.readthedocs.io/en/latest/) for many more details.

PyTerrier is a Python framework, but uses the underlying [Terrier information retrieval toolkit](http://terrier.org) for many indexing and retrieval operations. While PyTerrier was new in 2020, Terrier is written in Java and has a long history dating back to 2001. PyTerrier makes it easy to perform IR experiments in Python, but using the mature Terrier platform for the expensive indexing and retrieval operations. 

In the following, we introduce everything you need to know about PyTerrier, and also provide appropriate links to relevant parts of the [PyTerrier documentation](https://pyterrier.readthedocs.io/en/latest/).


### Imports

In [1]:
import pandas as pd
# Helpful for showing indexing information
pd.set_option('display.max_colwidth', 150)

import pyterrier as pt
import os

### Starting PyTerrier

The first step is to initialize PyTerrier using PyTerrier's `init()` method. The `init()` method will download Terrier's jar file (if it's not already) and then start the Java Virtual Machine. To avoid downstream complications, we check `started()` prior to calling `init()` to prevent multiple Terrier instances from running concurrently.

In [2]:
if not pt.started():
    pt.init()

PyTerrier 0.7.1 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)


No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Documents, Indexing and Indexes

PyTerrier typically works with Pandas dataframes for inputs. Let's create a toy set of documents in a dataframe to test. Note that the column name of `docno` is a special PyTerrier name that is the unique identifier for each document.

In [3]:
docs_df = pd.DataFrame([
        ["d1", "this is the first document of many documents"],
        ["d2", "this is another document"],
        ["d3", "the topic of this document is unknown"]
    ], columns=["docno", "text"])

docs_df

Unnamed: 0,docno,text
0,d1,this is the first document of many documents
1,d2,this is another document
2,d3,the topic of this document is unknown


Before any search engine can estimate which documents are most likely to be relevant for a given query, it must index the documents. 

In the following cell, we index the dataframe's documents. The index, with all its data structures, is written into a directory called `toydocs_index`. 

In [4]:
index_dir = './toydocs_index'
indexer = pt.DFIndexer(index_dir, overwrite=True)
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index_ref.toString()

'./toydocs_index/data.properties'

PyTerrier will generate a index in the `toydocs_index` directory and and we can list the files to see what kind of internal structure and files it made

In [5]:
os.listdir(index_dir)

['data.lexicon.fsomaphash',
 'data.meta.zdata',
 'data.meta-0.fsomapfile',
 'data.direct.bf',
 'data.inverted.bf',
 'data.lexicon.fsomapfile',
 'data.properties',
 'data.meta.idx',
 'data.document.fsarrayfile']

Once we've generated the files associated with `index_ref`, we can load the information into an actual PyTerrier index using the method `pt.IndexFactory.of()`. 

In [6]:
index = pt.IndexFactory.of(index_ref)

See Terrier's [`Index`](http://terrier.org/docs/current/javadoc/org/terrier/structures/Index.html) object for documentation, which is written in Java's Javadoc format. We can call these methods on our index object as well. Important methods to note are:
 - `getCollectionStatistics()`
 - `getInvertedIndex()`
 - `getLexicon()`

Let's see what is returned by the `CollectionStatistics()` method:

In [7]:
print(index.getCollectionStatistics().toString())

Number of documents: 3
Number of terms: 4
Number of postings: 6
Number of fields: 0
Number of tokens: 7
Field names: []
Positions:   false



Let's unpack the statistics a bit more. We have 3 documents but why do we have only 4 unique terms? We can look at which terms we have by getting the [`Lexicon`](http://terrier.org/docs/current/javadoc/org/terrier/structures/Lexicon.html) object, which contains our vocabulary. We can iterate over the `Lexicon` from Python like a dictionary to see which terms are present and what information there is about each term after indexing.

In [8]:
for kv in index.getLexicon():
    # Let's all print the type information of each to get a sense of what we're working with
    print("%s (%s) -> %s (%s)" % (kv.getKey(), type(kv.getKey()), kv.getValue().toString(), type(kv.getValue()) ) )

document (<class 'str'>) -> term0 Nt=3 TF=4 maxTF=2 @{0 0 0} (<class 'jnius.reflect.org.terrier.structures.LexiconEntry'>)
first (<class 'str'>) -> term1 Nt=1 TF=1 maxTF=1 @{0 0 7} (<class 'jnius.reflect.org.terrier.structures.LexiconEntry'>)
topic (<class 'str'>) -> term2 Nt=1 TF=1 maxTF=1 @{0 1 1} (<class 'jnius.reflect.org.terrier.structures.LexiconEntry'>)
unknown (<class 'str'>) -> term3 Nt=1 TF=1 maxTF=1 @{0 1 5} (<class 'jnius.reflect.org.terrier.structures.LexiconEntry'>)


Iterating over the `Lexicon` shows that we're mapping a `String ` term to a [`LexiconEntry`](http://terrier.org/docs/current/javadoc/org/terrier/structures/LexiconEntry.html) object, which itself is an [`EntryStatistics`](http://terrier.org/docs/current/javadoc/org/terrier/structures/EntryStatistics.html). The `LexiconEntry` contains information including the statistics of that term.

Looking at what we indexed reveals that PyTerrier is removing stopwords for us, much like Pyserini did. PyTerrier is also doing some token normalization as well so that we only have "document" in our index, even though document `d1` has the token "documents"! By default, Terrier removes standard stopwords and applies Porter's stemmer (which we talked about in class), though these behaviors can be configured.

The `EntryStatistics` also provides a few other fields that offer insights:
 - `Nt` is the number of unique documents that each term occurs in – this is useful for calculating IDF.
 - `TF` is the total number of occurrences – some weighting models use this instead of Nt.
 - The numbers in the `@{}` are a pointer – they tell Terrier where the postings are for that term in the inverted index data structure.

PyTerrier also supports directly looking up a word using the `[]` operator, much like we would if we were looking up a key's value in a dictionary. Let's look up the value for the word "document":

In [9]:
print(index.getLexicon()["document"])

term0 Nt=3 TF=4 maxTF=2 @{0 0 0}


We can use the information in the `Lexicon` to also look up documents as well. Remember from class that an inverted index is a mapping from a term to which *documents* each term occurs in. The `LexiconEntry` for a word contains the pointer to where to find the documents for that word in the inverted index. 

The object retrieved from using the `[]` operator with a `Lexicon` is a pointer that we can use with the inverted index.

In [10]:
pointer = index.getLexicon()["document"]
for posting in index.getInvertedIndex().getPostings(pointer):
    print(str(posting) + " doclen=%d" % posting.getDocumentLength())

ID(0) TF(2) doclen=3
ID(1) TF(1) doclen=1
ID(2) TF(1) doclen=3


From this output, we can see that the term "document" occurs in all three documents, as well as how long those documents are. Note that PyTerrier starts counting indexed documents with `int` values starting from 0 (called *docids*). These *docids* are then mapped back to *docnos*, which are the unique string identifiers for a document, e.g., the "`d1`", "`d2`" we used. This mapping is stored in a separate data structure called the *metaindex*, though you likely won't need to use that.

## Searching an Index

Our way into search in PyTerrier is called `BatchRetrieve`. BatchRetrieve is configured by specifying an index and a weighting model. Here', we'll use the `Tf` weighting, which is just term frequency; there are multiple possible weighting schemes, as we'll see later. Using a `BatchRetrieve` object, we will search for a single-word query, `"document"`.

In [11]:
br = pt.BatchRetrieve(index, wmodel="Tf")
br.search("document")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,0,d1,0,2.0,document
1,1,1,d2,1,1.0,document
2,1,2,d3,2,1.0,document


The `search()` method returns a Pandas dataframe with columns:
 - `qid`: this is the query id, which is by default "1", since we issued only one query
 - `docid`: Terrier' internal integer for each document
 - `docno`: the external (string) unique identifier for each document
 - `score`: since we use the `Tf` weighting model, this score corresponds to the total frequency of the query (terms) in each document
 - `rank`: A handy attribute showing the descending order by score
 - `query`: the input query

As expected, the `Tf` weighting model used here only counts the frequencies of the query terms in each document, i.e.:
$$
score(d,q) = \sum_{t \in q} tf_{t,d}
$$

Hence, it's clear that document `d1` should be the highest scored document with two occurrences (c.f. `'document'` and `'documents'`).  

### Searching with multiple queries

We can search for more than one query at a time using the  `transform()` method rather than the `search()` method. PyTerrier uses the notion of transformers, which we'll describe much more in Part 2, but for now, you can think of this function as transforming some input to some output. In our case, we'll create a Pandas DataFrame with our queries, which we'll provide as input to the `BatchRetrieve` object, to "transform" into results.

Note that we not only need to provide queries, but also query identifiers in the `qid` column. These `qid` values will let us distinguish which results go to which query.

In [12]:
queries = pd.DataFrame([["q1", "document"], ["q2", "first document"]], columns=["qid", "query"])
queries

Unnamed: 0,qid,query
0,q1,document
1,q2,first document


Now we can pass this queries data frame into `transform()` to get the results

In [13]:
br.transform(queries)

Unnamed: 0,qid,docid,docno,rank,score,query
0,q1,0,d1,0,2.0,document
1,q1,1,d2,1,1.0,document
2,q1,2,d3,2,1.0,document
3,q2,0,d1,0,3.0,first document
4,q2,1,d2,1,1.0,first document
5,q2,2,d3,2,1.0,first document


Most common operations in PyTerrier have to be overloaded so that you can call them using python syntax (called _operator overloading_). We'll discuss this more in Part 2, but for now, know that you can call `br.transform(queries)` using just `br(queries)`. Here. the `()` operator has been overloaded so that it calls `transform()` for us! You will see this usage very frequently in examples and documentation so it's worth noting and remembering the two are equivalent. As an example:

In [14]:
br(queries)

Unnamed: 0,qid,docid,docno,rank,score,query
0,q1,0,d1,0,2.0,document
1,q1,1,d2,1,1.0,document
2,q1,2,d3,2,1.0,document
3,q2,0,d1,0,3.0,first document
4,q2,1,d2,1,1.0,first document
5,q2,2,d3,2,1.0,first document


## Scaling up to Covid-19 Data

Let's move on to our full dataset, CORD19, which is easily accessible online. We'll use PyTerrier's `get_dataset()` function to download this corpus automatically and then to index it.

### Task 1: Indexing data (5 points)

You first task will be to write three lines of code that create the index using an indexer or, if the index was already created, loads the created index from file. 

In [15]:
cord19 = pt.datasets.get_dataset('irds:cord19/trec-covid')
pt_index_path = './terrier_trec_covid'

if not os.path.exists(pt_index_path + "/data.properties"):

    # create the index, using the IterDictIndexer indexer 

    # TODO
    indexer = pt.IterDictIndexer(pt_index_path)

    # we give the dataset get_corpus_iter() directly to the indexer
    # while specifying the fields to index and the metadata to record
    
    # TODO
    index_ref = indexer.index(cord19.get_corpus_iter(), fields=('abstract',), meta=('docno',))
else:
    # if you already have the index, create an IndexRef from the data in pt_index_path
    # that we can use to load using the IndexFactory
    
    # TODO
    index_ref = pt.IndexRef.of(pt_index_path + "/data.properties")
    
index = pt.IndexFactory.of(index_ref)


### Task 2: 3 points
- Print out the statistics of the index

In [16]:
# TODO
print(index.getCollectionStatistics().toString())

Number of documents: 192509
Number of terms: 151235
Number of postings: 11554033
Number of fields: 1
Number of tokens: 17728468
Field names: [abstract]
Positions:   false



As a curated collection, CORD19 has a corresponding set of queries, referred to as _topics_, and the relevance assessments for each query (i.e., topic), referred to as _qrels_. We use these to evaluate as a *test collection*. PyTerrier allows us to easily access the topics (queries) and qrels from the dataset. Like much of the inputs and outputs, these are expressed as dataframes as well:

In [17]:
cord19.get_topics(variant='title').head(5)

  df.drop(df.columns.difference(['qid','query']), 1, inplace=True)


Unnamed: 0,qid,query
0,1,coronavirus origin
1,2,coronavirus response to weather changes
2,3,coronavirus immunity
3,4,how do people die from the coronavirus
4,5,animal models of covid 19


In [18]:
cord19.get_qrels().head(5)

Unnamed: 0,qid,docno,label,iteration
0,1,005b2j4b,2,4.5
1,1,00fmeepz,1,4.0
2,1,010vptx3,2,0.5
3,1,0194oljo,1,2.5
4,1,021q9884,1,4.0


### Weighting Models

In the earlier example, we used the simple "`Tf`" as our ranking function for document retrieval in BatchRetrieve. However, we can use other models such as `"TF_IDF"` by simply changing the `wmodel="Tf"` keyword argument in the constructor of `BatchRetrieve`:

In [19]:
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")
tfidf.search("chemical reactions")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,18717,iavwkdpr,0,11.035982,chemical reactions
1,1,171636,v3blnh02,1,10.329726,chemical reactions
2,1,147193,ei4rb8fr,2,10.317138,chemical reactions
3,1,121217,msdycum2,3,9.653734,chemical reactions
4,1,170863,sj8i9ss2,4,9.500211,chemical reactions
...,...,...,...,...,...,...
995,1,2428,38aabxh1,995,3.790183,chemical reactions
996,1,14752,u709r8ss,996,3.790183,chemical reactions
997,1,20074,wxi1xsbo,997,3.790183,chemical reactions
998,1,117156,ts3obwts,998,3.790183,chemical reactions


Note that, as expected, because we switched the ranking, the scores of documents ranked by `TF_IDF` are no longer integers. You can see the exact TF-IDF formula used by Terrier from [the Github repo](https://github.com/terrier-org/terrier-core/blob/5.x/modules/core/src/main/java/org/terrier/matching/models/TF_IDF.java#L79)--sometimes helpful to know since there are multiple ways of defining TF-IDF! Terrier supports many weighting models and the documentation contains [a list of supported models](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html).

## Evaluating and Comparing IR Models

How do we know which of the models we've made so far are good IR models? PyTerrier provides a robust and extensive framework to help us automate the evaluation of IR models once we've defined them.

As a first pass, let's take a look at the relevance scores in the dataset. To do this, we'll merge (`join`) the `qrels` with the results of our ranker to produce a dataframe that has both the ranking model's predictions (`"score"`) and the actual relevance score (`"label"`). 

In [20]:
qrels = cord19.get_qrels()

def get_res_with_labels(ranker, df):
    # get the results for the query or queries
    results = ranker( df )
    # left outer join with the qrels
    with_labels = results.merge(qrels, on=["qid", "docno"], how="left").fillna(0)
    return with_labels

# lets get the Tf results for the first query
get_res_with_labels(tfidf, cord19.get_topics(variant='title').head(1))

  df.drop(df.columns.difference(['qid','query']), 1, inplace=True)


Unnamed: 0,qid,docid,docno,rank,score,query,label,iteration
0,1,175892,zy8qjaai,0,7.080599,coronavirus origin,1.0,1
1,1,82224,8ccl9aui,1,6.775667,coronavirus origin,2.0,1
2,1,135326,ne5r4d4b,2,6.683114,coronavirus origin,0.0,1.5
3,1,122804,75773gwg,3,6.590340,coronavirus origin,2.0,5
4,1,122805,kn2z7lho,4,6.590340,coronavirus origin,2.0,3
...,...,...,...,...,...,...,...,...
995,1,180809,0y0hau9l,995,4.214228,coronavirus origin,0.0,0
996,1,148967,f8vbflx6,996,4.212887,coronavirus origin,0.0,0
997,1,183189,uadfehr6,997,4.210201,coronavirus origin,2.0,1.5
998,1,67321,n5hnx2c3,998,4.202319,coronavirus origin,0.0,0


### Running an Experiment

We don't actually need to produce that dataframe to do our evaluation though! PyTerrier lets us run different results with an [Experiment](https://pyterrier.readthedocs.io/en/latest/experiments.html) object, which will compare models according to the evaluation metrics we specify. Here, let's run an experiment to evaluate our `tfidf` model that we created earlier:

In [21]:
pt.Experiment(
    [tfidf],
    cord19.get_topics(variant='title'),
    cord19.get_qrels(),
    eval_metrics=["map", "ndcg"])

  df.drop(df.columns.difference(['qid','query']), 1, inplace=True)


Unnamed: 0,name,map,ndcg
0,BR(TF_IDF),0.180002,0.370767


## Task 3: Define new models and evaluate them in an Experiment (28 points)

Now comes the fun part! Your task is to define **three** new [`BatchRetrieve`](https://pyterrier.readthedocs.io/en/latest/terrier-retrieval.html#batchretrieve) objects with different word ranking methods. You are welcome to set the hyperparameters but all models should be sufficiently different. You are definitely welcome (encouraged, even!) to compare _more_ than three models too.

Once you have defined your three `BatchRetrieve` objects, conduct an `Experiment` using all of them _at once_ (not three separate `Experiment` runs!) to evaluate the results.  Your experiment should include the two metrics used above, as well as NDCG for the top-5 and top-10 results. You are welcome to include other metrics as well

Print the results of the Experiment and then write 2-3 sentences (or more) about what you see in the performance. Is there a clear better model? Would you expect better performance with some hyperparameter tuning?

In [22]:
# TODO
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
pl2 = pt.BatchRetrieve(index, wmodel="PL2")
dlh = pt.BatchRetrieve(index, wmodel="DLH")

In [23]:
pt.Experiment([bm25, pl2, dlh],
    cord19.get_topics(variant='title'),
    cord19.get_qrels(),
    eval_metrics=["map", "ndcg", "ndcg_cut_5", "ndcg_cut_10"])

  df.drop(df.columns.difference(['qid','query']), 1, inplace=True)


Unnamed: 0,name,map,ndcg,ndcg_cut_5,ndcg_cut_10
0,BR(BM25),0.181478,0.373328,0.611724,0.583665
1,BR(PL2),0.169525,0.358597,0.58385,0.554204
2,BR(DLH),0.158592,0.349725,0.528784,0.517947


In the above experiment, BM25 has a higher map, ndcg, ndcg@5, and ndcg@10 than PL2 and DLH, and PL2 performs better than DLH. This indicates that BM25 has a higher ranking quality. It seems that there's a clear better model BM25. In addition, as this model is untuned here, but the performance is accpetable, I would expect better performance with some hyperparameter tuning.