# PyTerrier ECIR 2021 Tutorial Notebook - Part 4.1 - DeepCT and Doc2Query

This notebook provides experiences to attendees for creating indexing pipelines in [PyTerrier](https://github.com/terrier-org/pyterrier). All experiments are conducted using the [CORD19 corpus](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7251955/) and the [TREC Covid test collection](https://ir.nist.gov/covidSubmit/).

This notebook aims to demonstrate use of the docT5query and DeepCT for end-to-end indexing and retrieval in PyTerrier, as provided by PyTerrier plugins of *PyTerrier doc2query*](https://github.com/terrierteam/pyterrier_doc2query) and [*PyTerrier deeptct*](https://github.com/terrierteam/pyterrier_deepct).


In this notebook, you will experience:
 - indexing and retrieval using [*PyTerrier doc2query*](https://github.com/terrierteam/pyterrier_doc2query).
 - indexing and retrieval using [*PyTerrier deeptct*](https://github.com/terrierteam/pyterrier_deepct).


In [None]:
%tensorflow_version 1.x
!pip install --upgrade python-terrier
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_deepct.git
!pip install --upgrade git+https://github.com/terrierteam/pyterrier_doc2query.git

These lines are needed to make DeepCT & Tensorflow more quiet

In [None]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 
import tensorflow as tf
assert tf.__version__.startswith("1"), "TF 1 is required by DeepCT; on Colab, use %tensorflow_version 1.x"
tf.logging.set_verbosity(tf.logging.ERROR)

Load up PyTerrier!

In [None]:
import pyterrier as pt
if not pt.started():
  pt.init()

## DeepCT

Recall that the DeepCT model repeats terms based on their estimated importance. This repitition boosts the importance in an inverted index structure.

We provide an interface to the DeepCT model in the `pyterrier_deepct` package:

In [None]:
import pyterrier_deepct

### Loading a pre-trained model

We will load the pre-trained verison of DeepCT provided by the authors.

In [None]:
if not os.path.exists("marco.zip"):
  !wget http://boston.lti.cs.cmu.edu/appendices/arXiv2019-DeepCT-Zhuyun-Dai/outputs/marco.zip
  !unzip marco.zip
if not os.path.exists("uncased_L-12_H-768_A-12.zip"):
  !wget https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-768_A-12.zip
  !unzip uncased_L-12_H-768_A-12.zip
  !mkdir -p bert-base-uncased
  !mv vocab.txt bert_* bert-base-uncased/

Loading a model is as simple as specifying the model configuration and weight file:

In [None]:
deepct = pyterrier_deepct.DeepCTTransformer("bert-base-uncased/bert_config.json", "marco/model.ckpt-65816")

### Running on sample text

We can transform a dataframe with a sample document to observe the effect of DeepCT:

In [None]:
import pandas as pd
df = pd.DataFrame([{"docno" : "d1", "text" :"The 43rd European Conference on Information Retrieval (ECIR 2021) is held virtually due to the COVID-19 pandemic."}])
df.iloc[0].text

In [None]:
deepct_df = deepct(df)
deepct_df.iloc[0].text

(You may need to expand the text using the \[...\] button at the end of the text.)

Interesting, right? We can see a lot of terms are expanded. Let's use `Counter` to see which are the most important terms.

In [None]:
from collections import Counter
Counter(deepct_df.iloc[0].text.split()).most_common()

As you can see, DeepCT considers "Conference", "Information", and "Retrieval" to be the most important terms in the document. Not bad choices. However, it places little emphasis on numeric text (it compltely removes 19 from COVID-19) as well as the word "virtually".

### Loading an index of DeepCT documents

It takes too long to run DeepCT over the entire CORD19 collection in a tutorial setting, so we provide a version of the index for download.

If you would like to index the collection with DeepCT yourself, you can use:

```python
dataset = pt.get_dataset("irds:cord19/trec-covid")
indexer = (
  pt.apply.generic(lambda df: df.rename(columns={'abstract': 'text'})) # rename "abstract" column to "text"
  >> deepct # apply DeepCT transformation
  >> pt.IterDictIndexer("./deepct_index_path")) # index the modified documents
indexref = indexer.index(dataset.get_corpus_iter())
```

In [None]:
if not os.path.exists('deepct_marco_cord19.zip'):
  !wget http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/deepct_marco_cord19.zip
  !unzip deepct_marco_cord19.zip
deepct_indexref = pt.IndexRef.of('./deepct_index_path/data.properties')

Let's look at the top results for each of the TREC COVID quereis.

In [None]:
dataset = pt.get_dataset('irds:cord19/trec-covid')
pipeline = pt.BatchRetrieve(deepct_indexref) % 1 >> pt.text.get_text(dataset, 'title')
res = pipeline(dataset.get_topics('title'))
res.merge(dataset.get_qrels(), how='left').head()

Ouch-- queries 2 and 4 are non-relevant. Let's dig deeper into those source documents.

In [None]:
df = pd.DataFrame(doc for doc in dataset.get_corpus_iter() if doc['docno'] in ('g8grcy5j', 'c76g2p8a'))
df = df.rename(columns={'abstract': 'text'})
deepct_df = deepct(df)
print('deepct-transformed documents')
for deepct_text, docno, text in zip(deepct_df['text'], deepct_df['docno'], df['text']):
  print(docno)
  print(Counter(deepct_text.split()).most_common(10))
  print(text)

As we can see, the document ranked highest for "*coronavirus response to weather changes*" (g8grcy5j) discusses the potential for change in climate policy as a result of COVID-19, not how the virus responds to weather. DeepCT picks up on this theme and gives weather-related words high importance.

The top document for "*how do people die from the coronavirus*" (c76g2p8a) discusses the number of COVID-19 fatalities, not the causes of death. DeepCT boosts the importance of terms like "die" high importance scores.

## doc2query

Recall that doc2query augments an inverted index structure by predicting queries that may be used to search for the document, and appending those to the document text.

We provide an interface to doc2query using the `pyterrier_doc2query` package:

In [None]:
import pyterrier_doc2query

### Loading a pre-trained model

We will again use a version of the doc2query model released by the authors that is trained on the MS MARCO collection.

In [None]:
import os
if not os.path.exists("t5-base.zip"):
  !wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-passage/t5-base.zip
  !unzip t5-base.zip

We can load the model weights by specifying the checkpoint.

In [None]:
doc2query = pyterrier_doc2query.Doc2Query('model.ckpt-1004000', batch_size=8)

### Running on sample text

Let's see what queries it predicts for the sample document:

In [None]:
import pandas as pd
df = pd.DataFrame([{"docno" : "d1", "text" :"The 43rd European Conference on Information Retrieval (ECIR 2021) is held virtually due to the COVID-19 pandemic."}])
df.iloc[0].text

In [None]:
doc2query_df = doc2query(df)
doc2query_df.iloc[0].querygen

Looks like it's having a bit of trouble with our sample document.

### Loading an index of doc2query documents

Let's see how it does on TREC COVID. Again, it takes too long to index in a tutorial setting, so we provide an index.

If you would like to index the collection with doc2query yourself, you can use:

```python
dataset = pt.get_dataset("irds:cord19/trec-covid")
indexer = (
  pyterrier_doc2query.Doc2Query('model.ckpt-1004000', doc_attr='abstract', batch_size=8, append=True) # aply doc2query on abstracts and append
  >> pt.apply.generic(lambda df: df.rename(columns={'abstract': 'text'}) # rename "abstract" column to "text" for indexing
  >> pt.IterDictIndexer("./doc2query_index_path")) # index the expanded documents
indexref = indexer.index(dataset.get_corpus_iter())
```


In [None]:
if not os.path.exists('doc2query_marco_cord19.zip'):
  !wget http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/doc2query_marco_cord19.zip
  !unzip doc2query_marco_cord19.zip
doc2query_indexref = pt.IndexRef.of('./doc2query_index_path/data.properties')

And again, the top results on TREC COVID:

In [None]:
dataset = pt.get_dataset('irds:cord19/trec-covid')
pipeline = pt.BatchRetrieve(doc2query_indexref) % 1 >> pt.text.get_text(dataset, 'title')
res = pipeline(dataset.get_topics('title'))
res.merge(dataset.get_qrels(), how='left').head()

Let's take a look at what queries it generates for some of these documents:

In [None]:
df = pd.DataFrame(doc for doc in dataset.get_corpus_iter() if doc['docno'] in ('3sepefqa', 'l5fxswfz'))
df = df.rename(columns={'abstract': 'text'})
doc2query_df = doc2query(df)
for querygen, docno, text in zip(doc2query_df['querygen'], doc2query_df['docno'], df['text']):
  print(docno)
  print(querygen)
  print(text)

## Putting it all together!

Let's compare DeepCT and doc2query! We'll first download an plain old index of CORD19 (no DeepCT or doc2query) to serve as our baseline:

In [None]:
if not os.path.exists('terrier_index.zip'):
  !wget http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/terrier_index.zip
  !unzip terrier_index.zip
indexref = pt.IndexRef.of('./terrier_cord19/data.properties')

Now we can run our experiment! **Remember that both DeepCT and doc2query were traiend on the MS MARCO collection**, so this represents a zero-shot transfer setting.

In [None]:
pt.Experiment([
  pt.BatchRetrieve(indexref, wmodel="BM25"),
  pt.BatchRetrieve(deepct_indexref, wmodel="BM25"),
  pt.BatchRetrieve(doc2query_indexref, wmodel="BM25"),
  ],
  dataset.get_topics('title'),
  dataset.get_qrels(),
  baseline=0,
  names=["BM25", "BM25_deepct", "BM25_doc2query"],
  eval_metrics=["map", "ndcg", "ndcg_cut.10"]
)

In [None]:
pt.Experiment([
  pt.BatchRetrieve(indexref, wmodel="BM25"),
  pt.BatchRetrieve(deepct_indexref, wmodel="BM25"),
  pt.BatchRetrieve(doc2query_indexref, wmodel="BM25"),
  ],
  dataset.get_topics('description'),
  dataset.get_qrels(),
  baseline=0,
  names=["BM25", "BM25_deepct", "BM25_doc2query"],
  eval_metrics=["map", "ndcg", "ndcg_cut.10"]
)

#  That's all folks

There are separate notebooks in Part 4 for ANCE and ColBERT. Refer to the [tutorial repostiory](https://github.com/terrier-org/ecir2021tutorial).

Once you have finished all of the Part 4 notebooks, please dont forget to complete our exit quiz: https://forms.office.com/r/2WbpLiQmWV