# Doc2Query-- Reproduction Notebook

This notebook reproduces the main result table in the paper Doc2Query--: When Less is More.

The notebook is split into sections for Preparation, Last-Metre Reproduction, Last-Mile Reproduction, Complete Reproduction, and Pipeline Usage.

## Preparation

In [None]:
!pip install python-terrier
!pip install pyterrier_pisa
!pip install git+https://github.com/terrierteam/pyterrier_doc2query.git
!pip install git+https://github.com/terrierteam/pyterrier_dr.git
!pip install git+https://github.com/terrierteam/pyterrier_t5.git

In [2]:
import pyterrier as pt
if not pt.started():
    pt.init(tqdm='notebook')
from pyterrier.measures import *
from pyterrier_pisa import PisaIndex
from pyterrier_doc2query import Doc2QueryStore, QueryScoreStore, QueryFilter

In [3]:
def evaluate(idxs):
    bm25_base = idxs['base'].bm25()
    bm25_d2q_n40 = idxs['n40'].bm25(k1=2., b=1.)
    bm25_d2q_n40_electra_p30 = idxs['n40_electra'].bm25(k1=1.75, b=0.6)
    bm25_d2q_n40_monot5_p40 = idxs['n40_monot5'].bm25(k1=1.75, b=0.6)
    bm25_d2q_n40_tct_p50 = idxs['n40_tct'].bm25(k1=1.75, b=0.6)
    bm25_d2q_n80 = idxs['n80'].bm25(k1=2., b=1.)
    bm25_d2q_n80_electra_p30 = idxs['n80_electra'].bm25(k1=1.75, b=0.6)
    bm25_d2q_n80_monot5_p40 = idxs['n80_monot5'].bm25(k1=1.75, b=0.6)
    bm25_d2q_n80_tct_p50 = idxs['n80_tct'].bm25(k1=1.75, b=0.6)
    for did, meas in [
        ('irds:msmarco-passage/dev/small', RR@10),
        ('irds:msmarco-passage/dev/2', RR@10),
        ('irds:msmarco-passage/trec-dl-2019/judged', nDCG@10),
        ('irds:msmarco-passage/trec-dl-2020/judged', nDCG@10),
    ]:
        print(f'\n\nRunning on {did}...')
        dataset = pt.get_dataset(did)
        print(pt.Experiment(
            [bm25_base],
            dataset.get_topics(),
            dataset.get_qrels(),
            [meas],
            round=3,
            names=['BM25']
        ))
        print(pt.Experiment(
            [bm25_d2q_n40, bm25_d2q_n40_electra_p30, bm25_d2q_n40_monot5_p40, bm25_d2q_n40_tct_p50],
            dataset.get_topics(),
            dataset.get_qrels(),
            [meas],
            baseline=0,
            round=3,
            names=['Doc2Query (n=40)', 'w/ ELECTRA Filter (30%)', 'w/ MonoT5 Filter (40%)', 'w/ TCT Filter (50%)']
        ))
        print(pt.Experiment(
            [bm25_d2q_n80, bm25_d2q_n80_electra_p30, bm25_d2q_n80_monot5_p40, bm25_d2q_n80_tct_p50],
            dataset.get_topics(),
            dataset.get_qrels(),
            [meas],
            baseline=0,
            round=3,
            names=['Doc2Query (n=80)', 'w/ ELECTRA Filter (30%)', 'w/ MonoT5 Filter (40%)', 'w/ TCT Filter (50%)']
        ))

## Last-Metre Reproduction

In Last-Metre Reproduction, only final-stage artefacts are used. Here, we download and use pre-built PISA indexes.

In [4]:
idxs = {}
idxs['base'] = PisaIndex.from_dataset('msmarco_passage', 'pisa_porter2')
idxs['n40'] = PisaIndex.from_dataset('msmarco_passage', 'pisa_d2q_n40', stops='none')
idxs['n40_electra'] = PisaIndex.from_dataset('msmarco_passage', 'pisa_d2qmm_n40_electra_p70', stops='none')
idxs['n40_monot5'] = PisaIndex.from_dataset('msmarco_passage', 'pisa_d2qmm_n40_monot5_p60', stops='none')
idxs['n40_tct'] = PisaIndex.from_dataset('msmarco_passage', 'pisa_d2qmm_n40_tct_p50', stops='none')
idxs['n80'] = PisaIndex.from_dataset('msmarco_passage', 'pisa_d2q_n80', stops='none')
idxs['n80_electra'] = PisaIndex.from_dataset('msmarco_passage', 'pisa_d2qmm_n80_electra_p70', stops='none')
idxs['n80_monot5'] = PisaIndex.from_dataset('msmarco_passage', 'pisa_d2qmm_n80_monot5_p60', stops='none')
idxs['n80_tct'] = PisaIndex.from_dataset('msmarco_passage', 'pisa_d2qmm_n80_tct_p50', stops='none')
evaluate(idxs)



Running on irds:msmarco-passage/dev/small...
   name  RR@10
0  BM25  0.185
                      name  RR@10  RR@10 +  RR@10 -  RR@10 p-value
0         Doc2Query (n=40)  0.277      NaN      NaN            NaN
1  w/ ELECTRA Filter (30%)  0.316   2040.0    917.0   5.134067e-42
2   w/ MonoT5 Filter (40%)  0.308   1794.0    790.0   3.244273e-35
3      w/ TCT Filter (50%)  0.287   1402.0    836.0   2.990016e-06
                      name  RR@10  RR@10 +  RR@10 -  RR@10 p-value
0         Doc2Query (n=80)  0.279      NaN      NaN            NaN
1  w/ ELECTRA Filter (30%)  0.323   2102.0    900.0   8.165379e-48
2   w/ MonoT5 Filter (40%)  0.311   1747.0    790.0   7.635617e-38
3      w/ TCT Filter (50%)  0.293   1434.0    797.0   3.496333e-10


Running on irds:msmarco-passage/dev/2...
   name  RR@10
0  BM25  0.181
                      name  RR@10  RR@10 +  RR@10 -  RR@10 p-value
0         Doc2Query (n=40)  0.263      NaN      NaN            NaN
1  w/ ELECTRA Filter (30%)  0.309   1312.0    

## Last-Mile Reproduction

Last-Mile reproduction checks whether the same conclusions can be drawn at an intermediate stage of processing (i.e., using some of the data aretefacts produced). Here, we apply filtering, indexing, and retrieval, while using pre-computed filtering scores that were released by the authors.

In [5]:
# Data
dataset = pt.get_dataset('irds:msmarco-passage')
d2q = Doc2QueryStore.from_repo('https://huggingface.co/datasets/macavaney/d2q-msmarco-passage')
electra = QueryScoreStore.from_repo('https://huggingface.co/datasets/macavaney/d2q-msmarco-passage-scores-electra')
monot5 = QueryScoreStore.from_repo('https://huggingface.co/datasets/macavaney/d2q-msmarco-passage-scores-monot5')
tct = QueryScoreStore.from_repo('https://huggingface.co/datasets/macavaney/d2q-msmarco-passage-scores-tct')

idxs = {}

# Build base index
idxs['base'] = PisaIndex('repro_base.pisa')
if not idxs['base'].built():
    idxs['base'].index(dataset.get_corpus_iter())

for n in [40, 80]:
    # Build d2q index with k generated passages
    name = f'n{n}'
    idxs[name] = PisaIndex(f'repro_d2q_n{n}.pisa', stops='none')
    if not idxs[name].built():
        pipeline = d2q.generator(n, append=True) >> idxs[name]
        pipeline.index(dataset.get_corpus_iter())
    
    for m, fltr, p in [
        ('electra', electra, 70),
        ('monot5', monot5, 60),
        ('tct', tct, 50),
    ]:
        # Build d2q-- index with k generated passages and the provided filter
        name = f'n{n}_{m}'
        idxs[name] = PisaIndex(f'repro_d2q_n{n}_{m}_p{p}.pisa', stops='none')
        if not idxs[name].built():
            pipeline = fltr.query_scorer(n) >> QueryFilter(t=fltr.percentile(p)) >> idxs[name]
            pipeline.index(dataset.get_corpus_iter())

evaluate(idxs)



Running on irds:msmarco-passage/dev/small...
   name  RR@10
0  BM25  0.185
                      name  RR@10  RR@10 +  RR@10 -  RR@10 p-value
0         Doc2Query (n=40)  0.277      NaN      NaN            NaN
1  w/ ELECTRA Filter (30%)  0.316   2040.0    917.0   5.134067e-42
2   w/ MonoT5 Filter (40%)  0.308   1794.0    790.0   3.244273e-35
3      w/ TCT Filter (50%)  0.287   1402.0    836.0   2.990016e-06
                      name  RR@10  RR@10 +  RR@10 -  RR@10 p-value
0         Doc2Query (n=80)  0.279      NaN      NaN            NaN
1  w/ ELECTRA Filter (30%)  0.323   2102.0    900.0   8.165379e-48
2   w/ MonoT5 Filter (40%)  0.311   1747.0    790.0   7.635617e-38
3      w/ TCT Filter (50%)  0.293   1434.0    797.0   3.496333e-10


Running on irds:msmarco-passage/dev/2...
   name  RR@10
0  BM25  0.181
                      name  RR@10  RR@10 +  RR@10 -  RR@10 p-value
0         Doc2Query (n=40)  0.263      NaN      NaN            NaN
1  w/ ELECTRA Filter (30%)  0.309   1312.0    

## Complete Reproduction

To completely reproduce the results, the query-passage relevance scores need to be re-calculated (rather than using the pre-computed scores by the authors). The below code re-creates the `electra`, `monot5`, and `tct` stores used in the previous step. These can then be used with the above code to reproduce the result table.

Note that this step will take a long time, as detailed in Section 5.

In [None]:
# electra
from pyterrier_dr import ElectraScorer
from pyterrier_doc2query import Doc2QueryStore, QueryScoreStore, QueryScorer

d2q = Doc2QueryStore.from_repo('https://huggingface.co/datasets/macavaney/d2q-msmarco-passage').generator()
electra = QueryScoreStore('path/to/store')
pipeline = d2q >> QueryScorer(ElectraScorer()) >> electra

dataset = pt.get_dataset('irds:msmarco-passage')
pipeline.index(dataset.get_corpus_iter())

In [None]:
# monot5
from pyterrier_t5 import MonoT5ReRanker
from pyterrier_doc2query import Doc2QueryStore, QueryScoreStore, QueryScorer

d2q = Doc2QueryStore.from_repo('https://huggingface.co/datasets/macavaney/d2q-msmarco-passage').generator()
monot5 = QueryScoreStore('path/to/store')
pipeline = d2q >> QueryScorer(MonoT5ReRanker()) >> monot5

dataset = pt.get_dataset('irds:msmarco-passage')
pipeline.index(dataset.get_corpus_iter())

In [None]:
# tct
from pyterrier_dr import TctColBert
from pyterrier_doc2query import Doc2QueryStore, QueryScoreStore, QueryScorer

d2q = Doc2QueryStore.from_repo('https://huggingface.co/datasets/macavaney/d2q-msmarco-passage').generator()
tct = QueryScoreStore('path/to/store')
pipeline = d2q >> QueryScorer(TctColBert('castorini/tct_colbert-v2-hnp-msmarco')) >> tct

dataset = pt.get_dataset('irds:msmarco-passage')
pipeline.index(dataset.get_corpus_iter())

## Pipeline Usage

If you want to use Doc2Query-- in an arbitrary retrieval pipeline (e.g., with a different dataset), you can use the following code, or explore [the live demonstration on HuggingFace Spaces](https://huggingface.co/spaces/terrierteam/doc2query).

In [4]:
import pandas as pd
from pyterrier_doc2query import Doc2Query, QueryScorer, QueryFilter
from pyterrier_dr import ElectraScorer

doc2query = Doc2Query('macavaney/doc2query-t5-base-msmarco', append=False, num_samples=20)
scorer = ElectraScorer('crystina-z/monoELECTRA_LCE_nneg31')

# inspection
pipeline = doc2query >> QueryScorer(scorer) >> QueryFilter(append=False, t=3.21484375) # 30% electra filter

pipeline(pd.DataFrame([
  {'docno': '0', 'text': 'The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.'},
  {'docno': '100', 'text': "Antonín Dvorák (1841–1904) Antonin Dvorak was a son of butcher, but he did not follow his father's trade. While assisting his father part-time, he studied music, and graduated from the Prague Organ School in 1859."},
  {'docno': '1000', 'text': 'QuickFacts Matanuska-Susitna Borough, Alaska; UNITED STATES QuickFacts provides statistics for all states and counties, and for cities and towns with a population of 5,000 or more.'},
]))

ELECTRA scoring:   0%|          | 0/60 [00:00<?, ?record/s]

Unnamed: 0,docno,text,querygen,querygen_score
0,0,The presence of communication amid scientific ...,what was important in the success of the manha...,"[4.5231795, 3.5483692, 3.2334197, 3.7340453]"
1,100,Antonín Dvorák (1841–1904) Antonin Dvorak was ...,who is dvorak\nwhen was Antonin dvorak born an...,"[3.441968, 3.417485, 4.4049354, 4.4049354, 3.4..."
2,1000,"QuickFacts Matanuska-Susitna Borough, Alaska; ...",,[]


In [None]:
# indexing
index = PisaIndex('./my_index.pisa')
pipeline = doc2query >> QueryScorer(scorer) >> QueryFilter(append=True, t=3.21484375) >> index
corpus = pt.get_dataset('irds:vaswani').get_corpus_iter() # or any corpus you like!
pipeline.index(corpus)