## TREC Robust 04

This notebook demonstrates baseline experiments on the TREC Robust04 test collection. More information is provided [in the PyTerrier documenatation](https://pyterrier.readthedocs.io/en/latest/experiments/Robust04.html)


Install PyTerrier - this installs the latest version from the GitHub repository.

In [1]:
!pip install python-terrier


Start PyTerrier. By using `version='snapshot'`, we use the latest version of Terrier from its own [GitHub repository](http://github.com/terrier-org/terrier-core/).

In [2]:
import pyterrier as pt
if not pt.started():
    pt.init(mem=8000, version='snapshot', tqdm='notebook', 
            boot_packages=["com.github.terrierteam:terrier-prf:-SNAPSHOT"]
           )

Downloading terrier-assemblies 5.x-SNAPSHOT  jar-with-dependencies to /users/craigm/.pyterrier...
Done


Update this configuration, to detail:
 - where your copy of the TREC Disk 4 and 5 corpus is
 - where you wish to store your index.

In [3]:
DISK45_PATH="/local/collections/TRECdisk45/"
INDEX_DIR="/local/indices/disk45"

## Indexing

This indexes the corpus; it took around 8 minutes using a single thread.

In [8]:
import os
if os.path.exists(os.path.join(INDEX_DIR, "data.properties")):
    indexref = pt.IndexRef.of(os.path.join(INDEX_DIR, "data.properties"))
else:    
    files = pt.io.find_files(DISK45_PATH)
    # no-one indexes the congressional record in directory /CR/
    # indeed, recent copies from NIST dont contain it
    # we also remove some of the other unneeded files
    bad = ['/CR/', '/AUX/', 'READCHG', 'READMEFB', 'READFRCG', 'READMEFR', 'READMEFT', 'READMELA']
    for b in bad:
        files = list(filter(lambda f: b not in f, files))
    indexer = pt.TRECCollectionIndexer(INDEX_DIR, verbose=True)
    indexref = indexer.index(files)
    # processing the files took 7 minutes; the total indexing process took 7m40

index = pt.IndexFactory.of(indexref)
print(index.getCollectionStatistics().toString())

11:11:28.315 [main] WARN  o.t.i.MultiDocumentFileCollection - trec.encoding is not set; resorting to platform default (ISO-8859-1). Indexing may be platform dependent. Recommend trec.encoding=UTF-8


HBox(children=(FloatProgress(value=0.0, max=2299.0), HTML(value='')))


Number of documents: 528155
Number of terms: 738439
Number of fields: 0
Number of tokens: 156321446
Field names: []
Positions:   false



## Retrieval - Simple Weighting Models

In [9]:

BM25 = pt.BatchRetrieve(index, wmodel="BM25")
DPH  = pt.BatchRetrieve(index, wmodel="DPH")
PL2  = pt.BatchRetrieve(index, wmodel="PL2")
DLM  = pt.BatchRetrieve(index, wmodel="DirichletLM")


In [11]:
pt.Experiment(
    [BM25, DPH, PL2, DLM],
    pt.get_dataset("trec-robust-2004").get_topics(),
    pt.get_dataset("trec-robust-2004").get_qrels(),
    eval_metrics=["map", "P_10", "P_20", "ndcg_cut_20"],
    names=["BM25", "DPH", "PL2", "Dirichlet QL"]
)


13:24:04.061 [main] WARN  o.t.a.batchquerying.TRECQuery - trec.encoding is not set; resorting to platform default (ISO-8859-1). Retrieval may be platform dependent. Recommend trec.encoding=UTF-8


Unnamed: 0,name,map,P_10,P_20,ndcg_cut_20
0,BM25,0.241763,0.426104,0.349398,0.408061
1,DPH,0.251307,0.44739,0.361446,0.422524
2,PL2,0.229386,0.420884,0.343775,0.402179
3,Dirichlet QL,0.236826,0.407631,0.337952,0.39687


## Retrieval - Query Expansion

In [12]:
Bo1 = pt.rewrite.Bo1QueryExpansion(index)
KL = pt.rewrite.KLQueryExpansion(index)
RM3 = pt.rewrite.RM3(index)


In [13]:
pt.Experiment(
    [
            BM25, 
            BM25 >> Bo1 >> BM25, 
            BM25 >> KL >> BM25, 
            BM25 >> RM3 >> BM25, 
    ],
    pt.get_dataset("trec-robust-2004").get_topics(),
    pt.get_dataset("trec-robust-2004").get_qrels(),
    eval_metrics=["map", "P_10", "P_20", "ndcg_cut_20"],
    names=["BM25", "+Bo1", "+KL", "+RM3"]
    )


13:24:51.441 [main] WARN  o.t.a.batchquerying.TRECQuery - trec.encoding is not set; resorting to platform default (ISO-8859-1). Retrieval may be platform dependent. Recommend trec.encoding=UTF-8


Unnamed: 0,name,map,P_10,P_20,ndcg_cut_20
0,BM25,0.241763,0.426104,0.349398,0.408061
1,+Bo1,0.279458,0.448996,0.378916,0.436533
2,+KL,0.279401,0.444177,0.378313,0.435196
3,+RM3,0.276544,0.453815,0.379518,0.430367


In [14]:
pt.Experiment(
    [
            DPH, 
            DPH >> Bo1 >> DPH, 
            DPH >> KL >> DPH, 
            DPH >> RM3 >> DPH, 
    ],
    pt.get_dataset("trec-robust-2004").get_topics(),
    pt.get_dataset("trec-robust-2004").get_qrels(),
    eval_metrics=["map", "P_10", "P_20", "ndcg_cut_20"],
    names=["DPH", "+Bo1", "+KL", "+RM3"]
    )

13:26:53.533 [main] WARN  o.t.a.batchquerying.TRECQuery - trec.encoding is not set; resorting to platform default (ISO-8859-1). Retrieval may be platform dependent. Recommend trec.encoding=UTF-8


Unnamed: 0,name,map,P_10,P_20,ndcg_cut_20
0,DPH,0.251307,0.44739,0.361446,0.422524
1,+Bo1,0.285334,0.458635,0.387952,0.444528
2,+KL,0.28572,0.458635,0.386948,0.442636
3,+RM3,0.281796,0.461044,0.38996,0.441863
