# Anserini baseline

Anserini provides Pyserini, a python package which wrap Anserini (JAVA). In order to run this Notebook, JAVA 11 must be installed.


For official example, see [Anserini-notebooks](https://github.com/castorini/anserini-notebooks).

In [5]:
# download topics if not already done
!wget -P ../tmp https://ir.nist.gov/covidSubmit/data/topics-rnd4.xml
    
# download pre-build index. 
# See https://github.com/castorini/anserini/blob/master/docs/experiments-cord19.md#pre-built-indexes-all-versions
!wget -P ../tmp https://www.dropbox.com/s/jza7sdesjn7iqye/lucene-index-cord19-abstract-2020-07-16.tar.gz
!tar -C ../tmp -zxf ../tmp/lucene-index-cord19-abstract-2020-07-16.tar.gz 

# download human judgement
!wget -P ../tmp https://ir.nist.gov/covidSubmit/data/qrels-covid_d4_j0.5-4.txt

--2020-08-04 22:14:48--  https://ir.nist.gov/covidSubmit/data/topics-rnd4.xml
Resolving ir.nist.gov (ir.nist.gov)... 129.6.24.92, 2610:20:6005:24::92
Connecting to ir.nist.gov (ir.nist.gov)|129.6.24.92|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16327 (16K) [application/xml]
Saving to: ‘../tmp/topics-rnd4.xml’


2020-08-04 22:14:49 (136 KB/s) - ‘../tmp/topics-rnd4.xml’ saved [16327/16327]



In [8]:
from xml.etree import ElementTree
from pyserini.search import SimpleSearcher  # required JAVA 11
import pytrec_eval
import os

def read_topics(path_to_topic):
    tree = ElementTree.parse(path_to_topic)
    topics = list()
    for topic in tree.getroot():
        d = dict()
        d["number"] = topic.attrib["number"]
        for field in topic:
            d[field.tag] = field.text
        topics.append(d)

    return topics

In [2]:
topics = read_topics("../tmp/topics-rnd4.xml")

In [3]:
# test
searcher = SimpleSearcher("../tmp/lucene-index-cord19-abstract-2020-07-16")
query = topics[0]["query"]
hits = searcher.search(query)
print(hits[0].docid)

8ccl9aui


# Run Anserini on abstract

In [4]:
lines = list()
template = "{} Q0 {} {} {} anserini_baseline_abstract"
queries = [e["query"] for e in topics]


for i, query in enumerate(queries):
    seen = set()
    hits = searcher.search(query, 1200)
    for j, hit in enumerate(hits):
        if hit.docid in seen:
            continue
        seen.add(hit.docid)
        lines.append(template.format(i+1, hit.docid,j+1,  hit.score ))
        if len(seen) == 1000:
            break

    

In [5]:
# see example
print(lines[0])

1 Q0 8ccl9aui 1 4.121799945831299 anserini_baseline_abstract


In [6]:
with open("../tmp/baseline.txt", 'w') as f:
    f.write("\n".join(lines))

# Evalutation

Prerequisite: Have [trec_eval](https://github.com/usnistgov/trec_eval) installed. Alternatively, use [Python](https://github.com/cvangysel/pytrec_eval) version.

Install original trec_eval:
1. Go to parent directory of this repo
2. git clone https://github.com/usnistgov/trec_eval.git
3. make


In [31]:
# If follow above instruction, no need to change
PATH_TO_TREC = "../../trec_eval/trec_eval"


In [32]:

os.system(PATH_TO_TREC + " -c -m all_trec ../tmp/qrels-covid_d4_j0.5-4.txt  ../tmp/baseline.txt > ../tmp/out.txt")

with open("../tmp/out.txt", "r") as f:
    print(f.read())

runid                 	all	anserini_baseline_abstract
num_q                 	all	45
num_ret               	all	45000
num_rel               	all	15765
num_rel_ret           	all	5550
map                   	all	0.1396
gm_map                	all	0.0733
Rprec                 	all	0.2371
bpref                 	all	0.3419
recip_rank            	all	0.6778
iprec_at_recall_0.00  	all	0.7627
iprec_at_recall_0.10  	all	0.3574
iprec_at_recall_0.20  	all	0.2880
iprec_at_recall_0.30  	all	0.2153
iprec_at_recall_0.40  	all	0.1358
iprec_at_recall_0.50  	all	0.0946
iprec_at_recall_0.60  	all	0.0555
iprec_at_recall_0.70  	all	0.0241
iprec_at_recall_0.80  	all	0.0067
iprec_at_recall_0.90  	all	0.0066
iprec_at_recall_1.00  	all	0.0019
P_5                   	all	0.4800
P_10                  	all	0.4422
P_15                  	all	0.4237
P_20                  	all	0.4167
P_30                  	all	0.4015
P_100                 	all	0.3256
P_200                 	all	0.2709
P_500                 	all	0.1843
P_