# Anserini baseline

Anserini provides Pyserini, a python package which wrap Anserini (JAVA). In order to run this Notebook, JAVA 11 must be installed.


For official example, see [Anserini-notebooks](https://github.com/castorini/anserini-notebooks).

In [5]:
# download topics if not already done
!wget -P ../tmp https://ir.nist.gov/covidSubmit/data/topics-rnd4.xml
    
# download pre-build index. 
# See https://github.com/castorini/anserini/blob/master/docs/experiments-cord19.md#pre-built-indexes-all-versions
!wget -P ../tmp https://www.dropbox.com/s/jza7sdesjn7iqye/lucene-index-cord19-abstract-2020-07-16.tar.gz
!tar -C ../tmp -zxf ../tmp/lucene-index-cord19-abstract-2020-07-16.tar.gz 


--2020-08-04 22:14:48--  https://ir.nist.gov/covidSubmit/data/topics-rnd4.xml
Resolving ir.nist.gov (ir.nist.gov)... 129.6.24.92, 2610:20:6005:24::92
Connecting to ir.nist.gov (ir.nist.gov)|129.6.24.92|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16327 (16K) [application/xml]
Saving to: ‘../tmp/topics-rnd4.xml’


2020-08-04 22:14:49 (136 KB/s) - ‘../tmp/topics-rnd4.xml’ saved [16327/16327]



In [2]:
from xml.etree import ElementTree
from pyserini.search import SimpleSearcher  # required JAVA 11

def read_topics(path_to_topic):
    tree = ElementTree.parse(path_to_topic)
    topics = list()
    for topic in tree.getroot():
        d = dict()
        d["number"] = topic.attrib["number"]
        for field in topic:
            d[field.tag] = field.text
        topics.append(d)

    return topics

In [17]:
topics = read_topics("../tmp/topics-rnd4.xml")

In [21]:
# test
searcher = SimpleSearcher("../tmp/lucene-index-cord19-abstract-2020-07-16")
query = topics[0]["query"]
hits = searcher.search(query)
print(hits[0].docid)

8ccl9aui


# Run Anserini on abstract

In [22]:
lines = list()
template = "{} Q0 {} {} {} anserini_baseline_abstract"
queries = [e["query"] for e in topics]


for i, query in enumerate(queries):
    seen = set()
    hits = searcher.search(query, 1200)
    for j, hit in enumerate(hits):
        if hit.docid in seen:
            continue
        seen.add(hit.docid)
        lines.append(template.format(i+1, hit.docid,j+1,  hit.score ))
        if len(seen) == 1000:
            break

    

In [25]:
# see example
print(lines[0])

1 Q0 8ccl9aui 1 4.121799945831299 anserini_baseline_abstract


In [26]:
with open("../tmp/baseline.txt", 'w') as f:
    f.write("\n".join(lines))