# Anserini / Pyserini Installation
Anserini is an indexing and searching tool that is written in Java and was created for research in information retrieval (IR). Pyserini is a Python wrapper around Anserini.
* [Anserini](https://github.com/castorini/anserini)
* [Pyserini](https://github.com/castorini/pyserini)

## I. Installation
### A. Anserini
1. Install openjdk-11-jdk, ie., `sudo apt-get install openjdk-11-jdk`
2. Install Maven version 3.*
3. Set the JAVA_HOME environment variable to the folder that contains the JDK
4. Install make
5. Install gcc
6. Clone the Anserini repo with the `--recurse-submodules` option.
7. Change directory to the top-level folder inside the Anserini git repo (contains `pom.xml` which is maven build file).
8. Run `mvn clean package appassembler:assemble` to build maven project.
9. Run following to complete build

```cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..
```

### B. Pyserini
1. Ensure VM has Python 3.6 or higher
2. run `pip install pyserini==0.12.0`

## II. Test Installation

In [26]:
# Import
from pyserini.search import SimpleSearcher

Pyserini can download pre-built indices and search on common IR datasets. Running the following cell verifies the system is working.

In [3]:
searcher = SimpleSearcher.from_prebuilt_index('msmarco-passage')
hits = searcher.search('what is a lobster roll?')

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.5f}')

index-msmarco-passage-20201117-f87c94.tar.gz: 0.00B [00:00, ?B/s]

Attempting to initialize pre-built index msmarco-passage.
Downloading index at https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-msmarco-passage-20201117-f87c94.tar.gz...


index-msmarco-passage-20201117-f87c94.tar.gz: 2.07GB [02:40, 13.8MB/s]                               


Extracting /home/ubuntu/.cache/pyserini/indexes/index-msmarco-passage-20201117-f87c94.tar.gz into /home/ubuntu/.cache/pyserini/indexes/index-msmarco-passage-20201117-f87c94.1efad4f1ae6a77e235042eff4be1612d...
Initializing msmarco-passage...
 1 7157707 11.00830
 2 6034357 10.94310
 3 5837606 10.81740
 4 7157715 10.59820
 5 6034350 10.48360
 6 2900045 10.31190
 7 7157713 10.12300
 8 1584344 10.05290
 9 533614  9.96350
10 6234461 9.92200


## III. Build Our Own Index
We'll build an index from the DeepCT collection of passages. Each passage is contained within a pseudo-JSON text file with one document per line:
```
{"id": "1", "contents": "document 1 contents ..."}
{"id": "2", "contents": "document 2 contents ..."}
...
```

This collection is stored at `data/deepCT/org_collection_berttoken` on the elastic file share, in two text files: `1.json` and `2.json`. We'll put the index in the `deepCT_idx` subfolder.

The index is built from the command line with the following command. This step can take a while, so recommend running it in a shell outside of Jupyter.
```
python -m pyserini.index -collection JsonCollection \
                         -generator DefaultLuceneDocumentGenerator \
                         -threads 1 \
                         -input ~/efs/data/deepCT/org_collection_berttoken \
                         -index deepCT_idx \
                         -storePositions
```
* `-collection`: Specifies which Anserini Java class will be used to read the documents. The available classes are at https://github.com/castorini/anserini/tree/master/src/main/java/io/anserini/collection.
* `-generator`: Specifies which Anserini Java class is used to prepare the documents for indexing. Available classes are at https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/generator/DefaultLuceneDocumentGenerator.java.
* `-input`: Location of document collection
* `-index`: Location at which to store index
* `-storePositions`: Creates an index that contains only the document IDs and weights. Alternatively one can specify `-storeDocvectors` or `-storeRaw` to store document information in the index.

## IV. Search Example

In [2]:
searcher = SimpleSearcher('deepCT_idx')
hits = searcher.search('soccer ball')

for i in range(len(hits)):
    print(f'{i+1:2} {hits[i].docid:4} {hits[i].score:.5f}')

 1 827743 12.05080
 2 5892041 11.58790
 3 5892040 11.58320
 4 7690688 11.40820
 5 3496456 11.35480
 6 1198032 10.93510
 7 5240751 10.87360
 8 1198033 10.81280
 9 5240752 10.81060
10 7690690 10.79250


In [54]:
import re
import os
import os.path

# Extremely **inefficient** document retrieval function!!!
def get_doc(docid, collection="../efs/data/deepCT/org_collection_berttoken"):
    ptn = re.compile(r'{"id": "' + str(docid) + '"')
    
    for file in os.listdir(collection):
        with open(os.path.join(collection, file), "rt") as cfile:
            line = cfile.readline()
            while line:
                if re.match(ptn, line):
                    print(line)
                    break
                line = cfile.readline()
    
get_doc(40)

{"id": "40", "contents": "medical tours costa rica : medical tourism made easy ! ano other firm has helped more patients . receive care over the last 15 yearsa"}



In [53]:
for docid in [hit.docid for hit in hits]:
    get_doc(docid)

{"id": "827743", "contents": "this is soccer . comas guide to choosing the right soccer ball for your game . soccer . com carries training soccer balls , match soccer balls , professional match soccer balls , beach soccer balls , street soccer balls , indoor soccer balls , turf balls , futsal soccer balls , mini / skills soccer balls and medicine balls ."}

{"id": "5892041", "contents": "soccer . com . you can play soccer without a jersey , you can play soccer without soccer shorts , and you can even play without cleats . but without the ball , there is no game . the ball is the only essential piece of equipment in the game of soccer . we have a wide selection of soccer balls including fifa approved premium match balls , training balls , futsal balls , mini balls and many more . he ball is the only essential piece of equipment in the game of soccer . we have a wide selection of soccer balls including fifa approved premium match balls , training balls , futsal balls , mini balls and man