# Pyserini Index Searches
This notebook provides examples of how to search a Lucene index using Pyserini. The paths in this notebook are specific to our virtual machines. Others who try to run this demo will need to modify the paths per their system configuration.

Anserini and Pyserini must be installed to use this demo. See the main README file for guidance. Or refer to the [anserini](https://github.com/castorini/anserini) and [pyserini](https://github.com/castorini/anserini) Github repositories.

In [1]:
# Imports
import collections
import json
import os
import os.path
import sys

from pyserini.search import SimpleSearcher
from pyserini.index import IndexReader

dirname = os.getcwd()
repo_root = os.path.abspath(os.path.join(dirname, '../..'))
sys.path.insert(0, repo_root)
import util.indexer

## Access to MSMARCO Document Corpus
### Initialize Access
Run the following cell to set up access to the entire MSMARCO document corpus.
Once this is done, any document can quickly be retrieved by it's MSMARCO document
ID number.

In [2]:
# Path to MSMARCO Documents dataset file. File is too big for Github Repo -- download from https://microsoft.github.io/msmarco/
corpus = '/home/ubuntu/efs/data/msmarco_docs/msmarco-docs.tsv'
docs = util.indexer.IndexedFile(corpus, 'id')

Reading index from  /home/ubuntu/efs/data/msmarco_docs/msmarco-docs.tsv.pickled-index


## Test Document Access

In [3]:
# Test Access to MSMARCO documents
docs['D1573673'][:1000]

'D1573673\thttp://www.iroczone.com/about/iroc-z-history/\tIntroduction\tIntroduction Where do we begin? 1973 sounds like a good round figure … I suppose … how about October 27, 1973. Where? Riverside International Raceway, Riverside, CA of course! This is the date and location of the first, ever, IROC race. The International Race of Champions is a race comprised of top notch drivers from IRL, NASCAR Winston Cup, NASCAR Busch, World of Outlaws … all sorts of racing … to compete in identical cars and try their driving skills at winning the IROC Championship of the season. What did they drive? Oddly enough, for the 1974 season (IROC I), the car of choice was the Porsche Carerra RSR. Due to costs of running the Porsche, IROC went to the Chevrolet Camaro for the 1975 season (IROC II). IROC stuck with the Camaro until 1990, when the Camaro was replaced by the Dodge Daytona. Okay, so where does the Camaro IROC-Z tie into this? 1980 was the last season for IROC until it was “reborn” in 1984. F

## Pyserini Queries
### Intialize Searcher
Pyserini's `SimpleSearcher()` object can be used to submit queries and retrieve documents from the index.

Set the `path_to_idx` variable to point to the absolute path of the desired lucene index. The TRESPI team's lucene index can be downloaded from AWS S3: `s3://trespi.nir.ucb.2021/trespi_lucene_idx8.tar.gz`. Expand the tar file and then pass the path to the expanded folder to `SimpleSearcher()`.

In [4]:
path_to_idx = '/home/ubuntu/efs/msmarco_leaderboard_attempt1/lucene_idx_07172021'
searcher = SimpleSearcher(path_to_idx)

### Run a Search

In [5]:
hits = searcher.search('Chevrolet camero commemoritive nascar race iroc 1984')

### View Results
**NOTE:** Our Lucene index is not built from the original document text, therefore Pyserini functions can only retrieve the document ID and are unable to retrieve the original document text. We will use the util.indexer.IndexedFile object that we created earlier to view the original document text, URL, and title.

In [6]:
for hit in hits:
    print(hit.docid, hit.score)
    print(docs[hit.docid][:100])
    print('-----------------')

D1422896 33.47589874267578
D1422896	http://www.iroc-z.com/birthofirocz/birthofirocz.htm	.	How the Chevrolet Camaro Became an IR
-----------------
D1694606 31.94070053100586
D1694606	http://www.chevyhardcore.com/tech-stories/engine/camaro-engines-through-the-years-third-gen
-----------------
D2468666 26.682100296020508
D2468666	https://www.yahoo.com/news/BarrettJackson-Celebrates-iw-1259392312.html	Barrett-Jackson Cel
-----------------
D1422894 24.212400436401367
D1422894	http://www.superchevy.com/features/1407-the-history-of-iroc-racing-never-before-seen-photos
-----------------
D1422891 23.698999404907227
D1422891	http://www.gm-efi.com/news/the-history-of-iroc-international-race-of-champions/	The History
-----------------
D1573672 21.480600357055664
D1573672	https://en.wikipedia.org/wiki/Chevrolet_Camaro_(fifth_generation)	Chevrolet Camaro (fifth g
-----------------
D2547951 21.041099548339844
D2547951	http://www.motortrend.com/cars/chevrolet/camaro/2017/	2017Chevrolet Camaro	Buyer’s 

### Initialize Index Reader
The `SimpleSearcher` object is great for getting query results, but the `IndexReader()` object is needed to directly inspect the contents of the Lucene index.

In [7]:
index_reader = IndexReader(path_to_idx)

### Get Index Data

In [8]:
# Displays number of docs that contain term, and number of occurrence of term in collection.
for trm in ['iroc', 'camaro', 'race', 'speed']:
    print(index_reader.get_term_counts(trm))

(18, 691)
(929, 25306)
(13214, 151691)
(35709, 696967)


In [9]:
# Displays index postings for a specific document.
doc_vector = index_reader.get_document_vector("D1573673")
doc_vector

{'standard': 3,
 'horsepow': 7,
 'held': 1,
 'year': 12,
 'fuel': 1,
 'hp': 1,
 'hq': 1,
 'l98': 2,
 'iroc': 139,
 'good': 2,
 'when': 3,
 '350': 2,
 'put': 1,
 'factori': 2,
 'jeep': 2,
 'model': 1,
 'us': 1,
 'new': 2,
 'invert': 2,
 'made': 19,
 '1990': 1,
 'come': 58,
 'market': 1,
 'most': 1,
 '1986': 6,
 'brake': 2,
 'introduc': 11,
 'irock': 2,
 'system': 3,
 'driver': 3,
 'light': 4,
 'were': 3,
 'plant': 1,
 'name': 4,
 'irac': 8,
 'drive': 11,
 'engin': 16,
 'motor': 4,
 'muscl': 9,
 'vehicl': 1,
 'origin': 4,
 'roc': 7,
 'imoc': 21,
 'becom': 3,
 'best': 3,
 'speed': 3,
 'out': 8,
 'densiti': 3,
 'releas': 2,
 'car': 81,
 'camaro': 17,
 'get': 1,
 'came': 4,
 'irc': 11,
 'power': 17,
 'injector': 1,
 'make': 5,
 'win': 2,
 'greatest': 1,
 'icoc': 12,
 'dodg': 2,
 'race': 2,
 'built': 9,
 'gm': 3,
 'perform': 1,
 'start': 6,
 '1le': 1,
 'tpi': 4,
 'differ': 10,
 'chevi': 1,
 'speedomet': 1,
 'what': 1,
 'nascar': 2,
 'chang': 15,
 'z': 8,
 'opel': 2,
 '86': 5,
 'did': 90,
 'f

In [10]:
# Displays a document's score relative to a specific query.
index_reader.compute_query_document_score('D1573673', 'porsche in nascar')

5.129743576049805

In [11]:
hits = searcher.search('Chevrolet camero commemoritive nascar race iroc 1984')

### View Results

In [12]:
for hit in hits:
    print(hit.docid, hit.score)
    print(docs[hit.docid][:500])
    print('-----------------')

D1422896 33.47589874267578
D1422896	http://www.iroc-z.com/birthofirocz/birthofirocz.htm	.	How the Chevrolet Camaro Became an IROC-ZFrom 1985 until very early in 1990 Chevrolet produced the IROC-Z Camaro. It began with its inception in 1985 as an option add-on to the Z28 model then becoming its own model in 1988, eliminating the Z28 model, to its eventual cease of production in early 1990 and Chevrolet reintroducing the Z28. The IROC-Z model of the Chevrolet Camaro first made its public appearance in 1985 as an add on opti
-----------------
D1694606 31.94070053100586
D1694606	http://www.chevyhardcore.com/tech-stories/engine/camaro-engines-through-the-years-third-generation/	Camaro Engines Through The Years â Third Generation	The third-generation Camaro has never gotten the respect from enthusiasts like the first two generations of the nameplate. Part of the reason the third-gens suffered so greatly, is because they were a complete redesign from the first two generations. Built without f