# Generating a CIFF file from a Terrier index

This notebook demostrates generating a CIFF file from a Terrier index, through PyTerrier.

## Setup

In [1]:
!pip install python-terrier

Collecting python-terrier
[?25l  Downloading https://files.pythonhosted.org/packages/10/0e/1756a1892b8b2aa0152ac532c7f85de802bda25772108ab8196259ea9d4f/python-terrier-0.5.0.tar.gz (74kB)
[K     |████▍                           | 10kB 15.5MB/s eta 0:00:01[K     |████████▉                       | 20kB 11.9MB/s eta 0:00:01[K     |█████████████▎                  | 30kB 5.3MB/s eta 0:00:01[K     |█████████████████▊              | 40kB 5.9MB/s eta 0:00:01[K     |██████████████████████▏         | 51kB 3.8MB/s eta 0:00:01[K     |██████████████████████████▌     | 61kB 4.4MB/s eta 0:00:01[K     |███████████████████████████████ | 71kB 4.9MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 3.1MB/s 
Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Collecting pytrec_eval>=0.5
  Downloading https://files.pythonhosted.org/packages/2e/03/e6e84df6a7c1265579ab26bbe30ff7f8c22745

This downloads the CIFF definition from the CIFF repostiory.

In [2]:
!wget https://raw.githubusercontent.com/osirrc/ciff/master/src/main/protobuf/CommonIndexFileFormat.proto

--2021-03-29 11:28:29--  https://raw.githubusercontent.com/osirrc/ciff/master/src/main/protobuf/CommonIndexFileFormat.proto
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2313 (2.3K) [text/plain]
Saving to: ‘CommonIndexFileFormat.proto’


2021-03-29 11:28:30 (34.4 MB/s) - ‘CommonIndexFileFormat.proto’ saved [2313/2313]



Make the python code for CIFF

In [3]:
!mkdir -p ciff
!protoc --python_out=ciff ./CommonIndexFileFormat.proto

Start PyTerrier. Ensure that Terrier has imported the CIFF plugin. 

In [4]:
import pyterrier as pt
pt.init(packages=["com.github.terrierteam:terrier-ciff:-SNAPSHOT"])

  from pandas import Panel


terrier-assemblies 5.4  jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.5  jar not found, downloading to /root/.pyterrier...
Done
PyTerrier 0.5.0 has loaded Terrier 5.4 (built by craigm on 2021-01-16 14:17)


Lets use PyTerrier's prebuilt index for the small Vaswani test collection (~11k abstracts).

In [5]:
dataset = pt.get_dataset("vaswani")
index = pt.IndexFactory.of(dataset.get_index())

Downloading vaswani index to /root/.pyterrier/corpora/vaswani/index


HBox(children=(FloatProgress(value=0.0, description='data.direct.bf', max=334927.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='data.document.fsarrayfile', max=194293.0, style=ProgressS…




HBox(children=(FloatProgress(value=0.0, description='data.inverted.bf', max=308600.0, style=ProgressStyle(desc…




HBox(children=(FloatProgress(value=0.0, description='data.lexicon.fsomapfile', max=667016.0, style=ProgressSty…




HBox(children=(FloatProgress(value=0.0, description='data.lexicon.fsomaphash', max=777.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='data.lexicon.fsomapid', max=31024.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='data.meta.idx', max=91432.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='data.meta.zdata', max=171754.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='data.properties', max=882.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='data.meta-0.fsomapfile', max=742885.0, style=ProgressStyl…




## Generating the CIFF file

In [7]:
from ciff.CommonIndexFileFormat_pb2 import Header, PostingsList, Posting, DocRecord

In [20]:
!rm -f ./ciff.vaswani.gz
file = pt.io.autoopen("./ciff.vaswani.gz", "wb")

In [21]:
def write_delimited(f, o):
  from google.protobuf.internal.encoder import _VarintBytes
  from google.protobuf.internal.decoder import _DecodeVarint32
  size = o.ByteSize()
  f.write(_VarintBytes(size))
  f.write(o.SerializeToString())


In [22]:
header = Header()
header.version = 1
header.num_postings_lists = index.getCollectionStatistics().getNumberOfUniqueTerms()
header.num_docs = index.getCollectionStatistics().getNumberOfDocuments()
header.total_docs = index.getCollectionStatistics().getNumberOfDocuments()
header.total_postings_lists = index.getCollectionStatistics().getNumberOfUniqueTerms()
header.total_terms_in_collection = index.getCollectionStatistics().getNumberOfTokens()
header.average_doclength  = index.getCollectionStatistics().getAverageDocumentLength()
header.description = index.getIndexRef().toString()
write_delimited(file, header)

In [23]:
lexicon = index.getLexicon()
inv = index.getInvertedIndex()
for kv in pt.tqdm(lexicon, unit="t"):
  pl = PostingsList()
  pl.term = kv.getKey()
  pl.df = kv.getValue().getDocumentFrequency()
  pl.cf = kv.getValue().getFrequency()
  prevDocid = -1;
  for p in inv.getPostings(kv.getValue()):
    posting = Posting()
    curDocid = p.getId()
    code = curDocid if  prevDocid == -1 else curDocid - prevDocid;
    posting.docid = code
    posting.tf = p.getFrequency()
    pl.postings.append(posting)
    prevDocid = curDocid

  write_delimited(file, pl)

HBox(children=(FloatProgress(value=0.0, max=7756.0), HTML(value='')))




In [24]:
metaindex = index.getMetaIndex()
docindex = index.getDocumentIndex()
for docid in pt.tqdm(range(0, index.getCollectionStatistics().getNumberOfDocuments()), unit="d"):
  doc = DocRecord()
  doc.docid = docid
  doc.collection_docid = metaindex.getItem("docno", docid)
  doc.doclength = docindex.getDocumentLength(docid)
  write_delimited(file, doc)

HBox(children=(FloatProgress(value=0.0, max=11429.0), HTML(value='')))




In [25]:
file.close()

## Making a Terrier index from the CIFF file

Here we use the terrier-ciff plugin for Terier.

In [26]:
!mkdir -p from_ciff_index
pt.run("ciff-ingest", ["-I", "./from_ciff_index/data.properties", "./ciff.vaswani.gz"])

Finished ingesting ./ciff.vaswani.gz (11429 documents)
New index at /content/from_ciff_index//data.properties


## Validation Effectiveness of the CIFF index vs the original CIFF index.

In [27]:
pt.Experiment(
    [pt.BatchRetrieve(index), pt.BatchRetrieve( "./from_ciff_index/data.properties")],
    dataset.get_topics(),
    dataset.get_qrels(),
    ["map"]
)

Unnamed: 0,name,map
0,BR(DPH),0.283621
1,BR(DPH),0.283621
