Installing PyTerrier is easy - it can be installed from the command-line in the normal way using Pip

In [None]:
!pip install python-terrier

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-terrier
  Downloading python-terrier-0.9.2.tar.gz (104 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.4/104.4 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyjnius>=1.4.2
  Downloading pyjnius-1.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m47.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting matchpy
  Downloading matchpy-0.5.5-py3-none-any.whl (69 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.6/69.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting deprecated
  Downloading Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)
Collecting chest
  

If you want the latest version of PyTerrier, you can install direct from the Github repo:

In [None]:
!pip install --upgrade git+https://github.com/terrier-org/pyterrier.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/terrier-org/pyterrier.git
  Cloning https://github.com/terrier-org/pyterrier.git to /tmp/pip-req-build-zctt1vwf
  Running command git clone --filter=blob:none --quiet https://github.com/terrier-org/pyterrier.git /tmp/pip-req-build-zctt1vwf
  Resolved https://github.com/terrier-org/pyterrier.git to commit e47970a3a419e0580f02763b06c996f9c1ed5701
  Preparing metadata (setup.py) ... [?25l[?25hdone


# All usages of PyTerrier start by importing PyTerrier and starting it using the init() method

In [None]:
import pyterrier as pt

if not pt.started():
    pt.init(boot_packages=["com.github.terrierteam:terrier-prf:-SNAPSHOT"])


terrier-assemblies 5.7 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.7 jar not found, downloading to /root/.pyterrier...
Done
terrier-prf -SNAPSHOT jar not found, downloading to /root/.pyterrier...
Done


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



## Indexing a Pandas dataframe

Sometimes we have the documents that we want to index in memory. Terrier makes it easy to index standard Python data structures, particularly [Pandas dataframes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

To do thise, we can use a `pt.DFIndexer()` object

In [None]:
import pandas as pd
!rm -rf ./pd_index
pd_indexer = pt.DFIndexer("./pd_index")

# optionally modify properties
# index_properies = {"block.indexing":"true", "invertedfile.lexiconscanner":"pointers"}
# indexer.setProperties(**index_properies)

In [None]:
df = pd.DataFrame({ 
'docno':
['1', '2', '3'],
'url': 
['url1', 'url2', 'url3'],
'text': 
['He ran out of money, so he had to stop playing',
'The waves were crashing on the shore; it was a',
'The body may perhaps compensates for the loss']
})
df

Unnamed: 0,docno,url,text
0,1,url1,"He ran out of money, so he had to stop playing"
1,2,url2,The waves were crashing on the shore; it was a
2,3,url3,The body may perhaps compensates for the loss


Then there are a number of options to index the dataframe:    
The first argument should always a pandas.Series object of Strings, which specifies the body of each document.    
Any arguments after that are for specifying metadata.

In [None]:
indexref = pd_indexer.index(df["text"], df)

  for column, value in meta_column[1].iteritems():


In [None]:
index = pt.IndexFactory.of(indexref)
print(index.getCollectionStatistics().toString())

for kv in index.getLexicon():
  print((kv.getKey())+"\t"+ kv.getValue().toString())

index.getLexicon()["monei"].toString()

Number of documents: 3
Number of terms: 10
Number of postings: 10
Number of fields: 0
Number of tokens: 10
Field names: []
Positions:   false

bodi	term7 Nt=1 TF=1 maxTF=1 @{0 0 0}
compens	term9 Nt=1 TF=1 maxTF=1 @{0 0 4}
crash	term4 Nt=1 TF=1 maxTF=1 @{0 1 0}
loss	term6 Nt=1 TF=1 maxTF=1 @{0 1 4}
mai	term8 Nt=1 TF=1 maxTF=1 @{0 2 0}
monei	term1 Nt=1 TF=1 maxTF=1 @{0 2 4}
plai	term2 Nt=1 TF=1 maxTF=1 @{0 2 6}
ran	term0 Nt=1 TF=1 maxTF=1 @{0 3 0}
shore	term3 Nt=1 TF=1 maxTF=1 @{0 3 2}
wave	term5 Nt=1 TF=1 maxTF=1 @{0 3 6}


'term1 Nt=1 TF=1 maxTF=1 @{0 2 4}'

## Retrieval

Lets see how we can use one of these for retrieval. Retrieval takes place using the `BatchRetrieve` object, by invoking `transform()` method for one or more queries. For a quick test, you can give just pass your query to `transform()`. 

BatchRetrieve will return the results as a Pandas dataframe.


In [None]:
pt.BatchRetrieve(indexref).search("playing")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,0,1,0,0.615607,playing


However, most IR experiments, will use a set of queries. You can pass such a set using a data frame for input.
Having multiple parts

In [None]:
import pandas as pd
topics = pd.DataFrame([["3", "compensates"], ["2", "stop playing"]],columns=['qid','query'])
pt.BatchRetrieve(indexref).transform(topics)

NameError: ignored

Specifying models for search

In [None]:
import pandas as pd
topics = pd.DataFrame([["3", "compensates"], ["2", "playing"]],columns=['qid','query'])
result1 = pt.BatchRetrieve(indexref, wmodel="Hiemstra_LM")
pl2 = pt.BatchRetrieve(indexref, wmodel="PL2")
pipeline = (result1 % 100) >> pl2
pipeline.transform(topics)

Unnamed: 0,qid,docid,docno,rank,score,query
0,3,2,3,0,0.888287,compensates
1,2,0,1,0,1.025506,playing


In [None]:
import pandas as pd
topics = pd.DataFrame([["3", "compensates"], ["2", "playing"]],columns=['qid','query'])
pt.BatchRetrieve(indexref, wmodel="TF_IDF").transform(topics)

Unnamed: 0,qid,docid,docno,rank,score,query
0,3,2,3,0,1.008403,compensates
1,2,0,1,0,1.137441,playing


Stopword removal and Stemming

In [None]:
import pandas as pd
pd_indexer = pt.DFIndexer("./pd_index4", overwrite=True)
pd_indexer.setProperty( "termpipelines", "Stopwords,PorterStemmer")

df = pd.DataFrame({ 
'docno':
['1', '2', '3'],
'url': 
['url1', 'url2', 'url3'],
'text': 
['He ran out of money, so he had to stop playing',
'The waves were crashing on the shore; it was a',
'The body may perhaps compensates for the loss']
})

indexer = pd_indexer.index(df["text"], df)
# Printing Stats from index
print(pt.IndexFactory.of( indexer ).getCollectionStatistics().toString())
for kv in pt.IndexFactory.of( indexer ).getLexicon():
  print("%s (%s) -> %s (%s)" % (kv.getKey(), type(kv.getKey()), kv.getValue().toString(), type(kv.getValue()) ) )

queries = pd.DataFrame([["3", "compensates"], ["2", "playing"]],columns=['qid','query'])
pt.set_property("termpipelines", "")
pt.BatchRetrieve(indexer, wmodel="TF_IDF").transform(queries)


bm25 = pt.BatchRetrieve(indexer, wmodel="BM25")
bm25.search("compensates")

  indexer = pd_indexer.index(df["text"], df)
  for column, value in meta_column[1].iteritems():


Number of documents: 3
Number of terms: 10
Number of postings: 10
Number of fields: 0
Number of tokens: 10
Field names: []
Positions:   false

bodi (<class 'str'>) -> term7 Nt=1 TF=1 maxTF=1 @{0 0 0} (<class 'jnius.reflect.org.terrier.structures.LexiconEntry'>)
compens (<class 'str'>) -> term9 Nt=1 TF=1 maxTF=1 @{0 0 4} (<class 'jnius.reflect.org.terrier.structures.LexiconEntry'>)
crash (<class 'str'>) -> term4 Nt=1 TF=1 maxTF=1 @{0 1 0} (<class 'jnius.reflect.org.terrier.structures.LexiconEntry'>)
loss (<class 'str'>) -> term6 Nt=1 TF=1 maxTF=1 @{0 1 4} (<class 'jnius.reflect.org.terrier.structures.LexiconEntry'>)
mai (<class 'str'>) -> term8 Nt=1 TF=1 maxTF=1 @{0 2 0} (<class 'jnius.reflect.org.terrier.structures.LexiconEntry'>)
monei (<class 'str'>) -> term1 Nt=1 TF=1 maxTF=1 @{0 2 4} (<class 'jnius.reflect.org.terrier.structures.LexiconEntry'>)
plai (<class 'str'>) -> term2 Nt=1 TF=1 maxTF=1 @{0 2 6} (<class 'jnius.reflect.org.terrier.structures.LexiconEntry'>)
ran (<class 'str'>) 

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,2,3,0,0.681229,compensates


Method to read the similar questions

In [None]:
import csv

def read_tsv_test_data(file_path):
  # Takes in the file path for test file and generate a dictionary
  # of question id as the key and the list of question ids similar to it
  # as value. It also returns the list of all question ids that have
  # at least one similar question
  dic_similar_questions = {}
  lst_all_test = []
  with open(file_path, encoding="utf-8") as fd:
    rd = csv.reader(fd, delimiter="\t", quotechar='"')
    for row in rd:
        question_id = int(row[0])
        lst_similar = list(map(int, row[1:]))
        dic_similar_questions[question_id] = lst_similar
        lst_all_test.append(question_id)
        lst_all_test.extend(lst_similar)
  return dic_similar_questions, lst_all_test

## IterDictIndexer
Use this Indexer if you wish to index an iter of dicts (possibly with multiple fields). This version is optimized by
using multiple threads and POSIX fifos to tranfer data, which ends up being much faster.

#### Parameters

*   index_path (str) – Directory to store index. Ignored for IndexingType. MEMORY.
*   meta (Dict[str,int]) – What metadata for each document to record in the index, and what length to reserve. Defaults to {“docno” : 20}.
*   meta_reverse (List[str]) – What metadata shoudl we be able to resolve back to a docid. Defaults to [“docno”],
*   index(it, fields=('text',), meta=None, meta_lengths=None)
---- 
Parameters
*   it (iter[dict]) – an iter of document dict to be indexed
*   fields (list[str]) – keys to be indexed as fields
*   meta (list[str]) – keys to be considered as metdata
*   meta_lengths (list[int]) – length of metadata, defaults to 512 characters


In [8]:
from post_parser_record import PostParserRecord
import pandas as pd
post_reader = PostParserRecord("Posts_law.xml")

# Indexing the collection
questions = []
for question_id in post_reader.map_questions:
  question = post_reader.map_questions[question_id]
  title = question.title#.replace("©", " ")
  body = question.body#.replace("©", " ")
  questions.append({'docno':str(question_id), 'title': title, 'body': body})

# preparing the questions
lst_queries = []
dic_similar_questions, lst_all_test = read_tsv_test_data("duplicate_questions.tsv")
for question_id in dic_similar_questions:
  query = post_reader.map_questions[question_id].title
  query = query.replace("?", " ")
  query = query.replace("\"", " ")
  query = query.replace("\'", " ")
  query = query.replace("/", " ")
  lst_queries.append([str(question_id), query])

iter_indexer = pt.IterDictIndexer("./index", meta={'docno': 20, 'title': 10000, 'body':20000}, overwrite=True)
RETRIEVAL_FIELDS = ['title', 'body']
index = iter_indexer.index(questions, fields=RETRIEVAL_FIELDS)


#bm25 = pt.BatchRetrieve(index, num_results=200, wmodel="BM25")
#dph = pt.BatchRetrieve(index, wmodel="DPH")
#pipeline = (bm25 % 20) >> dph
# questions + titles
# {'precision@1': 0.028368794326241134, 'mrr@20': 0.13756442330441077}

# Retrieval models
bm25 = pt.BatchRetrieve(index, num_results=200, wmodel="BM25")
dph = pt.BatchRetrieve(index, wmodel="DPH")

# Query expansion techniques
bo1 = pt.rewrite.Bo1QueryExpansion(index)


pipeline = (bm25 % 20) >> bo1 >> dph
#{'precision@1': 0.03900709219858156, 'mrr@20': 0.15685095668298357}


queries = pd.DataFrame(lst_queries,columns=['qid','query'])
result = pipeline.transform(queries)

pt.io.write_results(result, "similar_questions_results.txt", format='trec')
print(result)

          qid  docid  docno  rank      score  \
0       11532   3484  11532     0  35.804581   
1       11532  20723  73830     1  19.734624   
2       11532   9488  31440     2  18.048025   
3       11532  16967  57169     3  16.368443   
4       11532  21332  76618     4  16.226852   
...       ...    ...    ...   ...        ...   
281995   9488   4552  15663   995   3.333565   
281996   9488   9402  31181   996   3.333565   
281997   9488  19948  70471   997   3.332240   
281998   9488   8796  29320   998   3.332197   
281999   9488  22788  80762   999   3.332197   

                                                  query_0  \
0       Is a (UK) retail company obliged to compensate...   
1       Is a (UK) retail company obliged to compensate...   
2       Is a (UK) retail company obliged to compensate...   
3       Is a (UK) retail company obliged to compensate...   
4       Is a (UK) retail company obliged to compensate...   
...                                                   ...

Using Ranx library python for evaluation

In [9]:
! pip install ranx

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ranx
  Downloading ranx-0.3.7-py3-none-any.whl (95 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.7/95.7 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting orjson
  Downloading orjson-3.8.10-cp39-cp39-manylinux_2_28_x86_64.whl (140 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.5/140.5 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
Collecting cbor2
  Downloading cbor2-5.4.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (223 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m223.1/223.1 kB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: orjson, cbor2, ranx
Successfully installed cbor2-5.4.6 orjson-3.8.10 ranx-0.3.7


In [10]:
import csv

file_path = "duplicate_questions.tsv"
result_qrel = "qrel_similar.tsv"

with open(result_qrel, 'w', newline='') as tsvfile:
  writer = csv.writer(tsvfile, delimiter='\t', lineterminator='\n')
  with open(file_path, encoding="utf-8") as fd:
    rd = csv.reader(fd, delimiter="\t", quotechar='"')
    for row in rd:
        question_id = int(row[0])
        lst_similar = list(map(int, row[1:]))
        for sim_qid in lst_similar:
          writer.writerow([str(question_id), "0", str(sim_qid), "1"])

In [11]:
from ranx import Qrels, Run, evaluate

# Note running ranx for the first time will be a bit slow
qrels = Qrels.from_file("qrel_similar.tsv", kind="trec")
run = Run.from_file("similar_questions_results.txt", kind="trec")

print(evaluate(qrels, run, "precision@1"))
evaluate(qrels, run, ["precision@1", "mrr@20"])
# default results
# {'precision@1': 0.02127659574468085, 'mrr@20': 0.13619857875012345}

0.03900709219858156


{'precision@1': 0.03900709219858156, 'mrr@20': 0.15685095668298357}

In [12]:
#per query results
evaluate(qrels, run, ["precision@1", "mrr@20"], return_mean=False)
# 5 10 25 all 1

{'precision@1': array([0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0