# PyTerrier Notebook for Full-Rank Submissions

This notebook serves as a baseline full-rank submission for [TIRA](https://tira.io)/[TIREx](https://tira.io/tirex) that builds a PyTerrier index and subsequently creates a run with BM25.

### Step 1: Ensure Libraries are Imported

In [24]:
import os

# Detect if we are in the TIRA sandbox
# Install the required dependencies if we are not in the sandbox.
if 'TIRA_DATASET_ID' not in os.environ:
    !pip3 install python-terrier tira==0.0.88 ir_datasets nltk spacy
    #!python -m spacy download en_core_web_sm
else:
    print('We are in the TIRA sandbox.')

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [25]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run

# this loads and starts pyterrier so that it also works in the TIRA
ensure_pyterrier_is_loaded()

# PyTerrier must be imported after the call to ensure_pyterrier_is_loaded in TIRA.
import pyterrier as pt

if not pt.started():
    pt.init(boot_packages=['mam10eks:custom-terrier-token-processing:0.0.1', 'com.github.terrierteam:terrier-prf:-SNAPSHOT'])
    from jnius import autoclass


Due to execution in TIRA, I have patched ir_datasets to always return the single input dataset mounted to the sandbox.


#### Step 1.1: Load Stopword-List

In [26]:
import nltk
from nltk.corpus import stopwords
import spacy
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# generate custom stopword list
nltk.download('stopwords')
nltk_stopwords = set(stopwords.words('english'))
nlp = spacy.load("en_core_web_sm")
spacy_stopwords = set(nlp.Defaults.stop_words)
sklearn_stopwords = set(ENGLISH_STOP_WORDS)
combined_stopwords = set.union(nltk_stopwords, spacy_stopwords, sklearn_stopwords)

!rm -Rf /tmp/index
file_path = "custom_stopwords.txt"

with open(file_path, 'w+') as file:
    for element in combined_stopwords:
        file.write(element+ "\n")

pt.set_property('stopwords.filename','./custom_stopwords.txt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Step 2: Load the data

In [27]:
data = pt.get_dataset('irds:ir-lab-jena-leipzig-wise-2023/validation-20231104-training')


In [35]:
print('See the first two queries:')
topics = data.get_topics('title')
print(topics.head(2))

See the first two queries:
       qid              query
0  q072224     purchase money
1  q072226  purchase used car


### Step 3: Build the Index

In [36]:
print('Build index:')
indexer = pt.IterDictIndexer("/tmp/index", overwrite = True, blocks = True,meta = {'docno':100, 'text': 20480}, stemmer = 'PorterStemmer')
!rm -Rf /tmp/index
index_ref = indexer.index(data.get_corpus_iter())

print('Done. Index is created')

Build index:


No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents:   0%|          | 52/61307 [00:00<02:07, 480.76it/s]

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents: 100%|██████████| 61307/61307 [00:59<00:00, 1035.59it/s]


Done. Index is created


### Step 4: Create the Retrieval Pipeline

In [30]:
index = pt.IndexFactory.of(index_ref)

bm25 = pt.BatchRetrieve(index, wmodel="BM25", verbose=True)
pl2 = pt.BatchRetrieve(index, wmodel="PL2", verbose=True)

#### Step 4.1: Add Query Expansion

In [37]:
#Query Expansion
bo1 = pt.rewrite.Bo1QueryExpansion(index) 

#Pipeline
pipe = (bm25 % 100) >> bo1 >> pl2

### Step 5: Create the Run and Persist the Run

In [38]:
print('Create run')

run = pipe(topics)

print('Done, run was created')

Create run


BR(BM25): 100%|██████████| 50/50 [00:25<00:00,  1.95q/s]
BR(PL2): 100%|██████████| 50/50 [00:28<00:00,  1.78q/s]

Done, run was created





### Step 6: Run Experiments

In [45]:
pt.Experiment(
    [bm25, pipe],
    data.get_topics()[:50],
    data.get_qrels(),
    eval_metrics=["map", "recip_rank", "P_10", "recall_10", "ndcg"],
    names=["BM25", "Spotted Turtle"],
    baseline=0
)

There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.


BR(BM25):   0%|          | 0/50 [00:00<?, ?q/s]

BR(BM25): 100%|██████████| 50/50 [00:25<00:00,  1.99q/s]
BR(BM25): 100%|██████████| 50/50 [00:26<00:00,  1.91q/s]
BR(PL2): 100%|██████████| 50/50 [00:27<00:00,  1.81q/s]


Unnamed: 0,name,map,recip_rank,P_10,recall_10,ndcg,map +,map -,map p-value,recip_rank +,...,recip_rank p-value,P_10 +,P_10 -,P_10 p-value,recall_10 +,recall_10 -,recall_10 p-value,ndcg +,ndcg -,ndcg p-value
0,BM25,0.137711,0.235739,0.082,0.233167,0.290184,,,,,...,,,,,,,,,,
1,Spotted Turtle,0.144571,0.232261,0.084,0.2425,0.29751,25.0,20.0,0.638853,16.0,...,0.909288,4.0,3.0,0.709499,4.0,3.0,0.782041,24.0,21.0,0.594109


In [None]:
persist_and_normalize_run(run, 'bm25-custom-stopwords')