# PyTerrier Notebook for Full-Rank Submissions

This notebook serves as a baseline full-rank submission for [TIRA](https://tira.io)/[TIREx](https://tira.io/tirex) that builds a PyTerrier index and subsequently creates a run with BM25.

### Step 1: Ensure Libraries are Imported

In [3]:
import os

# Detect if we are in the TIRA sandbox
# Install the required dependencies if we are not in the sandbox.
if 'TIRA_DATASET_ID' not in os.environ:
    !pip3 install python-terrier tira==0.0.88 ir_datasets
else:
    print('We are in the TIRA sandbox.')

Defaulting to user installation because normal site-packages is not writeable


In [4]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
import nltk
from nltk.corpus import stopwords
import spacy
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import pyterrier as pt
nltk.download('stopwords')
ensure_pyterrier_is_loaded()
#pt.set_option('display.max_colwidth',0)
if not pt.started():
    pt.init(boot_packages=['mam10eks:custom-terrier-token-processing:0.0.1', 'com.github.terrierteam:terrier-prf:-SNAPSHOT'])
    from jnius import autoclass

nltk_stopwords = set(stopwords.words('english'))
nlp = spacy.load("en_core_web_sm")
spacy_stopwords = set(nlp.Defaults.stop_words)
sklearn_stopwords = set(ENGLISH_STOP_WORDS)
combined_stopwords = set.union(nltk_stopwords, spacy_stopwords, sklearn_stopwords)
print(combined_stopwords)
!rm -Rf /tmp/index
file_path = "custom_stopwords.txt"
#if os.path.exists(file_path):
#   os.remove(file_path)
with open(file_path, 'w') as file:
    for element in combined_stopwords:
        file.write(element+ "\n")

!cat custom_stopwords.txt 
pt.set_property('stopwords.filename','./custom_stopwords.txt')




# this loads and starts pyterrier so that it also works in the TIRA


# PyTerrier must be imported after the call to ensure_pyterrier_is_loaded in TIRA.

def lemmatize(t):
    lemmatizer = autoclass("org.terrier.terms.PorterStemmer")
    return lemmatizer.stem(t)

Due to execution in TIRA, I have patched ir_datasets to always return the single input dataset mounted to the sandbox.
Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/timniederhausen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
PyTerrier 0.10.0 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


{'throughout', 'what', 'cant', 'such', 'do', 'nobody', 'further', 'inc', 'to', 'various', "should've", 'anything', 'she', 'name', 'even', '’re', 'con', 'out', 'besides', 'thus', 'however', 'mill', 'under', 'hereby', 'serious', 'ie', 'were', "isn't", 'doesn', 'anyway', 'empty', 'full', 'there', 'won', 'wasn', "'d", 'somehow', 'when', 'one', '‘re', 'you', "hasn't", 'being', 'others', 'wherever', 'eleven', 'mostly', 'thru', 'somewhere', 'own', 'most', 'put', 'around', 'her', 'becoming', 'cry', 'along', 'neither', '‘ll', 'system', 'perhaps', 'last', 'may', '’d', 'they', '‘d', 'themselves', 'did', 'isn', 'this', 'four', "shan't", 'as', '‘s', "couldn't", 'above', 'll', 'be', 'toward', 'via', 'while', 'become', 'please', 'so', "won't", 'because', 'without', 'another', 'these', 'alone', 'onto', 'sometime', 'whose', 'n‘t', 'ten', 'every', 'o', 'thin', "you'll", 'bottom', 'describe', 'three', 'if', 'nothing', "n't", 'six', 'detail', 'behind', 'should', "haven't", 'next', 'together', 'us', 'upon'

### Step 2: Load the data

In [5]:
data = pt.get_dataset('irds:ir-lab-jena-leipzig-wise-2023/validation-20231104-training')


Load ir_dataset "ir-lab-jena-leipzig-wise-2023/validation-20231104-training" from tira.


In [6]:
print('See the first two queries:')
topics = data.get_topics('title')
print(topics.head(2))

See the first two queries:


No settings given in /home/timniederhausen/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/timniederhausen/.tira/.tira-settings.json. I will use defaults.
       qid              query
0  q072224     purchase money
1  q072226  purchase used car


### Step 3: Build the Index

In [7]:
print('Build index:')
indexer = pt.IterDictIndexer("/tmp/index", overwrite = True, blocks = True,meta = {'docno':100, 'text': 20480}, stemmer = 'PorterStemmer')
!rm -Rf /tmp/index
index_ref = indexer.index(data.get_corpus_iter())

#iter_indexer = pt.IterDictIndexer("/tmp/index", meta={'docno': 100}, verbose=True)
#!rm -Rf /tmp/index
#indexref = iter_indexer.index(data.get_corpus_iter())

#print('Done. Index is created')

Build index:


No settings given in /home/timniederhausen/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/timniederhausen/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents:   0%|          | 0/61307 [00:00<?, ?it/s]

No settings given in /home/timniederhausen/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/timniederhausen/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents: 100%|██████████| 61307/61307 [01:38<00:00, 620.65it/s] 


### Step 4: Create the Retrieval Pipeline

In [8]:
index = pt.IndexFactory.of(index_ref)
bm25 = pt.BatchRetrieve(index, wmodel="BM25", verbose=True)

### Step 5: Create the Run and Persist the Run

In [9]:
#Query Expansion
bo1_expansion = bm25 >> pt.rewrite.Bo1QueryExpansion(index)
bo1_expansion(topics)
bm25_b01 = bo1_expansion >> bm25
print('Create run')
#run = bm25(topics)
print('Done, run was created')

BR(BM25):   0%|          | 0/882 [00:00<?, ?q/s]

BR(BM25): 100%|██████████| 882/882 [07:11<00:00,  2.04q/s]


Create run
Done, run was created


In [10]:
#persist_and_normalize_run(run, 'bm25-baseline')