# Final Retrieval System

This is a (slightly) improved retrieval system based on the BM25 retrieval.
It implements a custom index, query extension and stemming, sequential dependence and a simple pipeline.

### Step 1: Import Libraries

Imports TIRA, PyTerrier and Spacy and ensures PyTerrier is loaded.

In [22]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run

# PyTerrier must be imported after the call to ensure_pyterrier_is_loaded in TIRA.
import pyterrier as pt

import spacy

from nltk.corpus import wordnet

# Ensure PyTerrier is loaded
ensure_pyterrier_is_loaded()


### Step 2: Setup for Stopwords

Loads English stopwords from Spacy. Saves them in the file "stopwords.txt" and changes the PyTerrier stopwords file to that. 

In [23]:
# Load Spacy and set up stopwords
nlp = spacy.load("en_core_web_sm")
spacy_stopwords = set(nlp.Defaults.stop_words)


file_path = "stopwords.txt"
with open(file_path, 'w+') as file:
    for element in spacy_stopwords:
        file.write(element+ "\n")

pt.set_property('stopwords.filename','./stopwords.txt')

### Step 3: Import Dataset

Imports the ir_dataset from TIRA and prints the first two queries to see if it worked.

In [13]:
data = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')
print('See the first two queries:')
topics = data.get_topics('title')
print(topics.head(2))

See the first two queries:
  qid                                     query
0   1  retrieval system improving effectiveness
1   2  machine learning language identification


### Step 4: adding synonym recognition

In [26]:
nltk.download('wordnet')

def get_synonyms(word):
    synonyms = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            synonyms.add(lemma.name())
    return synonyms

def expand_with_synonyms(text):
    words = text.split()
    expanded_text = []
    for word in words:
        expanded_text.append(word)
        expanded_text.extend(get_synonyms(word))
    return " ".join(expanded_text)

NameError: name 'nltk' is not defined

Function to lemmatize text

In [14]:
# Function to lemmatize text
def lemmatize_text(text):
    doc = nlp(text)
    return " ".join([token.lemma_ for token in doc if token.text.lower() not in spacy_stopwords])



### Step 5: Build an Index with lemmatization and and synonym recognition

Builds a custom iterable index. 
Stems words and removes stopwords previously defined.
Block indexing is enabled.

In [15]:
class LemmaIndexer(pt.IterDictIndexer):
    def process(self, doc):
        doc['text'] = lemmatize_text(doc['text'])
        return doc

indexer = LemmaIndexer("/tmp/index", overwrite=True, blocks=True, meta={'docno': 100, 'text': 20480})
index_ref = indexer.index(data.get_corpus_iter())
print('Done. Index is created')

ir-lab-sose-2024/ir-acl-anthology-20240504-training documents:  66%|██████▌   | 83191/126958 [00:16<00:09, 4599.06it/s]



ir-lab-sose-2024/ir-acl-anthology-20240504-training documents: 100%|██████████| 126958/126958 [00:23<00:00, 5517.94it/s] 


17:49:55.853 [ForkJoinPool-3-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 4 empty documents
Done. Index is created


### Step 6: Setup for Retrieval-Pipeline

First definitions of index and retrieval models used.
"sdm" enables use of Sequential Dependance Model.

In [16]:
index = pt.IndexFactory.of(index_ref)
bm25 = pt.BatchRetrieve(index, wmodel="BM25", k1=1.5, b=0.75, verbose=True)
pl2 = pt.BatchRetrieve(index, wmodel="PL2", c=7.0, verbose=True)
sdm = pt.rewrite.SequentialDependence()

Secondly, definition of the query expansion and the actual pipeline.

In [17]:
#Query Expansion
bo1 = pt.rewrite.Bo1QueryExpansion(index) 
rm3 = pt.rewrite.RM3(index)

#Pipeline
pipe = (bm25 % 100) >> rm3 >> sdm >> pl2

### Step 7: Create the Run

Creates the run on the queries from the dataset.

In [18]:
print('Create run')

run = pipe(topics)

print('Done, run was created')

Create run


BR(BM25):   0%|          | 0/68 [00:00<?, ?q/s]

BR(BM25): 100%|██████████| 68/68 [00:11<00:00,  5.97q/s]


17:50:07.501 [main] WARN org.terrier.querying.RM1 - Did not identify any usable candidate expansion terms from docid 125137 among 6 possibilities
17:50:07.594 [main] WARN org.terrier.querying.RM1 - Did not identify any usable candidate expansion terms from docid 116910 among 5 possibilities


BR(PL2): 100%|██████████| 68/68 [00:13<00:00,  5.00q/s]

Done, run was created





### Step 8: Save the Runfile

In [19]:
persist_and_normalize_run(run, 'retrieval_system', default_output='../runs')

The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".
