# First Term Project: Cranfield Collection
“The Cranfield collection [...] was the pioneering test collection in allowing CRANFIELD precise quantitative measures of information retrieval effectiveness [...]. Collected in the United Kingdom starting in the late 1950s, it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and exhaustive relevance judgments of all (query, document) pairs.” [1, Section 8.2]

Your tasks, reviewed by your colleagues and the course instructors, are the following:

1.   *Implement an unsupervised ranked retrieval system*, [1, Chapter 6] which will produce a list of documents from the Cranfield collection in a descending order of relevance to a query from the Cranfield collection. You MUST NOT use relevance judgements from the Cranfield collection in your information retrieval system. Relevance judgements MUST only be used for the evaluation of your information retrieval system.

2.   *Document your code* in accordance with [PEP 257](https://www.python.org/dev/peps/pep-0257/), ideally using [the NumPy style guide](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard) as seen in the code from exercises.  
     *Stick to a consistent coding style* in accordance with [PEP 8](https://www.python.org/dev/peps/pep-0008/).

3.   *Reach at least 22% mean average precision* [1, Section 8.4] with your system on the Cranfield collection. You MUST record your score either in [the public leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vT0FoFzCptIYKDsbcv8LebhZDe_20GFeBAPmS-VyImlWbqET0T7I2iWy59p9SHbUe3LX1yJMhALPcCY/pubhtml) or in this Jupyter notebook. You are encouraged to use techniques for tokenization, [1, Section 2.2] document representation [1, Section 6.4], tolerant retrieval [1, Chapter 3], relevance feedback and query expansion, [1, Chapter 9] and others discussed in the course.

4.   _[Upload an .ipynb file](https://is.muni.cz/help/komunikace/spravcesouboru#k_ss_1) with this Jupyter notebook to the homework vault in IS MU._ You MAY also include a brief description of your information retrieval system and a link to an external service such as [Google Colaboratory](https://colab.research.google.com/), [DeepNote](https://deepnote.com/), or [JupyterHub](https://iirhub.cloud.e-infra.cz/).

[1] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to information retrieval*](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf). Cambridge university press, 2008.

## Loading the Cranfield collection

First, we will install [our library](https://gitlab.fi.muni.cz/xstefan3/pv211-utils) and load the Cranfield collection.

In [1]:
# !pip install --upgrade git+https://github.com/MIR-MU/pv211-utils.git@spring2025

In [2]:
# !export PYTHONPATH=/home/jovyan/.local/bin:${PYTHONPATH}

In [3]:
# !pip install --upgrade nltk

### Loading the documents

Next, we will define a class named `Document` that will represent a preprocessed document from the Cranfield collection. Tokenization and preprocessing of the `title` and `body` attributes of the individual documents as well as the creative use of the `authors`, `bibliography`, and `title` attributes is left to your imagination and craftsmanship.

In [4]:
from pv211_utils.cranfield.entities import CranfieldDocumentBase

class Document(CranfieldDocumentBase):
    """
    A preprocessed Cranfield collection document.

    Parameters
    ----------
    document_id : str
        A unique identifier of the document.
    authors : list of str
        A unique identifiers of the authors of the document.
    bibliography : str
        The bibliographical entry for the document.
    title : str
        The title of the document.
    body : str
        The abstract of the document.

    """
    def __init__(self, document_id: str, authors: str, bibliography: str, title: str, body: str):
        super().__init__(document_id, authors, bibliography, title, body)

    def __str__(self):
        return f"{self.title} {self.title} {self.title} {self.body}"

We will load documents into the `documents` [ordered dictionary](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Each document is an instance of the `Document` class that we have just defined.

In [5]:
from pv211_utils.datasets import CranfieldDataset

cranfield = CranfieldDataset()

documents = cranfield.load_documents(Document)

  from tqdm.autonotebook import tqdm


In [6]:
print('\n'.join(repr(document) for document in list(documents.values())[:3]))
print('...')
print('\n'.join(repr(document) for document in list(documents.values())[-3:]))

<Document 1 “experimental investigation of the aerodynamics of  ...”>
<Document 2 “simple shear flow past a flat plate in an incompre ...”>
<Document 3 “the boundary layer in simple shear flow past a fla ...”>
...
<Document 1398 “stability of rectangular plates under shear and be ...”>
<Document 1399 “buckling of transverse stiffened plates under shea ...”>
<Document 1400 “the buckling shear stress of simply-supported infi ...”>


In [7]:
document = documents['200']
document

<Document 200 “calculation of derivatives for a cropped delta win ...”>

In [8]:
print(document.authors)

watson,j.


In [9]:
print(document.bibliography)

arc r + m 3060, 1958.


In [10]:
print(document.title)

calculation of derivatives for a cropped delta wing with subsonic leading edges oscillating in a supersonic airstream .


In [11]:
print(document.body)

calculation of derivatives for a cropped delta wing with subsonic leading edges oscillating in a supersonic airstream .   the lift, pitching moment and full-span constant-chord control hinge-moment are derived for a cropped delta wing describing harmonic plunging and pitching oscillations of small amplitude and low-frequency parameter in a supersonic air stream .  it is assumed that (a) the wing has subsonic leading edges, (b) the wing is sufficiently thin and the mach number sufficiently supersonic to permit the use of linearised theory .   expressions for the various derivative coefficients are obtained for a particular delta wing of aspect ratio 1.8 and taper ratio these are avaluated and tabulated for mach numbers 1.1, 1.15, 1.2, 1.3, 1.4, 1.5, 1.6 and 1.944 .


### Loading the queries
Next, we will define a class named `Query` that will represent a preprocessed query from the Cranfield collection. Tokenization and preprocessing of the `body` attribute of the individual queries is left to your craftsmanship.

In [12]:
from pv211_utils.cranfield.entities import CranfieldQueryBase

class Query(CranfieldQueryBase):
    """
    A preprocessed Cranfield collection query.

    Parameters
    ----------
    query_id : int
        A unique identifier of the query.
    body : str
        The text of the query.

    """
    def __init__(self, query_id: int, body: str):
        super().__init__(query_id, body)

    def __str__(self):
        return self.body

We will load queries into the `queries` [ordered dictionary](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Each query is an instance of the `Query` class that we have just defined.

In [13]:
queries = cranfield.load_test_queries(Query)

In [14]:
queries

OrderedDict([(12,
              <Query 12 “how can the aerodynamic performance of channel flo ...”>),
             (31,
              <Query 31 “what size of end plate can be safely used to simul ...”>),
             (84,
              <Query 84 “references on the methods available for accurately ...”>),
             (32,
              <Query 32 “to find an approximate correction for thickness in ...”>),
             (204,
              <Query 204 “do viscous effects seriously modify pressure distr ...”>),
             (47,
              <Query 47 “what are the existing solutions for hypersonic vis ...”>),
             (118,
              <Query 118 “what are the aerodynamic interference effects on t ...”>),
             (156,
              <Query 156 “what qualitative and quantitative material is avai ...”>),
             (117,
              <Query 117 “is there any information on how the addition of a  ...”>),
             (179,
              <Query 179 “has a theory of quasi-conical

In [15]:
print('\n'.join(repr(query) for query in list(queries.values())[:3]))
print('...')
print('\n'.join(repr(query) for query in list(queries.values())[-3:]))

<Query 12 “how can the aerodynamic performance of channel flo ...”>
<Query 31 “what size of end plate can be safely used to simul ...”>
<Query 84 “references on the methods available for accurately ...”>
...
<Query 16 “can the transverse potential flow about a body of  ...”>
<Query 78 “has anyone explained the kink in the surge line of ...”>
<Query 146 “does a membrane theory exist by which the behaviou ...”>


In [16]:
query = queries[14]
query

<Query 14 “papers on shock-sound wave interaction .”>

In [17]:
print(query.body)

papers on shock-sound wave interaction .


In [18]:
from nltk import download

download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /home/jovyan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [20]:
import numpy as np

In [21]:
from nltk import download
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from gensim.utils import deaccent
import re
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
import string

# download('averaged_perceptron_tagger_eng')
# download('stopwords')
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))  # Convert to set for fast lookup
stemmer = PorterStemmer()

def deacc(text):
    return ' '.join([deaccent(word) for word in text.split(' ')])

def split_hyphenated(tokens):
    """Split hyphenated words into individual words."""
    new_tokens = []
    for token in tokens:
        if "-" in token:
            new_tokens.extend(token.split("-"))  # Split into separate words
        else:
            new_tokens.append(token)
    return new_tokens

def remove_possessives(tokens):
    return [re.sub(r"'s\b", "", token) for token in tokens]

def get_wordnet_pos(word):
    """Map POS tag to first character for WordNetLemmatizer."""
    tag = pos_tag([word])[0][1][0].upper()
    return {'J': wordnet.ADJ, 'V': wordnet.VERB, 'N': wordnet.NOUN, 'R': wordnet.ADV}.get(tag, wordnet.NOUN)


def stem(tokens: list):
    """Apply stemming using Porter Stemmer."""
    return [stemmer.stem(word) for word in tokens]


def tokenize(text: str):
    """Basic tokenization using regex and NLTK."""
    text = re.sub(r'\d+', '', text)  # Remove digits
    text = text.replace('/', ' ')
    text = re.sub(r'\b(\w\.){2,}', lambda m: m.group(0).replace('.', ''), text)
    text = re.sub(r'\s+', ' ', text).strip()  # Normalize spaces
    return word_tokenize(text)

def remove_punctuation(tokens):
    """Remove punctuation from tokenized words."""
    return [word for word in tokens if word not in string.punctuation]

def remove_stopwords(tokens: list, stop_words: set):
    """Remove dynamically identified stopwords and non-alphabetic words."""
    return [word for word in tokens if word not in stop_words]


def lemmatize(tokens: list):
    """Lemmatize words using POS tags."""
    return [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens]


def preprocessing(text: str, corpus=None, query=False):
    """Full preprocessing pipeline with dynamic stopword removal."""
    text = text.lower()
    text = deacc(text)
    tokens = tokenize(text)
    tokens = remove_possessives(tokens)
    tokens = remove_punctuation(tokens)
    stop_words = set(stopwords.words('english'))
    tokens = split_hyphenated(tokens)
    tokens = remove_stopwords(tokens, stop_words)
    tokens = lemmatize(tokens)
    # tokens = [token for token in tokens if token and not (len(token) == 1) and not (len(token)!=2 and token.endswith('.') )]
    return tokens

print(preprocessing("gugu gaga ahoj U.S.A. 132 reader's j."))

['gugu', 'gaga', 'ahoj', 'usa', 'reader', 'j']


In [22]:
# for idx in documents:
#     # print(documents[idx].body)
#     print(preprocessing(documents[idx].body))
#     if idx == '1000':
#         break

## Implementation of your information retrieval system

You can try the [preprocessing][1] and [systems][2] that are [available in our library][1], but feel free to implement your own.

 [1]: https://github.com/MIR-MU/pv211-utils/tree/main/pv211_utils/preprocessing
 [2]: https://github.com/MIR-MU/pv211-utils/tree/main/pv211_utils/systems

In [23]:
from pv211_utils.entities import QueryBase, DocumentBase
from pv211_utils.irsystem import IRSystemBase
from rank_bm25 import BM25Plus
from sklearn.feature_extraction.text import TfidfVectorizer

from collections import OrderedDict
from typing import Iterable

In [24]:
for i in queries:
    queries[i].body = ' '.join(preprocessing(queries[i].body, query=True))

In [25]:
class BM25PlusSystemWithRerank(IRSystemBase):
    """
    Class for BM25+ ranking system. BM25+ is extension of BM25 - bag-of-words retrieval function that ranks a set of
    documents based on the query terms appearing in each document, regardless of their proximity within the document.

    Parameters
    ----------
    documents: OrderedDict
        Input documents
    k1: float
        BM25 k1 parameter. k1 is a variable which helps determine term frequency saturation characteristics.
    b: float
        BM25 b parameter. With bigger b, the effects of the length of the document compared to the average
        length are more amplified.
    d: float
        BM25 d parameter. Delta parameter for BM25+.

    Attributes
    ----------
    bm25: BM25PlusCore
        Ranking model
    index: dict of (int, Document)
        A mapping from indexed document numbers to documents.

    """

    def __init__(self, documents: OrderedDict, preprocessing,
                 k1: float = 1.25, b: float = 0.75, d: float = 1):
        self.preprocessing = preprocessing

        docs_values = documents.values()

        self.corpus = [self.preprocessing(str(document)) for document in docs_values]
        self.stop_words = self.extract_collection_stopwords(self.corpus)
        print(self.stop_words)
        self.corpus = [[word for word in doc if word not in self.stop_words] for doc in self.corpus]
        self.bm25 = BM25Plus(self.corpus, k1=k1, b=b, delta=d)
        self.index = dict(enumerate(docs_values))

    @staticmethod
    def extract_weighted_expansion_terms(top_docs, query_terms, num_terms=3):
        vectorizer = TfidfVectorizer()
        tfidf_matrix = vectorizer.fit_transform(top_docs)
        feature_names = vectorizer.get_feature_names_out()

        term_scores = tfidf_matrix.sum(axis=0).A1
        term_ranking = sorted(zip(feature_names, term_scores), key=lambda x: x[1], reverse=True)

        expansion_terms = [term for term, _ in term_ranking if term not in query_terms][:num_terms]
        return expansion_terms

    @staticmethod
    def extract_expansion_terms(top_docs, query_terms, num_terms=3):
        """
        Extracts important terms from top-ranked documents using BM25 scores.
        """
        term_scores = {}

        for doc in top_docs:
            # print(doc)
            for term in doc:
                if term in query_terms:
                    continue
                if term not in term_scores:
                    term_scores[term] = 0
                term_scores[term] += 1  # Count term occurrences

        sorted_terms = sorted(term_scores.items(), key=lambda x: x[1], reverse=True)
        expansion_terms = [term for term, _ in sorted_terms[:num_terms]]
        return expansion_terms

    def search(self, query: QueryBase, top_n=1, num_expansion_terms=3) -> Iterable[DocumentBase]:
        """
        Perform BM25+ search with Pseudo Relevance Feedback (PRF).

        Parameters
        ----------
        query: QueryBase
        top_n: int
            Number of top documents to consider for PRF.
        num_expansion_terms: int
            Number of expansion terms to add to the query.
        """
        query = self.preprocessing(str(query))
        scores = self.bm25.get_scores(query)
        top_doc_indices = scores.argsort()[::-1][:top_n]
        top_docs = [self.corpus[idx] for idx in top_doc_indices]

        expansion_terms = self.extract_expansion_terms([doc for doc in top_docs], set(query), num_expansion_terms)
        expanded_query = " ".join(query) + " " + " ".join(expansion_terms)
        # expanded_query = self.expand_query_with_synonyms(expanded_query)
        new_scores = self.bm25.get_scores(expanded_query.split())
        final_ranked_docs = new_scores.argsort()[::-1]
        for doc in final_ranked_docs:
            yield self.index[doc]

    @staticmethod
    def get_synonyms(word):
        synonyms = set()
        for syn in wordnet.synsets(word):
            for lemma in syn.lemmas():
                synonyms.add(lemma.name())
        return list(synonyms)

    def expand_query_with_synonyms(self, query):
        expanded_terms = []
        for term in query.split():
            expanded_terms.extend(self.get_synonyms(term)[:1]) 
        return query + " " + " ".join(expanded_terms)

    @staticmethod
    def extract_collection_stopwords(corpus, max_df=0.90, min_df=3, top_n=250):
        """
        Identify collection-specific stopwords using TF-IDF.

        Parameters:
        - corpus: List of documents (already preprocessed).
        - max_df: Max document frequency threshold (words appearing in >85% of docs are stopwords).
        - min_df: Min document frequency threshold (ignore rare words appearing in <2 docs).
        - top_n: Number of most frequent words to consider as stopwords.

        Returns:
        - set of stopwords
        """
        corpus = [" ".join(doc) for doc in corpus]

        vectorizer = TfidfVectorizer(max_df=max_df, min_df=min_df, stop_words=None)
        X = vectorizer.fit_transform(corpus)

        word_scores = np.asarray(X.sum(axis=0)).flatten()
        words = vectorizer.get_feature_names_out()

        sorted_words = [word for _, word in sorted(zip(word_scores, words))]

        return set(sorted_words[:top_n])

system = BM25PlusSystemWithRerank(documents, preprocessing, k1 = 2.1, b = 0.9, d = 1.0)


{'paramount', 'basically', 'log', 'shanley', 'peculiarity', 'alleviation', 'prevention', 'constitutes', 'correspondingly', 'get', 'recommendation', 'aero', 'premature', 'answer', 'box', 'ray', 'chamber', 'unfortunately', 'irregular', 'monotonically', 'distinguish', 'amplify', 'predominantly', 'flexibly', 'insofar', 'resemble', 'salient', 'sutherland', 'caloric', 'shoulder', 'advantageous', 'denotes', 'sort', 'overlap', 'brass', 'intimate', 'favor', 'abrupt', 'scientific', 'nuclear', 'gaussian', 'tangency', 'ture', 'arrhenius', 'virtually', 'afterburning', 'desirability', 'implies', 'seventh', 'formally', 'necessitate', 'precipitate', 'adjustable', 'biconvex', 'vacuum', 'augmentation', 'sears', 'viscoelastic', 'sin', 'and', 'tensile', 'similarly', 'pronounce', 'dive', 'counter', 'unlikely', 'compact', 'success', 'whirl', 'proposes', 'key', 'equip', 'penetrate', 'evident', 'undershoot', 'go', 'bessel', 'retains', 'logical', 'bubble', 'british', 'expressible', 'stein', 'supersonically', '

## Evaluation
Finally, we will evaluate your information retrieval system using [the Mean Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) (MAP) evaluation measure.

In [27]:
from pv211_utils.cranfield.loader import load_judgements
from pv211_utils.cranfield.leaderboard import CranfieldLeaderboard
from pv211_utils.cranfield.eval import CranfieldEvaluation

submit_result = False
author_name = 'Balek, Vojtěch'

test_judgements = load_judgements(queries, documents)
leaderboard = CranfieldLeaderboard()
evaluation = CranfieldEvaluation(system, test_judgements, leaderboard=leaderboard, author_name=author_name)
evaluation.evaluate(queries, submit_result)

Your system achieved **44.24% MAP score**.

Congratulations, you passed the **22%** minimum! 🥳

Your result has been submitted to [the leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vSLY-jk70GJZSZjJYMKxh6CMBl47KDP6OFjrY_zIMUF9YRwTLl_DSU1mXCrBPiHyUxqav0URYtVP2PK/pubhtml)! 🏆