<a href="https://colab.research.google.com/github/tamteaa/CS5100B-Project/blob/main/reranking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reranking Retrieval Results

In this notebook, you will continue using the [Pyserini](http://pyserini.io/) library's indexing and retrieval models.  This time, however, you will get an initial set of retrieval results and then write your own reranking code to try to move relevant documents higher in the list.

As before, we start by installing the python interface. Since it calls the underlying Lucene search engine, which is written in Java, we make sure we point to an appropriate Java installation. If like Colab you don't have Java 21, uncomment the following code and run it, or whatever makes sense for your platform.

In [1]:
## Uncomment the following code to install Java 21 on Colab
!apt-get install openjdk-21-jre-headless -qq > /dev/null
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-21-openjdk-amd64"
!update-alternatives --set java /usr/lib/jvm/java-21-openjdk-amd64/bin/java
!java -version

openjdk version "21.0.6" 2025-01-21
OpenJDK Runtime Environment (build 21.0.6+7-Ubuntu-122.04.1)
OpenJDK 64-Bit Server VM (build 21.0.6+7-Ubuntu-122.04.1, mixed mode, sharing)


In [2]:
!pip install pyserini
# You can change this to gpu if you have one.
# It's a pyserini dependency, but we won't need it until the next assignment.
!pip install faiss-cpu

Collecting pyserini
  Downloading pyserini-0.44.0.tar.gz (195.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m195.3/195.3 MB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pyjnius>=1.6.0 (from pyserini)
  Downloading pyjnius-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting onnxruntime>=1.8.1 (from pyserini)
  Downloading onnxruntime-1.21.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting tiktoken>=0.4.0 (from pyserini)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting coloredlogs (from onnxruntime>=1.8.1->pyserini)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from to

We initialize the searcher with a pre-built index for the Robust04 collection, which Pyserini will automatically download if it hasn't already. Note that the index takes up 1.6GB of disk.

In [3]:
from pyserini.search.lucene import LuceneSearcher

searcher = LuceneSearcher.from_prebuilt_index('robust04')

Downloading index at https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.disk45.20240803.36f7e3.tar.gz...


lucene-inverted.disk45.20240803.36f7e3.tar.gz: 1.66GB [00:20, 85.6MB/s]                           


Now we can search for a query and inspect the results:

In [4]:
hits = searcher.search('black bear attacks', 1000)

# Prints the first 10 hits
for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')

 1 LA092790-0015   7.06680
 2 LA081689-0039   6.89020
 3 FBIS4-16530     6.61630
 4 LA102589-0076   6.46450
 5 FT932-15491     6.25090
 6 FBIS3-12276     6.24630
 7 LA091090-0085   6.17030
 8 FT922-13519     6.04270
 9 LA052790-0205   5.94060
10 LA103089-0041   5.90650


The `IndexReaderUtils` class provides various methods to read the index directly. For example, we can fetch a raw document from the index given its `docid`:

In [5]:
from pyserini.index import LuceneIndexReader
from IPython.core.display import display, HTML

reader = LuceneIndexReader.from_prebuilt_index('robust04')

doc = reader.doc('LA092790-0015').raw()
display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + doc + '</div>'))

Note that the result is exactly the same as displaying the hit contents above. Given the raw text, we can obtain its analyzed form (i.e., tokenized, stemmed, stopwords removed, etc.). Here we show the first ten tokens:

In [6]:
analyzed = reader.analyze(doc)
analyzed[0:10]

['date',
 'p',
 'septemb',
 '27',
 '1990',
 'thursdai',
 'ventura',
 'counti',
 'edit',
 'p']

## Retrieving Initial Ranked Lists

We can load some standard evaluation sets such as Robust04, which contains 250 queries, or "topics" as the TREC conferences call them.

In [7]:
from pyserini.search import get_topics
topics = get_topics('robust04')
print(f'{len(topics)} queries total')

250 queries total


The topics are in a dictionary, whose keys are integers uniquely identifying each query. Each topic contains the following fields:

* `title`: TREC's term for the brief query a user might actually type;
* `description`: a longer form of the query in the form of a complete sentence; and
* `narrative`: a description of what the user is looking for and what kinds of results would be relevant or non-relevant.

In [8]:
topics[301]

{'narrative': 'A relevant document must as a minimum identify the organization and the type of illegal activity (e.g., Columbian cartel exporting cocaine). Vague references to international drug trade without identification of the organization(s) involved would not be relevant.',
 'description': 'Identify organizations that participate in international criminal activity, the activity, and, if possible, collaborating organizations and the countries involved.',
 'title': 'International Organized Crime'}

For the purpose of your experiments, we'll divide them into a development and test set.

In [9]:
dev_topics = {k:topics[k] for k in list(topics.keys())[:125]}
test_topics = {k:topics[k] for k in list(topics.keys())[125:]}

Now, we'll fetch the relevance judgments for the Robust04 queries, which TREC calls "qrels".

In [10]:
from urllib.request import urlopen

qfile = 'https://github.com/castorini/anserini-tools/blob/63ceeab1dd94c1221f29b931d868e8fab67cc25c/topics-and-qrels/qrels.robust04.txt?raw=true'
qrels = []
for line in urlopen(qfile):
  qid, round, docid, score = line.strip().split()
  qrels.append([int(qid), 0, docid.decode('UTF-8'), int(score)])
#qrels = [line.strip().split() for line in urlopen(qfile)]

Each record in the qrel contains four fields:

1. the numeric identifier of the query;
2. the round of relevance feedback, which is here always 0;
3. the identifier of a documennt that has been judged; and
4. the relevance score of that document.

In Robust04, all relevance judgments are binary, i.e., 1 or 0. Note that not all non-relevant documents are recorded. The qrel file only contains those documents the annotators actually looked at; the vast majority of documents in the collection have not been judged. In IR evaluation, we assume that unannotated documents are non-relevant.

In [11]:
qrels[0:10]

[[301, 0, 'FBIS3-10082', 1],
 [301, 0, 'FBIS3-10169', 0],
 [301, 0, 'FBIS3-10243', 1],
 [301, 0, 'FBIS3-10319', 0],
 [301, 0, 'FBIS3-10397', 1],
 [301, 0, 'FBIS3-10491', 1],
 [301, 0, 'FBIS3-10555', 0],
 [301, 0, 'FBIS3-10622', 1],
 [301, 0, 'FBIS3-10634', 0],
 [301, 0, 'FBIS3-10635', 0]]

We collect the top 1000 hists for both the dev and test sets. You

In [12]:
# Compute top-1000 lists for queries in test_topics
def topic_hits(searcher, topics, k=1000):
  hits = {}
  for topic, info in topics.items():
    print(topic, info['title'])
    hits[topic] = [(hit.docid, hit.score) for hit in searcher.search(info['title'], k)]
  return hits

dev_hits = topic_hits(searcher, dev_topics)
test_hits = topic_hits(searcher, test_topics)

350 Health and Computer Terminals
351 Falkland petroleum exploration
352 British Chunnel impact
353 Antarctica exploration
354 journalist risks
355 ocean remote sensing
356 postmenopausal estrogen Britain
357 territorial waters dispute
358 blood-alcohol fatalities
359 mutual fund predictors
360 drug legalization benefits
361 clothing sweatshops
362 human smuggling
363 transportation tunnel disasters
364 rabies
365 El Nino
366 commercial cyanide uses
367 piracy
368 in vitro fertilization
369 anorexia nervosa bulimia
370 food/drug laws
371 health insurance holistic
372 Native American casino
373 encryption equipment export
374 Nobel prize winners
375 hydrogen energy
376 World Court
377 cigar smoking
378 euro opposition
379 mainstreaming
380 obesity medical treatment
381 alternative medicine
382 hydrogen fuel automobiles
383 mental illness drugs
384 space station moon
385 hybrid fuel cars
386 teaching disabled children
387 radioactive waste
388 organic soil enhancement
389 illegal technol

## Evaluating Initial Ranked Lists



When reranking, an important metric is the _recall_ of the initial set of results. This tells us the upper bound or &ldquo;headroom&rdquo; on the improvements that reranking can achieve. If the recall in the initial ranked lists is too low, we know we need to optimize the initial retrieval model.

For this assignment, you will work with fixed initial ranked lists from pyserini's BM25 model, but it's still useful to see how much room there is for improvement during reranking.

As before, you should process the `qrels` data to find the relevant results for each query.

In [13]:
## TODO [15 points]: Compute Recall@1000 for the dev_hits and test_hits data
## and print it out.

from collections import defaultdict

relevant_docs_all = defaultdict(set)
for qid, _, docid, score in qrels:
    if score > 0:
        relevant_docs_all[qid].add(docid)

def calculate_recall_at_k(qid, hits_list, relevant_docs, k):
    retrieved_docids = {docid for docid, score in hits_list[:k]}
    relevant_set = relevant_docs.get(qid, set())
    total_relevant = len(relevant_set)

    if total_relevant == 0:
        return None

    num_relevant_retrieved = len(retrieved_docids.intersection(relevant_set))
    return num_relevant_retrieved / total_relevant

def average_recall_at_k(topic_hits_dict, relevant_docs, topics_dict, k=1000):
    total_recall = 0
    num_queries_evaluated = 0
    for qid in topics_dict.keys():
        if qid in topic_hits_dict:
            if qid in relevant_docs:
                 recall = calculate_recall_at_k(qid, topic_hits_dict[qid], relevant_docs, k)
                 if recall is not None:
                     total_recall += recall
                     num_queries_evaluated += 1

    if num_queries_evaluated == 0:
        return 0.0

    return total_recall / num_queries_evaluated

recall_dev = average_recall_at_k(dev_hits, relevant_docs_all, dev_topics, k=1000)
recall_test = average_recall_at_k(test_hits, relevant_docs_all, test_topics, k=1000)

print(f"Recall@1000 (Dev): {recall_dev:.4f}")
print(f"Recall@1000 (Test): {recall_test:.4f}")

Recall@1000 (Dev): 0.6985
Recall@1000 (Test): 0.6993


For a given set of top-1000 lists, Recall@1000 will not change after reranking. What will change are ranking-based metrics like MAP and NDCG. You should compute MAP@1000 for the initial `dev_hits` and `test_hits` data.

In [14]:
## TODO [10 points]: Adapt your code from Homework 3 to compute MAP@1000 for
## the dev_hits and test_hits data and print it out.

def compute_ap_at_k(hits_list, relevant_docs_for_query, k=1000):
    num_relevant = len(relevant_docs_for_query)
    if num_relevant == 0:
        return 0.0

    hits_at_k = hits_list[:k]
    num_correct = 0
    sum_precisions = 0.0

    for rank, hit in enumerate(hits_at_k, 1):
        doc_id = hit[0] # hit is (docid, score)
        if doc_id in relevant_docs_for_query:
            num_correct += 1
            precision_at_rank = num_correct / rank
            sum_precisions += precision_at_rank

    return sum_precisions / num_relevant

def compute_map_at_k(topic_hits_dict, relevant_docs_all, topics_dict, k=1000):
    aps = []
    for qid in topics_dict.keys():
        if qid in topic_hits_dict and qid in relevant_docs_all:
            relevant_docs_for_query = relevant_docs_all[qid]
            if relevant_docs_for_query:
              ap = compute_ap_at_k(topic_hits_dict[qid], relevant_docs_for_query, k)
              aps.append(ap)

    if not aps:
        return 0.0
    return sum(aps) / len(aps)

map_dev_initial = compute_map_at_k(dev_hits, relevant_docs_all, dev_topics, k=1000)
map_test_initial = compute_map_at_k(test_hits, relevant_docs_all, test_topics, k=1000)

print(f"Initial MAP@1000 (Dev): {map_dev_initial:.4f}")
print(f"Initial MAP@1000 (Test): {map_test_initial:.4f}")

Initial MAP@1000 (Dev): 0.2426
Initial MAP@1000 (Test): 0.2637


## Reranking Search Results

In this final part of the assignment, you should implement a ranking function that, hopefully, improves on the baseline BM25 ranking. You may use the BM25 score for each document as input, as well as the query, of course, and any other properties of the documents you look up with the `reader` object.  After computing a new score for each candidate, re-sort the top-1000 results by your model's score.

You may use anything you've learned in this course---or in another course---to build your ranking function. For example, you could implement pseudo-relevance feedback or a relevance model, which would treat the top of each ranked list (e.g., the top 100) as if it were truly relevant and retrain model parameters. You could tune different BM25, query likelihood, or sequential dependence model parameters. You could try to learn different weights or embeddings for different fields in documents. You could use implementations of transformer language models such as BERT or SentenceBERT to score the compatibility of queries and documents. To be clear, you don't have to any of these approaches; you are free to try whatever ideas you like.

If your reranking model has tunable parameters, you should tune them on the `dev_hits` set. In the end, you will also evaluate MAP@1000 on the `test_hits` set.

**TODO**: Put any explanation of your reranking function here.

In [35]:
# --- Parameters for QE (Tune on dev set) ---
QE_PARAMS = {
    'N': 10,  # Number of feedback documents to analyze
    'M': 5   # Number of top terms to add to the query
}

In [41]:
## TODO [70 points]: Implement a reranking function that takes a query, the
## reader, and an initial ranking and computes new scores.
## Like BM25, higher should be better.
## If you train parameters or set hyperparameters for this ranking function,
## do that here, as well.


import math
from collections import defaultdict
import heapq

# --- Parameters ---
PRF_PARAMS = {
    'N': 10,     # Feedback docs
    'ALPHA': 1.0, # Original score weight
    'BETA': 0.01   # Feedback score weight
}

# --- Helper Functions ---

def build_feedback_vector(feedback_docs_data, reader):

    aggregated_terms = defaultdict(int)
    for f_docid_str, _ in feedback_docs_data:
        try:
            doc_vector = reader.get_document_vector(f_docid_str)
            if doc_vector:
                for term, count in doc_vector.items():
                    aggregated_terms[term] += count
        except Exception:
            pass
    return aggregated_terms

def score_doc_vs_vector(docid_str, query_vector, reader):

    score = 0.0
    try:
        doc_vector = reader.get_document_vector(docid_str)
        if doc_vector:
             vector_len = sum(query_vector.values())
             if vector_len > 0:
                 for term, weight in query_vector.items():
                     if term in doc_vector:
                         score += (weight / vector_len) * doc_vector[term]
    except Exception:
        pass
    return score

def calculate_new_score(docid_str, feedback_vector, original_bm25_score, reader, params):
    BETA = params['BETA']
    prf_component_score = score_doc_vs_vector(docid_str, feedback_vector, reader)
    new_score = original_bm25_score + (BETA * prf_component_score)
    return new_score


In [51]:
# --- Main Query Expansion Reranking Logic ---

def rerank(topic_hits_dict, topics_dict, reader, searcher, params):
    reranked_hits_all = {}
    total_queries = len(topic_hits_dict)
    processed_queries = 0

    N = params['N'] # Number of feedback docs
    M = params['M'] # Number of expansion terms

    # Get collection statistics needed for IDF
    try:
        stats = reader.stats()
        total_docs = stats['documents']
        if total_docs == 0:
             print("Warning: Total documents reported as 0. IDF calculation might fail.")
             total_docs = 1 # Avoid division by zero, though results might be odd
    except Exception as e:
        print(f"Error getting reader stats: {e}")
        raise

    for qid, initial_hits in topic_hits_dict.items():
        processed_queries += 1
        if processed_queries % 10 == 0 or processed_queries == total_queries:
            print(f'  QE-TFIDF Reranking query {processed_queries}/{total_queries} (ID: {qid})...')

        if not initial_hits:
            reranked_hits_all[qid] = []
            continue

        query_info = topics_dict[qid]
        original_query_text = query_info['title']

        # Step 1: Get Top N Docs for Feedback
        feedback_docs_data = initial_hits[:N]

        # Step 2: Build Aggregated Term Vector from Feedback Docs
        feedback_vector = build_feedback_vector(feedback_docs_data, reader) # Uses the existing helper

        # Step 3: Select Top M Expansion Terms using TF * IDF Weight
        expansion_terms = []
        if feedback_vector:
            term_scores = []
            for term, tf_agg in feedback_vector.items():
                try:
                    # Get document frequency (df) for IDF calculation
                    df, cf = reader.get_term_counts(term, analyzer=None) # Use analyzer=None if terms are already stemmed/processed
                    if df > 0:
                        idf = math.log(total_docs / df)
                        # Score term by aggregated TF * IDF
                        score = tf_agg * idf
                        term_scores.append((term, score))
                    # else: ignore terms with df=0 (shouldn't happen for terms in the index)
                except Exception as e:
                    # print(f"    Warning: Could not get counts for term '{term}': {e}")
                    pass # Skip terms if counts fail

            # Sort terms by TF*IDF score in descending order
            term_scores.sort(key=lambda item: item[1], reverse=True)
            # Get the top M terms
            expansion_terms = [term for term, score in term_scores[:M]]
            # print(f"    Expansion terms (TF-IDF): {expansion_terms}") # Optional

        # Step 4: Construct Expanded Query
        if expansion_terms:
            # Filter out original query terms from expansion terms to avoid duplication issues? Optional.
            # original_analyzed = set(reader.analyze(original_query_text))
            # expansion_terms_filtered = [t for t in expansion_terms if t not in original_analyzed]
            # expanded_query_text = original_query_text + " " + " ".join(expansion_terms_filtered)
            expanded_query_text = original_query_text + " " + " ".join(expansion_terms)
        else:
            expanded_query_text = original_query_text

        # print(f"    Expanded query: {expanded_query_text}") # Optional

        # Step 5: Re-run the Search with the Expanded Query
        try:
            new_hits_list = searcher.search(expanded_query_text, k=1000)
            reranked_hits_all[qid] = [(hit.docid, hit.score) for hit in new_hits_list]
        except Exception as e:
            print(f"    Error searching with expanded query for qid {qid}: {e}")
            reranked_hits_all[qid] = initial_hits # Fallback

    return reranked_hits_all

In [48]:
# --- Execution ---
try:
    reader_stats = reader.stats()
    searcher_info = searcher
except NameError as e:
    print(f"Error: '{e.name}' not defined. Please run the cells that initialize objects.")
    raise

print("Starting QE reranking...")

reranked_dev_hits = rerank(dev_hits, dev_topics, reader, searcher, QE_PARAMS)
reranked_test_hits = rerank(test_hits, test_topics, reader, searcher, QE_PARAMS)
print("QE Reranking complete.")

Starting QE reranking...
  QE-TFIDF Reranking query 10/125 (ID: 359)...
  QE-TFIDF Reranking query 20/125 (ID: 369)...
  QE-TFIDF Reranking query 30/125 (ID: 379)...
  QE-TFIDF Reranking query 40/125 (ID: 389)...
  QE-TFIDF Reranking query 50/125 (ID: 398)...
  QE-TFIDF Reranking query 60/125 (ID: 609)...
  QE-TFIDF Reranking query 70/125 (ID: 619)...
  QE-TFIDF Reranking query 80/125 (ID: 629)...
  QE-TFIDF Reranking query 90/125 (ID: 639)...
  QE-TFIDF Reranking query 100/125 (ID: 645)...
  QE-TFIDF Reranking query 110/125 (ID: 409)...
  QE-TFIDF Reranking query 120/125 (ID: 655)...
  QE-TFIDF Reranking query 125/125 (ID: 416)...
  QE-TFIDF Reranking query 10/125 (ID: 421)...
  QE-TFIDF Reranking query 20/125 (ID: 666)...
  QE-TFIDF Reranking query 30/125 (ID: 307)...
  QE-TFIDF Reranking query 40/125 (ID: 431)...
  QE-TFIDF Reranking query 50/125 (ID: 676)...
  QE-TFIDF Reranking query 60/125 (ID: 317)...
  QE-TFIDF Reranking query 70/125 (ID: 441)...
  QE-TFIDF Reranking query 80/1

In [49]:
## TODO [5 points]: Compute and print out the MAP@1000 score after reranking
## on dev_hits and test_hits.

map_dev_reranked = compute_map_at_k(reranked_dev_hits, relevant_docs_all, dev_topics, k=1000)
map_test_reranked = compute_map_at_k(reranked_test_hits, relevant_docs_all, test_topics, k=1000)

print(f"Reranked MAP@1000 (Dev): {map_dev_reranked:.4f}")
print(f"Reranked MAP@1000 (Test): {map_test_reranked:.4f}")

Reranked MAP@1000 (Dev): 0.2714
Reranked MAP@1000 (Test): 0.2328


In [52]:
import time
import numpy as np # For linspace if needed, or just use lists

# --- Parameter Tuning Cell for Query Expansion ---


N_values_to_test = [5, 10, 15, 20]
M_values_to_test = [5, 10, 15, 20, 25, 30]

best_dev_map_qe = -1.0
best_params_qe = None

print("--- Starting Parameter Tuning for Query Expansion (QE) ---")

required_vars = ['dev_hits', 'dev_topics', 'reader', 'searcher', 'relevant_docs_all',
                 'rerank_with_query_expansion', 'compute_map_at_k']
for var_name in required_vars:
    if var_name not in globals():
        raise NameError(f"'{var_name}' is not defined. Please ensure previous cells have been run.")


total_combinations = len(N_values_to_test) * len(M_values_to_test)
combination_count = 0
for n_val in N_values_to_test:
    for m_val in M_values_to_test:
        combination_count += 1
        current_params = {'N': n_val, 'M': m_val}
        print(f"\n[Tuning {combination_count}/{total_combinations}] Testing parameters: {current_params}")
        start_time = time.time()

        temp_reranked_dev_hits = rerank(dev_hits, dev_topics, reader, searcher, current_params)

        current_dev_map = compute_map_at_k(temp_reranked_dev_hits, relevant_docs_all, dev_topics, k=1000)

        end_time = time.time()
        print(f"  Parameters: {current_params} -> Dev MAP@1000: {current_dev_map:.4f} (Time: {end_time - start_time:.2f}s)")

        if current_dev_map > best_dev_map_qe:
            best_dev_map_qe = current_dev_map
            best_params_qe = current_params
            print(f"  *** New best Dev MAP found! ***")

print("\n--- QE Parameter Tuning Complete ---")
if best_params_qe:
    print(f"Best parameters found: {best_params_qe}")
    print(f"Best Dev MAP@1000 achieved: {best_dev_map_qe:.4f}")
else:
    print("No parameters were tested or no valid MAP scores obtained.")

print("\nIMPORTANT: Manually update QE_PARAMS in the main execution cell (Cell 16)")
print(f"to {best_params_qe} before running the final evaluation on the test set.")


--- Starting Parameter Tuning for Query Expansion (QE) ---

[Tuning 1/24] Testing parameters: {'N': 5, 'M': 5}
  QE-TFIDF Reranking query 10/125 (ID: 359)...
  QE-TFIDF Reranking query 20/125 (ID: 369)...
  QE-TFIDF Reranking query 30/125 (ID: 379)...
  QE-TFIDF Reranking query 40/125 (ID: 389)...
  QE-TFIDF Reranking query 50/125 (ID: 398)...
  QE-TFIDF Reranking query 60/125 (ID: 609)...
  QE-TFIDF Reranking query 70/125 (ID: 619)...
  QE-TFIDF Reranking query 80/125 (ID: 629)...
  QE-TFIDF Reranking query 90/125 (ID: 639)...
  QE-TFIDF Reranking query 100/125 (ID: 645)...
  QE-TFIDF Reranking query 110/125 (ID: 409)...
  QE-TFIDF Reranking query 120/125 (ID: 655)...
  QE-TFIDF Reranking query 125/125 (ID: 416)...
  Parameters: {'N': 5, 'M': 5} -> Dev MAP@1000: 0.2636 (Time: 54.01s)
  *** New best Dev MAP found! ***

[Tuning 2/24] Testing parameters: {'N': 5, 'M': 10}
  QE-TFIDF Reranking query 10/125 (ID: 359)...
  QE-TFIDF Reranking query 20/125 (ID: 369)...
  QE-TFIDF Reranking qu