# Getting started

### CLEF 2025 - CheckThat! Lab  - Task 4 Scientific Web Discourse - Subtask 4b (Scientific Claim Source Retrieval)

This notebook enables participants of subtask 4b to quickly get started. It includes the following:
- Code to upload data, including:
    - code to upload the collection set (CORD-19 academic papers' metadata)
    - code to upload the query set (tweets with implicit references to CORD-19 papers)
- Code to run a baseline retrieval model (BM25)
- Code to implement a neural re-ranking approach
- Code to evaluate both the baseline and neural models

Participants are free to use this notebook and add their own models for the competition.

# 1) Importing data

In [37]:
import numpy as np
import pandas as pd

## 1.a) Import the collection set
The collection set contains metadata of CORD-19 academic papers.

The preprocessed and filtered CORD-19 dataset is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b

Participants should first download the file then upload it on the Google Colab session with the following steps.


In [38]:
# 1) Download the collection set from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_COLLECTION_DATA = 'subtask4b_collection_data.pkl' #MODIFY PATH

In [39]:
df_collection = pd.read_pickle(PATH_COLLECTION_DATA)

In [40]:
df_collection.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7718 entries, 162 to 1056448
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   cord_uid          7718 non-null   object        
 1   source_x          7718 non-null   object        
 2   title             7718 non-null   object        
 3   doi               7677 non-null   object        
 4   pmcid             4959 non-null   object        
 5   pubmed_id         6233 non-null   object        
 6   license           7718 non-null   object        
 7   abstract          7718 non-null   object        
 8   publish_time      7715 non-null   object        
 9   authors           7674 non-null   object        
 10  journal           6668 non-null   object        
 11  mag_id            0 non-null      float64       
 12  who_covidence_id  528 non-null    object        
 13  arxiv_id          20 non-null     object        
 14  label             7718 n

In [41]:
df_collection.head()

Unnamed: 0,cord_uid,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,label,time,timet
162,umvrwgaw,PMC,Professional and Home-Made Face Masks Reduce E...,10.1371/journal.pone.0002618,PMC2440799,18612429,cc-by,BACKGROUND: Governments are preparing for a po...,2008-07-09,"van der Sande, Marianne; Teunis, Peter; Sabel,...",PLoS One,,,,umvrwgaw,2008-07-09,1215561600
611,spiud6ok,PMC,The Failure of R (0),10.1155/2011/527610,PMC3157160,21860658,cc-by,"The basic reproductive ratio, R (0), is one of...",2011-08-16,"Li, Jing; Blakeley, Daniel; Smith?, Robert J.",Comput Math Methods Med,,,,spiud6ok,2011-08-16,1313452800
918,aclzp3iy,PMC,Pulmonary sequelae in a patient recovered from...,10.4103/0970-2113.99118,PMC3424870,22919170,cc-by-nc-sa,The pandemic of swine flu (H1N1) influenza spr...,2012,"Singh, Virendra; Sharma, Bharat Bhushan; Patel...",Lung India,,,,aclzp3iy,2012-01-01,1325376000
993,ycxyn2a2,PMC,What was the primary mode of smallpox transmis...,10.3389/fcimb.2012.00150,PMC3509329,23226686,cc-by,The mode of infection transmission has profoun...,2012-11-29,"Milton, Donald K.",Front Cell Infect Microbiol,,,,ycxyn2a2,2012-11-29,1354147200
1053,zxe95qy9,PMC,"Lessons from the History of Quarantine, from P...",10.3201/eid1902.120312,PMC3559034,23343512,no-cc,"In the new millennium, the centuries-old strat...",2013-02-03,"Tognotti, Eugenia",Emerg Infect Dis,,,,zxe95qy9,2013-02-03,1359849600


## 1.b) Import the query set

The query set contains tweets with implicit references to academic papers from the collection set.

The preprocessed query set is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b

Participants should first download the file then upload it on the Google Colab session with the following steps.

In [42]:
# 1) Download the query tweets from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b?ref_type=heads
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_QUERY_TRAIN_DATA = 'subtask4b_query_tweets_train.tsv' #MODIFY PATH
PATH_QUERY_DEV_DATA = 'subtask4b_query_tweets_dev.tsv' #MODIFY PATH

In [43]:
df_query_train = pd.read_csv(PATH_QUERY_TRAIN_DATA, sep = '\t')
df_query_dev = pd.read_csv(PATH_QUERY_DEV_DATA, sep = '\t')

In [44]:
df_query_train.head()

Unnamed: 0,post_id,tweet_text,cord_uid
0,0,Oral care in rehabilitation medicine: oral vul...,htlvpvz5
1,1,this study isn't receiving sufficient attentio...,4kfl29ul
2,2,"thanks, xi jinping. a reminder that this study...",jtwb17u8
3,3,Taiwan - a population of 23 million has had ju...,0w9k8iy1
4,4,Obtaining a diagnosis of autism in lower incom...,tiqksd69


In [45]:
df_query_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12853 entries, 0 to 12852
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   post_id     12853 non-null  int64 
 1   tweet_text  12853 non-null  object
 2   cord_uid    12853 non-null  object
dtypes: int64(1), object(2)
memory usage: 301.4+ KB


In [46]:
df_query_dev.head()

Unnamed: 0,post_id,tweet_text,cord_uid
0,16,covid recovery: this study from the usa reveal...,3qvh482o
1,69,"""Among 139 clients exposed to two symptomatic ...",r58aohnu
2,73,I recall early on reading that researchers who...,sts48u9i
3,93,You know you're credible when NIH website has ...,3sr2exq9
4,96,Resistance to antifungal medications is a grow...,ybwwmyqy


In [47]:
df_query_dev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   post_id     1400 non-null   int64 
 1   tweet_text  1400 non-null   object
 2   cord_uid    1400 non-null   object
dtypes: int64(1), object(2)
memory usage: 32.9+ KB


# 2) Running the BM25 baseline
The following code runs a BM25 baseline.

In [48]:
!pip install rank_bm25
from rank_bm25 import BM25Okapi



In [49]:
# Create the BM25 corpus
corpus = df_collection[:][['title', 'abstract']].apply(lambda x: f"{x['title']} {x['abstract']}", axis=1).tolist()
cord_uids = df_collection[:]['cord_uid'].tolist()
tokenized_corpus = [doc.split(' ') for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

In [50]:
def get_top_cord_uids(query):
  text2bm25top = {}
  if query in text2bm25top.keys():
      return text2bm25top[query]
  else:
      tokenized_query = query.split(' ')
      doc_scores = bm25.get_scores(tokenized_query)
      indices = np.argsort(-doc_scores)[:5]
      bm25_topk = [cord_uids[x] for x in indices]

      text2bm25top[query] = bm25_topk
      return bm25_topk

In [51]:
# Retrieve topk candidates using the BM25 model
df_query_train['bm25_topk'] = df_query_train['tweet_text'].apply(lambda x: get_top_cord_uids(x))
df_query_dev['bm25_topk'] = df_query_dev['tweet_text'].apply(lambda x: get_top_cord_uids(x))

KeyboardInterrupt: 

# 3) Initial Neural Re-ranking Implementation
The following code implements a neural re-ranking approach to improve the BM25 baseline. We'll use a two-stage retrieval pipeline:

1. First stage: Use BM25 to retrieve candidate documents (efficient lexical matching)
2. Second stage: Re-rank those candidates with a neural model (better semantic understanding)

For the neural model, we'll use the Sentence-BERT framework to encode queries and documents into dense vector representations.

In [None]:
# Install required packages for neural reranking
!pip install -q sentence-transformers torch

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch

  from tqdm.autonotebook import tqdm, trange


In [None]:
# Enhanced BM25 function that returns both IDs and scores for more candidates
def get_top_cord_uids_extended(query, k=20):
    text2bm25top = {}
    if query in text2bm25top.keys():
        return text2bm25top[query]
    else:
        tokenized_query = query.split(' ')
        doc_scores = bm25.get_scores(tokenized_query)
        indices = np.argsort(-doc_scores)[:k]
        bm25_topk = [cord_uids[x] for x in indices]
        bm25_scores = [doc_scores[x] for x in indices]

        text2bm25top[query] = (bm25_topk, bm25_scores)
        return bm25_topk, bm25_scores

In [None]:
# Neural Re-ranker class
class NeuralReranker:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.corpus_embeddings = None
        self.corpus_texts = None
        self.paper_ids = None
        
    def index_collection(self, df_collection):
        # Create text representation for each document
        self.corpus_texts = df_collection[:][['title', 'abstract']].apply(
            lambda x: f"{x['title']} {x['abstract']}", axis=1).tolist()
        self.paper_ids = df_collection[:]['cord_uid'].tolist()
        
        # Calculate embeddings for all documents (this may take some time)
        print("Calculating document embeddings...")
        self.corpus_embeddings = self.model.encode(
            self.corpus_texts, 
            convert_to_tensor=True,
            show_progress_bar=True
        )
        print(f"Created embeddings for {len(self.corpus_texts)} documents")
    
    def rerank_candidates(self, query, candidate_ids, candidate_scores=None, top_k=5):
        """Re-rank the candidate documents for a given query"""
        # Get query embedding
        query_embedding = self.model.encode(query, convert_to_tensor=True)
        
        # Get embeddings for candidate documents
        candidate_indices = [self.paper_ids.index(cid) for cid in candidate_ids]
        candidate_embeddings = self.corpus_embeddings[candidate_indices]
        
        # Calculate cosine similarity between query and candidates
        cos_scores = util.cos_sim(query_embedding, candidate_embeddings)[0]
        
        # If BM25 scores are provided, we can combine the scores
        if candidate_scores is not None:
            # Normalize BM25 scores
            bm25_scores = torch.tensor(candidate_scores)
            bm25_scores = bm25_scores / bm25_scores.max()
            
            # Combine scores (you can adjust the weights)
            alpha = 0.3  # Weight for BM25 scores
            combined_scores = alpha * bm25_scores + (1-alpha) * cos_scores
        else:
            combined_scores = cos_scores
            
        # Sort by score
        top_results = torch.argsort(-combined_scores)[:top_k].tolist()
        
        # Return re-ranked document IDs
        return [candidate_ids[i] for i in top_results]

In [None]:
# Initialize the neural reranker
reranker = NeuralReranker()

# Index the collection (this may take some time depending on collection size)
reranker.index_collection(df_collection)



Calculating document embeddings...


Batches: 100%|██████████| 242/242 [10:59<00:00,  2.72s/it]

Created embeddings for 7718 documents





In [None]:
# Process training and dev queries
def process_queries(df_queries, top_k=5):
    results = []
    
    for _, row in df_queries.iterrows():
        query = row['tweet_text']
        
        # First-stage: Get BM25 candidates
        bm25_candidates, bm25_scores = get_top_cord_uids_extended(query)
        
        # Second-stage: Neural re-ranking
        reranked_candidates = reranker.rerank_candidates(
            query, 
            bm25_candidates, 
            bm25_scores, 
            top_k=top_k
        )
        
        results.append({
            'post_id': row['post_id'],
            'tweet_text': query,
            'cord_uid': row['cord_uid'],
            'bm25_topk': bm25_candidates[:5],  # For comparison
            'neural_reranked': reranked_candidates
        })
    
    return pd.DataFrame(results)

In [None]:
# Process training and dev data
print("Processing training queries...")
processed_train = process_queries(df_query_train)

print("Processing development queries...")
processed_dev = process_queries(df_query_dev)

Processing training queries...
Processing development queries...


# 4) Evaluating the models
The following code evaluates both the BM25 baseline and the neural re-ranking approach using the Mean Reciprocal Rank score (MRR@5).

In [None]:
# Evaluate retrieved candidates using MRR@k
def get_performance_mrr(data, col_gold, col_pred, list_k = [1, 5, 10]):
    d_performance = {}
    for k in list_k:
        data["in_topx"] = data.apply(lambda x: (1/([i for i in x[col_pred][:k]].index(x[col_gold]) + 1) if x[col_gold] in [i for i in x[col_pred][:k]] else 0), axis=1)
        d_performance[k] = data["in_topx"].mean()
    return d_performance

In [None]:
# Compare BM25 baseline with neural re-ranking
bm25_results_train = get_performance_mrr(processed_train, 'cord_uid', 'bm25_topk')
neural_results_train = get_performance_mrr(processed_train, 'cord_uid', 'neural_reranked')

bm25_results_dev = get_performance_mrr(processed_dev, 'cord_uid', 'bm25_topk')
neural_results_dev = get_performance_mrr(processed_dev, 'cord_uid', 'neural_reranked')

print("Training results:")
print(f"BM25: {bm25_results_train}")
print(f"Neural re-ranking: {neural_results_train}")

print("\nDevelopment results:")
print(f"BM25: {bm25_results_dev}")
print(f"Neural re-ranking: {neural_results_dev}")

Training results:
BM25: {1: 0.5079747918773827, 5: 0.5508999196037242, 10: 0.5508999196037242}
Neural re-ranking: {1: 0.5520112036100522, 5: 0.5959179439300811, 10: 0.5959179439300811}

Development results:
BM25: {1: 0.505, 5: 0.5520357142857142, 10: 0.5520357142857142}
Neural re-ranking: {1: 0.5728571428571428, 5: 0.6121071428571428, 10: 0.6121071428571428}


# 5) Additional Exploration: 
Fine-tuning different Models

### finetuning 'all-MiniLM-L6-v2'

In [None]:
def fine_tune_model(df_train, model_name='all-MiniLM-L6-v2'):
    from sentence_transformers import SentenceTransformer, InputExample, losses
    from torch.utils.data import DataLoader
    
    # Initialize model
    model = SentenceTransformer(model_name)
    
    # Prepare training data
    train_examples = []
    skipped = 0
    
    for _, row in df_train.iterrows():
        query = row['tweet_text']
        positive_paper_id = row['cord_uid']
        
        # Find matching papers in the collection and handle the case when no match is found
        matching_papers = df_collection[df_collection['cord_uid'] == positive_paper_id]
        if matching_papers.empty:
            # Skip this example if no matching paper is found
            skipped += 1
            continue
            
        # Get the text of the positive paper - use .loc instead of .iloc
        positive_index = matching_papers.index[0]
        positive_text = f"{df_collection.loc[positive_index, 'title']} {df_collection.loc[positive_index, 'abstract']}"
        
        # Create a training example
        train_examples.append(InputExample(
            texts=[query, positive_text],
            label=1.0  # Positive pair
        ))
        
        # For each positive, sample some negatives
        bm25_candidates, _ = get_top_cord_uids_extended(query, k=10)
        neg_added = False
        for neg_id in bm25_candidates:
            if neg_id != positive_paper_id:
                matching_neg_papers = df_collection[df_collection['cord_uid'] == neg_id]
                if matching_neg_papers.empty:
                    continue
                    
                neg_index = matching_neg_papers.index[0]
                neg_text = f"{df_collection.loc[neg_index, 'title']} {df_collection.loc[neg_index, 'abstract']}"
                
                train_examples.append(InputExample(
                    texts=[query, neg_text],
                    label=0.0  # Negative pair
                ))
                neg_added = True
                break  # Just add one negative for simplicity
        
        # If we couldn't find any valid negative, try a random paper
        if not neg_added:
            # Get a random paper that's not the positive one
            random_indices = df_collection.sample(5).index
            for idx in random_indices:
                random_id = df_collection.loc[idx, 'cord_uid']
                if random_id != positive_paper_id:
                    random_text = f"{df_collection.loc[idx, 'title']} {df_collection.loc[idx, 'abstract']}"
                    train_examples.append(InputExample(
                        texts=[query, random_text],
                        label=0.0  # Negative pair
                    ))
                    break
    
    print(f"Created {len(train_examples)} training examples. Skipped {skipped} queries with no matching papers.")
    
    # Create data loader
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
    
    # Use the cosine similarity loss
    train_loss = losses.CosineSimilarityLoss(model)
    
    # Train the model
    print("Fine-tuning the model...")
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=10,  # You may need more epochs
        warmup_steps=200,
        show_progress_bar=True
    )
    
    return model

# Usage:
try:
    # Only fine-tune on a subset of the data for faster execution
    subset_size = min(1000, len(df_query_train))
    fine_tuned_model = fine_tune_model(df_query_train.head(subset_size))
    
    # Update the reranker with the fine-tuned model
    reranker.model = fine_tuned_model
    
    # Re-process with the fine-tuned model
    processed_dev_finetuned = process_queries(df_query_dev)
    
    # Evaluate the fine-tuned model
    neural_finetuned_results_dev = get_performance_mrr(processed_dev_finetuned, 'cord_uid', 'neural_reranked')
    print(f"Fine-tuned neural re-ranking results: {neural_finetuned_results_dev}")
    
except Exception as e:
    print(f"Error during fine-tuning: {e}")
    print("Proceeding with the pre-trained model only.")



Created 2000 training examples. Skipped 0 queries with no matching papers.
Fine-tuning the model...


 40%|████      | 500/1250 [52:58<1:25:01,  6.80s/it]

{'loss': 0.1713, 'grad_norm': 1.3768905401229858, 'learning_rate': 1.4285714285714287e-05, 'epoch': 4.0}


 80%|████████  | 1000/1250 [1:49:32<28:14,  6.78s/it] 

{'loss': 0.0985, 'grad_norm': 2.380155563354492, 'learning_rate': 4.761904761904762e-06, 'epoch': 8.0}


100%|██████████| 1250/1250 [2:18:06<00:00,  6.63s/it]


{'train_runtime': 8286.0565, 'train_samples_per_second': 2.414, 'train_steps_per_second': 0.151, 'train_loss': 0.12301238861083984, 'epoch': 10.0}
Fine-tuned neural re-ranking results: {1: 0.5564285714285714, 5: 0.6033095238095239, 10: 0.6033095238095239}


### attempted to finetune allenai/scibert_scivocab_uncased, failed due to matrix dimension mismatch:

In [None]:
def fine_tune_model(df_train, model_name='allenai/scibert_scivocab_uncased'):
    from sentence_transformers import SentenceTransformer, InputExample, losses
    from torch.utils.data import DataLoader
    
    # Initialize model
    model = SentenceTransformer(model_name)
    
    # Prepare training data
    train_examples = []
    skipped = 0
    
    for _, row in df_train.iterrows():
        query = row['tweet_text']
        positive_paper_id = row['cord_uid']
        
        # Find matching papers in the collection and handle the case when no match is found
        matching_papers = df_collection[df_collection['cord_uid'] == positive_paper_id]
        if matching_papers.empty:
            # Skip this example if no matching paper is found
            skipped += 1
            continue
            
        # Get the text of the positive paper - use .loc instead of .iloc
        positive_index = matching_papers.index[0]
        positive_text = f"{df_collection.loc[positive_index, 'title']} {df_collection.loc[positive_index, 'abstract']}"
        
        # Create a training example
        train_examples.append(InputExample(
            texts=[query, positive_text],
            label=1.0  # Positive pair
        ))
        
        # For each positive, sample some negatives
        bm25_candidates, _ = get_top_cord_uids_extended(query, k=10)
        neg_added = False
        for neg_id in bm25_candidates:
            if neg_id != positive_paper_id:
                matching_neg_papers = df_collection[df_collection['cord_uid'] == neg_id]
                if matching_neg_papers.empty:
                    continue
                    
                neg_index = matching_neg_papers.index[0]
                neg_text = f"{df_collection.loc[neg_index, 'title']} {df_collection.loc[neg_index, 'abstract']}"
                
                train_examples.append(InputExample(
                    texts=[query, neg_text],
                    label=0.0  # Negative pair
                ))
                neg_added = True
                break  # Just add one negative for simplicity
        
        # If we couldn't find any valid negative, try a random paper
        if not neg_added:
            # Get a random paper that's not the positive one
            random_indices = df_collection.sample(5).index
            for idx in random_indices:
                random_id = df_collection.loc[idx, 'cord_uid']
                if random_id != positive_paper_id:
                    random_text = f"{df_collection.loc[idx, 'title']} {df_collection.loc[idx, 'abstract']}"
                    train_examples.append(InputExample(
                        texts=[query, random_text],
                        label=0.0  # Negative pair
                    ))
                    break
    
    print(f"Created {len(train_examples)} training examples. Skipped {skipped} queries with no matching papers.")
    
    # Create data loader
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
    
    # Use the cosine similarity loss
    train_loss = losses.CosineSimilarityLoss(model)
    
    # Train the model
    print("Fine-tuning the model...")
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=10,  # You may need more epochs
        warmup_steps=200,
        show_progress_bar=True
    )
    
    return model

# Usage:
try:
    # Only fine-tune on a subset of the data for faster execution
    subset_size = min(1000, len(df_query_train))
    fine_tuned_model = fine_tune_model(df_query_train.head(subset_size))
    
    # Update the reranker with the fine-tuned model
    reranker.model = fine_tuned_model
    
    # Re-process with the fine-tuned model
    processed_dev_finetuned = process_queries(df_query_dev)
    
    # Evaluate the fine-tuned model
    neural_finetuned_results_dev = get_performance_mrr(processed_dev_finetuned, 'cord_uid', 'neural_reranked')
    print(f"Fine-tuned neural re-ranking results: {neural_finetuned_results_dev}")
    
except Exception as e:
    print(f"Error during fine-tuning: {e}")
    print("Proceeding with the pre-trained model only.")

No sentence-transformers model found with name allenai/scibert_scivocab_uncased. Creating a new one with mean pooling.


Created 2000 training examples. Skipped 0 queries with no matching papers.
Fine-tuning the model...


 40%|████      | 500/1250 [10:29:05<14:21:56, 68.96s/it]

{'loss': 0.1739, 'grad_norm': 1.941263198852539, 'learning_rate': 1.4285714285714287e-05, 'epoch': 4.0}


 80%|████████  | 1000/1250 [20:14:25<4:52:10, 70.12s/it]

{'loss': 0.0415, 'grad_norm': 1.9587656259536743, 'learning_rate': 4.761904761904762e-06, 'epoch': 8.0}


100%|██████████| 1250/1250 [25:28:09<00:00, 73.35s/it]  


{'train_runtime': 91689.2202, 'train_samples_per_second': 0.218, 'train_steps_per_second': 0.014, 'train_loss': 0.08956683120727539, 'epoch': 10.0}
Error during fine-tuning: mat1 and mat2 shapes cannot be multiplied (1x768 and 384x20)
Proceeding with the pre-trained model only.


# 6) Evaluating different Models
Evaluating different Models instead of attempting further finetuning

### 6.1) SciBert

In [None]:
# 1. SciBERT implementation
from sentence_transformers import SentenceTransformer, util
import torch

# Reset any previous models to avoid mixing
if 'reranker' in locals():
    del reranker

# Neural Re-ranker using SciBERT
class SciBERTReranker:
    def __init__(self, model_name='allenai/scibert_scivocab_uncased'):
        # Initialize the model using SentenceTransformer wrapper
        self.model = SentenceTransformer(model_name)
        self.corpus_embeddings = None
        self.corpus_texts = None
        self.paper_ids = None
        
    def index_collection(self, df_collection):
        # Create text representation for each document
        self.corpus_texts = df_collection[:][['title', 'abstract']].apply(
            lambda x: f"{x['title']} {x['abstract']}", axis=1).tolist()
        self.paper_ids = df_collection[:]['cord_uid'].tolist()
        
        # Calculate embeddings for all documents
        print("Calculating document embeddings with SciBERT...")
        self.corpus_embeddings = self.model.encode(
            self.corpus_texts, 
            convert_to_tensor=True,
            show_progress_bar=True
        )
        print(f"Created embeddings for {len(self.corpus_texts)} documents")
    
    def rerank_candidates(self, query, candidate_ids, candidate_scores=None, top_k=5):
        """Re-rank the candidate documents for a given query"""
        # Get query embedding
        query_embedding = self.model.encode(query, convert_to_tensor=True)
        
        # Get embeddings for candidate documents
        candidate_indices = [self.paper_ids.index(cid) for cid in candidate_ids]
        candidate_embeddings = self.corpus_embeddings[candidate_indices]
        
        # Calculate cosine similarity between query and candidates
        cos_scores = util.cos_sim(query_embedding, candidate_embeddings)[0]
        
        # If BM25 scores are provided, combine the scores
        if candidate_scores is not None:
            # Normalize BM25 scores
            bm25_scores = torch.tensor(candidate_scores)
            bm25_scores = bm25_scores / bm25_scores.max()
            
            # Combine scores (you can adjust the weights)
            alpha = 0.3  # Weight for BM25 scores
            combined_scores = alpha * bm25_scores + (1-alpha) * cos_scores
        else:
            combined_scores = cos_scores
            
        # Sort by score
        top_results = torch.argsort(-combined_scores)[:top_k].tolist()
        
        # Return re-ranked document IDs
        return [candidate_ids[i] for i in top_results]

# Initialize SciBERT reranker
scibert_reranker = SciBERTReranker()

# Index the collection
scibert_reranker.index_collection(df_collection)

# Process the queries using SciBERT
def process_queries_scibert(df_queries, top_k=5):
    results = []
    
    for _, row in df_queries.iterrows():
        query = row['tweet_text']
        
        # First-stage: Get BM25 candidates
        bm25_candidates, bm25_scores = get_top_cord_uids_extended(query)
        
        # Second-stage: SciBERT re-ranking
        reranked_candidates = scibert_reranker.rerank_candidates(
            query, 
            bm25_candidates, 
            bm25_scores, 
            top_k=top_k
        )
        
        results.append({
            'post_id': row['post_id'],
            'tweet_text': query,
            'cord_uid': row['cord_uid'],
            'bm25_topk': bm25_candidates[:5],  # For comparison
            'scibert_reranked': reranked_candidates
        })
    
    return pd.DataFrame(results)

# Process dev data with SciBERT
print("Processing development queries with SciBERT...")
processed_dev_scibert = process_queries_scibert(df_query_dev)

# Evaluate SciBERT results
scibert_results_dev = get_performance_mrr(processed_dev_scibert, 'cord_uid', 'scibert_reranked')
print("\nDevelopment results:")
print(f"SciBERT: {scibert_results_dev}")

No sentence-transformers model found with name allenai/scibert_scivocab_uncased. Creating a new one with mean pooling.


Calculating document embeddings with SciBERT...


Batches: 100%|██████████| 242/242 [2:14:56<00:00, 33.45s/it]  


Created embeddings for 7718 documents
Processing development queries with SciBERT...

Development results:
SciBERT: {1: 0.48428571428571426, 5: 0.5314523809523809, 10: 0.5314523809523809}


### 6.2) covid-twitter-bert-v2

In [None]:
# 2. COVID-Twitter-BERT implementation
from sentence_transformers import SentenceTransformer, util
import torch

# Reset any previous models to avoid mixing
if 'scibert_reranker' in locals():
    del scibert_reranker

# Neural Re-ranker using COVID-Twitter-BERT
class COVIDTwitterBERTReranker:
    def __init__(self, model_name='digitalepidemiologylab/covid-twitter-bert-v2'):
        # Initialize the model using SentenceTransformer wrapper
        self.model = SentenceTransformer(model_name)
        self.corpus_embeddings = None
        self.corpus_texts = None
        self.paper_ids = None
        
    def index_collection(self, df_collection):
        # Create text representation for each document
        self.corpus_texts = df_collection[:][['title', 'abstract']].apply(
            lambda x: f"{x['title']} {x['abstract']}", axis=1).tolist()
        self.paper_ids = df_collection[:]['cord_uid'].tolist()
        
        # Calculate embeddings for all documents
        print("Calculating document embeddings with COVID-Twitter-BERT...")
        self.corpus_embeddings = self.model.encode(
            self.corpus_texts, 
            convert_to_tensor=True,
            show_progress_bar=True
        )
        print(f"Created embeddings for {len(self.corpus_texts)} documents")
    
    def rerank_candidates(self, query, candidate_ids, candidate_scores=None, top_k=5):
        """Re-rank the candidate documents for a given query"""
        # Get query embedding
        query_embedding = self.model.encode(query, convert_to_tensor=True)
        
        # Get embeddings for candidate documents
        candidate_indices = [self.paper_ids.index(cid) for cid in candidate_ids]
        candidate_embeddings = self.corpus_embeddings[candidate_indices]
        
        # Calculate cosine similarity between query and candidates
        cos_scores = util.cos_sim(query_embedding, candidate_embeddings)[0]
        
        # If BM25 scores are provided, combine the scores
        if candidate_scores is not None:
            # Normalize BM25 scores
            bm25_scores = torch.tensor(candidate_scores)
            bm25_scores = bm25_scores / bm25_scores.max()
            
            # Combine scores (you can adjust the weights)
            alpha = 0.3  # Weight for BM25 scores
            combined_scores = alpha * bm25_scores + (1-alpha) * cos_scores
        else:
            combined_scores = cos_scores
            
        # Sort by score
        top_results = torch.argsort(-combined_scores)[:top_k].tolist()
        
        # Return re-ranked document IDs
        return [candidate_ids[i] for i in top_results]

# Initialize COVID-Twitter-BERT reranker
covid_twitter_bert_reranker = COVIDTwitterBERTReranker()

# Index the collection
covid_twitter_bert_reranker.index_collection(df_collection)

# Process the queries using COVID-Twitter-BERT
def process_queries_covid_twitter_bert(df_queries, top_k=5):
    results = []
    
    for _, row in df_queries.iterrows():
        query = row['tweet_text']
        
        # First-stage: Get BM25 candidates
        bm25_candidates, bm25_scores = get_top_cord_uids_extended(query)
        
        # Second-stage: COVID-Twitter-BERT re-ranking
        reranked_candidates = covid_twitter_bert_reranker.rerank_candidates(
            query, 
            bm25_candidates, 
            bm25_scores, 
            top_k=top_k
        )
        
        results.append({
            'post_id': row['post_id'],
            'tweet_text': query,
            'cord_uid': row['cord_uid'],
            'bm25_topk': bm25_candidates[:5],  # For comparison
            'covid_twitter_bert_reranked': reranked_candidates
        })
    
    return pd.DataFrame(results)

# Process dev data with COVID-Twitter-BERT
print("Processing development queries with COVID-Twitter-BERT...")
processed_dev_covid_twitter_bert = process_queries_covid_twitter_bert(df_query_dev)

# Evaluate COVID-Twitter-BERT results
covid_twitter_bert_results_dev = get_performance_mrr(processed_dev_covid_twitter_bert, 'cord_uid', 'covid_twitter_bert_reranked')
print("\nDevelopment results:")
print(f"COVID-Twitter-BERT: {covid_twitter_bert_results_dev}")

No sentence-transformers model found with name digitalepidemiologylab/covid-twitter-bert-v2. Creating a new one with mean pooling.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Calculating document embeddings with COVID-Twitter-BERT...


Batches: 100%|██████████| 242/242 [7:53:01<00:00, 117.28s/it]  


Created embeddings for 7718 documents
Processing development queries with COVID-Twitter-BERT...

Development results:
COVID-Twitter-BERT: {1: 0.475, 5: 0.5396547619047619, 10: 0.5396547619047619}


### 6.3) specter

In [None]:
# 3. SPECTER implementation
from sentence_transformers import SentenceTransformer, util
import torch

# Reset any previous models to avoid mixing
if 'covid_twitter_bert_reranker' in locals():
    del covid_twitter_bert_reranker

# Neural Re-ranker using SPECTER
class SPECTERReranker:
    def __init__(self, model_name='allenai/specter'):
        # Initialize the model using SentenceTransformer wrapper
        self.model = SentenceTransformer(model_name)
        self.corpus_embeddings = None
        self.corpus_texts = None
        self.paper_ids = None
        
    def index_collection(self, df_collection):
        # Create text representation for each document - SPECTER works best with title and abstract
        self.corpus_texts = df_collection[:][['title', 'abstract']].apply(
            lambda x: f"{x['title']} {x['abstract']}", axis=1).tolist()
        self.paper_ids = df_collection[:]['cord_uid'].tolist()
        
        # Calculate embeddings for all documents
        print("Calculating document embeddings with SPECTER...")
        self.corpus_embeddings = self.model.encode(
            self.corpus_texts, 
            convert_to_tensor=True,
            show_progress_bar=True
        )
        print(f"Created embeddings for {len(self.corpus_texts)} documents")
    
    def rerank_candidates(self, query, candidate_ids, candidate_scores=None, top_k=5):
        """Re-rank the candidate documents for a given query"""
        # Get query embedding
        query_embedding = self.model.encode(query, convert_to_tensor=True)
        
        # Get embeddings for candidate documents
        candidate_indices = [self.paper_ids.index(cid) for cid in candidate_ids]
        candidate_embeddings = self.corpus_embeddings[candidate_indices]
        
        # Calculate cosine similarity between query and candidates
        cos_scores = util.cos_sim(query_embedding, candidate_embeddings)[0]
        
        # If BM25 scores are provided, combine the scores
        if candidate_scores is not None:
            # Normalize BM25 scores
            bm25_scores = torch.tensor(candidate_scores)
            bm25_scores = bm25_scores / bm25_scores.max()
            
            # Combine scores (you can adjust the weights)
            alpha = 0.3  # Weight for BM25 scores
            combined_scores = alpha * bm25_scores + (1-alpha) * cos_scores
        else:
            combined_scores = cos_scores
            
        # Sort by score
        top_results = torch.argsort(-combined_scores)[:top_k].tolist()
        
        # Return re-ranked document IDs
        return [candidate_ids[i] for i in top_results]

# Initialize SPECTER reranker
specter_reranker = SPECTERReranker()

# Index the collection
specter_reranker.index_collection(df_collection)

# Process the queries using SPECTER
def process_queries_specter(df_queries, top_k=5):
    results = []
    
    for _, row in df_queries.iterrows():
        query = row['tweet_text']
        
        # First-stage: Get BM25 candidates
        bm25_candidates, bm25_scores = get_top_cord_uids_extended(query)
        
        # Second-stage: SPECTER re-ranking
        reranked_candidates = specter_reranker.rerank_candidates(
            query, 
            bm25_candidates, 
            bm25_scores, 
            top_k=top_k
        )
        
        results.append({
            'post_id': row['post_id'],
            'tweet_text': query,
            'cord_uid': row['cord_uid'],
            'bm25_topk': bm25_candidates[:5],  # For comparison
            'specter_reranked': reranked_candidates
        })
    
    return pd.DataFrame(results)

# Process dev data with SPECTER
print("Processing development queries with SPECTER...")
processed_dev_specter = process_queries_specter(df_query_dev)

# Evaluate SPECTER results
specter_results_dev = get_performance_mrr(processed_dev_specter, 'cord_uid', 'specter_reranked')
print("\nDevelopment results:")
print(f"SPECTER: {specter_results_dev}")

No sentence-transformers model found with name allenai/specter. Creating a new one with mean pooling.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Calculating document embeddings with SPECTER...


Batches: 100%|██████████| 242/242 [2:15:00<00:00, 33.47s/it]  


Created embeddings for 7718 documents
Processing development queries with SPECTER...

Development results:
SPECTER: {1: 0.5364285714285715, 5: 0.5789404761904762, 10: 0.5789404761904762}


# Results so far:

In [52]:
# BASE MODELS:
print(f"Baseline BM25: {bm25_results_dev}")
print(f"all-MiniLM-L6-v2: {neural_results_dev}")
print(f"SPECTER: {specter_results_dev}")
print(f"COVID-Twitter-BERT: {covid_twitter_bert_results_dev}")
print(f"SciBERT: {scibert_results_dev}")

# FINE-TUNED MODELS:
print(f"Fine-tuned all-MiniLM-L6-v2: {neural_finetuned_results_dev}")

Baseline BM25: {1: 0.505, 5: 0.5520357142857142, 10: 0.5520357142857142}
all-MiniLM-L6-v2: {1: 0.5728571428571428, 5: 0.6121071428571428, 10: 0.6121071428571428}
SPECTER: {1: 0.5364285714285715, 5: 0.5789404761904762, 10: 0.5789404761904762}
COVID-Twitter-BERT: {1: 0.475, 5: 0.5396547619047619, 10: 0.5396547619047619}
SciBERT: {1: 0.48428571428571426, 5: 0.5314523809523809, 10: 0.5314523809523809}
Fine-tuned all-MiniLM-L6-v2: {1: 0.5564285714285714, 5: 0.6033095238095239, 10: 0.6033095238095239}


Based on these Results, we will revisit the all-MiniLM-L6-v2 and optimize it inside our new notebook:  
neural_re-ranking-submission.ipynb