# Biomedical Semantic Search & Simplification Pipeline


## Summary
1. Required libraries
2. Data preprocessing

A. Encoder Part

3. Optimised encoder model configuration
4. Corpus encoding
5. Saving embeddings

B. Decoder Part

6. Decoder model configuration
7. Setting up user input for the chatbot


### 1. Required libraries

In [2]:
# library installation
! pip install adapters transformers torch pandas tqdm numpy -q

### 2. Data preprocessing
Steps to load data into Google Colab:

1. Go to your Google Drive.

2. Create a folder called `Project_NLP`.

3. Drag `OptimusPrime_for_GoogleColab.ipynb` notebook into it and doble-click on it.

4. The CRUCIAL step:

- In the top right corner, click on the small arrow next to 'Sign in' (or 'RAM/Disk').

- Select Change runtime type.

- Choose T4 GPU.

In [10]:
import os
import urllib.request
from google.colab import drive
from pathlib import Path
from tqdm import tqdm


# Connection to the drive
drive.mount('/content/drive')

# Data repertory creation
local_path = "/content/data_temp"

# Uploading data files
files = [
    ("https://zenodo.org/records/14801641/files/relish_documents.tsv","relish_documents.tsv"),
    ("https://zenodo.org/records/14801641/files/relevance_matrix.tsv", "relevance_matrix.tsv")
]

def download_with_progress(url, destination_path):
    class DownloadProgressBar(tqdm):
        def update_to(self, b=1, bsize=1, tsize=None):
            if tsize is not None:
                self.total = tsize
            self.update(b * bsize - self.n)

    with DownloadProgressBar(unit='B', unit_scale=True, miniters=1, desc=os.path.basename(destination_path)) as t:
        urllib.request.urlretrieve(url, filename=destination_path, reporthook=t.update_to)

for url, filename in files:
    destination_path = os.path.join(local_path, filename)
    if not os.path.exists(destination_path):
        download_with_progress(url, destination_path)
    else:
        print(f"{filename} already exists in{local_path}. Download ignored.")

# 4. Verification
print("Files :")
print(os.listdir(local_path))


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
relish_documents.tsv already exists in/content/data_temp. Download ignored.
relevance_matrix.tsv already exists in/content/data_temp. Download ignored.
Files :
['__MACOSX', 'relish_documents.tsv', 'data_BumbleBee', 'relevance_matrix.tsv']


In [12]:
import pandas as pd


# Corpus loading
corpus_file = f"{local_path}/relish_documents.tsv"

print("Loading corpus...")
corpus_df = pd.read_csv(corpus_file, sep='\t')

# Inspection structure
print(f"✓ Corpus: {len(corpus_df)} articles")

# Standardise column names
if 'PMID' in corpus_df.columns:
    corpus_df.rename(columns={'PMID': 'pmid', 'Title': 'title', 'Abstract': 'abstract'},
                     inplace=True)

# Cleaning up the corpus of articles without abstracts
corpus_df = corpus_df.dropna(subset=['abstract'])
corpus_df['pmid'] = corpus_df['pmid'].astype(str)

print(f"After cleaning: {len(corpus_df)} articles")
print(f"Columns: {corpus_df.columns.tolist()}")


Loading corpus...
✓ Corpus: 163189 articles
After cleaning: 163189 articles
Columns: ['pmid', 'title', 'abstract']


## A. Encoder part
<a href="https://huggingface.co/allenai/specter2">Specter2</a> is chosen to vectorise the corpus It is capable of generating task specific embeddings for scientific tasks.

### 3. Optimised model configuration

In [13]:
import torch
from tqdm import tqdm
from transformers import AutoTokenizer
from adapters import AutoAdapterModel
from torch.cuda.amp import autocast

# Batch & device configuration
device = "cuda"
BATCH_SIZE = 128

print(f"device: {torch.cuda.get_device_name(0)}")

tokenizer = AutoTokenizer.from_pretrained('allenai/specter2_base')
model = AutoAdapterModel.from_pretrained('allenai/specter2_base')
model.load_adapter("allenai/specter2", source="hf", load_as="proximity", set_active=True)

# Forces activation of the proximity adapter
model.set_active_adapters("proximity")

#Verification of the use of 'proximity' (a list containing 'proximity')
print(f"Active adapter?{model.active_adapters}")

model.to(device)

device: Tesla T4


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/453 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/754 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

.gitattributes: 0.00B [00:00, ?B/s]

pytorch_adapter.bin:   0%|          | 0.00/3.59M [00:00<?, ?B/s]

adapter_config.json: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]



Active adapter?Stack[proximity]


BertAdapterModel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(31090, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttentionWithAdapters(
              (query): LoRALinearTorch(
                in_features=768, out_features=768, bias=True
                (shared_parameters): ModuleDict()
                (loras): ModuleDict()
              )
              (key): LoRALinearTorch(
                in_features=768, out_features=768, bias=True
                (shared_parameters): ModuleDict()
                (loras): ModuleDict()
              )
              (value): LoRALinearTorch(
             

### 4. Corpus encoding

In [16]:
import numpy as np

# Encoding function
def encode_papers_optimized(papers_list):
    # Préparation des inputs
    inputs = tokenizer(
        papers_list, # liste brute de textes
        padding=True, # tous les textes représentés par n tokens (nécessaire pour construire des matrices)
        truncation=True, # texte dépasse la limite définie de tokens, on coupe brutalement la fin
        return_tensors="pt", # retourne des tenseurs PyTorch, renvoie des listes bruts en objets utilisable par PyTorch
        max_length=512# la limite utilisée par le paramètre truncation
    ).to(device)

    # "autocast": performs calculations in 16-bit instead of 32-bit
    with torch.no_grad():
        with autocast():
            outputs = model(**inputs)

    return outputs.last_hidden_state[:, 0, :].cpu().numpy().astype(np.float32)

# encoding
texts_to_encode= (corpus_df['title'] + tokenizer.sep_token + corpus_df['abstract']).tolist()

embeddings = []
total_docs = len(texts_to_encode)

print(f"Encoding {total_docs} documents in batches of {BATCH_SIZE}...")

for i in tqdm(range(0, total_docs, BATCH_SIZE), unit="batch"):
    # Sélection du lot
    batch_texts = texts_to_encode[i : i + BATCH_SIZE]

    # Encodage
    emb = encode_papers_optimized(batch_texts)
    embeddings.append(emb)

# Embedding assembly
final_embeddings = np.vstack(embeddings)
print(f"Encoding finished ! Shape : {final_embeddings.shape}")

Encodage de 163189 documents par paquets de 128...


  with autocast():
  1%|          | 12/1275 [00:15<27:52,  1.32s/batch]


KeyboardInterrupt: 

### 5. Saving embeddings & corpus

In [17]:
DATA_PATH="/content/drive/MyDrive/Projet_NLP"

np.save(f"{DATA_PATH}/relish_embeddings_specter_turbo.npy", final_embeddings)

In [None]:
corpus_df.to_pickle(f"{DATA_PATH}/corpus_Optimus.pkl")

*<p style="color: green;">THE DECODER PART IS INDEPENDENT OF THE ENCODER; IT WORKS IF THE CORPUS AND ITS EMBEDDINGS ARE LOADED FIRST.</p>*


## B. Decoder part
<a href="https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct"> Qwen 2.5</a> is selected as the text generator because it is easy to use locally and comprehensive.

### 6. Decoder configuration

In [28]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Configuration
model_id = "Qwen/Qwen2.5-1.5B-Instruct"

print(f"Chargement de {model_id} sur le Mac...")

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# 4. Summary Function
def summarize_with_qwen(abstract_text):
    """
    Popularise a scientific abstract with Qwen 2.5
    """
    # Defining the role of the Chatbot
    messages = [
        {"role": "system", "content": "You are a helpful scientific assistant. Your goal is to summarize complex medical abstracts for a general audience (non-experts)."},
        {"role": "user", "content": f"Summarize this text in 4 simple sentences:\n\n{abstract_text}"}
    ]

    # Template output chatbot prompt
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    # Word generation
    with torch.no_grad():
        generated_ids = model.generate(
            **model_inputs,
            max_new_tokens=150,  # Longueur max du résumé
            temperature=0.7,     # Créativité (0.7 est équilibré)
            top_p=0.9            # Évite les phrases répétitives
        )

    # Decoding (The "prompt" part is removed to keep only the response)
    # generated_ids contient [Prompt + Réponse], on coupe le début
    input_len = model_inputs.input_ids.shape[1]
    response_ids = generated_ids[0][input_len:]

    response = tokenizer.decode(response_ids, skip_special_tokens=True)
    return response

# TEST
test_abstract = """
Acute myocardial infarction is a medical emergency comprising chest pain
and signs of ischemia on ECG. It is caused by the occlusion of a coronary artery.
Time is muscle, and rapid reperfusion is mandatory to save the patient.
"""

print("TEST SUMMARY")
summary = summarize_with_qwen(test_abstract)
print(summary)

Chargement de Qwen/Qwen2.5-1.5B-Instruct sur le Mac...
✓ Modèle chargé ! Mémoire utilisée : ~3 Go
TEST SUMMARY
Myocardial infarction is an urgent heart attack with chest pain and blocked arteries that must be treated quickly to save the heart muscle.


### 7. Setting up user input for the chatbot

In [None]:
import numpy as np
import pandas as pd

DATAFRAME= pd.read_pickle(f"{DATA_PATH}/corpus_Optimus.pkl")
EMBEDDING= np.load(f"{DATA_PATH}/relish_embeddings_specter_turbo.npy")

def moteur_recherche_medical(pmid_entre, dataframe=DATAFRAME, embeddings_matrice= EMBEDDING, k=10):
    # 1. Trouver l'index de l'article de départ

    pmid_cherche = str(pmid_entre)

    try:
        # On cherche la ligne où le PMID correspond
        index_article = dataframe.index[dataframe['pmid'] == pmid_cherche].tolist()[0]
    except IndexError:
        return "Sorry, this PMID is not currently in my article database.", []

    # 2. Récupérer son vecteur (Specter)
    vecteur_cible = embeddings_matrice[index_article].reshape(1, -1)

    # 3. Calculer la similarité (Produit scalaire) avec tous les autres
    # On compare le vecteur cible à toute la matrice d'un coup
    scores = np.dot(embeddings_matrice, vecteur_cible.T).flatten()

    # 4. Trier les scores et prendre les 10 meilleurs (on ignore le premier qui est lui-même)
    indices_voisins = np.argsort(scores)[::-1][1:k+1]

    # 5. Récupérer les données textuelles des voisins
    resultats = dataframe.iloc[indices_voisins]

    # 6. Le Décodeur : Résumer le texte du voisin le plus proche (le top 1)
    meilleur_match_abstract = resultats.iloc[0]['abstract']
    meilleur_match_titre = resultats.iloc[0]['title']

    print(f"L'article le plus proche est : {meilleur_match_titre}")

    # Appel à ta fonction Qwen (ton décodeur)
    resume_vulgarise = summarize_with_qwen(meilleur_match_abstract)

    return resume_vulgarise, resultats[['pmid', 'title']]

# --- EXEMPLE D'UTILISATION ---
resume, liste_articles = moteur_recherche_medical(27683064)
print(resume)

--- L'article le plus proche est : Sclerostin mediates bone response to mechanical unloading through antagonizing Wnt/beta-catenin signaling. ---
The research shows how mechanical stress affects bones. When bones don't get enough force from exercise or gravity, they start to lose calcium. Scientists discovered a protein called sclerostin that helps stop this happening. If you take away sclerostin from animals' bodies, their bones stay healthier when they're inactive. This could be important because it means there might be ways to help people who can't move much keep their bones strong without getting weaker over time.
