<a href="https://colab.research.google.com/github/syrrex/MedCompare/blob/main/AIR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MedCompare: Using Bio+ClinicalBERT for comparative analysis of medications



## Requirements

In [1]:
!pip install datasets
!pip install transformers
!pip install huggingface_hub

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [2]:
from huggingface_hub import notebook_login

In [3]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 1. Data Processing

In [10]:
import re
import json
import requests
from transformers import AutoTokenizer, pipeline, AutoModel, pipeline
import torch
from datasets import load_dataset

Load the necessary models and data


In [11]:
# Load Bio+ClinicalBERT tokenizer, model and data
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
dataset = load_dataset("MattBastar/Medicine_Details")

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

We get the ontology mapping via open source platform BioPortal bioontology. For this u need a file (api-key.txt) that contains an api-key for BioPortal. For this create an account5 at https://bioportal.bioontology.org/ and get your own key.

In [12]:
BASE_URL = "http://data.bioontology.org"
file = open("api-key.txt", "r")
API_KEY = file.read().strip()
file.close()

headers = {
    "Authorization": f"apikey token={API_KEY}"
}

### 1.1 Clean data


In [None]:
def clean_text(text):
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)  # Remove special characters
    text = re.sub(r"\b\d+\b", "", text)  # Remove standalone numbers
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces
    return text

### 1.2 Tokenize dataset fields

In [15]:
def tokenize_text(text):
    tokens = tokenizer(text, padding="max_length", truncation=True, max_length=128)
    return tokens["input_ids"]


Map:   0%|          | 0/11825 [00:00<?, ? examples/s]

Map:   0%|          | 0/11825 [00:00<?, ? examples/s]

Map:   0%|          | 0/11825 [00:00<?, ? examples/s]

### 1.3 Nameed Entity Recognition (NER)
(only for dataset) Eigendlich eh ned notwendig

In [65]:
ner_pipeline = pipeline("ner", model="dslim/bert-base-NER")

def extract_entities(text):
    """
    Extracts entities from text using the NER model.
    """
    entities = ner_pipeline(text)
    return [{"word": entity["word"], "entity": entity["entity"], "score": entity["score"]} for entity in entities]


Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### 1.4 Ontology Mapping using BioPortal API

We use the mapping that we created for the dataset for the User data we will look up using the api

In [66]:
# Function to map entities to ontology terms
def map_entities_to_ontology(entities, ontology_dict):
    mapped_entities = []
    for entity in entities:
        term = entity["word"].lower()  # Convert to lowercase for consistent matching
        ontology_id = ontology_dict.get(term, "unknown")  # Default to "unknown" if not found
        mapped_entity = {
            "word": entity["word"],
            "ontology_id": ontology_id,  # Ensure this is always a string
            "entity": entity["entity"],
            "score": entity["score"]
        }
        mapped_entities.append(mapped_entity)
    return mapped_entities

For user input:

In [60]:
# Function to look up ontology mappings from BioPortal API

def get_bioportal_mapping(term):

    params = {
        "q": term,
        "require_exact_match": "true"
    }
    response = requests.get(f"{BASE_URL}/search", headers=headers, params=params)

    if response.status_code != 200:
        return {term: "unknown"}  # Default to "unknown" if the API call fails

    data = response.json()

    # Filter relevant mappings based on ontology prefixes
    relevant_prefixes = [
        "http://purl.bioontology.org/ontology",  # BioPortal's main prefix
        "http://www.co-ode.org/ontologies/galen",  # GALEN ontology
        "http://ncicb.nci.nih.gov"  # NCI Thesaurus
    ]

    for result in data.get("collection", []):
        label = result.get("prefLabel")
        ontology_id = result.get("@id")

        if label and ontology_id and any(ontology_id.startswith(prefix) for prefix in relevant_prefixes):
            return {label.lower(): ontology_id}

    # Default
    return {term: "unknown"}

### 1.4.1 Creating the ontology_mappings.json
Create the file containing mapping for the whole dataset ontology_mappings.json You can find the file in GitHub. Executing takes much time.


In [None]:
def get_bioportal_mapping(term):
    params = {
        "q": term,
        "apikey": API_KEY,
        "require_exact_match": "true"
    }
    response = requests.get(f"{BASE_URL}/search", headers=headers, params=params)
    data = response.json()

    # Extract preferred label and ontology ID for the first result, if available
    mappings = {}
    for result in data.get("collection", []):
        label = result.get("prefLabel")
        ontology_id = result.get("@id")
        if label and ontology_id:
            mappings[label.lower()] = ontology_id
            break  # Take only the first result for simplicity

    return mappings

# Build the ontology dictionary for all unique terms
ontology_dict = {}

for term in terms:
    mappings = get_bioportal_mapping(term)
    ontology_dict.update(mappings)

    # To avoid hitting rate limits, add a delay if needed
    time.sleep(0.5)

print("Ontology dictionary created with mappings for all terms.")

with open("ontology_mappings.json", "w") as f:
    json.dump(ontology_dict, f)

print("Ontology mappings saved to ontology_mappings.json")

### 1.5 Generate Word Embeddings with Bio+ClinicalBERT

In [67]:
def generate_embeddings(text, tokenizer, model):

    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state[:, 0, :]  # Use the [CLS] token's embedding
    return embeddings.squeeze(0).tolist()

### 1.6 Processing User Input


In [54]:
# TODO: Named Entity Resolution

config.json:   0%|          | 0.00/993 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/431M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/669k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [69]:
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

def preprocess_user_input(user_input):
    """
    Preprocesses user input: cleaning, ontology mapping, and embedding generation.
    """
    # Step 1: Clean the input
    cleaned_input = clean_text(user_input)

    # Step 2: Split input into words and look up mappings for each term
    words = cleaned_input.split()
    mapped_terms = {}
    for word in words:
        mapping = get_bioportal_mapping(word)  # Query BioPortal for each term
        mapped_terms.update(mapping)

    # Step 3: Reconstruct the mapped input
    mapped_input = " ".join(mapped_terms.keys())

    # Step 4: Generate embeddings for the mapped input
    inputs = tokenizer(mapped_input, return_tensors="pt", padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state[:, 0, :]  # Use [CLS] token for embeddings

    return {
        "cleaned_input": cleaned_input,
        "mapped_terms": mapped_terms,
        "mapped_input": mapped_input,
        "embeddings": embeddings.squeeze(0).tolist()
    }

# Example user input
user_input = "I have severe nausea and headache."
result = preprocess_user_input(user_input)
print(result)

{'cleaned_input': 'I have severe nausea and headache', 'mapped_terms': {'i': 'http://purl.bioontology.org/ontology/SNOMEDCT/257989008', 'have': 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C101282', 'severe': 'http://purl.bioontology.org/ontology/SNOMEDCT/24484000', 'nausea': 'http://purl.bioontology.org/ontology/CST/NAUSEA', 'and': 'http://purl.bioontology.org/ontology/SNOMEDCT/421829000', 'headache': 'http://purl.bioontology.org/ontology/CST/HEADACHE'}, 'mapped_input': 'i have severe nausea and headache', 'embeddings': [0.3151679039001465, 0.16124042868614197, -0.40277090668678284, 0.4848475754261017, 0.3937555253505707, -0.2277817279100418, 0.010176713578402996, -0.20110704004764557, 0.523129403591156, 0.03740783780813217, -0.17707407474517822, -0.01949639618396759, -0.10518905520439148, 0.13322213292121887, -0.5156891942024231, 0.1786038726568222, -0.12779296934604645, -0.4906279146671295, -0.6355778574943542, -0.14065945148468018, -0.03880491480231285, 0.1575649231672287, -

### 1.7 Process Dataset

Instead of processing the whole dataset, you can also load the combined_dataset.json from GitHub

In [None]:
with open("ontology_mappings.json", "r") as f:
    ontology_dict = json.load(f)

for field in ["Composition", "Uses", "Side_effects"]:
    dataset = dataset.map(lambda x: {field: clean_text(x[field]) if x[field] else x[field]})

for field in ["Composition", "Uses", "Side_effects"]:
    dataset = dataset.map(lambda x: {f"{field}_tokens": tokenize_text(x[field])}, batched=True)

# Apply NER on relevant fields
for field in ["Uses", "Side_effects"]:
    dataset = dataset.map(lambda x: {f"{field}_entities": extract_entities(x[field]) if x[field] else []})

# Map entities to ontology IDs for "Uses" and "Side_effects"
for field in ["Uses", "Side_effects"]:
    dataset = dataset.map(lambda x: {
        f"{field}_mapped_entities": map_entities_to_ontology(x[f"{field}_entities"], ontology_dict)
    })

# Generate embeddings for dataset fields
for field in ["Composition", "Uses", "Side_effects"]:
    dataset = dataset.map(lambda x: {
        f"{field}_embeddings": generate_embeddings(x[field], tokenizer, model) if x[field] else None
    })

### 1.7.1 Save dataset

In [29]:
from datasets import concatenate_datasets

if isinstance(dataset, dict):  # If the dataset has splits
    combined_dataset = concatenate_datasets([split for split in dataset.values()])
else:
    combined_dataset = dataset  # If no splits, use the dataset as is

combined_dataset.to_json("combined_dataset.json")

split


Creating CSV from Arrow format:   0%|          | 0/12 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/12 [00:00<?, ?ba/s]

385932638

### 1.7.2 Load dataset

In [None]:
# open the already processed dataset
with open("combined_dataset.json", "r") as f:
    combined_dataset = json.load(f)

## 2. Similarity Ranking

## 3. Evaluation