# Embedding - Qwen

This notebook performs data preprocess on medical symptom/term embeddings. The workflow includes:
1. **Data Loading**: Loading original data in json file
2. **Data Cleaning**: Filtering and preprocessing medical terms
3. **Embedding Generation**: Creating embeddings for cleaned terms using Qwen
4. **Output**: Saving embedding results to files `qwen3_term_embeddings_4B.npz`

## Overview
The goal is to group similar medical terms/symptoms together based on their semantic similarity using embedding vectors.


## Step 1: RawData Preprocessing

Load original data json file. This file contains a dictionary where:
- **Keys**: Syndromes
- **Values**: Clinical synopsis

Cleaning data by remove dup, null, apply content filters
- **original data length**: 1,972
- **dedup**: 1,716
- **non-null**: 1,660

Output: \
✅ Saved 1,660 records to /content/drive/MyDrive/Colab Notebooks/CS229/Final Project/Embedding/rawdata_clean.jsonl

In [None]:
# Mount Google Drive
from google.colab import drive
from pathlib import Path

drive.mount('/content/drive', force_remount=True)
folder = Path('/content/drive/MyDrive/Colab Notebooks/CS229/Final Project/1-Embedding')
folder.mkdir(parents=True, exist_ok=True)


Mounted at /content/drive


In [None]:
# Load the json file
import json
from pathlib import Path
from pprint import pprint

try:
    with open(folder / 'seizure_epilepsy.json', 'r') as file:
        data = json.load(file)  # Parse the JSON data
    print(f"✔️ Loaded {len(data)} disease entries from OMIM")
except FileNotFoundError:
    print(f"🧱 The file seizure_epilepsy_map.txt was not found.")
except json.JSONDecodeError:
    print(f"🧱 The file seizure_epilepsy_map.txt does not contain valid JSON.")

print(f"📜 Exmaple: ")
pprint(data[:1])

✔️ Loaded 1972 disease entries from OMIM
📜 Exmaple: 
[{'clinicalSynopsis': ['Delayed development, variable severity, from birth in '
                       'some patients',
                       'Developmental regression in about 50% of patients',
                       'Normal development in some patients',
                       'Seizures, convulsive',
                       'Seizures, tonic-clonic',
                       'Seizures, partial',
                       'Seizures, absence',
                       'Seizures, atonic',
                       'Seizures, myoclonic',
                       'Status epilepticus',
                       'Autistic features',
                       'Aggression',
                       'Psychosis',
                       'Obsessive features',
                       'Carrier males show rigid personality',
                       'Carrier males show obsessive features',
                       'Carrier males show controlling and inflexible traits',
   

In [None]:
# Data Cleaning Step 1 - remove duplicates based on mimNumber
from collections import defaultdict

# 1) Group by exact mimNumber
groups = defaultdict(list)
missing_mim_idx = []

for i, row in enumerate(data):
    mim = row.get("mimNumber")
    if mim is None or mim == "":
        missing_mim_idx.append(i)
    else:
        groups[mim].append(i)

# 2) Summaries
n_rows = len(data)
n_groups = len(groups)
dup_groups = {m: idxs for m, idxs in groups.items() if len(idxs) > 1}
n_dup_groups = len(dup_groups)
n_dup_rows = sum(len(v) - 1 for v in dup_groups.values())

print(f"Total rows: {n_rows}")
print(f"Unique mimNumber (exact match): {n_groups}")
print(f"Groups with duplicates: {n_dup_groups}")
print(f"Duplicate rows to drop (keep first per mimNumber): {n_dup_rows}")
print(f"Rows missing mimNumber: {len(missing_mim_idx)}")

# Show a few duplicate groups
show_n = 5
if n_dup_groups:
    print("\nExamples of duplicate groups:")
    for mim, idxs in list(dup_groups.items())[:show_n]:
        titles = [data[i].get("preferredTitle") for i in idxs]
        print(f"  • mimNumber={mim} -> indices {idxs} | preferredTitle={titles}")

# 3) Build de-duplicated list in memory (keep first occurrence)
keep_idx = set(idxs[0] for idxs in groups.values())
data_dedup = [data[i] for i in sorted(keep_idx)]
print(f"\nAfter de-duplication: {len(data_dedup)} rows")

Total rows: 1972
Unique mimNumber (exact match): 1716
Groups with duplicates: 256
Duplicate rows to drop (keep first per mimNumber): 256
Rows missing mimNumber: 0

Examples of duplicate groups:
  • mimNumber=300088 -> indices [0, 414] | preferredTitle=['DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 9; DEE9', 'DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 9; DEE9']
  • mimNumber=300491 -> indices [1, 427] | preferredTitle=['EPILEPSY, X-LINKED 1, WITH VARIABLE LEARNING DISABILITIES AND BEHAVIOR DISORDERS; EPILX1', 'EPILEPSY, X-LINKED 1, WITH VARIABLE LEARNING DISABILITIES AND BEHAVIOR DISORDERS; EPILX1']
  • mimNumber=300607 -> indices [2, 484] | preferredTitle=['DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 8; DEE8', 'DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 8; DEE8']
  • mimNumber=300672 -> indices [3, 428] | preferredTitle=['DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 2; DEE2', 'DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 2; DEE2']
  • mimNumber=300884 -> indices [4, 486] | preferredTitle=['DE

In [None]:
# Data Cleaning Step 2 - remove null records
source = data_dedup if 'data_dedup' in globals() else data

def is_blank(x):
    """Return True if x is None/empty or only whitespace/empty elements."""
    if x is None:
        return True
    if isinstance(x, str):
        return not x.strip()
    if isinstance(x, (list, tuple, set)):
        return len(x) == 0 or all(is_blank(e) for e in x)
    if isinstance(x, dict):
        return len(x) == 0 or all(is_blank(v) for v in x.values())
    return False  # treat other types as non-blank

before = len(source)
kept = [row for row in source if not is_blank(row.get("clinicalSynopsis"))]
removed = before - len(kept)

print(f"Records before:  {before}")
print(f"Removed (blank clinicalSynopsis): {removed}")
print(f"Records after:   {len(kept)}")

# keep the result for next steps
data_nonnull = kept


Records before:  1716
Removed (blank clinicalSynopsis): 56
Records after:   1660


In [None]:
# Data Cleaning Step 3 - remove non-symptom terms, genetic information, or demographic data
# Terms containing these phrases will be completely EXCLUDED from clustering
del_list = [
    'see also',      # Cross-references, not symptoms
    'SIBS',          # Abbreviation for siblings
    'sibling',       # Family relationships
    'death',         # Outcomes, not symptoms
    'die',           # Outcomes, not symptoms
    'families',      # Demographic info
    'family',        # Demographic info
    'prevalence',    # Statistical info
    'deletion',      # Genetic mutations
    'mutation',      # Genetic mutations
    'increased frequency',  # Statistical info
    'translocation', # Genetic term
    'reported',      # Meta-information
    'report',        # Meta-information
    'clinical information',  # Too generic
    'incidence 1 in', # Statistical info
    'births',        # Demographic info
    'allelic',       # Genetic term
    'four major groups',  # Classification, not symptom
    'consanguineous', # Genetic/demographic term
    'x-inactivation', # Genetic term
    'patient b',     # Specific case reference
    '30% of cases',  # Statistical info
    'two types'      # Classification
]


# Phrases that will be REMOVED from terms (but term is kept if it has other content)
# These are filler phrases that don't change the semantic meaning
rep_list = [
    'some patients show',              # Filler phrase
    'patients may present with',       # Filler phrase
    'increased incidence in individuals',  # Filler phrase
    'may occur',                       # Filler phrase
    'some patients'                    # Filler phrase
]

In [None]:
import re, string, unicodedata
from collections import Counter

def clean_clinical_synopsis_dataset(data, del_list, rep_list):
    """
    Input:
      - data: list[dict] with key 'clinicalSynopsis' (str | list[str] | dict)
      - del_list: phrases that cause a term to be dropped if contained (case-insensitive)
      - rep_list: phrases to be removed before cleaning (whole-word, case-insensitive)

    Output:
      - clean_data: list[dict] with cleaned, de-duplicated clinicalSynopsis (list[str])
      - summary: dict with simple counts and example drops
    """

    # punctuation handling: keep only what you want to preserve
    _PRESERVE = set("-/")
    _DROP_PUNCT = "".join(ch for ch in string.punctuation if ch not in _PRESERVE)
    _SPACE_RE = re.compile(r"\s+")

    # sanitize lists
    del_set = {w.lower().strip() for w in del_list if w and w.strip()}
    rep_set = {w.lower().strip() for w in rep_list if w and w.strip()}

    # compile whole-word removal pattern for rep_list
    _rep_pattern = None
    if rep_set:
        reps = sorted((re.escape(p) for p in rep_set), key=len, reverse=True)
        _rep_pattern = re.compile(r"\b(?:" + "|".join(reps) + r")\b", re.IGNORECASE)

    def clean_text(s: str) -> str:
        s = unicodedata.normalize("NFKC", s).strip().lower()
        if _rep_pattern:
            s = _rep_pattern.sub(" ", s)         # remove phrases (whole word)
        if _DROP_PUNCT:
            s = s.translate(str.maketrans("", "", _DROP_PUNCT))
        s = _SPACE_RE.sub(" ", s).strip()
        return s

    def should_delete_term(raw: str) -> bool:
        s = raw.lower()
        if s.startswith("caused by"):            # keep same rule from original
            return True
        return any(dw in s for dw in del_set)     # substring match

    # ---- iterate & clean ----
    reasons = Counter()
    examples = {"del_phrase": [], "too_short": [], "blank_after_clean": []}

    def to_list(x):
        if x is None:
            return []
        if isinstance(x, str):
            return [x]
        if isinstance(x, dict):
            # if dict, take values that are strings
            vals = []
            for v in x.values():
                if isinstance(v, str):
                    vals.append(v)
                elif isinstance(v, list):
                    vals.extend([t for t in v if isinstance(t, str)])
            return vals
        if isinstance(x, (list, tuple, set)):
            return [t for t in x if isinstance(t, str)]
        return []

    clean_data = []
    total_terms_before = 0
    total_terms_after  = 0

    for row in data:
        terms = to_list(row.get("clinicalSynopsis"))
        total_terms_before += len(terms)

        kept = []
        for t in terms:
            if should_delete_term(t):
                reasons["del_phrase_or_caused_by"] += 1
                if len(examples["del_phrase"]) < 5:
                    examples["del_phrase"].append(t)
                continue

            ct = clean_text(t)
            if len(ct) <= 3:
                reasons["too_short_after_clean"] += 1
                if len(examples["too_short"]) < 5:
                    examples["too_short"].append((t, ct))
                continue

            if ct:   # non-blank
                kept.append(ct)
            else:
                reasons["blank_after_clean"] += 1
                if len(examples["blank_after_clean"]) < 5:
                    examples["blank_after_clean"].append(t)

        # de-duplicate within record preserving order
        seen = set()
        kept = [x for x in kept if not (x in seen or seen.add(x))]

        if kept:
            new_row = dict(row)
            new_row["clinicalSynopsis"] = kept
            clean_data.append(new_row)
            total_terms_after += len(kept)
        else:
            # drop records with no usable terms
            reasons["dropped_record_empty_clinicalSynopsis"] += 1

    summary = {
        "records_in": len(data),
        "records_out": len(clean_data),
        "terms_before": total_terms_before,
        "terms_after": total_terms_after,
        "reasons": dict(reasons),
        "examples": {k: v for k, v in examples.items() if v},
    }
    return clean_data, summary

clean_data, info = clean_clinical_synopsis_dataset(data_nonnull, del_list, rep_list)

def show_clean_summary(info, *, max_examples=3):
    def fmt(n):
        return f"{n:,}"

    rec_in   = info.get("records_in", 0)
    rec_out  = info.get("records_out", 0)
    rec_drop = rec_in - rec_out

    terms_in   = info.get("terms_before", 0)
    terms_out  = info.get("terms_after", 0)
    terms_drop = terms_in - terms_out

    reasons  = info.get("reasons", {}) or {}
    examples = info.get("examples", {}) or {}

    print("==== Clinical Synopsis Cleaning Summary ====")
    print(f"Records  : {fmt(rec_out)} / {fmt(rec_in)} kept  "
          f"({(rec_out/rec_in*100 if rec_in else 0):.1f}%);  "
          f"dropped {fmt(rec_drop)}")
    print(f"Terms    : {fmt(terms_out)} / {fmt(terms_in)} kept  "
          f"({(terms_out/terms_in*100 if terms_in else 0):.1f}%);  "
          f"removed {fmt(terms_drop)}")

    # Reasons table (sorted)
    if reasons:
        print("\nReasons (descending):")
        total_reason_count = sum(reasons.values())
        for k, v in sorted(reasons.items(), key=lambda kv: kv[1], reverse=True):
            share = (v / total_reason_count * 100) if total_reason_count else 0.0
            print(f"  • {k:<35} {fmt(v):>8}  ({share:5.1f}%)")

    # A few examples
    if examples:
        print("\nExamples:")
        for k, vals in examples.items():
            if not vals:
                continue
            print(f"  - {k}:")
            for i, ex in enumerate(vals[:max_examples], 1):
                # ex may be str or tuple
                if isinstance(ex, (list, tuple)):
                    ex_str = " | ".join(map(str, ex))
                else:
                    ex_str = str(ex)
                # Keep each example line compact
                if len(ex_str) > 140:
                    ex_str = ex_str[:137] + "…"
                print(f"      {i:>2}. {ex_str}")


# print results
show_clean_summary(info)

==== Clinical Synopsis Cleaning Summary ====
Records  : 1,660 / 1,660 kept  (100.0%);  dropped 0
Terms    : 49,898 / 53,403 kept  (93.4%);  removed 3,505

Reasons (descending):
  • del_phrase_or_caused_by                2,999  ( 99.8%)
  • too_short_after_clean                      6  (  0.2%)

Examples:
  - del_phrase:
       1. Caused by mutation in the protocadherin 19 gene
       2. Caused by mutation in the synapsin-1 gene
       3. Caused by mutation in the Rho guanine nucleotide exchange factor 9 gene
  - too_short:
       1. ADD | add
       2. ` | 
       3. ADD | add


In [None]:
# Save the cleaned data
folder = Path('/content/drive/MyDrive/Colab Notebooks/CS229/Final Project/1-Embedding')
folder.mkdir(parents=True, exist_ok=True)          # ensure folder exists
output_path = folder / 'rawdata_clean.jsonl'

# Save each record on a separate line (JSON Lines format)
with open(output_path, "w", encoding="utf-8") as f:
    for record in clean_data:
        json.dump(record, f, ensure_ascii=False)
        f.write("\n")

print(f"✅ Saved {len(clean_data):,} records to {output_path}")

✅ Saved 1,660 records to /content/drive/MyDrive/Colab Notebooks/CS229/Final Project/Embedding/rawdata_clean.jsonl


## Step 2: Buiding Embedding Model
- Download the embedding model using Ollama `Qwen3-Embedding-4B` model, which is a large-scale embedding model that converts text into high-dimensional vectors
- Load preprocessed rawdata
- Build and output the embedding file

Output:
✅ Saved 14,477 vectors to /content/drive/MyDrive/Colab Notebooks/CS229/Final Project/Embedding/qwen3_term_embeddings_4B.npz

In [None]:
from pathlib import Path
import json
from google.colab import drive

# Mount Drive
drive.mount('/content/drive', force_remount=True)

# Folder in Drive
folder = Path('/content/drive/MyDrive/Colab Notebooks/CS229/Final Project/1-Embedding')
folder.mkdir(parents=True, exist_ok=True)

# Correct way to build the file path
jsonl_path = folder / "rawdata_clean.jsonl"
print("JSONL path:", jsonl_path)
print("Exists?", jsonl_path.exists())

# Load JSONL (one JSON object per line)
records = []
if jsonl_path.exists():
    with jsonl_path.open('r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if line:
                records.append(json.loads(line))
    print(f"Loaded {len(records):,} records")
    if records:
        print("First record keys:", list(records[0].keys())[:10])
else:
    print("⚠️ File not found. Double-check the path above.")

Mounted at /content/drive
JSONL path: /content/drive/MyDrive/Colab Notebooks/CS229/Final Project/Embedding/rawdata_clean.jsonl
Exists? True
Loaded 1,660 records
First record keys: ['mimNumber', 'preferredTitle', 'clinicalSynopsis']


In [None]:
# Collect clinical terms
all_terms = []
for rec in records:
    terms = rec.get("clinicalSynopsis") or []
    if isinstance(terms, list):
        all_terms.extend(t for t in terms if isinstance(t, str) and t.strip())

# Deduplicate while preserving order
seen = set()
unique_terms = [t for t in all_terms if not (t in seen or seen.add(t))]
print(f"Unique terms to embed: {len(unique_terms):,}")
print("Examples:", unique_terms[:5])

Unique terms to embed: 14,477
Examples: ['delayed development variable severity from birth in', 'developmental regression in about 50 of patients', 'normal development in', 'seizures convulsive', 'seizures tonic-clonic']


In [None]:
# ✅ Install PyTorch (CUDA 12.1 wheels) + sentence-transformers
# Works great on A100 / L4 in Colab. If no GPU, PyTorch will still install (CPU build).
!pip -q install --index-url https://download.pytorch.org/whl/cu121 \
    torch torchvision torchaudio \
    sentence-transformers

# Quick sanity check
import torch
print("Torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

Torch: 2.8.0+cu126
CUDA available: True
GPU: NVIDIA A100-SXM4-80GB


In [None]:
import torch
from sentence_transformers import SentenceTransformer
import numpy as np

# Choose model automatically
use_gpu = torch.cuda.is_available()
device = "cuda" if use_gpu else "cpu"
model_id = "Qwen/Qwen3-Embedding-4B" if use_gpu else "Qwen/Qwen3-Embedding-0.6B"

print(f"Loading {model_id} on {device} ...")
model = SentenceTransformer(model_id, device=device)

# Optional speed tips on GPU
if use_gpu:
    torch.set_float32_matmul_precision("high")
    torch.backends.cudnn.benchmark = True

# Encode (tune batch_size if you hit OOM)
batch_size = 128 if use_gpu else 16
embs = model.encode(
    unique_terms,
    batch_size=batch_size,
    normalize_embeddings=True,
    show_progress_bar=True,
    convert_to_numpy=True,
)
print("Embeddings shape:", embs.shape)

# Save NPZ
out_path = folder / f"qwen3_term_embeddings_{'4B' if use_gpu else '0p6B'}.npz"

np.savez_compressed(
    out_path,
    embeddings=np.asarray(embs, dtype=np.float32),   # (N, D)
    terms=np.array(unique_terms, dtype=object),      # list[str]
)

print(f"✅ Saved {embs.shape[0]:,} vectors to {out_path}")

Loading Qwen/Qwen3-Embedding-4B on cuda ...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.08G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

Batches:   0%|          | 0/114 [00:00<?, ?it/s]

Embeddings shape: (14477, 2560)
✅ Saved 14,477 vectors to /content/drive/MyDrive/Colab Notebooks/CS229/Final Project/Embedding/qwen3_term_embeddings_full_4b.npz


## Step 3: Sanity check

In [None]:
# Load the embedding file
from google.colab import drive
from pathlib import Path
drive.mount('/content/drive', force_remount=True)
folder = Path('/content/drive/MyDrive/Colab Notebooks/CS229/Final Project/1-Embedding')
folder.mkdir(parents=True, exist_ok=True)

import numpy as np
emb_path = folder / f"qwen3_term_embeddings_4B.npz"

data = np.load(emb_path, allow_pickle=True)
E = data["embeddings"]          # shape: (N, D)
terms = data["terms"].tolist()  # list[str], length N

print("Loaded:", E.shape)

Loaded: (14477, 2560)


In [None]:
# 1) Basic alignment check
print(len(terms), E.shape)            # should both be 14477
assert len(terms) == E.shape[0]

# 2) Inspect one vector (first 8 numbers only)
print("First vector (8 vals):", E[0][:8])

# 3) Check dtype and any NaNs/Infs
print("dtype:", E.dtype)
print("has_nan:", np.isnan(E).any(), "has_inf:", np.isinf(E).any())

# 4) If you encoded with normalize_embeddings=True, norms should be ~1.0
norms = np.linalg.norm(E, axis=1)
print("Norms: min/mean/max =", norms.min(), norms.mean(), norms.max())

# 5) Quick cosine similarity among the first few terms
def cos_sim_matrix(X):
    Xn = X / np.linalg.norm(X, axis=1, keepdims=True)
    return Xn @ Xn.T

k = 5
S = cos_sim_matrix(E[:k])
print("Cosine similarity among first", k, "terms:\n", np.round(S, 3))
print("Terms:", terms[:k])

14477 (14477, 2560)
First vector (8 vals): [-0.0002141   0.03261214  0.02332739 -0.0246252  -0.00113236  0.0584226
  0.03064026 -0.00472033]
dtype: float32
has_nan: False has_inf: False
Norms: min/mean/max = 0.9999999 1.0 1.0000001
Cosine similarity among first 5 terms:
 [[1.    0.64  0.732 0.573 0.535]
 [0.64  1.    0.627 0.555 0.548]
 [0.732 0.627 1.    0.545 0.515]
 [0.573 0.555 0.545 1.    0.903]
 [0.535 0.548 0.515 0.903 1.   ]]
Terms: ['delayed development variable severity from birth in', 'developmental regression in about 50 of patients', 'normal development in', 'seizures convulsive', 'seizures tonic-clonic']


In [None]:
# Something maybe useful in the future
# 1) Pair them into a dict (small N)
pair_view = list(zip(terms, E[:3]))  # peek first 3
print(pair_view[0][0], pair_view[0][1][:6])  # term, first 6 dims

# 2) Cosine similarity search
import numpy as np

def cos_sim(a, b):
    a = a / np.linalg.norm(a, axis=-1, keepdims=True)
    b = b / np.linalg.norm(b, axis=-1, keepdims=True)
    return a @ b.T

# similarity between all pairs (3x3 here)
S = cos_sim(E, E)
print(np.round(S, 3))

# 3) Find nearest neighbors for a term
query = "epileptic encephalopathy"
i = terms.index(query)
scores = cos_sim(E[i:i+1], E).ravel()
rank = np.argsort(-scores)
for idx in rank[:3]:
    print(f"{terms[idx]:<30}  {scores[idx]:.3f}")


# 4) Append more later (keeps your file)
# After you encode more terms -> new_terms, new_vecs
E = np.vstack([E, new_vecs.astype(np.float32)])
terms.extend(new_terms)
np.savez_compressed(emb_path, embeddings=E, terms=np.array(terms, dtype=object))
