[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/biomedical/01_Drug_Discovery_Pipeline.ipynb)

# Drug Discovery Pipeline - Vector Similarity Search

## Overview

This notebook demonstrates a **complete drug discovery pipeline** using Semantica's modular architecture. We'll use individual modules directly to build a comprehensive system for drug-target interaction prediction using vector similarity search and knowledge graphs.

### Key Features

- **Modular Architecture**: Uses Semantica modules directly (`NERExtractor`, `GraphBuilder`, `EmbeddingGenerator`, `VectorStore`)
- **Multiple Data Sources**: Ingests from 15+ PubMed RSS feeds, preprint servers, and journal feeds
- **Vector Similarity Search**: Emphasizes embeddings and vector similarity for drug-target interaction prediction
- **Entity Extraction**: Extracts drug compounds, proteins, targets, enzymes, and receptors
- **Knowledge Graph**: Builds structured drug-target relationship graphs
- **GraphRAG**: Hybrid vector + graph retrieval for enhanced querying

### What You'll Learn

- How to use Semantica modules directly (avoiding the core orchestrator)
- How to ingest biomedical data from multiple sources
- How to extract entities using `NERExtractor`
- How to extract relationships using `RelationExtractor`
- How to generate embeddings with `EmbeddingGenerator`
- How to build knowledge graphs with `GraphBuilder`
- How to perform similarity search with `VectorStore`
- How to use GraphRAG with `AgentContext` for hybrid retrieval

### Pipeline Flow

```mermaid
graph LR
    A[Data Ingestion] --> B[Text Processing]
    B --> C[Entity Extraction]
    C --> D[Relationship Extraction]
    D --> E[Deduplication]
    E --> F[Embedding Generation]
    F --> G[Vector Store]
    G --> H[Knowledge Graph]
    H --> I[Similarity Search]
    H --> J[GraphRAG Queries]
    I --> K[Visualization]
    J --> K
```

### Data Sources

**PubMed RSS Feeds:**
- Drug Discovery, Drug Target Interaction, Pharmacokinetics, Pharmacodynamics
- Clinical Trials, Protein Targets, Drug Repurposing, Molecular Docking
- ADME, Drug Metabolism, Drug Safety, Precision Medicine
- Biomarkers, Drug Resistance, Combinatorial Therapy

**Preprint Servers:**
- BioRxiv (Pharmacology & Toxicology, Drug Discovery)
- MedRxiv (Clinical Trials)
- ChemRxiv

**Journal RSS Feeds:**
- Nature (Drug Discovery, Pharmacology)
- Science Translational Medicine
- Cell Chemical Biology
- Journal of Medicinal Chemistry
- Drug Discovery Today
- Trends in Pharmacological Sciences

**Database Links (for reference):**
- [ChEMBL](https://www.ebi.ac.uk/chembl/) - Bioactive molecules
- [DrugBank](https://go.drugbank.com/) - Drug and drug target database
- [PubChem](https://pubchem.ncbi.nlm.nih.gov/) - Chemical compound database
- [UniProt](https://www.uniprot.org/) - Protein sequence database
- [ClinicalTrials.gov](https://clinicaltrials.gov/) - Clinical trial registry
- [FDA Drug Approvals](https://www.fda.gov/drugs/drug-approvals-and-databases) - FDA drug information
- [WHO Drug Information](https://www.who.int/medicines/publications/druginformation/en/) - WHO drug resources

---


## Installation

Install Semantica and required dependencies:


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


## Configuration & Setup

Set up environment variables and configuration constants.


In [None]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "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")


In [None]:
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200


## Ingesting Biomedical Data from Multiple Sources

Ingest data from comprehensive biomedical sources including PubMed RSS feeds, preprint servers, and journal feeds.


In [None]:
from semantica.ingest import FeedIngestor, FileIngestor
import os
from contextlib import redirect_stderr
from io import StringIO

os.makedirs("data", exist_ok=True)

feed_sources = [
    # PubMed RSS Feeds
    ("PubMed - Drug Discovery", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=drug+discovery&limit=15&sort=pub_date&fc=article_type"),
    ("PubMed - Drug Target Interaction", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=drug+target+interaction&limit=15&sort=pub_date"),
    ("PubMed - Pharmacokinetics", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=pharmacokinetics&limit=15&sort=pub_date"),
    ("PubMed - Pharmacodynamics", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=pharmacodynamics&limit=15&sort=pub_date"),
    ("PubMed - Clinical Trial", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=clinical+trial&limit=15&sort=pub_date"),
    ("PubMed - Protein Target", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=protein+target&limit=15&sort=pub_date"),
    ("PubMed - Drug Repurposing", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=drug+repurposing&limit=15&sort=pub_date"),
    ("PubMed - Molecular Docking", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=molecular+docking&limit=15&sort=pub_date"),
    ("PubMed - ADME", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=ADME&limit=15&sort=pub_date"),
    ("PubMed - Drug Metabolism", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=drug+metabolism&limit=15&sort=pub_date"),
    ("PubMed - Drug Safety", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=drug+safety&limit=15&sort=pub_date"),
    ("PubMed - Precision Medicine", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=precision+medicine&limit=15&sort=pub_date"),
    ("PubMed - Biomarkers", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=biomarkers&limit=15&sort=pub_date"),
    ("PubMed - Drug Resistance", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=drug+resistance&limit=15&sort=pub_date"),
    ("PubMed - Combinatorial Therapy", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=combinatorial+therapy&limit=15&sort=pub_date"),
    
    # Preprint Servers
    ("BioRxiv - Pharmacology", "https://connect.biorxiv.org/biorxiv_xml.php?subject=pharmacology_and_toxicology"),
    ("BioRxiv - Drug Discovery", "https://connect.biorxiv.org/biorxiv_xml.php?subject=drug_discovery"),
    ("MedRxiv - Clinical Trials", "https://connect.medrxiv.org/medrxiv_xml.php?subject=clinical_trials"),
    ("ChemRxiv", "https://chemrxiv.org/engage/chemrxiv/rss.xml"),
    
    # Journal RSS Feeds
    ("Nature - Drug Discovery", "https://www.nature.com/subjects/drug-discovery.rss"),
    ("Nature - Pharmacology", "https://www.nature.com/subjects/pharmacology.rss"),
    ("Science Translational Medicine", "https://www.science.org/action/showFeed?type=etoc&feed=rss&jc=scitransmed"),
    ("Cell Chemical Biology", "https://www.cell.com/chemical-biology.rss"),
    ("Journal of Medicinal Chemistry", "https://pubs.acs.org/action/showFeed?type=etoc&feed=rss&jc=jmcmar"),
    ("Drug Discovery Today", "https://www.sciencedirect.com/journal/drug-discovery-today.rss"),
    ("Trends in Pharmacological Sciences", "https://www.cell.com/trends/pharmacological-sciences.rss"),
]

feed_ingestor = FeedIngestor()
all_documents = []

for feed_name, feed_url in feed_sources:
    try:
        with redirect_stderr(StringIO()):
            feed_data = feed_ingestor.ingest_feed(feed_url, validate=False)
        
        for item in feed_data.items:
            if not item.content:
                item.content = item.description or item.title or ""
            if item.content:
                if not hasattr(item, 'metadata'):
                    item.metadata = {}
                item.metadata['source'] = feed_name
                all_documents.append(item)
    except Exception:
        continue

if not all_documents:
    sample_drug_data = """
    Aspirin (acetylsalicylic acid) is a medication used to reduce pain, fever, or inflammation. 
    It targets cyclooxygenase enzymes COX-1 and COX-2. Aspirin is commonly used for cardiovascular protection.
    Ibuprofen is a nonsteroidal anti-inflammatory drug (NSAID) that targets COX-1 and COX-2 enzymes.
    Metformin is an antidiabetic medication that targets AMP-activated protein kinase (AMPK).
    Insulin targets the insulin receptor (INSR) to regulate glucose metabolism.
    Warfarin is an anticoagulant that targets vitamin K epoxide reductase complex subunit 1 (VKORC1).
    Atorvastatin is a statin medication that targets HMG-CoA reductase.
    """
    
    with open("data/sample_drugs.txt", "w") as f:
        f.write(sample_drug_data)
    
    file_ingestor = FileIngestor()
    all_documents = file_ingestor.ingest("data/sample_drugs.txt")

documents = all_documents
print(f"Ingested {len(documents)} documents")


## Normalizing and Chunking Documents

Clean and normalize text, then split into chunks using entity-aware chunking to preserve drug/protein entity boundaries.


In [None]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter

normalizer = TextNormalizer()
splitter = TextSplitter(
    method="entity_aware",
    ner_method="spacy",
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

normalized_documents = []
for doc in documents:
    normalized_text = normalizer.normalize(
        doc.content if hasattr(doc, 'content') else str(doc),
        clean_html=True,
        normalize_entities=True,
        remove_extra_whitespace=True,
        lowercase=False
    )
    normalized_documents.append(normalized_text)

chunked_documents = []
for doc_text in normalized_documents:
    try:
        with redirect_stderr(StringIO()):
            chunks = splitter.split(doc_text)
        chunked_documents.extend(chunks)
    except Exception:
        simple_splitter = TextSplitter(method="recursive", chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
        chunks = simple_splitter.split(doc_text)
        chunked_documents.extend(chunks)


In [None]:
from semantica.semantic_extract import NERExtractor

entity_extractor = NERExtractor(
    method="llm",
    provider="groq",
    llm_model="llama-3.1-8b-instant",
    temperature=0.0
)

all_entities = []
for chunk in chunked_documents:
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        entities = entity_extractor.extract_entities(
            chunk_text,
            entity_types=["Drug", "Protein", "Target", "Compound", "Enzyme", "Receptor"]
        )
        all_entities.extend(entities)
    except Exception:
        continue

drugs = [e for e in all_entities if e.label in ["Drug", "Compound"] or "drug" in e.label.lower()]
proteins = [e for e in all_entities if e.label in ["Protein", "Target", "Enzyme", "Receptor"] or "protein" in e.label.lower()]

print(f"Extracted {len(drugs)} drugs and {len(proteins)} proteins")


## Extracting Drug-Target Relationships

Extract relationships between drugs and proteins to understand drug-target interactions.


In [None]:
from semantica.semantic_extract import RelationExtractor

relation_extractor = RelationExtractor(
    method="llm",
    provider="groq",
    llm_model="llama-3.1-8b-instant",
    temperature=0.0
)

all_relationships = []
for chunk in chunked_documents:
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        relationships = relation_extractor.extract_relations(
            chunk_text,
            entities=all_entities,
            relation_types=["targets", "inhibits", "activates", "binds_to", "interacts_with"]
        )
        all_relationships.extend(relationships)
    except Exception:
        continue

print(f"Extracted {len(all_relationships)} relationships")


## Resolving Duplicate Entities

Detect and merge duplicate entities to ensure data quality and consistency.


In [None]:
from semantica.deduplication import DuplicateDetector

duplicate_detector = DuplicateDetector(
    similarity_threshold=0.85,
    method="semantic"
)

deduplicated_entities = duplicate_detector.detect_duplicates(all_entities)
merged_entities = duplicate_detector.merge_duplicates(deduplicated_entities)

print(f"Deduplicated {len(all_entities)} entities to {len(merged_entities)} unique entities")


## Generating Vector Embeddings

Generate embeddings for drugs and proteins to enable similarity search.


In [None]:
from semantica.embeddings import EmbeddingGenerator
from semantica.vector_store import VectorStore

embedding_gen = EmbeddingGenerator(
    provider="sentence_transformers",
    model=EMBEDDING_MODEL
)

vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)

drug_texts = [f"{d.text} {getattr(d, 'description', '')}" for d in drugs]
drug_embeddings = embedding_gen.generate_embeddings(drug_texts)

protein_texts = [f"{p.text} {getattr(p, 'description', '')}" for p in proteins]
protein_embeddings = embedding_gen.generate_embeddings(protein_texts)

print(f"Generated {len(drug_embeddings)} drug embeddings and {len(protein_embeddings)} protein embeddings")


## Populating Vector Database

Store drug and protein embeddings in the vector database with metadata for efficient similarity search.


In [None]:
drug_ids = vector_store.store_vectors(
    vectors=drug_embeddings,
    metadata=[{"type": "drug", "name": d.text, "label": d.label} for d in drugs]
)

protein_ids = vector_store.store_vectors(
    vectors=protein_embeddings,
    metadata=[{"type": "protein", "name": p.text, "label": p.label} for p in proteins]
)

print(f"Stored {len(drug_ids)} drug vectors and {len(protein_ids)} protein vectors")


## Building Drug-Target Knowledge Graph

Construct a knowledge graph from extracted entities and relationships to enable graph-based reasoning.


In [None]:
from semantica.kg import GraphBuilder

graph_builder = GraphBuilder(
    merge_entities=True,
    resolve_conflicts=True,
    entity_resolution_strategy="fuzzy"
)

kg_sources = [{
    "entities": [{"text": e.text, "type": e.label, "confidence": e.confidence} for e in merged_entities],
    "relationships": [{"source": r.source, "target": r.target, "type": r.label, "confidence": r.confidence} for r in all_relationships]
}]

kg = graph_builder.build(kg_sources)

entities_count = len(kg.get('entities', []))
relationships_count = len(kg.get('relationships', []))
print(f"Graph: {entities_count} entities, {relationships_count} relationships")


## Finding Similar Drugs via Vector Search

Use vector similarity search to find drugs similar to a query drug based on their embeddings.


In [None]:
query_drug = "Aspirin"
query_embedding = embedding_gen.generate_embeddings([query_drug])[0]
similar_drugs = vector_store.search_vectors(query_embedding, k=5)

print(f"Drugs similar to '{query_drug}':")
for i, result in enumerate(similar_drugs, 1):
    print(f"{i}. {result['metadata'].get('name', 'Unknown')} (similarity: {result['score']:.3f})")


## GraphRAG: Hybrid Vector + Graph Retrieval

Use GraphRAG to combine vector similarity search with knowledge graph traversal for enhanced retrieval and reasoning.


In [None]:
from semantica.context import AgentContext

context = AgentContext(vector_store=vector_store, knowledge_graph=kg)

query = "What drugs target COX enzymes?"
results = context.retrieve(
    query,
    max_results=10,
    use_graph=True,
    expand_graph=True,
    include_entities=True,
    include_relationships=True
)

print(f"GraphRAG query: '{query}'")
print(f"\nRetrieved {len(results)} results:\n")
for i, result in enumerate(results[:5], 1):
    print(f"{i}. Score: {result.get('score', 0):.3f}")
    print(f"   Content: {result.get('content', '')[:200]}...")
    if result.get('related_entities'):
        print(f"   Related entities: {len(result['related_entities'])}")
    print()


## Visualizing the Knowledge Graph

Generate an interactive visualization of the drug-target knowledge graph.


In [None]:
from semantica.visualization import KGVisualizer

visualizer = KGVisualizer()
visualizer.visualize(
    kg,
    output_path="drug_target_kg.html",
    layout="spring",
    node_size=20
)

print("Visualization saved to drug_target_kg.html")


## Exporting Results

Export the knowledge graph to various formats for further analysis or integration with other tools.


In [None]:
from semantica.export import GraphExporter

exporter = GraphExporter()
exporter.export(kg, output_path="drug_target_kg.json", format="json")
exporter.export(kg, output_path="drug_target_kg.graphml", format="graphml")

print("Exported knowledge graph to JSON and GraphML formats")
