[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/biomedical/01_Drug_Discovery_Pipeline.ipynb)

# Drug Discovery Pipeline - Vector Similarity Search

## Overview

This notebook demonstrates a **complete drug discovery pipeline** using Semantica's modular architecture. We'll use individual modules directly to build a comprehensive system for drug-target interaction prediction using vector similarity search and knowledge graphs.

### Key Features

- **Modular Architecture**: Uses Semantica modules directly (`NERExtractor`, `GraphBuilder`, `EmbeddingGenerator`, `VectorStore`)
- **Multiple Data Sources**: Ingests from 15+ PubMed RSS feeds, preprint servers, and journal feeds
- **Vector Similarity Search**: Emphasizes embeddings and vector similarity for drug-target interaction prediction
- **Entity Extraction**: Extracts drug compounds, proteins, targets, enzymes, and receptors
- **Knowledge Graph**: Builds structured drug-target relationship graphs
- **GraphRAG**: Hybrid vector + graph retrieval for enhanced querying

### What You'll Learn

- How to use Semantica modules directly (avoiding the core orchestrator)
- How to ingest biomedical data from multiple sources
- How to extract entities using `NERExtractor`
- How to extract relationships using `RelationExtractor`
- How to generate embeddings with `EmbeddingGenerator`
- How to build knowledge graphs with `GraphBuilder`
- How to perform similarity search with `VectorStore`
- How to use GraphRAG with `AgentContext` for hybrid retrieval

### Pipeline Flow

```mermaid
graph LR
    A[Data Ingestion] --> B[Text Processing]
    B --> C[Entity Extraction]
    C --> D[Relationship Extraction]
    D --> E[Deduplication]
    E --> F[Embedding Generation]
    F --> G[Vector Store]
    G --> H[Knowledge Graph]
    H --> I[Similarity Search]
    H --> J[GraphRAG Queries]
    I --> K[Visualization]
    J --> K
```

### Data Sources

**PubMed RSS Feeds:**
- Drug Discovery, Drug Target Interaction, Pharmacokinetics, Pharmacodynamics
- Clinical Trials, Protein Targets, Drug Repurposing, Molecular Docking
- ADME, Drug Metabolism, Drug Safety, Precision Medicine
- Biomarkers, Drug Resistance, Combinatorial Therapy

**Preprint Servers:**
- BioRxiv (Pharmacology & Toxicology, Drug Discovery)
- MedRxiv (Clinical Trials)
- ChemRxiv

**Journal RSS Feeds:**
- Nature (Drug Discovery, Pharmacology)
- Science Translational Medicine
- Cell Chemical Biology
- Journal of Medicinal Chemistry
- Drug Discovery Today
- Trends in Pharmacological Sciences


---


## Installation

Install Semantica and required dependencies:


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


## Configuration & Setup

Set up environment variables and configuration constants.


In [None]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "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")


In [None]:
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200


## Ingesting Biomedical Data from Multiple Sources

Ingest data from comprehensive biomedical sources including PubMed RSS feeds, preprint servers, and journal feeds.


In [None]:
from semantica.ingest import FeedIngestor, FileIngestor
import os
from contextlib import redirect_stderr
from io import StringIO

os.makedirs("data", exist_ok=True)

feed_sources = [
    # Nature Feeds
    ("Nature - Drug Discovery", "https://www.nature.com/subjects/drug-discovery.rss"),
    ("Nature - Pharmacology", "https://www.nature.com/subjects/pharmacology.rss"),
    ("Nature Reviews Drug Discovery", "https://www.nature.com/nrd.rss"),
    
    # FDA & Government Sources
    ("FDA MedWatch", "https://www.fda.gov/AboutFDA/ContactFDA/StayInformed/RSSFeeds/MedWatch/rss.xml"),
    ("NCI News", "https://www.cancer.gov/syndication/rss"),
    
    # Drug Information & News
    ("Drugs.com - MedNews", "https://www.drugs.com/rss/mednews.xml"),
    ("Drugs.com - FDA Alerts", "https://www.drugs.com/rss/fda-alerts.xml"),
    ("Drugs.com - Clinical Trials", "https://www.drugs.com/rss/clinical-trials.xml"),
    
    # Medical News
    ("Labroots Health & Medicine", "http://www.labroots.com/rss/trending/health-and-medicine"),
    ("Biology News Net", "https://www.biologynews.net/rss.php"),
    
    # Open Access Journals
    ("PLOS ONE - Medicine", "https://journals.plos.org/plosone/feed/atom"),
    ("PLOS Biology", "https://journals.plos.org/plosbiology/feed/atom"),
    ("PLOS Medicine", "https://journals.plos.org/plosmedicine/feed/atom"),
    
    # Preprint Servers
    ("arXiv - q-bio", "http://arxiv.org/rss/q-bio"),
    ("arXiv - q-bio.BM", "http://arxiv.org/rss/q-bio.BM"),
]

feed_ingestor = FeedIngestor()
all_documents = []

print(f"Ingesting from {len(feed_sources)} feed sources...")
for i, (feed_name, feed_url) in enumerate(feed_sources, 1):
    try:
        with redirect_stderr(StringIO()):
            feed_data = feed_ingestor.ingest_feed(feed_url, validate=False)
        
        feed_count = 0
        for item in feed_data.items:
            if not item.content:
                item.content = item.description or item.title or ""
            if item.content:
                if not hasattr(item, 'metadata'):
                    item.metadata = {}
                item.metadata['source'] = feed_name
                all_documents.append(item)
                feed_count += 1
        
        if feed_count > 0:
            print(f"  [{i}/{len(feed_sources)}] {feed_name}: {feed_count} documents")
    except Exception:
        continue

if not all_documents:
    sample_drug_data = """
    Aspirin (acetylsalicylic acid) is a medication used to reduce pain, fever, or inflammation. 
    It targets cyclooxygenase enzymes COX-1 and COX-2. Aspirin is commonly used for cardiovascular protection.
    Ibuprofen is a nonsteroidal anti-inflammatory drug (NSAID) that targets COX-1 and COX-2 enzymes.
    Metformin is an antidiabetic medication that targets AMP-activated protein kinase (AMPK).
    Insulin targets the insulin receptor (INSR) to regulate glucose metabolism.
    Warfarin is an anticoagulant that targets vitamin K epoxide reductase complex subunit 1 (VKORC1).
    Atorvastatin is a statin medication that targets HMG-CoA reductase.
    """
    
    with open("data/sample_drugs.txt", "w") as f:
        f.write(sample_drug_data)
    
    file_ingestor = FileIngestor()
    all_documents = file_ingestor.ingest("data/sample_drugs.txt")

documents = all_documents
print(f"Ingested {len(documents)} documents")


## Normalizing and Chunking Documents

Clean and normalize text, then split into chunks using entity-aware chunking to preserve drug/protein entity boundaries.


In [None]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter

normalizer = TextNormalizer()
splitter = TextSplitter(
    method="entity_aware",
    ner_method="spacy",
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

print(f"Normalizing {len(documents)} documents...")
normalized_documents = []
for i, doc in enumerate(documents, 1):
    normalized_text = normalizer.normalize(
        doc.content if hasattr(doc, 'content') else str(doc),
        clean_html=True,
        normalize_entities=True,
        remove_extra_whitespace=True,
        lowercase=False
    )
    normalized_documents.append(normalized_text)
    if i % 50 == 0 or i == len(documents):
        print(f"  Normalized {i}/{len(documents)} documents...")

print(f"Chunking {len(normalized_documents)} documents...")
chunked_documents = []
for i, doc_text in enumerate(normalized_documents, 1):
    try:
        with redirect_stderr(StringIO()):
            chunks = splitter.split(doc_text)
        chunked_documents.extend(chunks)
    except Exception:
        simple_splitter = TextSplitter(method="recursive", chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
        chunks = simple_splitter.split(doc_text)
        chunked_documents.extend(chunks)
    if i % 50 == 0 or i == len(normalized_documents):
        print(f"  Chunked {i}/{len(normalized_documents)} documents ({len(chunked_documents)} chunks so far)")

print(f"Created {len(chunked_documents)} chunks from {len(normalized_documents)} documents")




In [None]:
from semantica.semantic_extract import NERExtractor

# Using spaCy ML method (similar to NER cell)
entity_extractor = NERExtractor(method="ml", model="en_core_web_sm")

all_entities = []
print(f"Extracting entities from {len(chunked_documents)} chunks...")

for i, chunk in enumerate(chunked_documents, 1):
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        entities = entity_extractor.extract_entities(chunk_text)
        all_entities.extend(entities)
    except Exception:
        continue
    
    if i % 20 == 0 or i == len(chunked_documents):
        remaining = len(chunked_documents) - i
        print(f"  Processed {i}/{len(chunked_documents)} chunks ({len(all_entities)} entities found, {remaining} remaining)")

# Filter entities - spaCy returns standard types (PERSON, ORG, PRODUCT, etc.)
# Map to biomedical categories based on context
drugs = [e for e in all_entities if e.label == "PRODUCT" or (e.label == "ORG" and any(kw in e.text.lower() for kw in ["drug", "pharma", "medication"]))]
proteins = [e for e in all_entities if e.label == "ORG" or (e.label == "PRODUCT" and any(kw in e.text.lower() for kw in ["protein", "enzyme", "receptor", "kinase", "target"]))]

print(f"Extracted {len(drugs)} drugs and {len(proteins)} proteins")


## Extracting Drug-Target Relationships

Extract relationships between drugs and proteins to understand drug-target interactions.


In [None]:
from semantica.semantic_extract import RelationExtractor

# Using spaCy dependency parsing (similar to NER cell)
relation_extractor = RelationExtractor(method="dependency", model="en_core_web_sm")

all_relationships = []
print(f"Extracting relationships from {len(chunked_documents)} chunks...")

for i, chunk in enumerate(chunked_documents, 1):
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        relationships = relation_extractor.extract_relations(
            chunk_text,
            entities=all_entities,
            relation_types=["targets", "inhibits", "activates", "binds_to", "interacts_with"]
        )
        all_relationships.extend(relationships)
    except Exception:
        continue
    
    if i % 20 == 0 or i == len(chunked_documents):
        print(f"  Processed {i}/{len(chunked_documents)} chunks ({len(all_relationships)} relationships found)")

print(f"Extracted {len(all_relationships)} relationships")


## Resolving Duplicate Entities

Detect and merge duplicate entities to ensure data quality and consistency.


## Conflict Detection and Resolution

Detect and resolve conflicts in drug-target relationships from multiple research sources.

- **Detection Method**: Relationship conflict detection identifies discrepancies in drug-target interactions across sources
- **Resolution Strategy**: Credibility-weighted resolution prioritizes higher-credibility sources (e.g., Nature journals over arXiv preprints)
- **Use Case**: Handles conflicting information when multiple sources report different drug-target relationships


In [None]:
from semantica.conflicts import ConflictDetector, ConflictResolver

detector = ConflictDetector()
resolver = ConflictResolver(default_strategy="credibility_weighted")

# Convert to dict format for conflict detection
entities = [
    {
        "id": ent.text if hasattr(ent, 'text') else str(ent),
        "name": ent.text if hasattr(ent, 'text') else str(ent),
        "type": ent.label if hasattr(ent, 'label') else "ENTITY",
        "confidence": getattr(ent, 'confidence', 1.0),
        "source": ent.metadata.get("source", "unknown") if hasattr(ent, 'metadata') and ent.metadata else "unknown"
    }
    for ent in all_entities if hasattr(ent, 'text') or hasattr(ent, 'label')
]

relationships = [
    {
        "id": f"{rel.subject.text}_{rel.object.text}_{rel.predicate}",
        "source_id": rel.subject.text,
        "target_id": rel.object.text,
        "type": rel.predicate,
        "confidence": getattr(rel, 'confidence', 1.0),
        "source": rel.metadata.get("source", "unknown") if hasattr(rel, 'metadata') and rel.metadata else "unknown"
    }
    for rel in all_relationships if hasattr(rel, 'subject')
]

# Detect and resolve conflicts
print(f"Detecting conflicts in {len(entities)} entities, {len(relationships)} relationships...")
entity_conflicts = detector.detect_conflicts(entities)
relationship_conflicts = detector.detect_relationship_conflicts(relationships)
print(f"Detected {len(entity_conflicts)} entity conflicts, {len(relationship_conflicts)} relationship conflicts")

# Resolve conflicts
if entity_conflicts:
    resolver.resolve_conflicts(entity_conflicts, strategy="credibility_weighted")
    print(f"Resolved {len(entity_conflicts)} entity conflicts")

if relationship_conflicts:
    resolver.resolve_conflicts(relationship_conflicts, strategy="credibility_weighted")
    print(f"Resolved {len(relationship_conflicts)} relationship conflicts")

# GraphBuilder will use resolve_conflicts=True to apply resolutions automatically
print("Conflicts resolved. GraphBuilder will use cleaned data.")


## Generating Vector Embeddings

Generate embeddings for drugs and proteins to enable similarity search.


In [None]:
from semantica.embeddings import EmbeddingGenerator
from semantica.vector_store import VectorStore

embedding_gen = EmbeddingGenerator(
    provider="sentence_transformers",
    model=EMBEDDING_MODEL
)

vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)

print(f"Generating embeddings for {len(drugs)} drugs and {len(proteins)} proteins...")
drug_texts = [d.text for d in drugs]
drug_embeddings = embedding_gen.generate_embeddings(drug_texts)

protein_texts = [p.text for p in proteins]
protein_embeddings = embedding_gen.generate_embeddings(protein_texts)

print(f"Generated {len(drug_embeddings)} drug embeddings and {len(protein_embeddings)} protein embeddings")


## Populating Vector Database

Store drug and protein embeddings in the vector database with metadata for efficient similarity search.


In [None]:
print(f"Storing {len(drug_embeddings)} drug vectors and {len(protein_embeddings)} protein vectors...")
drug_ids = vector_store.store_vectors(
    vectors=drug_embeddings,
    metadata=[{"type": "drug", "name": d.text, "label": d.label} for d in drugs]
)

protein_ids = vector_store.store_vectors(
    vectors=protein_embeddings,
    metadata=[{"type": "protein", "name": p.text, "label": p.label} for p in proteins]
)

print(f"Stored {len(drug_ids)} drug vectors and {len(protein_ids)} protein vectors")


## Building Drug-Target Knowledge Graph

Construct a knowledge graph from extracted entities and relationships to enable graph-based reasoning.


In [None]:
from semantica.kg import GraphBuilder

graph_builder = GraphBuilder()

print(f"Building graph from {len(all_entities)} entities, {len(all_relationships)} relationships...")
kg = graph_builder.build({
    "entities": all_entities,
    "relationships": all_relationships
})

entities_count = len(kg.get('entities', []))
relationships_count = len(kg.get('relationships', []))
print(f"Graph: {entities_count} entities, {relationships_count} relationships")


## Finding Similar Drugs via Vector Search

Use vector similarity search to find drugs similar to a query drug based on their embeddings.


In [None]:
query_drug = "Aspirin"
query_embedding = embedding_gen.generate_embeddings([query_drug])[0]
similar_drugs = vector_store.search_vectors(query_embedding, k=5)

print(f"Drugs similar to '{query_drug}':")
for i, result in enumerate(similar_drugs, 1):
    metadata = result.get('metadata', {})
    name = metadata.get('name', 'Unknown') if metadata else 'Unknown'
    score = result.get('score', 0.0)
    print(f"{i}. {name} (similarity: {score:.3f})")


## GraphRAG: Hybrid Vector + Graph Retrieval

Use GraphRAG to combine vector similarity search with knowledge graph traversal for enhanced retrieval and reasoning.


In [None]:
from semantica.context import AgentContext, ContextRetriever

# Option 1: Use AgentContext (high-level, recommended)
context = AgentContext(
    vector_store=vector_store, 
    knowledge_graph=kg,
    hybrid_alpha=0.6,
    max_expansion_hops=2
)

# Option 2: Use ContextRetriever directly (more control)
retriever = ContextRetriever(
    vector_store=vector_store,
    knowledge_graph=kg,
    hybrid_alpha=0.6,
    max_expansion_hops=2
)

# GraphRAG query using AgentContext
query = "What drugs target COX enzymes?"
results = context.retrieve(
    query,
    max_results=10,
    use_graph=True,
    expand_graph=True,
    include_entities=True,
    include_relationships=True
)


print(f"Query: '{query}'")
print(f"Retrieved {len(results)} results:\n")
for i, result in enumerate(results[:5], 1):
    print(f"{i}. Score: {result.get('score', 0):.3f}")
    if result.get('content'):
        print(f"   {result['content'][:250]}")
    if result.get('related_entities'):
        entities = result['related_entities']
        names = [e.get('name', e.get('id', '')) for e in entities[:3]]
        print(f"   Entities: {', '.join(names)}" + (f" (+{len(entities)-3})" if len(entities) > 3 else ""))
    if result.get('related_relationships'):
        print(f"   Relationships: {len(result['related_relationships'])}")
    print()

## Visualizing the Knowledge Graph

Generate an interactive visualization of the drug-target knowledge graph.


In [None]:
from semantica.visualization import KGVisualizer

# Display interactive Plotly graph directly in notebook
visualizer = KGVisualizer(layout="force", node_size=20)
fig = visualizer.visualize_network(kg, output="interactive")

# Display the figure (Plotly will show it automatically in notebook)
fig.show() if fig else None

## Exporting Results

Export the knowledge graph to various formats for further analysis or integration with other tools.


In [None]:
from semantica.export import GraphExporter

exporter = GraphExporter()
exporter.export(kg, output_path="drug_target_kg.json", format="json")
exporter.export(kg, output_path="drug_target_kg.graphml", format="graphml")

print("Exported knowledge graph to JSON and GraphML formats")
