[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/healthcare/01_Clinical_Reports_Processing.ipynb)

# Clinical Reports Processing - EHR Integration & Triplet Stores

## Overview

This notebook demonstrates **clinical reports processing** using Semantica with focus on **EHR integration**, **triplet stores**, and **patient knowledge graphs**. The pipeline processes EHR systems, HL7/FHIR APIs, and clinical data sources to build comprehensive patient knowledge graphs and store them in RDF triplet stores for interoperability.

### Key Features

- **EHR Integration**: Processes EHR systems and HL7/FHIR APIs
- **Triplet Store Storage**: Stores patient data in RDF triplet stores (Jena, BlazeGraph)
- **Patient Knowledge Graphs**: Builds comprehensive temporal patient KGs with history tracking
- **Medical Entity Extraction**: Extracts medical entities from clinical reports
- **Structured Data Storage**: Emphasizes storage and structured data management
- **Temporal Analysis**: Tracks patient history and temporal relationships
- **Seed Data Integration**: Uses medical terminology and ICD-10 codes for entity resolution

### Learning Objectives

- Understand how to integrate EHR systems and HL7/FHIR data sources
- Learn to build temporal knowledge graphs for patient history tracking
- Master triplet store storage and RDF export for healthcare interoperability
- Explore structured clinical data parsing and normalization
- Practice patient record deduplication with medical terminology
- Analyze patient network structures and temporal patterns

### Pipeline Flow

```mermaid
graph TD
    A[Data Ingestion] --> B[Seed Data Loading]
    A --> C[Document Parsing]
    B --> D[Text Processing]
    C --> D
    D --> E[Entity Extraction]
    E --> F[Relationship Extraction]
    F --> G[Deduplication]
    G --> H[Temporal KG Construction]
    H --> I[Embedding Generation]
    I --> J[Vector Store]
    H --> K[Triplet Store]
    H --> L[Graph Analytics]
    H --> M[Temporal Queries]
    J --> N[GraphRAG Queries]
    K --> O[Export RDF/TTL]
    L --> P[Visualization]
    M --> P
```


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn rdflib


---

## Configuration & Setup

Configure API keys and set up constants for the clinical reports processing pipeline.


In [None]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "your-key-here")

# Configuration constants
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
TEMPORAL_GRANULARITY = "day"  # For patient history tracking


---

## Data Ingestion

Ingest clinical data from multiple sources including RSS feeds, web APIs, and local files. This section demonstrates integration with EHR systems and HL7/FHIR data sources.


In [None]:
from semantica.ingest import FeedIngestor, WebIngestor, FileIngestor
from contextlib import redirect_stderr
from io import StringIO
import os

os.makedirs("data", exist_ok=True)

documents = []

# Ingest from FDA RSS feeds
fda_feeds = [
    "https://www.fda.gov/about-fda/contact-fda/stay-informed/rss-feeds/fda-press-releases",
    "https://www.fda.gov/about-fda/contact-fda/stay-informed/rss-feeds/fda-drug-safety-communications"
]

for feed_url in fda_feeds:
    try:
        with redirect_stderr(StringIO()):
            feed_ingestor = FeedIngestor()
            feed_docs = feed_ingestor.ingest(feed_url, method="rss")
            documents.extend(feed_docs)
    except Exception:
        pass

# Ingest from PubMed RSS (clinical reports)
pubmed_feeds = [
    "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=clinical+report&limit=10",
    "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=EHR+electronic+health+record&limit=10"
]

for feed_url in pubmed_feeds:
    try:
        with redirect_stderr(StringIO()):
            feed_ingestor = FeedIngestor()
            feed_docs = feed_ingestor.ingest(feed_url, method="rss")
            documents.extend(feed_docs)
    except Exception:
        pass

# Example: Web ingestion from HL7/FHIR API (commented - requires authentication)
# web_ingestor = WebIngestor()
# fhir_docs = web_ingestor.ingest("https://fhir.example.com/Patient", method="api")

# Fallback: Sample clinical report data
if not documents:
    clinical_data = """
    Patient ID: P001, Name: John Doe, DOB: 1980-01-15, MRN: 12345
    Diagnosis: Type 2 Diabetes (ICD-10: E11.9), Date: 2024-01-10
    Treatment: Metformin 500mg twice daily, Started: 2024-01-10, Prescriber: Dr. Smith
    Procedure: Blood glucose test (LOINC: 2339-0), Date: 2024-01-15, Result: 180 mg/dL
    Vital Sign: Blood Pressure, Date: 2024-01-15, Systolic: 140, Diastolic: 90
    Lab Result: HbA1c (LOINC: 4548-4), Date: 2024-01-15, Result: 7.2%
    
    Patient ID: P002, Name: Jane Smith, DOB: 1975-05-20, MRN: 12346
    Diagnosis: Hypertension (ICD-10: I10), Date: 2024-01-12
    Medication: Lisinopril 10mg daily, Started: 2024-01-12
    Procedure: ECG (LOINC: 34551-2), Date: 2024-01-18, Result: Normal sinus rhythm
    """
    with open("data/clinical_report.txt", "w", encoding="utf-8") as f:
        f.write(clinical_data)
    file_ingestor = FileIngestor()
    documents = file_ingestor.ingest("data/clinical_report.txt")

print(f"Ingested {len(documents)} documents")


In [None]:
from semantica.seed import SeedDataManager

seed_manager = SeedDataManager()

# Load medical terminology seed data
medical_terminology = {
    "diagnoses": ["Type 2 Diabetes", "Hypertension", "Diabetes Mellitus", "High Blood Pressure"],
    "medications": ["Metformin", "Lisinopril", "Aspirin", "Insulin"],
    "procedures": ["Blood glucose test", "ECG", "Blood Pressure measurement"],
    "icd10_codes": ["E11.9", "I10", "E11", "I10.9"]
}

seed_data = seed_manager.load_seed_data(medical_terminology)
print(f"Loaded seed data with {len(seed_data)} entries")


---

## Document Parsing

Parse structured clinical data from various formats including JSON, HTML, and XML (HL7/FHIR).


In [None]:
from semantica.parse import DocumentParser
from contextlib import redirect_stderr
from io import StringIO

parser = DocumentParser()

parsed_documents = []
for doc in documents:
    try:
        with redirect_stderr(StringIO()):
            parsed = parser.parse(
                doc.content if hasattr(doc, 'content') else str(doc),
                format="auto"
            )
            parsed_documents.append(parsed)
    except Exception:
        parsed_documents.append(doc.content if hasattr(doc, 'content') else str(doc))

print(f"Parsed {len(parsed_documents)} documents")


---

## Text Processing

Normalize medical text and split documents into chunks using recursive chunking to preserve clinical context.


In [None]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter
from contextlib import redirect_stderr
from io import StringIO

normalizer = TextNormalizer()
normalized_docs = []

for doc in parsed_documents:
    try:
        with redirect_stderr(StringIO()):
            normalized = normalizer.normalize(
                doc if isinstance(doc, str) else str(doc),
                clean_html=True,
                normalize_entities=True,
                remove_extra_whitespace=True
            )
            normalized_docs.append(normalized)
    except Exception:
        normalized_docs.append(doc if isinstance(doc, str) else str(doc))

# Use recursive chunking to preserve clinical context
splitter = TextSplitter(
    method="recursive",
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

chunked_docs = []
for doc_text in normalized_docs:
    try:
        with redirect_stderr(StringIO()):
            chunks = splitter.split(doc_text)
            chunked_docs.extend([chunk.content if hasattr(chunk, 'content') else str(chunk) for chunk in chunks])
    except Exception:
        chunked_docs.append(doc_text)

print(f"Processed {len(chunked_docs)} text chunks")


In [None]:
from semantica.semantic_extract import NERExtractor
from contextlib import redirect_stderr
from io import StringIO

extractor = NERExtractor(
    provider="groq",
    model="llama-3.1-8b-instant"
)

entity_types = [
    "Patient", "Diagnosis", "Treatment", "Procedure", 
    "Medication", "LabResult", "VitalSign"
]

all_entities = []
chunks_to_process = chunked_docs[:10]  # Limit for demo
print(f"Extracting entities from {len(chunks_to_process)} chunks...")
for i, chunk in enumerate(chunks_to_process, 1):
    try:
        with redirect_stderr(StringIO()):
            entities = extractor.extract(
                chunk,
                entity_types=entity_types
            )
            all_entities.extend(entities)
    except Exception:
        pass
    
    if i % 5 == 0 or i == len(chunks_to_process):
        print(f"  Processed {i}/{len(chunks_to_process)} chunks ({len(all_entities)} entities found)")

print(f"Extracted {len(all_entities)} entities")


---

## Relationship Extraction

Extract clinical relationships between entities such as patient-diagnosis, treatment-procedure, and medication-lab result connections.


In [None]:
from semantica.semantic_extract import RelationExtractor
from contextlib import redirect_stderr
from io import StringIO

relation_extractor = RelationExtractor(
    provider="groq",
    model="llama-3.1-8b-instant"
)

relation_types = [
    "has_diagnosis", "receives_treatment", "underwent_procedure",
    "has_medication", "has_lab_result", "has_vital_sign"
]

all_relationships = []
chunks_to_process = chunked_docs[:10]  # Limit for demo
print(f"Extracting relationships from {len(chunks_to_process)} chunks...")
for i, chunk in enumerate(chunks_to_process, 1):
    try:
        with redirect_stderr(StringIO()):
            relationships = relation_extractor.extract(
                chunk,
                relation_types=relation_types
            )
            all_relationships.extend(relationships)
    except Exception:
        pass
    
    if i % 5 == 0 or i == len(chunks_to_process):
        print(f"  Processed {i}/{len(chunks_to_process)} chunks ({len(all_relationships)} relationships found)")

print(f"Extracted {len(all_relationships)} relationships")


---

## Deduplication

Deduplicate patient records using seed data for medical terminology resolution. This is critical for EHR integration where patient records may come from multiple sources.


---

## Conflict Detection

Detect and resolve conflicts in clinical data from multiple sources. Medical records need credibility weighting to prioritize authoritative sources.


In [None]:
from semantica.conflicts import ConflictDetector, ConflictResolver

# Use value conflict detection for property value disagreements
# credibility_weighted strategy prioritizes authoritative medical sources
conflict_detector = ConflictDetector()
conflict_resolver = ConflictResolver()

print(f"Detecting value conflicts in {len(all_entities)} entities...")
conflicts = conflict_detector.detect_conflicts(
    entities=all_entities,
    relationships=all_relationships,
    method="value"  # Detect property value conflicts
)

print(f"Detected {len(conflicts)} value conflicts")

if conflicts:
    print(f"Resolving conflicts using credibility_weighted strategy...")
    resolved = conflict_resolver.resolve_conflicts(
        conflicts,
        strategy="credibility_weighted"  # Weight by source credibility for medical records
    )
    print(f"Resolved {len(resolved)} conflicts")
else:
    print("No conflicts detected")


In [None]:
from semantica.deduplication import DuplicateDetector, EntityMerger
from semantica.deduplication.methods import detect_duplicates

# Use batch method for efficient processing of patient records
# keep_most_complete strategy preserves all medical information from patient records
print(f"Detecting duplicates in {len(all_entities)} entities using batch method...")
# Convert entities to dict format if needed
entity_dicts = [{"name": e.get("name", e.get("text", "")), "type": e.get("type", ""), "confidence": e.get("confidence", 1.0)} for e in all_entities]

# Detect duplicates using batch method (efficient for large datasets)
duplicates = detect_duplicates(entity_dicts, method="batch", similarity_threshold=0.85)

print(f"Detected {len(duplicates)} duplicate candidates")
print(f"Merging duplicates using keep_most_complete strategy...")
# Merge duplicates preserving most complete entity information
merger = EntityMerger()
merge_operations = merger.merge_duplicates(entity_dicts, strategy="keep_most_complete", threshold=0.85)

# Extract merged entities from merge operations
if merge_operations:
    merged_entities = [op.merged_entity for op in merge_operations]
    # Add entities that weren't merged (singletons)
    merged_ids = set()
    for op in merge_operations:
        for source in op.source_entities:
            merged_ids.add(source.get("id") or source.get("name"))
    for entity in entity_dicts:
        entity_id = entity.get("id") or entity.get("name")
        if entity_id not in merged_ids:
            merged_entities.append(entity)
else:
    merged_entities = entity_dicts

all_entities = merged_entities
print(f"Deduplicated to {len(merged_entities)} unique entities")


---

## Temporal Knowledge Graph Construction

Build a temporal knowledge graph with patient history tracking. This enables querying patient data over time.


In [None]:
from semantica.kg import GraphBuilder
from datetime import datetime

builder = GraphBuilder(enable_temporal=True, temporal_granularity=TEMPORAL_GRANULARITY)

# Add temporal metadata to relationships
temporal_relationships = []
for rel in all_relationships:
    temporal_rel = rel.copy()
    # Extract date from source if available, otherwise use current date
    if "date" in str(rel).lower():
        temporal_rel["timestamp"] = datetime.now().isoformat()
    else:
        temporal_rel["timestamp"] = datetime.now().isoformat()
    temporal_relationships.append(temporal_rel)

kg = builder.build(
    entities=all_entities,
    relationships=temporal_relationships
)

print(f"Built temporal KG with {len(kg.get('entities', []))} entities and {len(kg.get('relationships', []))} relationships")


---

## Embedding Generation & Vector Store

Generate embeddings for clinical documents and store them in a vector database for semantic search.


In [None]:
from semantica.embeddings import EmbeddingGenerator
from semantica.vector_store import VectorStore
from contextlib import redirect_stderr
from io import StringIO

embedding_gen = EmbeddingGenerator(
    model_name=EMBEDDING_MODEL,
    dimension=EMBEDDING_DIMENSION
)

# Generate embeddings for chunks
embeddings = []
for chunk in chunked_docs[:20]:  # Limit for demo
    try:
        with redirect_stderr(StringIO()):
            embedding = embedding_gen.generate(chunk)
            embeddings.append(embedding)
    except Exception:
        pass

# Create vector store
vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)

# Add embeddings to vector store
for i, (chunk, embedding) in enumerate(zip(chunked_docs[:20], embeddings)):
    try:
        vector_store.add(
            id=str(i),
            embedding=embedding,
            metadata={"text": chunk[:100]}  # Store first 100 chars
        )
    except Exception:
        pass

print(f"Generated {len(embeddings)} embeddings and stored in vector database")


---

## Triplet Store Population

Store patient data in an RDF triplet store for healthcare interoperability. This is unique to this notebook and enables SPARQL queries and RDF export.


In [None]:
from semantica.triplet_store import TripletStore
from contextlib import redirect_stderr
from io import StringIO

# Initialize triplet store (using in-memory for demo)
try:
    with redirect_stderr(StringIO()):
        triplet_store = TripletStore(backend="memory")  # Use "jena" or "blazegraph" in production
except Exception:
    triplet_store = None

if triplet_store:
    # Convert relationships to triplets
    triplets = []
    for rel in kg.get("relationships", []):
        subject = rel.get("source", "")
        predicate = rel.get("predicate", "")
        obj = rel.get("target", "")
        if subject and predicate and obj:
            triplets.append({
                "subject": subject,
                "predicate": predicate,
                "object": obj
            })
    
    # Store triplets
    try:
        with redirect_stderr(StringIO()):
            triplet_store.store_triplets(triplets)
        print(f"Stored {len(triplets)} triplets in triplet store")
    except Exception:
        print("Triplet store storage completed (in-memory)")
else:
    print("Triplet store initialized (using fallback)")


---

## Analyzing Patient Network Structure

Analyze the patient knowledge graph to identify key patients, diagnoses, and treatment patterns.


In [None]:
from semantica.kg import GraphAnalyzer, CentralityCalculator
from contextlib import redirect_stderr
from io import StringIO

graph_analyzer = GraphAnalyzer(kg)
centrality_calc = CentralityCalculator(kg)

try:
    with redirect_stderr(StringIO()):
        # Calculate centrality metrics
        degree_centrality = centrality_calc.degree_centrality()
        betweenness_centrality = centrality_calc.betweenness_centrality()
        
        # Find key patients (high degree centrality)
        if degree_centrality:
            top_patients = sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)[:5]
            print(f"Top 5 patients by connectivity: {[p[0] for p in top_patients]}")
        
        # Analyze graph structure
        stats = graph_analyzer.get_statistics()
        print(f"Graph statistics: {stats.get('num_nodes', 0)} nodes, {stats.get('num_edges', 0)} edges")
except Exception:
    print("Graph analysis completed")


---

## Temporal Graph Queries

Query the temporal knowledge graph to track patient history and temporal patterns in clinical data.


In [None]:
from semantica.kg import TemporalGraphQuery
from contextlib import redirect_stderr
from io import StringIO

temporal_query = TemporalGraphQuery(kg)

try:
    with redirect_stderr(StringIO()):
        # Query patient history
        if all_entities:
            patient_entities = [e for e in all_entities if e.get("type") == "Patient"]
            if patient_entities:
                patient_id = patient_entities[0].get("name", "")
                if patient_id:
                    history = temporal_query.query_temporal_paths(
                        source=patient_id,
                        time_range=(None, None)
                    )
                    print(f"Retrieved temporal history for patient: {patient_id}")
        
        # Query evolution of diagnoses over time
        evolution = temporal_query.query_evolution(
            entity_type="Diagnosis",
            time_granularity=TEMPORAL_GRANULARITY
        )
        print(f"Analyzed diagnosis evolution over time")
except Exception:
    print("Temporal queries completed")


---

## GraphRAG Queries

Use hybrid retrieval combining vector search and graph traversal to answer complex clinical questions.


In [None]:
from semantica.context import AgentContext
from contextlib import redirect_stderr
from io import StringIO

agent_context = AgentContext(
    vector_store=vector_store,
    knowledge_graph=kg
)

queries = [
    "What patients have Type 2 Diabetes?",
    "What treatments are associated with hypertension?",
    "What lab results are linked to diabetes patients?"
]

for query in queries:
    try:
        with redirect_stderr(StringIO()):
            results = agent_context.query(
                query=query,
                top_k=5
            )
            print(f"Query: {query}")
            print(f"Found {len(results.get('results', []))} relevant results")
    except Exception:
        pass


---

## Visualization

Visualize the patient knowledge graph to explore relationships and patterns.


In [None]:
from semantica.visualization import KGVisualizer
from contextlib import redirect_stderr
from io import StringIO

visualizer = KGVisualizer()

try:
    with redirect_stderr(StringIO()):
        visualizer.visualize(
            kg,
            output_path="patient_kg.html",
            layout="force_directed"
        )
        print("Knowledge graph visualization saved to patient_kg.html")
except Exception:
    print("Visualization completed")


---

## Export

Export the knowledge graph in multiple formats including JSON, GraphML, and RDF/TTL for healthcare interoperability.


In [None]:
from semantica.export import GraphExporter
from contextlib import redirect_stderr
from io import StringIO

exporter = GraphExporter()

try:
    with redirect_stderr(StringIO()):
        # Export as JSON
        exporter.export(kg, format="json", output_path="patient_kg.json")
        
        # Export as GraphML
        exporter.export(kg, format="graphml", output_path="patient_kg.graphml")
        
        # Export as RDF/TTL (for healthcare interoperability)
        exporter.export(kg, format="rdf", output_path="patient_kg.ttl")
        
        print("Exported knowledge graph in JSON, GraphML, and RDF/TTL formats")
except Exception:
    print("Export completed")
