[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/biomedical/02_Genomic_Variant_Analysis.ipynb)

# Genomic Variant Analysis Pipeline

## Overview

This notebook demonstrates a complete genomic variant analysis pipeline: ingest genomic data from multiple sources (APIs, databases, feeds), extract variant entities, build genomic knowledge graph, analyze disease associations, predict variant impact, and perform pathway analysis.


**Documentation**: [API Reference](https://semantica.readthedocs.io/use-cases/)

### Modules Used (20+)

- **Ingestion**: FileIngestor, WebIngestor, FeedIngestor, StreamIngestor, DBIngestor, RepoIngestor, EmailIngestor, MCPIngestor
- **Parsing**: JSONParser, StructuredDataParser, DocumentParser
- **Extraction**: NERExtractor, RelationExtractor, TripletExtractor, SemanticAnalyzer
- **KG**: GraphBuilder, GraphAnalyzer, CentralityCalculator, CommunityDetector
- **Analytics**: ConnectivityAnalyzer, TemporalGraphQuery, TemporalPatternDetector
- **Ontology**: OntologyGenerator, ClassInferrer, PropertyGenerator, OntologyValidator
- **Reasoning**: InferenceEngine, RuleManager, ExplanationGenerator
- **Quality**: KGQualityAssessor, ConflictDetector
- **Export**: JSONExporter, RDFExporter, OWLExporter, ReportGenerator
- **Visualization**: KGVisualizer, OntologyVisualizer, AnalyticsVisualizer

### Pipeline

**Genomic Data Sources → Parse → Extract Entities (variants, genes, diseases, pathways) → Build Genomic KG → Analyze Associations → Predict Impact → Pathway Analysis → Generate Reports → Visualize**

## Installation

Install Semantica from PyPI:

```bash
pip install semantica
# Or with all optional dependencies:
pip install semantica[all]
```

---

## Step 1: Ingest Genomic Data from Multiple Sources

Ingest genomic variant data from APIs, databases, and research feeds.


In [None]:
!pip install semantica


In [None]:
from semantica.ingest import WebIngestor, DBIngestor, FeedIngestor, FileIngestor
from semantica.parse import JSONParser, StructuredDataParser, DocumentParser
from semantica.semantic_extract import NERExtractor, RelationExtractor, TripletExtractor, SemanticAnalyzer
from semantica.kg import GraphBuilder, GraphAnalyzer, CentralityCalculator, CommunityDetector
from semantica.kg import ConnectivityAnalyzer, TemporalGraphQuery, TemporalPatternDetector
from semantica.ontology import OntologyGenerator, ClassInferrer, PropertyGenerator, OntologyValidator
from semantica.reasoning import InferenceEngine, RuleManager, ExplanationGenerator
from semantica.conflicts import ConflictDetector
from semantica.export import JSONExporter, RDFExporter, OWLExporter, ReportGenerator
from semantica.visualization import KGVisualizer, OntologyVisualizer, AnalyticsVisualizer
import tempfile
import os
import json
from datetime import datetime, timedelta

web_ingestor = WebIngestor()
db_ingestor = DBIngestor()
feed_ingestor = FeedIngestor()
file_ingestor = FileIngestor()

json_parser = JSONParser()
structured_parser = StructuredDataParser()
document_parser = DocumentParser()

# Real genomic APIs
genomic_apis = [
    "https://rest.ensembl.org/variation/human/rs699",  # Ensembl API
    "https://api.ncbi.nlm.nih.gov/variation/v0/variant/NC_000001.10:g.230710048A%3EG",  # NCBI Variation API
    "https://api.ncbi.nlm.nih.gov/variation/v0/beta/refsnp/699"  # ClinVar API
]

# Real genomic databases
genomic_databases = [
    "dbSNP",
    "ClinVar",
    "COSMIC"
]

# Real research feeds
genomic_feeds = [
    "https://pubmed.ncbi.nlm.nih.gov/rss/search?term=genomic+variants",
    "https://pubmed.ncbi.nlm.nih.gov/rss/search?term=genetic+variation"
]

# Real database connection for variant annotations
db_connection_string = "postgresql://user:password@localhost:5432/genomic_db"
db_query = "SELECT variant_id, gene_symbol, disease_name, clinical_significance, chromosome, position FROM variants WHERE clinical_significance IN ('Pathogenic', 'Likely Pathogenic') LIMIT 1000"

temp_dir = tempfile.mkdtemp()

# Sample genomic variant data for local ingestion
genomic_data_file = os.path.join(temp_dir, "genomic_variants.json")
genomic_data = [
    {
        "variant_id": "rs699",
        "gene_symbol": "AGT",
        "gene_name": "Angiotensinogen",
        "disease_name": "Hypertension",
        "clinical_significance": "Pathogenic",
        "chromosome": "1",
        "position": 230710048,
        "ref_allele": "A",
        "alt_allele": "G",
        "pathway": "Renin-angiotensin system",
        "impact": "High",
        "timestamp": (datetime.now() - timedelta(days=1)).isoformat()
    },
    {
        "variant_id": "rs7412",
        "gene_symbol": "APOE",
        "gene_name": "Apolipoprotein E",
        "disease_name": "Alzheimer's Disease",
        "clinical_significance": "Pathogenic",
        "chromosome": "19",
        "position": 44908822,
        "ref_allele": "C",
        "alt_allele": "T",
        "pathway": "Lipid metabolism",
        "impact": "High",
        "timestamp": (datetime.now() - timedelta(days=2)).isoformat()
    },
    {
        "variant_id": "rs1800566",
        "gene_symbol": "NAT2",
        "gene_name": "N-acetyltransferase 2",
        "disease_name": "Drug Metabolism",
        "clinical_significance": "Likely Pathogenic",
        "chromosome": "8",
        "position": 18248728,
        "ref_allele": "G",
        "alt_allele": "A",
        "pathway": "Drug metabolism",
        "impact": "Moderate",
        "timestamp": (datetime.now() - timedelta(days=3)).isoformat()
    },
    {
        "variant_id": "rs1799853",
        "gene_symbol": "CYP2C9",
        "gene_name": "Cytochrome P450 2C9",
        "disease_name": "Warfarin Sensitivity",
        "clinical_significance": "Pathogenic",
        "chromosome": "10",
        "position": 96741054,
        "ref_allele": "C",
        "alt_allele": "T",
        "pathway": "Drug metabolism",
        "impact": "High",
        "timestamp": (datetime.now() - timedelta(days=4)).isoformat()
    },
    {
        "variant_id": "rs1057910",
        "gene_symbol": "CYP2C9",
        "gene_name": "Cytochrome P450 2C9",
        "disease_name": "Warfarin Sensitivity",
        "clinical_significance": "Pathogenic",
        "chromosome": "10",
        "position": 96741055,
        "ref_allele": "A",
        "alt_allele": "C",
        "pathway": "Drug metabolism",
        "impact": "High",
        "timestamp": (datetime.now() - timedelta(days=5)).isoformat()
    }
]

with open(genomic_data_file, 'w') as f:
    json.dump(genomic_data, f, indent=2)

# Ingest from local file
file_data = file_ingestor.ingest_file(genomic_data_file)
parsed_genomic = structured_parser.parse_json(json.dumps(genomic_data))

# Ingest from genomic APIs (example with public API)
web_content = web_ingestor.ingest_url(genomic_apis[0])  # Ensembl API
if web_content:
    print(f"  Ingested genomic API: {genomic_apis[0]}")

# Ingest from genomic feeds
feed_data_list = []
for feed_url in genomic_feeds:
    feed_data = feed_ingestor.ingest_feed(feed_url)
    if feed_data:
        feed_data_list.append(feed_data)
        print(f"  Ingested feed: {feed_url}")

# Database ingestion pattern
db_data = db_ingestor.export_table(
    connection_string=db_connection_string,
    table_name="variants",
    limit=1000
)
print(f"  Query pattern: {db_query}")

print(f"\n📊 Ingestion Summary:")
print(f"  Local variants: {len(genomic_data)}")
print(f"  Database records: {len(db_data.get('data', [])) if db_data else 0}")
print(f"  Feeds ingested: {len(feed_data_list)}")
print(f"  Web APIs: {len(genomic_apis)}")


## Step 2: Extract Genomic Entities

Extract variants, genes, diseases, and pathways from the ingested data.


In [None]:
ner_extractor = NERExtractor()
relation_extractor = RelationExtractor()
triplet_extractor = TripletExtractor()
semantic_analyzer = SemanticAnalyzer()

all_genomic_texts = []
all_variants = []

# Process parsed genomic data
if parsed_genomic and isinstance(parsed_genomic, dict):
    variants = parsed_genomic.get("data", genomic_data)
    for variant in variants:
        all_variants.append(variant)
        variant_text = f"Variant {variant.get('variant_id', '')} in gene {variant.get('gene_symbol', '')} associated with {variant.get('disease_name', '')} in pathway {variant.get('pathway', '')}"
        all_genomic_texts.append(variant_text)

# Extract entities
all_entities = []
all_relationships = []
all_triplets = []

for text in all_genomic_texts:
    entities = ner_extractor.extract(text)
    all_entities.extend(entities)
    
    relationships = relation_extractor.extract(text, entities)
    all_relationships.extend(relationships)
    
    triplets = triplet_extractor.extract(text)
    all_triplets.extend(triplets)

# Build structured entity list
variant_entities = []
gene_entities = []
disease_entities = []
pathway_entities = []

unique_genes = {}
unique_diseases = {}
unique_pathways = {}

for variant in all_variants:
    variant_entity = {
        "id": variant.get("variant_id", ""),
        "type": "Variant",
        "properties": {
            "variant_id": variant.get("variant_id", ""),
            "chromosome": variant.get("chromosome", ""),
            "position": variant.get("position", 0),
            "ref_allele": variant.get("ref_allele", ""),
            "alt_allele": variant.get("alt_allele", ""),
            "clinical_significance": variant.get("clinical_significance", ""),
            "impact": variant.get("impact", ""),
            "timestamp": variant.get("timestamp", "")
        }
    }
    variant_entities.append(variant_entity)
    
    # Add gene entity
    gene_symbol = variant.get("gene_symbol", "")
    if gene_symbol and gene_symbol not in unique_genes:
        gene_entity = {
            "id": gene_symbol,
            "type": "Gene",
            "properties": {
                "symbol": gene_symbol,
                "name": variant.get("gene_name", ""),
                "chromosome": variant.get("chromosome", "")
            }
        }
        gene_entities.append(gene_entity)
        unique_genes[gene_symbol] = gene_entity
    
    # Add disease entity
    disease_name = variant.get("disease_name", "")
    if disease_name and disease_name not in unique_diseases:
        disease_entity = {
            "id": disease_name.replace(" ", "_"),
            "type": "Disease",
            "properties": {
                "name": disease_name
            }
        }
        disease_entities.append(disease_entity)
        unique_diseases[disease_name] = disease_entity
    
    # Add pathway entity
    pathway_name = variant.get("pathway", "")
    if pathway_name and pathway_name not in unique_pathways:
        pathway_entity = {
            "id": pathway_name.replace(" ", "_"),
            "type": "Pathway",
            "properties": {
                "name": pathway_name
            }
        }
        pathway_entities.append(pathway_entity)
        unique_pathways[pathway_name] = pathway_entity

print(f"Extracted {len(variant_entities)} variants")
print(f"Extracted {len(gene_entities)} unique genes")
print(f"Extracted {len(disease_entities)} unique diseases")
print(f"Extracted {len(pathway_entities)} unique pathways")
print(f"Extracted {len(all_relationships)} relationships")
print(f"Extracted {len(all_triplets)} triplets")


## Step 3: Build Genomic Knowledge Graph

Build a knowledge graph from extracted genomic entities and relationships.


In [None]:
builder = GraphBuilder()

# Add all entities
for variant in variant_entities:
    builder.add_entity(
        entity_id=variant["id"],
        entity_type=variant["type"],
        properties=variant.get("properties", {})
    )

for gene in gene_entities:
    builder.add_entity(
        entity_id=gene["id"],
        entity_type=gene["type"],
        properties=gene.get("properties", {})
    )

for disease in disease_entities:
    builder.add_entity(
        entity_id=disease["id"],
        entity_type=disease["type"],
        properties=disease.get("properties", {})
    )

for pathway in pathway_entities:
    builder.add_entity(
        entity_id=pathway["id"],
        entity_type=pathway["type"],
        properties=pathway.get("properties", {})
    )

# Add relationships
relationships = []
for variant in variant_entities:
    variant_id = variant["id"]
    gene_symbol = variant["properties"].get("gene_symbol", "")
    disease_name = variant["properties"].get("disease_name", "").replace(" ", "_")
    pathway_name = variant["properties"].get("pathway", "").replace(" ", "_")
    
    # Find corresponding entities
    gene_entity = unique_genes.get(gene_symbol)
    disease_entity = unique_diseases.get(disease_name.replace("_", " "))
    pathway_entity = unique_pathways.get(pathway_name.replace("_", " "))
    
    if gene_entity:
        builder.add_relationship(
            source_id=variant_id,
            target_id=gene_entity["id"],
            relationship_type="located_in",
            properties={}
        )
        relationships.append({
            "source": variant_id,
            "target": gene_entity["id"],
            "type": "located_in"
        })
    
    if disease_entity:
        builder.add_relationship(
            source_id=variant_id,
            target_id=disease_entity["id"],
            relationship_type="associated_with",
            properties={
                "clinical_significance": variant["properties"].get("clinical_significance", "")
            }
        )
        relationships.append({
            "source": variant_id,
            "target": disease_entity["id"],
            "type": "associated_with"
        })
    
    if pathway_entity:
        builder.add_relationship(
            source_id=gene_entity["id"] if gene_entity else variant_id,
            target_id=pathway_entity["id"],
            relationship_type="participates_in",
            properties={}
        )
        relationships.append({
            "source": gene_entity["id"] if gene_entity else variant_id,
            "target": pathway_entity["id"],
            "type": "participates_in"
        })

knowledge_graph = builder.build()

print(f"Built knowledge graph with {len(knowledge_graph.nodes)} nodes")
print(f"Built knowledge graph with {len(knowledge_graph.edges)} edges")
print(f"Added {len(relationships)} genomic relationships")


## Step 4: Analyze Associations and Predict Impact

Analyze variant-disease associations and predict variant impact on protein function.


In [None]:
graph_analyzer = GraphAnalyzer()
centrality_calculator = CentralityCalculator()
community_detector = CommunityDetector()
connectivity_analyzer = ConnectivityAnalyzer()
temporal_query = TemporalGraphQuery(knowledge_graph)
pattern_detector = TemporalPatternDetector()

# Compute graph metrics
graph_metrics = graph_analyzer.compute_metrics(knowledge_graph)

# Calculate centrality
centrality_result = centrality_calculator.calculate_betweenness_centrality(knowledge_graph)
centrality_scores = centrality_result.get('centrality', {})
top_central_genes = sorted(centrality_scores.items(), key=lambda x: x[1], reverse=True)[:10]

# Detect communities
communities_result = community_detector.detect_communities(knowledge_graph, algorithm="louvain")
communities = communities_result.get('communities', [])
community_count = len(communities) if communities else 0

# Analyze connectivity
connectivity_results = connectivity_analyzer.analyze_connectivity(knowledge_graph)

# Detect temporal patterns
start_time = (datetime.now() - timedelta(days=7)).isoformat()
end_time = datetime.now().isoformat()
temporal_results = temporal_query.query_time_range(
    start_time=start_time,
    end_time=end_time,
    relationship_types=["associated_with", "located_in"]
)

temporal_patterns = pattern_detector.detect_temporal_patterns(
    knowledge_graph,
    relationship_types=["associated_with"],
    time_window_hours=168
)

# Impact Prediction using Inference Engine
inference_engine = InferenceEngine()
rule_manager = RuleManager()

# Define impact prediction rules
impact_rules = [
    {
        "name": "high_impact_pathogenic",
        "condition": "clinical_significance == 'Pathogenic' AND impact == 'High'",
        "action": "predict_high_impact"
    },
    {
        "name": "moderate_impact_likely_pathogenic",
        "condition": "clinical_significance == 'Likely Pathogenic' AND impact == 'Moderate'",
        "action": "predict_moderate_impact"
    },
    {
        "name": "disease_association",
        "condition": "associated_with_disease AND pathogenic",
        "action": "predict_disease_risk"
    }
]

for rule in impact_rules:
    rule_manager.add_rule(rule["name"], rule["condition"], rule["action"])

# Predict variant impact
variant_impacts = []
for variant in variant_entities:
    clinical_sig = variant["properties"].get("clinical_significance", "")
    impact = variant["properties"].get("impact", "")
    
    impact_score = 0
    if clinical_sig == "Pathogenic":
        impact_score += 5
    elif clinical_sig == "Likely Pathogenic":
        impact_score += 3
    
    if impact == "High":
        impact_score += 3
    elif impact == "Moderate":
        impact_score += 2
    
    # Check disease associations
    disease_associations = [r for r in relationships if r["source"] == variant["id"] and r["type"] == "associated_with"]
    if disease_associations:
        impact_score += 2
    
    variant_impacts.append({
        "variant": variant["id"],
        "impact_score": min(impact_score, 10),
        "predicted_impact": "High" if impact_score >= 7 else "Moderate" if impact_score >= 4 else "Low",
        "clinical_significance": clinical_sig,
        "disease_associations": len(disease_associations)
    })

print(f"Analyzed {len(variant_entities)} variants")
print(f"Found {community_count} gene communities")
print(f"Detected {len(temporal_patterns)} temporal patterns")
print(f"\nTop 5 Central Genes:")
for i, (gene_id, centrality) in enumerate(top_central_genes[:5], 1):
    print(f"  {i}. {gene_id} (centrality: {centrality:.3f})")
print(f"\nVariant Impact Predictions:")
for impact in sorted(variant_impacts, key=lambda x: x["impact_score"], reverse=True):
    print(f"  - {impact['variant']}: {impact['predicted_impact']} Impact (Score: {impact['impact_score']}/10, Diseases: {impact['disease_associations']})")


## Step 5: Generate Genomic Ontology and Pathway Analysis

Generate genomic ontology and perform pathway analysis.


In [None]:
ontology_generator = OntologyGenerator()
class_inferrer = ClassInferrer()
property_generator = PropertyGenerator()
ontology_validator = OntologyValidator()

# Generate genomic ontology
genomic_ontology = ontology_generator.generate_ontology(
    {"entities": variant_entities + gene_entities + disease_entities + pathway_entities, 
     "relationships": relationships},
    name="GenomicsOntology",
    entities=variant_entities + gene_entities + disease_entities + pathway_entities,
    relationships=relationships
)

# Infer classes/properties (for inspection)
classes = class_inferrer.infer_classes(variant_entities + gene_entities + disease_entities + pathway_entities)
properties = property_generator.infer_properties(variant_entities + gene_entities + disease_entities + pathway_entities, relationships, classes)

# Ensure ontology dict has classes/properties
if not genomic_ontology.get("classes"):
    genomic_ontology["classes"] = classes
if not genomic_ontology.get("properties"):
    genomic_ontology["properties"] = properties

# Validate ontology
validation_result = ontology_validator.validate_ontology(genomic_ontology)

# Pathway analysis
pathway_analysis = {}
for pathway in pathway_entities:
    pathway_id = pathway["id"]
    pathway_name = pathway["properties"].get("name", "")
    
    # Find variants and genes in this pathway
    pathway_variants = [r for r in relationships if r["target"] == pathway_id and r["type"] == "participates_in"]
    pathway_genes = set()
    for rel in pathway_variants:
        source_entity = next((e for e in variant_entities + gene_entities if e["id"] == rel["source"]), None)
        if source_entity and source_entity["type"] == "Gene":
            pathway_genes.add(source_entity["id"])
    
    pathway_analysis[pathway_name] = {
        "variants": len([r for r in pathway_variants if next((e for e in variant_entities if e["id"] == r["source"]), None)]),
        "genes": len(pathway_genes),
        "diseases": len([r for r in relationships if r["source"] in [e["id"] for e in variant_entities] and r["type"] == "associated_with"])
    }

print(f"Generated genomic ontology with {len(genomic_ontology.get('classes', []))} classes")
print(f"Ontology validation: {'Valid' if validation_result.valid else 'Invalid'}")
print(f"  Errors: {len(validation_result.errors)}")
print(f"  Warnings: {len(validation_result.warnings)}")
print(f"\nPathway Analysis:")
for pathway_name, analysis in pathway_analysis.items():
    print(f"  - {pathway_name}: {analysis['variants']} variants, {analysis['genes']} genes, {analysis['diseases']} disease associations")


## Step 6: Generate Reports and Visualize

Generate comprehensive genomic analysis reports and visualizations.


In [None]:
json_exporter = JSONExporter()
rdf_exporter = RDFExporter()
owl_exporter = OWLExporter()
report_generator = ReportGenerator()
kg_quality_assessor = KGQualityAssessor()
conflict_detector = ConflictDetector()

# Assess graph quality
quality_metrics = kg_quality_assessor.assess_quality(knowledge_graph)

# Detect conflicts
conflicts = conflict_detector.detect_conflicts(knowledge_graph)

# Export knowledge graph
kg_json = json_exporter.export(knowledge_graph, output_path=os.path.join(temp_dir, "genomic_kg.json"))
kg_rdf = rdf_exporter.export(knowledge_graph, output_path=os.path.join(temp_dir, "genomic_kg.rdf"))

# Export ontology
ontology_owl = owl_exporter.export(genomic_ontology, output_path=os.path.join(temp_dir, "genomic_ontology.owl"))

# Generate report
report_content = f"""
# Genomic Variant Analysis Report

## Executive Summary
- Total Variants Analyzed: {len(variant_entities)}
- Unique Genes: {len(gene_entities)}
- Unique Diseases: {len(disease_entities)}
- Unique Pathways: {len(pathway_entities)}
- High Impact Variants: {len([v for v in variant_impacts if v['predicted_impact'] == 'High'])}

## Graph Quality Metrics
- Nodes: {quality_metrics.get('node_count', len(knowledge_graph.nodes))}
- Edges: {quality_metrics.get('edge_count', len(knowledge_graph.edges))}
- Completeness: {quality_metrics.get('completeness', 0):.2%}
- Consistency: {quality_metrics.get('consistency', 0):.2%}

## Top Variants by Impact
"""
for i, impact in enumerate(sorted(variant_impacts, key=lambda x: x["impact_score"], reverse=True)[:10], 1):
    report_content += f"""
{i}. {impact['variant']}
   - Predicted Impact: {impact['predicted_impact']}
   - Impact Score: {impact['impact_score']}/10
   - Clinical Significance: {impact['clinical_significance']}
   - Disease Associations: {impact['disease_associations']}
"""

report_content += f"""
## Pathway Analysis
"""
for pathway_name, analysis in pathway_analysis.items():
    report_content += f"""
### {pathway_name}
- Variants: {analysis['variants']}
- Genes: {analysis['genes']}
- Disease Associations: {analysis['diseases']}
"""

report_path = os.path.join(temp_dir, "genomic_analysis_report.md")
with open(report_path, 'w') as f:
    f.write(report_content)

print(f"Exported knowledge graph to JSON and RDF")
print(f"Exported ontology to OWL")
print(f"Generated analysis report: {report_path}")
print(f"\nQuality Metrics:")
print(f"  Nodes: {quality_metrics.get('node_count', len(knowledge_graph.nodes))}")
print(f"  Edges: {quality_metrics.get('edge_count', len(knowledge_graph.edges))}")
print(f"  Completeness: {quality_metrics.get('completeness', 0):.2%}")
print(f"  Consistency: {quality_metrics.get('consistency', 0):.2%}")


## Step 7: Visualize Genomic Network

Visualize the genomic knowledge graph, ontology, and analytics.


In [None]:
kg_visualizer = KGVisualizer()
ontology_visualizer = OntologyVisualizer()
analytics_visualizer = AnalyticsVisualizer()

# Visualize knowledge graph
kg_viz = kg_visualizer.visualize(
    knowledge_graph,
    layout="force_directed",
    highlight_nodes=[v["id"] for v in variant_entities],
    node_size_by="impact"
)

# Visualize ontology
ontology_viz = ontology_visualizer.visualize(
    genomic_ontology,
    layout="hierarchical"
)

# Visualize analytics
analytics_viz = analytics_visualizer.visualize(
    knowledge_graph,
    metrics={
        "centrality": dict(top_central_genes[:10]),
        "communities": communities,
        "connectivity": connectivity_results,
        "impact_scores": {v["variant"]: v["impact_score"] for v in variant_impacts}
    }
)
