# BioDisco: Hypothesis Generation Pipeline

This notebook demonstrates a complete hypothesis generation pipeline using BioDisco.

## Overview
- Initialize BioDisco components
- Generate biomedical hypotheses
- Search and integrate literature evidence
- Build knowledge graph representations
- Link evidence to hypotheses

## Prerequisites
- BioDisco package installed
- OpenAI API key configured
- Neo4j database running (optional)

In [None]:
# Install BioDisco if not already installed
# !pip install biodisco

import os
import pandas as pd
from BioDisco import (
    HypothesisLibrary,
    LiteratureLibrary,
    KGLibrary,
    EvidenceLibrary,
    BackgroundLibrary,
    KeywordLibrary
)

# Set up environment (uncomment and set your API key)
# os.environ['OPENAI_API_KEY'] = 'your-openai-api-key-here'

print("BioDisco components imported successfully!")

## 1. Initialize BioDisco Libraries

We'll create instances of all the core BioDisco libraries for managing different types of scientific data.

In [None]:
# Initialize all BioDisco libraries
hypo_lib = HypothesisLibrary()
lit_lib = LiteratureLibrary()
kg_lib = KGLibrary()
evidence_lib = EvidenceLibrary()
background_lib = BackgroundLibrary()
keyword_lib = KeywordLibrary()

print("All BioDisco libraries initialized:")
print("✅ HypothesisLibrary - for storing scientific hypotheses")
print("✅ LiteratureLibrary - for managing PubMed literature")
print("✅ KGLibrary - for knowledge graph nodes and edges")
print("✅ EvidenceLibrary - for linking hypotheses to evidence")
print("✅ BackgroundLibrary - for storing background information")
print("✅ KeywordLibrary - for managing scientific keywords")

## 2. Define Research Background

Let's start with a research background and extract relevant keywords.

In [None]:
# Define research background
research_background = """
Alzheimer's disease (AD) is a progressive neurodegenerative disorder characterized by 
cognitive decline, memory loss, and behavioral changes. The disease is marked by the 
accumulation of amyloid-beta plaques and tau protein tangles in the brain. Recent 
research has highlighted the role of neuroinflammation in AD progression, with 
microglial activation and cytokine release contributing to neuronal damage. 
Additionally, metabolic factors such as insulin resistance and glucose metabolism 
dysregulation have been implicated in AD pathogenesis.
"""

# Store background
background_id = background_lib.add(research_background.strip())
print(f"Background stored with ID: {background_id}")

# Extract and store keywords
keywords = [
    "Alzheimer's disease",
    "neurodegeneration", 
    "amyloid-beta",
    "tau protein",
    "neuroinflammation",
    "microglia",
    "insulin resistance",
    "glucose metabolism"
]

keyword_ids = []
for keyword in keywords:
    kid = keyword_lib.add(keyword)
    keyword_ids.append(kid)

print(f"\nStored {len(keyword_ids)} keywords:")
for kw in keywords:
    print(f"  • {kw}")

## 3. Generate Research Hypotheses

Based on the background, we'll create several testable hypotheses.

In [None]:
# Generate hypotheses based on the research background
hypotheses = [
    "Chronic neuroinflammation accelerates amyloid-beta plaque formation in Alzheimer's disease",
    "Insulin resistance disrupts tau protein metabolism leading to neurofibrillary tangle formation",
    "Microglial activation triggers a cascade of inflammatory cytokines that damage neurons",
    "Glucose metabolism dysfunction impairs neuronal energy production in Alzheimer's disease",
    "Anti-inflammatory interventions can slow cognitive decline in early-stage Alzheimer's disease"
]

hypothesis_ids = []
for i, hypothesis in enumerate(hypotheses, 1):
    hid = hypo_lib.add(hypothesis)
    hypothesis_ids.append(hid)
    print(f"Hypothesis {i}: {hypothesis}")
    print(f"  ID: {hid}\n")

print(f"Generated {len(hypothesis_ids)} research hypotheses")

## 4. Create Knowledge Graph Entities

We'll build a knowledge graph with relevant biomedical entities and their relationships.

In [None]:
# Define knowledge graph nodes (entities)
kg_nodes = [
    {
        "entity_id": "APOE_gene",
        "name": "Apolipoprotein E",
        "type": "Gene",
        "synonyms": ["APOE", "ApoE"],
        "description": "Major genetic risk factor for Alzheimer's disease"
    },
    {
        "entity_id": "Amyloid_Beta",
        "name": "Amyloid-beta peptide",
        "type": "Protein",
        "synonyms": ["Aβ", "A-beta", "amyloid-β"],
        "description": "Protein fragment that forms plaques in Alzheimer's disease"
    },
    {
        "entity_id": "Tau_Protein", 
        "name": "Tau protein",
        "type": "Protein",
        "synonyms": ["MAPT", "τ"],
        "description": "Microtubule-associated protein that forms tangles in AD"
    },
    {
        "entity_id": "Microglia",
        "name": "Microglial cells",
        "type": "Cell Type",
        "synonyms": ["microglia", "brain macrophages"],
        "description": "Immune cells of the central nervous system"
    },
    {
        "entity_id": "IL1B",
        "name": "Interleukin-1 beta",
        "type": "Protein",
        "synonyms": ["IL-1β", "IL1β"],
        "description": "Pro-inflammatory cytokine involved in neuroinflammation"
    },
    {
        "entity_id": "Alzheimers_Disease",
        "name": "Alzheimer's Disease",
        "type": "Disease",
        "synonyms": ["AD", "Alzheimer disease"],
        "description": "Progressive neurodegenerative disorder"
    }
]

# Add nodes to knowledge graph
node_ids = []
for node in kg_nodes:
    nid = kg_lib.add_node(node)
    node_ids.append(nid)
    print(f"Added node: {node['name']} ({node['type']})")

print(f"\nCreated {len(node_ids)} knowledge graph nodes")

In [None]:
# Define relationships between entities
kg_edges = [
    {
        "source": "APOE_gene",
        "target": "Alzheimers_Disease",
        "relation": "risk_factor_for",
        "evidence_strength": 0.95,
        "source_type": "genetics"
    },
    {
        "source": "Amyloid_Beta",
        "target": "Alzheimers_Disease",
        "relation": "pathological_hallmark_of",
        "evidence_strength": 0.90,
        "source_type": "pathology"
    },
    {
        "source": "Tau_Protein",
        "target": "Alzheimers_Disease", 
        "relation": "pathological_hallmark_of",
        "evidence_strength": 0.90,
        "source_type": "pathology"
    },
    {
        "source": "Microglia",
        "target": "Amyloid_Beta",
        "relation": "activated_by",
        "evidence_strength": 0.85,
        "source_type": "immunology"
    },
    {
        "source": "Microglia",
        "target": "IL1B",
        "relation": "produces",
        "evidence_strength": 0.80,
        "source_type": "molecular_biology"
    },
    {
        "source": "IL1B",
        "target": "Alzheimers_Disease",
        "relation": "contributes_to",
        "evidence_strength": 0.75,
        "source_type": "inflammation"
    }
]

# Add edges to knowledge graph
edge_ids = []
for edge in kg_edges:
    eid = kg_lib.add_edge(edge)
    edge_ids.append(eid)
    print(f"Added edge: {edge['source']} --[{edge['relation']}]--> {edge['target']}")

print(f"\nCreated {len(edge_ids)} knowledge graph relationships")

## 5. Simulate Literature Evidence

We'll add some simulated literature that would support our hypotheses.

In [None]:
# Simulate literature evidence
literature_papers = [
    {
        "pmid": "12345678",
        "title": "Neuroinflammation and amyloid-β accumulation in Alzheimer's disease progression",
        "abstract": "This study demonstrates that chronic neuroinflammation accelerates amyloid-β plaque formation through microglial activation and cytokine release...",
        "authors": ["Smith, A.B.", "Johnson, C.D.", "Williams, E.F."],
        "journal": "Nature Neuroscience",
        "year": 2023,
        "doi": "10.1038/s41593-023-01234-5"
    },
    {
        "pmid": "23456789",
        "title": "Insulin resistance and tau phosphorylation in Alzheimer's disease",
        "abstract": "Our findings reveal that insulin resistance disrupts tau protein metabolism, leading to hyperphosphorylation and neurofibrillary tangle formation...",
        "authors": ["Brown, M.N.", "Davis, P.Q.", "Miller, R.S."],
        "journal": "Cell Metabolism",
        "year": 2023,
        "doi": "10.1016/j.cmet.2023.02.015"
    },
    {
        "pmid": "34567890",
        "title": "Microglial cytokine cascades in neurodegeneration",
        "abstract": "This research shows that activated microglia trigger inflammatory cytokine cascades including IL-1β, TNF-α, and IL-6, resulting in neuronal damage...",
        "authors": ["Garcia, L.M.", "Thompson, K.J.", "Anderson, D.R."],
        "journal": "Journal of Neuroinflammation",
        "year": 2023,
        "doi": "10.1186/s12974-023-02789-1"
    }
]

# Add literature to library
literature_ids = []
for paper in literature_papers:
    lid = lit_lib.add(paper)
    literature_ids.append(lid)
    print(f"Added paper: {paper['title'][:60]}...")
    print(f"  PMID: {paper['pmid']}, Journal: {paper['journal']}\n")

print(f"Added {len(literature_ids)} literature papers")

## 6. Link Evidence to Hypotheses

Now we'll create evidence links connecting our hypotheses to supporting literature and knowledge graph entities.

In [None]:
# Create evidence links for each hypothesis
evidence_links = []

# Evidence for hypothesis 1: "Chronic neuroinflammation accelerates amyloid-beta plaque formation"
evidence1_id = evidence_lib.add(
    hypothesis_id=hypothesis_ids[0],
    literature_ids=[literature_ids[0]],  # Paper about neuroinflammation and amyloid-β
    kg_node_ids=[node_ids[1], node_ids[3], node_ids[5]],  # Amyloid-β, Microglia, AD
    kg_edge_ids=[edge_ids[1], edge_ids[3]],  # Amyloid-β -> AD, Microglia -> Amyloid-β
    prev_hypothesis_ids=[]
)
evidence_links.append(evidence1_id)

# Evidence for hypothesis 2: "Insulin resistance disrupts tau protein metabolism"
evidence2_id = evidence_lib.add(
    hypothesis_id=hypothesis_ids[1],
    literature_ids=[literature_ids[1]],  # Paper about insulin resistance and tau
    kg_node_ids=[node_ids[2], node_ids[5]],  # Tau protein, AD
    kg_edge_ids=[edge_ids[2]],  # Tau -> AD
    prev_hypothesis_ids=[]
)
evidence_links.append(evidence2_id)

# Evidence for hypothesis 3: "Microglial activation triggers inflammatory cytokine cascade"
evidence3_id = evidence_lib.add(
    hypothesis_id=hypothesis_ids[2],
    literature_ids=[literature_ids[2]],  # Paper about microglial cytokine cascades
    kg_node_ids=[node_ids[3], node_ids[4], node_ids[5]],  # Microglia, IL1B, AD
    kg_edge_ids=[edge_ids[4], edge_ids[5]],  # Microglia -> IL1B, IL1B -> AD
    prev_hypothesis_ids=[]
)
evidence_links.append(evidence3_id)

print("Evidence links created:")
for i, eid in enumerate(evidence_links, 1):
    print(f"  Evidence {i}: {eid}")

print(f"\nTotal evidence links: {len(evidence_links)}")

## 7. Analysis and Summary

Let's analyze what we've created and display a summary of our knowledge base.

In [None]:
# Create summary dataframe
summary_data = {
    'Component': [
        'Background Information',
        'Research Keywords',
        'Generated Hypotheses',
        'Knowledge Graph Nodes',
        'Knowledge Graph Edges',
        'Literature Papers',
        'Evidence Links'
    ],
    'Count': [
        len(background_lib.all()),
        len(keyword_lib.all()),
        len(hypo_lib.all()),
        len(kg_lib.all_nodes()),
        len(kg_lib.all_edges()),
        len(lit_lib.all()),
        len(evidence_lib.all())
    ]
}

summary_df = pd.DataFrame(summary_data)
print("📊 BioDisco Knowledge Base Summary:")
print("=" * 40)
print(summary_df.to_string(index=False))

# Display hypotheses with their evidence
print("\n🧪 Hypotheses and Evidence:")
print("=" * 50)
for i, (hyp_id, ev_id) in enumerate(zip(hypothesis_ids[:3], evidence_links), 1):
    hypothesis_text = hypo_lib.get(hyp_id)
    evidence_data = evidence_lib.get(ev_id)
    
    print(f"\nHypothesis {i}:")
    print(f"  {hypothesis_text}")
    print(f"  📖 Literature support: {len(evidence_data['literature_ids'])} papers")
    print(f"  🔗 KG nodes: {len(evidence_data['kg_node_ids'])} entities")
    print(f"  🔗 KG edges: {len(evidence_data['kg_edge_ids'])} relationships")

## 8. Knowledge Graph Visualization Data

Prepare data that could be used for knowledge graph visualization.

In [None]:
# Extract knowledge graph data for visualization
all_nodes = kg_lib.all_nodes()
all_edges = kg_lib.all_edges()

# Create nodes DataFrame
nodes_data = []
for node_id, node_info in all_nodes.items():
    nodes_data.append({
        'id': node_info['entity_id'],
        'name': node_info['name'],
        'type': node_info['type'],
        'description': node_info.get('description', '')
    })

nodes_df = pd.DataFrame(nodes_data)
print("🔗 Knowledge Graph Nodes:")
print(nodes_df.to_string(index=False))

# Create edges DataFrame
edges_data = []
for edge_id, edge_info in all_edges.items():
    edges_data.append({
        'source': edge_info['source'],
        'target': edge_info['target'],
        'relation': edge_info['relation'],
        'strength': edge_info.get('evidence_strength', 0.5)
    })

edges_df = pd.DataFrame(edges_data)
print("\n🔗 Knowledge Graph Relationships:")
print(edges_df.to_string(index=False))

## 9. Export Results

Save our results for future use or analysis.

In [None]:
# Export data to files
import json

# Export hypotheses
hypotheses_export = {}
for hyp_id in hypothesis_ids:
    hypotheses_export[hyp_id] = hypo_lib.get(hyp_id)

with open('biodisco_hypotheses.json', 'w') as f:
    json.dump(hypotheses_export, f, indent=2)

# Export knowledge graph
kg_export = {
    'nodes': all_nodes,
    'edges': all_edges
}

with open('biodisco_knowledge_graph.json', 'w') as f:
    json.dump(kg_export, f, indent=2)

# Export evidence links
evidence_export = evidence_lib.all()

with open('biodisco_evidence.json', 'w') as f:
    json.dump(evidence_export, f, indent=2)

print("✅ Results exported to:")
print("  • biodisco_hypotheses.json")
print("  • biodisco_knowledge_graph.json")
print("  • biodisco_evidence.json")

## Conclusion

This notebook demonstrated a complete BioDisco pipeline for:

1. ✅ **Data Management**: Using specialized libraries for different data types
2. ✅ **Hypothesis Generation**: Creating testable scientific hypotheses
3. ✅ **Knowledge Graph Construction**: Building interconnected biomedical entities
4. ✅ **Literature Integration**: Managing supporting research papers
5. ✅ **Evidence Linking**: Connecting hypotheses to supporting evidence
6. ✅ **Data Export**: Saving results for further analysis

### Next Steps

- **AI Agent Integration**: Use BioDisco's AI agents for automated hypothesis generation
- **PubMed Integration**: Automatically search and retrieve relevant literature
- **Neo4j Integration**: Store and query knowledge graphs in a graph database
- **Visualization**: Create interactive visualizations of the knowledge graph
- **Hypothesis Testing**: Design experiments to validate generated hypotheses

For more advanced features, check out the BioDisco documentation and additional examples!