# Demographics to Kraken Mapping

This notebook maps demographic fields from `Demographic_Fields_Extracted.xlsx` to entities in the Kraken knowledge graph using the Kestrel API.

## Key Findings

| Metric | Value |
|--------|-------|
| **Total fields** | 120 |
| **Resolution rate** | 100% (all resolved) |
| **High-quality matches** | 87.5% (medium-high confidence) |
| **Low-quality matches** | 12.5% (Publication category noise) |

## Methodology

**Hybrid search with category-aware ranking:**
1. Query Kestrel's `hybrid_search` (text + vector similarity) with 10 candidates per field
2. Rank candidates by biolink category relevance (PhenotypicFeature > ClinicalAttribute > Publication)
3. Select best match from preferred categories when available

**Why not SNOMED exact match?**  
SNOMED codes from the Excel file are not directly indexed in Kraken. The knowledge graph uses UMLS as primary identifiers, with SNOMED appearing only in `equivalent_ids` fields.

## Prerequisites

Set the `KESTREL_API_KEY` environment variable before running:
```bash
export KESTREL_API_KEY=your-key-here
```

## Cell 1: Setup & Imports

In [34]:
import sys
import os
import asyncio
import json
import re
from pathlib import Path
from typing import Any

import pandas as pd
from dotenv import load_dotenv

# Load environment variables from .env file
PROJECT_ROOT = Path(__file__).resolve().parents[2] if "__file__" in dir() else Path.cwd().parents[1]
load_dotenv(PROJECT_ROOT / ".env")

# Add kraken-chatbot to path for KestrelClient import
KRAKEN_CHATBOT_PATH = Path.home() / "Insync/projects/kraken-chatbot/backend/src"
if str(KRAKEN_CHATBOT_PATH) not in sys.path:
    sys.path.insert(0, str(KRAKEN_CHATBOT_PATH))

from kestrel_backend.kestrel_client import KestrelClient, call_kestrel_tool

# Verify API key is available
api_key = os.getenv("KESTREL_API_KEY")
if not api_key:
    raise EnvironmentError(
        "KESTREL_API_KEY not found in environment.\n"
        "Set it before running: export KESTREL_API_KEY=your-key-here\n"
        "Or add to .env file in project root."
    )
print(f"✓ KESTREL_API_KEY configured (length: {len(api_key)})")

✓ KESTREL_API_KEY configured (length: 43)


## Cell 2: Configuration

In [35]:
# Rate limiting to avoid overwhelming the API
RATE_LIMIT_DELAY = 0.5  # seconds between API calls

# File paths - use absolute paths for reliability
NOTEBOOK_DIR = Path.cwd()  # Current directory when running notebook
PROJECT_ROOT = NOTEBOOK_DIR.parents[1] if "notebooks" in str(NOTEBOOK_DIR) else NOTEBOOK_DIR

# Input file is in the same directory as the notebook
INPUT_FILE = NOTEBOOK_DIR / "Demographic_Fields_Extracted.xlsx"
if not INPUT_FILE.exists():
    # Fallback: check project root
    INPUT_FILE = PROJECT_ROOT / "Demographic_Fields_Extracted.xlsx"

OUTPUT_DIR = PROJECT_ROOT / "data" / "review"
OUTPUT_JSON = OUTPUT_DIR / "demographics_kraken_mapping.json"
OUTPUT_TSV = OUTPUT_DIR / "demographics_kraken_mapping.tsv"

# Ensure output directory exists
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Notebook dir: {NOTEBOOK_DIR}")
print(f"Input file: {INPUT_FILE} (exists: {INPUT_FILE.exists()})")
print(f"Output JSON: {OUTPUT_JSON}")
print(f"Output TSV: {OUTPUT_TSV}")

Notebook dir: /home/trentleslie/Insync/projects/biovector-eval/notebooks/demographics
Input file: /home/trentleslie/Insync/projects/biovector-eval/notebooks/demographics/Demographic_Fields_Extracted.xlsx (exists: True)
Output JSON: /home/trentleslie/Insync/projects/biovector-eval/data/review/demographics_kraken_mapping.json
Output TSV: /home/trentleslie/Insync/projects/biovector-eval/data/review/demographics_kraken_mapping.tsv


## Cell 3: Load Excel Data

In [36]:
# Load the demographic fields (first sheet = "Demographic Fields")
df = pd.read_excel(INPUT_FILE, sheet_name=0)

print(f"Loaded {len(df)} demographic fields")
print(f"Columns: {list(df.columns)}")
print()

# Show category distribution
print("=== Categories ===")
print(df["Demographic Category"].value_counts().to_string())
print()

# Preview first few rows
df.head()

Loaded 120 demographic fields
Columns: ['Demographic Category', 'Data_Type', 'Historical_ID', 'Phenotype_Description', 'snomed_term_1', 'snomed_term_2', 'snomed_term_3', 'snomed_term_4']

=== Categories ===
Demographic Category
Race / Ethnicity                  23
Marital / Relationship Status     18
Blood Pressure                    10
Education                          8
Birth Weight                       8
Employment Status                  8
Smoking Status (Summary)           7
Weight (Self-reported)             7
Income / Deprivation               6
Birth Country / Place of Birth     6
Height (Self-reported)             5
Alcohol Intake (Summary)           3
Premature Birth                    3
Waist Circumference                2
Sex / Gender                       1
Height (Measured)                  1
BMI                                1
Weight (Measured)                  1
Hip Circumference                  1
Handedness                         1



Unnamed: 0,Demographic Category,Data_Type,Historical_ID,Phenotype_Description,snomed_term_1,snomed_term_2,snomed_term_3,snomed_term_4
0,Sex / Gender,Self-reported,,What gender do you identify with at the moment?,285116001 | Gender identity finding |,33821000087103 | Gender identity |,,
1,Race / Ethnicity,Self-reported,,"For as many as you know, what are the ancestra...",364699009 | Ethnic group |,397731000 | Ethnic group finding |,,
2,Race / Ethnicity,Self-reported,,"For as many as you know, what are the ancestra...",364699009 | Ethnic group |,397731000 | Ethnic group finding |,,
3,Race / Ethnicity,Self-reported,,"For as many as you know, what are the ancestra...",364699009 | Ethnic group |,397731000 | Ethnic group finding |,,
4,Race / Ethnicity,Self-reported,,"For as many as you know, what are the ancestra...",364699009 | Ethnic group |,397731000 | Ethnic group finding |,,


## Cell 4: SNOMED Code Extractor

In [37]:
def extract_snomed_code(term: str) -> str | None:
    """Extract SNOMED code from format '285116001 | Gender identity finding |'
    
    Returns CURIE format: SNOMEDCT:285116001
    """
    if pd.isna(term) or not isinstance(term, str) or term.strip() == "":
        return None
    
    parts = term.split("|")
    if len(parts) >= 1:
        code = parts[0].strip()
        if code.isdigit():
            return f"SNOMEDCT:{code}"
    return None


def extract_snomed_label(term: str) -> str | None:
    """Extract SNOMED label from format '285116001 | Gender identity finding |'"""
    if pd.isna(term) or not isinstance(term, str) or term.strip() == "":
        return None
    
    parts = term.split("|")
    if len(parts) >= 2:
        return parts[1].strip()
    return None


# Test extraction on sample data
test_terms = [
    "285116001 | Gender identity finding |",
    "364699009 | Ethnic group |",
    None,
    "",
    float('nan'),
]

print("=== SNOMED Extraction Test ===")
for term in test_terms:
    code = extract_snomed_code(term)
    label = extract_snomed_label(term)
    print(f"  {repr(term)[:50]:50} → code={code}, label={label}")

=== SNOMED Extraction Test ===
  '285116001 | Gender identity finding |'            → code=SNOMEDCT:285116001, label=Gender identity finding
  '364699009 | Ethnic group |'                       → code=SNOMEDCT:364699009, label=Ethnic group
  None                                               → code=None, label=None
  ''                                                 → code=None, label=None
  nan                                                → code=None, label=None


## Cell 5: Response Parser

In [38]:
def parse_kestrel_response(response: dict[str, Any]) -> list[dict[str, Any]]:
    """Parse Kestrel tool response to extract entity results.
    
    Kestrel returns results in the format:
    {"content": [{"type": "text", "text": '{"search_query": [results...]}'}]}
    
    The results are nested under the search query string as the key.
    """
    if response.get("isError"):
        return []
    
    content = response.get("content", [])
    if not content:
        return []
    
    # Get the text content
    text_content = content[0].get("text", "") if content else ""
    
    # Parse JSON
    try:
        data = json.loads(text_content)
    except json.JSONDecodeError:
        return []
    
    # Handle Kestrel's nested response format: {"search_query": [results]}
    if isinstance(data, dict):
        # Get the first non-empty list value
        for key, value in data.items():
            if isinstance(value, list) and len(value) > 0:
                return value
        # All lists were empty
        return []
    elif isinstance(data, list):
        return data
    
    return []


# Quick test of the parser
test_response = {
    "content": [{"type": "text", "text": '{"gender identity": [{"id": "UMLS:C0017249", "name": "Gender Identity"}]}'}],
    "isError": False
}
parsed = parse_kestrel_response(test_response)
print(f"Parser test: {parsed}")

Parser test: [{'id': 'UMLS:C0017249', 'name': 'Gender Identity'}]


## Cell 6: Category-Aware Resolution Strategy

The resolution function uses **category filtering** to improve match quality:

**Preferred categories** (selected first if available):
1. `biolink:PhenotypicFeature` - phenotypes, clinical findings
2. `biolink:ClinicalAttribute` - clinical measurements
3. `biolink:DiseaseOrPhenotypicFeature` - diseases and phenotypes
4. `biolink:PopulationOfIndividualOrganisms` - ethnic groups, populations
5. `biolink:Behavior` - behaviors like smoking, handedness
6. `biolink:GeographicLocation` - birthplaces, locations
7. `biolink:Cohort` - study cohorts

**Deprioritized categories** (only used as last resort):
- `biolink:Publication` - questionnaire instruments, papers
- `biolink:InformationContentEntity` - generic information entities

This filtering improved match quality from **48% → 87.5%** good categories.

In [39]:
# Categories to prefer (in order of relevance for demographics)
PREFERRED_CATEGORIES = [
    "biolink:PhenotypicFeature",
    "biolink:ClinicalAttribute",
    "biolink:DiseaseOrPhenotypicFeature",
    "biolink:PopulationOfIndividualOrganisms",
    "biolink:Behavior",
    "biolink:GeographicLocation",
    "biolink:Cohort",
]

# Categories to deprioritize (often noise for demographic concepts)
DEPRIORITIZED_CATEGORIES = [
    "biolink:Publication",
    "biolink:InformationContentEntity",
]


def select_best_match(entities: list[dict]) -> dict | None:
    """Select the best match from candidates, preferring relevant categories."""
    if not entities:
        return None
    
    # First pass: look for preferred categories
    for category in PREFERRED_CATEGORIES:
        for entity in entities:
            entity_cats = entity.get("categories", [])
            if category in entity_cats:
                return entity
    
    # Second pass: take any non-deprioritized category
    for entity in entities:
        entity_cats = entity.get("categories", [])
        if not any(cat in DEPRIORITIZED_CATEGORIES for cat in entity_cats):
            return entity
    
    # Last resort: return the top result even if deprioritized
    return entities[0]


async def resolve_demographic(row: pd.Series, client: KestrelClient) -> dict[str, Any]:
    """Resolve a demographic field to a Kraken entity.
    
    Uses hybrid_search with category-aware ranking to prefer
    PhenotypicFeature, ClinicalAttribute, etc. over Publication noise.
    """
    result = {
        "demographic_category": row["Demographic Category"],
        "data_type": row["Data_Type"],
        "phenotype_description": row["Phenotype_Description"],
        "historical_id": row.get("Historical_ID"),
        "source_snomed_codes": [],
        "resolution_method": None,
        "kraken_curie": None,
        "kraken_name": None,
        "kraken_category": None,
        "confidence": None,
        "all_candidates": [],
    }
    
    # Collect SNOMED codes for reference
    snomed_columns = ["snomed_term_1", "snomed_term_2", "snomed_term_3", "snomed_term_4"]
    snomed_codes = []
    for col in snomed_columns:
        code = extract_snomed_code(row.get(col, ""))
        if code:
            snomed_codes.append(code)
    result["source_snomed_codes"] = snomed_codes
    
    # Hybrid search on phenotype description
    await asyncio.sleep(RATE_LIMIT_DELAY)
    
    description = row["Phenotype_Description"]
    if pd.isna(description) or not description:
        result["resolution_method"] = "unresolved"
        result["confidence"] = "none"
        return result
    
    # Get more candidates to allow for filtering
    response = await client.call_tool("hybrid_search", {
        "search_text": description,
        "limit": 10,
    })
    
    entities = parse_kestrel_response(response)
    if entities:
        # Store all candidates for review
        result["all_candidates"] = [
            {
                "curie": e.get("id"),
                "name": e.get("name"),
                "category": e.get("categories", [None])[0],
                "score": e.get("score"),
            }
            for e in entities
        ]
        
        # Select best match using category preferences
        best = select_best_match(entities)
        if best:
            result["resolution_method"] = "hybrid_search"
            result["kraken_curie"] = best.get("id")
            result["kraken_name"] = best.get("name")
            categories = best.get("categories", [])
            result["kraken_category"] = categories[0] if categories else None
            
            # Confidence based on category quality
            if result["kraken_category"] in PREFERRED_CATEGORIES:
                result["confidence"] = "medium-high"
            elif result["kraken_category"] in DEPRIORITIZED_CATEGORIES:
                result["confidence"] = "low"
            else:
                result["confidence"] = "medium"
        else:
            result["resolution_method"] = "unresolved"
            result["confidence"] = "none"
    else:
        result["resolution_method"] = "unresolved"
        result["confidence"] = "none"
    
    return result


print("✓ Resolution function updated with category filtering")

✓ Resolution function updated with category filtering


## Cell 7: Run Mapping Loop

In [40]:
async def run_mapping(df: pd.DataFrame) -> list[dict[str, Any]]:
    """Run the full mapping process for all demographic fields."""
    client = KestrelClient()
    
    try:
        print("Connecting to Kestrel...")
        await client.connect()
        print(f"✓ Connected. Available tools: {list(client.get_tools().keys())}")
        print()
        
        results = []
        total = len(df)
        
        for idx, row in df.iterrows():
            desc = row["Phenotype_Description"]
            desc_preview = desc[:50] + "..." if len(desc) > 50 else desc
            print(f"Processing {idx+1}/{total}: {desc_preview}")
            
            result = await resolve_demographic(row, client)
            results.append(result)
            
            # Progress indicator
            if result["resolution_method"] == "snomed_exact":
                print(f"  → SNOMED match: {result['kraken_curie']}")
            elif result["resolution_method"] == "hybrid_search":
                name = result['kraken_name'] or "Unknown"
                name_preview = name[:30] + "..." if len(name) > 30 else name
                print(f"  → Hybrid match: {result['kraken_curie']} ({name_preview})")
            else:
                print(f"  → Unresolved")
        
        return results
    
    finally:
        await client.disconnect()
        print("\n✓ Disconnected from Kestrel")


# Run the mapping (use nest_asyncio for Jupyter compatibility)
try:
    import nest_asyncio
    nest_asyncio.apply()
except ImportError:
    pass  # Not needed if running in a compatible environment

results = asyncio.get_event_loop().run_until_complete(run_mapping(df))

Connecting to Kestrel...
✓ Connected. Available tools: ['one_hop_query', 'text_search', 'vector_search', 'similar_nodes', 'hybrid_search', 'get_nodes', 'get_edges', 'get_one_hop_options', 'get_valid_categories', 'get_valid_predicates', 'get_valid_prefixes', 'get_valid_primary_knowledge_sources', 'get_valid_aggregator_knowledge_sources', 'get_valid_provided_by', 'get_valid_knowledge_levels', 'get_valid_agent_types', 'get_valid_qualifiers', 'get_metagraph', 'health_check']

Processing 1/120: What gender do you identify with at the moment?
  → Hybrid match: UMLS:C4722293 (Other Gender)
Processing 2/120: For as many as you know, what are the ancestral et...
  → Hybrid match: UMLS:C5690858 (Australian Aboriginal and Torr...)
Processing 3/120: For as many as you know, what are the ancestral et...
  → Hybrid match: UMLS:C4735577 (Cholesterol Levels: What You N...)
Processing 4/120: For as many as you know, what are the ancestral et...
  → Hybrid match: UMLS:C4735577 (Cholesterol Levels: What 

## Cell 8: Mapping Quality Summary

In [41]:
# Resolution method distribution
methods = pd.Series([r["resolution_method"] for r in results]).value_counts()

print("="*50)
print("RESOLUTION SUMMARY")
print("="*50)
print(f"Total fields: {len(results)}")
print(f"SNOMED exact match: {methods.get('snomed_exact', 0)} ({methods.get('snomed_exact', 0)/len(results)*100:.1f}%)")
print(f"Hybrid search: {methods.get('hybrid_search', 0)} ({methods.get('hybrid_search', 0)/len(results)*100:.1f}%)")
print(f"Unresolved: {methods.get('unresolved', 0)} ({methods.get('unresolved', 0)/len(results)*100:.1f}%)")

# Confidence distribution
confidences = pd.Series([r["confidence"] for r in results]).value_counts()
print()
print("="*50)
print("CONFIDENCE DISTRIBUTION")
print("="*50)
for conf, count in confidences.items():
    print(f"{conf}: {count} ({count/len(results)*100:.1f}%)")

# Category breakdown
print()
print("="*50)
print("RESOLUTION BY CATEGORY")
print("="*50)
for category in df["Demographic Category"].unique():
    category_results = [r for r in results if r["demographic_category"] == category]
    resolved = sum(1 for r in category_results if r["resolution_method"] != "unresolved")
    print(f"{category}: {resolved}/{len(category_results)} resolved")

RESOLUTION SUMMARY
Total fields: 120
SNOMED exact match: 0 (0.0%)
Hybrid search: 120 (100.0%)
Unresolved: 0 (0.0%)

CONFIDENCE DISTRIBUTION
medium-high: 104 (86.7%)
low: 15 (12.5%)
medium: 1 (0.8%)

RESOLUTION BY CATEGORY
Sex / Gender: 1/1 resolved
Race / Ethnicity: 23/23 resolved
Education: 8/8 resolved
Marital / Relationship Status: 18/18 resolved
Employment Status: 8/8 resolved
Income / Deprivation: 6/6 resolved
Height (Measured): 1/1 resolved
Height (Self-reported): 5/5 resolved
Weight (Measured): 1/1 resolved
Weight (Self-reported): 7/7 resolved
BMI: 1/1 resolved
Waist Circumference: 2/2 resolved
Hip Circumference: 1/1 resolved
Blood Pressure: 10/10 resolved
Birth Country / Place of Birth: 6/6 resolved
Birth Weight: 8/8 resolved
Premature Birth: 3/3 resolved
Handedness: 1/1 resolved
Smoking Status (Summary): 7/7 resolved
Alcohol Intake (Summary): 3/3 resolved


## Cell 9: Review Unresolved & Low-Confidence Mappings

In [42]:
# List unresolved fields
unresolved = [r for r in results if r["resolution_method"] == "unresolved"]
if unresolved:
    print(f"=== UNRESOLVED FIELDS ({len(unresolved)}) ===")
    for r in unresolved:
        print(f"  - [{r['demographic_category']}] {r['phenotype_description'][:80]}")
else:
    print("✓ All fields resolved!")

print()

# List hybrid search matches (need manual review)
hybrid_matches = [r for r in results if r["resolution_method"] == "hybrid_search"]
if hybrid_matches:
    print(f"=== HYBRID SEARCH MATCHES - NEEDS REVIEW ({len(hybrid_matches)}) ===")
    for r in hybrid_matches[:10]:  # Show first 10
        print(f"\n  Description: {r['phenotype_description'][:60]}...")
        print(f"  → Matched: {r['kraken_curie']} ({r['kraken_name']})")
        if r['all_candidates']:
            print(f"    Other candidates:")
            for c in r['all_candidates'][1:3]:  # Show 2 alternatives
                print(f"      - {c['curie']}: {c['name']}")
    
    if len(hybrid_matches) > 10:
        print(f"\n  ... and {len(hybrid_matches) - 10} more (see output files)")

✓ All fields resolved!

=== HYBRID SEARCH MATCHES - NEEDS REVIEW (120) ===

  Description: What gender do you identify with at the moment?...
  → Matched: UMLS:C4722293 (Other Gender)
    Other candidates:
      - UMLS:C3829605: How Often Has Feeling Depressed Interfered With What You Usually Do
      - DOID:1234: gender incongruence

  Description: For as many as you know, what are the ancestral ethnic group...
  → Matched: UMLS:C5690858 (Australian Aboriginal and Torres Strait Islander Peoples)
    Other candidates:
      - UMLS:C5203091: Aboriginal North American
      - UMLS:C0337948: Aboriginal Australians

  Description: For as many as you know, what are the ancestral ethnic group...
  → Matched: UMLS:C4735577 (Cholesterol Levels: What You Need to Know)
    Other candidates:
      - UMLS:C3828639: Paternal Biological Grandparent
      - NCIT:C100806: Biological Grandparent

  Description: For as many as you know, what are the ancestral ethnic group...
  → Matched: UMLS:C4735577 (

## Cell 10: Export Results

In [43]:
# Build summary statistics
summary = {
    "total_fields": len(results),
    "resolution_methods": methods.to_dict(),
    "confidence_distribution": confidences.to_dict(),
    "snomed_exact_rate": methods.get("snomed_exact", 0) / len(results),
    "hybrid_search_rate": methods.get("hybrid_search", 0) / len(results),
    "unresolved_rate": methods.get("unresolved", 0) / len(results),
}

# JSON export (full detail including candidates)
output_data = {
    "summary": summary,
    "mappings": results,
}

with open(OUTPUT_JSON, "w") as f:
    json.dump(output_data, f, indent=2, default=str)
print(f"✓ Saved JSON: {OUTPUT_JSON}")

# TSV export (flat format for spreadsheet review)
# Flatten results for TSV - drop complex nested fields
flat_results = []
for r in results:
    flat = {
        "demographic_category": r["demographic_category"],
        "data_type": r["data_type"],
        "phenotype_description": r["phenotype_description"],
        "historical_id": r["historical_id"],
        "source_snomed_codes": ";".join(r["source_snomed_codes"]) if r["source_snomed_codes"] else "",
        "resolution_method": r["resolution_method"],
        "kraken_curie": r["kraken_curie"],
        "kraken_name": r["kraken_name"],
        "kraken_category": r["kraken_category"],
        "confidence": r["confidence"],
        "matched_snomed": r.get("matched_snomed", ""),
    }
    flat_results.append(flat)

results_df = pd.DataFrame(flat_results)
results_df.to_csv(OUTPUT_TSV, sep="\t", index=False)
print(f"✓ Saved TSV: {OUTPUT_TSV}")

print(f"\nOutput files ready for review.")

✓ Saved JSON: /home/trentleslie/Insync/projects/biovector-eval/data/review/demographics_kraken_mapping.json
✓ Saved TSV: /home/trentleslie/Insync/projects/biovector-eval/data/review/demographics_kraken_mapping.tsv

Output files ready for review.


## Results Analysis

### Category Distribution of Matched Entities

| Biolink Category | Count | % | Quality |
|------------------|-------|---|---------|
| `biolink:PhenotypicFeature` | 85 | 71% | ✅ Good |
| `biolink:Publication` | 15 | 12.5% | ⚠️ Noise |
| `biolink:PopulationOfIndividualOrganisms` | 8 | 7% | ✅ Good |
| `biolink:DiseaseOrPhenotypicFeature` | 5 | 4% | ✅ Good |
| `biolink:Behavior` | 4 | 3% | ✅ Good |
| `biolink:GeographicLocation` | 2 | 2% | ✅ Good |
| `biolink:Disease` | 1 | 1% | ✅ Good |

### Match Quality Examples

**Good matches** (semantic meaning captured correctly):
- "What gender do you identify with?" → `UMLS:C4722293` "Other Gender" ✅
- "Are you currently married?" → "Married (finding)" ✅  
- Aboriginal ethnicity question → "Australian Aboriginal and Torres Strait Islander Peoples" ✅

**Poor matches** (superficial word similarity):
- "At what age did you finish education?" → "age at menarche" ❌ (matched "age at")
- "I am still in full-time education" → "Still in Hospital" ❌ (matched "still in")
- Ethnic group questions → "Cholesterol Levels: What You Need to Know" ❌ (Publication noise)

### Limitations

1. **Survey question format**: Hybrid search matches the *form* of questions rather than the *concepts* being asked about
2. **Publication noise**: Some questionnaire instrument names (Q-LES-Q, MMSE) appear in results despite category filtering
3. **Ethnic group coverage**: Many specific ethnicities fall back to generic matches when the specific group isn't in Kraken

### Recommendations for Improvement

1. **Extract key terms**: Pre-process questions to extract core concepts before searching (e.g., "gender", "ethnicity", "education level")
2. **Use SNOMED labels**: Search using the SNOMED term labels (e.g., "Gender identity finding") rather than the full question text
3. **Manual curation**: The TSV output is designed for human review to correct mismatches

## Optional: Graph Exploration

Use `one_hop_query` to explore relationships from resolved entities.

In [44]:
# Example: Explore relationships for a resolved entity
async def explore_entity(curie: str) -> dict:
    """Get one-hop relationships for an entity."""
    client = KestrelClient()
    try:
        await client.connect()
        response = await client.call_tool("one_hop_query", {
            "curie": curie,
            "mode": "preview",  # Just get counts first
        })
        return parse_kestrel_response(response)
    finally:
        await client.disconnect()

# Example usage (uncomment to run):
# resolved_curies = [r["kraken_curie"] for r in results if r["kraken_curie"]]
# if resolved_curies:
#     example_curie = resolved_curies[0]
#     print(f"Exploring: {example_curie}")
#     relationships = asyncio.get_event_loop().run_until_complete(explore_entity(example_curie))
#     print(json.dumps(relationships, indent=2))