<p align="left">
  <img src="https://raw.githubusercontent.com/python35/IINTS-SDK/main/img/iints_logo.png" width="160">
</p>
# Data Registry & Real-World Import
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/python35/IINTS-SDK/blob/main/examples/notebooks/08_Data_Registry_and_Import.ipynb)

**Goal:** discover official datasets, fetch the bundled sample set, and import it into a runnable scenario.

**You will learn:**
- List official datasets and metadata
- Pull citations for papers
- List datasets in the official registry
- Fetch the bundled sample dataset (offline)
- Convert CGM CSV into an IINTS scenario
- Run a short simulation from imported data


In [1]:
from __future__ import annotations
from pathlib import Path
from typing import Optional
import os
import sys
import subprocess


def _find_repo_root() -> Optional[Path]:
    for root in [Path.cwd(), *Path.cwd().parents]:
        if (root / "pyproject.toml").exists() and (root / "src").exists():
            return root
    return None

repo_root = _find_repo_root()
if repo_root is None:
    try:
        import google.colab  # type: ignore
        in_colab = True
    except Exception:
        in_colab = False

    if not in_colab:
        raise RuntimeError("Run this notebook inside the IINTS-SDK repo or on Colab.")

    if not Path("IINTS-SDK").exists():
        subprocess.check_call(["git", "clone", "https://github.com/python35/IINTS-SDK.git"])
    repo_root = Path("IINTS-SDK").resolve()

os.chdir(repo_root)
sys.path.insert(0, str(repo_root / "src"))
print("Repo root:", repo_root)


Repo root: /home/runner/work/IINTS-SDK/IINTS-SDK


## Step 1: List datasets


In [2]:
from iints.data import load_dataset_registry

registry = load_dataset_registry()
[{"id": d["id"], "name": d["name"], "access": d["access"]} for d in registry]


[{'id': 'sample', 'name': 'IINTS Sample CGM (Bundled)', 'access': 'bundled'},
 {'id': 'aide_t1d',
  'name': 'AIDE T1D Public Dataset',
  'access': 'public-download'},
 {'id': 'pedap', 'name': 'PEDAP Public Dataset', 'access': 'public-download'},
 {'id': 'azt1d',
  'name': 'AZT1D: A Real-World Dataset for Type 1 Diabetes',
  'access': 'manual'},
 {'id': 'hupa_ucm', 'name': 'HUPA-UCM Diabetes Dataset', 'access': 'manual'},
 {'id': 'openaps_data_commons',
  'name': 'OpenAPS Data Commons',
  'access': 'request'},
 {'id': 'tidepool_bigdata',
  'name': 'Tidepool Big Data Donation',
  'access': 'request'},
 {'id': 'niddk_central',
  'name': 'NIDDK Central Repository',
  'access': 'request'},
 {'id': 't1d_exchange',
  'name': 'T1D Exchange Clinic Registry',
  'access': 'request'}]

## Step 1b: Inspect dataset metadata and citations
Use this when you need the official citation text or BibTeX for a paper.


In [3]:
from iints.data import get_dataset

sample_meta = get_dataset("sample")
sample_meta["citation"]


{'text': 'IINTS-AF Team. IINTS Sample CGM (bundled). Accessed 2026-02-16.',
 'bibtex': '@misc{iints_sample_cgm, title={IINTS Sample CGM (bundled)}, author={IINTS-AF Team}, year={2026}, note={Bundled with IINTS-AF SDK, accessed 2026-02-16}}'}

## Step 2: Fetch the bundled sample dataset (offline)


In [4]:
from pathlib import Path
from iints.data import fetch_dataset, export_demo_csv

output_dir = Path("data_packs/sample")
try:
    paths = fetch_dataset("sample", output_dir=output_dir, extract=False)
except Exception as exc:
    print("Bundled registry fetch failed, falling back to demo CSV.")
    print(exc)
    demo_path = output_dir / "demo_cgm.csv"
    export_demo_csv(demo_path)
    paths = [demo_path]
paths


[PosixPath('data_packs/sample/demo_cgm.csv')]

## Step 3: Convert CSV to scenario


In [5]:
from iints.data import scenario_from_csv

sample_csv = paths[0]
result = scenario_from_csv(sample_csv, scenario_name="Sample CGM")
result.scenario


{'scenario_name': 'Sample CGM',
 'scenario_version': '1.0',
 'description': 'Imported CGM scenario',
 'stress_events': [{'start_time': 60,
   'event_type': 'meal',
   'value': 45.0,
   'absorption_delay_minutes': 10,
   'duration': 60},
  {'start_time': 360,
   'event_type': 'meal',
   'value': 60.0,
   'absorption_delay_minutes': 10,
   'duration': 60},
  {'start_time': 720,
   'event_type': 'meal',
   'value': 70.0,
   'absorption_delay_minutes': 10,
   'duration': 60}]}

## Step 4: Run a short simulation from imported data


In [6]:
import iints
from iints.core.algorithms.fixed_basal_bolus import FixedBasalBolus
from iints.validation import load_patient_config_by_name

patient_config = load_patient_config_by_name("clinic_safe_baseline").model_dump()
algorithm = FixedBasalBolus(settings={"fixed_basal_rate": 0.4, "carb_ratio": 12.0})

outputs = iints.run_simulation(
    algorithm=algorithm,
    scenario=result.scenario,
    patient_config=patient_config,
    duration_minutes=240,
    time_step=5,
    output_dir="results/data_sample",
)

outputs["results"].head()


Simulation terminated early: Critical failure: glucose < 40.0 mg/dL for 30 minutes.


Simulation terminated early: Critical failure: glucose < 40.0 mg/dL for 30 minutes.


Simulation terminated early: Critical failure: glucose < 40.0 mg/dL for 30 minutes.


Unnamed: 0,time_minutes,glucose_actual_mgdl,glucose_to_algo_mgdl,glucose_trend_mgdl_min,predicted_glucose_30min,predicted_glucose_heuristic_30min,predicted_glucose_ai_30min,delivered_insulin_units,algo_recommended_insulin_units,sensor_status,...,uncertainty,fallback_triggered,safety_level,safety_actions,safety_reason,safety_triggered,supervisor_latency_ms,human_intervention,human_intervention_note,algorithm_why_log
0,0,140.0,140.0,0.0,140.0,140.0,,0.041667,0.041667,ok,...,0.0,False,safe,,APPROVED,False,0.014096,False,,[]
1,5,118.99537,118.99537,-4.200926,-7.237269,-7.237269,,0.0,0.041667,ok,...,0.0,False,emergency,PREDICTED_HYPO: -7.2 mg/dL in 30 min; NEGATIVE...,PREDICTED_HYPO: -7.2 mg/dL in 30 min; NEGATIVE...,True,0.023023,False,,[]
2,10,101.136806,101.136806,-3.571713,-6.215972,-6.215972,,0.0,0.041667,ok,...,0.0,False,emergency,PREDICTED_HYPO: -6.2 mg/dL in 30 min; NEGATIVE...,PREDICTED_HYPO: -6.2 mg/dL in 30 min; NEGATIVE...,True,0.013436,False,,[]
3,15,85.952396,85.952396,-3.036882,-5.351979,-5.351979,,0.0,0.041667,ok,...,0.0,False,emergency,PREDICTED_HYPO: -5.4 mg/dL in 30 min; NEGATIVE...,PREDICTED_HYPO: -5.4 mg/dL in 30 min; NEGATIVE...,True,0.011701,False,,[]
4,20,73.041018,73.041018,-2.582276,-4.621694,-4.621694,,0.0,0.041667,ok,...,0.0,False,emergency,PREDICTED_HYPO: -4.6 mg/dL in 30 min; NEGATIVE...,PREDICTED_HYPO: -4.6 mg/dL in 30 min; NEGATIVE...,True,0.011261,False,,[]


### Recap
You can now go from **official dataset registry → CSV import → runnable scenario** in a few steps.


## Optional: Prepare training data for the AI predictor

Use the dataset registry to fetch real-world data, then standardize to Parquet for predictor training. See `research/README.md` for the full pipeline.

Synthetic bootstrap:
```bash
python research/synthesize_dataset.py --runs 50 --output data/synthetic.parquet
python research/train_predictor.py --data data/synthetic.parquet --config research/configs/predictor.yaml --out models
```

Convert a CSV you imported:
```python
import pandas as pd
from pathlib import Path
from iints.research.dataset import save_parquet

frame = pd.read_csv("results/imported/standard_cgm.csv")
save_parquet(frame, Path("data/training.parquet"))
```