<a class="reference external" href="https://jupyter.designsafe-ci.org/hub/user-redirect/lab/tree/CommunityData/Training/Computational-Workflows-on-DesignSafe/Jupyter_Notebooks/Jupyter_Notebooks_Misc/HDF5_Quick_Explorer.ipynb" target="_blank">
<img alt="Try on DesignSafe" src="https://raw.githubusercontent.com/DesignSafe-Training/pinn/main/DesignSafe-Badge.svg" /></a>

# HDF5 Quick Explorer

***A Lightweight, Interactive Way to Inspect and Understand HDF5 Files***

**by Silvia Mazzoni**

HDF5 files are incredibly powerful for storing large, structured datasets — but they can also feel opaque. Before you analyze data, build models, or launch large-scale jobs, you often need a fast way to answer basic but essential questions:

* What’s actually inside this file?
* How is the data organized?
* What are the dataset names, shapes, and attributes?
* Which groups matter for my workflow?

This notebook is designed as a **quick, practical exploration tool** for HDF5 files. It lets you inspect file structure, navigate groups and datasets, and preview contents **without committing to a full analysis pipeline** or writing custom one-off scripts every time.

The goal here is *orientation*, not heavy computation.

---

## What This Notebook Is (and Isn’t)

**What it is:**

* A lightweight, interactive explorer for HDF5 files
* A way to quickly understand file structure and metadata
* A starting point for deciding *how* you want to use the data next
* Useful both locally and in HPC / DesignSafe-style workflows

**What it isn’t:**

* A full analysis or visualization workflow
* A domain-specific post-processing tool
* A replacement for your production scripts

Think of this notebook as the equivalent of “opening the box and looking inside” before you decide what to build.

---

## How to Use It

You can run this notebook in a **read–explore–adapt** mode:

1. Point it at an HDF5 file
2. Browse groups, datasets, shapes, and attributes
3. Preview data safely and selectively
4. Use what you learn to inform downstream scripts, jobs, or analyses

Most users will copy small pieces of this notebook into their own workflows once they understand their data layout — that’s intentional.

---

## Why This Matters

In many real workflows (HPC jobs, parametric studies, simulation outputs, ML pipelines), HDF5 becomes the *interface* between computation stages. Spending a few minutes understanding file structure upfront can save hours of confusion later.

This notebook exists to make that first step **fast, transparent, and low-friction**.


> Tip: avoid printing large arrays. Use small slices.


## 0) Imports (and optional install)

In [1]:
# If needed (usually already installed on HPC/JupyterHub):
!pip -q install --upgrade h5py

import os
import h5py
import numpy as np



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.3[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## 1) Point to your HDF5 file

In [2]:
WorkPath_local = '~/Work/stampede3'
WorkPath_hpc = '$WORK'
DatasetPath = 'Datasets/NGAWest2'

hdf5Filename = 'NGAWest2_TimeSeriesOnly_byRSN_AT2_260115.hdf5'

In [3]:
# Update this path to your file.
# Examples:
#   h5_path = "/work2/05072/silvia/stampede3/Datasets/NGAWest2/NGAWest2.h5"
#   h5_path = os.path.expanduser("~/Work/stampede3/Datasets/NGAWest2/NGAWest2.h5")

h5_path = os.path.expanduser(f"{WorkPath_local}/{DatasetPath}/{hdf5Filename}")

assert os.path.exists(h5_path), f"File not found: {h5_path}"
print("HDF5:", h5_path)

print('File Exists:',os.path.exists(h5_path))

HDF5: /home/jupyter/Work/stampede3/Datasets/NGAWest2/NGAWest2_TimeSeriesOnly_byRSN_AT2_260115.hdf5
File Exists: True


## 2) Top-level keys

In [4]:
Nmax = 50
with h5py.File(h5_path, "r") as h5:
    keys = list(h5.keys())
    print("Top-level keys:", keys[0:Nmax])
    print('Number of Keys:',len(keys))
    for ik,k in enumerate(keys):
        obj = h5[k]
        kind = "Group" if isinstance(obj, h5py.Group) else "Dataset"
        if ik<Nmax:
            print(f"  - {k:30s} {kind}")


Top-level keys: ['RSN1', 'RSN10', 'RSN100', 'RSN1000', 'RSN10000', 'RSN10001', 'RSN10002', 'RSN10003', 'RSN10004', 'RSN10005', 'RSN10006', 'RSN10007', 'RSN10008', 'RSN10009', 'RSN1001', 'RSN10010', 'RSN10011', 'RSN10012', 'RSN10013', 'RSN10014', 'RSN10015', 'RSN10016', 'RSN10017', 'RSN10018', 'RSN10019', 'RSN1002', 'RSN10020', 'RSN10021', 'RSN10022', 'RSN10023', 'RSN10024', 'RSN10025', 'RSN10026', 'RSN10027', 'RSN10028', 'RSN10029', 'RSN1003', 'RSN10030', 'RSN10031', 'RSN10032', 'RSN10033', 'RSN10034', 'RSN10035', 'RSN10036', 'RSN10037', 'RSN10038', 'RSN10039', 'RSN1004', 'RSN10040', 'RSN10041']
Number of Keys: 17166
  - RSN1                           Group
  - RSN10                          Group
  - RSN100                         Group
  - RSN1000                        Group
  - RSN10000                       Group
  - RSN10001                       Group
  - RSN10002                       Group
  - RSN10003                       Group
  - RSN10004                       Group
  - RS

## 3) Tree view (recursive)

In [5]:
def h5_tree(h5: h5py.File, max_items: int = 500, max_depth: int = 6):
    """Print a compact tree of groups/datasets up to max_depth."""
    count = 0

    def _walk(g: h5py.Group, prefix: str, depth: int):
        nonlocal count
        if depth > max_depth:
            return
        for name in g.keys():
            if count >= max_items:
                print("... (stopped: max_items reached)")
                return
            obj = g[name]
            if isinstance(obj, h5py.Group):
                print(f"{prefix}/{name}  [Group]")
                count += 1
                _walk(obj, f"{prefix}/{name}", depth + 1)
            else:
                shape = obj.shape
                dtype = obj.dtype
                print(f"{prefix}/{name}  [Dataset] shape={shape} dtype={dtype}")
                count += 1

    print(f"Tree (max_items={max_items}, max_depth={max_depth}):")
    _walk(h5, "", 0)

with h5py.File(h5_path, "r") as h5:
    h5_tree(h5, max_items=200, max_depth=5)


Tree (max_items=200, max_depth=5):
/RSN1  [Group]
/RSN1/RSN1_HELENA.A_A-HMC180.AT2  [Dataset] shape=(5093,) dtype=float32
/RSN1/RSN1_HELENA.A_A-HMC270.AT2  [Dataset] shape=(5103,) dtype=float32
/RSN1/RSN1_HELENA.A_A-HMCDWN.AT2  [Dataset] shape=(5106,) dtype=float32
/RSN10  [Group]
/RSN10/RSN10_IMPVALL.BG_C-ELC-UP.AT2  [Dataset] shape=(8000,) dtype=float32
/RSN10/RSN10_IMPVALL.BG_C-ELC000.AT2  [Dataset] shape=(8000,) dtype=float32
/RSN10/RSN10_IMPVALL.BG_C-ELC090.AT2  [Dataset] shape=(8000,) dtype=float32
/RSN100  [Group]
/RSN100/RSN100_HOLLISTR_A-SJB033.AT2  [Dataset] shape=(4151,) dtype=float32
/RSN100/RSN100_HOLLISTR_A-SJB123.AT2  [Dataset] shape=(4156,) dtype=float32
/RSN100/RSN100_HOLLISTR_A-SJBDWN.AT2  [Dataset] shape=(4151,) dtype=float32
/RSN1000  [Group]
/RSN1000/RSN1000_NORTHR_PIC-UP.AT2  [Dataset] shape=(4000,) dtype=float32
/RSN1000/RSN1000_NORTHR_PIC090.AT2  [Dataset] shape=(4000,) dtype=float32
/RSN1000/RSN1000_NORTHR_PIC180.AT2  [Dataset] shape=(4000,) dtype=float32
/RSN1

## 4) Inspect a specific dataset path

In [9]:
# Put the dataset path you want here, e.g.:
#   dset_path = "/RSN/12345/accel"
#   dset_path = "/accel/12345"
#   dset_path = "/accel"  (if 2D)
dset_path = "/RSN1/RSN1_HELENA.A_A-HMC180.AT2"

with h5py.File(h5_path, "r") as h5:
    if dset_path not in h5:
        print("Not found:", dset_path)
    else:
        d = h5[dset_path]
        print("Path:", dset_path)
        print("Shape:", d.shape)
        print("Dtype:", d.dtype)
        print("Chunks:", d.chunks)
        print("Compression:", d.compression)
        print("Attrs:", list(d.attrs.keys()))
        # Show a tiny preview safely
        if d.size == 0:
            print("Empty dataset.")
        else:
            if len(d.shape) == 1:
                preview = d[:10]
            else:
                slicer = tuple(0 for _ in range(len(d.shape) - 1)) + (slice(0, min(10, d.shape[-1])),)
                preview = d[slicer]
            print("Preview:", np.asarray(preview))


Path: //RSN1/RSN1_HELENA.A_A-HMC180.AT2
Shape: (5093,)
Dtype: float32
Chunks: None
Compression: None
Attrs: ['dtHeader', 'metric', 'nptsHeader', 'record']
Preview: [-0.00018903 -0.00018869 -0.00018836 -0.00018804 -0.00018775 -0.00018747
 -0.00018721 -0.00018697 -0.00018674 -0.00018654]


## 5) Quick RSN lookup helpers

In [12]:
def rsn_exists(h5: h5py.File, rsn: str, template: str) -> bool:
    return template.format(rsn=rsn) in h5

def suggest_templates(h5: h5py.File, rsn: str = "12345"):
    candidates = [
        "/RSN/{rsn}/accel",
        "/RSN/{rsn}/Accel",
        "/accel/{rsn}",
        "/Accel/{rsn}",
        "/motions/{rsn}/accel",
        "/records/{rsn}/accel",
    ]
    hits = [c for c in candidates if rsn_exists(h5, rsn, c)]
    return hits

rsn = "10041"  # change to a real RSN from your flatfile

with h5py.File(h5_path, "r") as h5:
    hits = suggest_templates(h5, rsn=rsn)
    if hits:
        print("Found candidate accel paths for RSN", rsn)
        for h in hits:
            print("  ", h.format(rsn=rsn))
    else:
        print("No matches for the canned templates. Use the Tree view to find your pattern.")


No matches for the canned templates. Use the Tree view to find your pattern.


## 6) Search for a key name anywhere (best-effort)

In [8]:
def find_paths_containing(h5: h5py.File, needle: str, max_hits: int = 50):
    hits = []
    def _visit(name, obj):
        nonlocal hits
        if len(hits) >= max_hits:
            return
        if needle.lower() in name.lower():
            hits.append("/" + name)
    h5.visititems(_visit)
    return hits

needle = "accel"  # try: "RSN", "dt", "time", "station", etc.

with h5py.File(h5_path, "r") as h5:
    hits = find_paths_containing(h5, needle, max_hits=50)
    print(f"Paths containing '{needle}' (showing up to 50):")
    for p in hits:
        print(" ", p)


Paths containing 'accel' (showing up to 50):
