splitGraph is an R package for representing biomedical dataset structure as a
typed dependency graph so that leakage-relevant relationships can be made
explicit, validated, queried, and converted into deterministic split
constraints.
It does not fit models, run preprocessing pipelines, or generate resamples by itself. Its job is to encode dataset structure before evaluation so that overlap, provenance, and time-ordering assumptions are inspectable instead of implicit.
The plot above shows six samples (blue) that share three subjects (orange),
two batches (green), two timepoints (red), and two outcome classes (brown).
A plain vfold_cv on this dataset would violate subject, batch, and time
structure at the same time — and that is exactly what the graph is designed
to make visible.
In biomedical evaluation workflows, leakage often comes from dataset structure rather than obvious coding mistakes. Samples may share:
- the same subject
- the same batch
- the same study
- the same collection timepoint
- the same assay provenance
- the same derived feature set
- the same outcome definition
If those relationships are not modeled explicitly, a train/test split can look correct while still violating the intended scientific separation.
splitGraph makes those dependencies first-class objects.
Does:
- metadata ingestion with canonical ID normalization
- one-shot graph construction from canonical metadata via
graph_from_metadata() - typed node and edge constructors backed by
igraph - structural, semantic, and leakage-relevant validation
- typed query and traversal helpers
- projected sample-dependency detection
- split-constraint derivation for subject, batch, study, time, and composite modes
- translation of constraints into a stable, tool-agnostic
split_spec - split-spec preflight validation and leakage summary helpers
- typed layered
plot()method with per-type colors and a node-type legend print(),summary(), andas.data.frame()on all core S3 objects
Does not:
- fit models or run preprocessing pipelines
- generate resamples (
rsampledoes that) - implement leakage-aware training workflows (
bioLeakandfastmldo) - provide a general-purpose graph analytics toolkit
The package is intentionally narrow: dataset dependency structure for leakage-aware evaluation design.
Development version from GitHub:
install.packages("remotes")
remotes::install_github("selcukorkmaz/splitGraph")The fastest path is graph_from_metadata(), which auto-detects canonical
columns in a metadata frame and assembles a validated dependency_graph:
library(splitGraph)
meta <- data.frame(
sample_id = c("S1", "S2", "S3", "S4", "S5", "S6"),
subject_id = c("P1", "P1", "P2", "P2", "P3", "P3"),
batch_id = c("B1", "B2", "B1", "B2", "B1", "B2"),
timepoint_id = c("T0", "T1", "T0", "T1", "T0", "T1"),
time_index = c(0, 1, 0, 1, 0, 1),
outcome_value = c(0, 1, 0, 1, 1, 0)
)
g <- graph_from_metadata(meta, graph_name = "demo")
plot(g)
validation <- validate_graph(g)
subject_constraint <- derive_split_constraints(g, mode = "subject")
split_spec <- as_split_spec(subject_constraint, graph = g)
validate_split_spec(split_spec)
summarize_leakage_risks(g, constraint = subject_constraint, split_spec = split_spec)For full control over node labels, attribute columns, and non-canonical
relations, use create_nodes() / create_edges() / build_dependency_graph()
directly.
split_spec is the tool-agnostic handoff object produced by
as_split_spec(). Downstream packages provide their own adapters so that
splitGraph has no runtime dependency on any of them:
# fastml / tidymodels (rsample) — in fastml:
rset <- fastml::rset_from_split_spec(split_spec, data = my_data, v = 5)
# then use rset anywhere rsample::vfold_cv() is expected:
# tune::tune_grid(wflow, resamples = rset, grid = grid, metrics = metrics)
# bioLeak — in bioLeak:
plan <- bioLeak::as_leaksplits(split_spec, data = my_data,
outcome = "y", v = 5)Signatures shown match fastml 0.7.8 and the current bioLeak adapter.
Consult ?fastml::rset_from_split_spec or ?bioLeak::as_leaksplits in your
installed versions if in doubt.
The typical end-to-end flow is therefore:
graph_from_metadata(meta)→ typeddependency_graphderive_split_constraints(g, mode = ...)→split_constraintas_split_spec(constraint, graph = g)→split_spec- adapter call in
fastml/bioLeak/ custom wrapper → native resamples
Sample,Subject,Batch,Study,Timepoint,Assay,FeatureSet,Outcome
sample_belongs_to_subjectsample_processed_in_batchsample_from_studysample_collected_at_timepointsample_measured_by_assaysample_uses_featuresetsample_has_outcomesubject_has_outcometimepoint_precedesfeatureset_generated_from_studyfeatureset_generated_from_batch
graph_node_set, graph_edge_set, dependency_graph,
depgraph_validation_report, graph_query_result, split_constraint,
split_spec, split_spec_validation, leakage_risk_summary.
| Layer | Functions |
|---|---|
| Ingestion and construction | ingest_metadata(), graph_from_metadata(), create_nodes(), create_edges(), build_dependency_graph(), dependency_graph(), as_igraph() |
| Validation | validate_graph(), validate_split_spec() |
| Queries | query_node_type(), query_edge_type(), query_neighbors(), query_paths(), query_shortest_paths(), detect_dependency_components(), detect_shared_dependencies() |
| Constraint derivation | derive_split_constraints(), grouping_vector() |
| Split-spec translation | as_split_spec(), summarize_leakage_risks() |
query_node_type(g, "Subject")
query_edge_type(g, "sample_processed_in_batch")
query_neighbors(g, node_ids = "sample:S1", edge_types = "sample_belongs_to_subject")
detect_shared_dependencies(g, via = "Batch")
detect_dependency_components(g, via = c("Subject", "Batch"))subject_constraint <- derive_split_constraints(g, mode = "subject")
batch_constraint <- derive_split_constraints(g, mode = "batch")
study_constraint <- derive_split_constraints(g, mode = "study")
time_constraint <- derive_split_constraints(g, mode = "time")
strict_composite <- derive_split_constraints(
g, mode = "composite", strategy = "strict",
via = c("Subject", "Batch")
)
rule_based_composite <- derive_split_constraints(
g, mode = "composite", strategy = "rule_based",
priority = c("batch", "study", "subject", "time")
)plot(g) renders a typed, layered layout with per-type node colors and an
auto-generated node-type legend. Layers: Sample (top), peer dependencies
(Subject / Batch / Study / Timepoint) in the middle band,
Assay / FeatureSet next, Outcome (bottom).
plot(g) # typed layered layout (default)
plot(g, layout = "sugiyama") # alternative hierarchical layout
plot(g, show_labels = FALSE) # hide node labels on dense graphs
plot(g, legend = FALSE) # suppress the legend
plot(g, legend_position = "bottomright")
plot(g, node_colors = c(Sample = "#000000")) # override type colorscitation("splitGraph")produces:
Korkmaz S (2026). splitGraph: Dataset Dependency Graphs for Leakage-Aware Evaluation. R package version 0.1.0. https://github.com/selcukorkmaz/splitGraph
MIT. See LICENSE.
The package prefers explicit failure over silent guessing. In particular:
- unknown sample IDs, ambiguous direct assignments, and conflicting duplicate nodes or edges are rejected rather than silently resolved
- contradictory time-order metadata are rejected rather than reconciled arbitrarily
- validation truth is not changed by severity filters, and generated split specs are re-validated against the package's own preflight rules
