Skip to content

Dataset versioning

T. Andrew Manning edited this page Jun 15, 2026 · 3 revisions

Issue: https://github.com/scimma/blast/issues/320

Problem

A transient dataset comprises the database objects associated with a transient (e.g. Transient, Host, Cutout, etc) and related file objects (e.g. cutout images, SED fit results). When a transient is reprocessed, any existing data is overwritten, allowing the possibility of the following scenario: Researcher Alpha analyzes Blast data downloaded from version X (e.g. v1.10.0) and publishes a scientific paper. Later, Researcher Beta downloads data from Blast version Y (e.g. v1.13.1), including transients cited by Researcher Alpha that have since been reprocessed and now have different values for some subset of the transient data.

Identical analyses of nominally identical datasets will thus in general disagree or at least be inconsistent, which obviously does not support reproducible science.

Goal

We want to implement a dataset version control system to ensure that reprocessing of transient workflows does not alter published data that a researcher may have already downloaded and used to produce scientific results.

Method

Each time a transient is processed, all tabular data (i.e. database objects) in the dataset will be exported to a JSON document using the standard API endpoint. There is already a top-level "metadata" collection capturing "app_version" and "export_time". Another key "dataset_version" will be added to the "metadata" collection

Clone this wiki locally