-
Notifications
You must be signed in to change notification settings - Fork 10
Dataset versioning
Issue: https://github.com/scimma/blast/issues/320
A transient dataset comprises the database objects associated with a transient (e.g. Transient, Host, Cutout, etc) and related file objects (e.g. cutout images, SED fit results). When a transient is reprocessed, any existing data is overwritten, allowing the possibility of the following scenario: Researcher Alpha analyzes Blast data downloaded from version X (e.g. v1.10.0) and publishes a scientific paper. Later, Researcher Beta downloads data from Blast version Y (e.g. v1.13.1), including transients cited by Researcher Alpha that have since been reprocessed and now have different values for some subset of the transient data.
Identical analyses of nominally identical datasets will thus in general disagree or at least be inconsistent, which obviously does not support reproducible science.
We want to implement a dataset version control system to ensure that reprocessing of transient workflows does not alter published data that a researcher may have already downloaded and used to produce scientific results.
Each time a transient is processed, all tabular data (i.e. database objects) in the dataset will be exported to a JSON document using the same schema and algorithm as the archive export. This file will be called a "dataset revision" (DR).
{
"metadata": {
"app_version": "v1.12.0",
"dataset_version": 0
"export_time": "2026-05-27T15:46:03.934740+00:00"
},
"transient": {
"model": "host.transient",
"pk": 3492,
"fields": {
"ra_deg": 141.509425482,
"dec_deg": 24.521685394,
"name": "2020able",
"software_version": "1.12.0",
...
}
},
"host": {
"model": "host.host",
"pk": 13236,
"fields": {
"ra_deg": 141.51223628,
"dec_deg": 24.52093673,
"name": "2020able",
...
},
},
...
"files": {
"cutout_cdn/2MASS/2MASS_H.fits": <checksum>,
"sed_output/2026dgt_global_modeldata.npz": <checksum>,
...
}
}There is already a top-level metadata collection capturing the Blast version and export time. Another key dataset_version will be added to the metadata collection to store the revision number. It will be a monotonically increasing integer starting at zero.
A checksum will be generated for each file output from the workflow and stored in the DR.
A new task will be appended to the end of the workflow that generates the DR. If it is the first revision, it will be stored in /data/revisions/[name]/0/[name].json. If a revision already exists, then the latest revision N is discovered by listing the revisions directory. The data values from the workflow are compared against those of revision N: if they match, no new revision is created; otherwise, a new revision directory N+1 is created.
The DR index should
The transient 2026dgt is added to Blast and the workflow generates data for the first time.
Dataset files are stored under cutout_cdn and sed_output:
/data/cutout_cdn/2026dgt/2MASS/2MASS_H.fits
/data/cutout_cdn/2026dgt/...
/data/sed_output/2026dgt/2026dgt_global_modeldata.npz
/data/sed_output/2026dgt/...
Tabular data and file checksums are exported to the JSON file
/data/revisions/2026dgt/0/2026dgt.json
Later, 2026dgt is reprocessed by a newer Blast version, but all data objects are identical to DR 0. No revision is created.
Subsequently, 2026dgt is reprocessed by the same Blast version as before, but for some reason, the image file returned by the remote source catalog is slightly different from the original fetched weeks before, thereby altering the calculation results. In this case a new DR 1 is created at
/data/revisions/2026dgt/1/2026dgt.json
The file checksum comparison determines that only two of the dataset files changed:
/data/cutout_cdn/2026dgt/DES/DES_z.fits
/data/sed_output/2026dgt_global_modeldata.npz
The original files are moved to
/data/revisions/2026dgt/0/cutout_cdn/DES/DES_z.fits
/data/revisions/2026dgt/0/sed_output/2026dgt_global_modeldata.npz