Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference checksum validation #61

Merged
merged 10 commits into from
Jan 17, 2024
Merged

Conversation

yashpatel6
Copy link
Contributor

Description

Adding basic validation for reference checksums, with failed checks resulting in a warning but not failure of test cases

Pipeline Run Results

Testing log: /hot/software/package/tool-NFTest/Python/development/unreleased/yashpatel-reference-checksums/log-nftest-20240113T020243Z.log

Checklist

  • This PR does NOT contain Protected Health Information (PHI). A repo may need to be deleted if such data is uploaded.
    Disclosing PHI is a major problem1 - Even a small leak can be costly2.

  • This PR does NOT contain germline genetic data3, RNA-Seq, DNA methylation, microbiome or other molecular data4.

  • This PR does NOT contain other non-plain text files, such as: compressed files, images (e.g. .png, .jpeg), .pdf, .RData, .xlsx, .doc, .ppt, or other output files.

  To automatically exclude such files using a .gitignore file, see here for example.

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • I have set up or verified the main branch protection rule following the github standards before opening this pull request.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].

  • I have added the major changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

Footnotes

  1. UCLA Health reaches $7.5m settlement over 2015 breach of 4.5m patient records

  2. The average healthcare data breach costs $2.2 million, despite the majority of breaches releasing fewer than 500 records.

  3. Genetic information is considered PHI.
    Forensic assays can identify patients with as few as 21 SNPs

  4. RNA-Seq, DNA methylation, microbiome, or other molecular data can be used to predict genotypes (PHI) and reveal a patient's identity.

@nwiltsie
Copy link
Member

I'm not 100% clear on the use case here. It seems like the nftest.yml file you used had these changes made...

diff --git a/nftest.yml b/nftest.yml
index 30d31e4..2691d1d 100644
--- a/nftest.yml
+++ b/nftest.yml
@@ -10,10 +10,19 @@ cases:
     message: Testing proper separation between even and odd
     nf_script: ./main.nf
     nf_config: ./test/test_1000_divisor13.config
+    reference_files:
+      - reference_parameter_name: a_reference_param
+        reference_parameter_path: /hot/user/yashpatel/new.txt
+        reference_checksum: 2153bae19fd93c9110fd0067d5bde2f3
+        reference_checksum_type: 'md5'
+      - reference_parameter_name: another_reference_param
+        reference_parameter_path: /hot/user/yashpatel/new.txt
+        reference_checksum: 2153bae19fd93c9110fd0067d5bde2f4
+        reference_checksum_type: 'md5'

... and I can see that a_reference_param and another_reference_param were added to the nextflow command...

2024-01-13 02:02:43,731 - NFTest - INFO - NXF_WORK='/scratch/$SLURM_JOB_ID/./pipeline-demo-pipeline-test/work' nextflow run ./main.nf -c /hot/user/yashpatel/pipeline-demo-pipeline/pipeline-demo-pipeline/test/global.config -c ./test/test_1000_divisor13.config --a_reference_param /hot/user/yashpatel/new.txt --another_reference_param /hot/user/yashpatel/new.txt --output_dir /hot/user/yashpatel/pipeline-demo-pipeline/pipeline-demo-pipeline/out/Even-and-odd-output

... and a warning was issued that one of the checksums failed to match...

2024-01-13 02:02:43,730 - NFTest - WARNING - Checksum for reference file: another_reference_param=/hot/user/yashpatel/new.txt - `2153bae19fd93c9110fd0067d5bde2f3` does not match expected checksum of `2153bae19fd93c9110fd0067d5bde2f4`

Can you elaborate a little on what you're intending to use this for? My best guess would be to ensure that referenced but untracked files (like large BAMs) used as test inputs do not change.

@yashpatel6
Copy link
Contributor Author

The primary use-case here is for reference files in pipelines, for example reference genomes or GATK bundles that get used in pipelines. For perfect reproducibility, the same reference files need to be used but currently, there's no check for the actual input reference files so I've added a check here with the tests at least. If the check does fail, the tests will continue but with the warning as a guiding point in case tests fail or as a note to keep in mind

@yashpatel6 yashpatel6 merged commit 8e1180e into main Jan 17, 2024
2 checks passed
@yashpatel6 yashpatel6 deleted the yashpatel-reference-checksums branch January 17, 2024 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants