prepare_transcripts: build pyfasta index eagerly#71
Open
pinin4fjords wants to merge 1 commit into
Open
Conversation
When `prepare_transcripts` writes `transcripts_sequence.fa` but doesn't build its pyfasta index, downstream RiboCode steps (`RiboCode`, `RiboCode_onestep`) trigger pyfasta's lazy first-read index build, which writes `transcripts_sequence.fa.gdx` and `transcripts_sequence.fa.flat` *next to the FASTA*. That breaks any deployment where the annotation directory is not the consumer's own writable working directory: - read-only `/mnt` or NFS mounts on shared HPC infrastructure - container bind mounts published `:ro` - workflow engines that stage the annotation as a symlink into each consumer task (writes follow the symlink back to the producer; parallel consumers then race) Building the indexes once in the producing call makes the published annotation set complete and removes the lazy-write path. The `GenomeSeq` constructor uses the same `get_chrom` key_fn downstream readers use, so the `.gdx`/`.flat` produced are byte-identical to what pyfasta would otherwise write lazily (verified md5-equal on the nf-core/test-datasets chr20 fixture). Closes xryanglab#70.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #70.
prepare_transcriptswritestranscripts_sequence.fabut doesn't build its pyfasta index. Downstream RiboCode steps (RiboCode,RiboCode_onestep) then trigger pyfasta's lazy first-read index build, which writestranscripts_sequence.fa.gdxandtranscripts_sequence.fa.flatnext to the FASTA. That breaks any deployment where the annotation directory isn't the consumer's own writable working directory:/mntor NFS mounts on shared HPC infrastructure:roThis PR adds one line to the end of
processTranscripts(...)so the indexes are built in the producing call. TheGenomeSeqconstructor uses the sameget_chromkey_fn downstream readers use, so the sidecars produced are byte-identical to what pyfasta would otherwise write lazily.Verification
Ran the patched
prepare_transcriptsagainst the nf-core/test-datasets chr20 fixture (Homo_sapiens.GRCh38.111_chr20.gtf+Homo_sapiens.GRCh38.dna.chromosome.20.fa):4981ecea...(.gdx) /e99a891b...(.flat)No CLI / API surface change; existing callers get a slightly more complete annotation directory. Happy to gate this behind a
--prebuild-indexesflag if you'd prefer to keep current default behaviour.