prepare_transcripts: build pyfasta index eagerly by pinin4fjords · Pull Request #71 · xryanglab/RiboCode

pinin4fjords · 2026-05-19T10:20:03Z

Fixes #70.

prepare_transcripts writes transcripts_sequence.fa but doesn't build its pyfasta index. Downstream RiboCode steps (RiboCode, RiboCode_onestep) then trigger pyfasta's lazy first-read index build, which writes transcripts_sequence.fa.gdx and transcripts_sequence.fa.flat next to the FASTA. That breaks any deployment where the annotation directory isn't the consumer's own writable working directory:

read-only /mnt or NFS mounts on shared HPC infrastructure
container bind mounts published :ro
workflow engines that stage the annotation as a symlink into each consumer task (writes follow the symlink back to the producer; parallel consumers then race)

This PR adds one line to the end of processTranscripts(...) so the indexes are built in the producing call. The GenomeSeq constructor uses the same get_chrom key_fn downstream readers use, so the sidecars produced are byte-identical to what pyfasta would otherwise write lazily.

Verification

Ran the patched prepare_transcripts against the nf-core/test-datasets chr20 fixture (Homo_sapiens.GRCh38.111_chr20.gtf + Homo_sapiens.GRCh38.dna.chromosome.20.fa):

Eager-built sidecars: 4981ecea... (.gdx) / e99a891b... (.flat)
Fresh post-hoc pyfasta open of the same FASTA: same md5s

No CLI / API surface change; existing callers get a slightly more complete annotation directory. Happy to gate this behind a --prebuild-indexes flag if you'd prefer to keep current default behaviour.

When `prepare_transcripts` writes `transcripts_sequence.fa` but doesn't build its pyfasta index, downstream RiboCode steps (`RiboCode`, `RiboCode_onestep`) trigger pyfasta's lazy first-read index build, which writes `transcripts_sequence.fa.gdx` and `transcripts_sequence.fa.flat` *next to the FASTA*. That breaks any deployment where the annotation directory is not the consumer's own writable working directory: - read-only `/mnt` or NFS mounts on shared HPC infrastructure - container bind mounts published `:ro` - workflow engines that stage the annotation as a symlink into each consumer task (writes follow the symlink back to the producer; parallel consumers then race) Building the indexes once in the producing call makes the published annotation set complete and removes the lazy-write path. The `GenomeSeq` constructor uses the same `get_chrom` key_fn downstream readers use, so the `.gdx`/`.flat` produced are byte-identical to what pyfasta would otherwise write lazily (verified md5-equal on the nf-core/test-datasets chr20 fixture). Closes xryanglab#70.

pinin4fjords mentioned this pull request May 19, 2026

feat(ribocode): pre-build pyfasta indexes + prefix-scoped outputs nf-core/modules#11685

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prepare_transcripts: build pyfasta index eagerly#71

prepare_transcripts: build pyfasta index eagerly#71
pinin4fjords wants to merge 1 commit into
xryanglab:masterfrom
pinin4fjords:eager-pyfasta-index-build

pinin4fjords commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pinin4fjords commented May 19, 2026

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant