Skip to content

prepare_transcripts: build pyfasta index eagerly#71

Open
pinin4fjords wants to merge 1 commit into
xryanglab:masterfrom
pinin4fjords:eager-pyfasta-index-build
Open

prepare_transcripts: build pyfasta index eagerly#71
pinin4fjords wants to merge 1 commit into
xryanglab:masterfrom
pinin4fjords:eager-pyfasta-index-build

Conversation

@pinin4fjords
Copy link
Copy Markdown

Fixes #70.

prepare_transcripts writes transcripts_sequence.fa but doesn't build its pyfasta index. Downstream RiboCode steps (RiboCode, RiboCode_onestep) then trigger pyfasta's lazy first-read index build, which writes transcripts_sequence.fa.gdx and transcripts_sequence.fa.flat next to the FASTA. That breaks any deployment where the annotation directory isn't the consumer's own writable working directory:

  • read-only /mnt or NFS mounts on shared HPC infrastructure
  • container bind mounts published :ro
  • workflow engines that stage the annotation as a symlink into each consumer task (writes follow the symlink back to the producer; parallel consumers then race)

This PR adds one line to the end of processTranscripts(...) so the indexes are built in the producing call. The GenomeSeq constructor uses the same get_chrom key_fn downstream readers use, so the sidecars produced are byte-identical to what pyfasta would otherwise write lazily.

Verification

Ran the patched prepare_transcripts against the nf-core/test-datasets chr20 fixture (Homo_sapiens.GRCh38.111_chr20.gtf + Homo_sapiens.GRCh38.dna.chromosome.20.fa):

  • Eager-built sidecars: 4981ecea... (.gdx) / e99a891b... (.flat)
  • Fresh post-hoc pyfasta open of the same FASTA: same md5s

No CLI / API surface change; existing callers get a slightly more complete annotation directory. Happy to gate this behind a --prebuild-indexes flag if you'd prefer to keep current default behaviour.

When `prepare_transcripts` writes `transcripts_sequence.fa` but doesn't
build its pyfasta index, downstream RiboCode steps (`RiboCode`,
`RiboCode_onestep`) trigger pyfasta's lazy first-read index build, which
writes `transcripts_sequence.fa.gdx` and `transcripts_sequence.fa.flat`
*next to the FASTA*. That breaks any deployment where the annotation
directory is not the consumer's own writable working directory:

- read-only `/mnt` or NFS mounts on shared HPC infrastructure
- container bind mounts published `:ro`
- workflow engines that stage the annotation as a symlink into each
  consumer task (writes follow the symlink back to the producer;
  parallel consumers then race)

Building the indexes once in the producing call makes the published
annotation set complete and removes the lazy-write path. The
`GenomeSeq` constructor uses the same `get_chrom` key_fn downstream
readers use, so the `.gdx`/`.flat` produced are byte-identical to what
pyfasta would otherwise write lazily (verified md5-equal on the
nf-core/test-datasets chr20 fixture).

Closes xryanglab#70.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

prepare_transcripts: build pyfasta .gdx/.flat sidecars eagerly so downstream tasks don't write to staged inputs

1 participant