Skip to content

Write Log.out and Log.progress.out with real STAR-equivalent content #55

@pinin4fjords

Description

@pinin4fjords

Summary

rustar doesn't write Log.out (verbose run log) or Log.progress.out (per-chunk progress timestamps) — STAR writes both alongside Log.final.out. Consumers that parse these files for parameter dumps, per-phase progress, warnings, memory usage, or chunk-level mapping rates get nothing.

The goal is real content parity — not stubs that mimic STAR's section-header structure with placeholder content. Files that look like STAR's verbose log but carry only a {:#?} Debug dump of params and three timestamps mislead consumers worse than the files being absent (they pass file-existence checks but fail every actual parse).

STAR reference behaviour

  • Log.out is the verbose run log, written incrementally by source/InOutStreams.cpp plus parameter-dump and per-phase update calls scattered across source/Parameters.cpp, source/Aligner.cpp, and source/sjdbInsertJunctions.cpp. Content:
    • Full parameter dump with every default value (STAR's parameter format, one name<TAB>value line per parameter).
    • Per-phase progress messages (..... loading genome, ..... started mapping, ..... finished mapping, etc.) with timestamps.
    • Warnings (WARNING --X ...) and informational notes emitted during run.
    • Final timing and memory usage info.
  • Log.progress.out is updated periodically (roughly every minute) during alignment, one line per chunk reporting reads processed and mapping speed.

Reproducer

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-logout && cd /tmp/rustar-mre-logout

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

echo "=== STAR Log* files ==="; ls STAR.Log*
echo "=== rustar Log* files ==="; ls RUS./Log*

Observed: STAR writes STAR.Log.final.out, STAR.Log.out, STAR.Log.progress.out. rustar writes only RUS./Log.final.out.

Suggested approach

This is structural — Log.out needs progress hooks during the long-running phases (genome load, suffix-array build, per-chunk alignment) so events can be written as they happen, not at the end. Log.progress.out needs a periodic writer separate from the main alignment loop. Both need STAR-format parameter dumps and warning emission paths.

Not a one-PR drive-by; deferred until someone commits to the content fidelity. Stubs are not the goal — see the rejected approach in the conversation on PR #44.

Severity

Low. Today nf-core/rnaseq works around this with optional: true outputs. Affects provenance / QC tooling that parses STAR's verbose log.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). Split out from #28.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions