Skip to content

Comments

Add a Reference Sequences documentation section.#38

Merged
whitwham merged 1 commit intosamtools:gh-pagesfrom
jkbonfield:refseq
May 14, 2025
Merged

Add a Reference Sequences documentation section.#38
whitwham merged 1 commit intosamtools:gh-pagesfrom
jkbonfield:refseq

Conversation

@jkbonfield
Copy link
Contributor

This covers a lot of material about how CRAM references are retrieved and how to efficiently use a reference cache.

If we accept this is may be good to modify the htslib error messages to point to this URL to aid debugging of reference sequences.

Copy link
Member

@whitwham whitwham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things to look at.

While this feature was added for efficient storage of denovo assemblies, it may still be worth considering for data aligned against widely known reference genomes when wantng a self contained CRAM that has no requirements on external data.

This may seem to remove the benefits of reference based encoding, but there will usually be many alignments covering the same location so this is a form of deduplication analogous to LZ compression techniques.
The reference is stored per CRAM slice and is the portion of reference covered by the alignments within that slice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

portion of reference -> portion of the reference

Maybe rewrite the entire sentence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it to "Each CRAM slice stores the portion of the reference covered by alignments within that slice."

The reference is stored per CRAM slice and is the portion of reference covered by the alignments within that slice.
This obviously makes internal / embedded references only compatible with coordinate sorted aligned records.
The reference stored in the slice can be directly copied from the reference used during alignment, or created on-the-fly by computing the consensus from the alignments.
This is controlled in htslib with the `embed_ref` or `embed_ref=1` option, which uses an externally supplied reference, and `embed_ref=2` which creates a new consensus reference.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

htslib -> HTSlib

Used both ways in the text. Probably should stick to one.

However this is limited to http, https and ftp URIs so it is recommended to escape literal colons within URIs by using a double-colon instead.

Note it is faster to fetch sequences from an MD5sum-structured local directory than fetching it from local FASTA file as there is no parsing or validation required on the sequences.
Indeed htslib typically accesses the MD5sum sequence using the UNIX `mmap()` system call, which also reduces memory usage when many processes are accessing the same reference file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

htslib -> HTSlib

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here and other instances. I had a mixture it seemed.

@jkbonfield
Copy link
Contributor Author

Thanks for the review. I've made those changes and also done some restructuring following our discussion during the meeting.

It could perhaps be restructured further. Eg:

  • Define search order (important)
  • The simplest example of using external references with -T ref.fa option
  • Explain inefficiencies of -T ref.fa (extra parsing) and the benefit of using REF_PATH for a pre-parsed md5sum directory tree. (We could perhaps use this as a point to list REF_CACHE instead, as it's amounting to much the same thing, except when doing remote lookups which are no longer automatic anyway).
  • Describe seq_cache_populate.pl script for prepopulating the REF_PATH file structure
  • Caching the EBI CRAM reference repository with HTSlib's ref-cache tool, and utilisation via REF_PATH and/or REF_CACHE.

Thoughts? It's already much better than what we had before, which was basically a terse section of samtools.1 man page only.

@whitwham
Copy link
Member

whitwham commented May 9, 2025

I like the new version and I think it works well.

This covers a lot of material about how CRAM references are retrieved
and how to efficiently use a reference cache.
@whitwham whitwham merged commit 5f1d7f8 into samtools:gh-pages May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants