Add a Reference Sequences documentation section.#38
Add a Reference Sequences documentation section.#38whitwham merged 1 commit intosamtools:gh-pagesfrom
Conversation
reference_seqs.md
Outdated
| While this feature was added for efficient storage of denovo assemblies, it may still be worth considering for data aligned against widely known reference genomes when wantng a self contained CRAM that has no requirements on external data. | ||
|
|
||
| This may seem to remove the benefits of reference based encoding, but there will usually be many alignments covering the same location so this is a form of deduplication analogous to LZ compression techniques. | ||
| The reference is stored per CRAM slice and is the portion of reference covered by the alignments within that slice. |
There was a problem hiding this comment.
portion of reference -> portion of the reference
Maybe rewrite the entire sentence?
There was a problem hiding this comment.
I'll change it to "Each CRAM slice stores the portion of the reference covered by alignments within that slice."
reference_seqs.md
Outdated
| The reference is stored per CRAM slice and is the portion of reference covered by the alignments within that slice. | ||
| This obviously makes internal / embedded references only compatible with coordinate sorted aligned records. | ||
| The reference stored in the slice can be directly copied from the reference used during alignment, or created on-the-fly by computing the consensus from the alignments. | ||
| This is controlled in htslib with the `embed_ref` or `embed_ref=1` option, which uses an externally supplied reference, and `embed_ref=2` which creates a new consensus reference. |
There was a problem hiding this comment.
htslib -> HTSlib
Used both ways in the text. Probably should stick to one.
reference_seqs.md
Outdated
| However this is limited to http, https and ftp URIs so it is recommended to escape literal colons within URIs by using a double-colon instead. | ||
|
|
||
| Note it is faster to fetch sequences from an MD5sum-structured local directory than fetching it from local FASTA file as there is no parsing or validation required on the sequences. | ||
| Indeed htslib typically accesses the MD5sum sequence using the UNIX `mmap()` system call, which also reduces memory usage when many processes are accessing the same reference file. |
There was a problem hiding this comment.
Done here and other instances. I had a mixture it seemed.
|
Thanks for the review. I've made those changes and also done some restructuring following our discussion during the meeting. It could perhaps be restructured further. Eg:
Thoughts? It's already much better than what we had before, which was basically a terse section of samtools.1 man page only. |
|
I like the new version and I think it works well. |
This covers a lot of material about how CRAM references are retrieved and how to efficiently use a reference cache.
This covers a lot of material about how CRAM references are retrieved and how to efficiently use a reference cache.
If we accept this is may be good to modify the htslib error messages to point to this URL to aid debugging of reference sequences.