New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
samtools sort -n moves two mates of a same read apart #520
Comments
It would help to have a concrete example so we have a test case and also better understand the issue. I don't know what various programs can cope with and what they choke on. Technically all of our name sorted files are valid (it only ever stated name sorted, nothing more), but if there is community consensus on what sub-ordering is appropriate for name sorted data then we can change it. I tried the example from #258 in samtools, biobambam and picard to see what orders we got. Samtools and biobambam both appear to use stable sorts with a minimal secondary sort order (/1 vs /2 read ends). So if a supplementary READ1 read comes first in the input data, it'll be first in the output data. Picard uses a secondary ordering that includes the flags in some way, so the data always puts primary reads ahead of their supplementary reads. The example data from #258 looks like this when sorted by Picard:
Is this what samtools needs to do too? Making them consistent would be useful. |
For reference, I found the picard query name comparison here: In short it's first by name, then by fwd/rev status, then if they still match by complemented flag, then by primary flag, next by supplementary flag, and finally by "HI" aux tag. Edit: supplementary is in there. |
Something like this is an approximation of the picard method. It tries to reorder the flag bits in the order of READ1, READ2, COMPLEMENTED, SECONDARY, SUPPLEMENTARY, everything-else. Needs more checking...
|
@jkbonfield Sorry for my late reply. I guess what RSEM wants is to have a stable sort for sorting reads by names only. Even without the /1 vs /2 comparison. If you wish, it is OK to put primary alignments before secondary alignments. The reason is that generally, for RNA-Seq data, the two mates of an alignment are always together. Thus, if we sort only by name, we can keep the adjacent structure. Because all RNA-Seq transcript quantification tools rely on the assumption that the two mates of an paired-end read alignment are adjacent, keeping this property will be ** critical ** for all RNA-Seq data sets. |
@jkbonfield , here is an example using the SAM lines you have shown: The input BAM file (produced by aligners) normally looks like: c1 65 xx 1 1 10M * 0 0 * * You can see that the two mates of an alignment are always adjacent to each other. In addition, one read's alignments are always grouped together. However, the alignments are not sorted by read names. This is because normally people will use multiple threads to align their reads. The order each thread finishes aligning one read is nondeterministic. Multi-threading of aligner also makes the alignment output are not identical each time. Thus, it may lead the downstream programs produce slightly different results each time due to floating point issues. One easy way to fix it is to sort the alignment file by read names and thus make them identical. After sorting, the ideal output of the above example should be: a1 65 xx 9 1 10M * 0 0 * * |
@jkbonfield, so for me, whatever secondary sort order works for me provided the read1 vs read2 property is not included in the secondary sort order. Hope it is clear to you :) |
@jkbonfield , another related issue is, when comparing read names, maybe we can only compare the characters after >/@ and before the first white space. In real data, normally the read name is presented after >/@ and before the first white space. The things after are normally some annotation and should not count as part of the read name. Only comparing characters after >/@ and before the first white space will make SAMtools sort -n more robust. For example, Illumina paired-end reads might look like: // first mate of one read // second mate of one read Some aligners may fail to strip the annotation of the second mate and result in the following read names in SAM/BAM file: SRR12345.3 65 xx x x 10M * 0 0 * * If we sort by the full name, we will separate the two mates of a same read apart. |
@jkbonfield , FYI, I pasted my modified bam_sort.c function. The following codes changed strnum_cmp's behavior to only consider the characters before spaces. // Added for RSEM by Bo Li: Test if the query name should stop // Modified for RSEM by Bo Li: Let the query name stops whenever encounters a space |
@jkbonfield , FYI, the codes below change the behavior of heap_lt: // Function to compare reads in the heap and determine which one is < the other |
@jkbonfield , FYI, the codes below change the behavior of bam1_lt // Function to compare reads and determine which one is < the other |
@jkbonfield , hope it helps. Thanks, |
This was already considered in #521. |
@jmarshall Can you remind me your decision? For now, try to not consider this issue? Thanks, |
@jmarshall , if you can add an option to ignore the characters after the white spaces, that will be great! |
@bli25wisc Please see the previous reply on #521. It would be great if you would respond to the question there re which aligners you have observed exhibiting this bug. |
The SAM specification states that the first column is the query name. Anything in there that is NOT the query name is a bug in whatever produced the data, so we shouldn't be pandering to broken software IMO. Instead I recommend you submit a request to fix the aligner or fastq processor that failed to honour the tradition of fastq ">qname comment". The ordering is a different issue. |
Maybe I'm being dense, but I don't see how your sort code above achieves what you want it to achieve. Given 1 primary alignment and 1 supplementary alignment, you are simply saying that the flags are irrelevant - so no read1/read2 matters and neither does primary/supplementary. The only thing you sort on is name. That doesn't forcibly put the mates next to each other at all, but relies on blind luck doesn't it? (Or perhaps if it's a stable sort algorithm, relies on them being in that order in the position sorted file.) What about 1 primary and 2 supplementary, so 3 sites covered. Is it then the requirement of main sort order being name and secondary sort order being position? Is it possible for your sites to be overlapping in any way? So the read with the greatest alignment position of the 1st supplementary pair is rightmost (greater pos) than the left-most positon of the 2nd supplementary pair. Forgive me my complete lack of knowledge about what RNASeq data looks like in practice. |
Hi @jkbonfield , the reality for not only RNA-Seq but all aligner produced outputs are: the two mates of one paired-end read are always adjacent to each other. In addition, the aligner outputs are not sorted by coordinates unless you ask it to do so. Thus, provided we have a stable sort, by only sorting the names, we will keep the adjacency between two mates. Hope it helps, |
On Tue, Mar 15, 2016 at 02:08:02AM -0700, Bo Li wrote:
I see two common possibilities for input data here:
I still don't see how a stable sort gets you to your desired name sort Is your code simply working because position sorted data resorted with Is it better therefore to make this explicit (position 2ndary, to cope James James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova The Wellcome Trust Sanger Institute is operated by Genome Research |
I think @bli25wisc has a point. It would be good to have a new sorting order that sorts by name only, not looking at any other fields. |
A name-only sort does at least mean you can perform any secondary sort term manually and then re-sort by name to bring reads together via a second field. I assume the second field necessary for RNA seq is indeed coordinate, hence why the fix works. As a general tool it's not an efficient way to go, but quick and easy to implement. Also agree we'd want a new sort option for this. We cannot simply change the current sort as some tools rely on the read flag ordering (eg those dealing with fastq type data). |
I second that name-only sot option would be beneficial. |
Hi, I have the following SAM records in an input file:
with the mate pair order for F0004 being severely disrupted. Is this intended? Best, Sven |
@sven0schuierer It's the way it has always worked, see the discussion above. |
Hi Rob, |
Maybe not what you'd like, but something like this perhaps is a workaround:
It's strictly sorting by 1st field as primary (NAME) and 2nd field as secondary when primary is identical (FLAG). That's strictly numerical ascending order rather than bit-flag aware, but is that sufficient? If so then it's possible this could be implemented in future samtools releases, but having a workaround for now helps. Going in and out of SAM format isn't ideal, but it shouldn't be that tragic for performance if you have a fast disk for the intermediate sort files. On my basic 4-core desktop with a slow spinny disk, it took 1m8s to sort 10 million BAM records via unix sort, and an identical 1m8s via samtools sort. That was using all 4 cores. However on larger files there may be a difference as it starts writing out different sized temporay files and a corresponding difference in I/O pressure. (I'm unsure how SAM | zstd compares to level-1 BAM.) Edit: spotted sort --compress-program=zstd is an option. With that the intermediate files are lightly compressed, reducing I/O bottlenecks. So it may help on larger files. |
Hi James, |
Dear samtools developers,
This is Bo Li, the developer of RSEM. When I used samtools sort -n to make sure the alignments are grouped by read names, I found an issue. If we have paired-end reads, samtools sort -n will arrange the first mates of different alignments together, and then followed by the second mates. Because most RNA-Seq transcript quantification programs assume that the two mates of a read should be adjacent to each other, this behavior of SAMtools is not convenient.
I wonder if it is possible to change the behavior of -n or provide a new option that does not move the mates of a same read apart.
Thanks,
Bo
The text was updated successfully, but these errors were encountered: