Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samtools sort -n moves two mates of a same read apart #520

Open
bli25wisc opened this issue Feb 1, 2016 · 27 comments
Open

samtools sort -n moves two mates of a same read apart #520

bli25wisc opened this issue Feb 1, 2016 · 27 comments

Comments

@bli25wisc
Copy link

Dear samtools developers,

This is Bo Li, the developer of RSEM. When I used samtools sort -n to make sure the alignments are grouped by read names, I found an issue. If we have paired-end reads, samtools sort -n will arrange the first mates of different alignments together, and then followed by the second mates. Because most RNA-Seq transcript quantification programs assume that the two mates of a read should be adjacent to each other, this behavior of SAMtools is not convenient.

I wonder if it is possible to change the behavior of -n or provide a new option that does not move the mates of a same read apart.

Thanks,
Bo

@jkbonfield
Copy link
Contributor

It would help to have a concrete example so we have a test case and also better understand the issue. I don't know what various programs can cope with and what they choke on. Technically all of our name sorted files are valid (it only ever stated name sorted, nothing more), but if there is community consensus on what sub-ordering is appropriate for name sorted data then we can change it.

I tried the example from #258 in samtools, biobambam and picard to see what orders we got.

Samtools and biobambam both appear to use stable sorts with a minimal secondary sort order (/1 vs /2 read ends). So if a supplementary READ1 read comes first in the input data, it'll be first in the output data. Picard uses a secondary ordering that includes the flags in some way, so the data always puts primary reads ahead of their supplementary reads. The example data from #258 looks like this when sorted by Picard:

a1  65  xx  9   1   10M *   0   0   *   *
a1  2113    xx  5   1   10M *   0   0   *   *
a1  129 xx  1   1   10M *   0   0   *   *
a1  2177    xx  7   1   10M *   0   0   *   *
b1  65  xx  1   1   10M *   0   0   *   *
b1  2113    xx  5   1   10M *   0   0   *   *
b1  129 xx  9   1   10M *   0   0   *   *
b1  2177    xx  7   1   10M *   0   0   *   *
c1  65  xx  1   1   10M *   0   0   *   *
c1  2113    xx  5   1   10M *   0   0   *   *
c1  129 xx  9   1   10M *   0   0   *   *
c1  2177    xx  7   1   10M *   0   0   *   *

Is this what samtools needs to do too? Making them consistent would be useful.

@jkbonfield
Copy link
Contributor

For reference, I found the picard query name comparison here:

http://grepcode.com/file/repo1.maven.org/maven2/org.utgenome.thirdparty/picard/1.102.0/net/sf/samtools/SAMRecordQueryNameComparator.java?av=f

In short it's first by name, then by fwd/rev status, then if they still match by complemented flag, then by primary flag, next by supplementary flag, and finally by "HI" aux tag.

Edit: supplementary is in there.

@jkbonfield
Copy link
Contributor

Something like this is an approximation of the picard method. It tries to reorder the flag bits in the order of READ1, READ2, COMPLEMENTED, SECONDARY, SUPPLEMENTARY, everything-else.

Needs more checking...

static inline int bam1_lt(const bam1_p a, const bam1_p b)
{
    if (g_is_by_qname) {
        int t = strnum_cmp(bam_get_qname(a), bam_get_qname(b));
        if (t < 0) return 1;
        if (t > 0) return 0;
        uint32_t af = ((a->core.flag&0x0d0)<<12)
                     |((a->core.flag&0x100)<<9)
                     |((a->core.flag&0x800)<<5)
                     |a->core.flag;
        uint32_t bf = ((b->core.flag&0x0d0)<<12)
                     |((b->core.flag&0x100)<<9)
                     |((b->core.flag&0x800)<<5)
                     |b->core.flag;
        return af < bf;
        // Maybe bit reordering via debruijn sequence is possible?
    } else return (((uint64_t)a->core.tid<<32|(a->core.pos+1)<<1|bam_is_rev(a)) < ((uint64_t)b->core.tid<<32|(b->core.pos+1)<<1|bam_is_rev(b)));
}

@bli25wisc
Copy link
Author

@jkbonfield Sorry for my late reply.

I guess what RSEM wants is to have a stable sort for sorting reads by names only. Even without the /1 vs /2 comparison. If you wish, it is OK to put primary alignments before secondary alignments.

The reason is that generally, for RNA-Seq data, the two mates of an alignment are always together. Thus, if we sort only by name, we can keep the adjacent structure.

Because all RNA-Seq transcript quantification tools rely on the assumption that the two mates of an paired-end read alignment are adjacent, keeping this property will be ** critical ** for all RNA-Seq data sets.

@bli25wisc
Copy link
Author

@jkbonfield , here is an example using the SAM lines you have shown:

The input BAM file (produced by aligners) normally looks like:

c1 65 xx 1 1 10M * 0 0 * *
c1 129 xx 9 1 10M * 0 0 * *
c1 2113 xx 5 1 10M * 0 0 * *
c1 2177 xx 7 1 10M * 0 0 * *
a1 65 xx 9 1 10M * 0 0 * *
a1 129 xx 1 1 10M * 0 0 * *
a1 2113 xx 5 1 10M * 0 0 * *
a1 2177 xx 7 1 10M * 0 0 * *
b1 65 xx 1 1 10M * 0 0 * *
b1 129 xx 9 1 10M * 0 0 * *
b1 2113 xx 5 1 10M * 0 0 * *
b1 2177 xx 7 1 10M * 0 0 * *

You can see that the two mates of an alignment are always adjacent to each other. In addition, one read's alignments are always grouped together. However, the alignments are not sorted by read names. This is because normally people will use multiple threads to align their reads. The order each thread finishes aligning one read is nondeterministic.

Multi-threading of aligner also makes the alignment output are not identical each time. Thus, it may lead the downstream programs produce slightly different results each time due to floating point issues. One easy way to fix it is to sort the alignment file by read names and thus make them identical.

After sorting, the ideal output of the above example should be:

a1 65 xx 9 1 10M * 0 0 * *
a1 129 xx 1 1 10M * 0 0 * *
a1 2113 xx 5 1 10M * 0 0 * *
a1 2177 xx 7 1 10M * 0 0 * *
b1 65 xx 1 1 10M * 0 0 * *
b1 129 xx 9 1 10M * 0 0 * *
b1 2113 xx 5 1 10M * 0 0 * *
b1 2177 xx 7 1 10M * 0 0 * *
c1 65 xx 1 1 10M * 0 0 * *
c1 129 xx 9 1 10M * 0 0 * *
c1 2113 xx 5 1 10M * 0 0 * *
c1 2177 xx 7 1 10M * 0 0 * *

@bli25wisc
Copy link
Author

@jkbonfield, so for me, whatever secondary sort order works for me provided the read1 vs read2 property is not included in the secondary sort order. Hope it is clear to you :)

@bli25wisc
Copy link
Author

@jkbonfield , another related issue is, when comparing read names, maybe we can only compare the characters after >/@ and before the first white space.

In real data, normally the read name is presented after >/@ and before the first white space. The things after are normally some annotation and should not count as part of the read name.

Only comparing characters after >/@ and before the first white space will make SAMtools sort -n more robust. For example, Illumina paired-end reads might look like:

// first mate of one read
@SRR12345.3 1:XXX:YYY:ZZZ
ATCGGATCCC
+SRR12345.3 1:XXX:YYY:ZZZ
IIIIIIIIII

// second mate of one read
@SRR12345.3 2:XXX:YYY:ZZZ
ATATTATCCC
+SRR12345.3 2:XXX:YYY:ZZZ
IIIIIIIIII

Some aligners may fail to strip the annotation of the second mate and result in the following read names in SAM/BAM file:

SRR12345.3 65 xx x x 10M * 0 0 * *
SRR12345.3 2:XXX:YYY:ZZZ 129 xx x x 10M * 0 0 * *

If we sort by the full name, we will separate the two mates of a same read apart.

@bli25wisc
Copy link
Author

@jkbonfield , FYI, I pasted my modified bam_sort.c function. The following codes changed strnum_cmp's behavior to only consider the characters before spaces.

// Added for RSEM by Bo Li: Test if the query name should stop
static bool ok(char c) { return c > 0 && !isspace(c); }

// Modified for RSEM by Bo Li: Let the query name stops whenever encounters a space
static int strnum_cmp(const char __a, const char *b)
{
const unsigned char *a = (const unsigned char
)_a, b = (const unsigned char)_b;
const unsigned char _pa = a, *pb = b;
//while (_pa && _pb) { // Modified for RSEM by Bo Li
while (ok(_pa) && ok(_pb)) {
if (isdigit(_pa) && isdigit(_pb)) {
while (_pa == '0') ++pa;
while (_pb == '0') ++pb;
while (isdigit(_pa) && isdigit(_pb) && *pa == *pb) ++pa, ++pb;
if (isdigit(_pa) && isdigit(_pb)) {
int i = 0;
while (isdigit(pa[i]) && isdigit(pb[i])) ++i;
return isdigit(pa[i])? 1 : isdigit(pb[i])? -1 : (int)_pa - (int)_pb;
} else if (isdigit(_pa)) return 1;
else if (isdigit(_pb)) return -1;
else if (pa - a != pb - b) return pa - a < pb - b? 1 : -1;
} else {
if (_pa != _pb) return (int)_pa - (int)_pb;
++pa; ++pb;
}
}
//return *pa ? 1 : *pb ? -1 : 0; // Modified for RSEM by Bo Li
return ok(_pa)? 1 : ok(*pb)? -1 : 0;
}

@bli25wisc
Copy link
Author

@jkbonfield , FYI, the codes below change the behavior of heap_lt:

// Function to compare reads in the heap and determine which one is < the other
// Note by Bo Li: It actually returns which one is >
static inline int heap_lt(const heap1_t a, const heap1_t b)
{
if (g_is_by_qname) {
int t;
if (a.b == NULL || b.b == NULL) return a.b == NULL? 1 : 0;
t = strnum_cmp(bam_get_qname(a.b), bam_get_qname(b.b));
//return (t > 0 || (t == 0 && (a.b->core.flag&0xc0) > (b.b->core.flag&0xc0))); // Modified for RSEM by Bo Li
return t > 0; // In order to make sure the two mates of an alignment are adjacent to each other
} else return __pos_cmp(a, b);
}

@bli25wisc
Copy link
Author

@jkbonfield , FYI, the codes below change the behavior of bam1_lt

// Function to compare reads and determine which one is < the other
static inline int bam1_lt(const bam1_p a, const bam1_p b)
{
if (g_is_by_qname) {
int t = strnum_cmp(bam_get_qname(a), bam_get_qname(b));
//return (t < 0 || (t == 0 && (a->core.flag&0xc0) < (b->core.flag&0xc0))); // Modified for RSEM by Bo Li
return t < 0;
} else return (((uint64_t)a->core.tid<<32|(a->core.pos+1)<<1|bam_is_rev(a)) < ((uint64_t)b->core.tid<<32|(b->core.pos+1)<<1|bam_is_rev(b)));
}

@bli25wisc
Copy link
Author

@jkbonfield , hope it helps.

Thanks,
Bo

@jmarshall
Copy link
Member

another related issue is, when comparing read names, maybe we can only compare the characters after >/@ and before the first white space.

This was already considered in #521.

@bli25wisc
Copy link
Author

@jmarshall Can you remind me your decision? For now, try to not consider this issue?

Thanks,
Bo

@bli25wisc
Copy link
Author

@jmarshall , if you can add an option to ignore the characters after the white spaces, that will be great!

@jmarshall
Copy link
Member

@bli25wisc Please see the previous reply on #521. It would be great if you would respond to the question there re which aligners you have observed exhibiting this bug.

@jkbonfield
Copy link
Contributor

The SAM specification states that the first column is the query name. Anything in there that is NOT the query name is a bug in whatever produced the data, so we shouldn't be pandering to broken software IMO. Instead I recommend you submit a request to fix the aligner or fastq processor that failed to honour the tradition of fastq ">qname comment".

The ordering is a different issue.

@jkbonfield
Copy link
Contributor

Maybe I'm being dense, but I don't see how your sort code above achieves what you want it to achieve.

Given 1 primary alignment and 1 supplementary alignment, you are simply saying that the flags are irrelevant - so no read1/read2 matters and neither does primary/supplementary. The only thing you sort on is name. That doesn't forcibly put the mates next to each other at all, but relies on blind luck doesn't it? (Or perhaps if it's a stable sort algorithm, relies on them being in that order in the position sorted file.)

What about 1 primary and 2 supplementary, so 3 sites covered. Is it then the requirement of main sort order being name and secondary sort order being position? Is it possible for your sites to be overlapping in any way? So the read with the greatest alignment position of the 1st supplementary pair is rightmost (greater pos) than the left-most positon of the 2nd supplementary pair. Forgive me my complete lack of knowledge about what RNASeq data looks like in practice.

@bli25wisc
Copy link
Author

Hi @jkbonfield , the reality for not only RNA-Seq but all aligner produced outputs are: the two mates of one paired-end read are always adjacent to each other. In addition, the aligner outputs are not sorted by coordinates unless you ask it to do so. Thus, provided we have a stable sort, by only sorting the names, we will keep the adjacency between two mates.

Hope it helps,
Bo

@jkbonfield
Copy link
Contributor

On Tue, Mar 15, 2016 at 02:08:02AM -0700, Bo Li wrote:

Hi @jkbonfield , the reality for not only RNA-Seq but all aligner produced outputs are: the two mates of one paired-end read are always adjacent to each other. In addition, the aligner outputs are not sorted by coordinates unless you ask it to do so. Thus, provided we have a stable sort, by only sorting the names, we will keep the adjacency between two mates.

I see two common possibilities for input data here:

  1. You're using the data as produced by the aligner, already in read
    name order. If the aligner is collating mates together as you say
    then it's all fine and no sorting is needed. Is that correct?
    (I disagree that all aligners produce data in this order, but maybe
    the RNAseq aware ones do and that's the only thing that matters in
    this case.)

  2. The data has been sorted into positional order and you wish to
    essentially reverse this process; hence samtools sort -n.

I still don't see how a stable sort gets you to your desired name sort
order UNLESS your name sort order explicitly has position as its
secondary sort field.

Is your code simply working because position sorted data resorted with
a stable name(only) sort is identical to primary name + secondary
position sort? If so I understand, at last!

Is it better therefore to make this explicit (position 2ndary, to cope
with any file rather than only resorting from position sorted files),
or is it more desireable to have a name sort option which sorts on
name and ONLY name in a stable fashion, permitting completely
arbitrary sorts by anyone else. Albeit inefficiently. Eg sort by BI
index, then resort by name(only) => primary sort on name and secondary
on BI tag. Not as efficient as explicitly controlling the secondary
sort fields, but workable and flexible.

James

James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@lh3
Copy link
Member

lh3 commented Mar 24, 2016

I think @bli25wisc has a point. It would be good to have a new sorting order that sorts by name only, not looking at any other fields.

@jkbonfield
Copy link
Contributor

A name-only sort does at least mean you can perform any secondary sort term manually and then re-sort by name to bring reads together via a second field. I assume the second field necessary for RNA seq is indeed coordinate, hence why the fix works.

As a general tool it's not an efficient way to go, but quick and easy to implement.

Also agree we'd want a new sort option for this. We cannot simply change the current sort as some tools rely on the read flag ordering (eg those dealing with fastq type data).

@luidale
Copy link

luidale commented Feb 19, 2018

I second that name-only sot option would be beneficial.

@sven0schuierer
Copy link

sven0schuierer commented Apr 27, 2020

Hi,

I have the following SAM records in an input file:

F0001	99	1	165727509	255	76M	=	165727661	227	CGGCAAAACAAAATGAAACAGATCGCTCCTTGTACATAAAACAGCTAAAAATTTGGCCTCTTTATTGATCTTATGT	AAAFFJAJJJJ7F77-FFJF-JFJ-AAFFFA7JA-FFJJJJJ-<FJJJFJJJJFJFJF-<7-FJJFFJJF<---<A	NH:i:1	HI:i:1	AS:i:143	nM:i:3
F0001	147	1	165727661	255	1S75M	=	165727509	-227	CAATCTCTAAATGCTCATAGACAGCCCACTTGCAGGGCATACACAGTGCCTTTAGAGTCGGTCCCACAGAAAATAC	-<<<--A--7-AJ<-FJA7--FFAF<7<-7JJJAJJJJF-<7AF-7-JFF7<7JF7<JAJFF<<JJJJJFFAA<-A	NH:i:1	HI:i:1	AS:i:143	nM:i:3
F0002	99	3	125497833	255	76M	=	125498079	314	CCTGTGTTAAAAACAGATAGAAGATTTTGTCTCTACAAAGGATGGGATGTGAAGTAATCCTCAAGAGAAAGTTTTC	AAFFFJJAJJJFAJJJJFJJJJJJJJFFFJJFJJJFJFJFJJJFJAFJJJJ<-F-<JAJJJJJJJFJAJJJJJ<FF	NH:i:1	HI:i:1	AS:i:130	nM:i:6
F0002	147	3	125498079	255	8S68M	=	125497833	-314	AAGACGTCGCACAACACTATGCGGATTGTAAACTAAAAGGTAGCTTCTGAAGAACTCAAATTCACTATATCGCCGC	-7-A7JJ7FA7JJFA<-F-A7-J-F<--<--<-A-F7AJA<-FAF<FF-7F-FJA<<AJJJJJAJJFJFJJFFFAA	NH:i:1	HI:i:1	AS:i:130	nM:i:6
F0003	163	5	57176305	255	24S52M	=	57214155	54557	GTTAAGTATCTTCAAAGAACTTCAATATTATTTCAGAGGATACAAAATAAAAATTTAAACTGGAAAATAAAGATTA	A-<FFJJJ<JJJFFJ-<JJ-<-F--FJFJ<-<FJ-<-JFF77-FAF<<-<FA-F-<-FJ--<FJ-FA-7FF--77-	NH:i:1	HI:i:1	AS:i:116	nM:i:6
F0003	83	5	57214155	255	39M7182N21M9449N16M	=	57176305	-54557	GCCTGGCTTAATTTCCCTACCTCTTCATCATCAACAAAGGTCCTGACCAGCAGTTTGAAATCGTCATTGAATTTTG	JJJJJA<JJJFJJJJJF<-<7-<--AAJJJFF-AFF<JJJJJJJFFJJFFAJJJJJJJFJAJJFFJJJJJJFFFAA	NH:i:1	HI:i:1	AS:i:116	nM:i:6
F0004	99	17	31034431	0	76M	=	31034477	122	CGGCGATCAGAGGGCGATGAAGTTCTAGATCCATTGAGACAAGCTCTAGACAGTACCATGCAGTGCCACAACTTGT	AAFFFJJJJJJAFJJ<J-FFFJFAFJJJJAFAJJFFJJJJFJFJAF-FFFF-<FJ-<AFAJJJJ--A-F-7F<-F<	NH:i:5	HI:i:1	AS:i:140	nM:i:5
F0004	147	17	31034477	0	76M	=	31034431	-122	TATACAGTAGCATGCAGTCCCACAACCTGTACCAGCATCCCCAGCTTCTGGCATTCCATGTTTCTGCTCCTGTGGC	<----FF7JJJJJJJ<FA---AJFA<--JJFJF-JA-JJ<AFFAF-AA-<7--JJJJFJJ-JFJJJF7JJJAF-<A	NH:i:5	HI:i:1	AS:i:140	nM:i:5
F0004	419	12	56117550	0	76M	=	56117596	122	GCCACAGGAGCAGAAACATGGAATGCCAGAAGCTGGGGATGCTGGTACAGGTTGTGGGACTGCATGCTACTGTATA	A<-FAJJJ7FJJJFJ-JJFJJJJ--7<-AA-FAFFA<JJ-AJ-FJFJJ--<AFJA---AF<JJJJJJJ7FF----<	NH:i:5	HI:i:2	AS:i:140	nM:i:5
F0004	339	12	56117596	0	76M	=	56117550	-122	ACAAGTTGTGGCACTGCATGGTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTGATCGCCG	<F-<F7-F-A--JJJJAFA<-JF<-FFFF-FAJFJFJJJJFFJJAFAJJJJFAFJFFF-J<JJFAJJJJJJFFFAA	NH:i:5	HI:i:2	AS:i:140	nM:i:5
F0004	419	7	137513962	0	76M	=	137514008	122	GCCACAGGAGCAGAAACATGGAATGCCAGAAGCTGGGGATGCTGGTACAGGTTGTGGGACTGCATGCTACTGTATA	A<-FAJJJ7FJJJFJ-JJFJJJJ--7<-AA-FAFFA<JJ-AJ-FJFJJ--<AFJA---AF<JJJJJJJ7FF----<	NH:i:5	HI:i:3	AS:i:140	nM:i:5
F0004	339	7	137514008	0	76M	=	137513962	-122	ACAAGTTGTGGCACTGCATGGTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTGATCGCCG	<F-<F7-F-A--JJJJAFA<-JF<-FFFF-FAJFJFJJJJFFJJAFAJJJJFAFJFFF-J<JJFAJJJJJJFFFAA	NH:i:5	HI:i:3	AS:i:140	nM:i:5
F0004	419	20	21755373	0	76M	=	21755419	122	GCCACAGGAGCAGAAACATGGAATGCCAGAAGCTGGGGATGCTGGTACAGGTTGTGGGACTGCATGCTACTGTATA	A<-FAJJJ7FJJJFJ-JJFJJJJ--7<-AA-FAFFA<JJ-AJ-FJFJJ--<AFJA---AF<JJJJJJJ7FF----<	NH:i:5	HI:i:4	AS:i:140	nM:i:5
F0004	339	20	21755419	0	76M	=	21755373	-122	ACAAGTTGTGGCACTGCATGGTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTGATCGCCG	<F-<F7-F-A--JJJJAFA<-JF<-FFFF-FAJFJFJJJJFFJJAFAJJJJFAFJFFF-J<JJFAJJJJJJFFFAA	NH:i:5	HI:i:4	AS:i:140	nM:i:5
F0004	419	5	55944759	0	76M	=	55944805	122	GCCACAGGAGCAGAAACATGGAATGCCAGAAGCTGGGGATGCTGGTACAGGTTGTGGGACTGCATGCTACTGTATA	A<-FAJJJ7FJJJFJ-JJFJJJJ--7<-AA-FAFFA<JJ-AJ-FJFJJ--<AFJA---AF<JJJJJJJ7FF----<	NH:i:5	HI:i:5	AS:i:140	nM:i:5
F0004	339	5	55944805	0	76M	=	55944759	-122	ACAAGTTGTGGCACTGCATGGTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTGATCGCCG	<F-<F7-F-A--JJJJAFA<-JF<-FFFF-FAJFJFJJJJFFJJAFAJJJJFAFJFFF-J<JJFAJJJJJJFFFAA	NH:i:5	HI:i:5	AS:i:140	nM:i:5
F0005	163	1	26472524	255	76M	=	26473504	1583	GCGAGAACGACCCCCGGACCGACCAAAGCCCGCGCGGCGCTGCATCCCGCGTACAGCACCTACGTCCCGCTACCGT	<-AFFJJJJJJJJJJ-JJJAJFF--AJFJ<JJJF<<-AFFJJAJ-F7-7-7A-A-A-AAA7FJFJ<A7-AJ-A<<J	NH:i:1	HI:i:1	AS:i:145	nM:i:4
F0005	83	1	26473504	255	24M175N30M352N22M	=	26472524	-1583	GATAAAGCAAAGGTGAAGGACGAACCACAGAGAAGTTCCGCGAGGTTGTCTGCTAAACCTGCTCCTCCAAAGCCAG	FJFAJFJAJFJ7JJFFF7AF7AF--<F-F-FAF-7-<<JJA--FAJJ<<-JJFJFJFJJF-JJJ<JJJFJJFFFAA	NH:i:1	HI:i:1	AS:i:145	nM:i:4

after sorting them by name (-n) with samtools 1.9 I obtain the following ordering:

F0001	99	1	165727509	255	76M	=	165727661	227	CGGCAAAACAAAATGAAACAGATCGCTCCTTGTACATAAAACAGCTAAAAATTTGGCCTCTTTATTGATCTTATGT	AAAFFJAJJJJ7F77-FFJF-JFJ-AAFFFA7JA-FFJJJJJ-<FJJJFJJJJFJFJF-<7-FJJFFJJF<---<A	NH:i:1	HI:i:1	AS:i:143	nM:i:3
F0001	147	1	165727661	255	1S75M	=	165727509	-227	CAATCTCTAAATGCTCATAGACAGCCCACTTGCAGGGCATACACAGTGCCTTTAGAGTCGGTCCCACAGAAAATAC	-<<<--A--7-AJ<-FJA7--FFAF<7<-7JJJAJJJJF-<7AF-7-JFF7<7JF7<JAJFF<<JJJJJFFAA<-A	NH:i:1	HI:i:1	AS:i:143	nM:i:3
F0002	99	3	125497833	255	76M	=	125498079	314	CCTGTGTTAAAAACAGATAGAAGATTTTGTCTCTACAAAGGATGGGATGTGAAGTAATCCTCAAGAGAAAGTTTTC	AAFFFJJAJJJFAJJJJFJJJJJJJJFFFJJFJJJFJFJFJJJFJAFJJJJ<-F-<JAJJJJJJJFJAJJJJJ<FF	NH:i:1	HI:i:1	AS:i:130	nM:i:6
F0002	147	3	125498079	255	8S68M	=	125497833	-314	AAGACGTCGCACAACACTATGCGGATTGTAAACTAAAAGGTAGCTTCTGAAGAACTCAAATTCACTATATCGCCGC	-7-A7JJ7FA7JJFA<-F-A7-J-F<--<--<-A-F7AJA<-FAF<FF-7F-FJA<<AJJJJJAJJFJFJJFFFAA	NH:i:1	HI:i:1	AS:i:130	nM:i:6
F0003	83	5	57214155	255	39M7182N21M9449N16M	=	57176305	-54557	GCCTGGCTTAATTTCCCTACCTCTTCATCATCAACAAAGGTCCTGACCAGCAGTTTGAAATCGTCATTGAATTTTG	JJJJJA<JJJFJJJJJF<-<7-<--AAJJJFF-AFF<JJJJJJJFFJJFFAJJJJJJJFJAJJFFJJJJJJFFFAA	NH:i:1	HI:i:1	AS:i:116	nM:i:6
F0003	163	5	57176305	255	24S52M	=	57214155	54557	GTTAAGTATCTTCAAAGAACTTCAATATTATTTCAGAGGATACAAAATAAAAATTTAAACTGGAAAATAAAGATTA	A-<FFJJJ<JJJFFJ-<JJ-<-F--FJFJ<-<FJ-<-JFF77-FAF<<-<FA-F-<-FJ--<FJ-FA-7FF--77-	NH:i:1	HI:i:1	AS:i:116	nM:i:6
F0004	99	17	31034431	0	76M	=	31034477	122	CGGCGATCAGAGGGCGATGAAGTTCTAGATCCATTGAGACAAGCTCTAGACAGTACCATGCAGTGCCACAACTTGT	AAFFFJJJJJJAFJJ<J-FFFJFAFJJJJAFAJJFFJJJJFJFJAF-FFFF-<FJ-<AFAJJJJ--A-F-7F<-F<	NH:i:5	HI:i:1	AS:i:140	nM:i:5
F0004	339	12	56117596	0	76M	=	56117550	-122	ACAAGTTGTGGCACTGCATGGTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTGATCGCCG	<F-<F7-F-A--JJJJAFA<-JF<-FFFF-FAJFJFJJJJFFJJAFAJJJJFAFJFFF-J<JJFAJJJJJJFFFAA	NH:i:5	HI:i:2	AS:i:140	nM:i:5
F0004	339	7	137514008	0	76M	=	137513962	-122	ACAAGTTGTGGCACTGCATGGTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTGATCGCCG	<F-<F7-F-A--JJJJAFA<-JF<-FFFF-FAJFJFJJJJFFJJAFAJJJJFAFJFFF-J<JJFAJJJJJJFFFAA	NH:i:5	HI:i:3	AS:i:140	nM:i:5
F0004	339	20	21755419	0	76M	=	21755373	-122	ACAAGTTGTGGCACTGCATGGTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTGATCGCCG	<F-<F7-F-A--JJJJAFA<-JF<-FFFF-FAJFJFJJJJFFJJAFAJJJJFAFJFFF-J<JJFAJJJJJJFFFAA	NH:i:5	HI:i:4	AS:i:140	nM:i:5
F0004	339	5	55944805	0	76M	=	55944759	-122	ACAAGTTGTGGCACTGCATGGTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTGATCGCCG	<F-<F7-F-A--JJJJAFA<-JF<-FFFF-FAJFJFJJJJFFJJAFAJJJJFAFJFFF-J<JJFAJJJJJJFFFAA	NH:i:5	HI:i:5	AS:i:140	nM:i:5
F0004	147	17	31034477	0	76M	=	31034431	-122	TATACAGTAGCATGCAGTCCCACAACCTGTACCAGCATCCCCAGCTTCTGGCATTCCATGTTTCTGCTCCTGTGGC	<----FF7JJJJJJJ<FA---AJFA<--JJFJF-JA-JJ<AFFAF-AA-<7--JJJJFJJ-JFJJJF7JJJAF-<A	NH:i:5	HI:i:1	AS:i:140	nM:i:5
F0004	419	12	56117550	0	76M	=	56117596	122	GCCACAGGAGCAGAAACATGGAATGCCAGAAGCTGGGGATGCTGGTACAGGTTGTGGGACTGCATGCTACTGTATA	A<-FAJJJ7FJJJFJ-JJFJJJJ--7<-AA-FAFFA<JJ-AJ-FJFJJ--<AFJA---AF<JJJJJJJ7FF----<	NH:i:5	HI:i:2	AS:i:140	nM:i:5
F0004	419	7	137513962	0	76M	=	137514008	122	GCCACAGGAGCAGAAACATGGAATGCCAGAAGCTGGGGATGCTGGTACAGGTTGTGGGACTGCATGCTACTGTATA	A<-FAJJJ7FJJJFJ-JJFJJJJ--7<-AA-FAFFA<JJ-AJ-FJFJJ--<AFJA---AF<JJJJJJJ7FF----<	NH:i:5	HI:i:3	AS:i:140	nM:i:5
F0004	419	20	21755373	0	76M	=	21755419	122	GCCACAGGAGCAGAAACATGGAATGCCAGAAGCTGGGGATGCTGGTACAGGTTGTGGGACTGCATGCTACTGTATA	A<-FAJJJ7FJJJFJ-JJFJJJJ--7<-AA-FAFFA<JJ-AJ-FJFJJ--<AFJA---AF<JJJJJJJ7FF----<	NH:i:5	HI:i:4	AS:i:140	nM:i:5
F0004	419	5	55944759	0	76M	=	55944805	122	GCCACAGGAGCAGAAACATGGAATGCCAGAAGCTGGGGATGCTGGTACAGGTTGTGGGACTGCATGCTACTGTATA	A<-FAJJJ7FJJJFJ-JJFJJJJ--7<-AA-FAFFA<JJ-AJ-FJFJJ--<AFJA---AF<JJJJJJJ7FF----<	NH:i:5	HI:i:5	AS:i:140	nM:i:5
F0005	83	1	26473504	255	24M175N30M352N22M	=	26472524	-1583	GATAAAGCAAAGGTGAAGGACGAACCACAGAGAAGTTCCGCGAGGTTGTCTGCTAAACCTGCTCCTCCAAAGCCAG	FJFAJFJAJFJ7JJFFF7AF7AF--<F-F-FAF-7-<<JJA--FAJJ<<-JJFJFJFJJF-JJJ<JJJFJJFFFAA	NH:i:1	HI:i:1	AS:i:145	nM:i:4
F0005	163	1	26472524	255	76M	=	26473504	1583	GCGAGAACGACCCCCGGACCGACCAAAGCCCGCGCGGCGCTGCATCCCGCGTACAGCACCTACGTCCCGCTACCGT	<-AFFJJJJJJJJJJ-JJJAJFF--AJFJ<JJJF<<-AFFJJAJ-F7-7-7A-A-A-AAA7FJFJ<A7-AJ-A<<J	NH:i:1	HI:i:1	AS:i:145	nM:i:4

with the mate pair order for F0004 being severely disrupted. Is this intended?

Best,

Sven

@daviesrob
Copy link
Member

@sven0schuierer It's the way it has always worked, see the discussion above.
It looks like you would want to sort by NAME, then HI tag, then READ1/2. Unfortunately while samtools has a sort-by-tag option, it doesn't quite do what you want because the tag comes first in the sort order. So samtools sort -n -t HI will give you all the HI:i:1 records ordered by read name then read1/2, then all the HI:i:2 records and so on.
Unfortunately no-one has yet got around to rewriting samtools sort to support more flexible orderings, which would solve this problem.

@sven0schuierer
Copy link

Hi Rob,
Thanks a lot for the quick reply. So if you use -t HI, you destroy the sorting by read name. Hmm ... also not what I need. Too bad.
Thanks again.
Best regards,
Sven

@jkbonfield
Copy link
Contributor

jkbonfield commented Apr 28, 2020

Maybe not what you'd like, but something like this perhaps is a workaround:

(samtools view -H in.bam; samtools view -@4 in.bam | sort -S 1G --parallel=4 -k 1,1 -k 2n) | samtools view -@4 -o out.bam -

It's strictly sorting by 1st field as primary (NAME) and 2nd field as secondary when primary is identical (FLAG). That's strictly numerical ascending order rather than bit-flag aware, but is that sufficient?

If so then it's possible this could be implemented in future samtools releases, but having a workaround for now helps. Going in and out of SAM format isn't ideal, but it shouldn't be that tragic for performance if you have a fast disk for the intermediate sort files.

On my basic 4-core desktop with a slow spinny disk, it took 1m8s to sort 10 million BAM records via unix sort, and an identical 1m8s via samtools sort. That was using all 4 cores. However on larger files there may be a difference as it starts writing out different sized temporay files and a corresponding difference in I/O pressure. (I'm unsure how SAM | zstd compares to level-1 BAM.)

Edit: spotted sort --compress-program=zstd is an option. With that the intermediate files are lightly compressed, reducing I/O bottlenecks. So it may help on larger files.

@sven0schuierer
Copy link

Hi James,
Thanks a lot for your suggestion. This is a good idea.
However, the sorting produced by your suggestion is still similar to the sorting in my original post. For fragment F0004 the ordering of the flags is now 99, 147, 339, 339, 339, 339, 419, 419, 419, 419. So the secondary mapping mate-pairs are still separated.
But I can use "sort -k 1,1 -k 3,3 -k 4n" as a proxy. It keeps most of the mate-pairs together (in my example even all of them).
Best regards,
Sven

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants