samtools fastq could merge pair members and output just a single FASTA/Q entry #1991

mmokrejs · 2024-02-15T18:23:47Z

Hi,
there are many tools (bbmerge.sh from BBmap, NGmerge, VSEARCH, USEARCH, ...) which can look for some overlap between members of a FASTA/Q read pair but is there a tool which can use already mapped pair members to their common reference and just merge them, evtl. insert the needed Ns as a linker in between with proper length? I have the original .fastq.gz files still around so it could maybe just act on position-sorted SAM/BAM and poke through the synced read pairs in R1 and R2 files and merge them accordingly (replace Ns with a nucleotide from the other mate, eventually prefer higher QUAL)?

https://sourceforge.net/projects/bbmap/
https://drive5.com/usearch/manual/merge_pair.html
https://github.com/jsh58/NGmerge
https://cme.h-its.org/exelixis/web/software/pear/
https://gitlab.com/german.tischler/biobambam2/-/blob/master/src/programs/bamtofastq.1

Thank you,

The text was updated successfully, but these errors were encountered:

jkbonfield · 2024-02-22T12:24:48Z

The intention of samtools fastq is very much to reverse the alignment process. Ie to take an aligned file and reproduce something akin to the original instrument outputs so it can then be realigned with other tools and/or parameters. This would be quite a departure from that goal.

So no, it cannot do this currently. I'll have to take a look at what other tools do, but this does feel a bit out of the normal remit for samtools fastq.

Edit: it's also a complex thing to do with many weird corner cases.

Eg what to do about alignments that are mapping incorrectly, with the same strand when they should differ? Or where the insert size is actually negative as they point away from each other instead of towards one another? What about the other sequencing strategies, like 454's approach where they circularised it and sequenced over an adapter and the read was then split into two in software? Would we need to add the adapter back? Also what about pairs mapped to different chromosomes, or the same chromosome but MBs apart? What do we do about singletons where only 1 read has been found?

Even implementing it efficiently is non-trivial (unless it's name collated) if we want to deal with distant read-pairs.

It sounds like there are so many potential pit falls and questions that this would be complex to implement and a substantial piece of work. I question the need for us to do this unless there are multiple groups wanting it and the existing tools out there don't already fulfill the requirements.

daviesrob assigned jkbonfield Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

samtools fastq could merge pair members and output just a single FASTA/Q entry #1991

samtools fastq could merge pair members and output just a single FASTA/Q entry #1991

mmokrejs commented Feb 15, 2024 •

edited

jkbonfield commented Feb 22, 2024 •

edited

samtools fastq could merge pair members and output just a single FASTA/Q entry #1991

samtools fastq could merge pair members and output just a single FASTA/Q entry #1991

Comments

mmokrejs commented Feb 15, 2024 • edited

jkbonfield commented Feb 22, 2024 • edited

mmokrejs commented Feb 15, 2024 •

edited

jkbonfield commented Feb 22, 2024 •

edited