Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samtools fastq could merge pair members and output just a single FASTA/Q entry #1991

Open
mmokrejs opened this issue Feb 15, 2024 · 1 comment
Assignees

Comments

@mmokrejs
Copy link

mmokrejs commented Feb 15, 2024

Hi,
there are many tools (bbmerge.sh from BBmap, NGmerge, VSEARCH, USEARCH, ...) which can look for some overlap between members of a FASTA/Q read pair but is there a tool which can use already mapped pair members to their common reference and just merge them, evtl. insert the needed Ns as a linker in between with proper length? I have the original .fastq.gz files still around so it could maybe just act on position-sorted SAM/BAM and poke through the synced read pairs in R1 and R2 files and merge them accordingly (replace Ns with a nucleotide from the other mate, eventually prefer higher QUAL)?

https://sourceforge.net/projects/bbmap/
https://drive5.com/usearch/manual/merge_pair.html
https://github.com/jsh58/NGmerge
https://cme.h-its.org/exelixis/web/software/pear/
https://gitlab.com/german.tischler/biobambam2/-/blob/master/src/programs/bamtofastq.1

Thank you,

@jkbonfield
Copy link
Contributor

jkbonfield commented Feb 22, 2024

The intention of samtools fastq is very much to reverse the alignment process. Ie to take an aligned file and reproduce something akin to the original instrument outputs so it can then be realigned with other tools and/or parameters. This would be quite a departure from that goal.

So no, it cannot do this currently. I'll have to take a look at what other tools do, but this does feel a bit out of the normal remit for samtools fastq.

Edit: it's also a complex thing to do with many weird corner cases.

Eg what to do about alignments that are mapping incorrectly, with the same strand when they should differ? Or where the insert size is actually negative as they point away from each other instead of towards one another? What about the other sequencing strategies, like 454's approach where they circularised it and sequenced over an adapter and the read was then split into two in software? Would we need to add the adapter back? Also what about pairs mapped to different chromosomes, or the same chromosome but MBs apart? What do we do about singletons where only 1 read has been found?

Even implementing it efficiently is non-trivial (unless it's name collated) if we want to deal with distant read-pairs.

It sounds like there are so many potential pit falls and questions that this would be complex to implement and a substantial piece of work. I question the need for us to do this unless there are multiple groups wanting it and the existing tools out there don't already fulfill the requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants