Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0 SNPs reported for longer reads (undocumented maximum read length?) #49

Open
bsmith89 opened this issue Jan 14, 2023 · 8 comments
Open

Comments

@bsmith89
Copy link

In a pinch, I decided to apply GT-Pro for an "off-label" use, where I fed it whole genome sequence instead of reads.

This worked surprisingly well, but I had two issues I had to work around:

First, I needed to convert a FASTA formatted genome sequence into FASTQ. Not a problem, but I did want to confirm that GT-Pro isn't sensitive to quality scores (since I had to provide a "dummy" score for each base). Quality is ignored, right?

Updating GT-Pro to accept FASTA formatted sequence might be nice for some applications.

Second, and much more importantly, there seems to be a maximum read length, after which GT-Pro starts reporting no SNPs at all. For instance, if I write a ~3 Mbp reference genome as a single read in a FASTQ file, it reports 0 SNPs. I got around this by converting the full-length sequence into tiles (e.g. of 200 bp), at which point it started reporting the thousands of SNP sites correctly. (Note: to make sure I wasn't losing kmers at "read" boundaries, I overlapped my tiles by 32 bp.)

This isn't too surprising to me (since GT-Pro is intended for short-reads), but I didn't notice any mention of a length limit in the documentation. (Admittedly, I didn't look all that hard.) I also figure that some users may want to apply GT-Pro to long reads, and I expect this to give secretly incorrect results.

I played around with the tile lengths and I start getting 0-SNPs somewhere around 500 bases.

Am I doing something silly?

@nick-youngblut
Copy link
Contributor

nick-youngblut commented May 7, 2023

@bsmith89 do you know if @zjshi is still maintaining GT-Pro (and CallM)?

The last commit was in Aug 22, 2022, and @zjshi does not seem to be responding to reported issues, including this one from January. I'm asking you @bsmith89, since you are in the Pollard lab, which collaborated on the GT-Pro paper.

If GT-Pro is not longer maintained, then how should one go about using StrainFacts? I'm guessing that MIDAS(2) does not scale nearly as well as GT-Pro, and the StrainFacts documentation currently recommends using GT-Pro:

GT-Pro is the preferred metagenotyper for StrainFacts.

@zjshi
Copy link
Owner

zjshi commented May 8, 2023

Hi Byron, my apology again for my late response. It turns out all alerts of this repo have been unfortunately redirected to a group account in Biohub instead of my email account. I can now receive alerts after I was dropped from the group account. For your question, GT-Pro indeed does not support reads that are longer than 500bp, which was hard coded at line 460 in /src/gt_pro.cpp. It may be changed plus recompiled to enable processing of longer reads. However, you might might observe performance drops due to such a change.

@zjshi
Copy link
Owner

zjshi commented May 8, 2023

Hi Nick, I've transitioned out of the lab since last year, but I will continue to address most pressing issues related to GT-Pro. Since I have a full time job competing for my time, I hope you will understand it could take longer than usual to respond. I may most likely make my responses during the weekends. Request for new features could be challenging to meet, but will be considered case by case.

@nick-youngblut
Copy link
Contributor

nick-youngblut commented May 8, 2023

Hi Nick, I've transitioned out of the lab since last year, but I will continue to address most pressing issues related to GT-Pro.

@zjshi I know how that is. Sadly, one either has to devote their free time to maintaining code from past projects (and finishing publishing manuscript from their last job) while working a new full-time job... or just abandon the project. I've run into that most recently for ResMiCo, with many nights & weekends spent on the project.

I completely understand not having time to keep working on old projects (I'm behind with some github issues now), but I just wanted to know whether or not I would have to fork your repo and fix any bugs if I ran into them (or just questions about software usage). I'm glad to hear that you are still able to devote a bit of time to gt-pro. I'd be happy to help, if possible.

@bsmith89
Copy link
Author

bsmith89 commented May 8, 2023

GT-Pro indeed does not support reads that are longer than 500bp, which was hard coded at line 460 in /src/gt_pro.cpp. It may be changed plus recompiled to enable processing of longer reads. However, you might might observe performance drops due to such a change.

@zjshi, thanks! Good to know the exact cut-off. I can submit a PR to update the documentation. I might play with increasing that limit if it seems important. While I'm not super familiar with C++, I imagine a compile-time flag could be added to set that constant...? Might try doing that myself at some point.

Very thankful that you're still maintaining GT-Pro!

@zjshi
Copy link
Owner

zjshi commented May 14, 2023

@nick-youngblut Thanks for you understanding and sharing your experience! Again please feel free to let me know any further questions or issues you will run into in the future :)

@zjshi
Copy link
Owner

zjshi commented May 14, 2023

@bsmith89 Compile-time flag might do the job for a specific max read length, but I feel it might be annoying to compile it every time for each individual dataset. One way is to compile the program with supplying a large constant that is likely always longer than any single long read, e.g. something like 5M. The second way is to chop a input long read into multiple short reads that are shorter than 500bp. To make sure every nucleotide base will be processed, every two adjacent short reads should have a small overlap (l=30). Another way is to make the constant an variable in the program and the add a flag that can be used by the user to supply read length cap. This idea came up when we wrote the program, however, we eventually did not go with it due to a performance preference. Yes, I believe I should be able to find a good and sustainable way to maintain GT-Pro and other repos for the long run.

@bsmith89
Copy link
Author

These are all good ideas! In the meantime, I'd suggest that GT-Pro docs should give an informative warning when reads are too long and this should be clearly stated in the documentation. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants