New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression between Scrappie rgr_r94 and rgrgr_r94 #1
Comments
Thank you for the clarification! I've add a new section to the conclusions to address this. Some of my comments are speculative, so I'd be curious for your input:
Finally, I was just curious, what does the 'pirate' mean in the rgr_r94/rgrgr_r94 networks? I've been wondering since I saw this in Scrappie's history: Pirates vs bioinformaticians. I'm not sure who to root for in this competition. On one hand, I am a bioinformatician. But on the other hand, the pirate networks do seem to be doing well. 😄 |
We train variously with PCR and native DNA, sometimes just the former. Including native DNA can in principle help, it depends on the rates of methylation as to what a model will learn: e.g. one can imagine the case where methylation occurs at such a low rate in the training data that a model is optimised by simply ignoring this data. On the other hand, yes a model can learn to recognised methylated/modified squiggle, even if it simply labels the squiggle with canonical bases. Pirate is simply a reference to the fact that in trying to pronounce rgrgr one invariably sounds like a pirate. |
Thanks for the quick response - I've updated the text again. And I like the pirate explanation, appropriate seeing as how yesterday was Talk Like a Pirate Day! I wonder if the ideal solution is to have lots of different trained models included with Albacore. Off the top of my head: human_pcr, human_native, ecoli_pcr, ecoli_native, mixed. It could default to the mixed model but you could manually choose a model which best fits your data (something like |
And regarding methylation, I suppose my preferred behaviour is what you described: e.g. labelling a 5mC as just C. Actually labelling bases as methylated is cool, but that feels like a separate issue, and I would probably only want those labels if I explicitly asked for them. |
We've upgraded the two pirate networks (rgrgr_r94, rgrgr_r95) in Scrappie 1.1.1 to some trained on a more balanced set of genomes. |
Thanks - I'll give it a try! |
I've just updated the repo with new results, including Scrappie v1.1.1. Versions 1.1.0 and 1.1.1 also providing an interesting case to show the difference a training set can make, so I added a section comparing the two. Thanks again! |
Great work! Thank you very much.
As of the current releases, Albacore (2.0.2) and Scrappie (1.1.0) are at parity in their basecalling technology, Albacore using the rgrgr ("pirate") network. I suspect the reason for the apparent regression between the rgr_r94 and rgrgr_r94 networks in Scrappie is that the rgrgr_r94 model included was trained from a human-only data set, rather than a mixed set of genomes. As you said, Scrappie is a technology demonstrator and Albacore should be favoured for most uses.
While Nanonet is the only caller that includes the ability to retrain its own networks, Scrappie and Albacore can use networks trained by our open source Sloika project.
The text was updated successfully, but these errors were encountered: