Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression between Scrappie rgr_r94 and rgrgr_r94 #1

Closed
tmassingham-ont opened this issue Sep 14, 2017 · 7 comments
Closed

Regression between Scrappie rgr_r94 and rgrgr_r94 #1

tmassingham-ont opened this issue Sep 14, 2017 · 7 comments

Comments

@tmassingham-ont
Copy link

Great work! Thank you very much.

As of the current releases, Albacore (2.0.2) and Scrappie (1.1.0) are at parity in their basecalling technology, Albacore using the rgrgr ("pirate") network. I suspect the reason for the apparent regression between the rgr_r94 and rgrgr_r94 networks in Scrappie is that the rgrgr_r94 model included was trained from a human-only data set, rather than a mixed set of genomes. As you said, Scrappie is a technology demonstrator and Albacore should be favoured for most uses.

While Nanonet is the only caller that includes the ability to retrain its own networks, Scrappie and Albacore can use networks trained by our open source Sloika project.

@rrwick
Copy link
Owner

rrwick commented Sep 20, 2017

Thank you for the clarification! I've add a new section to the conclusions to address this.

Some of my comments are speculative, so I'd be curious for your input:

  1. Are ONT training sets all PCRed DNA with no methylation?
  2. Would including native methylated DNA in the training set give an easy improvement to accuracy? Or is it more complicated than that?

Finally, I was just curious, what does the 'pirate' mean in the rgr_r94/rgrgr_r94 networks? I've been wondering since I saw this in Scrappie's history: Pirates vs bioinformaticians. I'm not sure who to root for in this competition. On one hand, I am a bioinformatician. But on the other hand, the pirate networks do seem to be doing well. 😄

@cjw85
Copy link

cjw85 commented Sep 20, 2017

We train variously with PCR and native DNA, sometimes just the former. Including native DNA can in principle help, it depends on the rates of methylation as to what a model will learn: e.g. one can imagine the case where methylation occurs at such a low rate in the training data that a model is optimised by simply ignoring this data. On the other hand, yes a model can learn to recognised methylated/modified squiggle, even if it simply labels the squiggle with canonical bases.

Pirate is simply a reference to the fact that in trying to pronounce rgrgr one invariably sounds like a pirate.

@rrwick
Copy link
Owner

rrwick commented Sep 20, 2017

Thanks for the quick response - I've updated the text again. And I like the pirate explanation, appropriate seeing as how yesterday was Talk Like a Pirate Day!

I wonder if the ideal solution is to have lots of different trained models included with Albacore. Off the top of my head: human_pcr, human_native, ecoli_pcr, ecoli_native, mixed. It could default to the mixed model but you could manually choose a model which best fits your data (something like --model ecoli_native) to get the best accuracy. Or (and this is widely speculative, as I have no experience working with neural networks) could Albacore try each model and somehow automatically figure out which one 'fits' best with the data?

@rrwick
Copy link
Owner

rrwick commented Sep 20, 2017

And regarding methylation, I suppose my preferred behaviour is what you described: e.g. labelling a 5mC as just C. Actually labelling bases as methylated is cool, but that feels like a separate issue, and I would probably only want those labels if I explicitly asked for them.

@tmassingham-ont
Copy link
Author

We've upgraded the two pirate networks (rgrgr_r94, rgrgr_r95) in Scrappie 1.1.1 to some trained on a more balanced set of genomes.

@rrwick
Copy link
Owner

rrwick commented Sep 23, 2017

Thanks - I'll give it a try!

@rrwick
Copy link
Owner

rrwick commented Oct 2, 2017

I've just updated the repo with new results, including Scrappie v1.1.1.

Versions 1.1.0 and 1.1.1 also providing an interesting case to show the difference a training set can make, so I added a section comparing the two.

Thanks again!

@rrwick rrwick closed this as completed Oct 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants