Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make release for racon denovo branch #310

Closed
leoisl opened this issue Nov 17, 2022 · 11 comments
Closed

Make release for racon denovo branch #310

leoisl opened this issue Nov 17, 2022 · 11 comments

Comments

@leoisl
Copy link
Collaborator

leoisl commented Nov 17, 2022

Merged into main, but still need to do some small fixes and make the release

@mbhall88
Copy link
Member

What small fixes need to be done?

@iqbal-lab
Copy link
Collaborator

Wasn't it the changelog and examples

@leoisl
Copy link
Collaborator Author

leoisl commented Nov 21, 2022

yes, + precompiled binaries, conda recipe update, and the release itself. Conda might auto update seeing we have a new release, but I am not sure the auto update will work. I would actually prefer not to have this version on conda, as it is not stable enough yet I guess

@mbhall88
Copy link
Member

Conda won't trigger on pre-releases.

@leoisl
Copy link
Collaborator Author

leoisl commented Nov 21, 2022

Cool!

@leoisl
Copy link
Collaborator Author

leoisl commented Nov 21, 2022

I've found an issue when running with the tiny example we have for this release. Not an issue with the changes in this release I believe, so I think we can proceed with the release. Description of the issue follows:

The tiny example describes new alleles that pandora denovo should find, and one sample should genotype to one allele, and the other sample to the new allele. For the example gene GC00010897, the new allele is a T in position 44, and this is what we should find.

GC00010897	44	.	C	T	.	.	SVTYPE=SNP;GRAPHTYPE=SIMPLE	GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF	1:0,11:0,16:0,11:0,16:0,23:0,32:1,0:-220.34,-8.03511:212.304	0:22,0:18,0:22,0:18,0:44,0:37,0:0,1:-2.87264,-270.207:267.334

New pandora version finds this:

GC00010897	1	.	ATGCAGATACGTGAACAGGGCCGCAAAATTCAGTGCATCCGCACCGTGTACGACAAGGCCATTGGCCGGGGTCGGCAGACGGTCATTGCCACACTGGCCCGCTATACGAC	ATGCAGATACGTGAACAGGGCCGCAAAATTCAGTGCATCCGCATCGTGTACGACAAGGCCATTGGCCGGGGTCGGCAGACGGTCATTGCCACACTGGCCCGCTATACGAC	.	.	VC=PH_SNPs;GRAPHTYPE=SIMPLE	GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF	1:19,20:21,23:26,26:28,27:331,369:372,422:0.235294,0.111111:-221.046,-198.019:23.0274	0:30,26:27,23:32,32:30,29:515,471:465,428:0,0.166667:-228.672,-276.438:47.766

Notes:

  1. Both records describe the same variant, C -> T at pos 44, but the new version outputs this as a 110-bp PH_SNP, where in truth is actually just a single SNP. So the new version is not being as compact as it should;
  2. Although both samples in both versions agree on their genotyping, in the new version we lowered the confidence by an order of magnitude (212 and 267 GT conf for the previous version, 23 and 47 GT conf for the new version). This will probably change genotyping if downstream tools filter on GT Conf.

For some reason, on the previous version, we just have a 1-bp bubble describing the SNP to be genotyped in the PRG string, flanked by long linear sequences:

ATGCAGATACGTGAACAGGGCCGCAAAATTCAGTGCATCCGCA 5 C 6 T 5 CGTGTACGACAAGGCCATTGGCCGGGGTCGGCAGACGGTCATTGCCACACTGGCCCGCTATACGAC

In the new version, we have a long 110-bp bubble:

 5 TTeGAGTAAAACAATCCCCCGCGCTTATATAAGCGCGTTGATATTTTTAATTATTAACAAGCAACATCATGCTAATACAGACATACAAGGAGATCATCTCTCTTTGCCTGTTTTTTATTATTTCAGGAGTGTAAACACATTTTCCG 6 TTGAGTAAAACAATCCCCCGCGCTTATATAAGCGCGTTGATATTTTTAGTTATTAACAAGCAACATCATGCTAATACAGACATACAAGGAGATCATCTCTCTTTGCCTGTTTTTTATTATTTCAGGAGTGTAAACACATTTTCCG 5 

make_prg versions haven't changed (AFAIK), and the discovered pandora denovo alleles for this gene are exactly the same. Although the graphs are conceptually the same, i.e. represent the same information, one allows for a very compact description of the variants, while the other not. However, coupled with an issue on calculating likelihoods of PH_SNPs in pandora, we could actually get different results. From this execution, there are two tasks we should carry out:

  1. Fix the issue of GT conf decreasing the longer alleles are. What happens in this example is that, when genotyping, the coverage that matters, which is the coverage on position 44 where we actually have the SNP, gets averaged together with the coverage of the 109 other identical bps, i.e. this SNP signal gets diluted. One fix for this would be to post-process the variant and break it into smaller, more compact ones. In this example, we should trim off the 109 identical bps on both alleles, and keep just the SNP itself. This would allow us to genotype the SNP based just on the coverage of the bases that differ, so the coverage of the 109 identical bps on both alleles would not get on the way. This also happens with indels and complex variants, although these (especially the latter) might be harder to break;
  2. Debug make_prg and understand why it created different graphs from presumably same input. From make_prg perspective, however, this is not a bug - both graphs represent the same information. However, we can miss variants due to this different representation of the PRG and the issue described in (1).

I believe this is not an issue to drprg though, as it does not use make_prg update.

Did I miss some task that we should do?

Do you agree we can release even with this known issue?

Which would be the path you'd recommend to solve or investigate this?

@mbhall88
Copy link
Member

mbhall88 commented Nov 21, 2022

This has to be a make prg issue right? Given the discovered allele was a single base? Might be a bug/regression in update? I don't really know where you v1.0.0 of make_prg is coming from though. I did see some strange allele additions back when I was using update in drprg remember - but I'm using this version of make prg https://github.com/mbhall88/drprg/blob/533f7b36084dbb269d511e9146f5cadae021a7b1/justfile#L7. It surely can't be a pandora issue as pandora doesn't build PRGs...

@leoisl
Copy link
Collaborator Author

leoisl commented Nov 22, 2022

This has to be a make prg issue right?

I think so...

I don't really know where you v1.0.0 of make_prg is coming from though.

Neither I at this point. I think from https://github.com/leoisl/make_prg/tree/update_1_0_0_pre_release , but I am not sure. Anyway both mine and your version are outdated. make_prg is partially merged, we are working with 2 different versions that are not the most updated one.

This is too messy to manage and really an issue. As I de-prioritised make_prg due to new projects with higher priorities (tbpore, roundhound, mof and now pandora), I am re-prioritising make_prg now and just resuming other tasks once we have make_prg merged and released. Otherwise this will become even messier and the cost to fix will be higher.

@leoisl
Copy link
Collaborator Author

leoisl commented Nov 22, 2022

FYI my container refer to this commit: leoisl/make_prg@b064dcd . But this does not matter much, as we will be releasing new make_prg version in the next couple of days, so we will all sync to use the new version

@iqbal-lab
Copy link
Collaborator

BTW for this
#310 (comment)
In the second enumeration, item 1, where you talk about the SNP signal being diluted , that's why Brice has his fancy model for gramtools which handles this. Porting it to Pandora is on the road map somewhere.

I'd ignore the gcp issue , the real question is why a single snp is suddenly becoming a long thing

@leoisl
Copy link
Collaborator Author

leoisl commented Dec 13, 2022

Done

@leoisl leoisl closed this as completed Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants