Make release for racon denovo branch #310

leoisl · 2022-11-17T19:00:54Z

Merged into main, but still need to do some small fixes and make the release

mbhall88 · 2022-11-20T22:55:25Z

What small fixes need to be done?

iqbal-lab · 2022-11-20T23:47:08Z

Wasn't it the changelog and examples

leoisl · 2022-11-21T00:14:11Z

yes, + precompiled binaries, conda recipe update, and the release itself. Conda might auto update seeing we have a new release, but I am not sure the auto update will work. I would actually prefer not to have this version on conda, as it is not stable enough yet I guess

mbhall88 · 2022-11-21T00:15:37Z

Conda won't trigger on pre-releases.

leoisl · 2022-11-21T00:16:07Z

Cool!

leoisl · 2022-11-21T17:31:43Z

I've found an issue when running with the tiny example we have for this release. Not an issue with the changes in this release I believe, so I think we can proceed with the release. Description of the issue follows:

The tiny example describes new alleles that pandora denovo should find, and one sample should genotype to one allele, and the other sample to the new allele. For the example gene GC00010897, the new allele is a T in position 44, and this is what we should find.

GC00010897	44	.	C	T	.	.	SVTYPE=SNP;GRAPHTYPE=SIMPLE	GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF	1:0,11:0,16:0,11:0,16:0,23:0,32:1,0:-220.34,-8.03511:212.304	0:22,0:18,0:22,0:18,0:44,0:37,0:0,1:-2.87264,-270.207:267.334

New pandora version finds this:

GC00010897	1	.	ATGCAGATACGTGAACAGGGCCGCAAAATTCAGTGCATCCGCACCGTGTACGACAAGGCCATTGGCCGGGGTCGGCAGACGGTCATTGCCACACTGGCCCGCTATACGAC	ATGCAGATACGTGAACAGGGCCGCAAAATTCAGTGCATCCGCATCGTGTACGACAAGGCCATTGGCCGGGGTCGGCAGACGGTCATTGCCACACTGGCCCGCTATACGAC	.	.	VC=PH_SNPs;GRAPHTYPE=SIMPLE	GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF	1:19,20:21,23:26,26:28,27:331,369:372,422:0.235294,0.111111:-221.046,-198.019:23.0274	0:30,26:27,23:32,32:30,29:515,471:465,428:0,0.166667:-228.672,-276.438:47.766

Notes:

Both records describe the same variant, C -> T at pos 44, but the new version outputs this as a 110-bp PH_SNP, where in truth is actually just a single SNP. So the new version is not being as compact as it should;
Although both samples in both versions agree on their genotyping, in the new version we lowered the confidence by an order of magnitude (212 and 267 GT conf for the previous version, 23 and 47 GT conf for the new version). This will probably change genotyping if downstream tools filter on GT Conf.

For some reason, on the previous version, we just have a 1-bp bubble describing the SNP to be genotyped in the PRG string, flanked by long linear sequences:

ATGCAGATACGTGAACAGGGCCGCAAAATTCAGTGCATCCGCA 5 C 6 T 5 CGTGTACGACAAGGCCATTGGCCGGGGTCGGCAGACGGTCATTGCCACACTGGCCCGCTATACGAC

In the new version, we have a long 110-bp bubble:

 5 TTeGAGTAAAACAATCCCCCGCGCTTATATAAGCGCGTTGATATTTTTAATTATTAACAAGCAACATCATGCTAATACAGACATACAAGGAGATCATCTCTCTTTGCCTGTTTTTTATTATTTCAGGAGTGTAAACACATTTTCCG 6 TTGAGTAAAACAATCCCCCGCGCTTATATAAGCGCGTTGATATTTTTAGTTATTAACAAGCAACATCATGCTAATACAGACATACAAGGAGATCATCTCTCTTTGCCTGTTTTTTATTATTTCAGGAGTGTAAACACATTTTCCG 5

make_prg versions haven't changed (AFAIK), and the discovered pandora denovo alleles for this gene are exactly the same. Although the graphs are conceptually the same, i.e. represent the same information, one allows for a very compact description of the variants, while the other not. However, coupled with an issue on calculating likelihoods of PH_SNPs in pandora, we could actually get different results. From this execution, there are two tasks we should carry out:

Fix the issue of GT conf decreasing the longer alleles are. What happens in this example is that, when genotyping, the coverage that matters, which is the coverage on position 44 where we actually have the SNP, gets averaged together with the coverage of the 109 other identical bps, i.e. this SNP signal gets diluted. One fix for this would be to post-process the variant and break it into smaller, more compact ones. In this example, we should trim off the 109 identical bps on both alleles, and keep just the SNP itself. This would allow us to genotype the SNP based just on the coverage of the bases that differ, so the coverage of the 109 identical bps on both alleles would not get on the way. This also happens with indels and complex variants, although these (especially the latter) might be harder to break;
Debug make_prg and understand why it created different graphs from presumably same input. From make_prg perspective, however, this is not a bug - both graphs represent the same information. However, we can miss variants due to this different representation of the PRG and the issue described in (1).

I believe this is not an issue to drprg though, as it does not use make_prg update.

Did I miss some task that we should do?

Do you agree we can release even with this known issue?

Which would be the path you'd recommend to solve or investigate this?

mbhall88 · 2022-11-21T22:35:09Z

This has to be a make prg issue right? Given the discovered allele was a single base? Might be a bug/regression in update? I don't really know where you v1.0.0 of make_prg is coming from though. I did see some strange allele additions back when I was using update in drprg remember - but I'm using this version of make prg https://github.com/mbhall88/drprg/blob/533f7b36084dbb269d511e9146f5cadae021a7b1/justfile#L7. It surely can't be a pandora issue as pandora doesn't build PRGs...

leoisl · 2022-11-22T13:37:05Z

This has to be a make prg issue right?

I think so...

I don't really know where you v1.0.0 of make_prg is coming from though.

Neither I at this point. I think from https://github.com/leoisl/make_prg/tree/update_1_0_0_pre_release , but I am not sure. Anyway both mine and your version are outdated. make_prg is partially merged, we are working with 2 different versions that are not the most updated one.

This is too messy to manage and really an issue. As I de-prioritised make_prg due to new projects with higher priorities (tbpore, roundhound, mof and now pandora), I am re-prioritising make_prg now and just resuming other tasks once we have make_prg merged and released. Otherwise this will become even messier and the cost to fix will be higher.

leoisl · 2022-11-22T14:42:35Z

FYI my container refer to this commit: leoisl/make_prg@b064dcd . But this does not matter much, as we will be releasing new make_prg version in the next couple of days, so we will all sync to use the new version

iqbal-lab · 2022-11-22T19:47:37Z

BTW for this
#310 (comment)
In the second enumeration, item 1, where you talk about the SNP signal being diluted , that's why Brice has his fancy model for gramtools which handles this. Porting it to Pandora is on the road map somewhere.

I'd ignore the gcp issue , the real question is why a single snp is suddenly becoming a long thing

leoisl · 2022-12-13T09:49:22Z

Done

leoisl mentioned this issue Nov 22, 2022

Understand why latest executable of new make_prg version did not produce the expected graph iqbal-lab-org/make_prg#43

Closed

leoisl closed this as completed Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make release for racon denovo branch #310

Make release for racon denovo branch #310

leoisl commented Nov 17, 2022

mbhall88 commented Nov 20, 2022

iqbal-lab commented Nov 20, 2022

leoisl commented Nov 21, 2022

mbhall88 commented Nov 21, 2022

leoisl commented Nov 21, 2022

leoisl commented Nov 21, 2022

mbhall88 commented Nov 21, 2022 •

edited

Loading

leoisl commented Nov 22, 2022

leoisl commented Nov 22, 2022

iqbal-lab commented Nov 22, 2022

leoisl commented Dec 13, 2022

Make release for racon denovo branch #310

Make release for racon denovo branch #310

Comments

leoisl commented Nov 17, 2022

mbhall88 commented Nov 20, 2022

iqbal-lab commented Nov 20, 2022

leoisl commented Nov 21, 2022

mbhall88 commented Nov 21, 2022

leoisl commented Nov 21, 2022

leoisl commented Nov 21, 2022

mbhall88 commented Nov 21, 2022 • edited Loading

leoisl commented Nov 22, 2022

leoisl commented Nov 22, 2022

iqbal-lab commented Nov 22, 2022

leoisl commented Dec 13, 2022

mbhall88 commented Nov 21, 2022 •

edited

Loading