Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple alternative variants #6

Open
joanmarticarreras opened this issue Jan 20, 2021 · 11 comments
Open

Multiple alternative variants #6

joanmarticarreras opened this issue Jan 20, 2021 · 11 comments

Comments

@joanmarticarreras
Copy link

Hi Robert!

First, Congrats with the project! It's pretty cool!

I've decided to give it a try for my project (virus diversity) and I realized that in my multi-sample VCFs I tend to have many genome positions with multiple possible variants (both nucleotides and deletions ""). Vcf-annotator stops when in the ALT column there are multiple symbols as in "A,G,".

As enhancement I think it would be great if it can operate with this type of data.

Cheers

Joan

@rpetit3
Copy link
Owner

rpetit3 commented Jan 20, 2021

Hi Joan!

Thank you very much for checking out vcf-annotator. I could look into what you've suggested. Do you think you could attach an example?

Thanks!

@joanmarticarreras
Copy link
Author

Hi Robert,

Here is an example:

##fileformat=VCFv4.1
##contig=<ID=1,length=29903>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
NC_045512.2	25	.	T	*,G,C	.	.	.
NC_045512.2	241	.	C	T,Y	.	.	.
NC_045512.2	512	.	C	T,*	.	.	.
NC_045512.2	514	.	T	C,*	.	.	.
NC_045512.2	520	.	G	T,*	.	.	.
NC_045512.2	521	.	G	T,*	.	.	.
NC_045512.2	710	.	C	T,*	.	.	.
NC_045512.2	734	.	T	*,C	.	.	.
NC_045512.2	739	.	A	*,G	.	.	.
NC_045512.2	745	.	C	*,A	.	.	.
NC_045512.2	784	.	C	*,T	.	.	.
NC_045512.2	832	.	C	*,T	.	.	.
NC_045512.2	835	.	C	*,T	.	.	.
NC_045512.2	875	.	C	*,T	.	.	.
NC_045512.2	878	.	C	*,T	.	.	.
NC_045512.2	894	.	A	*,G	.	.	.
NC_045512.2	913	.	C	T,*	.	.	.
NC_045512.2	960	.	G	*,A	.	.	.

It comes from the program snp-sites which very nicely collects mutations in a MSA into a multi-sample VCF file.

I use both of them in a couple of pipelines, if you have a citable source for your software, let us know!

Joan

@rpetit3
Copy link
Owner

rpetit3 commented Aug 13, 2021

Hi Joan!

I apologize for the delay! I think I have a work in place for this, but first do you know if the Y here is a typo?

NC_045512.2	241	.	C	T,Y	.	.	.

I'm only asking because snp-sites seems to suggest anything not an A, T, G, or C is converted to '*' based on this https://github.com/sanger-pathogens/snp-sites/blob/52c98cb3e0ed0d336b24b27a5c0f3da4cbe44e71/src/vcf.c#L131-L134

@rpetit3 rpetit3 mentioned this issue Aug 13, 2021
@rpetit3
Copy link
Owner

rpetit3 commented Aug 13, 2021

Looping @marimaro into this, since this also seems related to * in the ALT field.

In this case I'm assuming * means everything that was not an A, T, G, C, or N. Or could it also mean missing?

I guess what I'm asking is, how would you want '*' to be dealt with? Treated as an N, ignored, etc...

Would love to get your thoughts

@marimaro
Copy link

Hey Robert,

I'm not sure, but I found this: https://www.biostars.org/p/279971/

To run vcf-annotator without this error I simply changed the * for nothing, though this might not be the best approach.

Hope it helps and thanks for looking into this.

@rpetit3
Copy link
Owner

rpetit3 commented Aug 17, 2021

Was yours also from snp-sites or a VCF produced by a multiple sequence alignment?

Taken from the VCF spec docs

 The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion.
 If there are no alternative alleles, then the missing value should be used.

Did you have any variants where the ALT was just *? Like this:

NC_045512.2	512	.	C	*	.	.	.

@marimaro
Copy link

Nope, only as something like T,* or A,*

@rpetit3
Copy link
Owner

rpetit3 commented Aug 17, 2021

I'm thinking for this I will annotate the T, but add a field like has_asterisk=True in the INFO column.

Do you think that would work for you?

@marimaro
Copy link

For me, yes. If you add an explanation somewhere, I guess it'll be fine.

@rpetit3
Copy link
Owner

rpetit3 commented Sep 13, 2021

Another example with * in the column #9 (comment)

Tagging @BioWilko - How would you like the * to be treated?

@BioWilko
Copy link

I'm honestly not sure, I think your suggestion above is probably the best option. I think it's a case of snp-sites seeing a lower case "n" and getting confused but I wouldn't quote me on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants