Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse annotations into separate columns #63

Closed
jeromekelleher opened this issue Mar 4, 2024 · 4 comments
Closed

Parse annotations into separate columns #63

jeromekelleher opened this issue Mar 4, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@jeromekelleher
Copy link
Contributor

Variant level annotations are often included as INFO tags with substructure, e.g.

##SnpEffVersion="4.3i (build 2016-12-15 22:33), by Pablo Cingolani"
##SnpEffCmd="SnpEff  -noStats -lof GRCh38.86 /gpfs/commons/home/usevani/compbio/CCDG/Project_CCDG_14151_B01_GRM_WGS/final_annotated_vcfs/tmp_dir_annt/CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.annotated.normalize.vcf.gz "
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA>
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">

It would be very helpful and useful to split these into their own Zarr arrays. We could add this as an option, like --parse-snpeff or something (I'm not sure how stable these formats are across versions, etc, though)

@jeromekelleher jeromekelleher added the enhancement New feature or request label Mar 4, 2024
@jeromekelleher
Copy link
Contributor Author

Type inference for these sub-columns would be an issue also. Hopefully the output of various annotation programs will be fairly dependable, and we can bake in lookup tables, defaulting to String if not known.

@jeromekelleher
Copy link
Contributor Author

There's a basic question here about whether this should be done in vcf2zarr as part of the VCF conversion process, or whether we should post-process some VCF columns that have been stored as Zarr arrays to extract annotations. I'm inclined to go with parsing the Zarr arrays, perhaps as something like

vcf2zarr extract-annotations <ZARR DIR> 

It would look for some known annotation INFO fields (like variant_ANN etc above) and do the necessary thing to extract the required Zarr columns.

Re naming these, the simplest this is to do something like variaant_ANN_Allele, etc, i.e., follow the nested naming.

@jeromekelleher
Copy link
Contributor Author

This is not straightforward... Looking at an example from recent 1000 Genomes data, we have

<zarr.core.Array '/variant_ANN' (96475, 18) object>
Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcri>
[['A|intergenic_region|MODIFIER|DEFB125|ENSG00000178591|intergenic_region|ENSG00000178591|||n.60070G>A||||||'
  '' '' ... '' '' '']
 ['C|intergenic_region|MODIFIER|DEFB125|ENSG00000178591|intergenic_region|ENSG00000178591|||n.60083T>C||||||'
  '' '' ... '' '' '']
 ['C|intergenic_region|MODIFIER|DEFB125|ENSG00000178591|intergenic_region|ENSG00000178591|||n.60114T>C||||||'
  '' '' ... '' '' '']
 ...
 ['A|downstream_gene_variant|MODIFIER|STK35|ENSG00000125834|transcript|ENST00000381482.7|protein_coding||c.*35483G>A|||||4306>
  'A|intragenic_variant|MODIFIER|STK35|ENSG00000125834|gene_variant|ENSG00000125834|||n.2152861G>A||||||'
  '' ... '' '' '']
 ['A|downstream_gene_variant|MODIFIER|STK35|ENSG00000125834|transcript|ENST00000381482.7|protein_coding||c.*35600T>A|||||4423>
  'A|intragenic_variant|MODIFIER|STK35|ENSG00000125834|gene_variant|ENSG00000125834|||n.2152978T>A||||||'
  '' ... '' '' '']
 ['A|downstream_gene_variant|MODIFIER|STK35|ENSG00000125834|transcript|ENST00000381482.7|protein_coding||c.*35631T>A|||||4454>
  'A|intragenic_variant|MODIFIER|STK35|ENSG00000125834|gene_variant|ENSG00000125834|||n.2153009T>A||||||'
  '' ... '' '' '']]

So, the ANN column is 2D, with (it looks like) a maximum of 18 annotations for a given variant in this set. Each of these annotations is a pipe-separated list of mostly string data. So, we could separate this out into ~15 arrays of dimension (variants, 18) but I don't think there's much point. It's not going to map well to the Zarr model (because of all the strings).

I think this is a place where integrating with a different technology designed for handling sparse string data is the right approach.

@jeromekelleher
Copy link
Contributor Author

Going to close this as a "wontfix" as it's out of scope for the moment.

@jeromekelleher jeromekelleher closed this as not planned Won't fix, can't repro, duplicate, stale Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant