Parse annotations into separate columns #63

jeromekelleher · 2024-03-04T11:25:12Z

Variant level annotations are often included as INFO tags with substructure, e.g.

##SnpEffVersion="4.3i (build 2016-12-15 22:33), by Pablo Cingolani"
##SnpEffCmd="SnpEff  -noStats -lof GRCh38.86 /gpfs/commons/home/usevani/compbio/CCDG/Project_CCDG_14151_B01_GRM_WGS/final_annotated_vcfs/tmp_dir_annt/CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.annotated.normalize.vcf.gz "
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA>
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">

It would be very helpful and useful to split these into their own Zarr arrays. We could add this as an option, like --parse-snpeff or something (I'm not sure how stable these formats are across versions, etc, though)

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2024-03-04T11:58:10Z

Type inference for these sub-columns would be an issue also. Hopefully the output of various annotation programs will be fairly dependable, and we can bake in lookup tables, defaulting to String if not known.

jeromekelleher · 2024-04-18T12:02:19Z

There's a basic question here about whether this should be done in vcf2zarr as part of the VCF conversion process, or whether we should post-process some VCF columns that have been stored as Zarr arrays to extract annotations. I'm inclined to go with parsing the Zarr arrays, perhaps as something like

vcf2zarr extract-annotations <ZARR DIR>

It would look for some known annotation INFO fields (like variant_ANN etc above) and do the necessary thing to extract the required Zarr columns.

Re naming these, the simplest this is to do something like variaant_ANN_Allele, etc, i.e., follow the nested naming.

jeromekelleher · 2024-04-18T22:10:55Z

This is not straightforward... Looking at an example from recent 1000 Genomes data, we have

<zarr.core.Array '/variant_ANN' (96475, 18) object>
Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcri>
[['A|intergenic_region|MODIFIER|DEFB125|ENSG00000178591|intergenic_region|ENSG00000178591|||n.60070G>A||||||'
  '' '' ... '' '' '']
 ['C|intergenic_region|MODIFIER|DEFB125|ENSG00000178591|intergenic_region|ENSG00000178591|||n.60083T>C||||||'
  '' '' ... '' '' '']
 ['C|intergenic_region|MODIFIER|DEFB125|ENSG00000178591|intergenic_region|ENSG00000178591|||n.60114T>C||||||'
  '' '' ... '' '' '']
 ...
 ['A|downstream_gene_variant|MODIFIER|STK35|ENSG00000125834|transcript|ENST00000381482.7|protein_coding||c.*35483G>A|||||4306>
  'A|intragenic_variant|MODIFIER|STK35|ENSG00000125834|gene_variant|ENSG00000125834|||n.2152861G>A||||||'
  '' ... '' '' '']
 ['A|downstream_gene_variant|MODIFIER|STK35|ENSG00000125834|transcript|ENST00000381482.7|protein_coding||c.*35600T>A|||||4423>
  'A|intragenic_variant|MODIFIER|STK35|ENSG00000125834|gene_variant|ENSG00000125834|||n.2152978T>A||||||'
  '' ... '' '' '']
 ['A|downstream_gene_variant|MODIFIER|STK35|ENSG00000125834|transcript|ENST00000381482.7|protein_coding||c.*35631T>A|||||4454>
  'A|intragenic_variant|MODIFIER|STK35|ENSG00000125834|gene_variant|ENSG00000125834|||n.2153009T>A||||||'
  '' ... '' '' '']]

So, the ANN column is 2D, with (it looks like) a maximum of 18 annotations for a given variant in this set. Each of these annotations is a pipe-separated list of mostly string data. So, we could separate this out into ~15 arrays of dimension (variants, 18) but I don't think there's much point. It's not going to map well to the Zarr model (because of all the strings).

I think this is a place where integrating with a different technology designed for handling sparse string data is the right approach.

jeromekelleher · 2024-04-19T13:29:45Z

Going to close this as a "wontfix" as it's out of scope for the moment.

jeromekelleher added the enhancement New feature or request label Mar 4, 2024

jeromekelleher closed this as not planned Won't fix, can't repro, duplicate, stale Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse annotations into separate columns #63

Parse annotations into separate columns #63

jeromekelleher commented Mar 4, 2024

jeromekelleher commented Mar 4, 2024

jeromekelleher commented Apr 18, 2024

jeromekelleher commented Apr 18, 2024

jeromekelleher commented Apr 19, 2024

Parse annotations into separate columns #63

Parse annotations into separate columns #63

Comments

jeromekelleher commented Mar 4, 2024

jeromekelleher commented Mar 4, 2024

jeromekelleher commented Apr 18, 2024

jeromekelleher commented Apr 18, 2024

jeromekelleher commented Apr 19, 2024