-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse annotations into separate columns #63
Comments
Type inference for these sub-columns would be an issue also. Hopefully the output of various annotation programs will be fairly dependable, and we can bake in lookup tables, defaulting to String if not known. |
There's a basic question here about whether this should be done in vcf2zarr as part of the VCF conversion process, or whether we should post-process some VCF columns that have been stored as Zarr arrays to extract annotations. I'm inclined to go with parsing the Zarr arrays, perhaps as something like
It would look for some known annotation INFO fields (like Re naming these, the simplest this is to do something like |
This is not straightforward... Looking at an example from recent 1000 Genomes data, we have
So, the ANN column is 2D, with (it looks like) a maximum of 18 annotations for a given variant in this set. Each of these annotations is a pipe-separated list of mostly string data. So, we could separate this out into ~15 arrays of dimension I think this is a place where integrating with a different technology designed for handling sparse string data is the right approach. |
Going to close this as a "wontfix" as it's out of scope for the moment. |
Variant level annotations are often included as INFO tags with substructure, e.g.
It would be very helpful and useful to split these into their own Zarr arrays. We could add this as an option, like
--parse-snpeff
or something (I'm not sure how stable these formats are across versions, etc, though)The text was updated successfully, but these errors were encountered: