Integrating functionality from an independent project #614
Unanswered
shz9
asked this question in
Show and tell
Replies: 2 comments
-
Thanks for the detailed proposal, @shz9! We'd definitely welcome your efforts on |
Beta Was this translation helpful? Give feedback.
0 replies
-
Hi @shz9 - this sounds great! A few initial thoughts:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all,
First of all, I would like to thank everyone for all the great work that went into this toolkit. Almost a year ago, I independently started working on something similar, mainly statistical genetics tools/simulators that are based on the PyData ecosystem. I was hoping to eventually integrate the work with a mature python package like
scikit-allel
or have it be as an independent project, but I think it fits much better within the umbrella ofsgkit
.The tools that I have been developing are here:
https://github.com/shz9/gwasimulator
I have been working on this pretty much alone, so documentation/testing is quite poor at this stage. But in terms of conceptual contributions, the modules in the above repo implement the following:
Linkage Disequilibrium (LD) tools: The impetus for my work was me trying to efficiently train coordinate-wise Polygenic Risk Score (PRS) models in python, and for that to work efficiently, LD is a major bottleneck. Therefore, I wrote functions/modules to do the following:
-- Efficiently compute and store large LD matrices (both banded and dense). I'm using Dask to compute the correlation between variants and then storing this huge matrix on disk in the form of Zarr arrays. I can comfortably compute and store the LD matrix of chromosome 1 (100k x 100k) from the 1000G project on my laptop and it usually takes ~20 minutes.
-- For many applications in statistical genetics, we don't need the full dense LD matrix, so I implemented functions to either shrink or sparsify the matrix beyond a certain window. I have functions that compute the Wen-Stephens shrinkage estimator of LD (though this part of the code probably needs independent review). Currently, the functions support windows in units of cM, but they can be easily extended to take in Megabases, or some other units.
-- For sparse LD matrices, I store them in Variable Length (VarLen) Zarr arrays (or ragged arrays) instead of scipy's sparse matrix format because I found the read access from Zarr to be much faster, especially when I need to grab the data one row (or one column) at a time.
-- I implemented a wrapper that allows for efficient read access of rows of the LD matrices, taking into account their chunk structure. I also optimized the chunking structure so that it's faster to read rows of the LD matrix (this is important for fitting the coordinate-wise PRS models).
Phenotype simulation tools: Given my focus on PRS methods, I also needed to develop phenotype simulation tools to test the predictive performance/robustness of those methods. My phenotype simulator supports sparse genetic architecture (in the form of spike-and-slab or mixtures of Gaussian effects) and allows the user to set the heritability for the trait. For a separate project that I'm collaborating on, I also implemented a multi-population phenotype simulator where the user can control the level of correlation in the effect sizes between different populations.
GWAS tools: The module also allows the user to perform simple GWAS on the read/simulated phenotypes. Currently, the code only supports doing GWAS on continuous traits, but I've been planning on extending it to binary, case-control traits. I also have a method (not properly tested) to perform GWAS with 3rd party tools, such as PLINK.
Plotting: I have minimal plotting functions for generating Manhattan/QQ/LD plots.
Other work in progress:
-- Parsers for summary statistics. I saw that this is something that you were hoping to incorporate into
sgkit
(Summary statistics IO and methods #440) and I have thought a bit as well about doing this on my end.-- One area that I have lots of incomplete code on is incorporating functional/genomic annotations into this framework. These annotations are in the form of attributes for each variant and they are useful for many applications and PRS is one of them.
-- I also tried incorporating LD blocks into the code, as they're a useful approximation/simplification that is used by many statistical genetics methods to speed up/streamline computation. I will need to think more about the best way to incorporate them into the data structures that I have, but my thinking was that it's best to have them represented as pseudo-chromosomes.
I think this work and other future directions fit comfortably within the goals of
sgkit
and I would like to collaborate on integrating/improving these functionalities within thesgkit
framework. I can see that there will be a lot of getting-up-to-speed from my end to understand the data structures and design ofsgkit
in order for this to go smoothly. And I would appreciate any help/pointers/guidance in the process.Beta Was this translation helpful? Give feedback.
All reactions