Integrating functionality from an independent project #614

shz9 · 2021-06-25T07:19:16Z

shz9
Jun 25, 2021

Hi all,

First of all, I would like to thank everyone for all the great work that went into this toolkit. Almost a year ago, I independently started working on something similar, mainly statistical genetics tools/simulators that are based on the PyData ecosystem. I was hoping to eventually integrate the work with a mature python package like scikit-allel or have it be as an independent project, but I think it fits much better within the umbrella of sgkit.

The tools that I have been developing are here:
https://github.com/shz9/gwasimulator

I have been working on this pretty much alone, so documentation/testing is quite poor at this stage. But in terms of conceptual contributions, the modules in the above repo implement the following:

Linkage Disequilibrium (LD) tools: The impetus for my work was me trying to efficiently train coordinate-wise Polygenic Risk Score (PRS) models in python, and for that to work efficiently, LD is a major bottleneck. Therefore, I wrote functions/modules to do the following:
-- Efficiently compute and store large LD matrices (both banded and dense). I'm using Dask to compute the correlation between variants and then storing this huge matrix on disk in the form of Zarr arrays. I can comfortably compute and store the LD matrix of chromosome 1 (100k x 100k) from the 1000G project on my laptop and it usually takes ~20 minutes.
-- For many applications in statistical genetics, we don't need the full dense LD matrix, so I implemented functions to either shrink or sparsify the matrix beyond a certain window. I have functions that compute the Wen-Stephens shrinkage estimator of LD (though this part of the code probably needs independent review). Currently, the functions support windows in units of cM, but they can be easily extended to take in Megabases, or some other units.
-- For sparse LD matrices, I store them in Variable Length (VarLen) Zarr arrays (or ragged arrays) instead of scipy's sparse matrix format because I found the read access from Zarr to be much faster, especially when I need to grab the data one row (or one column) at a time.
-- I implemented a wrapper that allows for efficient read access of rows of the LD matrices, taking into account their chunk structure. I also optimized the chunking structure so that it's faster to read rows of the LD matrix (this is important for fitting the coordinate-wise PRS models).
Phenotype simulation tools: Given my focus on PRS methods, I also needed to develop phenotype simulation tools to test the predictive performance/robustness of those methods. My phenotype simulator supports sparse genetic architecture (in the form of spike-and-slab or mixtures of Gaussian effects) and allows the user to set the heritability for the trait. For a separate project that I'm collaborating on, I also implemented a multi-population phenotype simulator where the user can control the level of correlation in the effect sizes between different populations.
GWAS tools: The module also allows the user to perform simple GWAS on the read/simulated phenotypes. Currently, the code only supports doing GWAS on continuous traits, but I've been planning on extending it to binary, case-control traits. I also have a method (not properly tested) to perform GWAS with 3rd party tools, such as PLINK.
Plotting: I have minimal plotting functions for generating Manhattan/QQ/LD plots.
Other work in progress:
-- Parsers for summary statistics. I saw that this is something that you were hoping to incorporate into sgkit (Summary statistics IO and methods #440) and I have thought a bit as well about doing this on my end.
-- One area that I have lots of incomplete code on is incorporating functional/genomic annotations into this framework. These annotations are in the form of attributes for each variant and they are useful for many applications and PRS is one of them.
-- I also tried incorporating LD blocks into the code, as they're a useful approximation/simplification that is used by many statistical genetics methods to speed up/streamline computation. I will need to think more about the best way to incorporate them into the data structures that I have, but my thinking was that it's best to have them represented as pseudo-chromosomes.

I think this work and other future directions fit comfortably within the goals of sgkit and I would like to collaborate on integrating/improving these functionalities within the sgkit framework. I can see that there will be a lot of getting-up-to-speed from my end to understand the data structures and design of sgkit in order for this to go smoothly. And I would appreciate any help/pointers/guidance in the process.

hammer · 2021-06-26T12:37:55Z

hammer
Jun 26, 2021
Maintainer

Thanks for the detailed proposal, @shz9! We'd definitely welcome your efforts on sgkit. I'll write more thoughts next week during working hours.

0 replies

tomwhite · 2021-06-28T11:25:34Z

tomwhite
Jun 28, 2021
Maintainer

Hi @shz9 - this sounds great! A few initial thoughts:

We use Xarray for our data structures (backed by Zarr, and using Dask for computation). As a starting point it might be good to try loading your data into sgkit. We support BGEN, PLINK, VCF, but there may be edge cases that you hit, so it would be good to find those sooner.
For sparse LD matrices we have been using a COO dataframe to store them, see https://pystatgen.github.io/sgkit/latest/generated/sgkit.ld_matrix.html. It would be interesting to learn more about VarLen Zarr, and how it might work with this representation and/or Xarray.
We don't have plotting functions in the library, but there are some for Manhattan/QQ plots in this notebook (part of Replicating Hail's GWAS Tutorial #463, not yet merged).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating functionality from an independent project #614

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Integrating functionality from an independent project #614

shz9 Jun 25, 2021

Replies: 2 comments

hammer Jun 26, 2021 Maintainer

tomwhite Jun 28, 2021 Maintainer

shz9
Jun 25, 2021

hammer
Jun 26, 2021
Maintainer

tomwhite
Jun 28, 2021
Maintainer