Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GWAS linear regression implementation #16

Merged
merged 26 commits into from
Jul 17, 2020
Merged

Conversation

eric-czech
Copy link
Collaborator

This adds a method for calculating variant association statistics with a single continuous phenotype. This formulation is equivalent to standard OLS methods (lm in R, statsmodels in Python, etc.) but the lift here is in expressing this for a large number of individual regressions as matrix operations where some of the covariates are constant, and without ever requiring computation of an (n_samples x n_samples) matrix.

I think validating this on real data won't be very difficult but I'm going to hold off on writing anything for it until some of the core parts of the code stabilize a bit. This should be a good example to help frame the discussion in https://github.com/pystatgen/sgkit/issues/11 more.

Basic usage would look something like this:

ds = ds.assign(**{ 
  'call/dosage': (
    ['variants', 'samples'], 
    # `count_alleles` doesn't exist yet, but this would be an application
    sgkit.count_alleles(ds['call/genotype'])
  )
})
stats = sgkit.stats.association.linear_regression(
    ds,
    covariates=['pc0', 'pc1', 'age', 'sex'],
    dosage: 'call/dosage',
    trait: 'phenotype_0',
    add_intercept = True
)

I created a stats subpackage and put my association.py module in it but I don't know how we want to expose that -- sgkit.stats.linear_regression perhaps? Let me know if anybody has objections to that direction.

@tomwhite
Copy link
Collaborator

tomwhite commented Jul 6, 2020

High-level comment - I like the style you are using here of passing in the whole xarray (ds) and refering to inputs and outputs with variable names (e.g. dosage="call/dosage").

@alimanfoo
Copy link
Collaborator

Re converting genotype calls to dosage, I think we need a separate function from count_alleles(). In scikit-allel, GenotypeArray.count_alleles() computes the per-variant allele counts, i.e., summarises across samples. For computing dosage, we might want a function like sgkit.genotype_to_dosage(). Also would need some thought as to how that function should handle multiallelic variants.

Re API, having to go via ds.assign() is quite verbose and maybe a little obscure to new users. Could we do something simpler like:

sgkit.genotype_to_dosage(ds, input='call/genotype', assign='call/dosage')

...? The input and assign args could even have those values as default? Alternative name for "assign" could be "output"?

Re sgkit.stats.association.linear_regression(), FWIW in scikit-allel I decided to create just a single namespace and pull in all functions because that seemed more user-friendly, but happy to work towards that.

@eric-czech
Copy link
Collaborator Author

Re converting genotype calls to dosage

In using the name "dosage" there, I was trying to straddle a few possibilities and I think most of them won't align well with allele counting. Another reasonable name might have been "features" or "encoding" since we'll likely want to separate how alleles are counted from how variants are encoded in a GWAS. Additive encodings ignoring sex chromosomes could use a shared allele counting function but either way, I think they're going to diverge enough that it's not worth trying to factor out the common parts yet.

I am running into the need for allele counting in some other work though. I was thinking that functions to create variables like 'call/allele_count' and 'variant/allele_count' would both be reasonable. Have you sketched anything out for that by chance @alimanfoo? i.e. following up on https://github.com/pystatgen/sgkit/issues/3#issuecomment-654186782.

Re API, having to go via ds.assign() is quite verbose and maybe a little obscure to new users

I'm somewhat strongly in favor of an API that doesn't make in-place modifications to the dataset given how straightforward the xarray merge/assign utilities are to use. I think it removes a good bit of complexity (especially for methods that add many fields) and should be familiar for Pandas users. Maybe @tomwhite has some thoughts on that one?

Re sgkit.stats.association.linear_regression(), FWIW in scikit-allel I decided to create just a single namespace and pull in all functions because that seemed more user-friendly, but happy to work towards that.

Sounds good to me. I'm happy with sgkit.linear_regression if you guys are.

@alimanfoo
Copy link
Collaborator

alimanfoo commented Jul 8, 2020

I'm somewhat strongly in favor of an API that doesn't make in-place modifications to the dataset given how straightforward the xarray merge/assign utilities are to use.

FWIW the assign syntax looked painful to me, there's a lot that would need explaining to a new user. Bear in mind many scikit-allel users were also new to Python, so tricks like unpacking a dict can cause people to stumble. Also making the user specify the dimension names here seemed unnecessary given that we know what the dimensions will be. Also a load of typing!

What about if you could do sgkit.foo(ds, input='call/genotype') in which case the function returns a new DataArray for the user to assign manually or do what they want with, but could also do sgkit.foo(ds, input='call/genotype', assign='call/bar') in which case the function would perform the assignment?

I'm happy with sgkit.linear_regression if you guys are.

If pulling up to the root namespace then the function name might need a little disambiguation, e.g., sgkit.linear_regression_association_test or something along those lines.

@alimanfoo
Copy link
Collaborator

I am running into the need for allele counting in some other work though. I was thinking that functions to create variables like 'call/allele_count' and 'variant/allele_count' would both be reasonable. Have you sketched anything out for that by chance @alimanfoo? i.e. following up on #3 (comment).

I may be wrongly conflating dosage and allele counting, I raised #21 to discuss but happy to take advice on naming and scope.

Re allele counting, the suggestion to have functions that can create 'call/allele_count' and 'variant/allele_count' sounds great. Would this be the same or different functions?

@eric-czech
Copy link
Collaborator Author

FWIW the assign syntax looked painful to me, there's a lot that would need explaining to a new user. Bear in mind many scikit-allel users were also new to Python, so tricks like unpacking a dict can cause people to stumble. Also making the user specify the dimension names here seemed unnecessary given that we know what the dimensions will be. Also a load of typing!

Another option is to always use merge instead of being more explicit with assign:

import xarray as xr

def count_alleles(ds):
    return ds['call/genotype'].sum(dim='ploidy').rename('call/allele_count')

ds = ds.merge(count_alleles(ds))
# now 'call/allele_count' is in ds

This would mean that our functions could return DataArrays or Datasets and the above would still work. If it was as simple as just calling merge like that, would you agree that it's worth not needing to support assignments internally? We could pick default names for the arrays internally and let users override them before the merge if they wanted to (so we could side-step arguments for variable names too).

@eric-czech
Copy link
Collaborator Author

An additional note on that, it also appears to work if the dataset is indexed and the array/dataset to be appended isn't, i.e.:

def count_alleles(ds):
    return ds['call/genotype'].sum(dim='ploidy').rename('call/allele_count')

cts = count_alleles(ds) # this has no index
ds.set_index(variants=('variant/contig', 'variant/position')).merge(cts)
# cts is aligned correctly when appended

That's great because I thought we might have to worry about copying indexes into results.

@alimanfoo
Copy link
Collaborator

If it was as simple as just calling merge like that, would you agree that it's worth not needing to support assignments internally? We could pick default names for the arrays internally and let users override them before the merge if they wanted to (so we could side-step arguments for variable names too).

(To disambiguate the different types of allele count, I'm going to assume for the moment the function sgkit.count_alleles() performs a per-variant allele count, i.e., summarising over samples and ploidy dimensions. And assume there is a different function which does something similar but does a per-call count, i.e., summarises over just the ploidy dimension.)

I'ts nice to know about merge, but still wondering if there is a more obvious way to do it. As a user, I think I'd want to be able to do something like either:

ds['variant/allele_count'] = sgkit.count_alleles(ds, input='call/genotype')

...or:

sgkit.count_alleles(ds, input='call/genotype', assign='variant/allele_count')

The first option above would be very pandas-y, like assigning a new column to a dataframe, we use that kind of syntax often.

Also I'm often doing per-population allele counts, where I think I'd like to be able to do something like this:

loc_pop1_samples = ...  # obtain boolean vector locating samples from population 1 within the cohort
loc_pop2_samples = ...  # obtain boolean vector locating samples from population 2 within the cohort
# etc.
ds['variant/allele_count_pop1'] = sgkit.count_alleles(ds, samples=loc_pop1_samples, input='call/genotype')
ds['variant/allele_count_pop2'] = sgkit.count_alleles(ds, samples=loc_pop2_samples, input='call/genotype')
# etc.

Btw we should probably move this discussion to #3, somewhat off the main topic this PR :-)

@eric-czech
Copy link
Collaborator Author

ds['variant/allele_count'] = sgkit.count_alleles(ds, input='call/genotype')

FYI assignment is also fine, so if count_alleles was to return a single DataArray, then the above works. Or ds['variant/allele_count'] = sgkit.count_alleles(ds)['variant/allele_count'] works too if it returned a Dataset instead with multiple related arrays.

@eric-czech
Copy link
Collaborator Author

Is there anything else you guys would like me to rethink for this (cc: @tomwhite / @alimanfoo / @jeromekelleher)?

I cleaned up a few things and tried to adapt to some conventions elsewhere. One wart left is that statsmodels throws a warning which I'm catching in code on import. Maybe there's a better way to do that?

I'm happy to change the exported name too. My minor gripe with linear_regression_association_test is that it returns effects as well so it's not just the test. Maybe gwas_linear_regression?

@ravwojdyla
Copy link
Collaborator

ravwojdyla commented Jul 13, 2020

@eric-czech in your example it would be:

ds: SgkitDataset = sgkit_bgen.read_bgen(...)

# here the user can do whatever they like with the underlying dataset, including corrupting it
# which we can detect in the `from_dataset`, recreating dataset should be cheap in comparison
# to actual computation
ds = SgkitDataset.from_dataset(ds.dataset.sel(variant=ds.dataset['variant/contig'].isin(range(1, 23)))
# alternatively more fluent:
ds = ds.with_dataset(lambda dataset: dataset.sel(variant=ds.dataset['variant/contig'].isin(range(1, 23)))

ds = sgkit.ld_prune(ds)

So yes - the same thing as your example.

This could be streamlined with fluent API (which I'm not suggesting we should do at this point, but for the reference):

(sgkit_bgen
  .read_bgen(...)
  .with_dataset(lambda dataset: dataset.sel(variant=ds.dataset['variant/contig'].isin(range(1, 23)))
  .ld_prune(ds)
)

But filtering to autosomes should be relatively common, so we should have a function for that:

ds: SgkitDataset = sgkit_bgen.read_bgen(...)
ds = sgkit.filter_autosomes(ds) 
ds = sgkit.ld_prune(ds)

or more complicated example:

def my_custom_reusable_imputation_pipeline(ds: SgkitDataset) -> SgkitDataset:
  # my custom ds manipulation - strictly expects a valid SgkitDataset and type safe
  ...

def my_custom_ld_pruning(ds: xr.Dataset) -> xr.Dataset:
  # my custom XR manipulation
  ...

ds: SgkitDataset = sgkit_bgen.read_bgen(...)
ds = sgkit.filter_autosomes(ds) 
ds = my_custom_reusable_imputation_pipeline(ds)
ds = ds.with_dataset(my_custom_ld_pruning(ds.dataset))
ds = sgkit.pc_relate(ds, pcs)

I believe all these examples can be type safe, with code completion and data validation - which would be the benefits of encapsulation. WDYT?

@eric-czech
Copy link
Collaborator Author

I believe all these examples can be type safe, with code completion and data validation - which would be the benefits of encapsulation. WDYT?

Makes sense to me, though I think type safe will need to mean that the constructor validation is run on every call to with_dataset (e.g. for something like ds.with_dataset(lambda ds: ds.drop('variant/contig'))). I wouldn't say that's necessarily a bad thing but I don't think it becomes very different from a check_*_dataset call in each genetic method, and they could get expensive if we add some value-based checks.

The majority of methods only need a part of what's in an SgkitDataset too so I think requiring meeting the full contract would be quite limiting -- or at least that's something I've found to be pretty irksome about the standard encapsulation approach in other scientific libs we've used (especially in R). Can you see some way around this? That would be my biggest protest against a wrapper that just does validation.

@jeromekelleher
Copy link
Collaborator

Just piping in here - this is a great discussion. I don't have anything to add, but it's great to see the details being thought through here.

@tomwhite
Copy link
Collaborator

I wonder if value-based checks should be made explicit - in either style of API - since they may not be cheap. User documentation could include them as a way of demonstrating good practice. Pipelines generally have QC upfront, so sprinkling some validation checks in there should sit quite naturally.

Type, dimension and variable name checks on the other hand are very cheap, and should be done internally in every method call as precondition checks. Constructors should check all variables in a dataset, other methods should only check the fields they rely on.

@ravwojdyla
Copy link
Collaborator

@eric-czech yep the validation would need to run on each ctor/boxing but we could mitigate that in a number of ways - flags to control the depth of the validation, @tomwhite 's suggestion (which I'm +1 to). Regarding partial SgkitDataset, I think here we also have a couple of options, depending on how type safe we want to be, but I think I would suggest starting simple, SgkitDataset could have lightweight validations (of whatever data provided, even if just partial), and if on the way we want to make it more type safe by having more concrete classes to handle different cases we can add that later. But I agree, we should definitely support that use case.

@eric-czech
Copy link
Collaborator Author

I think I would suggest starting simple, SgkitDataset could have lightweight validations (of whatever data provided, even if just partial), and if on the way we want to make it more type safe by having more concrete classes to handle different cases we can add that later

Having more concrete classes to handle different cases makes sense to me but I don't see how allowing a class to do a partial validation on whatever data was provided would work well. For example, how could you know what to validate here?

type(ds) # SgkitDataset
# This is OK, the first with_dataset can ignore 'variant/contig' on validation
ds.with_dataset(lambda ds: ds.drop('variant/contig')).with_dataset(sgkit.count_alleles) 
# Not OK, this needs to fail in a validation somewhere since contig is not optional
ds.with_dataset(lambda ds: ds.drop('variant/contig')).with_dataset(sgkit.ld_prune) 

I think you could drop down to Xarray, do all the mutations, and then box the type again with fields to check at the end with something like sgkit.count_alleles(SgkitDataset(ds, vars_to_validate=['call/genotype'])), but then the optional fields become tightly coupled to the method you're about to call.

I'm assuming by more concrete classes you mean there could be something like a "MinimalSgkitDataset" type that only enforces some fields, but once you start crossing all of the different inputs with GWAS operations and scikit-allel functions, the number of unique combinations of required fields is going to be pretty large (and hard to name concrete classes for). My concern is that it's easy to spec this out given only what we've attempted so far in sgkit and that once we start growing the applications, we'll have overly general types because it's too hard to model them more granularly. To be dramatic about it (lol), I don't want to end up with an oppressive type system that we live with just because it will be easy to say we should always err on the side of too much validation.

In the end I think it would be great if you wanted to propose a new data structures API, but I would request that it answer two questions:

  • What concept do the dataset wrappers model? I think genotype calls have made for a simple example but SgkitDataset is definitely the wrong level of generality and it's not clear to me in which direction the wrappers will break out. Some possibilities are:
    • Classes more or less for each method (so there will be many of them)
    • Classes targeting a more general set of individual biological concepts (e.g. https://scikit-allel.readthedocs.io/en/stable/model.html)
    • Classes targeting an entire kind of analysis, e.g. "GWASDataset", "WESDataset", "PopGenDataset", or file type e.g. "PLINKDataset", "VCFDataset"
    • Generic but somewhat domain specific ND-array wrappers, e.g. Hail MatrixTable
  • If we chose a level above other than the first one, how can we make it possible to run methods without defining a large number of unnecessary fields/variables?

@ravwojdyla
Copy link
Collaborator

So I'm suggesting keeping it relatively simple and type safe at the beginning, and as we grow more use cases we can (re)evaluate if we need more classes or maybe a change of tack. No solution will be a silver bullet obviously. How would that right now look like in your example:

ds_no_contig = ds.with_dataset(lambda ds: ds.drop('variant/contig'))

# This is OK, the first with_dataset can ignore 'variant/contig' on validation
sgkit.count_alleles(ds_no_contig)

# Not OK, this needs to fail in a validation somewhere since contig is not optional
sgkit.ld_prune(ds_no_contig)

Building upon @tomwhite comment, we could have:

  • with_dataset do a lightweight checks on the data it has available in the dataset (which might be partial data), with heavyweight data checks at request (either flag or separate methods)
  • count_alleles would check for availability of `call/genotype' and in this case pass
  • ld_prune would check for the availability of variant/contig and fail (at runtime)

A fair question would be how is this better than just doing checks on XRs passed into methods, and I would argue:

  • centralised data validation in the CTOR/ with_dataset of SgkitDataset (there is potential for memoization here, but that is separate topic)
  • potential to embed domain specific methods into SgkitDataset (including fluent APIs) (with code completion etc)
  • potential for more elaborated class structure and type safety - where some failures could be captured at the type level

I see it as a middle ground between oppressive type system and flexibility with potentially for improvement as we build upon more use cases. And again I don't want to block this PR (or anyone else), if anything all the work should continue with the current assumptions and if we want to evaluate this idea, we can make all the necessary changes later. If anyone feels strongly against this idea, we/I can drop it :) Otherwise I can also sketch out a PR with the API above, which would be a simple place to start. What do you think @eric-czech , @tomwhite ?

@eric-czech
Copy link
Collaborator Author

with_dataset do a lightweight checks on the data it has available in the dataset (which might be partial data), with heavyweight data checks at request (either flag or separate methods)

What kind of flags/methods are you imagining for that one? I like that decoupling the more I think about it -- that could be a nice way to deal with repetitive patterns in analysis without having to push it into class definitions.

@tomwhite
Copy link
Collaborator

FWIW I feel that a wrapper needs to be useful and compelling enough to pull its weight. I'm +1 on @ravwojdyla sketching out something in a separate PR (that we may never adopt :), while this one is merged with the Xarray API. What still needs doing on this one @eric-czech?

@eric-czech
Copy link
Collaborator Author

What still needs doing on this one @eric-czech?

Nothing I think? Looking back through our conversations I don't think I missed any of the easier/concrete suggestions. I'm sure it will need some refactoring at some point but it shouldn't interfere with much else in the meantime.

@eric-czech
Copy link
Collaborator Author

FYI I don't know why the status still shows that there are conflicts. I tried using the web editor to resolve them twice but it failed both times and told me to restart so I did it on a command line instead. I fixed the little conflict that there was and pushed the clean commit to this same branch so I'm not sure why those commits aren't showing up here or why github still thinks there are conflicts.

Maybe using the web editor screwed something up? Has anybody else run into this before?

@ravwojdyla
Copy link
Collaborator

ravwojdyla commented Jul 15, 2020

with_dataset do a lightweight checks on the data it has available in the dataset (which might be partial data), with heavyweight data checks at request (either flag or separate methods)

What kind of flags/methods are you imagining for that one? I like that decoupling the more I think about it -- that could be a nice way to deal with repetitive patterns in analysis without having to push it into class definitions.

@eric-czech I think that would likely be case/type specific validation methods - like say ds.validate_genotype(), but I also like your idea with type classes like SgkitDataset[MySchema](xr_ds), which could validate how the data conforms to types (standards) from MySchema. Would be interesting to explore this idea further! I might go ahead and sketch a PR (and as mentioned before, we might not use it in the end but it might be easier to capture the idea and continue the discussion :) )

Also btw - currently GH says there are not conflicts but the build has failed.

@eric-czech
Copy link
Collaborator Author

Ah I see, I guess it was just being slow.

@alimanfoo
Copy link
Collaborator

Hi folks, just to say I raised an issue (#43), thought it worth having a dedicated place for continuing discussion of use of classes and methods and related issues started here. I haven't put much of an issue description in yet, feel free to try and frame the problem concisely :-)

@eric-czech
Copy link
Collaborator Author

Hey @ravwojdyla / @tomwhite, could one of you take a peek when you get a chance at my mypy fixes in pystatgen/sgkit@f2c0352 and officially approve if this is looking OK now?

sgkit/stats/association.py Outdated Show resolved Hide resolved
requirements.txt Show resolved Hide resolved
sgkit/typing.py Show resolved Hide resolved
sgkit/tests/test_association.py Outdated Show resolved Hide resolved
.pre-commit-config.yaml Show resolved Hide resolved

def test_linear_regression_statistics(ds):
def validate(dfp: DataFrame, dft: DataFrame) -> None:
print(dfp)
Copy link
Collaborator

@ravwojdyla ravwojdyla Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these print stmts intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to do that often with these method tests since they'll only be shown when there's an error, and I don't often find smart diffs from assertion utils (e.g. np.testing) to be as good for debugging as simply seeing the datasets when they're small. I take it as a somewhat acceptable convention since I've seen it in Xarray/Dask tests, and haven't seen any downsides other than that stdout is a mess if not captured by pytest (can't think of why that would ever be necessary though).

sgkit/tests/test_association.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants