Conversation
| Additionally, both covariate and trait arrays will be rechunked to have blocks | ||
| along the sample (row) dimension but not the column dimension (i.e. | ||
| they must be tall and skinny). | ||
|
|
There was a problem hiding this comment.
Was this warning removed for a reason?
There was a problem hiding this comment.
It was duplicated.
| hardy_weinberg_p_value_vec_jit = njit(hardy_weinberg_p_value_vec, fastmath=True) | ||
|
|
||
|
|
||
| def hardy_weinberg_test( |
There was a problem hiding this comment.
This function should be added to the top-level __init__.py so it's a part of the public API.
Also, +1 on the name (even though it's not a verb!). I think using the abbreviation hwe internally is fine too (as you have done).
There was a problem hiding this comment.
| n = len(obs_hets) | ||
| p = np.empty(n, dtype=np.float64) | ||
| for i in range(n): | ||
| p[i] = hardy_weinberg_p_value_jit(obs_hets[i], obs_hom1[i], obs_hom2[i]) |
There was a problem hiding this comment.
Observation: we're using Dask for parallelization here (across blocks) which is fine, but we may want to consider numba's parallel=True in the future for this loop as another dispatch option.
| # Otherwise compute genotype counts from calls | ||
| else: | ||
| # TODO: Use API genotype counting function instead, e.g. | ||
| # https://github.com/pystatgen/sgkit/issues/29#issuecomment-656691069 |
There was a problem hiding this comment.
Can you open an issue to track this please.
There was a problem hiding this comment.
sgkit/tests/test_hwe.py
Outdated
|
|
||
| def test_hwep__raise_on_negative(): | ||
| args = [[-1, 0, 0], [0, -1, 0], [0, 0, -1]] | ||
| for arg in args: |
There was a problem hiding this comment.
Nit: might be slighter neater to parameterise the test?
There was a problem hiding this comment.
sgkit/stats/hwe.py
Outdated
| # TODO: Use API genotype counting function instead, e.g. | ||
| # https://github.com/pystatgen/sgkit/issues/29#issuecomment-656691069 | ||
| mask = ds["call_genotype_mask"].any(dim="ploidy") | ||
| gtc = xr.where(mask, -1, ds["call_genotype"].sum(dim="ploidy")) # type: ignore[no-untyped-call] |
There was a problem hiding this comment.
Style nit/question: we are being a bit inconsistent on case for these variables. E.g. in count_alleles the variables are G, CTS, etc. Also, in the summary stats PR I used G for genotypes, but n_het for heterogeneous counts.
Are there some rules we can use? E.g. use lowercase except when we are replicating a paper which uses X, etc?
There was a problem hiding this comment.
pystatgen/sgkit@466859e#diff-2ec013d1f086d9558934c38ac90411d1R173
I like the capital convention so array names don't conflict with scalars when there are a lot of both. When there aren't a lot of either, switching to lower case name sounds good and is what I would prefer too. I don't know how to make a threshold for that clear, so I've been trying to err on the side of sticking with the capital letter convention for consistency (thanks for flagging this one).
I filed https://github.com/pystatgen/sgkit/issues/117 to make a record of this.
This PR contains a new implementation for https://github.com/pystatgen/sgkit/issues/28.
Summary of changes:
sgkit/stats/hwe.pymoduleThere may be better ways to organize jit-compiled function tests, see https://github.com/pystatgen/sgkit/issues/77.