-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add weight-awareness to many functions in scipy.stats and scipy.… #6931
Conversation
…spatial.distance Adds weight awareness to: - scipy.stats: - gmean - hmean - mode* - tmean - tvar* - tstd* - tsem* - moment - variation - skew - kurtosis - describe - itemfreq* - collapse_weights (new function) - percentileofscore - sem - zscore - zmap - sigmaclip - pearsonr - spearmanr - rankdata - scipy.distance.spatial: - minkowski* - euclidean* - sqeuclidean* - correlation - cosine - hamming - jaccard - kulsinski - cityblock - chebyshev - braycurtis - canberra* - yule - matching - dice - rogerstanimoto - russellrao - sokalmichener - sokalsneath Asterisked functions do not follow all weight invariant operations. Fixes #5718. Discussion to follow.
I'm also writing a blog post on these issues, which should be published shortly on https://www.twosigma.com/insights. Here's a bit of it to get the theory straight: Weight, what? What are weights?Weights are a way to signify that some observations are "more important" for some reason than others. There are quite a few reasons why this might be the case, with the most common being "frequency based weighting", "cost based weighting", and "variance based weighting". Frequency based weightingImagine you have a series of temperature observations of your workplace over two weeks: Cost based weightingImagine that people at your workplace consider the default temperature to be $74\unicode{xb0}$F—every degree away from that makes people less happy. But also, on Monday people are twice as sensitive to the temperature, and on Friday people are half as sensitive. This could get complicated to represent—the average affect of temperature on happiness would be something like: But, assuming the observations above start on a Monday, you could assign the following weights to the observations:
Variance based weightingThis one is a bit more complicated. Imagine that for the first half of your observations, your thermometer was a little broken—not completely broken, but the measurements had additional noise. If you wanted to know the average temperature, you could discount those noisier observations. Specifically, if you want to linearly combine independent observations so that the result has the minimum variance, then
Operations on Weighted DataAbove we gave possible interpretations of weights. Weights can also be defined somewhat more formally as an unnormalized probability distribution (or probability measure) on the records in your data, which in turn is a discretization of the joint distribution of your data.
Invariant OperationsGiven any dataset, and any weight-aware function
Invariant-Based TestingIn this pull request, we've added testing based on many of the above invariants in the form of a function wrapper, |
This looks to me impossible to review, and will require a large amount of unit tests. And will require a lot of discussion about the meaning of weights. The PR changes mostly the descriptive statistics, which might be a bit easier than with the hypothesis tests, I guess mostly about number of observations versus sums of weights and degrees of freedom corrections. |
I think I have pretty complete test coverage on this--anything checked by normal tests in scipy.stats or scipy.spatial.distance is also checked in several weighted permutations (sometimes 5 variations). It does only cover descriptive stats and distance metrics, and yeah, I raise NotImplemented for questions about degrees of freedom with weights. I'm pretty sure it's provable that you can't correctly implement degrees of freedom and have weights that obey the invariant operations on measure, I can dig that up again if you'd like. |
Sorry, I didn't see the unit tests, I don't like that github doesn't display longer files anymore. |
Regarding statsmodels/statsmodels#2848, I don't see why there has to be a choice between |
@josef-pkt no worries :) Thanks for your time in looking at it! |
I do see, @josef-pkt, how adding weights to error estimates or to robust estimates like you were doing in statsmodels would be especially difficult. I agree that's a nontrivial problem to solve, because different interpretations of weights do give you different results for those kinds of calculations. The problem would then be underspecified. |
The problem is not underspecified if we ask users which kind of weights they want, i.e. I'm dropping question on your current implementation (given that the unit tests don't have any quick examples to check) Spatial is not my area. However, var_weights are different from freq/prob_weights if the underlying function is nonlinear: w^q * f(x) is not the same as f(w x) for some q (or more generally a function q(w). ( |
The problem is not underspecified if we ask users which kind of weights they want, i.e. I'm dropping question on your current implementation (given that the unit tests don't have any quick examples to check) Spatial is not my area. However, var_weights are different from freq/prob_weights if the underlying function is nonlinear: w^q * f(x) is not the same as f(w x) for some q (or more generally a function q(w). ( |
Besides The ones that are asterisked contain some exceptions--the vast majority of those don't hold to invariant 5, the scalefree weights. In the test files, "*_test=False" when the function is wrapped denotes a disabling of testing any individual invariant. Many of these didn't seem to be a large conceptual problem. For instance, in distance, all of the asterisked functions do not obey scalefree weights because they're measuring absolute distance, not average distance on each dimension. Since changing that would not be backwards compatible, and since there are valid reasons to do absolute distance, those function were left without testing along that variant. |
as reference https://github.com/statsmodels/statsmodels/blob/master/statsmodels/stats/weightstats.py#L42 |
tvar, tstd, and tsem are asterisked because they use ddof=1 by default. When ddof is not 0 and these functions are called, in this implementation a NotImplementedError is raised (from line 315). In the test coverage, I duplicated all tests with ddof != 0 with a ddof = 0 counterpart. The weight wrapper treats "NotImplementedError" as acceptable behavior, so the tests pass, as desired. mode and itemfreq were actually excluded from some tests because they are tested on strings, which my wrapper function didn't like |
Yeah, I agree that making ddof compatible with weights is nontrivial. Any ddof != 0 in this implementation while also adding weights should raise an error. |
@apbard I tried adding back in 1303814, but I'm getting some very odd errors at the moment:
|
scipy/stats/stats.py
Outdated
return sd / np.sqrt(am.count()) | ||
sd = np.sqrt(_var(am, axis=axis, ddof=ddof, weights=weights)) | ||
n = am.count(axis=axis) | ||
return sd / np.sqrt(n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this assumes nobs=n and not weights.sum()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that's true, I missed that one. I'm not convinced that sd / sqrt(weights.sum())
is a valid estimate of standard error for all definitions of weights--it probably only works for freq_weights
, so I'm inclined to remove weighting from tsem
. Agreed?
---------- | ||
.. [1] R. B. D'Agostino, A. J. Belanger and R. B. D'Agostino Jr., | ||
"A suggestion for using powerful and informative tests of | ||
normality", American Statistician 44, pp. 316-321, 1990. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you remove the reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I missed 22f34db, adding it back in now
scipy/stats/stats.py
Outdated
return np.array([items, freq]).T | ||
|
||
|
||
def collapse_weights(a, weights=None, axis=None): | ||
""" | ||
Collapse a weigthed representation of an array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo weighted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fix will be committed shortly
scipy/stats/stats.py
Outdated
return np.array([items, freq]).T | ||
|
||
|
||
def collapse_weights(a, weights=None, axis=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default for axis in stats is usually 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, changed, locally, will be committing shortly
uniq_a = np.take(sorted_a, new_a, axis=axis) | ||
cum_w = np.append(0, weights[sort_ix].cumsum()) | ||
uniq_w = np.diff(cum_w[np.append(new_a, -1)]) | ||
return uniq_a, uniq_w |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm too tired to figure out how this works.
how can this work along an axis if the are an unequal number of uniques?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If axis is not None, and ndim > 1, then it considers a "record" to consist of all of the axes which are not the axis specified. It looks for uniques among "records", and sums up their weight. It assumes that a.shape[axis] == weights.size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding that explanation to the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I need more examples in general. I'll work on adding examples tomorrow.
I managed to read up to One main issue is whether it's not better to go with frequency weights all the way instead of raising an exception whenever |
Thanks so much @josef-pkt for giving it a look! That's definitely a question; my gut is to err on the side of not being opinionated and raise NotImplemented's or just plain not implementing things when they're vague. I think degrees of freedom, error estimates, hypothesis testing, and robust estimates are things that require more information that I'd like to assume. If those do get added, I'm guess there might be some nasty gotchas that should be pointed out in the docs. Or maybe for those kinds of things there needs to be a separate type of weight kwarg? |
…e validation missed in a previous commit
My plan for statsmodels is to require specific weights like Stata and no more plain The question is about future keywords in scipy.stats if there will be different kinds of weights. I'm a bit doubtful whether the generic (Annoying case: weights in R glm are largely a mystery to me and might include magic.) |
Finally fixed a weird numpy bug. In CI, there are currently two issues out of my depth:
Those don't seem like things caused by this PR as far as I can tell, so absent any new information I think this is good? |
Yes, the sparse failures can be ignored (they're a numpy 1.12.0rc2 issue). Exceeding the 50 min time limit could be caused by this PR if you added long-running tests; we were already close to the limit. If not, it may be just random timing variation. |
@metaperture I have thought about the |
@apbard yeah, I think that probably the biggest expense timewise was layering in weight_checked into *_calling_conventions. Maybe I should take that out to get back under the time limit? @rgommers should I be looking at the time limit as a hard constraint? I had imagined it was just a config var somewhere @apbard I'm a bit concerned, though, that refactoring the tests as you suggest will lead to a loss of coverage for edge cases (especially weird axis questions). It feels like maybe using weight_checked everywhere besides calling_conventions makes more sense to me? I think we just have contrary opinions on this, I can definitely see yours (especially in that it's a little more clear to many and less nested). Does anyone else have an opinion so we can break the deadlock? More coverage + cleaner semantics (imo) vs apbard's general test idea (less time and less arcane code)? |
Yes, the 50 min timeout on TravisCI is not configurable. So we shouldn't come close to that, or we get really annoying problems. Timing varies quite a bit depending on system load, so the total run should ideally be under 40 min. |
Speeding up the tests so we don't hit 50min in Travis CI
Just curious -- I'm guessing with something this big that it won't be coming out on a patch version. If that's the case, and it looks like 0.19.0 is getting frozen soon, is there anything left for me to do at this point? And if it gets pushed to 0.20.0 or later, should I be continuously merging in changes like I did yesterday, or should I wait to batch all merges until later? #opensourcenewbie Thanks for any info :) |
@metaperture: as a contributor, you don't specially need to care about releases. The 0.19.x series will be branched off master and be developed in parallel, with any relevant bugfixes backported, while development continues on master branch as usual. In particular, there's no need to delay working on this, as the next release will be in ~6 months --- which often is a surprisingly short time so the earlier things get in the better. |
As to whether you need to continuously rebase as master branch changes --- this is usually best to do only after bigger issues in the approach etc are settled (I don't know what's the status for this PR on that front) and things are looking like close to ready to merge. |
Thanks @pv for the info. To anyone else, are there issues with the approach? This seems to squash quite a few bugs (last 4 references to this are bugs that this PR would fix), be a better way to organize the code, and add significant functionality. Of course, I'm a bit biased :P |
I think this looks promising, but as @josef-pkt said it's hard to review due to the sheer size of the PR. The Then you can leave this PR for just the |
@metaperture Will you have time to split this PR into two, and squash + rebase on master? |
Modifications to scipy.stats and scipy.distance.spatial
by mtartre@twosigma.com
Adds weight awareness to:
Asterisked functions do not follow all weight invariant operations (exceptions are noted as disabling certain invariant operations in tests). Fixes #5718. Tested on python 2.7 and python 3.6. Discussion to follow.