Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add weight-awareness to many functions in scipy.stats and scipy.… #6931

Closed
wants to merge 22 commits into from
Closed

ENH: add weight-awareness to many functions in scipy.stats and scipy.… #6931

wants to merge 22 commits into from

Conversation

quantology
Copy link

@quantology quantology commented Jan 6, 2017

Modifications to scipy.stats and scipy.distance.spatial
by mtartre@twosigma.com

Adds weight awareness to:

  • scipy.stats:
    • gmean
    • hmean
    • mode*
    • tmean
    • tvar*
    • tstd*
    • tsem*
    • moment
    • variation
    • skew
    • kurtosis
    • describe
    • itemfreq*
    • collapse_weights (new function)
    • percentileofscore
    • sem
    • zscore
    • zmap
    • sigmaclip
    • pearsonr
    • spearmanr
    • rankdata
  • scipy.distance.spatial:
    • minkowski*
    • euclidean*
    • sqeuclidean*
    • correlation
    • cosine
    • hamming
    • jaccard
    • kulsinski
    • cityblock
    • chebyshev
    • braycurtis
    • canberra*
    • yule
    • matching
    • dice
    • rogerstanimoto
    • russellrao
    • sokalmichener
    • sokalsneath

Asterisked functions do not follow all weight invariant operations (exceptions are noted as disabling certain invariant operations in tests). Fixes #5718. Tested on python 2.7 and python 3.6. Discussion to follow.

…spatial.distance

Adds weight awareness to:
- scipy.stats:
  - gmean
  - hmean
  - mode*
  - tmean
  - tvar*
  - tstd*
  - tsem*
  - moment
  - variation
  - skew
  - kurtosis
  - describe
  - itemfreq*
  - collapse_weights (new function)
  - percentileofscore
  - sem
  - zscore
  - zmap
  - sigmaclip
  - pearsonr
  - spearmanr
  - rankdata
- scipy.distance.spatial:
  - minkowski*
  - euclidean*
  - sqeuclidean*
  - correlation
  - cosine
  - hamming
  - jaccard
  - kulsinski
  - cityblock
  - chebyshev
  - braycurtis
  - canberra*
  - yule
  - matching
  - dice
  - rogerstanimoto
  - russellrao
  - sokalmichener
  - sokalsneath

Asterisked functions do not follow all weight invariant operations. Fixes #5718. Discussion to follow.
@quantology
Copy link
Author

I'm also writing a blog post on these issues, which should be published shortly on https://www.twosigma.com/insights. Here's a bit of it to get the theory straight:

Weight, what? What are weights?

Weights are a way to signify that some observations are "more important" for some reason than others. There are quite a few reasons why this might be the case, with the most common being "frequency based weighting", "cost based weighting", and "variance based weighting".

Frequency based weighting

Imagine you have a series of temperature observations of your workplace over two weeks: [72, 75, 72, 73, 72, 73, 75, 72, 73, 72]. If there are only k values but n observations (here k=3, n=10), then you can reduce your data into a weighted representation which requires two vectors of length k—your data vector and your weight vector. For example, with the above data, the data vector would be [72, 73, 75] and the weight vector would be [5, 3, 2]. When k << n, this can drastically reduce memory requirements and increase speed, since the number of observations (often the base for complexity) is now k, not n. In this sense, the weights are simply a means to compress the data.

Cost based weighting

Imagine that people at your workplace consider the default temperature to be $74\unicode{xb0}$F—every degree away from that makes people less happy. But also, on Monday people are twice as sensitive to the temperature, and on Friday people are half as sensitive. This could get complicated to represent—the average affect of temperature on happiness would be something like:
$$\sum_{DoW \in {M,Tu,W,Th,F}}
(\text{proportion of days = DoW})
(\text{importance of DoW})
(\text{average }
|T_i - 74\unicode{xb0}|
\text{ for day of week DoW})$$

But, assuming the observations above start on a Monday, you could assign the following weights to the observations: [2, 1, 1, 1, 0.5, 2, 1, 1, 1, 0.5]. Then the average affect on happiness here would be the weighted average $\frac{\sum_i w_i |T_i - 74\unicode{xb0}|}{\sum_i w_i}$. Here, the weights reduce the complexity of the representation of your problem, which often happens when dealing with outcomes of different importances or "costs".

Note: Survey weighting from poststratification or sampling probabilities most naturally
fall under this bucket, see Gelman 2007 for these weighting schemes in a regression context.
This interpretation also dovetails fluidly with Empirical Risk Minimization and M-estimators.

In this vein, some algorithms synthetically change weights so that they pay more attention
to some data points than others based on classification accuracy—Boosting is the most popular example of this.

Variance based weighting

This one is a bit more complicated. Imagine that for the first half of your observations, your thermometer was a little broken—not completely broken, but the measurements had additional noise. If you wanted to know the average temperature, you could discount those noisier observations. Specifically, if you want to linearly combine independent observations so that the result has the minimum variance, then $w_i^* = \frac{1}{\sigma_i^2}$ is the ideal weighting for each observation. Other weights taking variance into account could be justified depending on the problem at hand.

Note: many of the scipy.spatial.distance functions also include a C
implementation when using pdist and cdist. No new C implementations were added,
but when weights are not given, the old C implementation is used whenever it
would have been used previously.

Operations on Weighted Data

Above we gave possible interpretations of weights. Weights can also be defined somewhat more formally as an unnormalized probability distribution (or probability measure) on the records in your data, which in turn is a discretization of the joint distribution of your data.

Note that below the matrix augmentation symbol $|$, or sometimes $[\cdot | \cdot]$, is being used to denote concatenating records of (dataset, weights) pairs.

Invariant Operations

Given any dataset, and any weight-aware function $f$, there are several operations you can perform on weights which should not change the result of $f$, that is $f(D, w) = f(D', w')$.

  1. 1-weight assumption — If your dataset $D$ does not have any weights assigned to its observations, that is taken to mean that all the weights are 1.

$f(D) = f(D, null) = f(D, \vec{1})$

  1. 0-weights dropped — If any weights are 0, those records should not affect any functionals computed on the dataset.

Given rows $R = {i | w_i = 0}$,

$f(D, w) = f(D_{i \notin R}, w_{i \notin R})$

  1. combining observations — If any subset of the dataset has identical records, you can combine them, replacing them with a single record whose weight is the sum of the previous weights.

For any row $i$,
$R_i = {j | D_j = D_i}$,

$f(D, w) = f( [D_{j \notin R_i}|D_{i}], [w_{j \notin R_i}|\sum_{j \in R_i} w_j])$

  1. splitting observations — The inverse of combination, splitting, is also allowed. You can also split observations by duplicating the data and assigning weights to each of the new observations such that the weights of all the new observation sums to the weight of the old observation.

Given a record $i$ and a new series of weights $w'$ of length $k$ with $\sum w' = w_i$,

$f(D, w) = f([D_{j \neq i}|D_i], [w_{j \neq i}|w'])$

  1. scalefree weights — In general, weights are considered scalefree, so multiplying all of the weights by any constant shouldn't matter. However, in practice there are some exceptions to this, and many of the asterisked

Given any constant $k$,

$f(D, w) = f(D, kw)$

Invariant-Based Testing

In this pull request, we've added testing based on many of the above invariants in the form of a function wrapper, _weight_checked. The function wrapper performs a bunch of randomized (and seeded) weight-invariant operation on the inputs and checks to make sure they're all equivalent, as they should be. Adding this one layer onto the existing test suite was far simpler than replicating the whole test suite for weights.

@josef-pkt
Copy link
Member

This looks to me impossible to review, and will require a large amount of unit tests.
(8 to 10 years ago there was a similar proposal for weights in stats.)

And will require a lot of discussion about the meaning of weights.
e.g. I'm currently thinking of 4 types of weights similar to Stata statsmodels/statsmodels#2848.
In some cases, some of the weights will behave in the same way, but not in general.

The PR changes mostly the descriptive statistics, which might be a bit easier than with the hypothesis tests, I guess mostly about number of observations versus sums of weights and degrees of freedom corrections.

@quantology
Copy link
Author

Just noticed I missed 1303814 by @apbard

@quantology
Copy link
Author

quantology commented Jan 6, 2017

I think I have pretty complete test coverage on this--anything checked by normal tests in scipy.stats or scipy.spatial.distance is also checked in several weighted permutations (sometimes 5 variations).

It does only cover descriptive stats and distance metrics, and yeah, I raise NotImplemented for questions about degrees of freedom with weights. I'm pretty sure it's provable that you can't correctly implement degrees of freedom and have weights that obey the invariant operations on measure, I can dig that up again if you'd like.

@josef-pkt
Copy link
Member

Sorry, I didn't see the unit tests, I don't like that github doesn't display longer files anymore.

@quantology
Copy link
Author

quantology commented Jan 6, 2017

Regarding statsmodels/statsmodels#2848, I don't see why there has to be a choice between freq_weights and prob_weights, if you just let the measure be unnormalized you're fine, and that preserves the default_weight=1. var_weights also seems not to be an issue--on algorithms that use var_weights, there doesn't seem to be implementation issues. For now, though, I've left untouched all distance metrics that had var or inv var weights.

@quantology
Copy link
Author

@josef-pkt no worries :) Thanks for your time in looking at it!

@quantology
Copy link
Author

I do see, @josef-pkt, how adding weights to error estimates or to robust estimates like you were doing in statsmodels would be especially difficult. I agree that's a nontrivial problem to solve, because different interpretations of weights do give you different results for those kinds of calculations. The problem would then be underspecified.

@josef-pkt
Copy link
Member

The problem is not underspecified if we ask users which kind of weights they want, i.e. I'm dropping weights as a useful term except ins some generic sense, but use freq_weights, var_weights, prob_weights and i???_weights for a specified meaning.

question on your current implementation (given that the unit tests don't have any quick examples to check)
Does the behavior of the current statistics functions follow your "invariance operations 1-5"?

Spatial is not my area. However, var_weights are different from freq/prob_weights if the underlying function is nonlinear: w^q * f(x) is not the same as f(w x) for some q (or more generally a function q(w).
var_weights are like the squared scale in the pdf of the distributions, while freq_weights or prob_weights are just multiplying on the outside of the function. I think that is the discussion for Minkowsky in #5718

(prob_weights and freq_weights are the same for all the main results in a model, but might differ in some results statistic. In Stata for example, prob_weights or survey weights trigger robust sandwich standard errors, AFAIU. freq_weights have a clean likelihood interpretation, prob_weights might just rescale the data as in your "cost based weighting" but not have a correctly specified likelihood interpretation.)

@josef-pkt
Copy link
Member

josef-pkt commented Jan 6, 2017

The problem is not underspecified if we ask users which kind of weights they want, i.e. I'm dropping weights as a useful term except in some generic sense, but use freq_weights, var_weights, prob_weights and i???_weights for a specified meaning.

question on your current implementation (given that the unit tests don't have any quick examples to check)
Does the behavior of the current statistics functions follow your "invariance operations 1-5"?

Spatial is not my area. However, var_weights are different from freq/prob_weights if the underlying function is nonlinear: w^q * f(x) is not the same as f(w x) for some q (or more generally a function q(w).
var_weights are like the inverse of the squared scale in the pdf of the distributions, while freq_weights or prob_weights are just multiplying on the outside of the function. I think that is the discussion for Minkowsky in #5718

(prob_weights and freq_weights are the same for all the main results in a model, but might differ in some results statistic. In Stata for example, prob_weights or survey weights trigger robust sandwich standard errors, AFAIU. freq_weights have a clean likelihood interpretation, prob_weights might just rescale the data as in your "cost based weighting" but not have a correctly specified likelihood interpretation.)

@quantology
Copy link
Author

Besides wminkowski, no existing behavior was changed. Of the functions without asterisks, they followed all the invariant ops after relatively straightforward application of the principles (rankdata, percentileofscore, and scoreofpercentile being the functions with the most amount finagling due to ordinal ranking).

The ones that are asterisked contain some exceptions--the vast majority of those don't hold to invariant 5, the scalefree weights. In the test files, "*_test=False" when the function is wrapped denotes a disabling of testing any individual invariant. Many of these didn't seem to be a large conceptual problem.

For instance, in distance, all of the asterisked functions do not obey scalefree weights because they're measuring absolute distance, not average distance on each dimension. Since changing that would not be backwards compatible, and since there are valid reasons to do absolute distance, those function were left without testing along that variant.

@josef-pkt
Copy link
Member

as reference https://github.com/statsmodels/statsmodels/blob/master/statsmodels/stats/weightstats.py#L42
3 or 4 years ago I added weighted statistics including t_test to statsmodels but restricted myself explicitly to freq_weights because those were the only ones I could make sense of. (but at the time I still used generic weights as name.) It's possible to "misuse" them for other weights as long as ddof and sum of weights are appropriately set.

@quantology
Copy link
Author

quantology commented Jan 6, 2017

tvar, tstd, and tsem are asterisked because they use ddof=1 by default. When ddof is not 0 and these functions are called, in this implementation a NotImplementedError is raised (from line 315). In the test coverage, I duplicated all tests with ddof != 0 with a ddof = 0 counterpart. The weight wrapper treats "NotImplementedError" as acceptable behavior, so the tests pass, as desired.

mode and itemfreq were actually excluded from some tests because they are tested on strings, which my wrapper function didn't like

@quantology
Copy link
Author

Yeah, I agree that making ddof compatible with weights is nontrivial. Any ddof != 0 in this implementation while also adding weights should raise an error.

@quantology
Copy link
Author

quantology commented Jan 6, 2017

@apbard I tried adding back in 1303814, but I'm getting some very odd errors at the moment:

  1. [fixed] it looks like numpy (I'm on 1.11.2) doesn't like custom_notdouble
  2. _assert_within_tol fails on line 564 and 1369, in _check_calling_conventions, still looking into that, but I'm not using the weighted version for those so I'm not sure what changed
  3. [fixed] I'm getting a weird error in py3.6, but not in 2.7, for some test test_qhull in spatial that I haven't touched, coming from https://github.com/scipy/scipy/blob/master/scipy/spatial/tests/test_qhull.py#L934, which looks pretty hacky.

return sd / np.sqrt(am.count())
sd = np.sqrt(_var(am, axis=axis, ddof=ddof, weights=weights))
n = am.count(axis=axis)
return sd / np.sqrt(n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assumes nobs=n and not weights.sum()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that's true, I missed that one. I'm not convinced that sd / sqrt(weights.sum()) is a valid estimate of standard error for all definitions of weights--it probably only works for freq_weights, so I'm inclined to remove weighting from tsem. Agreed?

@rgommers rgommers added enhancement A new feature or improvement scipy.spatial scipy.stats labels Jan 6, 2017
----------
.. [1] R. B. D'Agostino, A. J. Belanger and R. B. D'Agostino Jr.,
"A suggestion for using powerful and informative tests of
normality", American Statistician 44, pp. 316-321, 1990.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you remove the reference?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I missed 22f34db, adding it back in now

return np.array([items, freq]).T


def collapse_weights(a, weights=None, axis=None):
"""
Collapse a weigthed representation of an array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo weighted

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fix will be committed shortly

return np.array([items, freq]).T


def collapse_weights(a, weights=None, axis=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default for axis in stats is usually 0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, changed, locally, will be committing shortly

uniq_a = np.take(sorted_a, new_a, axis=axis)
cum_w = np.append(0, weights[sort_ix].cumsum())
uniq_w = np.diff(cum_w[np.append(new_a, -1)])
return uniq_a, uniq_w
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm too tired to figure out how this works.

how can this work along an axis if the are an unequal number of uniques?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If axis is not None, and ndim > 1, then it considers a "record" to consist of all of the axes which are not the axis specified. It looks for uniques among "records", and sums up their weight. It assumes that a.shape[axis] == weights.size

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding that explanation to the docstring.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I need more examples in general. I'll work on adding examples tomorrow.

@josef-pkt
Copy link
Member

I managed to read up to scoreatpercentile, but need to leave that for another day.

One main issue is whether it's not better to go with frequency weights all the way instead of raising an exception whenever n is needed. Users can still change the usage by rescaling their weights so that weights.sum() is their desired number of observations. I think that would handle most cases except for var_weights.

@quantology
Copy link
Author

Thanks so much @josef-pkt for giving it a look! That's definitely a question; my gut is to err on the side of not being opinionated and raise NotImplemented's or just plain not implementing things when they're vague.

I think degrees of freedom, error estimates, hypothesis testing, and robust estimates are things that require more information that I'd like to assume. If those do get added, I'm guess there might be some nasty gotchas that should be pointed out in the docs. Or maybe for those kinds of things there needs to be a separate type of weight kwarg?

@josef-pkt
Copy link
Member

My plan for statsmodels is to require specific weights like Stata and no more plain weights keywords (unless, maybe, if they are unambiguous).

The question is about future keywords in scipy.stats if there will be different kinds of weights. I'm a bit doubtful whether the generic weights is useful if it works only for some of the "easy" cases, and cannot be extended to the popular hypothesis tests or even ddof handling in var, for example.
(From what I have seen so far, this PR is more digestible than what my initial impression was, by avoiding the "tough" cases.)

(Annoying case: weights in R glm are largely a mystery to me and might include magic.)

@quantology
Copy link
Author

Finally fixed a weird numpy bug. In CI, there are currently two issues out of my depth:

Those don't seem like things caused by this PR as far as I can tell, so absent any new information I think this is good?

@rgommers
Copy link
Member

Yes, the sparse failures can be ignored (they're a numpy 1.12.0rc2 issue).

Exceeding the 50 min time limit could be caused by this PR if you added long-running tests; we were already close to the limit. If not, it may be just random timing variation.

@apbard
Copy link
Contributor

apbard commented Jan 12, 2017

@rgommers I think that the time limit might also be related to #6947 since this PR is adding more pdist/cdist calls to *_calling_conventions.

@apbard
Copy link
Contributor

apbard commented Jan 12, 2017

@metaperture I have thought about the weight_checkedas a wrapper instead of a test.
IMHO, it would be better to transform it into a general test like *_calling_conventions.
I think you can probably avoid testing against all possible dtypes and just check for the different properties of weights.
This way it will still be a "one-stop-shop" and any new function that supports weight should automagically being tested.
This on the other hand won't lead to more tests since as I pointed out before most of the current tests you have are redundant and have been removed in #6858. Furthermore, an explicit test will make more clear how and when weights are tested

@quantology
Copy link
Author

@apbard yeah, I think that probably the biggest expense timewise was layering in weight_checked into *_calling_conventions. Maybe I should take that out to get back under the time limit? @rgommers should I be looking at the time limit as a hard constraint? I had imagined it was just a config var somewhere

@apbard I'm a bit concerned, though, that refactoring the tests as you suggest will lead to a loss of coverage for edge cases (especially weird axis questions). It feels like maybe using weight_checked everywhere besides calling_conventions makes more sense to me? I think we just have contrary opinions on this, I can definitely see yours (especially in that it's a little more clear to many and less nested). Does anyone else have an opinion so we can break the deadlock? More coverage + cleaner semantics (imo) vs apbard's general test idea (less time and less arcane code)?

@rgommers
Copy link
Member

@rgommers should I be looking at the time limit as a hard constraint?

Yes, the 50 min timeout on TravisCI is not configurable. So we shouldn't come close to that, or we get really annoying problems. Timing varies quite a bit depending on system load, so the total run should ideally be under 40 min.

@quantology
Copy link
Author

Just curious -- I'm guessing with something this big that it won't be coming out on a patch version. If that's the case, and it looks like 0.19.0 is getting frozen soon, is there anything left for me to do at this point? And if it gets pushed to 0.20.0 or later, should I be continuously merging in changes like I did yesterday, or should I wait to batch all merges until later? #opensourcenewbie

Thanks for any info :)

@pv
Copy link
Member

pv commented Jan 18, 2017

@metaperture: as a contributor, you don't specially need to care about releases. The 0.19.x series will be branched off master and be developed in parallel, with any relevant bugfixes backported, while development continues on master branch as usual. In particular, there's no need to delay working on this, as the next release will be in ~6 months --- which often is a surprisingly short time so the earlier things get in the better.

@pv
Copy link
Member

pv commented Jan 18, 2017

As to whether you need to continuously rebase as master branch changes --- this is usually best to do only after bigger issues in the approach etc are settled (I don't know what's the status for this PR on that front) and things are looking like close to ready to merge.

@quantology
Copy link
Author

Thanks @pv for the info. To anyone else, are there issues with the approach? This seems to squash quite a few bugs (last 4 references to this are bugs that this PR would fix), be a better way to organize the code, and add significant functionality. Of course, I'm a bit biased :P

@rgommers
Copy link
Member

I think this looks promising, but as @josef-pkt said it's hard to review due to the sheer size of the PR. The spatial.distance changes are probably easier to merge. I'd think getting the maintenance only gh-7043 merged first, then opening a new PR on top of that with only the spatial changes in this one would make that mergeable.

Then you can leave this PR for just the stats changes - those require some more discussion I suspect.

@perimosocordiae
Copy link
Member

@metaperture Will you have time to split this PR into two, and squash + rebase on master?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants