Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Use Welford's method in stats.moments.rolling_var #6817

Merged
merged 1 commit into from Apr 22, 2014

Conversation

jaimefrio
Copy link
Contributor

This PR implements a modified version of Welford's method to compute
the rolling variance. Instead of keeping track of the sum and sum of
the squares of the items in the window, it tracks the mean and the sum
of squared differences from the mean. This turns out to be (much) more
numerically stable.

The formulas to update these two variables when adding or removing an
item from the sequence are well known, see e.g.

http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

The formulas used when both adding one and removing one item I have
not seen explicitly worked out anywhere, but are not too hard to come
up with if you put pen to (a lot of) paper.

@jaimefrio
Copy link
Contributor Author

Some proof of the numerical stability claims:

In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: np.random.seed(0)
In [4]: a = np.random.rand(10)
In [5]: a[5] = np.nan

Before this PR:

In [6]: pd.stats.moments.rolling_var(a, 3)
Out[6]:
array([        nan,         nan,  0.00720519,  0.00749899,  0.00835439,
               nan,         nan,         nan,  0.08136806,  0.10003762])

In [7]: pd.stats.moments.rolling_var(a + 1e6, 3)
Out[7]:
array([        nan,         nan,  0.00748698,  0.0078125 ,  0.0094401 ,
               nan,         nan,         nan,  0.08235677,  0.10123698])

In [8]: pd.stats.moments.rolling_var(a + 1e9, 3)
Out[8]:
array([           nan,            nan,   341.33333333,   341.33333333,
         853.33333333,            nan,            nan,            nan,
         853.33333333,  1194.66666667])

After this PR:

In [6]: pd.stats.moments.rolling_var(a, 3)
Out[6]:
array([        nan,         nan,  0.00720519,  0.00749899,  0.00835439,
               nan,         nan,         nan,  0.08136806,  0.10003762])

In [7]: pd.stats.moments.rolling_var(a + 1e6, 3)
Out[7]:
array([        nan,         nan,  0.00720519,  0.00749899,  0.00835439,
               nan,         nan,         nan,  0.08136806,  0.10003762])

In [8]: pd.stats.moments.rolling_var(a + 1e9, 3)
Out[8]:
array([        nan,         nan,  0.00720519,  0.00749899,  0.0083544 ,
               nan,         nan,         nan,  0.08136812,  0.10003767])

In [9]: pd.stats.moments.rolling_var(a + 1e12, 3)
Out[9]:
array([        nan,         nan,  0.00720175,  0.00749382,  0.00834783,
               nan,         nan,         nan,  0.08135042,  0.10003433])

In [10]: pd.stats.moments.rolling_var(a + 1e15, 3)
Out[10]:
array([       nan,        nan,  0.015625 ,  0.015625 ,  0.015625 ,
              nan,        nan,        nan,  0.0234375,  0.078125 ])

In [11]: pd.stats.moments.rolling_var(a + 1e18, 3)
Out[11]: array([ nan,  nan,   0.,   0.,   0.,  nan,  nan,  nan,   0.,   0.])

@jaimefrio
Copy link
Contributor Author

And some timings:

In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: np.random.seed(0)
In [4]: b = np.random.rand(1e6)

In [5]: %timeit pd.stats.moments.rolling_var(b, 3)
100 loops, best of 3: 19.3 ms per loop # before this PR
10 loops, best of 3: 24.6 ms per loop # after this PR

So it is about 25% slower, but I guess correctness has its price...

@jreback
Copy link
Contributor

jreback commented Apr 6, 2014

since this would replace roll_var just delete the original (and call this roll_var)

pls add the examples above as a test (which should fail with the current one ) but pass with the new one

you will prob need to compare with np.allclose to avoid testing for exactness

@jreback
Copy link
Contributor

jreback commented Apr 6, 2014

otoh does it make sense to provide this as an option (maybe method=...) so can change in a new version?

it seems to me that is a direct replacement
(though slightly slower)

@jreback jreback added this to the 0.14.0 milestone Apr 6, 2014
@jaimefrio
Copy link
Contributor Author

I don't think it should be possible for the user to revert back to the old method: although the PR is labelled ENH, it is in a way closer to a bug fix, you really shouldn't be computing the variance the way it was being done. So I think the 25% slowdown is a reasonable price to pay.

Will add the tests and remove the old code.

@jreback
Copy link
Contributor

jreback commented Apr 6, 2014

that sounds ok
thanks

@jreback
Copy link
Contributor

jreback commented Apr 6, 2014

ping when ready for merge

@jreback
Copy link
Contributor

jreback commented Apr 21, 2014

@jaimefrio hows this coming?

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

@jaimefrio ping when you have updated...like to get in 0.14

@jaimefrio
Copy link
Contributor Author

@jreback Pushed a new commit with a simple, hard coded test. I am not very familiar with your testing setup, so I kind of hacked the test in. Do let me kow if you'd rather have it done some other way.

On a separate note, the numerical stability of the rolling_skew and rolling_kurt implementations is appalling, will try to get those in a separate PR:

>>> x = np.random.rand(10)

>>> pd.stats.moments.rolling_skew(x, 4)
array([        nan,         nan,         nan,  1.41536395, -0.10049455,
       -1.00255189,  0.73649012,  0.96189153, -0.53150107,  0.02123665])
>>> pd.stats.moments.rolling_skew(x+5000, 4)
array([        nan,         nan,         nan,  1.10559882, -0.14945926,
       -1.18256746,  0.6376619 ,  0.9452194 , -0.53515534,  0.0225138 ])

>>> pd.stats.moments.rolling_kurt(x, 4)
array([        nan,         nan,         nan,  1.51764024,  0.41301506,
        0.65789532, -1.7739017 , -0.70561097, -2.29377668, -5.55911936])
>>> pd.stats.moments.rolling_kurt(x+100, 4)
array([        nan,         nan,         nan,  1.49945593,  0.4065266 ,
        0.63296502, -1.79442734, -0.7062332 , -2.29368718, -5.55921368])

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

that test looks fine
only think needed is a release more mention (refenenece this pr number) ; can put in API to draw attention to it

pls open a new issue for the skew/Kurt stability issues

@jaimefrio
Copy link
Contributor Author

Updated the release notes, and opened issue #6929

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

pls rebase this

@jreback jreback added the Algos label Apr 22, 2014
@jaimefrio
Copy link
Contributor Author

rebased

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

hmm....travis is not running this....odd.....

can you make sure that the travis hook is set?

This PR implements a modified version of Welford's method to compute
the rolling variance. Instead of keeping track of the sum and sum of
the squares of the items in the window, it tracks the mean and the sum
of squared differences from the mean. This turns out to be (much) more
numerically stable.

The formulas to update these two variables when adding or removing an
item from the sequence are well known, see e.g.

http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

The formulas used when both adding one and removing one item I have
not seen explicitly worked out anywhere, but are not too hard to come
up with if you put pen to (a lot of) paper.
@jaimefrio
Copy link
Contributor Author

Travis is running it four times now, two in pydata/pandas and two in jaimefrio/pandas...

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

that's right it runs for each commit both on your private branchand pandas branch - the green check comes when it passes on master

so ping when that's green

@jaimefrio
Copy link
Contributor Author

All green and ready to go.

jreback added a commit that referenced this pull request Apr 22, 2014
ENH: Use Welford's method in stats.moments.rolling_var
@jreback jreback merged commit 07f4966 into pandas-dev:master Apr 22, 2014
@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

thanks! great contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants