Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Inconsistencies in calculating second moments of a single value #7900

Closed
seth-p opened this issue Aug 1, 2014 · 15 comments · Fixed by #7926
Closed

BUG: Inconsistencies in calculating second moments of a single value #7900

seth-p opened this issue Aug 1, 2014 · 15 comments · Fixed by #7926
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@seth-p
Copy link
Contributor

seth-p commented Aug 1, 2014

I noticed (in #7884) that ewmvar, ewmstd, ewmvol, ewmcov, rolling_var, and rolling_std return 0.0 for a single value (assuming min_periods=0); whereas Series.std, Series.var, ewmcorr, expanding_cov, expanding_corr, rolling_cov, and rolling_corr all return NaN for a single value. expanding_std and expanding_var produce Value Error: min_periods (2) must be <= window (1).

I think all of these should all return NaN for a single value. At any rate, I would expect greater consistency one way or the other.

Mildly related, when calculating the correlation of a constant series with itself, Series.corr(), expanding_corr, and rolling_corr return NaN, while ewmcorr sometimes returns NaN, sometimes 1 and sometimes -1, due to numerical accuracy issues. Actually, as shown in a separate comment below, rolling_corr also produces inconsistent results for a constant subseries following different prior values.

Inconsistencies in calculating second moments of a single point:

Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 10:45:13) [MSC v.1600 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from pandas import Series, ewmvar, ewmstd, ewmvol, ewmcov, rolling_var, rolling_std, ewmcorr, expanding_cov, expanding_corr,
 expanding_std, expanding_var, rolling_cov, rolling_corr
In [2]: s = Series([1.])

In [3]: ewmvar(s, halflife=2., min_periods=0)
Out[3]:
0    0
dtype: float64

In [4]: ewmstd(s, halflife=2., min_periods=0)
Out[4]:
0    0
dtype: float64

In [5]: ewmvol(s, halflife=2., min_periods=0)
Out[5]:
0    0
dtype: float64

In [6]: ewmcov(s, s, halflife=2., min_periods=0)
Out[6]:
0    0
dtype: float64

In [7]: rolling_var(s, window=2, min_periods=0)
Out[7]:
0    0
dtype: float64

In [8]: rolling_std(s, window=2, min_periods=0)
Out[8]:
0    0
dtype: float64

In [9]: s.std()
Out[9]: nan

In [10]: s.var()
Out[10]: nan

In [11]: ewmcorr(s, s, halflife=2., min_periods=0)
Out[11]:
0   NaN
dtype: float64

In [12]: expanding_cov(s, s, min_periods=0)
Out[12]:
0   NaN
dtype: float64

In [13]: expanding_corr(s, s, min_periods=0)
Out[13]:
0   NaN
dtype: float64

In [16]: rolling_cov(s, s, window=3, min_periods=0)
Out[16]:
0   NaN
dtype: float64

In [17]: rolling_corr(s, s, window=3, min_periods=0)
Out[17]:
0   NaN
dtype: float64


In [14]: expanding_std(s, min_periods=0)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-7320ee579e6a> in <module>()
----> 1 expanding_std(s, min_periods=0)

C:\Python34\lib\site-packages\pandas\stats\moments.py in f(arg, min_periods, freq, center, **kwargs)
    825             return func(arg, window, minp, **kwds)
    826         return _rolling_moment(arg, window, call_cython, min_periods, freq=freq,
--> 827                                center=center, **kwargs)
    828
    829     return f

C:\Python34\lib\site-packages\pandas\stats\moments.py in _rolling_moment(arg, window, func, minp, axis, freq, center, how, args, kwa
rgs, **kwds)
    345         result = np.apply_along_axis(calc, axis, values)
    346     else:
--> 347         result = calc(values)
    348
    349     rs = return_hook(result)

C:\Python34\lib\site-packages\pandas\stats\moments.py in <lambda>(x)
    339     arg = _conv_timerule(arg, freq, how)
    340     calc = lambda x: func(x, window, minp=minp, args=args, kwargs=kwargs,
--> 341                           **kwds)
    342     return_hook, values = _process_data_structure(arg)
    343     # actually calculate the moment. Faster way to do this?

C:\Python34\lib\site-packages\pandas\stats\moments.py in call_cython(arg, window, minp, args, kwargs, **kwds)
    823         def call_cython(arg, window, minp, args=(), kwargs={}, **kwds):
    824             minp = check_minp(minp, window)
--> 825             return func(arg, window, minp, **kwds)
    826         return _rolling_moment(arg, window, call_cython, min_periods, freq=freq,
    827                                center=center, **kwargs)

C:\Python34\lib\site-packages\pandas\stats\moments.py in <lambda>(*a, **kw)
    604                                how='median')
    605
--> 606 _ts_std = lambda *a, **kw: _zsqrt(algos.roll_var(*a, **kw))
    607 rolling_std = _rolling_func(_ts_std, 'Unbiased moving standard deviation.',
    608                             check_minp=_require_min_periods(1))

C:\Python34\lib\site-packages\pandas\algos.pyd in pandas.algos.roll_var (pandas\algos.c:28990)()

C:\Python34\lib\site-packages\pandas\algos.pyd in pandas.algos._check_minp (pandas\algos.c:16394)()

ValueError: min_periods (2) must be <= window (1)

In [15]: expanding_var(s, min_periods=0)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-28e2018aee68> in <module>()
----> 1 expanding_var(s, min_periods=0)

C:\Python34\lib\site-packages\pandas\stats\moments.py in f(arg, min_periods, freq, center, **kwargs)
    825             return func(arg, window, minp, **kwds)
    826         return _rolling_moment(arg, window, call_cython, min_periods, freq=freq,
--> 827                                center=center, **kwargs)
    828
    829     return f

C:\Python34\lib\site-packages\pandas\stats\moments.py in _rolling_moment(arg, window, func, minp, axis, freq, center, how, args, kwa
rgs, **kwds)
    345         result = np.apply_along_axis(calc, axis, values)
    346     else:
--> 347         result = calc(values)
    348
    349     rs = return_hook(result)

C:\Python34\lib\site-packages\pandas\stats\moments.py in <lambda>(x)
    339     arg = _conv_timerule(arg, freq, how)
    340     calc = lambda x: func(x, window, minp=minp, args=args, kwargs=kwargs,
--> 341                           **kwds)
    342     return_hook, values = _process_data_structure(arg)
    343     # actually calculate the moment. Faster way to do this?

C:\Python34\lib\site-packages\pandas\stats\moments.py in call_cython(arg, window, minp, args, kwargs, **kwds)
    823         def call_cython(arg, window, minp, args=(), kwargs={}, **kwds):
    824             minp = check_minp(minp, window)
--> 825             return func(arg, window, minp, **kwds)
    826         return _rolling_moment(arg, window, call_cython, min_periods, freq=freq,
    827                                center=center, **kwargs)

C:\Python34\lib\site-packages\pandas\algos.pyd in pandas.algos.roll_var (pandas\algos.c:28990)()

C:\Python34\lib\site-packages\pandas\algos.pyd in pandas.algos._check_minp (pandas\algos.c:16394)()

ValueError: min_periods (2) must be <= window (1)


In [20]: show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.14.1
nose: 1.3.3
Cython: 0.20.2
numpy: 1.9.0b1
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.1.0
sphinx: 1.2.2
patsy: 0.3.0
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.4
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None

Instability in ewmcorr of a constant series with itself:

In [1]: from pandas import Series, ewmcorr, expanding_corr, rolling_corr

In [2]: s = Series([5., 5., 5.])

In [3]: s.corr(s)
Out[3]: nan

In [4]: expanding_corr(s, s)
Out[4]:
0   NaN
1   NaN
2   NaN
dtype: float64

In [5]: rolling_corr(s, s, window=3)
Out[5]:
0   NaN
1   NaN
2   NaN
dtype: float64

In [6]: ewmcorr(s, s, halflife=3.)
Out[6]:
0   -1
1   -1
2   -1
dtype: float64

In [9]: ewmcorr(Series([3., 3., 3.]), halflife=3.)
Out[9]:
0   NaN
1     1
2     1
dtype: float64
@seth-p
Copy link
Contributor Author

seth-p commented Aug 3, 2014

A couple questions:

  1. Do you agree that the second moment function should all return NaN for a single value?
  2. Is there an existing function in Pandas that rounds tiny numbers near 0 to 0? The ewmcorr instability problem is due to the fact that ewmvar of a constant series isn't always identically zero, but rather is sometimes a tiny positive or negative number.

@jreback
Copy link
Contributor

jreback commented Aug 3, 2014

see the rollimg_var stuff which 'fixes' the numerical issues part by using the Welford algorithms

  1. yes
  2. can u post an example

@seth-p
Copy link
Contributor Author

seth-p commented Aug 3, 2014

Re: 1). Should the answer depend on whether bias is True or False (for the ewm* functions), or whether ddof is 0 or 1 (for others)? Perhaps one would want the result to be either NaN or 0 depending on the value of bias/ddof. I'm really not sure, so want to make sure others are before delving into the code...

Re: 2), I added an example to the end of the initial comment. (Sorry, I guess I should have added it as a separate comment.)

@seth-p
Copy link
Contributor Author

seth-p commented Aug 3, 2014

Re: rounding near 0, algos.roll_var() deals with tiny negative errors very simply:

                if val < 0:
                    val = 0

It doesn't deal with tiny positive errors.
(The "smartness" of the algorithm has to do with the rolling aspect of the problem. I don't think it has any particular smartness for dealing with constant series (other than the test for negative errors mentioned above).)

@seth-p
Copy link
Contributor Author

seth-p commented Aug 3, 2014

Here's an example showing that rolling_var also has numerical problems.

In [30]: rolling_var(Series([1., 5., 5., 5.]), window=3)
Out[30]:
0             NaN
1             NaN
2    5.333333e+00
3    8.881784e-16
dtype: float64

The final value should be identically 0.0.

This instability plays out in inconsistent results for the correlation of a constant series with itself:

In [33]: rolling_corr(Series([0., 5., 5., 5.]), window=3)
Out[33]:
0   NaN
1   NaN
2     1
3   NaN
dtype: float64

In [34]: rolling_corr(Series([1., 5., 5., 5.]), window=3)
Out[34]:
0   NaN
1   NaN
2     1
3     0
dtype: float64

In both cases the final value should be the correlation of [5., 5., 5.] with itself, yet one produces NaN and the other 0.

@seth-p
Copy link
Contributor Author

seth-p commented Aug 3, 2014

CC'ing @snth, @jaimefrio, and @kdiether, who appear to have worked on this/related code.

@jaimefrio
Copy link
Contributor

The check for val < 0 in the rolling_var function is there because it was in the previous algorithm, and I preferred to play it safe, because a negative value would be a catastrophic failure if computing rolling_std, although it should be mostly unnecessary with the new algorithm.

I don't think that rounding down to 0 very small values of rolling_var is a good idea. I have quickly hacked the implementation to keep track of consecutive identical values, see #7916. Have only tested it with your small example above, and have made no assessment of the performance impact. But I feel that if the changes are to be made in rolling_var, something like this should be the way to go. Rounding down to zero within rolling_corr wouldn't feel that bad, but maybe that is just me.

The other thing I can commit to doing is implementing a variant of Welford's method for rolling_cov, see e.g. stable one-pass algorithm here so that when the rolling covariance of a series with itself is computed, the operations lead to the exact same value as computing the rolling variance of the series. This will not solve the problem of 0 / 0 being NaN, but at least the correlation of a series with itself will either be 1 or NaN, but not 0.

@seth-p
Copy link
Contributor Author

seth-p commented Aug 4, 2014

Sounds good. I think that if you add an explicit check for constant series, then there's no need to round small results to 0.

Yeah, I have no strong opinion about whether the correlation of a constant series with itself should be 1 or NaN -- at least it shouldn't be 0 or -1.

@seth-p
Copy link
Contributor Author

seth-p commented Aug 4, 2014

Regarding the general inconsistencies between NaN and 0 for the second moments of a single value, on further reflection I think it makes sense that the result be NaN for unbiased statistics (bias=False or ddof>=1) but 0 for biased statistics (bias=True or ddof=0).

I will go through the list of functions above and see how the behavior accords with this principle.

jaimefrio added a commit to jaimefrio/pandas that referenced this issue Aug 18, 2014
Added logic to `rolling_var` to detect windows where all non-NaN
values are identical.

Need to assess both correctness and performance impact.
@seth-p
Copy link
Contributor Author

seth-p commented Aug 20, 2014

After further review (and once #7912 is addressed), I think that all biased (bias=True or ddof=0) second moments return 0 for single value, while unbiased (bias=False or ddof=1) return NaN, except for the following, which need to be fixed:

  • rolling_var, and rolling_std return 0.0; and
  • expanding_std and expanding_var produce Value Error: min_periods (2) must be <= window (1).

@seth-p
Copy link
Contributor Author

seth-p commented Aug 20, 2014

@jaimefrio, I'm going to include in #7926 a simple fix for the rolling/expanding_var/std problems mentioned in the immediately prior comment. This is so that I can do consistency checks across the various ewm/expanding/rolling functions. (I hope to have it checked in by tomorrow.) Feel free to do something fancier w/ algos.roll_var()...

@jaimefrio
Copy link
Contributor

The rolling_var/std fix is simply to modify that if statement that has a # pathological case comment? We probably just need to to replace it by an if nobs <= ddof: val = NaN, as the 0 value for single observation biased variance will come up naturally, especially with the recent changes I have in the pipeline.

How will users actually get to specify this parameter? Right now it is hardcoded to unbiased estimation, ddof = 1.

I have the fix for rolling_var almost done, but I still need to write some test for it. I also have a branch where I am implementing a stable algorithm for skewness and kurtosis, including a similar check for "all non-NaNs in window are identical."

I have lost track of all the changes you are trying to pull together, but just so you know, I silently approve of your efforts from the distance... ;)

@seth-p
Copy link
Contributor Author

seth-p commented Aug 21, 2014

Yeah, sorry, I've been encountering lots of little inconsistencies and edge cases that I've been trying to clean up. Some of the stuff I did is already in master (#7603, #7738, #7766, #7896, #7898 (which is obsolete in view of #7977), and #7934); #8059 is ready to be merged; and #7926 should be ready soon -- it has a lot of new consistency checks for the ewm*, expanding_* and rolling_* functions, which have helped me identify many of these issues (e.g. #8084 and #8086, which I don't have any immediate plans to work on). Things should be a bit cleaner / more stable once I finish #7926.

As for ddof, some of the rolling_ functions support it, e.g. rolling_var, though it's not documented. I think it would be best to make it explicit and documented in all the relevant rolling_ and expanding_ functions. I actually make use of it in my new tests in #7926. I'll let you know when that's ready...

@seth-p
Copy link
Contributor Author

seth-p commented Aug 21, 2014

OK, I think I'm pretty much done w/ #7926, though I still need to rebase it after #8059 is merged into master.

Note the test_rolling_consistency() and test_expanding_consistency() functions. They each do two things for various arguments (window, min_periods, center):

  • call self._test_moments_consistency(), to test mathematical consistency between the various moments; and
  • compare various rolling/expanding_xyz() functions with rolling/expanding_apply() applied to Series/DataFrame.xyz().

You'll see that several tests are commented out with comments of the form # restore once .... These indicate tests that are currently failing (at least in some cases), but should pass once ... is done/fixed.

@jaimefrio, note in particular that I commented out the test for the correlation of a series with itself being identically 1. because rolling_cov(x,x) is not identically equal to rolling_var(x), resulting in correlations that are not identically 1.. As you observed, replacing algos.roll_var() with an algos.roll_cov(), and implementing rolling_var() in terms of it (the way ewmcov() and ewmvar() are now defined) should fix the problem.

@jaimefrio, note also that the tests call _test_moments_consistency() with both unbiased (the default) and biased (with ddof=0) versions of expanding/rolling_std/var(), though unfortunately currently there's no way to call a biased version of expanding/rolling_cov(). (One could use the identity cov(x,y)=0.5*(var(x+y)-var(x)-var(y)) with biased versions of var(), but that doesn't help in testing consistency...) Ideally the putative algos.roll_cov() would take a ddof parameter, so that one could calculate unbiased and biased versions of all the second moments.

These tests are rather slow, and can probably be trimmed a bit (i.e. fewer distinct window and min_periods values), but I've found them extremely useful in identifying bugs and inconsistencies.

@seth-p
Copy link
Contributor Author

seth-p commented Aug 28, 2014

OK, #8059 has been merged into master, and I've updated #7926, which I think is now ready to be merged. @jaimefrio, you may find #7926's test_rolling_consistency() function helpful in testing.

jaimefrio added a commit to jaimefrio/pandas that referenced this issue Sep 16, 2014
Add a check to rolling_var for repeated observations, in order to
produce an exactly zero value of the variance when all entries are
identical. Related to the discussion in pandas-dev#7900
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
3 participants