Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A changing of dtype in scipy.stats._multivariate.multivariate_normal_gen() makes covariance matrix no longer positive semidefinite #9942

Closed
ZhaofengWu opened this issue Mar 13, 2019 · 21 comments
Labels
enhancement A new feature or improvement scipy.stats

Comments

@ZhaofengWu
Copy link

The multivariate_normal_frozen class calls the _process_parameters function of the multivariate_normal_gen class.

self.dim, self.mean, self.cov = self._dist._process_parameters(
None, mean, cov)

For my covariance matrix (cov.ndim == 2), this method does nothing but changes its dtype from float32 to float64. However, this change of dtype apparently makes a positive semidefinite covariance matrix no longer one. Using PDB to pause below this function call, I get the following output (cov is my original covariance matrix, and self.cov is the output of _process_parameters:

image

@chrisb83
Copy link
Member

Thanks for reporting this issue. Could you add

  • Numpy / SciPy Version
  • what ist cov, so we can copy/paste to reproduce the problem

Thanks

@ZhaofengWu
Copy link
Author

Numpy 1.16.0, SciPy 1.1.0. I pickled and then zipped my cov, attached below. Not sure if it will still be able to reproduce this error after these steps though.

cov.pkl.zip

@chrisb83
Copy link
Member

hm, if I run abc = pickle.load( open( 'cov.pkl', 'rb' ) )``, i get ImportError: No module named 'numpy.core._multiarray_umath'

any ideas?
otherwise can you try one of the methods given here: https://docs.scipy.org/doc/numpy/reference/routines.io.html

@rgommers
Copy link
Member

@chrisb83 are you using numpy <1.16? pickle files aren't really portable like that (and a security risk, I wouldn't open some random ones).

@rgommers
Copy link
Member

@ZhaofengWu if you want to provide large arrays as example and can't give the code to reproduce them, then please do so as .npy files (produced by np.save).

@ZhaofengWu
Copy link
Author

Thanks. I will give a .npy in a few hours.

@ZhaofengWu
Copy link
Author

Here are the two .npy files. Please unzip first. They should only differ by dtype, yet one breaks _PSD while the other doesn't.
cov.npy.zip
self.cov.npy.zip

@chrisb83
Copy link
Member

yes, I can reproduce this.

Reason: _PSD uses the following code to check whether the matrix is psd

s, u = scipy.linalg.eigh(M, lower=lower, check_finite=check_finite)
eps = _eigvalsh_to_eps(s, cond, rcond)
if np.min(s) < -eps:
    raise ValueError('the input matrix must be positive semidefinite')

If you look at _eigvalsh_to_eps, the value of eps depends on the dytpe:

t = spectrum.dtype.char.lower()
factor = {'f': 1E3, 'd': 1E6}
cond = factor[t] * np.finfo(t).eps
eps = cond * np.max(abs(spectrum))

I don't know where this condition comes from

@chrisb83
Copy link
Member

Also note that the smallest eigenvalue differs depending on dtype, so even if eps were the same, one could get a different outcome when using the smallest eigenvalue to evaluate a condition.

import numpy as np
from scipy import linalg

cov1 = np.load('cov.npy')  # float32
cov2 = np.load('self.cov.npy') # float64

np.array_equal(cov1, cov2) # True

lower=True
check_finite=True

s1, u1 = linalg.eigh(cov1, lower=lower, check_finite=check_finite)
s2, u2 = linalg.eigh(cov2, lower=lower, check_finite=check_finite)

print(np.min(s1)) # -7.1243524e-17
print(np.min(s2)) # -7.875839015554746e-19

@ZhaofengWu
Copy link
Author

Thanks! I think I understand now where this issue comes from, but I don't have enough statistic background to know if this is the expected behavior. Although from a user's point of view, it feels like changing the dtype of a matrix shouldn't affect its inherent properties... should it?

@chrisb83
Copy link
Member

Unfortunately, I don't have further insights on the criterion to determine whether the matrix is PSD.

In any case, it seems to be a borderline case since there are eigenvalues slightly below 0 (could be due a minor numerical errors of course).

Maybe some other people know more about this

@ZhaofengWu
Copy link
Author

ZhaofengWu commented Apr 1, 2019

Can this method _process_parameters be changed so that if the dtype was a float32, then do not modify it?

cov = np.asarray(cov, dtype=float)

@mdhaber
Copy link
Contributor

mdhaber commented Dec 12, 2020

@ZhaofengWu Potentially, but could you try adding a tiny bit of the identity matrix to your covariance matrix? Sometimes that is done to improve numerical stability without affecting much else. See gh-13231 for a little more information.

@mdhaber
Copy link
Contributor

mdhaber commented Dec 25, 2020

This seems related to gh-10205 in that both involve the positive definiteness check on a covariance matrix. @ilayn I just thought I should ping you based on your comments in gh-12184.

@ilayn
Copy link
Member

ilayn commented Dec 25, 2020

Yes I'd like to touch that part of the code but a bit reluctant since I can't predict the consequences. However first waiting for #12184 and #10205 to be done first so I can read it properly.

@mdhaber mdhaber added the enhancement A new feature or improvement label Mar 16, 2022
@mdhaber
Copy link
Contributor

mdhaber commented Jun 8, 2022

@ilayn we seem to be gridlocked on these three issues. Do you see a path forward here? I wasn't quite sure what "read it properly" meant, in this case. Do you mean that gh-12184 would do some helpful refactoring that would make the code easier to predict?

@mdhaber
Copy link
Contributor

mdhaber commented Oct 18, 2022

@tirthasheshpatel @tupui Do you think that the Covariance class resolves this issue now that the user has complete control over their matrix is considered (numerically) positive semidefinite or not?

For instance, this user could perform an eigendecomposition on their matrix and store the decomposition instead of the original matrix. Reducing precision would no longer change whether the matrix is semidefinite or not.

@tirthasheshpatel
Copy link
Member

I think Covariance addresses the problem. Although, when I try to sample from the multivariate normal using the covariance object the OP provided, it gives me the same error:

In [55]: import numpy as np

In [56]: from scipy import stats

In [57]: with np.load('cov.npy.zip') as data:
    ...:     cov = data['cov']
    ...:

In [58]: w, v = np.linalg.eigh(cov)

In [59]: cov_obj = stats.Covariance.from_eigendecomposition((w, v))

In [60]: dist = stats.multivariate_normal(np.zeros(cov.shape[0], dtype=cov.dtype), cov_obj)

In [61]: dist.logpdf(np.zeros(cov.shape[0]))
Out[61]: 46479.578745868996

In [62]: dist.rvs()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [62], line 1
----> 1 dist.rvs()

File ~/oss/scipy/build-install/lib/python3.10/site-packages/scipy/stats/_multivariate.py:859, in multivariate_normal_frozen.rvs(self, size, random_state)
    858 def rvs(self, size=1, random_state=None):
--> 859     return self._dist.rvs(self.mean, self.cov_object.covariance, size,
    860                           random_state)

File ~/oss/scipy/build-install/lib/python3.10/site-packages/scipy/stats/_multivariate.py:747, in multivariate_normal_gen.rvs(self, mean, cov, size, random_state)
    726 def rvs(self, mean=None, cov=1, size=1, random_state=None):
    727     """Draw random samples from a multivariate normal distribution.
    728
    729     Parameters
   (...)
    745
    746     """
--> 747     dim, mean, cov_object = self._process_parameters(mean, cov)
    748     cov = cov_object.covariance
    750     random_state = self._get_random_state(random_state)

File ~/oss/scipy/build-install/lib/python3.10/site-packages/scipy/stats/_multivariate.py:416, in multivariate_normal_gen._process_parameters(self, mean, cov, allow_singular)
    409 dim, mean, cov = self._process_parameters_psd(None, mean, cov)
    410 # After input validation, some methods then processed the arrays
    411 # with a `_PSD` object and used that to perform computation.
    412 # To avoid branching statements in each method depending on whether
    413 # `cov` is an array or `Covariance` object, we always process the
    414 # array with `_PSD`, and then use wrapper that satisfies the
    415 # `Covariance` interface, `CovViaPSD`.
--> 416 psd = _PSD(cov, allow_singular=allow_singular)
    417 cov_object = _covariance.CovViaPSD(psd)
    418 return dim, mean, cov_object

File ~/oss/scipy/build-install/lib/python3.10/site-packages/scipy/stats/_multivariate.py:166, in _PSD.__init__(self, M, cond, rcond, lower, check_finite, allow_singular)
    164 if np.min(s) < -eps:
    165     msg = "The input matrix must be symmetric positive semidefinite."
--> 166     raise ValueError(msg)
    167 d = s[s > eps]
    168 if len(d) < len(s) and not allow_singular:

ValueError: The input matrix must be symmetric positive semidefinite.

In [63]:

Any idea @mdhaber why this could be happening or if it is related to the Covariance changes? Shouldn't the rvs method also use the covariance object instead of computing the Eigen-decomposition again?

@mdhaber
Copy link
Contributor

mdhaber commented Oct 18, 2022

Yup, there's a small bug in multivariate_normal_frozen.rvs:

def rvs(self, size=1, random_state=None):
return self._dist.rvs(self.mean, self.cov_object.covariance, size,
random_state)

It should pass in self.cov_object, not self.cov_object.covariance.
It probably happened because I was thinking of calling an _rvs (which wouldn't process the parameters), but that doesn't exist.
Fixed by gh-17258.

@tirthasheshpatel
Copy link
Member

Closing this. Thanks for your work on the Covariance class @mdhaber!

@ZhaofengWu FYI: With the new Covariance class, it is possible to compute the eigen-decomposition and feed that into the multivariate_normal distribution instead to avoid the error:

import numpy as np
from scipy import stats

with np.load('cov.npy.zip') as data:
    cov = data['cov']

# compute the eigen-decomposition
w, v = np.linalg.eigh(cov)

# create a covariance object using the decomposition
cov_obj = stats.Covariance.from_eigendecomposition((w, v))

# create the distribution using the covariance object instead.
dist = stats.multivariate_normal(np.zeros(cov.shape[0], dtype=cov.dtype), cov_obj)

# now, the methods `pdf`, etc will utilize the decomposition to make the
# computation more efficient and also avoid recomputing the eigen-decomposition
# every time `pdf` or any other method of unfrozen distributions is called.
dist.logpdf(np.zeros(cov.shape[0]))  # 46479.578745868996

I hope this addresses your issue!

@mdhaber
Copy link
Contributor

mdhaber commented Oct 19, 2022

Thanks @tirthasheshpatel!

@ZhaofengWu Note also that I'd recommend saving the eigenvalue decomposition of your data if you want full control over whether your covariance is treated as singular or not when you load it up again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

No branches or pull requests

6 participants