A changing of dtype in scipy.stats._multivariate.multivariate_normal_gen() makes covariance matrix no longer positive semidefinite #9942

ZhaofengWu · 2019-03-13T04:56:13Z

The multivariate_normal_frozen class calls the _process_parameters function of the multivariate_normal_gen class.

Lines 734 to 735 in edfb7ae

    
           self.dim, self.mean, self.cov = self._dist._process_parameters( 
        
                                                               None, mean, cov)

For my covariance matrix (cov.ndim == 2), this method does nothing but changes its dtype from float32 to float64. However, this change of dtype apparently makes a positive semidefinite covariance matrix no longer one. Using PDB to pause below this function call, I get the following output (cov is my original covariance matrix, and self.cov is the output of _process_parameters:

The text was updated successfully, but these errors were encountered:

chrisb83 · 2019-03-13T16:23:24Z

Thanks for reporting this issue. Could you add

Numpy / SciPy Version
what ist cov, so we can copy/paste to reproduce the problem

Thanks

ZhaofengWu · 2019-03-13T16:29:43Z

Numpy 1.16.0, SciPy 1.1.0. I pickled and then zipped my cov, attached below. Not sure if it will still be able to reproduce this error after these steps though.

cov.pkl.zip

chrisb83 · 2019-03-14T18:33:13Z

hm, if I run abc = pickle.load( open( 'cov.pkl', 'rb' ) )``, i get ImportError: No module named 'numpy.core._multiarray_umath'

any ideas?
otherwise can you try one of the methods given here: https://docs.scipy.org/doc/numpy/reference/routines.io.html

rgommers · 2019-03-14T18:48:30Z

@chrisb83 are you using numpy <1.16? pickle files aren't really portable like that (and a security risk, I wouldn't open some random ones).

rgommers · 2019-03-14T18:50:15Z

@ZhaofengWu if you want to provide large arrays as example and can't give the code to reproduce them, then please do so as .npy files (produced by np.save).

ZhaofengWu · 2019-03-14T21:07:05Z

Thanks. I will give a .npy in a few hours.

ZhaofengWu · 2019-03-15T08:56:14Z

Here are the two .npy files. Please unzip first. They should only differ by dtype, yet one breaks _PSD while the other doesn't.
cov.npy.zip
self.cov.npy.zip

chrisb83 · 2019-03-15T21:31:53Z

yes, I can reproduce this.

Reason: _PSD uses the following code to check whether the matrix is psd

s, u = scipy.linalg.eigh(M, lower=lower, check_finite=check_finite)
eps = _eigvalsh_to_eps(s, cond, rcond)
if np.min(s) < -eps:
    raise ValueError('the input matrix must be positive semidefinite')

If you look at _eigvalsh_to_eps, the value of eps depends on the dytpe:

t = spectrum.dtype.char.lower()
factor = {'f': 1E3, 'd': 1E6}
cond = factor[t] * np.finfo(t).eps
eps = cond * np.max(abs(spectrum))

I don't know where this condition comes from

chrisb83 · 2019-03-15T21:37:20Z

Also note that the smallest eigenvalue differs depending on dtype, so even if eps were the same, one could get a different outcome when using the smallest eigenvalue to evaluate a condition.

import numpy as np
from scipy import linalg

cov1 = np.load('cov.npy')  # float32
cov2 = np.load('self.cov.npy') # float64

np.array_equal(cov1, cov2) # True

lower=True
check_finite=True

s1, u1 = linalg.eigh(cov1, lower=lower, check_finite=check_finite)
s2, u2 = linalg.eigh(cov2, lower=lower, check_finite=check_finite)

print(np.min(s1)) # -7.1243524e-17
print(np.min(s2)) # -7.875839015554746e-19

ZhaofengWu · 2019-03-15T21:46:04Z

Thanks! I think I understand now where this issue comes from, but I don't have enough statistic background to know if this is the expected behavior. Although from a user's point of view, it feels like changing the dtype of a matrix shouldn't affect its inherent properties... should it?

chrisb83 · 2019-03-16T18:47:44Z

Unfortunately, I don't have further insights on the criterion to determine whether the matrix is PSD.

In any case, it seems to be a borderline case since there are eigenvalues slightly below 0 (could be due a minor numerical errors of course).

Maybe some other people know more about this

ZhaofengWu · 2019-04-01T21:35:39Z

Can this method _process_parameters be changed so that if the dtype was a float32, then do not modify it?

scipy/scipy/stats/_multivariate.py

Line 378 in edfb7ae

cov = np.asarray(cov, dtype=float)

mdhaber · 2020-12-12T05:03:03Z

@ZhaofengWu Potentially, but could you try adding a tiny bit of the identity matrix to your covariance matrix? Sometimes that is done to improve numerical stability without affecting much else. See gh-13231 for a little more information.

mdhaber · 2020-12-25T06:36:13Z

This seems related to gh-10205 in that both involve the positive definiteness check on a covariance matrix. @ilayn I just thought I should ping you based on your comments in gh-12184.

ilayn · 2020-12-25T12:11:23Z

Yes I'd like to touch that part of the code but a bit reluctant since I can't predict the consequences. However first waiting for #12184 and #10205 to be done first so I can read it properly.

mdhaber · 2022-06-08T12:01:40Z

@ilayn we seem to be gridlocked on these three issues. Do you see a path forward here? I wasn't quite sure what "read it properly" meant, in this case. Do you mean that gh-12184 would do some helpful refactoring that would make the code easier to predict?

mdhaber · 2022-10-18T05:47:52Z

@tirthasheshpatel @tupui Do you think that the Covariance class resolves this issue now that the user has complete control over their matrix is considered (numerically) positive semidefinite or not?

For instance, this user could perform an eigendecomposition on their matrix and store the decomposition instead of the original matrix. Reducing precision would no longer change whether the matrix is semidefinite or not.

tirthasheshpatel · 2022-10-18T07:04:28Z

I think Covariance addresses the problem. Although, when I try to sample from the multivariate normal using the covariance object the OP provided, it gives me the same error:

In [55]: import numpy as np

In [56]: from scipy import stats

In [57]: with np.load('cov.npy.zip') as data:
    ...:     cov = data['cov']
    ...:

In [58]: w, v = np.linalg.eigh(cov)

In [59]: cov_obj = stats.Covariance.from_eigendecomposition((w, v))

In [60]: dist = stats.multivariate_normal(np.zeros(cov.shape[0], dtype=cov.dtype), cov_obj)

In [61]: dist.logpdf(np.zeros(cov.shape[0]))
Out[61]: 46479.578745868996

In [62]: dist.rvs()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [62], line 1
----> 1 dist.rvs()

File ~/oss/scipy/build-install/lib/python3.10/site-packages/scipy/stats/_multivariate.py:859, in multivariate_normal_frozen.rvs(self, size, random_state)
    858 def rvs(self, size=1, random_state=None):
--> 859     return self._dist.rvs(self.mean, self.cov_object.covariance, size,
    860                           random_state)

File ~/oss/scipy/build-install/lib/python3.10/site-packages/scipy/stats/_multivariate.py:747, in multivariate_normal_gen.rvs(self, mean, cov, size, random_state)
    726 def rvs(self, mean=None, cov=1, size=1, random_state=None):
    727     """Draw random samples from a multivariate normal distribution.
    728
    729     Parameters
   (...)
    745
    746     """
--> 747     dim, mean, cov_object = self._process_parameters(mean, cov)
    748     cov = cov_object.covariance
    750     random_state = self._get_random_state(random_state)

File ~/oss/scipy/build-install/lib/python3.10/site-packages/scipy/stats/_multivariate.py:416, in multivariate_normal_gen._process_parameters(self, mean, cov, allow_singular)
    409 dim, mean, cov = self._process_parameters_psd(None, mean, cov)
    410 # After input validation, some methods then processed the arrays
    411 # with a `_PSD` object and used that to perform computation.
    412 # To avoid branching statements in each method depending on whether
    413 # `cov` is an array or `Covariance` object, we always process the
    414 # array with `_PSD`, and then use wrapper that satisfies the
    415 # `Covariance` interface, `CovViaPSD`.
--> 416 psd = _PSD(cov, allow_singular=allow_singular)
    417 cov_object = _covariance.CovViaPSD(psd)
    418 return dim, mean, cov_object

File ~/oss/scipy/build-install/lib/python3.10/site-packages/scipy/stats/_multivariate.py:166, in _PSD.__init__(self, M, cond, rcond, lower, check_finite, allow_singular)
    164 if np.min(s) < -eps:
    165     msg = "The input matrix must be symmetric positive semidefinite."
--> 166     raise ValueError(msg)
    167 d = s[s > eps]
    168 if len(d) < len(s) and not allow_singular:

ValueError: The input matrix must be symmetric positive semidefinite.

In [63]:

Any idea @mdhaber why this could be happening or if it is related to the Covariance changes? Shouldn't the rvs method also use the covariance object instead of computing the Eigen-decomposition again?

mdhaber · 2022-10-18T20:44:14Z

Yup, there's a small bug in multivariate_normal_frozen.rvs:

scipy/scipy/stats/_multivariate.py

Lines 858 to 860 in 3c0bc3f

    
           def rvs(self, size=1, random_state=None): 
        
               return self._dist.rvs(self.mean, self.cov_object.covariance, size, 
        
                                     random_state)

It should pass in self.cov_object, not self.cov_object.covariance.
It probably happened because I was thinking of calling an _rvs (which wouldn't process the parameters), but that doesn't exist.
Fixed by gh-17258.

tirthasheshpatel · 2022-10-19T00:28:08Z

Closing this. Thanks for your work on the Covariance class @mdhaber!

@ZhaofengWu FYI: With the new Covariance class, it is possible to compute the eigen-decomposition and feed that into the multivariate_normal distribution instead to avoid the error:

import numpy as np
from scipy import stats

with np.load('cov.npy.zip') as data:
    cov = data['cov']

# compute the eigen-decomposition
w, v = np.linalg.eigh(cov)

# create a covariance object using the decomposition
cov_obj = stats.Covariance.from_eigendecomposition((w, v))

# create the distribution using the covariance object instead.
dist = stats.multivariate_normal(np.zeros(cov.shape[0], dtype=cov.dtype), cov_obj)

# now, the methods `pdf`, etc will utilize the decomposition to make the
# computation more efficient and also avoid recomputing the eigen-decomposition
# every time `pdf` or any other method of unfrozen distributions is called.
dist.logpdf(np.zeros(cov.shape[0]))  # 46479.578745868996

I hope this addresses your issue!

mdhaber · 2022-10-19T00:35:16Z

Thanks @tirthasheshpatel!

@ZhaofengWu Note also that I'd recommend saving the eigenvalue decomposition of your data if you want full control over whether your covariance is treated as singular or not when you load it up again.

rgommers added the scipy.stats label Mar 14, 2019

mdhaber mentioned this issue Dec 12, 2020

A Solid Foundation for Statistics in Python with SciPy mdhaber/scipy#26

Closed

mdhaber added the enhancement A new feature or improvement label Mar 16, 2022

mdhaber mentioned this issue Oct 18, 2022

MAINT: stats.multivariate_normal: frozen rvs should pass cov_object to unfrozen rvs #17258

Merged

tirthasheshpatel closed this as completed Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A changing of dtype in scipy.stats._multivariate.multivariate_normal_gen() makes covariance matrix no longer positive semidefinite #9942

A changing of dtype in scipy.stats._multivariate.multivariate_normal_gen() makes covariance matrix no longer positive semidefinite #9942

ZhaofengWu commented Mar 13, 2019

chrisb83 commented Mar 13, 2019

ZhaofengWu commented Mar 13, 2019

chrisb83 commented Mar 14, 2019

rgommers commented Mar 14, 2019

rgommers commented Mar 14, 2019

ZhaofengWu commented Mar 14, 2019

ZhaofengWu commented Mar 15, 2019

chrisb83 commented Mar 15, 2019

chrisb83 commented Mar 15, 2019

ZhaofengWu commented Mar 15, 2019

chrisb83 commented Mar 16, 2019

ZhaofengWu commented Apr 1, 2019 •

edited

mdhaber commented Dec 12, 2020 •

edited

mdhaber commented Dec 25, 2020

ilayn commented Dec 25, 2020

mdhaber commented Jun 8, 2022 •

edited

mdhaber commented Oct 18, 2022 •

edited

tirthasheshpatel commented Oct 18, 2022

mdhaber commented Oct 18, 2022 •

edited

tirthasheshpatel commented Oct 19, 2022

mdhaber commented Oct 19, 2022

A changing of dtype in scipy.stats._multivariate.multivariate_normal_gen() makes covariance matrix no longer positive semidefinite #9942

A changing of dtype in scipy.stats._multivariate.multivariate_normal_gen() makes covariance matrix no longer positive semidefinite #9942

Comments

ZhaofengWu commented Mar 13, 2019

chrisb83 commented Mar 13, 2019

ZhaofengWu commented Mar 13, 2019

chrisb83 commented Mar 14, 2019

rgommers commented Mar 14, 2019

rgommers commented Mar 14, 2019

ZhaofengWu commented Mar 14, 2019

ZhaofengWu commented Mar 15, 2019

chrisb83 commented Mar 15, 2019

chrisb83 commented Mar 15, 2019

ZhaofengWu commented Mar 15, 2019

chrisb83 commented Mar 16, 2019

ZhaofengWu commented Apr 1, 2019 • edited

mdhaber commented Dec 12, 2020 • edited

mdhaber commented Dec 25, 2020

ilayn commented Dec 25, 2020

mdhaber commented Jun 8, 2022 • edited

mdhaber commented Oct 18, 2022 • edited

tirthasheshpatel commented Oct 18, 2022

mdhaber commented Oct 18, 2022 • edited

tirthasheshpatel commented Oct 19, 2022

mdhaber commented Oct 19, 2022

ZhaofengWu commented Apr 1, 2019 •

edited

mdhaber commented Dec 12, 2020 •

edited

mdhaber commented Jun 8, 2022 •

edited

mdhaber commented Oct 18, 2022 •

edited

mdhaber commented Oct 18, 2022 •

edited