BUG: Error computing multivariate_normal.logpdf with singular cov matrix #15509

IgnacioHeredia · 2022-02-02T16:34:54Z

Describe your issue.

Hi everyone,

I noticed a possible error when implementing the logpdf of a multivariate normal with singular covariances matrices. The scipy implementation gives different results as a manual implementation (formula here). I attach a minimally reproducible example.
The output is:

Non singular case:
  Scipy logpdf: -2.834677770629769
  Manual logpdf: -2.8346777706297686

Singular case:
  Scipy logpdf: -2.4568046699816684
  Manual logpdf: -3.3757432031863406

At this point my usual approach would be to not trust my code, but it happens that using R I get the same results as my manual implementation in Python so I'm a bit clueless on what multivariate_normal.logpdf is actually doing in the case of singular matrices. The R code is using dmvnorm from the mvnorm package.

I also opened a (posibly related?) issue regarding multivariate_normal at #15508.

Thanks!

Ignacio

Reproducing Code Example

import numpy as np
from scipy.stats import multivariate_normal


sample = np.array([1, 1])
mean = np.array([0, 0])


def compare(cov, singular):
    
    # Logpdf - scipy implementation
    ##############################
    
    a = multivariate_normal.logpdf(
        x=sample,
        mean=mean,
        cov=cov,
        allow_singular=singular,
    )
    print(f'  Scipy logpdf: {a}')
    
    
    # Logpdf - manual implementation
    ################################
    
    if not singular:
        det = np.linalg.det(cov)
        log_det = np.log(det)
        inv = np.linalg.inv(cov)
        
    else:  # singular cov --> use pseudo det and pseudo inv 
        # Note: I also checked that for non singular matrices, pseudo-det/inv still this give the same results as using normal det/inv (!)
        eig_values = np.linalg.eig(cov)[0]
        eig_values = eig_values[eig_values > 1e-12]
#         log_det = np.sum(np.log(eig_values))  # --> use instead log(a*b)=log(a)+log(b) property to avoid overflow NaN problem when computing det = np.prod(eig_values)
        log_det = np.log(np.product(eig_values))
        inv = np.linalg.pinv(cov)

    def manual_logpdf(sample):
        # ref: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Likelihood_function
        residuals = sample - mean
        k = len(mean)
        return -0.5 * (log_det + residuals.T.dot(inv).dot(residuals) + k * np.log(2 * np.pi))    

    b = manual_logpdf(sample)
    print(f'  Manual logpdf: {b}')
    

# Non singular matrix
cov = np.array([[2.0, 0.3],
                [0.3, 0.5]])
print('Non singular case:')
compare(cov, singular=False)

# Singular matrix, (symmetric) positive semidefinite
print('\nSingular case:')
cov = np.array([[2, 6],
                [6, 18]])
compare(cov, singular=True)

Error message

None

SciPy/NumPy/Python version information

1.6.2

The text was updated successfully, but these errors were encountered:

mdhaber · 2022-04-11T00:45:45Z

@IgnacioHeredia Here is the relevant section of code:

scipy/scipy/stats/_multivariate.py

Lines 464 to 466 in 5f4ba67

    
           dev = x - mean 
        
           maha = np.sum(np.square(np.dot(dev, prec_U)), axis=-1) 
        
           return -0.5 * (rank * _LOG_2PI + log_det_cov + maha)

maha is what you write as residuals.T.dot(inv).dot(residuals).
log_det_cov is what you call log_det.
rank * _LOG_2PI corresponds with your k * np.log(2 * np.pi).

In your example, only the last of these is different numerically, and it just comes down to the fact that rank != k. rank = 1 in SciPy, and in your implementation, k = 2. When you change that, the two produce the same result.

I don't have time to look into which is correct right now. Do you know?

Here is a simplified version of your example, modified to agree with SciPy:

import numpy as np
from scipy.stats import multivariate_normal

sample = np.array([1, 1])
mean = np.array([0, 0])
cov = np.array([[2, 6],
                [6, 18]])

def manual_logpdf(sample):
    # ref: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Likelihood_function
    eig_values = np.linalg.eig(cov)[0]
    eig_values = eig_values[eig_values > 1e-12]
    log_det = np.log(np.product(eig_values))
    inv = np.linalg.pinv(cov)

    residuals = sample - mean
    k = np.linalg.matrix_rank(cov)  # len(mean)?
    return -0.5 * (k * np.log(2 * np.pi)
                   + log_det 
                   + residuals.T.dot(inv).dot(residuals))    
    
x = multivariate_normal.logpdf(x=sample, mean=mean, cov=cov, 
                                allow_singular=True)
x2 = manual_logpdf(sample)

print(x, x2)  # -2.4568046699816684 -2.4568046699816684

IgnacioHeredia · 2022-04-11T10:15:30Z

Hi @mdhaber, thanks for taking a look!

The Wikipedia page from where I took the logpdf formula says that k is the dimension of the mean vector (as in my original implementation). In the non-singular case, this is the same as the rank (therefore I got the same results as Scipy's implementation) but in the singular case they obviously differ.

mdhaber · 2022-04-11T14:29:47Z

Yes, but also:

which makes me wonder if they just weren't considering the singular case when they first define k.

IgnacioHeredia · 2022-04-11T15:53:06Z

I think you are actually right @mdhaber. I went to one of the sources listed in the Wikipedia page and I found this (p. 528):

Rao, C. R. (1973). Linear Statistical Inference and Its Applications. New York: Wiley. pp. 527–528. ISBN 0-471-70823-2.

The Wikipedia page notation is indeed misleading. The book clearly distinguishes the length p of the mean vector and the rank of the covariance matrix k (=R(sigma), first line of the image). An the pdf should be computed using k not p (last line, with sigma- being pseudo inverse, and product of eigenvalues being the pseudo-determinant).

I guess Scipy's implementation is correct after all!

I'll try to fix submit fixes to both wikipedia and the mvnorm R package.
Feel free to close the issue if you agree.

Thanks for all!

mdhaber · 2022-04-11T16:24:23Z

I think SciPy is right because then the multivariate_normal pdf agrees with a norm PDF`. Please check me here, because I only had a vague intuition this would work.

evals, evecs = np.linalg.eig(cov)
stats.norm.logpdf(evecs [:, 1] @ [1, 1], scale=np.sqrt(evals[1]))  # -2.4568046699816684

Thinking about it from a different perspective, I might also have expected the pdf to be zero for a vector not aligned with evecs [:, 1].
I think a surface plot will reveal what's going on. My guess is that it will look like a univariate normal PDF swept perpendicular to the eigenvector with the nonzero eigenvalue.

IgnacioHeredia added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Feb 2, 2022

AtsushiSakai added the scipy.stats label Feb 3, 2022

mdhaber mentioned this issue Mar 15, 2022

A Solid Foundation for Statistics in Python with SciPy mdhaber/scipy#26

Closed

mdhaber closed this as completed Apr 11, 2022

mdhaber mentioned this issue Apr 13, 2022

BUG: multivariate_normal returns a pdf for values outside its support when covariance is singular #5286

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Error computing multivariate_normal.logpdf with singular cov matrix #15509

BUG: Error computing multivariate_normal.logpdf with singular cov matrix #15509

IgnacioHeredia commented Feb 2, 2022

mdhaber commented Apr 11, 2022

IgnacioHeredia commented Apr 11, 2022

mdhaber commented Apr 11, 2022

IgnacioHeredia commented Apr 11, 2022 •

edited

mdhaber commented Apr 11, 2022 •

edited

BUG: Error computing multivariate_normal.logpdf with singular cov matrix #15509

BUG: Error computing multivariate_normal.logpdf with singular cov matrix #15509

Comments

IgnacioHeredia commented Feb 2, 2022

Describe your issue.

Reproducing Code Example

Error message

SciPy/NumPy/Python version information

mdhaber commented Apr 11, 2022

IgnacioHeredia commented Apr 11, 2022

mdhaber commented Apr 11, 2022

IgnacioHeredia commented Apr 11, 2022 • edited

mdhaber commented Apr 11, 2022 • edited

IgnacioHeredia commented Apr 11, 2022 •

edited

mdhaber commented Apr 11, 2022 •

edited