Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Error computing multivariate_normal.logpdf with singular cov matrix #15509

Closed
IgnacioHeredia opened this issue Feb 2, 2022 · 5 comments
Closed
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats

Comments

@IgnacioHeredia
Copy link

Describe your issue.

Hi everyone,

I noticed a possible error when implementing the logpdf of a multivariate normal with singular covariances matrices. The scipy implementation gives different results as a manual implementation (formula here). I attach a minimally reproducible example.
The output is:

Non singular case:
  Scipy logpdf: -2.834677770629769
  Manual logpdf: -2.8346777706297686

Singular case:
  Scipy logpdf: -2.4568046699816684
  Manual logpdf: -3.3757432031863406

At this point my usual approach would be to not trust my code, but it happens that using R I get the same results as my manual implementation in Python so I'm a bit clueless on what multivariate_normal.logpdf is actually doing in the case of singular matrices. The R code is using dmvnorm from the mvnorm package.

I also opened a (posibly related?) issue regarding multivariate_normal at #15508.

Thanks!

Ignacio

Reproducing Code Example

import numpy as np
from scipy.stats import multivariate_normal


sample = np.array([1, 1])
mean = np.array([0, 0])


def compare(cov, singular):
    
    # Logpdf - scipy implementation
    ##############################
    
    a = multivariate_normal.logpdf(
        x=sample,
        mean=mean,
        cov=cov,
        allow_singular=singular,
    )
    print(f'  Scipy logpdf: {a}')
    
    
    # Logpdf - manual implementation
    ################################
    
    if not singular:
        det = np.linalg.det(cov)
        log_det = np.log(det)
        inv = np.linalg.inv(cov)
        
    else:  # singular cov --> use pseudo det and pseudo inv 
        # Note: I also checked that for non singular matrices, pseudo-det/inv still this give the same results as using normal det/inv (!)
        eig_values = np.linalg.eig(cov)[0]
        eig_values = eig_values[eig_values > 1e-12]
#         log_det = np.sum(np.log(eig_values))  # --> use instead log(a*b)=log(a)+log(b) property to avoid overflow NaN problem when computing det = np.prod(eig_values)
        log_det = np.log(np.product(eig_values))
        inv = np.linalg.pinv(cov)

    def manual_logpdf(sample):
        # ref: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Likelihood_function
        residuals = sample - mean
        k = len(mean)
        return -0.5 * (log_det + residuals.T.dot(inv).dot(residuals) + k * np.log(2 * np.pi))    

    b = manual_logpdf(sample)
    print(f'  Manual logpdf: {b}')
    

# Non singular matrix
cov = np.array([[2.0, 0.3],
                [0.3, 0.5]])
print('Non singular case:')
compare(cov, singular=False)

# Singular matrix, (symmetric) positive semidefinite
print('\nSingular case:')
cov = np.array([[2, 6],
                [6, 18]])
compare(cov, singular=True)

Error message

None

SciPy/NumPy/Python version information

1.6.2

@IgnacioHeredia IgnacioHeredia added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Feb 2, 2022
@mdhaber
Copy link
Contributor

mdhaber commented Apr 11, 2022

@IgnacioHeredia Here is the relevant section of code:

dev = x - mean
maha = np.sum(np.square(np.dot(dev, prec_U)), axis=-1)
return -0.5 * (rank * _LOG_2PI + log_det_cov + maha)

  • maha is what you write as residuals.T.dot(inv).dot(residuals).
  • log_det_cov is what you call log_det.
  • rank * _LOG_2PI corresponds with your k * np.log(2 * np.pi).

In your example, only the last of these is different numerically, and it just comes down to the fact that rank != k. rank = 1 in SciPy, and in your implementation, k = 2. When you change that, the two produce the same result.

I don't have time to look into which is correct right now. Do you know?

Here is a simplified version of your example, modified to agree with SciPy:

import numpy as np
from scipy.stats import multivariate_normal

sample = np.array([1, 1])
mean = np.array([0, 0])
cov = np.array([[2, 6],
                [6, 18]])

def manual_logpdf(sample):
    # ref: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Likelihood_function
    eig_values = np.linalg.eig(cov)[0]
    eig_values = eig_values[eig_values > 1e-12]
    log_det = np.log(np.product(eig_values))
    inv = np.linalg.pinv(cov)

    residuals = sample - mean
    k = np.linalg.matrix_rank(cov)  # len(mean)?
    return -0.5 * (k * np.log(2 * np.pi)
                   + log_det 
                   + residuals.T.dot(inv).dot(residuals))    
    
x = multivariate_normal.logpdf(x=sample, mean=mean, cov=cov, 
                                allow_singular=True)
x2 = manual_logpdf(sample)

print(x, x2)  # -2.4568046699816684 -2.4568046699816684

@IgnacioHeredia
Copy link
Author

Hi @mdhaber, thanks for taking a look!

The Wikipedia page from where I took the logpdf formula says that k is the dimension of the mean vector (as in my original implementation). In the non-singular case, this is the same as the rank (therefore I got the same results as Scipy's implementation) but in the singular case they obviously differ.

@mdhaber
Copy link
Contributor

mdhaber commented Apr 11, 2022

Yes, but also:
image

which makes me wonder if they just weren't considering the singular case when they first define k.

@IgnacioHeredia
Copy link
Author

IgnacioHeredia commented Apr 11, 2022

I think you are actually right @mdhaber. I went to one of the sources listed in the Wikipedia page and I found this (p. 528):

Rao, C. R. (1973). Linear Statistical Inference and Its Applications. New York: Wiley. pp. 527–528. ISBN 0-471-70823-2.

image

The Wikipedia page notation is indeed misleading. The book clearly distinguishes the length p of the mean vector and the rank of the covariance matrix k (=R(sigma), first line of the image). An the pdf should be computed using k not p (last line, with sigma- being pseudo inverse, and product of eigenvalues being the pseudo-determinant).

I guess Scipy's implementation is correct after all!

I'll try to fix submit fixes to both wikipedia and the mvnorm R package.
Feel free to close the issue if you agree.

Thanks for all!

@mdhaber
Copy link
Contributor

mdhaber commented Apr 11, 2022

I think SciPy is right because then the multivariate_normal pdf agrees with a norm PDF`. Please check me here, because I only had a vague intuition this would work.

evals, evecs = np.linalg.eig(cov)
stats.norm.logpdf(evecs [:, 1] @ [1, 1], scale=np.sqrt(evals[1]))  # -2.4568046699816684

Thinking about it from a different perspective, I might also have expected the pdf to be zero for a vector not aligned with evecs [:, 1].
I think a surface plot will reveal what's going on. My guess is that it will look like a univariate normal PDF swept perpendicular to the eigenvector with the nonzero eigenvalue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Projects
None yet
Development

No branches or pull requests

3 participants