Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A math mistake in spectral_embedding #8129

Closed
GenTang opened this issue Dec 28, 2016 · 19 comments
Closed

A math mistake in spectral_embedding #8129

GenTang opened this issue Dec 28, 2016 · 19 comments
Labels
Bug
Milestone

Comments

@GenTang
Copy link

@GenTang GenTang commented Dec 28, 2016

spectral_embedding_bug.pdf

Description

Spectral embedding is an algorithm proposed by the following paper:

“Laplacian Eigenmaps for Dimensionality Reduction and Data Representation” M. Belkin, P. Niyogi, Neural Computation, June 2003; 15 (6):1373-1396

In page 6 of this paper, it defines

$$
Lf = \lambda D f
$$
where $D$ is diagonal weight matrix, its entries are column sums of $W$, $D_{ii} = \sum_{j} W_{ji}$. $L=D-W$

Let $f_0, f_1, ... f_k-1$ be the solutions of the above equations with $0=\lambda_0 \le \lambda_1 \le ... \lambda_{k-1}$ Note: $f_0$ is a constant vector

Therefore, $(f_1, ... f_m)$ is lapacian eigenmaps

Let $g = D^{\frac{1}{2}}f$ and $L^{sym}=D^{\frac{-1}{2}}(D-W)D^{\frac{-1}{2}}$, we have

$$
D^{\frac{-1}{2}}(D-W)D^{\frac{-1}{2}}*D^{\frac{1}{2}}f = \lambda D^{\frac{1}{2}}f
$$
$$
L^{sym}*g = \lambda g
$$

In the code here:

laplacian, dd = graph_laplacian(adjacency,
           normed=norm_laplacian, return_diag=True)

laplacian is $L^{sym}$ (with norm_laplacian=True), dd is $D^{\frac{1}{2}}$

In the code here

lambdas, diffusion_map = eigsh(laplacian, k=n_components,
		sigma=1.0, which='LM',
		tol=eigen_tol, v0=v0)
embedding = diffusion_map.T[n_components::-1] * dd

diffusion_map is $g$. and embedding is $f$.

However, $f = D^{\frac{-1}{2}}g$ which means

embedding = diffusion_map.T[n_components::-1] / dd

The same mistakes are in line 289 and line 313

Steps/Code to Reproduce

from sklearn.manifold.spectral_embedding_ import spectral_embedding
import numpy as np

matrix = np.zeros([4, 4])
matrix[0, 1] = 1
matrix[1, 0] = 1
matrix[0, 2] = 1
matrix[2, 0] = 1
matrix[1, 2] = 1
matrix[2, 1] = 1
matrix[2, 3] = 4
matrix[3, 2] = 4

f = spectral_embedding(matrix, n_components=1, drop_first=False)

Expected Results

Constant vector

array([[ 0.26726124],
	   [ 0.26726124],
	   [ 0.26726124],
	   [ 0.26726124]])

Actual Results

array([[ 0.53452248],
       [ 0.53452248],
       [ 1.60356745],
       [ 1.06904497]])

Versions

>>> import platform; print(platform.platform())
Darwin-16.1.0-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
('Python', '2.7.9 (v2.7.9:648dcafa7e5f, Dec 10 2014, 10:10:46) \n[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]')
>>> import numpy; print("NumPy", numpy.__version__)
('NumPy', '1.11.1')
>>> import scipy; print("SciPy", scipy.__version__)
('SciPy', '0.17.0')
>>> import sklearn; print("Scikit-Learn", sklearn.__version__)
('Scikit-Learn', '0.17.1')
@GenTang GenTang changed the title A bug in spectral_embedding A math mistake in spectral_embedding Dec 28, 2016
@jnothman
Copy link
Member

@jnothman jnothman commented Dec 29, 2016

Thanks. Would you be willing to submit a patch and some more robust unit tests than what we presently have?

@jnothman jnothman added the Bug label Dec 29, 2016
@GenTang
Copy link
Author

@GenTang GenTang commented Dec 29, 2016

OK. I have a question about spectral embedding for a graph which is not fully connected:

When graph is not fully connected and norm_laplacian=False, dd could contains 0 (matrix D is singular and the code doesn't set value to 1 as norm_laplacian=True). It will take the program down. Right now, I don't know how to deal this situation, any idea?

However, I don't understand why the parameter norm_laplacian exists. If norm_laplacian=False, I think that we can not get f from diffusion_map which is the eigenvectors of D-W

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 29, 2016

I'm not the best person to help you there, and I hope @GaelVaroquaux has a moment to chime in

@devanshdalal
Copy link
Contributor

@devanshdalal devanshdalal commented Jan 1, 2017

Hi, I think I can dig up the issue, if @GenTang is not working on this one?

@GenTang
Copy link
Author

@GenTang GenTang commented Jan 2, 2017

@devanshdalal, It is great that you want to help.
But how do you want to solve the problems with non normalized laplacian matrix?

@devanshdalal
Copy link
Contributor

@devanshdalal devanshdalal commented Jan 2, 2017

@GenTang, The “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. If you go to page 6 of the paper, you can see that the same algorithm is applicable to non-connected graphs as well(step 3 has to be applied for each component). I guess we might not need to change anything, or maybe very minute changes. I will have to dive deep into the code to be sure.

@agramfort
Copy link
Member

@agramfort agramfort commented Jan 2, 2017

@jakevdp may have an opinion also. He wrote a lot of the manifold learning code.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jan 2, 2017

I don't think that @jakevdp wrote this part. It was adapted afterhand from the spectral clustering code that I wrote.

I had a look at the equations and the code, and I believe that the bug report is correct (and hence the code is wrong).

@devanshdalal
Copy link
Contributor

@devanshdalal devanshdalal commented Jan 2, 2017

@GaelVaroquaux,I can fix the issue. Let me fix the issue. if its not too much.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jan 2, 2017

@devanshdalal
Copy link
Contributor

@devanshdalal devanshdalal commented Jan 2, 2017

@GaelVaroquaux, will do both and submit a pull request soon.

@GenTang
Copy link
Author

@GenTang GenTang commented Jan 3, 2017

@devanshdalal, I see this part of paper. But consider the following example which will cause divide by zero Error.

>>> matrix
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])
>>> matrix, dd = graph_laplacian(matrix, normed=False, return_diag=True)
>>> lambdas, diffusion_map = eigsh(matrix, k = 1)
>>> diffusion_map.T[1::-1] / dd
__main__:1: RuntimeWarning: divide by zero encountered in divide
array([[ 0.70710678, -0.70710678,         inf,        -inf]])

@GaelVaroquaux, I sorry to bother you, but I don't understand what is this algorithm about to compute when norm_laplacian=False. The code corresponding is here
Thanks

@devanshdalal
Copy link
Contributor

@devanshdalal devanshdalal commented Jan 3, 2017

@GenTang , I got your point. Let me look into the probelm. I will discuss the possible solution with u before start implementing.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jan 3, 2017

@GenTang
Copy link
Author

@GenTang GenTang commented Jan 3, 2017

@GaelVaroquaux
Thanks a lot for your reply.
And for your question, the lambdas is array([ 2.]). Of course, we can replace zero by one in dd, but it doesn't change anything.

My question is not about dd has zeros, It is about why spectral_embedding introduce parameter norm_laplacian.

When we compute spectral embedding, we try to solve the following equation: (D - W) * f = \lambda * D * f

To solve this, we should compute normalized laplacian matrix D^(-1/2) * (D - W) * D^(-1/2), then compute eigenvectors of normalized laplacian matrix and finally get f.

However, with norm_laplacian=False, diffusion_map is the eigenvectors of D - W. It means that: (D - W) * diffusion_map = \lambda * diffusion_map.

But (D - W) * D^(-1/2) * diffusion_map = ? or (D - W) * D^(1/2) * diffusion_map = ? It seems that diffusion_map.T[:n_components] / dd or diffusion_map.T[:n_components] * dd doesn't have any mathematical meaning. (Please correct me if I am wrong)

So I don't understand why sklearn learn introduce parameter norm_laplacian in spectral_embedding, since spectral_embedding should always work with normalized laplacian matrix.

@devanshdalal
Copy link
Contributor

@devanshdalal devanshdalal commented Jan 4, 2017

@GenTang , I believe you are correct regarding norm_laplacian. For theoretically correct implementation, its should not be an argument of spectral_embedding() and should be like this in the code.

graph_laplacian(adjacency,
            normed=True, return_diag=True)

Correct me if I am wrong? Removing norm_laplacian can be solution or we have to provide an implementation where the case of norm_laplacian=False is handled properly. (I think it can be done as well. any ideas in that direction??).

EDIT: No update in this thread lately. What are your views, guys? @GenTang , @GaelVaroquaux .

@GenTang
Copy link
Author

@GenTang GenTang commented Jan 10, 2017

For me, the second solution should be better
@devanshdalal

@devanshdalal
Copy link
Contributor

@devanshdalal devanshdalal commented Jan 10, 2017

ok, Let me provide a quick patch soon.

@amueller
Copy link
Member

@amueller amueller commented Mar 3, 2017

didn't look into this but is this by any chance related to #6489 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

6 participants
You can’t perform that action at this time.