New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in manifold.t_sne._kl_divergence #9711

Closed
goncalo-rodrigues opened this Issue Sep 8, 2017 · 2 comments

Comments

Projects
None yet
3 participants
@goncalo-rodrigues
Contributor

goncalo-rodrigues commented Sep 8, 2017

There seems to be an error in the calculation of kl_divergence in t_sne. (line 140)

    X_embedded = params.reshape(n_samples, n_components)

    # Q is a heavy-tailed distribution: Student's t-distribution
    n = pdist(X_embedded, "sqeuclidean")
    n += 1.
    n /= degrees_of_freedom
    n **= (degrees_of_freedom + 1.0) / -2.0
    Q = np.maximum(n / (2.0 * np.sum(n)), MACHINE_EPSILON)

    # Optimization trick below: np.dot(x, y) is faster than
    # np.sum(x * y) because it calls BLAS

    # Objective: C (Kullback-Leibler divergence of P and Q)
    kl_divergence = 2.0 * np.dot(P, np.log(P / Q))

The correct formula for t-student probability distribution function is, according to wikipedia:
https://en.wikipedia.org/wiki/Student%27s_t-distribution
(gamma((v+1)/2)/gamma(sqrt(v*pi)*gamma(v/2)) * (1+(t^2)/v)^(-(v+1)/2)
= constant * (1+(t^2)/v)^(-(v+1)/2)
where v = degrees of freedom and t-sne uses the pairwise distances as the variable t. We can ignore the constant because it doesn't depend on t and it will be normalized away. The code, however is computing (1/v + (t^2)/v)^(-(v+1)/2).

    X_embedded = params.reshape(n_samples, n_components)

    # Q is a heavy-tailed distribution: Student's t-distribution
    n = pdist(X_embedded, "sqeuclidean")
    n /= degrees_of_freedom
    n += 1.
    n **= (degrees_of_freedom + 1.0) / -2.0
    Q = np.maximum(n / (2.0 * np.sum(n)), MACHINE_EPSILON)

    # Optimization trick below: np.dot(x, y) is faster than
    # np.sum(x * y) because it calls BLAS

    # Objective: C (Kullback-Leibler divergence of P and Q)
    kl_divergence = 2.0 * np.dot(P, np.log(P / Q))

This would be the correct version (n+=1 after dividing by degrees of freedom)

goncalo-rodrigues added a commit to goncalo-rodrigues/scikit-learn that referenced this issue Sep 8, 2017

Bug fix in tsne kl_divergence issue scikit-learn#9711
Kl_divergence had a small bug and did not follow the correct formula for the t-student probability distribution. Large error would be induced for large n_components.

goncalo-rodrigues added a commit to goncalo-rodrigues/scikit-learn that referenced this issue Sep 8, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman
Member

jnothman commented Sep 9, 2017

@tomMoral

This comment has been minimized.

Show comment
Hide comment
@tomMoral

tomMoral Sep 9, 2017

Contributor

This is indeed a bug.
It should also be changed in the _barnes_hut implementation for the computation of the kl_divergence:

# In _barnes_hut_tsne.pyx, line 136
-           qij = (((1.0 + dij) / dof) ** exponent)
+           qij = ((1.0 + dij / dof) ** exponent)

We should also check the gradient computations as I assume this formula was correct when I checked it.

Contributor

tomMoral commented Sep 9, 2017

This is indeed a bug.
It should also be changed in the _barnes_hut implementation for the computation of the kl_divergence:

# In _barnes_hut_tsne.pyx, line 136
-           qij = (((1.0 + dij) / dof) ** exponent)
+           qij = ((1.0 + dij / dof) ** exponent)

We should also check the gradient computations as I assume this formula was correct when I checked it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment