Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Neighborhood Components Analysis #10058

Merged
merged 89 commits into from Feb 28, 2019
Merged

Conversation

wdevazelhes
Copy link
Contributor

@wdevazelhes wdevazelhes commented Nov 2, 2017

Hi, this PR is an implementation of the Neighborhood Components Analysis algorithm (NCA), a popular supervised distance metric learning algorithm. As LMNN (cf PR #8602) this algorithm takes as input a labeled dataset, instead of similar/dissimilar pairs like it is the case for most metric learning algorithms, and learns a linear transformation of the space. However, NCA and LMNN have different objective functions: NCA tries to maximise the probability of every sample to be correctly classified based on a stochastic nearest neighbors rule, and therefore does not need to fix in advance a set of target neighbors.

There have been several attempts to implement NCA (2 PRs: #5276 (closed) and #4789 (not closed)). I created a fresh PR for sake of clarity. Indeed, this code is intended to be as similar to LMNN as possible, which should allow the factorisation of some points of code which are the same in both algorithms.

At the time of writing, this algorithm uses scipy's L-BFGS-B solver to solve the optimisation problem, like LMNN. It has the big advantage of avoiding to tune a learning rate parameter.
I benchmarked this implementation to the package metric-learn's one (https://github.com/all-umass/metric-learn): the one in this PR has the advantage of being scalable to large datasets (indeed, metric-learn's NCA throws Memory Error for too big datasets like faces or digits), for no significative loss in performance for small datasets.

The remaining tasks are the following:

  • More detailed benchmark of performance against reference implementations (for
    instance metric learn's one) (coming soon)
  • Add an example
  • Benchmark the algorithm accuracy on several datasets
  • Documentation

What is more, some improvements could also be made in a second time:

  • Make algorithmic improvements (like ignore samples where softmax distance is nearly null)
  • Add possibility to choose another solver like SGD to scale to large
    datasets (probably in another PR ?)
  • Add rules of thumb to choose solver etc by default (idem ?)
  • Make it possible to pass sparse matrices as input?
  • Add support for multilabel classification, which is a case where NCA could be really useful (cf. discussion with @bellet and @GaelVaroquaux) Metric learning is a good algorithm to use for these types of problems: for instance NCA would fit one matrix to satisfy multilabel constraints, which is an advantage with respect to one-vs-all/rest algorithms (because it can be better when labels are correlated for instance)

Feedback is welcome !

@wdevazelhes
Copy link
Contributor Author

wdevazelhes commented Nov 2, 2017

I benchmarked this implementation to the package metric-learn's one (https://github.com/all-umass/metric-learn): the one in this PR has the advantage of being scalable to large datasets (indeed, metric-learn's NCA throws Memory Error for too big datasets like faces or digits), for no significative loss in performance for small datasets.

Here is a snippet that shows it (on my machine which has 7.7GB of memory):

from sklearn.datasets import load_digits
from metric_learn import NCA
from sklearn.neighbors import NeighborhoodComponentsAnalysis
from sklearn.utils.testing import assert_raises
digits = load_digits()
X, y = digits.data, digits.target
nca_ml = NCA()
assert_raises(MemoryError, nca_ml.fit, X, y)
nca_sk = NeighborhoodComponentsAnalysis()
nca_sk.fit(X, y)  # does not raise any error

Indeed, they do not add a significative speedup but have a high memory cost.
transformation = transformation.reshape(-1, X.shape[1])
loss = 0
gradient = np.zeros(transformation.shape)
X_embedded = transformation.dot(X.T).T
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With np.dot(X, transformation.T) you save a transpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I will change it.

@johny-c
Copy link

johny-c commented Nov 11, 2017

Hi @wdevazelhes , since the code is very similar to LMNN, I felt free to add just a couple of comments, by no means a full review. @jnothman , I hope I didn't break any contribution guidelines..

diff_embedded[~ci, :]))
p_i = np.sum(p_i_j) # probability of x_i to be correctly
# classified
gradient += 2 * (p_i * (sum_ci.T + sum_not_ci.T) - sum_ci.T)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just add the matrices without transposing, and transpose the gradient once before returning the unraveled gradient.

Copy link

@johny-c johny-c Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remembered I had seen a more efficient computation of the gradient (see last equation in slide 12 here: http://bengio.abracadoudou.com/lce/slides/roweis.pdf ).
This would amount to:

p_i_j = exp_dist_embedded[ci]
p_i_k = exp_dist_embedded[~ci]
diffs = X[i, :] - X
diff_ci = diffs[ci, :]
diff_not_ci = diffs[~ci, :]
sum_ci = diff_ci.T.dot(p_i_j[:, np.newaxis] * diff_ci)
sum_not_ci = diff_not_ci.T.dot(p_i_k[:, np.newaxis] * diff_not_ci)
p_i = np.sum(p_i_j)
gradient += p_i * sum_not_ci + (1 - p_i) * sum_ci

And multiplying by the transformation after the for-loop:
gradient = 2*np.dot(transformation, gradient)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just add the matrices without transposing, and transpose the gradient once before returning the unraveled gradient.

True, I will change it.

I remembered I had seen a more efficient computation of the gradient (see last equation in slide 12 here: http://bengio.abracadoudou.com/lce/slides/roweis.pdf ).
This would amount to:
[...]
And multiplying by the transformation after the for-loop:
gradient = 2*np.dot(transformation, gradient)

Since I already computed embedded differences for the softmax before wouldn't it be more efficient to reuse it ?
I agree that the expression with (1 - p_i) is clearer though.

@codecov
Copy link

codecov bot commented Nov 15, 2017

Codecov Report

Merging #10058 into master will increase coverage by 0.01%.
The diff coverage is 99.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10058      +/-   ##
==========================================
+ Coverage   96.19%   96.21%   +0.01%     
==========================================
  Files         336      338       +2     
  Lines       62740    63025     +285     
==========================================
+ Hits        60354    60638     +284     
- Misses       2386     2387       +1
Impacted Files Coverage Δ
sklearn/neighbors/__init__.py 100% <100%> (ø) ⬆️
sklearn/neighbors/tests/test_nca.py 100% <100%> (ø)
sklearn/neighbors/nca.py 99.29% <99.29%> (ø)
sklearn/decomposition/tests/test_pca.py 100% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f2f167...48cab11. Read the comment docs.

@wdevazelhes
Copy link
Contributor Author

Thanks for your comments @johny-c ! I have modified the code according to them.

@wdevazelhes
Copy link
Contributor Author

I also added an example: plot_nca_dim_reduction.py (in commit 12cf3a9)

@wdevazelhes
Copy link
Contributor Author

wdevazelhes commented Nov 15, 2017

I benchmarked this PR implementation of NCA against metric-learn one: I plotted the training curves (objective function vs time), for the same initialisation (identity), on breast_cancer dataset, on my machine which has 7.7 Gb of memory and 4 cores.
Since metric-learn optimisation is a sort of variant of stochastic gradient descent, it needs a learning rate to be tuned (contrary to this PR implementation), so I plotted the curve for several learning rates.

At some points, metric-learn NCA training is interrupted prematurely: this is due to numerical instabilities, and this warning is thrown:

RuntimeWarning: invalid value encountered in true_divide
  softmax /= softmax.sum()

time_loss_ml_perso_2
time_loss_ml_perso_30

# Plot the embedding and show the evaluation score
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y)
plt.title("{}, KNN (k={})".format(name, n_neighbors))
plt.text(0.9, 0.1, '{:.2f}'.format(acc_knn), size=15,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The accuracy is a bit misplaced when running the example on my laptop. It is probably probably easier to put "Test accuracy = x" in the title (after a line break)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import time
from scipy.misc import logsumexp
from scipy.optimize import minimize
from sklearn.preprocessing import OneHotEncoder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use relative import

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


pca:
``n_features_out`` many principal components of the inputs passed
to :meth:`fit` will be used to initialize the transformation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all params are not indented the same way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because pca, identity, random and numpy array are not arguments but they are possible choices for argument init. I took the syntax from LMNN. Should I write it in another way ?

----------
transformation : array, shape (n_features_out, n_features)
The linear transformation on which to compute loss and evaluate
gradient
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert empty line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

X_embedded = np.dot(X, transformation.T)

# for every sample x_i, compute its contribution to loss and gradient
for i in range(X.shape[0]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looping over samples seems a problem. Is it possible to vectorize?

Copy link
Contributor

@bellet bellet Nov 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be possible but one should then try to avoid an O(n_samples*n_samples*n_features) memory complexity

@agramfort
Copy link
Member

agramfort commented Nov 20, 2017 via email


Parameters
----------
transformation : array, shape(n_components, n_features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it raveled here or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, it should be raveled, good catch, I'll change it

@@ -0,0 +1,511 @@
import pytest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add here a header with license and authors on top of the file like we do in other places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right it's missing, but did you mean to say in the main file nca.py ? I looked in two or three tests code and it didn't seem they had an author section/licence

@agramfort
Copy link
Member

ok did one (last?) round.

more tomorrow!

@agramfort
Copy link
Member

agramfort commented Feb 28, 2019 via email

@@ -3,7 +3,9 @@
Neighborhood Component Analysis
"""

# License: BSD 3 Clause
# Authors: William de Vazelhes <wdevazelhes@gmail.com>
# John Chiotellis <ioannis.chiotellis@in.tum.de>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johny-c I put the email I found on your webpage, but feel free to put another one here ;)

@wdevazelhes
Copy link
Contributor Author

wdevazelhes commented Feb 28, 2019

There just remains to settle on #10058 (comment), and check on #10058 (comment) and maybe wait for @johny-c to put his email (otherwise he could do it after) ? Otherwise I addressed all your comments, so, it's ready @agramfort @GaelVaroquaux !

"""

# Authors: William de Vazelhes <wdevazelhes@gmail.com>
# John Chiotellis <ioannis.chiotellis@in.tum.de>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johny-c same remark here too

@bthirion bthirion closed this Feb 28, 2019
@bthirion bthirion reopened this Feb 28, 2019
Sprint Paris 2019 automation moved this from Needs review to To do Feb 28, 2019
@agramfort agramfort changed the title [MRG+1] Neighborhood Components Analysis [MRG+2] Neighborhood Components Analysis Feb 28, 2019
Copy link
Contributor

@bellet bellet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too :-)

@GaelVaroquaux
Copy link
Member

LGTM. +1 for merge once the CI has run.

@jnothman jnothman moved this from To do to Needs review in Sprint Paris 2019 Feb 28, 2019
@jnothman jnothman moved this from Needs review to Done in Sprint Paris 2019 Feb 28, 2019
@jnothman
Copy link
Member

Let's do it

@jnothman jnothman merged commit d31b67f into scikit-learn:master Feb 28, 2019
@jnothman
Copy link
Member

Congratulations @wdevazelhes!!!! This has been a long time coming...

@wdevazelhes
Copy link
Contributor Author

That's great ! Thanks a lot for your reviews and comments @jnothman @agramfort @GaelVaroquaux @bellet, and congrats to @johny-c too ! I'm excited to work on improvements later on. Also to transpose the changes of this PR to LMNN so that it can be merged, or maybe we'll see first how NCA goes for a bit of time and then see what to do for LMNN

@bellet
Copy link
Contributor

bellet commented Feb 28, 2019

Congrats @wdevazelhes !

Indeed, core devs should decide whether they also want to include LMNN (#8602). I think it would be good as LMNN is also a popular method and there is quite a lot of code that could be shared with NCA.

@johny-c
Copy link

johny-c commented Feb 28, 2019

Hi all! @wdevazelhes, congrats and thanks for including me as an author. Regarding LMNN, as far as I remember, what is missing is some testing . I will have a couple of free weeks in March, so I could have another look at it.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet