Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Neighborhood Components Analysis #10058

Merged
merged 89 commits into from Feb 28, 2019

Conversation

@wdevazelhes
Copy link
Contributor

wdevazelhes commented Nov 2, 2017

Hi, this PR is an implementation of the Neighborhood Components Analysis algorithm (NCA), a popular supervised distance metric learning algorithm. As LMNN (cf PR #8602) this algorithm takes as input a labeled dataset, instead of similar/dissimilar pairs like it is the case for most metric learning algorithms, and learns a linear transformation of the space. However, NCA and LMNN have different objective functions: NCA tries to maximise the probability of every sample to be correctly classified based on a stochastic nearest neighbors rule, and therefore does not need to fix in advance a set of target neighbors.

There have been several attempts to implement NCA (2 PRs: #5276 (closed) and #4789 (not closed)). I created a fresh PR for sake of clarity. Indeed, this code is intended to be as similar to LMNN as possible, which should allow the factorisation of some points of code which are the same in both algorithms.

At the time of writing, this algorithm uses scipy's L-BFGS-B solver to solve the optimisation problem, like LMNN. It has the big advantage of avoiding to tune a learning rate parameter.
I benchmarked this implementation to the package metric-learn's one (https://github.com/all-umass/metric-learn): the one in this PR has the advantage of being scalable to large datasets (indeed, metric-learn's NCA throws Memory Error for too big datasets like faces or digits), for no significative loss in performance for small datasets.

The remaining tasks are the following:

  • More detailed benchmark of performance against reference implementations (for
    instance metric learn's one) (coming soon)
  • Add an example
  • Benchmark the algorithm accuracy on several datasets
  • Documentation

What is more, some improvements could also be made in a second time:

  • Make algorithmic improvements (like ignore samples where softmax distance is nearly null)
  • Add possibility to choose another solver like SGD to scale to large
    datasets (probably in another PR ?)
  • Add rules of thumb to choose solver etc by default (idem ?)
  • Make it possible to pass sparse matrices as input?
  • Add support for multilabel classification, which is a case where NCA could be really useful (cf. discussion with @bellet and @GaelVaroquaux) Metric learning is a good algorithm to use for these types of problems: for instance NCA would fit one matrix to satisfy multilabel constraints, which is an advantage with respect to one-vs-all/rest algorithms (because it can be better when labels are correlated for instance)

Feedback is welcome !

@wdevazelhes

This comment has been minimized.

Copy link
Contributor Author

wdevazelhes commented Nov 2, 2017

I benchmarked this implementation to the package metric-learn's one (https://github.com/all-umass/metric-learn): the one in this PR has the advantage of being scalable to large datasets (indeed, metric-learn's NCA throws Memory Error for too big datasets like faces or digits), for no significative loss in performance for small datasets.

Here is a snippet that shows it (on my machine which has 7.7GB of memory):

from sklearn.datasets import load_digits
from metric_learn import NCA
from sklearn.neighbors import NeighborhoodComponentsAnalysis
from sklearn.utils.testing import assert_raises
digits = load_digits()
X, y = digits.data, digits.target
nca_ml = NCA()
assert_raises(MemoryError, nca_ml.fit, X, y)
nca_sk = NeighborhoodComponentsAnalysis()
nca_sk.fit(X, y)  # does not raise any error
Do not precompute pairwise differences
Indeed, they do not add a significative speedup but have a high memory cost.
transformation = transformation.reshape(-1, X.shape[1])
loss = 0
gradient = np.zeros(transformation.shape)
X_embedded = transformation.dot(X.T).T

This comment has been minimized.

@johny-c

johny-c Nov 11, 2017

With np.dot(X, transformation.T) you save a transpose.

This comment has been minimized.

@wdevazelhes

wdevazelhes Nov 15, 2017

Author Contributor

True, I will change it.

@johny-c

This comment has been minimized.

Copy link

johny-c commented Nov 11, 2017

Hi @wdevazelhes , since the code is very similar to LMNN, I felt free to add just a couple of comments, by no means a full review. @jnothman , I hope I didn't break any contribution guidelines..

diff_embedded[~ci, :]))
p_i = np.sum(p_i_j) # probability of x_i to be correctly
# classified
gradient += 2 * (p_i * (sum_ci.T + sum_not_ci.T) - sum_ci.T)

This comment has been minimized.

@johny-c

johny-c Nov 11, 2017

You could just add the matrices without transposing, and transpose the gradient once before returning the unraveled gradient.

This comment has been minimized.

@johny-c

johny-c Nov 11, 2017

I remembered I had seen a more efficient computation of the gradient (see last equation in slide 12 here: http://bengio.abracadoudou.com/lce/slides/roweis.pdf ).
This would amount to:

p_i_j = exp_dist_embedded[ci]
p_i_k = exp_dist_embedded[~ci]
diffs = X[i, :] - X
diff_ci = diffs[ci, :]
diff_not_ci = diffs[~ci, :]
sum_ci = diff_ci.T.dot(p_i_j[:, np.newaxis] * diff_ci)
sum_not_ci = diff_not_ci.T.dot(p_i_k[:, np.newaxis] * diff_not_ci)
p_i = np.sum(p_i_j)
gradient += p_i * sum_not_ci + (1 - p_i) * sum_ci

And multiplying by the transformation after the for-loop:
gradient = 2*np.dot(transformation, gradient)

This comment has been minimized.

@wdevazelhes

wdevazelhes Nov 15, 2017

Author Contributor

You could just add the matrices without transposing, and transpose the gradient once before returning the unraveled gradient.

True, I will change it.

I remembered I had seen a more efficient computation of the gradient (see last equation in slide 12 here: http://bengio.abracadoudou.com/lce/slides/roweis.pdf ).
This would amount to:
[...]
And multiplying by the transformation after the for-loop:
gradient = 2*np.dot(transformation, gradient)

Since I already computed embedded differences for the softmax before wouldn't it be more efficient to reuse it ?
I agree that the expression with (1 - p_i) is clearer though.

@codecov

This comment has been minimized.

Copy link

codecov bot commented Nov 15, 2017

Codecov Report

Merging #10058 into master will increase coverage by 0.01%.
The diff coverage is 99.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10058      +/-   ##
==========================================
+ Coverage   96.19%   96.21%   +0.01%     
==========================================
  Files         336      338       +2     
  Lines       62740    63025     +285     
==========================================
+ Hits        60354    60638     +284     
- Misses       2386     2387       +1
Impacted Files Coverage Δ
sklearn/neighbors/__init__.py 100% <100%> (ø) ⬆️
sklearn/neighbors/tests/test_nca.py 100% <100%> (ø)
sklearn/neighbors/nca.py 99.29% <99.29%> (ø)
sklearn/decomposition/tests/test_pca.py 100% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f2f167...48cab11. Read the comment docs.

@wdevazelhes

This comment has been minimized.

Copy link
Contributor Author

wdevazelhes commented Nov 15, 2017

Thanks for your comments @johny-c ! I have modified the code according to them.

@wdevazelhes

This comment has been minimized.

Copy link
Contributor Author

wdevazelhes commented Nov 15, 2017

I also added an example: plot_nca_dim_reduction.py (in commit 12cf3a9)

@wdevazelhes

This comment has been minimized.

Copy link
Contributor Author

wdevazelhes commented Nov 15, 2017

I benchmarked this PR implementation of NCA against metric-learn one: I plotted the training curves (objective function vs time), for the same initialisation (identity), on breast_cancer dataset, on my machine which has 7.7 Gb of memory and 4 cores.
Since metric-learn optimisation is a sort of variant of stochastic gradient descent, it needs a learning rate to be tuned (contrary to this PR implementation), so I plotted the curve for several learning rates.

At some points, metric-learn NCA training is interrupted prematurely: this is due to numerical instabilities, and this warning is thrown:

RuntimeWarning: invalid value encountered in true_divide
  softmax /= softmax.sum()

time_loss_ml_perso_2
time_loss_ml_perso_30

# Plot the embedding and show the evaluation score
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y)
plt.title("{}, KNN (k={})".format(name, n_neighbors))
plt.text(0.9, 0.1, '{:.2f}'.format(acc_knn), size=15,

This comment has been minimized.

@bellet

bellet Nov 16, 2017

Contributor

The accuracy is a bit misplaced when running the example on my laptop. It is probably probably easier to put "Test accuracy = x" in the title (after a line break)

This comment has been minimized.

@wdevazelhes

wdevazelhes Nov 22, 2017

Author Contributor

Done

import time
from scipy.misc import logsumexp
from scipy.optimize import minimize
from sklearn.preprocessing import OneHotEncoder

This comment has been minimized.

@agramfort

agramfort Nov 17, 2017

Member

use relative import

This comment has been minimized.

@wdevazelhes

wdevazelhes Nov 22, 2017

Author Contributor

Done

pca:
``n_features_out`` many principal components of the inputs passed
to :meth:`fit` will be used to initialize the transformation.

This comment has been minimized.

@agramfort

agramfort Nov 17, 2017

Member

all params are not indented the same way

This comment has been minimized.

@wdevazelhes

wdevazelhes Nov 22, 2017

Author Contributor

This is because pca, identity, random and numpy array are not arguments but they are possible choices for argument init. I took the syntax from LMNN. Should I write it in another way ?

----------
transformation : array, shape (n_features_out, n_features)
The linear transformation on which to compute loss and evaluate
gradient

This comment has been minimized.

@agramfort

agramfort Nov 17, 2017

Member

insert empty line

This comment has been minimized.

@wdevazelhes

wdevazelhes Nov 22, 2017

Author Contributor

Done

X_embedded = np.dot(X, transformation.T)

# for every sample x_i, compute its contribution to loss and gradient
for i in range(X.shape[0]):

This comment has been minimized.

@agramfort

agramfort Nov 17, 2017

Member

looping over samples seems a problem. Is it possible to vectorize?

This comment has been minimized.

@bellet

bellet Nov 20, 2017

Contributor

it should be possible but one should then try to avoid an O(n_samples*n_samples*n_features) memory complexity

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Nov 20, 2017

@GaelVaroquaux GaelVaroquaux requested a review from agramfort Feb 27, 2019

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Feb 27, 2019

This looks ready to merge for me. @agramfort : can you check a second time?

Parameters
----------
transformation : array, shape(n_components, n_features)

This comment has been minimized.

@agramfort

agramfort Feb 27, 2019

Member

is it raveled here or not?

This comment has been minimized.

@wdevazelhes

wdevazelhes Feb 28, 2019

Author Contributor

Absolutely, it should be raveled, good catch, I'll change it

@@ -0,0 +1,511 @@
import pytest

This comment has been minimized.

@agramfort

agramfort Feb 27, 2019

Member

I would add here a header with license and authors on top of the file like we do in other places

This comment has been minimized.

@wdevazelhes

wdevazelhes Feb 28, 2019

Author Contributor

That's right it's missing, but did you mean to say in the main file nca.py ? I looked in two or three tests code and it didn't seem they had an author section/licence

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Feb 27, 2019

ok did one (last?) round.

more tomorrow!

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Feb 28, 2019

@@ -3,7 +3,9 @@
Neighborhood Component Analysis
"""

# License: BSD 3 Clause
# Authors: William de Vazelhes <wdevazelhes@gmail.com>
# John Chiotellis <ioannis.chiotellis@in.tum.de>

This comment has been minimized.

@wdevazelhes

wdevazelhes Feb 28, 2019

Author Contributor

@johny-c I put the email I found on your webpage, but feel free to put another one here ;)

@wdevazelhes

This comment has been minimized.

Copy link
Contributor Author

wdevazelhes commented Feb 28, 2019

There just remains to settle on #10058 (comment), and check on #10058 (comment) and maybe wait for @johny-c to put his email (otherwise he could do it after) ? Otherwise I addressed all your comments, so, it's ready @agramfort @GaelVaroquaux !

"""

# Authors: William de Vazelhes <wdevazelhes@gmail.com>
# John Chiotellis <ioannis.chiotellis@in.tum.de>

This comment has been minimized.

@wdevazelhes

wdevazelhes Feb 28, 2019

Author Contributor

@johny-c same remark here too

@bthirion bthirion closed this Feb 28, 2019

@bthirion bthirion reopened this Feb 28, 2019

Sprint Paris 2019 automation moved this from Needs review to To do Feb 28, 2019

@agramfort agramfort changed the title [MRG+1] Neighborhood Components Analysis [MRG+2] Neighborhood Components Analysis Feb 28, 2019

@bellet

bellet approved these changes Feb 28, 2019

Copy link
Contributor

bellet left a comment

LGTM too :-)

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Feb 28, 2019

LGTM. +1 for merge once the CI has run.

@jnothman jnothman moved this from To do to Needs review in Sprint Paris 2019 Feb 28, 2019

@jnothman jnothman moved this from Needs review to Done in Sprint Paris 2019 Feb 28, 2019

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 28, 2019

Let's do it

@jnothman jnothman merged commit d31b67f into scikit-learn:master Feb 28, 2019

2 of 7 checks passed

LGTM analysis: Python Running analyses for revisions
Details
ci/circleci: doc Your tests are queued behind your running builds
Details
ci/circleci: doc-min-dependencies Your tests are queued behind your running builds
Details
ci/circleci: lint Your tests are queued behind your running builds
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 28, 2019

Congratulations @wdevazelhes!!!! This has been a long time coming...

@wdevazelhes

This comment has been minimized.

Copy link
Contributor Author

wdevazelhes commented Feb 28, 2019

That's great ! Thanks a lot for your reviews and comments @jnothman @agramfort @GaelVaroquaux @bellet, and congrats to @johny-c too ! I'm excited to work on improvements later on. Also to transpose the changes of this PR to LMNN so that it can be merged, or maybe we'll see first how NCA goes for a bit of time and then see what to do for LMNN

@bellet

This comment has been minimized.

Copy link
Contributor

bellet commented Feb 28, 2019

Congrats @wdevazelhes !

Indeed, core devs should decide whether they also want to include LMNN (#8602). I think it would be good as LMNN is also a popular method and there is quite a lot of code that could be shared with NCA.

@johny-c

This comment has been minimized.

Copy link

johny-c commented Feb 28, 2019

Hi all! @wdevazelhes, congrats and thanks for including me as an author. Regarding LMNN, as far as I remember, what is missing is some testing . I will have a couple of free weeks in March, so I could have another look at it.

Kiku-git added a commit to Kiku-git/scikit-learn that referenced this pull request Mar 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.