[MRG+1] Add multiplicative-update solver in NMF, with all beta-divergence #5295

Merged
merged 35 commits into from Dec 12, 2016

Conversation

Projects
None yet
10 participants
@TomDLT
Member

TomDLT commented Sep 22, 2015

This PR is a second part of what have been discussed in #4811. (first part, #4852, merged)
It includes :

  • a Multiplicative-Update solver in NMF. This solver is generally slower than the Coordinate-Descent solver, but it handles all beta-divergences (including Frobenius, generalized Kullback_Leibler and Itakura-Saito).
  • a plot to visualize the beta-divergence for several values of beta.
  • benchmarks below

Link to the rendered doc


plot_beta_divergence.py
beta_div


This change is Reviewable

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel Sep 23, 2015

Member

I would favor using CD for beta divergences as well. We decided to remove the slow pg solver. I would prefer not add a new slow solver if possible. The KDD paper shows that for generalized KL, we can just use the element-wise Newton update without line search. For other divergences, this will need some verification (unless there are more recent works that cover this?) but in worst case we can implement a simple line search. As a start, it would be nice to implement the CD update for generalized KL to compare.

Member

mblondel commented Sep 23, 2015

I would favor using CD for beta divergences as well. We decided to remove the slow pg solver. I would prefer not add a new slow solver if possible. The KDD paper shows that for generalized KL, we can just use the element-wise Newton update without line search. For other divergences, this will need some verification (unless there are more recent works that cover this?) but in worst case we can implement a simple line search. As a start, it would be nice to implement the CD update for generalized KL to compare.

@tttthomasssss

This comment has been minimized.

Show comment
Hide comment
@tttthomasssss

tttthomasssss Oct 20, 2015

Contributor

Could anybody comment on when this is planned to be merged into master? I'm quite keen on using NMF with generalized KL divergence.

Contributor

tttthomasssss commented Oct 20, 2015

Could anybody comment on when this is planned to be merged into master? I'm quite keen on using NMF with generalized KL divergence.

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Oct 21, 2015

Member
Member

agramfort commented Oct 21, 2015

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Nov 10, 2015

Member

Sorry for the long silence, I have not as much time for this as before.
@mblondel I spent some time trying to implement the coordinate descent for generalized Kullback Leibler divergence, as described in Coordinate descent with greedy selection, at least in the dense case.
However, I did not manage to be faster than multiplicative update, which is not what is indicated in the paper, and probably underlines my own limits (or proves that my implementation of multiplicative update is very good :)). More seriously, even if we manage to be faster, the extension to all beta-divergence is not trivial and would require much more work.

I understand that multiplicative update is an old method and probably not the fastest, yet it is still a good way to improve the NMF in scikit-learn, extending to a lot of different loss functions.

Member

TomDLT commented Nov 10, 2015

Sorry for the long silence, I have not as much time for this as before.
@mblondel I spent some time trying to implement the coordinate descent for generalized Kullback Leibler divergence, as described in Coordinate descent with greedy selection, at least in the dense case.
However, I did not manage to be faster than multiplicative update, which is not what is indicated in the paper, and probably underlines my own limits (or proves that my implementation of multiplicative update is very good :)). More seriously, even if we manage to be faster, the extension to all beta-divergence is not trivial and would require much more work.

I understand that multiplicative update is an old method and probably not the fastest, yet it is still a good way to improve the NMF in scikit-learn, extending to a lot of different loss functions.

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Nov 19, 2015

Member

I compared my implementation of multiplicative update with #2540 (Frobenius norm) and #1348 (KL divergence).


Comparing to #2540, the results are identical, provided some slight modifications:
To avoid division by zero, what do we prefer?:

  1. A / (B + eps) (like in #2540)
  2. (A + eps) / (B + eps) (like in #1348)
  3. B[B == 0] = eps then A / B
    I have a preference for the last one, for it does not affect update with nonzero elements. WDYT ?

Comparing to #1348, the results are different since it forgets a division in the update
(see Eq 5 here), and adds a normalization of H.

Member

TomDLT commented Nov 19, 2015

I compared my implementation of multiplicative update with #2540 (Frobenius norm) and #1348 (KL divergence).


Comparing to #2540, the results are identical, provided some slight modifications:
To avoid division by zero, what do we prefer?:

  1. A / (B + eps) (like in #2540)
  2. (A + eps) / (B + eps) (like in #1348)
  3. B[B == 0] = eps then A / B
    I have a preference for the last one, for it does not affect update with nonzero elements. WDYT ?

Comparing to #1348, the results are different since it forgets a division in the update
(see Eq 5 here), and adds a normalization of H.

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Jan 22, 2016

Member

Some benchmarking:
20_news - sparse (11314, 39116)
20news
Faces - dense (400, 4096)
faces
MNIST - dense (70000, 784)
mnist
RCV1 - sparse (804414, 47236)
rcv1

The MU solver is not very fast, but it was implemented not for its speed, but for it handles all beta-divergences. Note that the poor performances with 'nndsvd' initialization (top-left) are expected, since this initialization have a lot of zeros that cannot be modified by a Multiplicative Update.

Member

TomDLT commented Jan 22, 2016

Some benchmarking:
20_news - sparse (11314, 39116)
20news
Faces - dense (400, 4096)
faces
MNIST - dense (70000, 784)
mnist
RCV1 - sparse (804414, 47236)
rcv1

The MU solver is not very fast, but it was implemented not for its speed, but for it handles all beta-divergences. Note that the poor performances with 'nndsvd' initialization (top-left) are expected, since this initialization have a lot of zeros that cannot be modified by a Multiplicative Update.

@TomDLT TomDLT changed the title from [WIP] Add multiplicative-update solver in NMF, with all beta-divergence to [MRG] Add multiplicative-update solver in NMF, with all beta-divergence Jan 22, 2016

@TomDLT TomDLT added the Needs Review label Apr 26, 2016

@tguillemot

This comment has been minimized.

Show comment
Hide comment
@tguillemot

tguillemot Aug 18, 2016

Contributor

@TomDLT @mblondel @agramfort This PR is mergeable ?
I will do a first round of review but I don't know those methods. Someone can have a look with me ?

Contributor

tguillemot commented Aug 18, 2016

@TomDLT @mblondel @agramfort This PR is mergeable ?
I will do a first round of review but I don't know those methods. Someone can have a look with me ?

benchmarks/bench_plot_nmf.py
-
- return timeset, err
+import matplotlib.pyplot as plt
+import pandas

This comment has been minimized.

@tguillemot

tguillemot Aug 18, 2016

Contributor

We can use pandas for benchs ?

@tguillemot

tguillemot Aug 18, 2016

Contributor

We can use pandas for benchs ?

This comment has been minimized.

@TomDLT

TomDLT Aug 22, 2016

Member

Using it in a benchmark does not make it a dependency of the package, as for matplotlib.

@TomDLT

TomDLT Aug 22, 2016

Member

Using it in a benchmark does not make it a dependency of the package, as for matplotlib.

sklearn/decomposition/nmf.py
+ If beta == 1, this is the generalized Kullback-Leibler divergence
+ If beta == 0, this is the Itakura-Saito divergence
+ Else, this is the general beta-divergence.
+ """

This comment has been minimized.

@tguillemot

tguillemot Aug 18, 2016

Contributor

Can you specify what are X, W, H, and beta with the classical docstring ?

@tguillemot

tguillemot Aug 18, 2016

Contributor

Can you specify what are X, W, H, and beta with the classical docstring ?

+ # np.sum(np.dot(W, H) ** beta)
+ sum_WH_beta = 0
+ for i in range(X.shape[1]):
+ sum_WH_beta += np.sum(fast_dot(W, H[:, i]) ** beta)

This comment has been minimized.

@tguillemot

tguillemot Aug 18, 2016

Contributor

Is there a better option ?

@tguillemot

tguillemot Aug 18, 2016

Contributor

Is there a better option ?

This comment has been minimized.

@TomDLT

TomDLT Aug 18, 2016

Member

This is a temporary fix so as to be memory efficient in the sparse case.
(We cannot compute directly np.dot(W, H)).

We could have a cython code for it, yet the use case is rather limited: sparse X, and beta not in [0, 1, 2] (which are the most useful cases). So I propose to merge it as it is, and improve it in a separate PR.

@TomDLT

TomDLT Aug 18, 2016

Member

This is a temporary fix so as to be memory efficient in the sparse case.
(We cannot compute directly np.dot(W, H)).

We could have a cython code for it, yet the use case is rather limited: sparse X, and beta not in [0, 1, 2] (which are the most useful cases). So I propose to merge it as it is, and improve it in a separate PR.

This comment has been minimized.

@tguillemot

tguillemot Aug 18, 2016

Contributor

Good for me :)

@tguillemot

tguillemot Aug 18, 2016

Contributor

Good for me :)

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Aug 20, 2016

Member

@mblondel what's your take on this PR?

Member

agramfort commented Aug 20, 2016

@mblondel what's your take on this PR?

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel Aug 20, 2016

Member

I spent some time trying to implement the coordinate descent for generalized Kullback Leibler divergence, as described in Coordinate descent with greedy selection, at least in the dense case.

Thanks for the investigation.

More seriously, even if we manage to be faster, the extension to all beta-divergence is not trivial and would require much more work.

Agreed.

@mblondel what's your take on this PR?

+1 on my side and sorry for the late reply. As a big fan of CD, I would be potentially interested in the CD code for GKL divergence as it could be faster in the sparse case.

Member

mblondel commented Aug 20, 2016

I spent some time trying to implement the coordinate descent for generalized Kullback Leibler divergence, as described in Coordinate descent with greedy selection, at least in the dense case.

Thanks for the investigation.

More seriously, even if we manage to be faster, the extension to all beta-divergence is not trivial and would require much more work.

Agreed.

@mblondel what's your take on this PR?

+1 on my side and sorry for the late reply. As a big fan of CD, I would be potentially interested in the CD code for GKL divergence as it could be faster in the sparse case.

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Aug 20, 2016

Member
Member

agramfort commented Aug 20, 2016

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Aug 22, 2016

Member

let's make it a WIP->MRG then

it is, all reviews will be greatly appreciated.

Member

TomDLT commented Aug 22, 2016

let's make it a WIP->MRG then

it is, all reviews will be greatly appreciated.

doc/modules/decomposition.rst
+However, NMF can also be used with a different function to measure the
+distance between X and the matrix product WH. Another typical distance
+function used in NMF is the (generalized) Kullback-Leibler (KL) divergence,
+also referred as I-divergence:

This comment has been minimized.

@tguillemot

tguillemot Sep 1, 2016

Contributor

I would change a bit the formulation :
"Other distance functions can be used in NMF as, for example, the (generalized) Kullback-Leibler (KL) divergence, also referred as I-divergence:

..math::
d_{KL}(X, Y) = \sum_{i,j} (X_{ij} * log(\frac{X_{ij}}{Y_{ij}}) - X_{ij} + Y_{ij})

Or, the the Itakura-Saito (IS) divergence:"
See if you prefer.

@tguillemot

tguillemot Sep 1, 2016

Contributor

I would change a bit the formulation :
"Other distance functions can be used in NMF as, for example, the (generalized) Kullback-Leibler (KL) divergence, also referred as I-divergence:

..math::
d_{KL}(X, Y) = \sum_{i,j} (X_{ij} * log(\frac{X_{ij}}{Y_{ij}}) - X_{ij} + Y_{ij})

Or, the the Itakura-Saito (IS) divergence:"
See if you prefer.

doc/modules/decomposition.rst
+ d_{IS}(X, Y) = \sum_{i,j} (\frac{X_{ij}}{Y_{ij}} - log(\frac{X_{ij}}{Y_{ij}}) - 1)
+
+These three distances are special cases of the beta-divergence family, with
+:math:`\beta = 2, 1, 0` respectively [Fevotte, 2011]. The beta-divergence are

This comment has been minimized.

@tguillemot

tguillemot Sep 1, 2016

Contributor

Maybe you can add a link to [Fevotte, 2011](even if it's on the references section)

@tguillemot

tguillemot Sep 1, 2016

Contributor

Maybe you can add a link to [Fevotte, 2011](even if it's on the references section)

doc/modules/decomposition.rst
+also referred as I-divergence:
+
+.. math::
+ d_{KL}(X, Y) = \sum_{i,j} (X_{ij} * log(\frac{X_{ij}}{Y_{ij}}) - X_{ij} + Y_{ij})

This comment has been minimized.

@tguillemot

tguillemot Sep 1, 2016

Contributor

Remove the multiplication symbol *

@tguillemot

tguillemot Sep 1, 2016

Contributor

Remove the multiplication symbol *

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Sep 1, 2016

Member

Thanks for the comments @tguillemot, I addressed them.

Link to the rendered doc

Member

TomDLT commented Sep 1, 2016

Thanks for the comments @tguillemot, I addressed them.

Link to the rendered doc

doc/modules/decomposition.rst
@@ -686,7 +685,7 @@ faces dataset, in comparison with the PCA eigenfaces.
The :attr:`init` attribute determines the initialization method applied, which
has a great impact on the performance of the method. :class:`NMF` implements
-the method Nonnegative Double Singular Value Decomposition. NNDSVD is based on
+the method Nonnegative Double Singular Value Decomposition. NNDSVD [4]_ is based on

This comment has been minimized.

@ogrisel

ogrisel Sep 19, 2016

Member

Maybe explain why the NNDSVD init should not be used with the multiplicative updates solver at this end of this paragraph.

@ogrisel

ogrisel Sep 19, 2016

Member

Maybe explain why the NNDSVD init should not be used with the multiplicative updates solver at this end of this paragraph.

This comment has been minimized.

@TomDLT

TomDLT Sep 20, 2016

Member

Good point. Done

@TomDLT

TomDLT Sep 20, 2016

Member

Good point. Done

+ raise ValueError(
+ 'Invalid beta_loss parameter: solver %r does not handle beta_loss'
+ ' = %r' % (solver, beta_loss))
+

This comment has been minimized.

@ogrisel

ogrisel Sep 19, 2016

Member

Maybe issue a user warning when the mu solver is used in conjunction with the nndsvd init.

@ogrisel

ogrisel Sep 19, 2016

Member

Maybe issue a user warning when the mu solver is used in conjunction with the nndsvd init.

This comment has been minimized.

@TomDLT

TomDLT Sep 20, 2016

Member

Done

@ogrisel

How do you explain that MU behaves so badly with NNDSVDAR init on 20newsgroups? Does it changes significantly if you change the random seed?

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Sep 21, 2016

Member

I relaunched the benchmark without changing anything, and plotting with the same scales, we see that the poor behavior with NNDSVDAR init is gone, I don't know why:

Before
20news
After
20_news__correct

The three other datasets give the same results as before.

Member

TomDLT commented Sep 21, 2016

I relaunched the benchmark without changing anything, and plotting with the same scales, we see that the poor behavior with NNDSVDAR init is gone, I don't know why:

Before
20news
After
20_news__correct

The three other datasets give the same results as before.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Oct 3, 2016

Member

Good for 20 newsgroups.

With respect to #5295 (comment) I also prefer #3.

Member

ogrisel commented Oct 3, 2016

Good for 20 newsgroups.

With respect to #5295 (comment) I also prefer #3.

@ogrisel

I have not yet finished my review but are some more comments. Also I found the following case where mu seems to underperform:

import numpy as np
from sklearn.decomposition import NMF

rng = np.random.RandomState(42)
n_samples, n_features = 500, 50
k = 10
n_components = min(2 * k, n_features)

W_true = np.clip(rng.randn(n_samples, k), 0, np.inf)
H_true = np.clip(rng.randn(k, n_features), 0, np.inf)
X = np.dot(W_true, H_true)


def nmf_error(X, **model_params):
    nmf = NMF(**model_params)
    W = nmf.fit_transform(X)
    H = nmf.components_
    return np.linalg.norm(X - np.dot(W, H))


for solver in ['cd', 'mu', 'pg']:
    error = nmf_error(X, n_components=n_components, solver=solver,
                      random_state=0, init='nndsvdar', max_iter=500)
    print('solver=%s, error=%f' % (solver, error))

with output:

solver=cd, error=0.114302
solver=mu, error=3.146888
/volatile/ogrisel/code/scikit-learn/sklearn/decomposition/nmf.py:1217: DeprecationWarning: 'pg' solver will be removed in release 0.19. Use 'cd' solver instead.
  nls_max_iter,
/volatile/ogrisel/code/scikit-learn/sklearn/decomposition/nmf.py:496: UserWarning: Iteration limit reached in nls subproblem.
  warnings.warn("Iteration limit reached in nls subproblem.")
solver=pg, error=0.506917

Any idea why? I need to play more with the rng seed an problem size but it might be a bug no?

+defined by :
+
+.. math::
+ d_{\beta}(X, Y) = \sum_{i,j} \frac{1}{\beta(\beta - 1)}(X_{ij}^\beta + (\beta-1)Y_{ij}^\beta - \beta X_{ij} Y_{ij}^{\beta - 1})

This comment has been minimized.

@ogrisel

ogrisel Oct 3, 2016

Member

Could you please insert the figure of the beta divergence loss function example after this formula?

@ogrisel

ogrisel Oct 3, 2016

Member

Could you please insert the figure of the beta divergence loss function example after this formula?

This comment has been minimized.

@TomDLT

TomDLT Oct 3, 2016

Member

Done

@TomDLT

TomDLT Oct 3, 2016

Member

Done

sklearn/decomposition/nmf.py
@@ -28,6 +29,8 @@
from ..exceptions import ConvergenceWarning
from .cdnmf_fast import _update_cdnmf_fast
+EPSILSON = 1e-9

This comment has been minimized.

@ogrisel

ogrisel Oct 3, 2016

Member

I think np.ffinfo(np.float32).eps feels less arbitrary.

@ogrisel

ogrisel Oct 3, 2016

Member

I think np.ffinfo(np.float32).eps feels less arbitrary.

This comment has been minimized.

@TomDLT

TomDLT Oct 3, 2016

Member

Done

@TomDLT

TomDLT Oct 3, 2016

Member

Done

sklearn/decomposition/nmf.py
+
+
+def _compute_regularization(alpha, l1_ratio, regularization, n_samples,
+ n_features):

This comment has been minimized.

@ogrisel

ogrisel Oct 3, 2016

Member

why passing n_samples and n_features here?

@ogrisel

ogrisel Oct 3, 2016

Member

why passing n_samples and n_features here?

This comment has been minimized.

@TomDLT

TomDLT Oct 3, 2016

Member

This is related to #5296, yet I will remove them in this PR.

@TomDLT

TomDLT Oct 3, 2016

Member

This is related to #5296, yet I will remove them in this PR.

+ else:
+ WH_safe_X_data **= beta_loss - 2
+ # element-wise multiplication
+ WH_safe_X_data *= X_data

This comment has been minimized.

@ogrisel

ogrisel Oct 3, 2016

Member

I think this should probably be special-cased to:

elif beta_loss == 2:
    WH_safe_X_data[:] = X_data
@ogrisel

ogrisel Oct 3, 2016

Member

I think this should probably be special-cased to:

elif beta_loss == 2:
    WH_safe_X_data[:] = X_data

This comment has been minimized.

@ogrisel

ogrisel Oct 3, 2016

Member

Actually in the special case where beta_loss == 2 we could even skip the call to _special_sparse_dot(W, H, X) and make this part of the update quite faster right?

@ogrisel

ogrisel Oct 3, 2016

Member

Actually in the special case where beta_loss == 2 we could even skip the call to _special_sparse_dot(W, H, X) and make this part of the update quite faster right?

This comment has been minimized.

@TomDLT

TomDLT Oct 3, 2016

Member

I already separated the case beta_loss == 2 just above.

@TomDLT

TomDLT Oct 3, 2016

Member

I already separated the case beta_loss == 2 just above.

This comment has been minimized.

@ogrisel

ogrisel Oct 4, 2016

Member

Indeed, sorry for the noise.

@ogrisel

ogrisel Oct 4, 2016

Member

Indeed, sorry for the noise.

sklearn/decomposition/nmf.py
+ # copy used in the Denominator
+ WH = WH_safe_X.copy()
+
+ # to avoid division by zero

This comment has been minimized.

@ogrisel

ogrisel Oct 3, 2016

Member

or more generally taking a negative power of zero.

@ogrisel

ogrisel Oct 3, 2016

Member

or more generally taking a negative power of zero.

This comment has been minimized.

@TomDLT

TomDLT Oct 3, 2016

Member

Done

@TomDLT

TomDLT Oct 3, 2016

Member

Done

+ H *= delta_H
+
+ # These values will be recomputed since H changed
+ H_sum, HHt, XHt = None, None, None

This comment has been minimized.

@ogrisel

ogrisel Oct 3, 2016

Member

Isn't there any cheap way to compute the fraction of change of the objective function to check for convergence? Iterating until max_iter seems very wasteful to me.

@ogrisel

ogrisel Oct 3, 2016

Member

Isn't there any cheap way to compute the fraction of change of the objective function to check for convergence? Iterating until max_iter seems very wasteful to me.

This comment has been minimized.

@TomDLT

TomDLT Oct 3, 2016

Member

In #1348, the convergence check is prev_error - error < tol.
In #2540, the convergence check is abs(1. - max_update) < tol.

About computing the error, the cost is a bit high, yet I need to test it again.

About checking the maximum update, the update indeed converges to 1, but the criterion is rather poor if I recall correctly. After discussing with audio researchers in my group that use NMF a lot, a fixed number of iteration seems pretty classic.
Yet I can easily add this criterion if we consider it a better solution.

@TomDLT

TomDLT Oct 3, 2016

Member

In #1348, the convergence check is prev_error - error < tol.
In #2540, the convergence check is abs(1. - max_update) < tol.

About computing the error, the cost is a bit high, yet I need to test it again.

About checking the maximum update, the update indeed converges to 1, but the criterion is rather poor if I recall correctly. After discussing with audio researchers in my group that use NMF a lot, a fixed number of iteration seems pretty classic.
Yet I can easily add this criterion if we consider it a better solution.

This comment has been minimized.

@TomDLT

TomDLT Oct 3, 2016

Member

Computing the error at each iteration increases the computational time by:

  • {+50%, +50%} on 20news dataset (sparse), with beta loss in {2, 1}
  • {+100%, +230%} on faces dataset (dense), with beta loss in {2, 1}

(in particular np.log(div) here is quite costly)

@TomDLT

TomDLT Oct 3, 2016

Member

Computing the error at each iteration increases the computational time by:

  • {+50%, +50%} on 20news dataset (sparse), with beta loss in {2, 1}
  • {+100%, +230%} on faces dataset (dense), with beta loss in {2, 1}

(in particular np.log(div) here is quite costly)

This comment has been minimized.

@ogrisel

ogrisel Oct 4, 2016

Member

Let's do it every 10 iterations then.

@ogrisel

ogrisel Oct 4, 2016

Member

Let's do it every 10 iterations then.

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Oct 3, 2016

Member

About your toy example, increasing the number of iteration solves the problem, yet I don't know why it is that slow.

Member

TomDLT commented Oct 3, 2016

About your toy example, increasing the number of iteration solves the problem, yet I don't know why it is that slow.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Oct 4, 2016

Member

About your toy example, increasing the number of iteration solves the problem

Indeed, I thought I had tried that but maybe I was in slowly decreasing / plateau like area of the convergence curve and I had not tried to increase max_iter enough at once.

yet I don't know why it is that slow.

That's weird but apparently it does not seem to be so bad on more realistic data as your benchmarks show.

Member

ogrisel commented Oct 4, 2016

About your toy example, increasing the number of iteration solves the problem

Indeed, I thought I had tried that but maybe I was in slowly decreasing / plateau like area of the convergence curve and I had not tried to increase max_iter enough at once.

yet I don't know why it is that slow.

That's weird but apparently it does not seem to be so bad on more realistic data as your benchmarks show.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Oct 4, 2016

Member

About computing the error, the cost is a bit high, yet I need to test it again.

Then we could compute the cost every 10 or 100 iter instead.

Member

ogrisel commented Oct 4, 2016

About computing the error, the cost is a bit high, yet I need to test it again.

Then we could compute the cost every 10 or 100 iter instead.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Oct 4, 2016

Member

About checking the maximum update, the update indeed converges to 1, but the criterion is rather poor if I recall correctly.

I tried to add print statements to monitor the maximum absolute values in the delta multipliers and it seems that at least for some components we can get into an oscillatory regime.

Member

ogrisel commented Oct 4, 2016

About checking the maximum update, the update indeed converges to 1, but the criterion is rather poor if I recall correctly.

I tried to add print statements to monitor the maximum absolute values in the delta multipliers and it seems that at least for some components we can get into an oscillatory regime.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Oct 4, 2016

Member

Also ideally we would like that the tol-based criterion to be approximately invariant to any re-scaling of X.

Member

ogrisel commented Oct 4, 2016

Also ideally we would like that the tol-based criterion to be approximately invariant to any re-scaling of X.

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Oct 4, 2016

Member

The convergence test is now done every 100 iterations, and is normalized:

if (previous_error - error) / error < tol:
    break
Member

TomDLT commented Oct 4, 2016

The convergence test is now done every 100 iterations, and is normalized:

if (previous_error - error) / error < tol:
    break
@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Oct 4, 2016

Member
if (previous_error - error) / error < tol:
    break

I am afraid this can diver on problems where error can get very close to zero very suddenly. Maybe the following is safer:

# difference of errors normalized by mean error across the last two
# iterations.
if  2 * (previous_error - error) / (previous_error + error) < tol:
    break

or just:

if  previous_error - error / previous_error < tol:
    break

if we have the guarantee that the error is monotonic.

Member

ogrisel commented Oct 4, 2016

if (previous_error - error) / error < tol:
    break

I am afraid this can diver on problems where error can get very close to zero very suddenly. Maybe the following is safer:

# difference of errors normalized by mean error across the last two
# iterations.
if  2 * (previous_error - error) / (previous_error + error) < tol:
    break

or just:

if  previous_error - error / previous_error < tol:
    break

if we have the guarantee that the error is monotonic.

sklearn/decomposition/nmf.py
+
+ if verbose:
+ end_time = time.time()
+ print("Epoch %02d reached after %.3f seconds; convergence reached." %

This comment has been minimized.

@ogrisel

ogrisel Oct 4, 2016

Member

Please remove "; convergence reached" from this message.

@ogrisel

ogrisel Oct 4, 2016

Member

Please remove "; convergence reached" from this message.

sklearn/decomposition/nmf.py
+ if verbose:
+ epoch_time = time.time()
+ print("Epoch %02d reached after %.3f seconds." %
+ (epoch + 1, epoch_time - start_time))

This comment has been minimized.

@ogrisel

ogrisel Oct 4, 2016

Member

Please include the value of the objective function in this message.

@ogrisel

ogrisel Oct 4, 2016

Member

Please include the value of the objective function in this message.

sklearn/decomposition/nmf.py
+ if tol > 0 and epoch % 100 == 0:
+ error = beta_divergence(X, W, H, beta_loss)
+ if (previous_error - error) / error < tol:
+ break

This comment has been minimized.

@ogrisel

ogrisel Oct 4, 2016

Member

Actually this is too strict and never works on some toy problem I am testing on. I thought about storing the initial value of the error in a variable called error_at_init and then checking for:

if (previous_error - error) / error_at_init < tol:
    break

but then it's hard to define a good value for tol.

@ogrisel

ogrisel Oct 4, 2016

Member

Actually this is too strict and never works on some toy problem I am testing on. I thought about storing the initial value of the error in a variable called error_at_init and then checking for:

if (previous_error - error) / error_at_init < tol:
    break

but then it's hard to define a good value for tol.

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Oct 4, 2016

Member

Good point. I updated with your last suggestion, as the error is indeed monotonic.

if (previous_error - error) / previous_error < tol:
    break

I also updated the tests to avoid or ignore ConvergenceWarning.

Member

TomDLT commented Oct 4, 2016

Good point. I updated with your last suggestion, as the error is indeed monotonic.

if (previous_error - error) / previous_error < tol:
    break

I also updated the tests to avoid or ignore ConvergenceWarning.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Oct 4, 2016

Member

Here is what I tried, not sure this is the best stopping criterion normalization though:

diff --git a/sklearn/decomposition/nmf.py b/sklearn/decomposition/nmf.py
index 47b934d..8e1d1d3 100644
--- a/sklearn/decomposition/nmf.py
+++ b/sklearn/decomposition/nmf.py
@@ -974,10 +974,10 @@ def _fit_multiplicative_update(X, W, H, beta_loss='frobenius',
     else:
         gamma = 1.

+    start_time = time.time()
     # used for the convergence criterion
-    previous_error = np.inf
+    error_at_init = previous_error = beta_divergence(X, W, H, beta_loss)

-    start_time = time.time()
     H_sum, HHt, XHt = None, None, None
     for epoch in range(max_iter):
         # update W
@@ -999,14 +999,14 @@ def _fit_multiplicative_update(X, W, H, beta_loss='frobenius',
         # test convergence criterion every 100 iterations
         if tol > 0 and epoch % 100 == 0:
             error = beta_divergence(X, W, H, beta_loss)
-            if (previous_error - error) / error < tol:
+            if (previous_error - error) / error_at_init < tol:
                 break
             previous_error = error

             if verbose:
                 epoch_time = time.time()
-                print("Epoch %02d reached after %.3f seconds." %
-                      (epoch + 1, epoch_time - start_time))
+                print("Epoch %02d reached after %.3f seconds, error: %f" %
+                      (epoch + 1, epoch_time - start_time, error))

     if verbose:
         end_time = time.time()
Member

ogrisel commented Oct 4, 2016

Here is what I tried, not sure this is the best stopping criterion normalization though:

diff --git a/sklearn/decomposition/nmf.py b/sklearn/decomposition/nmf.py
index 47b934d..8e1d1d3 100644
--- a/sklearn/decomposition/nmf.py
+++ b/sklearn/decomposition/nmf.py
@@ -974,10 +974,10 @@ def _fit_multiplicative_update(X, W, H, beta_loss='frobenius',
     else:
         gamma = 1.

+    start_time = time.time()
     # used for the convergence criterion
-    previous_error = np.inf
+    error_at_init = previous_error = beta_divergence(X, W, H, beta_loss)

-    start_time = time.time()
     H_sum, HHt, XHt = None, None, None
     for epoch in range(max_iter):
         # update W
@@ -999,14 +999,14 @@ def _fit_multiplicative_update(X, W, H, beta_loss='frobenius',
         # test convergence criterion every 100 iterations
         if tol > 0 and epoch % 100 == 0:
             error = beta_divergence(X, W, H, beta_loss)
-            if (previous_error - error) / error < tol:
+            if (previous_error - error) / error_at_init < tol:
                 break
             previous_error = error

             if verbose:
                 epoch_time = time.time()
-                print("Epoch %02d reached after %.3f seconds." %
-                      (epoch + 1, epoch_time - start_time))
+                print("Epoch %02d reached after %.3f seconds, error: %f" %
+                      (epoch + 1, epoch_time - start_time, error))

     if verbose:
         end_time = time.time()
@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Oct 4, 2016

Member

In my experiments (previous_error - error) / previous_error < tol can also be too strict.

Member

ogrisel commented Oct 4, 2016

In my experiments (previous_error - error) / previous_error < tol can also be too strict.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Oct 4, 2016

Member

I also updated the tests to avoid or ignore ConvergenceWarning.

We should also make sure that the tol / criterion we use by default make all example run without issuing a ConvergenceWarning.

Member

ogrisel commented Oct 4, 2016

I also updated the tests to avoid or ignore ConvergenceWarning.

We should also make sure that the tol / criterion we use by default make all example run without issuing a ConvergenceWarning.

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Oct 4, 2016

Member

I updated the normalization with error_at_init which seems indeed a good idea.

I did not update examples, as they all use the default solver (solver='cd').

Member

TomDLT commented Oct 4, 2016

I updated the normalization with error_at_init which seems indeed a good idea.

I did not update examples, as they all use the default solver (solver='cd').

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Oct 26, 2016

Member

Done

Member

TomDLT commented Oct 26, 2016

Done

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 26, 2016

Member

Thanks. Hm so @mblondel I saw a +1 from you up there, but that wasn't a full review, was it? @agramfort did you review? I see +1 from @ogrisel

Member

amueller commented Oct 26, 2016

Thanks. Hm so @mblondel I saw a +1 from you up there, but that wasn't a full review, was it? @agramfort did you review? I see +1 from @ogrisel

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Oct 27, 2016

Member
Member

agramfort commented Oct 27, 2016

@VictorBst

This comment has been minimized.

Show comment
Hide comment
@VictorBst

VictorBst Nov 29, 2016

Hi, @TomDLT and @agramfort invited to have a look at the code.

I mostly looked at the math aspect of the code and it seems to be correct. After quick experiments on my data, everything seems to work as intended and gives similar results to other simple NMF python codes I tested. As a side note, this one was always either equal or the fastest in my tests, especially for higher dimensions.

Hi, @TomDLT and @agramfort invited to have a look at the code.

I mostly looked at the math aspect of the code and it seems to be correct. After quick experiments on my data, everything seems to work as intended and gives similar results to other simple NMF python codes I tested. As a side note, this one was always either equal or the fastest in my tests, especially for higher dimensions.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 29, 2016

Member

hm I'm not sure if this should be merged before #7927, there are some conflicts, I think.... opinions @TomDLT ?

Member

amueller commented Nov 29, 2016

hm I'm not sure if this should be merged before #7927, there are some conflicts, I think.... opinions @TomDLT ?

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Nov 29, 2016

Member

ok let's wait for #7927

Member

TomDLT commented Nov 29, 2016

ok let's wait for #7927

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Dec 12, 2016

Member

@TomDLT now that #7927 has been merged, this needs a rebase.

Member

ogrisel commented Dec 12, 2016

@TomDLT now that #7927 has been merged, this needs a rebase.

TomDLT added some commits Dec 12, 2016

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Dec 12, 2016

Member

Done

Member

TomDLT commented Dec 12, 2016

Done

@ogrisel ogrisel merged commit ae4f710 into scikit-learn:master Dec 12, 2016

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Dec 12, 2016

Member

Thanks @TomDLT (and @VictorBst for the review). I squash-merged. 🍻

Member

ogrisel commented Dec 12, 2016

Thanks @TomDLT (and @VictorBst for the review). I squash-merged. 🍻

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Dec 12, 2016

Member

🍾 Cool ! Thanks a lot for the reviews ! 🍾

Member

TomDLT commented Dec 12, 2016

🍾 Cool ! Thanks a lot for the reviews ! 🍾

@tguillemot

This comment has been minimized.

Show comment
Hide comment
@tguillemot

tguillemot Dec 12, 2016

Contributor

Yeah !!!

Contributor

tguillemot commented Dec 12, 2016

Yeah !!!

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Dec 12, 2016

Member

Awesome! Congrats!

Member

amueller commented Dec 12, 2016

Awesome! Congrats!

YHYbetterlife added a commit to YHYbetterlife/scikit-learn that referenced this pull request Dec 12, 2016

adis300 added a commit to adis300/scikit-learn that referenced this pull request Dec 13, 2016

@TomDLT TomDLT deleted the TomDLT:nmf_mu branch Dec 20, 2016

sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017

@Przemo10 Przemo10 referenced this pull request Mar 17, 2017

Closed

update fork (#1) #8606

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment