# [MRG+1] Add multiplicative-update solver in NMF, with all beta-divergence #5295

Merged
merged 35 commits into from Dec 12, 2016
+1,092 −328

## Conversation

Projects
None yet
10 participants
Member

### TomDLT commented Sep 22, 2015 • edited Edited 1 time TomDLT edited Oct 26, 2016 (most recent)

 This PR is a second part of what have been discussed in #4811. (first part, #4852, merged) It includes : a Multiplicative-Update solver in NMF. This solver is generally slower than the Coordinate-Descent solver, but it handles all beta-divergences (including Frobenius, generalized Kullback_Leibler and Itakura-Saito). a plot to visualize the beta-divergence for several values of beta. benchmarks below Link to the rendered doc plot_beta_divergence.py This change is

Closed

Member

### mblondel commented Sep 23, 2015

 I would favor using CD for beta divergences as well. We decided to remove the slow pg solver. I would prefer not add a new slow solver if possible. The KDD paper shows that for generalized KL, we can just use the element-wise Newton update without line search. For other divergences, this will need some verification (unless there are more recent works that cover this?) but in worst case we can implement a simple line search. As a start, it would be nice to implement the CD update for generalized KL to compare.
Contributor

### tttthomasssss commented Oct 20, 2015

 Could anybody comment on when this is planned to be merged into master? I'm quite keen on using NMF with generalized KL divergence.
Member

### agramfort commented Oct 21, 2015

 @mblondel maybe we can do better than multiplicative updates but I think we should be pragmatic here. We can deprecate after when we have something better. wdyt?

Merged

Member

### TomDLT commented Nov 10, 2015 • edited Edited 1 time TomDLT edited Apr 25, 2016 (most recent)

 Sorry for the long silence, I have not as much time for this as before. @mblondel I spent some time trying to implement the coordinate descent for generalized Kullback Leibler divergence, as described in Coordinate descent with greedy selection, at least in the dense case. However, I did not manage to be faster than multiplicative update, which is not what is indicated in the paper, and probably underlines my own limits (or proves that my implementation of multiplicative update is very good :)). More seriously, even if we manage to be faster, the extension to all beta-divergence is not trivial and would require much more work. I understand that multiplicative update is an old method and probably not the fastest, yet it is still a good way to improve the NMF in scikit-learn, extending to a lot of different loss functions.
Member

### TomDLT commented Nov 19, 2015 • edited Edited 1 time TomDLT edited Apr 25, 2016 (most recent)

 I compared my implementation of multiplicative update with #2540 (Frobenius norm) and #1348 (KL divergence). Comparing to #2540, the results are identical, provided some slight modifications: To avoid division by zero, what do we prefer?: A / (B + eps) (like in #2540) (A + eps) / (B + eps) (like in #1348) B[B == 0] = eps then A / B I have a preference for the last one, for it does not affect update with nonzero elements. WDYT ? Comparing to #1348, the results are different since it forgets a division in the update (see Eq 5 here), and adds a normalization of H.
Member

### TomDLT commented Jan 22, 2016 • edited Edited 1 time TomDLT edited Sep 20, 2016 (most recent)

 Some benchmarking: 20_news - sparse (11314, 39116) Faces - dense (400, 4096) MNIST - dense (70000, 784) RCV1 - sparse (804414, 47236) The MU solver is not very fast, but it was implemented not for its speed, but for it handles all beta-divergences. Note that the poor performances with 'nndsvd' initialization (top-left) are expected, since this initialization have a lot of zeros that cannot be modified by a Multiplicative Update.

Contributor

### tguillemot commented Aug 18, 2016

 @TomDLT @mblondel @agramfort This PR is mergeable ? I will do a first round of review but I don't know those methods. Someone can have a look with me ?

### tguillemot reviewed Aug 18, 2016

 - - return timeset, err +import matplotlib.pyplot as plt +import pandas

#### tguillemot Aug 18, 2016

Contributor

We can use pandas for benchs ?

#### TomDLT Aug 22, 2016

Member

Using it in a benchmark does not make it a dependency of the package, as for matplotlib.

### tguillemot reviewed Aug 18, 2016

 + If beta == 1, this is the generalized Kullback-Leibler divergence + If beta == 0, this is the Itakura-Saito divergence + Else, this is the general beta-divergence. + """

#### tguillemot Aug 18, 2016 • edited Edited 1 time tguillemot edited Aug 18, 2016 (most recent)

Contributor

Can you specify what are X, W, H, and beta with the classical docstring ?

### tguillemot reviewed Aug 18, 2016

 + # np.sum(np.dot(W, H) ** beta) + sum_WH_beta = 0 + for i in range(X.shape[1]): + sum_WH_beta += np.sum(fast_dot(W, H[:, i]) ** beta)

#### tguillemot Aug 18, 2016

Contributor

Is there a better option ?

#### TomDLT Aug 18, 2016 • edited Edited 1 time TomDLT edited Aug 18, 2016 (most recent)

Member

This is a temporary fix so as to be memory efficient in the sparse case.
(We cannot compute directly np.dot(W, H)).

We could have a cython code for it, yet the use case is rather limited: sparse X, and beta not in [0, 1, 2] (which are the most useful cases). So I propose to merge it as it is, and improve it in a separate PR.

Contributor

Good for me :)

Member

### agramfort commented Aug 20, 2016

 @mblondel what's your take on this PR?
Member

### mblondel commented Aug 20, 2016

 I spent some time trying to implement the coordinate descent for generalized Kullback Leibler divergence, as described in Coordinate descent with greedy selection, at least in the dense case. Thanks for the investigation. More seriously, even if we manage to be faster, the extension to all beta-divergence is not trivial and would require much more work. Agreed. @mblondel what's your take on this PR? +1 on my side and sorry for the late reply. As a big fan of CD, I would be potentially interested in the CD code for GKL divergence as it could be faster in the sparse case.
Member

### agramfort commented Aug 20, 2016

 cool. Looks like we can proceed then :) let's make it a WIP->MRG then
Member

### TomDLT commented Aug 22, 2016 • edited Edited 1 time TomDLT edited Sep 1, 2016 (most recent)

 let's make it a WIP->MRG then it is, all reviews will be greatly appreciated.

### tguillemot reviewed Sep 1, 2016

 +However, NMF can also be used with a different function to measure the +distance between X and the matrix product WH. Another typical distance +function used in NMF is the (generalized) Kullback-Leibler (KL) divergence, +also referred as I-divergence:

#### tguillemot Sep 1, 2016

Contributor

I would change a bit the formulation :
"Other distance functions can be used in NMF as, for example, the (generalized) Kullback-Leibler (KL) divergence, also referred as I-divergence:

..math::
d_{KL}(X, Y) = \sum_{i,j} (X_{ij} * log(\frac{X_{ij}}{Y_{ij}}) - X_{ij} + Y_{ij})

Or, the the Itakura-Saito (IS) divergence:"
See if you prefer.

### tguillemot reviewed Sep 1, 2016

 + d_{IS}(X, Y) = \sum_{i,j} (\frac{X_{ij}}{Y_{ij}} - log(\frac{X_{ij}}{Y_{ij}}) - 1) + +These three distances are special cases of the beta-divergence family, with +:math:\beta = 2, 1, 0 respectively [Fevotte, 2011]. The beta-divergence are

#### tguillemot Sep 1, 2016

Contributor

Maybe you can add a link to [Fevotte, 2011](even if it's on the references section)

### tguillemot reviewed Sep 1, 2016

 +also referred as I-divergence: + +.. math:: + d_{KL}(X, Y) = \sum_{i,j} (X_{ij} * log(\frac{X_{ij}}{Y_{ij}}) - X_{ij} + Y_{ij})

#### tguillemot Sep 1, 2016

Contributor

Remove the multiplication symbol *

Member

### ogrisel requested changes Sep 19, 2016

 @@ -686,7 +685,7 @@ faces dataset, in comparison with the PCA eigenfaces. The :attr:init attribute determines the initialization method applied, which has a great impact on the performance of the method. :class:NMF implements -the method Nonnegative Double Singular Value Decomposition. NNDSVD is based on +the method Nonnegative Double Singular Value Decomposition. NNDSVD [4]_ is based on

#### ogrisel Sep 19, 2016

Member

Maybe explain why the NNDSVD init should not be used with the multiplicative updates solver at this end of this paragraph.

#### TomDLT Sep 20, 2016 • edited Edited 1 time TomDLT edited Sep 20, 2016 (most recent)

Member

Good point. Done

 + raise ValueError( + 'Invalid beta_loss parameter: solver %r does not handle beta_loss' + ' = %r' % (solver, beta_loss)) +

#### ogrisel Sep 19, 2016

Member

Maybe issue a user warning when the mu solver is used in conjunction with the nndsvd init.

Member

Done

### ogrisel reviewed Sep 19, 2016

How do you explain that MU behaves so badly with NNDSVDAR init on 20newsgroups? Does it changes significantly if you change the random seed?

Member

### TomDLT commented Sep 21, 2016 • edited Edited 1 time TomDLT edited Sep 21, 2016 (most recent)

 I relaunched the benchmark without changing anything, and plotting with the same scales, we see that the poor behavior with NNDSVDAR init is gone, I don't know why: Before After The three other datasets give the same results as before.
Member

### ogrisel commented Oct 3, 2016

 Good for 20 newsgroups. With respect to #5295 (comment) I also prefer #3.

### ogrisel requested changes Oct 3, 2016

I have not yet finished my review but are some more comments. Also I found the following case where mu seems to underperform:

import numpy as np
from sklearn.decomposition import NMF

rng = np.random.RandomState(42)
n_samples, n_features = 500, 50
k = 10
n_components = min(2 * k, n_features)

W_true = np.clip(rng.randn(n_samples, k), 0, np.inf)
H_true = np.clip(rng.randn(k, n_features), 0, np.inf)
X = np.dot(W_true, H_true)

def nmf_error(X, **model_params):
nmf = NMF(**model_params)
W = nmf.fit_transform(X)
H = nmf.components_
return np.linalg.norm(X - np.dot(W, H))

for solver in ['cd', 'mu', 'pg']:
error = nmf_error(X, n_components=n_components, solver=solver,
random_state=0, init='nndsvdar', max_iter=500)
print('solver=%s, error=%f' % (solver, error))

with output:

solver=cd, error=0.114302
solver=mu, error=3.146888
/volatile/ogrisel/code/scikit-learn/sklearn/decomposition/nmf.py:1217: DeprecationWarning: 'pg' solver will be removed in release 0.19. Use 'cd' solver instead.
nls_max_iter,
/volatile/ogrisel/code/scikit-learn/sklearn/decomposition/nmf.py:496: UserWarning: Iteration limit reached in nls subproblem.
warnings.warn("Iteration limit reached in nls subproblem.")
solver=pg, error=0.506917


Any idea why? I need to play more with the rng seed an problem size but it might be a bug no?

 +defined by : + +.. math:: + d_{\beta}(X, Y) = \sum_{i,j} \frac{1}{\beta(\beta - 1)}(X_{ij}^\beta + (\beta-1)Y_{ij}^\beta - \beta X_{ij} Y_{ij}^{\beta - 1})

#### ogrisel Oct 3, 2016 • edited Edited 1 time ogrisel edited Oct 3, 2016 (most recent)

Member

Could you please insert the figure of the beta divergence loss function example after this formula?

#### TomDLT Oct 3, 2016

Member

Done

 @@ -28,6 +29,8 @@ from ..exceptions import ConvergenceWarning from .cdnmf_fast import _update_cdnmf_fast +EPSILSON = 1e-9

#### ogrisel Oct 3, 2016

Member

I think np.ffinfo(np.float32).eps feels less arbitrary.

#### TomDLT Oct 3, 2016

Member

Done

 + + +def _compute_regularization(alpha, l1_ratio, regularization, n_samples, + n_features):

#### ogrisel Oct 3, 2016

Member

why passing n_samples and n_features here?

#### TomDLT Oct 3, 2016

Member

This is related to #5296, yet I will remove them in this PR.

 + else: + WH_safe_X_data **= beta_loss - 2 + # element-wise multiplication + WH_safe_X_data *= X_data

#### ogrisel Oct 3, 2016 • edited Edited 1 time ogrisel edited Oct 3, 2016 (most recent)

Member

I think this should probably be special-cased to:

elif beta_loss == 2:
WH_safe_X_data[:] = X_data

#### ogrisel Oct 3, 2016

Member

Actually in the special case where beta_loss == 2 we could even skip the call to _special_sparse_dot(W, H, X) and make this part of the update quite faster right?

#### TomDLT Oct 3, 2016

Member

I already separated the case beta_loss == 2 just above.

#### ogrisel Oct 4, 2016

Member

Indeed, sorry for the noise.

 + # copy used in the Denominator + WH = WH_safe_X.copy() + + # to avoid division by zero

#### ogrisel Oct 3, 2016

Member

or more generally taking a negative power of zero.

#### TomDLT Oct 3, 2016

Member

Done

 + H *= delta_H + + # These values will be recomputed since H changed + H_sum, HHt, XHt = None, None, None

#### ogrisel Oct 3, 2016

Member

Isn't there any cheap way to compute the fraction of change of the objective function to check for convergence? Iterating until max_iter seems very wasteful to me.

#### TomDLT Oct 3, 2016

Member

In #1348, the convergence check is prev_error - error < tol.
In #2540, the convergence check is abs(1. - max_update) < tol.

About computing the error, the cost is a bit high, yet I need to test it again.

About checking the maximum update, the update indeed converges to 1, but the criterion is rather poor if I recall correctly. After discussing with audio researchers in my group that use NMF a lot, a fixed number of iteration seems pretty classic.
Yet I can easily add this criterion if we consider it a better solution.

#### TomDLT Oct 3, 2016 • edited Edited 1 time TomDLT edited Oct 3, 2016 (most recent)

Member

Computing the error at each iteration increases the computational time by:

• {+50%, +50%} on 20news dataset (sparse), with beta loss in {2, 1}
• {+100%, +230%} on faces dataset (dense), with beta loss in {2, 1}

(in particular np.log(div) here is quite costly)

#### ogrisel Oct 4, 2016

Member

Let's do it every 10 iterations then.

Member

### TomDLT commented Oct 3, 2016

 About your toy example, increasing the number of iteration solves the problem, yet I don't know why it is that slow.
Member

### ogrisel commented Oct 4, 2016

 About your toy example, increasing the number of iteration solves the problem Indeed, I thought I had tried that but maybe I was in slowly decreasing / plateau like area of the convergence curve and I had not tried to increase max_iter enough at once. yet I don't know why it is that slow. That's weird but apparently it does not seem to be so bad on more realistic data as your benchmarks show.
Member

### ogrisel commented Oct 4, 2016 • edited Edited 1 time ogrisel edited Oct 4, 2016 (most recent)

 About computing the error, the cost is a bit high, yet I need to test it again. Then we could compute the cost every 10 or 100 iter instead.
Member

### ogrisel commented Oct 4, 2016

 About checking the maximum update, the update indeed converges to 1, but the criterion is rather poor if I recall correctly. I tried to add print statements to monitor the maximum absolute values in the delta multipliers and it seems that at least for some components we can get into an oscillatory regime.
Member

### ogrisel commented Oct 4, 2016 • edited Edited 1 time ogrisel edited Oct 4, 2016 (most recent)

 Also ideally we would like that the tol-based criterion to be approximately invariant to any re-scaling of X.
Member

### TomDLT commented Oct 4, 2016

 The convergence test is now done every 100 iterations, and is normalized: if (previous_error - error) / error < tol: break
Member

### ogrisel commented Oct 4, 2016 • edited Edited 1 time ogrisel edited Oct 4, 2016 (most recent)

 if (previous_error - error) / error < tol: break I am afraid this can diver on problems where error can get very close to zero very suddenly. Maybe the following is safer: # difference of errors normalized by mean error across the last two # iterations. if 2 * (previous_error - error) / (previous_error + error) < tol: break or just: if previous_error - error / previous_error < tol: break if we have the guarantee that the error is monotonic.

### ogrisel reviewed Oct 4, 2016

 + + if verbose: + end_time = time.time() + print("Epoch %02d reached after %.3f seconds; convergence reached." %

#### ogrisel Oct 4, 2016

Member

Please remove "; convergence reached" from this message.

 + if verbose: + epoch_time = time.time() + print("Epoch %02d reached after %.3f seconds." % + (epoch + 1, epoch_time - start_time))

#### ogrisel Oct 4, 2016

Member

Please include the value of the objective function in this message.

 + if tol > 0 and epoch % 100 == 0: + error = beta_divergence(X, W, H, beta_loss) + if (previous_error - error) / error < tol: + break

#### ogrisel Oct 4, 2016 • edited Edited 1 time ogrisel edited Oct 4, 2016 (most recent)

Member

Actually this is too strict and never works on some toy problem I am testing on. I thought about storing the initial value of the error in a variable called error_at_init and then checking for:

if (previous_error - error) / error_at_init < tol:
break

but then it's hard to define a good value for tol.

Member

### TomDLT commented Oct 4, 2016

 Good point. I updated with your last suggestion, as the error is indeed monotonic. if (previous_error - error) / previous_error < tol: break I also updated the tests to avoid or ignore ConvergenceWarning.
Member

### ogrisel commented Oct 4, 2016 • edited Edited 1 time ogrisel edited Oct 4, 2016 (most recent)

 Here is what I tried, not sure this is the best stopping criterion normalization though: diff --git a/sklearn/decomposition/nmf.py b/sklearn/decomposition/nmf.py index 47b934d..8e1d1d3 100644 --- a/sklearn/decomposition/nmf.py +++ b/sklearn/decomposition/nmf.py @@ -974,10 +974,10 @@ def _fit_multiplicative_update(X, W, H, beta_loss='frobenius', else: gamma = 1. + start_time = time.time() # used for the convergence criterion - previous_error = np.inf + error_at_init = previous_error = beta_divergence(X, W, H, beta_loss) - start_time = time.time() H_sum, HHt, XHt = None, None, None for epoch in range(max_iter): # update W @@ -999,14 +999,14 @@ def _fit_multiplicative_update(X, W, H, beta_loss='frobenius', # test convergence criterion every 100 iterations if tol > 0 and epoch % 100 == 0: error = beta_divergence(X, W, H, beta_loss) - if (previous_error - error) / error < tol: + if (previous_error - error) / error_at_init < tol: break previous_error = error if verbose: epoch_time = time.time() - print("Epoch %02d reached after %.3f seconds." % - (epoch + 1, epoch_time - start_time)) + print("Epoch %02d reached after %.3f seconds, error: %f" % + (epoch + 1, epoch_time - start_time, error)) if verbose: end_time = time.time()
Member

### ogrisel commented Oct 4, 2016

 In my experiments (previous_error - error) / previous_error < tol can also be too strict.
Member

### ogrisel commented Oct 4, 2016

 I also updated the tests to avoid or ignore ConvergenceWarning. We should also make sure that the tol / criterion we use by default make all example run without issuing a ConvergenceWarning.
Member

### TomDLT commented Oct 4, 2016

 I updated the normalization with error_at_init which seems indeed a good idea. I did not update examples, as they all use the default solver (solver='cd').

### TomDLT added some commits Oct 4, 2016

 fix init docstring 
 3b18a45 
 typo and improve test decreasing 
 6dff7c9 
 remove unused private function _safe_compute_error 
 6b20e30 
 make beta_divergence function private 
 0ee3cbf 
 Remove deprecated ProjectedGradientNMF 
 f428d20 
 remove warning in test 
 dd4d6b5 
 update doc 
 31f2c0c 
 FIX raise an error when beta_loss <= 0 and X contains zeros 
 44779a1 
 TYPO epsilson -> epsilon 
 4713b1c 
 remove other occurences of ProjectedGradientNMF 
 0d9fb50 
 add whats_new.rst entry 
 42de807 
 minor leftovers 
 2af4a23 
 non-ascii and nitpick 
 259d827 
 safe_min instead of min 
 057c70c 
Member

 Done
Member

### amueller commented Oct 26, 2016

 Thanks. Hm so @mblondel I saw a +1 from you up there, but that wasn't a full review, was it? @agramfort did you review? I see +1 from @ogrisel
Member

### agramfort commented Oct 27, 2016

 no I did not. No time :(

### VictorBst commented Nov 29, 2016

 Hi, @TomDLT and @agramfort invited to have a look at the code. I mostly looked at the math aspect of the code and it seems to be correct. After quick experiments on my data, everything seems to work as intended and gives similar results to other simple NMF python codes I tested. As a side note, this one was always either equal or the fastest in my tests, especially for higher dimensions.
 solve conflict with master 
 a9ac84a 
Member

### amueller commented Nov 29, 2016

 hm I'm not sure if this should be merged before #7927, there are some conflicts, I think.... opinions @TomDLT ?
Member

### TomDLT commented Nov 29, 2016

 ok let's wait for #7927
Member

### ogrisel commented Dec 12, 2016

 @TomDLT now that #7927 has been merged, this needs a rebase.

### TomDLT added some commits Dec 12, 2016

 Merge branch 'master' into nmf_mu 
 5e017c6 
 minor doc update 
 928ea89 
Member

 Done

### ogrisel merged commit ae4f710 into scikit-learn:master Dec 12, 2016 2 checks passed

#### 2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Member

### ogrisel commented Dec 12, 2016

 Thanks @TomDLT (and @VictorBst for the review). I squash-merged. 🍻
Member

### TomDLT commented Dec 12, 2016

 🍾 Cool ! Thanks a lot for the reviews ! 🍾
Contributor

### tguillemot commented Dec 12, 2016

 Yeah !!!
Member

### amueller commented Dec 12, 2016

 Awesome! Congrats!

### YHYbetterlife added a commit to YHYbetterlife/scikit-learn that referenced this pull request Dec 12, 2016

 Revert "[MRG+1] Add multiplicative-update solver in NMF, with all bet… 
…a-divergence (#5295)"

This reverts commit ae4f710.
 e0be9a4 

 [MRG+1] Add multiplicative-update solver in NMF, with all beta-diverg… 
…ence (#5295)
 70c6cd5 

### sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017

 [MRG+1] Add multiplicative-update solver in NMF, with all beta-diverg… 
…ence (#5295)
 6ff493e 

Closed

### Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

 [MRG+1] Add multiplicative-update solver in NMF, with all beta-diverg… 
…ence (#5295)
 d52e939 

### NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

 [MRG+1] Add multiplicative-update solver in NMF, with all beta-diverg… 
…ence (#5295)
 8e6b103 

### paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

 [MRG+1] Add multiplicative-update solver in NMF, with all beta-diverg… 
…ence (#5295)
 1e20981 

 [MRG+1] Add multiplicative-update solver in NMF, with all beta-diverg… 
…ence (#5295)
 fb39155