[WIP] FIX Projected Gradient NMF stopping condition #2557

Closed
wants to merge 1 commit into from

5 participants

@vene
scikit-learn member

I caught a bug in the PG NMF stopping condition. All in all it seems that this solver is not at all bad, but I need to understand it properly.

@vene
scikit-learn member

Benchmarks against Lars's PR.
text
faces

@coveralls

Coverage Status

Coverage remained the same when pulling dadc2fc on vene:pgnmffix into 30eb78d on scikit-learn:master.

@mblondel mblondel commented on the diff Oct 30, 2013
sklearn/decomposition/nmf.py
for n_iter in range(1, self.max_iter + 1):
- # stopping condition
- # as discussed in paper
- proj_norm = norm(np.r_[gradW[np.logical_or(gradW < 0, W > 0)],
- gradH[np.logical_or(gradH < 0, H > 0)]])
+ # stopping condition on the norm of the projected gradient
+ proj_norm = (
@mblondel
scikit-learn member

I find the code a bit hard to follow. I would rather define variables proj_grad_W and proj_grad_H and compute the Frobenius norms of those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@mblondel mblondel commented on the diff Oct 30, 2013
sklearn/decomposition/nmf.py
tolH = tolW
- tol = self.tol * init_grad
+ tol = init_grad * self.tol ** 2
@mblondel
scikit-learn member

I would rather not change the tol. I think the following stopping criterion would be easier to follow:
if current_norm / init_norm < tol: break

This is intuitive because because the sum of the projected gradient norms is supposed to become smaller and smaller compared to init_norm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans larsmans commented on the diff Nov 28, 2013
sklearn/decomposition/nmf.py
@@ -499,17 +499,18 @@ def fit_transform(self, X, y=None):
- safe_sparse_dot(X, H.T, dense_output=True))
gradH = (np.dot(np.dot(W.T, W), H)
- safe_sparse_dot(W.T, X, dense_output=True))
- init_grad = norm(np.r_[gradW, gradH.T])
- tolW = max(0.001, self.tol) * init_grad # why max?
+ init_grad = (gradW ** 2).sum() + (gradH ** 2).sum()
@larsmans
scikit-learn member

(X ** 2).sum() is very memory-intensive and slow. I'd do it this way:

def sqnorm(x):
    x = x.ravel()
    return np.dot(x, x)

then redefine norm as sqrt(sqnorm(x)). This way, the memory overhead is constant instead of linear and BLAS's ddot can be used to compute the squared norm in a single pass.

@GaelVaroquaux
scikit-learn member
@larsmans
scikit-learn member

Or x.ravel(order='K'), "to view the elements in the order they occur in memory, except for reversing the data when strides are negative" (docs).

X.squeeze() has the same guarantee of never copying, but leaves negative strides alone which may impact BLAS performance. It's a tad faster, though.

@larsmans
scikit-learn member

Btw., we have a different definition of norm in sklearn.utils.extmath which uses the BLAS nrm2 function...

@GaelVaroquaux
scikit-learn member
@larsmans
scikit-learn member

No, there's no copy. When the strides are inappropriate for BLAS, dot (the vector dot version) backs off to a slower routine.

@vene
scikit-learn member
vene added a note Nov 28, 2013

Btw., we have a different definition of norm in sklearn.utils.extmath which uses the BLAS nrm2 function...

If I remember correctly we discussed it and it proved slower for computing the Frobenius norm.

@larsmans
scikit-learn member

Hm. Maybe we should write more comments :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans larsmans commented on the diff Nov 28, 2013
sklearn/decomposition/nmf.py
for n_iter in range(1, self.max_iter + 1):
- # stopping condition
- # as discussed in paper
- proj_norm = norm(np.r_[gradW[np.logical_or(gradW < 0, W > 0)],
- gradH[np.logical_or(gradH < 0, H > 0)]])
+ # stopping condition on the norm of the projected gradient
+ proj_norm = (
+ ((gradW * np.logical_or(gradW < 0, W > 0)) ** 2).sum() +
+ ((gradH * np.logical_or(gradH < 0, H > 0)) ** 2).sum())
@larsmans
scikit-learn member

Same here. In fact the code would get more readable with a sqnorm helper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene
scikit-learn member

The non-negative least squares benchmark notebook version of the solver diverged a little bit. Most importantly the regularization is more properly done (but this makes the objective function and the parametrization different). Should I bring it in to this PR?

@ogrisel
scikit-learn member

Could you summarize how it will impact the parametrization? Do you think there is a way to maintain backward compat with a deprecation cycle?

@vene
scikit-learn member

The default result of the estimator would change anyway in many cases because of the change in stopping conditions.

Should I just put the refactored solver in this PR to discuss it? Make a separate PR?

@ogrisel
scikit-learn member

I am fine with bringing it into this PR to make it easier to see the changes.

@amueller
scikit-learn member

Should this be closed in light of #4852?

@agramfort
scikit-learn member

yes. Closing

@agramfort agramfort closed this Sep 22, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment