-
-
Notifications
You must be signed in to change notification settings - Fork 25.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup MultiTaskLasso #17021
Speedup MultiTaskLasso #17021
Conversation
fix W[ii, 0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense since n_tasks is usually small. BLAS libraries try to adjust their threading scheme according to the sizes of the matrices involved but I think there's still a lot of work to do in that direction.
sklearn/linear_model/_cd_fast.pyx
Outdated
|
||
# if np.sum(W[:, ii] ** 2) != 0.0: # can do better | ||
if _nrm2(n_tasks, W_ptr + ii * n_tasks, 1) != 0.0: | ||
if (W[0, ii] != 0.): # faster than testing full col norm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not equivalent.
- before: equivalent to test if all the W[:, ii] == 0. If so, no need to compute the dot. Makes sense
- after: only check if W[0, ii] == 0.
I think you miss some computations here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. It's something we haven't cleaned up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's actually on purpose as it's unlikely that you have w_ii[0]==0 and some values != in w_ii[1:] so it should do as well and avoid computations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What really needs to be avoided is the computation of the norm of W[:, ii]. I think a safe and not too heavy way to do the check is just a loop over W[:, ii], (as in https://github.com/mathurinm/celer/blob/master/celer/multitask_fast.pyx#L360)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's unlikely
Unlikely is not always ? If it can happen that only w_ii[0]==0
, the computation would be wrong. I like @mathurinm's idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implemented
I changed the doctest because it was a pathological case with a duplicated feature. It failed because of numerical errors (1e-16 instead of exact 0) |
The test uses an alpha which is superior to alpha_max:
So you can (and should, up to numerical errors) get a duality gap of 0 even with 0 iteration. I suggest changing tol to -1 in the test? Or using a smaller alpha. |
what is weird is why we did not have this issue before as the new code
should do literally the same thing
up to numerical errors. Any idea?
… |
The current test is wrong in master and passes only because of numerical errors: we fit with alpha > alpha_max so we should get a duality gap of 0, and no convergence warning. In master, we get a gap of 1e-13 so a warning is raised. My belief is that in the PR there is no numerical error and we get exactly 0 |
a benchmark to confirm the speed gains:
code https://gist.github.com/agramfort/ca54cc1bc12a37d7a426a7799cc236ce you can see that the speed gains are not minor. |
ready for review @jeremiedbb et al :) maybe we can even squeeze this in before the release ... |
sklearn/linear_model/_cd_fast.pyx
Outdated
floating[::1, :] X, | ||
floating[::1, :] Y, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't squeeze that into the release. We need to have cython 3.0 as min version.
@@ -881,7 +881,7 @@ def test_convergence_warnings(): | |||
|
|||
# check that the model fails to converge | |||
with pytest.warns(ConvergenceWarning): | |||
MultiTaskElasticNet(max_iter=1, tol=0).fit(X, y) | |||
MultiTaskElasticNet(alpha=0.001, max_iter=1, tol=0).fit(X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is changing alpha necessary here ? It converges with only 1 iteration with default alpha ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default alpha is superior to alpha_max for this pair (X, y). So a duality gap of 0 (up to numerical errors) is reached in 0 iterations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must be misunderstanding something but the test does not fail on master. How can it fail with this PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is tricky.
The test should fail on master. It passes only because of numerical errors (a duality gap of 1e-13 instead of 0 is returned).
I believe it fails on the PR because a duality gap of 0 is returned (which is the correct value)
Run:
import numpy as np
from sklearn.linear_model import MultiTaskElasticNet
random_state = np.random.RandomState(0)
X = random_state.standard_normal((1000, 500))
y = random_state.standard_normal((1000, 3))
clf = MultiTaskElasticNet(max_iter=1, tol=0).fit(X, y)
print(clf.dual_gap_) # should be 0, is 1e-13
print(np.linalg.norm(clf.coef_)) # we have converged. The initialization is already the solution bc alpha >= alpha_max
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. tol=0
is rarely a good idea because it's subject to numerical precision issues. We can change the tol or the alpha. But in either case could you add a comment to explain that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO It is more logical to use tol=-1
(eg, the value of alpha we have used could still be >= alpha_max (it is not))
Where should I put the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just above this line, when creating the model instance.
I'm happy with the current state of this PR but as I said in a comment, we can't merge the switch to memoryviews until we bump our cython min version to 0.3 (which is still an alpha). So we can either wait a little bit which means it will be for the next release, or put back the ndarrays and we can merge it for this release. I'm fine with both, your call. |
Putting back the ndarrays only means changing the signature of the cython function, we could still use |
sklearn/linear_model/_cd_fast.pyx
Outdated
@@ -622,8 +622,8 @@ def enet_coordinate_descent_gram(floating[::1] w, | |||
|
|||
def enet_coordinate_descent_multi_task(floating[::1, :] W, floating l1_reg, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremiedbb I already see the syntax floating[::1, :] W
in master, did I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use memoryviews when we are sure that the array has been created in scikit-learn because then we know it's not read-only. When it's a user provided array (X, y), we don't have that guarantee and fused typed memoryviews don't work with read only arrays.
I'm not sure about that, if I remember correctly, it's slower to access ndarray elements. So it would mean keeping the pointers. We'd have to rerun the benchmarks to make sure. |
I reverted to np.ndarray for X and Y. @agramfort can you check that the benchmarks are still favorable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, made a first pass
@@ -1990,13 +1989,13 @@ class MultiTaskLasso(MultiTaskElasticNet): | |||
-------- | |||
>>> from sklearn import linear_model | |||
>>> clf = linear_model.MultiTaskLasso(alpha=0.1) | |||
>>> clf.fit([[0,0], [1, 1], [2, 2]], [[0, 0], [1, 1], [2, 2]]) | |||
>>> clf.fit([[0, 1], [1, 2], [2, 4]], [[0, 0], [1, 1], [2, 3]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this change needed?
just trying to understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a duplicated feature resulted in the second coefficient being 0, but with numerical errors it was 1e-16
sklearn/linear_model/_cd_fast.pyx
Outdated
# _ger(RowMajor, n_samples, n_tasks, 1.0, | ||
# &X[0, ii], 1, | ||
# &w_ii[0], 1, &R[0, 0], n_tasks) | ||
# Using Blas Level1 and for loop for avoid slower threads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Using Blas Level1 and for loop for avoid slower threads | |
# Using Blas Level1 and for loop to avoid slower threads |
sklearn/linear_model/_cd_fast.pyx
Outdated
|
||
# if np.sum(w_ii ** 2) != 0.0: # can do better | ||
if _nrm2(n_tasks, wii_ptr, 1) != 0.0: | ||
# if (w_ii[0] != 0.): # faster than testing full norm for non-zeros, yet unsafe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm discovering the code so I lack context, but I'm not sure what this is supposed to mean.
In general I feel like "We don't do X because it wouldn't work" is mostly confusing because one would just wonder why we would even want to do X in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think we can even remove the 2 comments.
if np.sum(w_ii ** 2) != 0.0: # can do better
-> we do better so no need for that any more
# faster than testing full norm for non-zeros, yet unsafe
-> nicola's argument
sklearn/linear_model/_cd_fast.pyx
Outdated
X_ptr + ii * n_samples, 1, | ||
wii_ptr, 1, &R[0, 0], n_tasks) | ||
|
||
# Using Blas Level2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this whole comment block be indented to the left now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
sklearn/linear_model/_cd_fast.pyx
Outdated
# Using BLAS Level 2: | ||
# _gemv(RowMajor, Trans, n_samples, n_tasks, 1.0, &R[0, 0], | ||
# n_tasks, &X[0, ii], 1, 0.0, &tmp[0], 1) | ||
# Using BLAS Level 1 (faster small vectors like here): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Using BLAS Level 1 (faster small vectors like here): | |
# Using BLAS Level 1 (faster for small vectors like here): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
sklearn/linear_model/_cd_fast.pyx
Outdated
# if (W[0, ii] != 0.): # faster than testing full col norm, but unsafe | ||
# Using numpy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above
@mathurinm your latest changes LGTM but can you rerun my bench on this version to see what is the speedup with the last version of the code? |
On my machine the gain for 50 tasks is not always present.
|
great. good to go from my end. needs 2nd approval |
@NicolasHug thank you for the review, does it look good to you now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mathurinm @agramfort
Maybe we could add a TODO
comment and/or open an issue about using views instead of numpy arrays in these places, since it does seem that the benchmarks with the views was even faster
Also this probably needs a what's new?
I'm not sure what to think about the small regressions with n_samples=100 and n_tasks=50. I guess if the absolute running time is pretty slow already, that's still OK?
done @NicolasHug |
@@ -319,6 +319,13 @@ Changelog | |||
random noise to the target. This might help with stability in some edge | |||
cases. :pr:`15179` by :user:`angelaambroz`. | |||
|
|||
- |Efficiency| Speed up :class:`linear_model.MultiTaskLasso`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this can make it to 0.23
@adrinjalali this is ready for a merge, so your choice basically ;)
For completeness, here are the old (aster) running times on my machine:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also +1 for be part of 0.23 if not too late @adrinjalali. Merging and tagging #17010 but feel free to let us know if you would rather move to 0.24 instead.
Thank you @mathurinm! |
I was hoping this one would get in, I intentionally hadn't tagged yet :) #17010 |
This is for now WIP.
A good PR for @jeremiedbb as it's related to findings in kmeans.
Basically doing some Blas Level 2 calls in enet_coordinate_descent_multi_task
slows things down as it's called on small vectors.
It needs more tests / benchmarks but I have a 10x speed up on a relevant dataset for my research.
done with @mathurinm