Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Read-only data compatibility for Lasso #4775

Merged

Conversation

arthurmensch
Copy link
Contributor

Should fix issue #4772. Benchmarking the changes (calling cd_fast within a lasso regression) suggest we do not reduce performance compared to current master

I added a regression test that fails on current master (setting a read only flag on design matrix)

transform_alpha=0.001, random_state=0, n_jobs=-1)
code = dico.fit(X).transform(X)
assert_array_almost_equal(np.dot(code, dico.components_), X, decimal=2)
X.flags.writeable = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last line is useless as X will be garbage collected. Please remove it.

Sorry I had not realised that you were sharing the same X as in other tests. I would be cleaner to do:

X_readonly = X.copy()
X_readonly.flags.writeable = False

and use that instead of X in the fit and transform calls.

@ogrisel ogrisel changed the title cd_fast.pyx : changed ctype to np.ndarray to avoid cython read only error [WIP] cd_fast.pyx : changed ctype to np.ndarray to avoid cython read only error May 27, 2015
@ogrisel
Copy link
Member

ogrisel commented May 27, 2015

I edited the title of this PR as a [WIP]. Please change it to [MRG] once you address the above comments.

n_components = 12
X.flags.writeable = False
dico = DictionaryLearning(n_components, transform_algorithm='lasso_cd',
transform_alpha=0.001, random_state=0, n_jobs=-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use n_jobs=-1 in tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this so that the test only assert that dict_learning works on memory mapped ro arrays

@arthurmensch arthurmensch changed the title [WIP] cd_fast.pyx : changed ctype to np.ndarray to avoid cython read only error [MRG] cd_fast.pyx : changed ctype to np.ndarray to avoid cython read only error May 27, 2015

if __name__ == '__main__':
import nose
nose.runmodule()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not add such boilerplate. Use nosetests sklearn/decomposition/tests/test_dict_learning.py to run the tests of a specific test module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, these lines still exist in some tests file though

@ogrisel
Copy link
Member

ogrisel commented May 28, 2015

Also can you please squash the commits of this PR? Such intermediate commits have not historical values per se.

@ogrisel
Copy link
Member

ogrisel commented May 28, 2015

If we need help with squashing please feel free to ask.

@arthurmensch arthurmensch changed the title [MRG] cd_fast.pyx : changed ctype to np.ndarray to avoid cython read only error [WIP] cd_fast.pyx : changed ctype to np.ndarray to avoid cython read only error May 28, 2015
@arthurmensch arthurmensch force-pushed the cd_fast_readonly_array_brainfix branch 2 times, most recently from 73c98dd to 6094264 Compare May 29, 2015 08:09
@arthurmensch arthurmensch changed the title [WIP] cd_fast.pyx : changed ctype to np.ndarray to avoid cython read only error [MRG] cd_fast.pyx : changed ctype to np.ndarray to avoid cython read only error May 29, 2015
@ogrisel
Copy link
Member

ogrisel commented May 29, 2015

LGTM. I think we should write a new test in test_common that checks that we can fit any estimator on readonly data. But this can be done in another PR.

@ogrisel ogrisel changed the title [MRG] cd_fast.pyx : changed ctype to np.ndarray to avoid cython read only error [MRG+1] cd_fast.pyx : changed ctype to np.ndarray to avoid cython read only error May 29, 2015
@@ -59,6 +61,13 @@ def test_dict_learning_reconstruction_parallel():
code = dico.transform(X)
assert_array_almost_equal(np.dot(code, dico.components_), X, decimal=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: you need 2 empty lines between top-level functions

@GaelVaroquaux
Copy link
Member

Does the change to the cython code have any impact to run time?

@ogrisel
Copy link
Member

ogrisel commented Jun 24, 2015

@arthurmensch could you please rebase this on top of master? it's no longer mergeable according to github.

I think the np.asarray vs np.array in input validation that triggers memory copy on np.memmap inputs should be tackled in a different pull request.

Instead to test the fix, please add a unittest that calls into the private cython API directly with np.memmap readonly data directly.

@arthurmensch arthurmensch force-pushed the cd_fast_readonly_array_brainfix branch from 976e28e to bc33901 Compare June 24, 2015 13:53
@ogrisel
Copy link
Member

ogrisel commented Jun 24, 2015

Please remove the last commit, I think this should be addressed in a separate PR. In the mean time write a unittest that uses the private API directly as I did here: #4684

@arthurmensch arthurmensch changed the title [MRG+1] cd_fast.pyx : changed ctype to np.ndarray to avoid cython read only error [MRG+1] Read only data compatibility for Lasso Jul 2, 2015
@arthurmensch arthurmensch changed the title [MRG+1] Read only data compatibility for Lasso [MRG+1] Read-only data compatibility for Lasso Jul 2, 2015
@ogrisel
Copy link
Member

ogrisel commented Jul 21, 2015

Can you please squash those commits together?

Copy from joblib.pool (for independance)"""
try:
if os.path.exists(folder_path):
shutil.rmtree(folder_path) # This can fail under windows, but will succeed when called by atexit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8

@arthurmensch arthurmensch force-pushed the cd_fast_readonly_array_brainfix branch from 6869512 to caec867 Compare July 23, 2015 13:05
@arthurmensch
Copy link
Contributor Author

Done

@ogrisel
Copy link
Member

ogrisel commented Jul 23, 2015

Please squash commits that have trivial commit messages such as "Fix".

@arthurmensch arthurmensch force-pushed the cd_fast_readonly_array_brainfix branch 4 times, most recently from 9683cef to 9a55eeb Compare July 27, 2015 13:02
@@ -74,7 +73,7 @@

print('Learning the dictionary...')
t0 = time()
dico = MiniBatchDictionaryLearning(n_components=100, alpha=1, n_iter=500)
dico = MiniBatchDictionaryLearning(n_components=100, alpha=1, n_iter=500, batch_size=100, n_jobs=4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will break on windows, right? Let's not do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I left it, I used it for tests. I don't understand why it would break on Windows though, could you explain ?

@amueller
Copy link
Member

LGTM apart from minor comments.

@arthurmensch arthurmensch force-pushed the cd_fast_readonly_array_brainfix branch 3 times, most recently from 8dc0aa7 to e3435a1 Compare July 31, 2015 10:10
@arthurmensch arthurmensch force-pushed the cd_fast_readonly_array_brainfix branch from e3435a1 to fead69a Compare July 31, 2015 10:12
@amueller
Copy link
Member

as far as I know, using multiprocessing requires you to wrap your call into a if __name__ == "__main__" as the file is repeatedly imported. Otherwise you get infinite recursion. As we don't do that for examples, we can't use multiprocessing. (@ogrisel correct me if I'm wrong)

@amueller
Copy link
Member

Travis test failure from cec3bf9. Merging.

amueller added a commit that referenced this pull request Jul 31, 2015
…infix

[MRG+1] Read-only data compatibility for Lasso
@amueller amueller merged commit 77ecf16 into scikit-learn:master Jul 31, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants