New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] ENH Add working_memory global config for chunked operations #10280

Merged
merged 105 commits into from May 25, 2018

Conversation

6 participants
@jnothman
Member

jnothman commented Dec 10, 2017

We often get issues related to memory consumption and don't deal with them particularly well. Indeed, Scikit-learn should be at home on commodity hardware like developer/researcher laptops.

Some operations can be performed chunked, so that the result is computed in constant (or O(n)) memory relative to some current O(n) (O(n^2)) consumption. Examples include: getting the argmin and min of all pairwise distances (currently done with an ad-hoc parameter to pairwise_distances_argmin_min), calculating silhouette score (#1976), getting nearest neighbors with brute force (#7287), calculating standard deviation of every feature (#5651).

It's not very helpful to provide this "how much constant memory" parameter in each function (because they're often called within nested code), so this PR instead makes a global config parameter of it. The optimisation is then transparent to the user, but still configurable.

At @rth's request, this PR has been cut back. The proposed changes to silhouette and neighbors can be seen here.

This PR (building upon my work with @dalmia) will therefore:

  • add set_config(working_memory=n_mib)
  • add pairwise_distances_chunked
  • make use of the latter in nearest neighbors, silhouette and pairwise_distances_argmin_min
  • deprecate batch_size in pairwise_distances_argmin_min

and thus:

  • help towards fixing #7175, #10279, #7177 (silhouette)
  • Resolve #7979
  • help towards fixing #7287 (neighbors)
  • perhaps help towards #8216
  • provide an interface for fixing #5651
  • it looks like there was some suggestion of using this kind of chunking in LMNN (#8602)

TODO:

  • attract other core devs' attention which I feel has been lacking despite the repeated popular interest in these issues!
  • review my comments to @dalmia at #7979
  • add tests for get_chunk_n_rows
  • add tests for _check_chunk_size and any others for pairwise_distances_chunked
  • add config documentation
  • fix, perhaps, #5651 with a chunked standard deviation implementation (can be a separate PR)
  • perhaps review uses of gen_batches to see if they can use this parameter to be more principled in the choice of batch size: in most cases the batch size affects the model, so we can't just change it. And in other cases, it may make little difference. We could change mean_shift's use from fixed batch_size=500 to something sensitive to working_memory if we wish.
  • add what's new entry
  • add example to pairwise_distances_chunked that uses start and a tuple return
  • benchmarking

Sentient07 and others added some commits Dec 26, 2015

Reverted the change, added regression test
reverted the comment

Resolved merge conflicts
FIX pass n_jobs from silhouette_score
Also use threading for parallelism

@rth rth changed the title from [MRG] ENH Add working_memory global config for chunked operations to [MRG+1] ENH Add working_memory global config for chunked operations Mar 8, 2018

@TomDLT

TomDLT approved these changes May 22, 2018

LGTM

- A new configuration parameter, ``working_memory`` was added to control memory
consumption limits in chunked operations, such as the new
:func:`metrics.pairwise_distances_chunked`. See :ref:`working_memory`.

This comment has been minimized.

@TomDLT

TomDLT May 22, 2018

Member

You seem to have forgotten the glossary entry.

This comment has been minimized.

@jnothman

jnothman May 23, 2018

Member

I think a reference to the User Guide is most relevant in what's new.

I can add a glossary entry, though I'm not sure how it will help beyond the user guide and the config_context docstring.

This comment has been minimized.

@TomDLT

TomDLT May 23, 2018

Member

Fair enough, I just wonder what is the goal of the syntax :ref:, which does not render a link.

Did you mean :func:set_config? Or maybe you need a label in doc/modules/computational_performance.rst?

This comment has been minimized.

@jnothman

jnothman May 23, 2018

Member

The latter. A glossary reference would be :term:, not :ref: which references sections.

``reduce_func``.
Examples
-------

This comment has been minimized.

@TomDLT

TomDLT May 22, 2018

Member

You need one more dash to have proper rendering.

assert isinstance(S_chunks, GeneratorType)
S_chunks = list(S_chunks)
assert len(S_chunks) > 1
# atol is for diagonal where S is explcitly zeroed on the diagonal

This comment has been minimized.

@TomDLT

TomDLT May 22, 2018

Member

*explicitly

min_block_mib = np.array(X).shape[0] * 8 * 2 ** -20
for block in blockwise_distances:
memory_used = len(block) * 8

This comment has been minimized.

@TomDLT

TomDLT May 22, 2018

Member

You should use memory_used = block.size * 8 to have the correct memory used in the block.

This comment has been minimized.

@jnothman

jnothman May 23, 2018

Member

Hmm... indeed!

for block in blockwise_distances:
memory_used = len(block) * 8
assert memory_used <= min(working_memory, min_block_mib) * 2 ** 20

This comment has been minimized.

@TomDLT

TomDLT May 22, 2018

Member

And the min should be a max, shouldn't it?

metric='euclidean')
# Test small amounts of memory
for power in range(-16, 0):
check_pairwise_distances_chunked(X, None, working_memory=2 ** power,

This comment has been minimized.

@TomDLT

TomDLT May 22, 2018

Member

This line raises a lot of warnings:
UserWarning: Could not adhere to working_memory config. Currently 0MiB, 1MiB required.
We should silence them as they are expected.

@rth

This comment has been minimized.

Member

rth commented May 22, 2018

@TomDLT approved these changes

Great that this is happening!

@jnothman

This comment has been minimized.

Member

jnothman commented May 23, 2018

Thanks @TomDLT for the review and the approval! I've addressed all your comments except for the glossary one, which I'm not sure is necessitated at this point.

jnothman added some commits May 23, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented May 23, 2018

I suppose we could include a new Global Configuration section of the glossary. But I'm not sure how that goes beyond the config_context API reference.

jnothman added some commits May 23, 2018

@amueller

This comment has been minimized.

Member

amueller commented May 24, 2018

Is there a plan to in the future also use this in pairwise_distances and automatically dispatch to pairwise_distances_chunked if the parameter is set? (or maybe I'm overlooking something).
Also, @jnothman feel free to point me to stuff you want me to look at ;)

@jnothman

This comment has been minimized.

Member

jnothman commented May 24, 2018

@amueller

This comment has been minimized.

Member

amueller commented May 24, 2018

py3.6 fails ;)
But you're right, I wasn't thinking it through...

jnothman added some commits May 24, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented May 25, 2018

Merging to enable downstream PRs. Let me know if there are further quibbles! Thanks for the reviews, Roman and Tom!

@jnothman jnothman merged commit ef8d22a into scikit-learn:master May 25, 2018

8 checks passed

ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 95.11%)
Details
codecov/project 95.17% (+0.06%) compared to 1557cb8
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
@rth

This comment has been minimized.

Member

rth commented May 27, 2018

Thanks for this @jnothman !

@TomDLT @amueller FYI there is a follow-up PR in #11136 that applies this mechanism to brute force nearest neighbors.

@rth

This comment has been minimized.

Member

rth commented May 27, 2018

.. and also #11135 for chunking silhouette_score calculations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment