New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] FIX Avoid accumulating forest predictions in non-threadsafe manner #9830

Merged
merged 2 commits into from Oct 16, 2017

Conversation

Projects
None yet
5 participants
@jnothman
Member

jnothman commented Sep 26, 2017

Resolved #9393. See #9734.

Ping @ogrisel.

@jnothman jnothman changed the title from FIX Avoid accumulating forest predictions in non-threadsafe manner to [MRG] FIX Avoid accumulating forest predictions in non-threadsafe manner Sep 26, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 26, 2017

Member

The problem with this is that it makes RF prediction more memory intensive by a factor of n_estimators / n_jobs (assuming n_samples for prediction is much larger than tree size).

I could do this summation over an iterator instead, which with joblib we will only use in the n_jobs=1 case... WDYT?

Member

jnothman commented Sep 26, 2017

The problem with this is that it makes RF prediction more memory intensive by a factor of n_estimators / n_jobs (assuming n_samples for prediction is much larger than tree size).

I could do this summation over an iterator instead, which with joblib we will only use in the n_jobs=1 case... WDYT?

@jnothman jnothman added this to the 0.19.1 milestone Sep 26, 2017

@@ -378,13 +378,16 @@ def feature_importances_(self):
# ForestClassifier or ForestRegressor, because joblib complains that it cannot
# pickle it when placed there.
def accumulate_prediction(predict, X, out):

This comment has been minimized.

@amueller

amueller Sep 27, 2017

Member

The problem is that "out" is shared among threads, right?

@amueller

amueller Sep 27, 2017

Member

The problem is that "out" is shared among threads, right?

@@ -378,13 +378,16 @@ def feature_importances_(self):
# ForestClassifier or ForestRegressor, because joblib complains that it cannot
# pickle it when placed there.
def accumulate_prediction(predict, X, out):

This comment has been minimized.

@amueller

amueller Sep 27, 2017

Member

The problem is that "out" is shared among threads, right?

@amueller

amueller Sep 27, 2017

Member

The problem is that "out" is shared among threads, right?

This comment has been minimized.

@amueller

amueller Sep 27, 2017

Member

If this is the case I don't understand the memory increase. Shouldn't out be n_samples x n_classes and the memory increase therefore n_samples x n_classes x n_jobs?

@amueller

amueller Sep 27, 2017

Member

If this is the case I don't understand the memory increase. Shouldn't out be n_samples x n_classes and the memory increase therefore n_samples x n_classes x n_jobs?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 27, 2017

Member

At master, up to n_jobs prediction matrices are in memory at once. The accumulation is done, albeit badly, in each thread.

At this PR all prediction matrices are returned from parallel, i.e. n_estimators separate matrices. The accumulation is done after.

A mutex lock on the accumulation would be sufficient to solve this, but we've tended to avoid them. Alternatively a map that returns an iterator rather than a list (or a list of promises) would suffice to give roughly current memory consumption with correctness a assured.

Member

jnothman commented Sep 27, 2017

At master, up to n_jobs prediction matrices are in memory at once. The accumulation is done, albeit badly, in each thread.

At this PR all prediction matrices are returned from parallel, i.e. n_estimators separate matrices. The accumulation is done after.

A mutex lock on the accumulation would be sufficient to solve this, but we've tended to avoid them. Alternatively a map that returns an iterator rather than a list (or a list of promises) would suffice to give roughly current memory consumption with correctness a assured.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 28, 2017

Member

Before I forgot I am guessing this is partly reverting #8672. We should take a look at this PR to try to understand the motivations behind it.

Member

lesteve commented Sep 28, 2017

Before I forgot I am guessing this is partly reverting #8672. We should take a look at this PR to try to understand the motivations behind it.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 28, 2017

Member
Member

jnothman commented Sep 28, 2017

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 28, 2017

Member

Chatting with @ogrisel about this one, he thinks adding a threading.Lock is probably the best and the simplest at the same time. The summation of probabilities should not be a bottleneck so that the lock will not impact performance.

Just curious, can we actually reproduce the failure outside the Debian testing framework (mips is the failing architecture I think) ?

Member

lesteve commented Sep 28, 2017

Chatting with @ogrisel about this one, he thinks adding a threading.Lock is probably the best and the simplest at the same time. The summation of probabilities should not be a bottleneck so that the lock will not impact performance.

Just curious, can we actually reproduce the failure outside the Debian testing framework (mips is the failing architecture I think) ?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 30, 2017

Member

I tried to reproduce it the other day and failed. Our best chance? Lots of random features, max_features=1, very few samples, many estimators, large n_jobs. Let's use threading.Lock then.

Member

jnothman commented Sep 30, 2017

I tried to reproduce it the other day and failed. Our best chance? Lots of random features, max_features=1, very few samples, many estimators, large n_jobs. Let's use threading.Lock then.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 30, 2017

Member

Done

Member

jnothman commented Sep 30, 2017

Done

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Oct 3, 2017

Member

The changes look fine but I am a bit uncomfortable merging this kind of blind, without having to managed to reproduce neither in the scikit-learn tests or in a simpler snippet where you update a single numpy array with parallel summations.

Also maybe it would be a good idea to run a quick and dirty benchmarks to make sure that the lock is not impacting performance?

Alternatively a map that returns an iterator rather than a list (or a list of promises) would suffice to give roughly current memory consumption with correctness a assured.

FYI I tried a little while ago of having Parallel return an generator. It was kind of working except when closing the pool before consuming all the results, in which case it hung and I never figured out why. I may have another go at it at one point. My branch is here if you want to know more details.

Member

lesteve commented Oct 3, 2017

The changes look fine but I am a bit uncomfortable merging this kind of blind, without having to managed to reproduce neither in the scikit-learn tests or in a simpler snippet where you update a single numpy array with parallel summations.

Also maybe it would be a good idea to run a quick and dirty benchmarks to make sure that the lock is not impacting performance?

Alternatively a map that returns an iterator rather than a list (or a list of promises) would suffice to give roughly current memory consumption with correctness a assured.

FYI I tried a little while ago of having Parallel return an generator. It was kind of working except when closing the pool before consuming all the results, in which case it hung and I never figured out why. I may have another go at it at one point. My branch is here if you want to know more details.

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Oct 3, 2017

Member

My understanding of the issue is that the issue arises when multiple trees are trying to add their individual predictions to the single out array at the same time, causing an issue where some updates are overwritten. Is this correct? It seems weird to me that the GIL is not preventing this from happening, do you know why that is the case?

Member

jmschrei commented Oct 3, 2017

My understanding of the issue is that the issue arises when multiple trees are trying to add their individual predictions to the single out array at the same time, causing an issue where some updates are overwritten. Is this correct? It seems weird to me that the GIL is not preventing this from happening, do you know why that is the case?

@ogrisel

ogrisel approved these changes Oct 3, 2017

It seems like a reasonable fix. GIL contention should not be too visible as adding arrays it probably orders of magnitude faster than computing the predictions themselves.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Oct 3, 2017

Member

@jmschrei I don't know how the GIL protects the += operation on numpy arrays.

Member

ogrisel commented Oct 3, 2017

@jmschrei I don't know how the GIL protects the += operation on numpy arrays.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 3, 2017

Member
Member

jnothman commented Oct 3, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 16, 2017

Member

This reproduces the contention, showing that at master, n_jobs=4 is inconsistent and n_jobs=1 is consistent. Unfortunately, it seems to show that the current PR is also inconsistent and I cannot fathom why:

import numpy as np
from sklearn.ensemble import RandomForestRegressor
import sklearn.ensemble
print(sklearn.ensemble.__path__)

X = np.random.rand(10, 100)
y = np.random.rand(10) * 100
rfr = RandomForestRegressor(n_estimators=1000, max_features=1, n_jobs=4).fit(X, y)
ys = []
for i in range(100):
    if i % 10 == 0:
        print(i)
    ys.append(rfr.set_params(n_jobs=4).predict(X))

n_failures = sum(np.any(np.diff(ys, axis=0), axis=1))
if n_failures:
    print('Broke up to %d times!' % n_failures)
else:
    print('Consistent!')
Member

jnothman commented Oct 16, 2017

This reproduces the contention, showing that at master, n_jobs=4 is inconsistent and n_jobs=1 is consistent. Unfortunately, it seems to show that the current PR is also inconsistent and I cannot fathom why:

import numpy as np
from sklearn.ensemble import RandomForestRegressor
import sklearn.ensemble
print(sklearn.ensemble.__path__)

X = np.random.rand(10, 100)
y = np.random.rand(10) * 100
rfr = RandomForestRegressor(n_estimators=1000, max_features=1, n_jobs=4).fit(X, y)
ys = []
for i in range(100):
    if i % 10 == 0:
        print(i)
    ys.append(rfr.set_params(n_jobs=4).predict(X))

n_failures = sum(np.any(np.diff(ys, axis=0), axis=1))
if n_failures:
    print('Broke up to %d times!' % n_failures)
else:
    print('Consistent!')
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 16, 2017

Member

The answer is that the test is finding instability due to summation order not threading contention. The differences are minuscule, both at master and this PR. So the issue remains without a reliable test.

What I can say is that the effect of locking on performance is negligible.

Member

jnothman commented Oct 16, 2017

The answer is that the test is finding instability due to summation order not threading contention. The differences are minuscule, both at master and this PR. So the issue remains without a reliable test.

What I can say is that the effect of locking on performance is negligible.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Oct 16, 2017

Member

Thanks a lot @jnothman, let's merge this one!

Member

lesteve commented Oct 16, 2017

Thanks a lot @jnothman, let's merge this one!

@lesteve lesteve merged commit 4a4b711 into scikit-learn:master Oct 16, 2017

6 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.17%)
Details
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +3.82% compared to 4e6b403
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details

jnothman added a commit to jnothman/scikit-learn that referenced this pull request Oct 17, 2017

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment