-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Massive memory usage by parallel RandomForestClassifier #936
Comments
The problem is that two copies of X (X and X_argsorted) are made for each job. You cant circumvent that by putting X into shared memory. I did that in one my branches: I used that module: https://bitbucket.org/cleemesser/numpy-sharedmem/issue/2/sharedmemory_sysvso-not-added-correctly-to It was not put into master because of this additional depedency though. |
To use shared memory you can memory map your input set with joblib: from sklearn.externals import joblib
filename = '/tmp/dataset.joblib'
joblib.dump(np.asfortranarray(X), filename)
X = joblib.load(filename, mmap_mode='c') IIRC the random forest model need a fortran layout data to work efficiently hence the call to |
BTW @glouppe is the above strategy works as expected it would be great to make the RandomForest/ExtraTrees* classes able to do it automatically using a |
@jni any news on this? Have you tried any of the afore-mentionned solutions? If that work for you we should devise a way to make it simpler to implement or at least better documented. |
Haven't tried it, busy weekend — I'll do it today! Thanks! |
Ok, two failures to report. First, I tried to combine @glouppe's code with @ogrisel's joblib modification. This crashed and burned and anyway didn't seem to much affect memory usage: it was up to 20GB before it crashed. I've made a gist with the diff with scikit-learn 0.11.X and the error for njobs=2. I then tried @glouppe's cytomine branch directly, after installing sharedmem, but this also failed for some unknown reason. ... Any ideas? |
@jni have you tried playing with "min_density"? That can really affect memory usage (also cpu usage though in a non-linear way). |
@amueller, I just tried |
I stand corrected: usage is >50GB with |
hm ok. was just an idea. don't know where the additional copy comes from (2 instead of 3). You did take care of the memory layout, right? |
I meant if you use fortran or c ordering. IIRC the forests want fortran ordering, so if you provide c-ordered arrays, they'll make a copy. |
I thought that's what you might have meant, so I ran |
For my branch to work, you need
Hope this helps! |
Thanks, @glouppe! This'll help me but I'm disappointed it's of limited use if it won't make it into the scikit proper... It seems to me that if I can get it to work, this can be an optional dependency... I often use the following pattern: try:
import sharedmem as shm
shm_available = True
except ImportError:
logging.warning('sharedmem library is not available')
shm_available = False Otherwise, I would still be interested in getting joblib persistence to work... Secondly, this may explain the close-to-3x memory usage when my data is copied, since it's not float32. It's probably a good idea to coerce the data within BaseForest before it is copied by I'll run these experiments this afternoon and report back. Thanks everyone! |
@jni could you please report the error message you get with memory mapped file solution, or better tell me if this is the same as the following? https://gist.github.com/3084146 If so, I will try to give it a deeper look by trying to reproduce it in a pure joblib context, outside of scikit-learn. Maybe @GaelVaroquaux has an idea on the cause of the problem. |
@ogrisel, yes, it's the same error as you pointed out. |
Alright I'll try and see if there is an easy solution to fix this problem tonight unless @GaelVaroquaux or someone else does it first. |
@jni are you running unix (linux or OSX)? If so maybe just putting the data into the right memory layout before calling the fit method of the random forest might work thanks to the copy on write semantics of the unix fork that backs the multiprocessing module of the standard library on those platforms: #!/usr/bin/env python
import numpy as np
from sklearn.datasets.samples_generator import make_classification
from sklearn.externals import joblib
from sklearn.ensemble import RandomForestClassifier
print "generating dataset"
X, y = make_classification(n_samples=100000, n_features=500)
print "put data in the right layout"
X = np.asarray(X, dtype=np.float32, order='F')
print "fitting random forest:"
clf = RandomForestClassifier(n_estimators=100, n_jobs=2)
print clf.fit(X, y).score(X, y) Can you tell us if it solves your issue? Thanks to @larsmans for the heads up on COW unix forks. |
@jni In the end, does this work with my branch? I put this together a few months ago and it indeed solved my problems (as long as |
@ogrisel, I'm on Linux (Fedora Core 16). I'm aware of the idea of copy-on-write in Unix fork(), but in my experience I have never been able to capitalise on it in Python. I believe we're running into the problem detailed here, namely, objects might not change but the Python interpreter is changing the metadata of an object (e.g. the reference count), which results in the whole object getting copied. To illustrate; some setup: import numpy as np
from ray import classify # this is my own library
dat5 = classify.load_training_data_from_disk('training/multi-channel-graph-e05-5.trdat.h5')
from sklearn.ensemble import RandomForestClassifier
# using @glouppe's branch
features = np.asarray(dat5[0], dtype=np.float32, order='F')
labels = np.asarray(dat5[1][:, 0], dtype=np.float32)
features.shape
# (299351, 415)
float(features.nbytes) / 2**30
# 0.46279529109597206
labels.shape
np.unique(labels)
# array([-1., 1.], dtype=float32) Now we try with and without @glouppe's shared memory. If COW was working, there should be no difference in memory usage. But! rf = RandomForestClassifier(100, max_depth=20, n_jobs=16, shared=False, bootstrap=False)
rf = rf.fit(features, labels)
# about 1GB/process
rf = RandomForestClassifier(100, max_depth=20, n_jobs=16, shared=True, bootstrap=False)
rf = rf.fit(features, labels)
# about 100MB/process!!! So, in conclusion:
|
Thanks for the COW check. It's good to know that it's not working and that it's not fixable. For the shm module, we would rather having to avoid the maintenance burden of an external dependency (furthermore it's probably quite experimental and not guaranteed to work on other platforms). I would rather like a solution based on For the pbm with numpy memmap, it's seems to be a known bug (a regression in numpy 1.5+): http://projects.scipy.org/numpy/ticket/1809 Would be great to find a fix and then backport it into the sklearn.utils.fixes module as a monkey patch. |
As for the |
Hi @jni, FYI I have started a new branch in joblib to add support for numpy.memmap arrays to This is not yet used in scikit-learn though: the embedded joblib version in scikit-learn will need to get synchronized with upstream once this PR is merged. |
Thanks @ogrisel! Is it actually fixed? i.e. do I just need to replace the bundled version with this branch? Or still working on it? |
It should be fixed in my joblib branch. You can try to do the swap manually but I am not sure if other recent changes in joblib will impact its use in scikit-learn (it probably should not) as I have not tested myself yet. Then you can try something as: #!/usr/bin/env python
import numpy as np
from sklearn.datasets.samples_generator import make_classification
from sklearn.externals import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib
print "generating dataset"
X, y = make_classification(n_samples=100000, n_features=500)
filename = '/tmp/dataset.joblib'
print "put data in the right layout and map to " + filename
joblib.dump(np.asarray(X, dtype=np.float32, order='F'), filename)
X = joblib.load(filename, mmap_mode='c')
print "fitting random forest:"
clf = RandomForestClassifier(n_estimators=100, n_jobs=2)
print clf.fit(X, y).score(X, y) |
I have tried the previous script, I don't get the previous error anymore by the memory usage does not seem to stay constant when I increase |
@ogrisel I'm not sure if I'm doing something wrong here, but I've just imported your joblib.external.parallel.py fixes and run the above code #936 (comment).
I'd love to get the shared memory working using your method instead of the sharedmem package... |
Which branch are you using? I have started a new branch with another approach: I am still working on it though. |
FYI: I am working on a new approach to efficiently deal with shared memory with |
@jni @bdholt1 I think my pull request is in a workable state ready for testing on your use cases: To tests you can just replace the joblib embedded within the sklearn source tree with the one that comes from this repo / branch:
With this drop-in replacement, any numpy array larger than 1MB passed as argument to a Please feel free to report any issue directly as comments to joblib/joblib#44 . |
Clone of #2179. |
I think this will be hard to fix without swapping out joblib (or maybe even the GIL ;), but basically the amount of memory used by RandomForestClassifier is exorbitant for n_jobs > 1. In my case, I have a dataset of about 1GB (300,000 samples by 415 features by 64-bit float), but doing fit() on a RandomForestClassifier having n_jobs=16 results in 45GB of memory being used.
Does anyone have any ideas or is this hopeless without moving everything to C?
The text was updated successfully, but these errors were encountered: