Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed the bug that joblib uses memory mapping when data size is too large #495

Merged
merged 2 commits into from
Oct 7, 2021

Conversation

ginward
Copy link

@ginward ginward commented Oct 3, 2021

This is to fix the bug mentioned in the following issues:

joblib/joblib#1225

scikit-learn/scikit-learn#21228

#494

MaartenGr/BERTopic#151

It turns out that joblib will use memoryviews in multiprocessing if the data size is beyond max_nbytes. However, this creates error in Cython, as discussed here: scikit-learn/scikit-learn#7981 (comment)

Therefore, I have set the max_nbytes to none to prevent the auto-memmapping, solving the above errors. The downside is that now the memory usage could be higher if you use multi-processing, as the numpy array will need to be copied into the new process.

@jc-healy
Copy link
Collaborator

jc-healy commented Oct 7, 2021

Thanks for the PR. Hopefully this cleans up the problems.

@SergioG-M
Copy link

Can you explain how do you set the max_nbytes to None? Do you need to call joblib.Parallel or is there any other way?

@jsyrjala
Copy link

jsyrjala commented Nov 4, 2021

We also hit this issue and are currently using an git commit to fetch a working version of HDBSCAN.

Could you consider releasing a new version of HDBSCAN with this fix?

@yotammarton
Copy link

@ginward thanks for that, liked the way you documented everything - helped me to find these issues.

Although I tried reinstalling HDBSCAN from git master with your fix, I still struggled with BERTopic with large amount of docs (2.9M) having the same error you had.
What solved this for me was passing core_dist_n_jobs=1 to HDBSCAN and using it as a custom HDBSCAN in BERTopic init.

# -- Custom HDBSCAN
bertopic_params['hdbscan_model'] = hdbscan.HDBSCAN(min_cluster_size=bertopic_params['min_topic_size'],
                                                   metric='euclidean',
                                                   cluster_selection_method='eom',
                                                   prediction_data=True,
                                                   core_dist_n_jobs=1)  # TODO should prevent error
# -- BERTopic
topic_model = BERTopic(**bertopic_params)

@ginward
Copy link
Author

ginward commented Nov 4, 2021

@yotammarton Maybe you can try cloning from the master branch, and check manually if the source code here and here has been updated, before installing directly from the clone?

  1. You can get the pull request from the following code:
    https://github.com/scikit-learn-contrib/hdbscan.git

  2. Check that the code has been updated.

  3. Uninstall the existing version: pip uninstall hdbscan

  4. Then install the new version with: pip install ./hdbscan

Also note that if you are working on a Mac, you might need to change multiprocessing from fork to spawn manually to avoid deadlocks: https://stackoverflow.com/a/66113051/5705174

Setting the number of jobs to 1 might fix the issue, but it might slow down the code, too.

@yotammarton
Copy link

My steps were:

  1. pip uninstall hdbscan
  2. pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git#egg=hdbscan as advised in the project main README.

I was unable to locate the source code after installing the package so couldn't verify the change.
But, I'm pretty certain your fix was included because the pip printed logs showed the most updated commit (54da636)

Im on Ubuntu 18.04.5 LTS (GNU/Linux 5.3.0-1032-aws x86_64)

evanroyrees added a commit to evanroyrees/Autometa that referenced this pull request Nov 30, 2021
…ternals using joblib.

🔥🐛 This is a somewhat known error as similar messages have been discussed [here](scikit-learn/scikit-learn#21685)
and on the [hdbscan GH pull-#495](scikit-learn-contrib/hdbscan#495).

The error messages is emitted from joblib.externals.loky.process_executor._RemoteTraceback and emits a ValueError:

'ValueError: buffer source array is read-only.'

So far this has not been encountered with scikit-learn version 0.24
@dbl001
Copy link

dbl001 commented Mar 24, 2022

I got the same error. I tried uninstalling hdbscan and reinstall hdbscan from the patch (as per instructions above), but it didn't help.
My data size is: 615449 rows × 5 column

I am running Anaconda Python Python 3.8.5 on OS X Monterey 12.3.

% pip show hdbscan
Name: hdbscan
Version: 0.8.28
Summary: Clustering based on density with variable density clusters
Home-page: http://github.com/scikit-learn-contrib/hdbscan
Author: None
Author-email: None
License: BSD
Location: /Users/davidlaxer/anaconda3/envs/social_networking/lib/python3.8/site-packages
Requires: scipy, cython, numpy, joblib, scikit-learn
Required-by: bertopic

@dbl001
Copy link

dbl001 commented Mar 25, 2022

After adjusting 'core_dist_n_jobs=1 in hdbscan.HDBSCAN,
the process ran, consumed ~300GB then died:
E.g.
self.hdbscan_model = hdbscan_model or hdbscan.HDBSCAN(min_cluster_size=self.min_topic_size,
metric='euclidean',
cluster_selection_method='eom',

/Users/davidlaxer/tensorflow-metal/bin/python /Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevd.py --multiproc --qt-support=auto --client 127.0.0.1 --port 59368 --file /Users/davidlaxer/BERTopic/test1.py
Connected to pydev debugger (build 213.7172.26)
Batches: 100%|██████████| 19233/19233 [25:25:56<00:00,  4.76s/it]
2022-03-25 12:16:13,072 - BERTopic - Transformed documents to Embeddings
OMP: Info #270: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
2022-03-25 12:28:06,809 - BERTopic - Reduced dimensionality with UMAP
2022-03-25 12:30:15,653 - BERTopic - Clustered UMAP embeddings with HDBSCAN
/Users/davidlaxer/anaconda3/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

@ginward
Copy link
Author

ginward commented Apr 17, 2022

@dbl001 Is it possible that your machine does not have enough memory RAM?

@dbl001
Copy link

dbl001 commented Apr 17, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants