Fixed the bug that joblib uses memory mapping when data size is too large #495

ginward · 2021-10-03T09:14:51Z

This is to fix the bug mentioned in the following issues:

It turns out that joblib will use memoryviews in multiprocessing if the data size is beyond max_nbytes. However, this creates error in Cython, as discussed here: scikit-learn/scikit-learn#7981 (comment)

Therefore, I have set the max_nbytes to none to prevent the auto-memmapping, solving the above errors. The downside is that now the memory usage could be higher if you use multi-processing, as the numpy array will need to be copied into the new process.

jc-healy · 2021-10-07T17:55:37Z

Thanks for the PR. Hopefully this cleans up the problems.

SergioG-M · 2021-10-26T09:04:32Z

Can you explain how do you set the max_nbytes to None? Do you need to call joblib.Parallel or is there any other way?

jsyrjala · 2021-11-04T08:20:13Z

We also hit this issue and are currently using an git commit to fetch a working version of HDBSCAN.

Could you consider releasing a new version of HDBSCAN with this fix?

yotammarton · 2021-11-04T09:38:40Z

@ginward thanks for that, liked the way you documented everything - helped me to find these issues.

Although I tried reinstalling HDBSCAN from git master with your fix, I still struggled with BERTopic with large amount of docs (2.9M) having the same error you had.
What solved this for me was passing core_dist_n_jobs=1 to HDBSCAN and using it as a custom HDBSCAN in BERTopic init.

# -- Custom HDBSCAN
bertopic_params['hdbscan_model'] = hdbscan.HDBSCAN(min_cluster_size=bertopic_params['min_topic_size'],
                                                   metric='euclidean',
                                                   cluster_selection_method='eom',
                                                   prediction_data=True,
                                                   core_dist_n_jobs=1)  # TODO should prevent error
# -- BERTopic
topic_model = BERTopic(**bertopic_params)

ginward · 2021-11-04T09:43:33Z

@yotammarton Maybe you can try cloning from the master branch, and check manually if the source code here and here has been updated, before installing directly from the clone?

You can get the pull request from the following code:
https://github.com/scikit-learn-contrib/hdbscan.git
Check that the code has been updated.
Uninstall the existing version: pip uninstall hdbscan
Then install the new version with: pip install ./hdbscan

Also note that if you are working on a Mac, you might need to change multiprocessing from fork to spawn manually to avoid deadlocks: https://stackoverflow.com/a/66113051/5705174

Setting the number of jobs to 1 might fix the issue, but it might slow down the code, too.

yotammarton · 2021-11-04T09:51:35Z

My steps were:

pip uninstall hdbscan
pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git#egg=hdbscan as advised in the project main README.

I was unable to locate the source code after installing the package so couldn't verify the change.
But, I'm pretty certain your fix was included because the pip printed logs showed the most updated commit (54da636)

Im on Ubuntu 18.04.5 LTS (GNU/Linux 5.3.0-1032-aws x86_64)

…ternals using joblib. 🔥🐛 This is a somewhat known error as similar messages have been discussed [here](scikit-learn/scikit-learn#21685) and on the [hdbscan GH pull-#495](scikit-learn-contrib/hdbscan#495). The error messages is emitted from joblib.externals.loky.process_executor._RemoteTraceback and emits a ValueError: 'ValueError: buffer source array is read-only.' So far this has not been encountered with scikit-learn version 0.24

dbl001 · 2022-03-24T14:28:04Z

I got the same error. I tried uninstalling hdbscan and reinstall hdbscan from the patch (as per instructions above), but it didn't help.
My data size is: 615449 rows × 5 column

I am running Anaconda Python Python 3.8.5 on OS X Monterey 12.3.

% pip show hdbscan
Name: hdbscan
Version: 0.8.28
Summary: Clustering based on density with variable density clusters
Home-page: http://github.com/scikit-learn-contrib/hdbscan
Author: None
Author-email: None
License: BSD
Location: /Users/davidlaxer/anaconda3/envs/social_networking/lib/python3.8/site-packages
Requires: scipy, cython, numpy, joblib, scikit-learn
Required-by: bertopic

dbl001 · 2022-03-25T20:16:39Z

After adjusting 'core_dist_n_jobs=1 in hdbscan.HDBSCAN,
the process ran, consumed ~300GB then died:
E.g.
self.hdbscan_model = hdbscan_model or hdbscan.HDBSCAN(min_cluster_size=self.min_topic_size,
metric='euclidean',
cluster_selection_method='eom',

/Users/davidlaxer/tensorflow-metal/bin/python /Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevd.py --multiproc --qt-support=auto --client 127.0.0.1 --port 59368 --file /Users/davidlaxer/BERTopic/test1.py
Connected to pydev debugger (build 213.7172.26)
Batches: 100%|██████████| 19233/19233 [25:25:56<00:00,  4.76s/it]
2022-03-25 12:16:13,072 - BERTopic - Transformed documents to Embeddings
OMP: Info #270: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
2022-03-25 12:28:06,809 - BERTopic - Reduced dimensionality with UMAP
2022-03-25 12:30:15,653 - BERTopic - Clustered UMAP embeddings with HDBSCAN
/Users/davidlaxer/anaconda3/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

ginward · 2022-04-17T06:39:56Z

@dbl001 Is it possible that your machine does not have enough memory RAM?

dbl001 · 2022-04-17T14:44:22Z

It’s an iMac 27” with 128gb DRAM.

…

On Apr 16, 2022, at 11:40 PM, Jinhua Wang ***@***.***> wrote: @dbl001 <https://github.com/dbl001> Is it possible that your machine does not have enough memory RAM? — Reply to this email directly, view it on GitHub <#495 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXWFW47LLPFE7VXFGSJH63VFOW4PANCNFSM5FHMFFXQ>. You are receiving this because you were mentioned.

Jinhua Wang added 2 commits October 3, 2021 16:18

set max_nbytes to None in parallel

fd95aa4

set max_nbytes to None in parallel

1c24b8d

This was referenced Oct 3, 2021

BrokenProcessPool: A task has failed to un-serialize. #494

Closed

help sought to train a big data sentence model (upto 1.5 million sentences) MaartenGr/BERTopic#151

Closed

numpy.core._exceptions.MemoryError #498

Closed

jc-healy merged commit 9d6f68b into scikit-learn-contrib:master Oct 7, 2021

jsyrjala mentioned this pull request Nov 4, 2021

Missing tag 0.8.27 from git, but the release exists in PyPi #505

Open

thomasjpfan mentioned this pull request Jun 30, 2022

Allowing parallel_backend to configure kwargs for the parallel call joblib/joblib#1305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed the bug that joblib uses memory mapping when data size is too large #495

Fixed the bug that joblib uses memory mapping when data size is too large #495

ginward commented Oct 3, 2021 •

edited

Loading

jc-healy commented Oct 7, 2021

SergioG-M commented Oct 26, 2021

jsyrjala commented Nov 4, 2021

yotammarton commented Nov 4, 2021

ginward commented Nov 4, 2021 •

edited

Loading

yotammarton commented Nov 4, 2021

dbl001 commented Mar 24, 2022 •

edited

Loading

dbl001 commented Mar 25, 2022

ginward commented Apr 17, 2022

dbl001 commented Apr 17, 2022 via email

Fixed the bug that joblib uses memory mapping when data size is too large #495

Fixed the bug that joblib uses memory mapping when data size is too large #495

Conversation

ginward commented Oct 3, 2021 • edited Loading

jc-healy commented Oct 7, 2021

SergioG-M commented Oct 26, 2021

jsyrjala commented Nov 4, 2021

yotammarton commented Nov 4, 2021

ginward commented Nov 4, 2021 • edited Loading

yotammarton commented Nov 4, 2021

dbl001 commented Mar 24, 2022 • edited Loading

dbl001 commented Mar 25, 2022

ginward commented Apr 17, 2022

dbl001 commented Apr 17, 2022 via email

ginward commented Oct 3, 2021 •

edited

Loading

ginward commented Nov 4, 2021 •

edited

Loading

dbl001 commented Mar 24, 2022 •

edited

Loading