New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BrokenProcessPool: A task has failed to un-serialize. #494
Comments
It seems that when the embedding size (or the data size) is less than |
Is this related to the |
I have tried this on the command line, but the following error still occurs, so it is not a ipython issue:
|
Is it possible that the inter-process communication has a size limit? |
It looks like your error is in our old hdbscan code within the parallel
Boruvka algorithm for computing the minimal spanning tree. A quick work
around might be to set n_jobs=1 in hdbscan to deactivate the parallelism.
There is obviously a deeper issue in play but this might help alleviate the
problem in the short term.
…On Fri, Oct 1, 2021 at 11:45 AM Jinhua Wang ***@***.***> wrote:
I have tried this on the command line, but the following error still
occurs, so it is not a ipython issue:
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 404, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "sklearn/neighbors/_binary_tree.pxi", line 1057, in sklearn.neighbors._kd_tree.BinaryTree.__setstate__
File "sklearn/neighbors/_binary_tree.pxi", line 999, in sklearn.neighbors._kd_tree.BinaryTree._update_memviews
File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/rds/user/jw983/hpc-work/transcripts/notebook3/BERTopic.py", line 62, in <module>
topic_model = topic_model.fit(docs, embeddings)
File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/bertopic/_bertopic.py", line 212, in fit
self.fit_transform(documents, embeddings, y)
File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/bertopic/_bertopic.py", line 291, in fit_transform
documents, probabilities = self._cluster_embeddings(umap_embeddings, documents)
File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/bertopic/_bertopic.py", line 1386, in _cluster_embeddings
self.hdbscan_model.fit(umap_embeddings)
File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py", line 919, in fit
self._min_spanning_tree) = hdbscan(X, **kwargs)
File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py", line 610, in hdbscan
(single_linkage_tree, result_min_span_tree) = memory.cache(
File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/memory.py", line 352, in __call__
return self.func(*args, **kwargs)
File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py", line 275, in _hdbscan_boruvka_kdtree
alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric,
File "hdbscan/_hdbscan_boruvka.pyx", line 392, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__
File "hdbscan/_hdbscan_boruvka.pyx", line 426, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds
File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/parallel.py", line 1054, in __call__
self.retrieve()
File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/parallel.py", line 933, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/concurrent/futures/_base.py", line 445, in result
return self.__get_result()
File "/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
raise self._exception
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#494 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC3IUWQBJWVTMJOA3Q2XAJ3UEXJR5ANCNFSM5FEAYBKQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Thanks! Why is the hdbscan code old? Is there a newer version? |
Sorry, that was a poor turn of phrase on my part. I haven't worked on the
hdbscan code base in a while.
In general, I suspect that this is probably related joblib and a version
conflict. to this issue that is currently being worked on over in this
thread if you'd like to have a look:
lmcinnes/umap#567 (comment)
…On Fri, Oct 1, 2021 at 12:03 PM Jinhua Wang ***@***.***> wrote:
It looks like your error is in our old hdbscan code within the parallel
Boruvka algorithm for computing the minimal spanning tree. A quick work
around might be to set n_jobs=1 in hdbscan to deactivate the parallelism.
There is obviously a deeper issue in play but this might help alleviate the
problem in the short term.
… <#m_2312540480260826037_>
Thanks! Why is the hdbscan code old? Is there a newer version?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#494 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC3IUWRPJZQWHNZB27F53DTUEXLUBANCNFSM5FEAYBKQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@jc-healy Would downgrading to numpy 1.19 work? |
That is what I would try. It seemed to address a similar issue that other
folks were having. It would at least help identify if these were indeed
the same issue.
…On Fri, Oct 1, 2021 at 12:09 PM Jinhua Wang ***@***.***> wrote:
@jc-healy <https://github.com/jc-healy> Would downgrading to numpy 1.19
work?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#494 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC3IUWSOQMYUEMA2VMJIKFDUEXMKVANCNFSM5FEAYBKQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I think this is related to this issue here, given that the issue only arises if the size of the data is larger than 140,000. I think it is related to the maximum size of memory view that Joblib can pickle. |
This is related to this issue in BERTopic, and this issue in UMAP. Probably related to this issue and this issue, too.
I am currently Using UMAP to reduce the dimension and then using HDBSCAN to cluster the embeddings. However, I am running into the following error. Any idea why?
My data size is 10 million rows and 5 dimensions (reduced with UMAP from 384 dimensions). I have 1TB of RAM and 32 cores.
I am using Jupyter Notebook.
The text was updated successfully, but these errors were encountered: