BrokenProcessPool: A task has failed to un-serialize. #494

ginward · 2021-10-01T07:23:03Z

This is related to this issue in BERTopic, and this issue in UMAP. Probably related to this issue and this issue, too.

I am currently Using UMAP to reduce the dimension and then using HDBSCAN to cluster the embeddings. However, I am running into the following error. Any idea why?

My data size is 10 million rows and 5 dimensions (reduced with UMAP from 384 dimensions). I have 1TB of RAM and 32 cores.

I am using Jupyter Notebook.

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 404, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "sklearn/neighbors/_binary_tree.pxi", line 1057, in sklearn.neighbors._kd_tree.BinaryTree.__setstate__
  File "sklearn/neighbors/_binary_tree.pxi", line 999, in sklearn.neighbors._kd_tree.BinaryTree._update_memviews
  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
"""

The above exception was the direct cause of the following exception:

BrokenProcessPool                         Traceback (most recent call last)
/tmp/ipykernel_778601/1248467627.py in <module>
----> 1 test1=hdbscan_model.fit(embedding_test)

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py in fit(self, X, y)
    917          self._condensed_tree,
    918          self._single_linkage_tree,
--> 919          self._min_spanning_tree) = hdbscan(X, **kwargs)
    920 
    921         if self.prediction_data:

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
    608                                            gen_min_span_tree, **kwargs)
    609             else:
--> 610                 (single_linkage_tree, result_min_span_tree) = memory.cache(
    611                     _hdbscan_boruvka_kdtree)(X, min_samples, alpha,
    612                                              metric, p, leaf_size,

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py in _hdbscan_boruvka_kdtree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
    273 
    274     tree = KDTree(X, metric=metric, leaf_size=leaf_size, **kwargs)
--> 275     alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric,
    276                                  leaf_size=leaf_size // 3,
    277                                  approx_min_span_tree=approx_min_span_tree,

hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__()

hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds()

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/parallel.py in __call__(self, iterable)
   1052 
   1053             with self._backend.retrieval_context():
-> 1054                 self.retrieve()
   1055             # Make sure that we get a last message telling us we are done
   1056             elapsed_time = time.time() - self._start_time

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/parallel.py in retrieve(self)
    931             try:
    932                 if getattr(self._backend, 'supports_timeout', False):
--> 933                     self._output.extend(job.get(timeout=self.timeout))
    934                 else:
    935                     self._output.extend(job.get())

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/concurrent/futures/_base.py in result(self, timeout)
    443                     raise CancelledError()
    444                 elif self._state == FINISHED:
--> 445                     return self.__get_result()
    446                 else:
    447                     raise TimeoutError()

/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/concurrent/futures/_base.py in __get_result(self)
    388         if self._exception:
    389             try:
--> 390                 raise self._exception
    391             finally:
    392                 # Break a reference cycle with the exception in self._exception

BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

The text was updated successfully, but these errors were encountered:

ginward · 2021-10-01T07:31:42Z

It seems that when the embedding size (or the data size) is less than 100000, it runs fine. But if it is bigger than that, the above error arises.

ginward · 2021-10-01T07:41:17Z

Is this related to the core_dist_n_jobs parameter?

ginward · 2021-10-01T15:45:21Z

I have tried this on the command line, but the following error still occurs, so it is not a ipython issue:

joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 404, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "sklearn/neighbors/_binary_tree.pxi", line 1057, in sklearn.neighbors._kd_tree.BinaryTree.__setstate__
  File "sklearn/neighbors/_binary_tree.pxi", line 999, in sklearn.neighbors._kd_tree.BinaryTree._update_memviews
  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/rds/user/jw983/hpc-work/transcripts/notebook3/BERTopic.py", line 62, in <module>
    topic_model = topic_model.fit(docs, embeddings)
  File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/bertopic/_bertopic.py", line 212, in fit
    self.fit_transform(documents, embeddings, y)
  File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/bertopic/_bertopic.py", line 291, in fit_transform
    documents, probabilities = self._cluster_embeddings(umap_embeddings, documents)
  File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/bertopic/_bertopic.py", line 1386, in _cluster_embeddings
    self.hdbscan_model.fit(umap_embeddings)
  File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py", line 919, in fit
    self._min_spanning_tree) = hdbscan(X, **kwargs)
  File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py", line 610, in hdbscan
    (single_linkage_tree, result_min_span_tree) = memory.cache(
  File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py", line 275, in _hdbscan_boruvka_kdtree
    alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric,
  File "hdbscan/_hdbscan_boruvka.pyx", line 392, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__
  File "hdbscan/_hdbscan_boruvka.pyx", line 426, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds
  File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/parallel.py", line 1054, in __call__
    self.retrieve()
  File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/parallel.py", line 933, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/concurrent/futures/_base.py", line 445, in result
    return self.__get_result()
  File "/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

ginward · 2021-10-01T15:55:37Z

Is it possible that the inter-process communication has a size limit?

jc-healy · 2021-10-01T15:59:05Z

It looks like your error is in our old hdbscan code within the parallel Boruvka algorithm for computing the minimal spanning tree. A quick work around might be to set n_jobs=1 in hdbscan to deactivate the parallelism. There is obviously a deeper issue in play but this might help alleviate the problem in the short term.

…

On Fri, Oct 1, 2021 at 11:45 AM Jinhua Wang ***@***.***> wrote: I have tried this on the command line, but the following error still occurs, so it is not a ipython issue: joblib.externals.loky.process_executor._RemoteTraceback: """ Traceback (most recent call last): File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 404, in _process_worker call_item = call_queue.get(block=True, timeout=timeout) File "/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/multiprocessing/queues.py", line 122, in get return _ForkingPickler.loads(res) File "sklearn/neighbors/_binary_tree.pxi", line 1057, in sklearn.neighbors._kd_tree.BinaryTree.__setstate__ File "sklearn/neighbors/_binary_tree.pxi", line 999, in sklearn.neighbors._kd_tree.BinaryTree._update_memviews File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__ ValueError: buffer source array is read-only """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/rds/user/jw983/hpc-work/transcripts/notebook3/BERTopic.py", line 62, in <module> topic_model = topic_model.fit(docs, embeddings) File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/bertopic/_bertopic.py", line 212, in fit self.fit_transform(documents, embeddings, y) File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/bertopic/_bertopic.py", line 291, in fit_transform documents, probabilities = self._cluster_embeddings(umap_embeddings, documents) File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/bertopic/_bertopic.py", line 1386, in _cluster_embeddings self.hdbscan_model.fit(umap_embeddings) File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py", line 919, in fit self._min_spanning_tree) = hdbscan(X, **kwargs) File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py", line 610, in hdbscan (single_linkage_tree, result_min_span_tree) = memory.cache( File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/memory.py", line 352, in __call__ return self.func(*args, **kwargs) File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py", line 275, in _hdbscan_boruvka_kdtree alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric, File "hdbscan/_hdbscan_boruvka.pyx", line 392, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__ File "hdbscan/_hdbscan_boruvka.pyx", line 426, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/parallel.py", line 1054, in __call__ self.retrieve() File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/parallel.py", line 933, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result return future.result(timeout=timeout) File "/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/concurrent/futures/_base.py", line 445, in result return self.__get_result() File "/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result raise self._exception joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#494 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3IUWQBJWVTMJOA3Q2XAJ3UEXJR5ANCNFSM5FEAYBKQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

ginward · 2021-10-01T16:03:02Z

It looks like your error is in our old hdbscan code within the parallel Boruvka algorithm for computing the minimal spanning tree. A quick work around might be to set n_jobs=1 in hdbscan to deactivate the parallelism. There is obviously a deeper issue in play but this might help alleviate the problem in the short term.
…

Thanks! Why is the hdbscan code old? Is there a newer version?

jc-healy · 2021-10-01T16:05:58Z

Sorry, that was a poor turn of phrase on my part. I haven't worked on the hdbscan code base in a while. In general, I suspect that this is probably related joblib and a version conflict. to this issue that is currently being worked on over in this thread if you'd like to have a look: lmcinnes/umap#567 (comment)

…

On Fri, Oct 1, 2021 at 12:03 PM Jinhua Wang ***@***.***> wrote: It looks like your error is in our old hdbscan code within the parallel Boruvka algorithm for computing the minimal spanning tree. A quick work around might be to set n_jobs=1 in hdbscan to deactivate the parallelism. There is obviously a deeper issue in play but this might help alleviate the problem in the short term. … <#m_2312540480260826037_> Thanks! Why is the hdbscan code old? Is there a newer version? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#494 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3IUWRPJZQWHNZB27F53DTUEXLUBANCNFSM5FEAYBKQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

ginward · 2021-10-01T16:09:02Z

@jc-healy Would downgrading to numpy 1.19 work?

jc-healy · 2021-10-01T16:13:12Z

That is what I would try. It seemed to address a similar issue that other folks were having. It would at least help identify if these were indeed the same issue.

…

On Fri, Oct 1, 2021 at 12:09 PM Jinhua Wang ***@***.***> wrote: @jc-healy <https://github.com/jc-healy> Would downgrading to numpy 1.19 work? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#494 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3IUWSOQMYUEMA2VMJIKFDUEXMKVANCNFSM5FEAYBKQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

ginward · 2021-10-03T05:58:57Z

I think this is related to this issue here, given that the issue only arises if the size of the data is larger than 140,000. I think it is related to the maximum size of memory view that Joblib can pickle.

ginward · 2021-10-03T09:15:44Z

@jc-healy Please see #495 for a proposed solution. Multiprocessing works for large data now.

ginward mentioned this issue Oct 1, 2021

help sought to train a big data sentence model (upto 1.5 million sentences) MaartenGr/BERTopic#151

Closed

This was referenced Oct 3, 2021

BrokenProcessPool: A task has failed to un-serialize. joblib/joblib#1225

Closed

ValueError: buffer source array is read-only scikit-learn/scikit-learn#21228

Closed

Fixed the bug that joblib uses memory mapping when data size is too large #495

Merged

ginward closed this as completed Oct 14, 2021

thomasjpfan mentioned this issue Feb 16, 2023

RFC Consider making auto-memmaping a manual operation joblib/joblib#1376

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BrokenProcessPool: A task has failed to un-serialize. #494

BrokenProcessPool: A task has failed to un-serialize. #494

ginward commented Oct 1, 2021 •

edited

ginward commented Oct 1, 2021 •

edited

ginward commented Oct 1, 2021

ginward commented Oct 1, 2021

ginward commented Oct 1, 2021

jc-healy commented Oct 1, 2021 via email

ginward commented Oct 1, 2021

jc-healy commented Oct 1, 2021 via email

ginward commented Oct 1, 2021

jc-healy commented Oct 1, 2021 via email

ginward commented Oct 3, 2021

ginward commented Oct 3, 2021

BrokenProcessPool: A task has failed to un-serialize. #494

BrokenProcessPool: A task has failed to un-serialize. #494

Comments

ginward commented Oct 1, 2021 • edited

ginward commented Oct 1, 2021 • edited

ginward commented Oct 1, 2021

ginward commented Oct 1, 2021

ginward commented Oct 1, 2021

jc-healy commented Oct 1, 2021 via email

ginward commented Oct 1, 2021

jc-healy commented Oct 1, 2021 via email

ginward commented Oct 1, 2021

jc-healy commented Oct 1, 2021 via email

ginward commented Oct 3, 2021

ginward commented Oct 3, 2021

ginward commented Oct 1, 2021 •

edited

ginward commented Oct 1, 2021 •

edited