Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PicklingError encountered when using multiple GPUs #67681

Closed
TotalVariation opened this issue Nov 2, 2021 · 5 comments
Closed

PicklingError encountered when using multiple GPUs #67681

TotalVariation opened this issue Nov 2, 2021 · 5 comments
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@TotalVariation
Copy link

TotalVariation commented Nov 2, 2021

🐛 Bug

Hello,

I was using Slurm-based HPC with multiple GPUs to train deep learning models.

Screenshot from 2021-11-02 14-31-12

Screenshot from 2021-11-02 14-32-20

I believe this pickle error was caused by multiprocessing but I have little knowledge about this. Thank you so much.

Environment

Screenshot from 2021-11-02 15-03-09
Screenshot from 2021-11-02 15-03-36

cc @ssnl @VitalyFedyunin @ejguan @NivekT

@ejguan
Copy link
Contributor

ejguan commented Nov 2, 2021

Potentially the problem came from your implementation of Dataset based on your trace.

Could you please post more detail about it so we can reproduce the error?

@ejguan ejguan added module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Nov 2, 2021
@TotalVariation
Copy link
Author

Thank you @ejguan for your quick response.

Screenshot from 2021-11-03 14-20-19

I believe this error is related to multiprocessing used in Torch. It would force everything to be pickled. But there are some packages that are dependent on sh in my code, which contains a class called GlobResults that is not meant for pickling. Here is the same issue I posted on amoffat/sh#590, which you can refer to.

@ecederstrand
Copy link

ecederstrand commented Nov 3, 2021

Something is calling sh.glob(). Assuming this happens in your own code, you need to decide whether you want the globbing to happen before or after spawning the process. Either:

  1. Pass around the arguments for sh.glob() and run the command in the spawned process
  2. Convert the result of sh.glob() to something that can be pickled, and pass the result around:
>>> import pickle, sh
>>> pickle.dumps(sh.glob('*'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_pickle.PicklingError: Can't pickle <class 'sh.GlobResults'>: attribute lookup GlobResults on sh failed
>>> pickle.dumps(list(sh.glob('*')))
b'...'

@TotalVariation
Copy link
Author

@ecederstrand Thank you again for your solution. This is the exact reason why there would be a pickling error. I have solved the issue by following your code snippet.

@NivekT
Copy link
Contributor

NivekT commented Nov 5, 2021

Seems like this issue is resolved, so I will close this.

@TotalVariation let us know if there is any further issue or suggestion for changes.

@NivekT NivekT closed this as completed Nov 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants