Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BrokenPipeError: [Errno 32] Broken pipe #2341

Closed
mjchen611 opened this issue Aug 8, 2017 · 38 comments
Closed

BrokenPipeError: [Errno 32] Broken pipe #2341

mjchen611 opened this issue Aug 8, 2017 · 38 comments

Comments

@mjchen611
Copy link

mjchen611 commented Aug 8, 2017

Hi, I use Pytorch to run a triplet network(GPU), but when I got data , there was always a BrokenPipeError:[Errno 32] Broken pipe.

I thought it was something wrong in the following codes:

for batch_idx, (data1, data2, data3) in enumerate(test_loader):
if args.cuda:
data1, data2, data3 = data1.cuda(), data2.cuda(), data3.cuda()
data1, data2, data3 = Variable(data1), Variable(data2), Variable(data3)

Can you give me some suggestions? Thank you so much.

@alykhantejani
Copy link
Contributor

Would you be able to post a snippet of code that can reproduce this?

@mjchen611
Copy link
Author

mjchen611 commented Aug 8, 2017

@alykhantejani

  1. The code link was :https://github.com/andreasveit/triplet-network-pytorch/blob/master/train.py

  2. The error ocured in train.py -- 136

  3. The error was:

runfile('G:/researchWork2/pytorch/triplet-network-pytorch-master/train.py', wdir='G:/researchWork2/pytorch/triplet-network-pytorch-master')
Reloaded modules: triplet_mnist_loader, triplet_image_loader, tripletnet

Number of params: 21840
Traceback (most recent call last):
File "", line 1, in
runfile('G:/researchWork2/pytorch/triplet-network-pytorch-master/train.py', wdir='G:/researchWork2/pytorch/triplet-network-pytorch-master')

File "D:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)

File "D:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "G:/researchWork2/pytorch/triplet-network-pytorch-master/train.py", line 258, in
main()

File "G:/researchWork2/pytorch/triplet-network-pytorch-master/train.py", line 116, in main
train(train_loader, tnet, criterion, optimizer, epoch)

File "G:/researchWork2/pytorch/triplet-network-pytorch-master/train.py", line 137, in train
for batch_idx, (data1, data2) in enumerate(train_loader):

File "D:\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 303, in iter
return DataLoaderIter(self)

File "D:\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 162, in init
w.start()

File "D:\Anaconda3\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)

File "D:\Anaconda3\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)

File "D:\Anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)

File "D:\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 65, in init
reduction.dump(process_obj, to_child)

File "D:\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)

BrokenPipeError: [Errno 32] Broken pipe

  1. Some part of train related codes as follows:
    def train(train_loader, tnet, criterion, optimizer, epoch):
    losses = AverageMeter()
    accs = AverageMeter()
    emb_norms = AverageMeter()

switch to train mode

tnet.train()
for batch_idx, (data1, data2, data3) in enumerate(train_loader):
if args.cuda:
data1, data2, data3 = data1.cuda(), data2.cuda(), data3.cuda()
data1, data2, data3 = Variable(data1), Variable(data2), Variable(data3)

# compute output
dista, distb, embedded_x, embedded_y, embedded_z = tnet(data1, data2, data3)
# 1 means, dista should be larger than distb
target = torch.FloatTensor(dista.size()).fill_(1)
if args.cuda:
    target = target.cuda()
target = Variable(target)

loss_triplet = criterion(dista, distb, target)
loss_embedd = embedded_x.norm(2) + embedded_y.norm(2) + embedded_z.norm(2)
loss = loss_triplet + 0.001 * loss_embedd

# measure accuracy and record loss
acc = accuracy(dista, distb)
losses.update(loss_triplet.data[0], data1.size(0))
accs.update(acc, data1.size(0))
emb_norms.update(loss_embedd.data[0]/3, data1.size(0))

# compute gradient and do optimizer step
optimizer.zero_grad()
loss.backward()
optimizer.step()

if batch_idx % args.log_interval == 0:
    print('Train Epoch: {} [{}/{}]\t'
          'Loss: {:.4f} ({:.4f}) \t'
          'Acc: {:.2f}% ({:.2f}%) \t'
          'Emb_Norm: {:.2f} ({:.2f})'.format(
        epoch, batch_idx * len(data1), len(train_loader.dataset),
        losses.val, losses.avg, 
        100. * accs.val, 100. * accs.avg, emb_norms.val, emb_norms.avg))

log avg values to somewhere

plotter.plot('acc', 'train', epoch, accs.avg)
plotter.plot('loss', 'train', epoch, losses.avg)
plotter.plot('emb_norms', 'train', epoch, emb_norms.avg)

Thank you so much.

@mjchen611
Copy link
Author

@alykhantejani
And I use it in Windows8.1 with Cuda

@soumith soumith added this to Uncategorized in Issue Status Aug 23, 2017
@soumith
Copy link
Member

soumith commented Aug 30, 2017

we do not support windows officially yet. Maybe @peterjc123 knows what's wrong.

@soumith soumith closed this as completed Aug 30, 2017
@soumith soumith removed this from Uncategorized in Issue Status Aug 31, 2017
@peterjc123
Copy link
Collaborator

@mjchen611 You can set num_workers to 0 to see the actual error. Did you have your plotter correctly configured?

@ratteripenta
Copy link

ratteripenta commented Nov 23, 2017

I can actually verify that setting the num_workers to 0 or 1 helped out. No matter the case, DataLoader always failed with me regardless of dataset with a higher value. The error has to do with multiprocessing with DataLoader:


  File "D:/Opiskelu/PyTorch Tutorials/cnn_transfer_learning_cuda.py", line 76, in <module>
    inputs, classes = next(iter(dataloaders['train']))

  File "C:\Anaconda3\envs\ml\lib\site-packages\torch\utils\data\dataloader.py", line 301, in __iter__
    return DataLoaderIter(self)

  File "C:\Anaconda3\envs\ml\lib\site-packages\torch\utils\data\dataloader.py", line 158, in __init__
    w.start()

  File "C:\Anaconda3\envs\ml\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)

  File "C:\Anaconda3\envs\ml\lib\multiprocessing\context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)

  File "C:\Anaconda3\envs\ml\lib\multiprocessing\context.py", line 313, in _Popen
    return Popen(process_obj)

  File "C:\Anaconda3\envs\ml\lib\multiprocessing\popen_spawn_win32.py", line 66, in __init__
    reduction.dump(process_obj, to_child)

  File "C:\Anaconda3\envs\ml\lib\multiprocessing\reduction.py", line 59, in dump
    ForkingPickler(file, protocol).dump(obj)

BrokenPipeError: [Errno 32] Broken pipe

@peterjc123
Copy link
Collaborator

peterjc123 commented Nov 23, 2017

@karmus89 Actually this error only occurs when you try to do multiprocessing on some code with errors in it. It's unexpected that you face with this issue when your code is right. I don't know which version you are using. Can you send a small piece of code that can reproduce your issue?

@ratteripenta
Copy link

ratteripenta commented Nov 23, 2017

Will do! And remember, I'm a using Windows machine. The code is directly copied from the tutorial PyTorch: Transfer Learning Tutorial. This means that the dataset has to be downloaded and extracted as instructed.

The code to reproduce the error:

import torch
import torchvision
from torchvision import datasets, models, transforms
import os

data_transforms = {
    'train': transforms.Compose([
        transforms.RandomSizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Scale(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
}

data_dir = 'hymenoptera_data'
image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
                                          data_transforms[x])
                  for x in ['train', 'val']}
dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=4,
                                             shuffle=True, num_workers=4)
              for x in ['train', 'val']}

# The code fill fail here trying to iterate over the DataLoader with multiple num_workers (Windows only)
inputs, classes = next(iter(dataloaders['train']))

And I just made some PyTorch forum posts regarding this. The problem lies with Python's multiprocessing and Windows. Please see this PyTorch discussion reply as I don't want to overly copy paste stuff here.

Edit:

Here's the code that doesn't crash, which at the same time complies with Python's multiprocessing programming guidelines for Windows machines:

import torch
import torchvision
from torchvision import datasets, models, transforms
import os

if __name__ == "__main__":
    
    data_transforms = {
        'train': transforms.Compose([
            transforms.RandomSizedCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ]),
        'val': transforms.Compose([
            transforms.Scale(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ]),
    }
    
    data_dir = 'hymenoptera_data'
    image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
                                              data_transforms[x])
                      for x in ['train', 'val']}
    dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=4,
                                                 shuffle=True, num_workers=4)
                  for x in ['train', 'val']}

    inputs, classes = next(iter(dataloaders['train']))

@peterjc123
Copy link
Collaborator

peterjc123 commented Nov 23, 2017

@karmus89 Well, I think I have stated it where the package was published. I'm so sad that you installed the package without reading the notice.

@ratteripenta
Copy link

ratteripenta commented Nov 23, 2017

@peterjc123 Please see my edited response where I did exactly that. The requirement for wrapping the code inside of if __name__ == '__main__' code isn't immediately obvoius, as it is only required for Windows machines.

Edit:
Regarding the stating of the requirement, I indeed have missed it. I used conda to install the package directly, so I never came accross any introductory requirements. But thanks anyway! And sorry for making you sad!

Edit 2:
Wow, couldn't have known even where to look for that 😄 👍

@Dehde
Copy link

Dehde commented Sep 21, 2018

A question regarding the above. I am running into the above problem within a jupyter notebook. How do you solve this in a jupyter notebook? Wrapping the code in "if name == 'main' " does not change a thing. Does someone know how to translate this to jupyter notebooks?

@peterjc123
Copy link
Collaborator

@Dehde What about setting the num_worker of the DataLoader to zero?

@Dehde
Copy link

Dehde commented Sep 21, 2018

@peterjc123
Thanks for the quick reply! I did not fully make myself clear, sorry: Is there a way to run pytorch on windows in jupyter notebook and still use the worker functionality, so not set them to zero? I definitely need parellelized preprocessing.. Thanks for your time!

@peterjc123
Copy link
Collaborator

Could you show me the minimal code so that I could reproduce?

@Dehde
Copy link

Dehde commented Sep 22, 2018

@peterjc123
I will edit it into this post here on monday, don't have access to the code right now. Thank you!

As promised, the code I use:

`
if name == 'main':

batch_size = 256

size = (128, 128)
image_datasets = {}
image_datasets["train"] = WaterbodyDataset(masks=train_masks, images=train_imgs,
                                            transform_img=transforms.Compose([
                                                RandomCrop(size),
                                                transforms.ToTensor(),
                                            ]),
                                            transform_mask=transforms.Compose([
                                                RandomCrop(size),
                                                transforms.ToTensor(),
                                            ]))

image_datasets["val"] = WaterbodyDataset(masks=val_masks, images=val_imgs,
                                            transform_img=transforms.Compose([
                                                transforms.ToTensor(),
                                            ]),
                                            transform_mask=transforms.Compose([
                                                transforms.ToTensor()
                                            ]))

dataloaders = {'train': torch.utils.data.DataLoader(image_datasets['train'], batch_size=batch_size, 
                                                    shuffle=True, num_workers=1),
               'val' : torch.utils.data.DataLoader(image_datasets['val'], batch_size=batch_size, 
                                                   shuffle=False, num_workers=1)}

dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}

hps = HyperParams()
hps.update("name=resnet34_128_deconv_pret00rained_bs32_adam_lr0.0001_wd0_pat5,"
           "arch=resnet34,input_channel=4,freeze=0,deconv=1,opt=adam,debug=0,"
           "weight_decay=0.0,patience=100,pretrained=1,lr=0.0001,print_freq=10,every_x_epoch_eval=1")
pprint(attr.asdict(hps))

model = Model(hps)
model.train(dataloaders)`

The WaterbodyDataset inherits from the pytorch dataset class.

@Jerry-Jie-Xie
Copy link

I also got the same error. When I set num_workers to 0, the error does not appear again. However, when I set num_workers to 1, the error is still there.

@saurabh502
Copy link

When I set num_workers to 0, there is no error.

@ghost
Copy link

ghost commented Jan 13, 2019

Please i need assistance with this error "BrokenPipeError: [Errno 32] Broken pipe"
code from :https://github.com/higgsfield/np-hard-deep-reinforcement-learning/blob/master/Neural%20Combinatorial%20Optimization.ipynb
i am using windows 10.

@MarcinMisiurewicz
Copy link

  1. Wrap the code in if __name__ == '__main__':
    but for me, nonetheless, the error sometimes appears again. I know it sounds silly, but what helps me is just
  2. rebooting the computer.
    Windows 10 here

@BramVanroy
Copy link
Contributor

I found that the issue is still present, but only when I use a custom collate_fn.

@angeloyeo
Copy link

For me, just changing num_workers from 2 to 0 made the code work properly...

@cp9612
Copy link

cp9612 commented Aug 2, 2019

Had same issue when I ran the PyTorch Data Loading and Processing Tutorial. Changing num_workers from 2 to 0 solved the problem, but num_workers = 2 worked fine with other datasets.. I use Windows

@divyanshj16
Copy link

num_workers > 0 doesn't work for windows.
Even with the new IterableDataset.

@ShoufaChen
Copy link

I met this same error. And when I try to find method to solve this problem, the program continues to run automatically (wait about 10 minutes ) amazing 😕

@CorentinJ
Copy link

I've run the exact same code multiple times with different results. Also, I've copied code that causes a broken pipe to a new file (the contents being exactly the same) and it would run fine. I think there's an external factor in play here. I can't reproduce the bug anymore, but maybe try deleting your __pycache__ directory if there's any.

@germanjke
Copy link

have some problem on Windows10. dunno why but i think problem is dataloader (num_workers to 0 doesn't help) and multiprocessing

@morawi
Copy link

morawi commented Mar 3, 2020

have some problem on Windows10. dunno why but i think problem is dataloader (num_workers to 0 doesn't help) and multiprocessing

After using Ubuntu for quire some time, I am trying Windows-10 lately (just for prototyping before using the cluster machine) and bumped into the same error, setting num_workers to 0 helped. Make sure you are setting all dataloaders, train, test, and validate.

@PiPiNam
Copy link

PiPiNam commented Mar 5, 2020

I also have same problem on Win10. I got the error message '[Errno 32] Broken pipe' when I set the num_workers greater than 0.
And my code is download from Pytorch official tutorial.

I guess that is a bug for Win10, and I am looking forward to see a fixed version on next release.

@paleomoon
Copy link

same error, num_workers=0 worked, but I want multiprocessing to speed up dataloading.

@morawi
Copy link

morawi commented Mar 24, 2020

same error, num_workers=0 worked, but I want multiprocessing to speed up dataloading.

Seems that the only way for this to work is using Linux, I am using Windows-10 for prototyping and then pushing everything to the cluster which is based on Linux.

if platform.system()=='Windows': n_cpu= 0

@msminhas93
Copy link

msminhas93 commented Apr 28, 2020

I also encountered a similar problem in windows 10 when defining my custom torchvision dataset and trying to run it in jupyter lab. Apparently the custom dataset does not get registered as an attribute to the main module which is called by the DataLoader in the multiprocessing.py\spawn.py file. I fixed it by writing the dataset into a module and then importing it as mentioned here:

https://stackoverflow.com/questions/41385708/multiprocessing-example-giving-attributeerror

  File "C:\Users\johndoe\Anaconda3\envs\PyTorch15\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\Users\johndoe\Anaconda3\envs\PyTorch15\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'RandomPatchExtractor' on <module '__main__' (built-in)>

@arnabsinha99
Copy link

@mjchen611 You can set num_workers to 0 to see the actual error. Did you have your plotter correctly configured?

Setting num_workers to 0 worked for me. Could you explain why is this causing an error?

@ltjkoomen
Copy link

I have noticed this issue is closed, but I do not think this is fixed. Is there any effort to fix multi-processing dataloader on windows? Currently there are 2 options as far as I know:

  1. wrap it in if __name__ == '__main__':, which does not always work.
  2. do not use multi-processing on windows: if platform.system()=='Windows': n_cpu= 0

So the first one is an imperfect fix, while the second one amounts to just giving up. Is there any effort on fixing multi-processed dataloading on windows currently going on somewhere else or should we re-open this one?

@BlackTeaAttenuation
Copy link

BlackTeaAttenuation commented Oct 4, 2020

Use
if __name__ == '__main__' and '__file__' in globals(): instead of if __name__ == '__main__':
That works for me. I use Jupyter notebook and windows 10.

this is the reference

@doanhung95wkm
Copy link

I got problem when trying to train on my custom Coco dataset (which is little bit difference from default CocoDetection Pytorch class). Add params collate_fn=utils.collate_fn worked for me:
trainloader = torch.utils.data.DataLoader(coco_train, batch_size=2, shuffle=False, num_workers=1, collate_fn=utils.collate_fn)

@bigbizze
Copy link

If anyone runs into this issue and none of the above works, my problem ended up being that my file name had "-" in it, as opposed to, say, "_", and multiprocessing was unable to resolve the references as a result.

@willdone1337
Copy link

you must put all code for train into if name=='main'

@smolboii
Copy link

another thing is that, at least in my experience with using detectron2, the number of workers has to be <= your cpu cores, unlike with linux. so if you have 12 cpu cores like I do, u can't use more than 12 workers (not that that would be that beneficial to begin with, i suppose).

and with detectron2 in particular, if you use an evaluator this will then double the amount of workers as it creates N additional workers (N being num_workers) for evaluation, while the other workers are not terminated. so with 12 core cpu you can actually only have 6 workers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests