Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: [Errno 9] Bad file descriptor #42

Open
aws-stdun opened this issue Nov 13, 2022 · 1 comment
Open

OSError: [Errno 9] Bad file descriptor #42

aws-stdun opened this issue Nov 13, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@aws-stdun
Copy link

aws-stdun commented Nov 13, 2022

How to reproduce

Using a p4d.24xlarge:

from parallelformers import parallelize
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "facebook/opt-66b"
batch_size = [1]
batch = [["out story begins on"] * bs for bs in batch_size]
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
inputs = [tokenizer(seq, return_tensors="pt").input_ids for seq in batch]
parallelize(model, num_gpus=8, fp16=True)
for _ in range(100):
    model.generate(
        torch.cat(inputs, dim=0),
        do_sample=True,
        max_length=2048,
        num_return_sequences=1,
    )

It loads okay and begins performing inference.
Can see all 8 GPUs at 90+% utilization using nvidia-smi for a while.
Then eventually one GPU drops to 0%, the others jump to 100%.
Terminal shows:

Traceback (most recent call last):                                                                         
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)                                                                       
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps                                                                                                         
    cls(buf, protocol).dump(obj)                                                                           
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 367, in reduce_storage                                                                          
    df = multiprocessing.reduction.DupFd(fd)                                                               
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/reduction.py", line 198, in DupFd                                                                                                        
    return resource_sharer.DupFd(fd)                                                                       
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/resource_sharer.py", line 48, in __init__                                                                                                
    new_fd = os.dup(fd)                                                                                    
OSError: [Errno 9] Bad file descriptor 

It then seems to hang forever from there.

I do realize this stacktrace doesn't give enough enough to get back to parallelformers, which is frustrating. Maybe it's actually a bug in PyTorch or Multiprocessing?

Environment

  • OS : Ubuntu 20.04.4 LTS
  • Python version : 3.8.13
  • Transformers version : 4.24.0
  • Whether to use Docker : No
  • Misc. : N/A
@aws-stdun aws-stdun added the bug Something isn't working label Nov 13, 2022
@mkardas
Copy link

mkardas commented Jan 11, 2023

Changing pytorch sharing strategy seems to help:

import torch
torch.multiprocessing.set_sharing_strategy("file_system")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants