Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: received 0 items of ancdata #138

Closed
GCQi opened this issue Feb 21, 2023 · 10 comments
Closed

RuntimeError: received 0 items of ancdata #138

GCQi opened this issue Feb 21, 2023 · 10 comments

Comments

@GCQi
Copy link

GCQi commented Feb 21, 2023

When I train the lstr based tusimple, as the command is python main_landet.py --train --config ./configs/lane_detection/lstr/resnet18s_tusimple.py --mixed-precision, it run sevel epochs and randomly export the error RuntimeError: received 0 items of ancdata

The error message is:

Traceback (most recent call last):
  File "main_landet.py", line 80, in <module>
    runner.run()
  File "/data/123/gcq/LaneDetection/pytorch-auto-drive/utils/runners/lane_det_trainer.py", line 35, in run
    for i, data in enumerate(self.dataloader, 0):
  File "/home/123/anaconda3/envs/pad/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/123/anaconda3/envs/pad/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/home/123/anaconda3/envs/pad/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
    success, data = self._try_get_data()
  File "/home/123/anaconda3/envs/pad/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/123/anaconda3/envs/pad/lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File "/home/123/anaconda3/envs/pad/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 305, in rebuild_storage_fd
    fd = df.detach()
  File "/home/123/anaconda3/envs/pad/lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/123/anaconda3/envs/pad/lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/123/anaconda3/envs/pad/lib/python3.8/multiprocessing/reduction.py", line 164, in recvfds
    raise RuntimeError('received %d items of ancdata' %
RuntimeError: received 0 items of ancdata
@GCQi
Copy link
Author

GCQi commented Feb 21, 2023

Besides, it also show me the warning /data/123/gcq/LaneDetection/pytorch-auto-drive/utils/datasets/utils.py:30: UserWarning: An output with one or more elements was resized since it had shape [88473600], which does not match the required output shape [128, 3, 360, 640]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1670525552411/work/aten/src/ATen/native/Resize.cpp:17.)

@GCQi
Copy link
Author

GCQi commented Feb 21, 2023

And I changed the batch size to 128, maybe it caused the error?

@voldemortX
Copy link
Owner

And I changed the batch size to 128, maybe it caused the error?

Yes it is probably the reason, scale it down and see if the issue persists? Usually, this loading error accurs when parallel data loading is too heavy for your system.

@GCQi
Copy link
Author

GCQi commented Feb 21, 2023

And I changed the batch size to 128, maybe it caused the error?

Yes it is probably the reason, scale it down and see if the issue persists? Usually, this loading error accurs when parallel data loading is too heavy for your system.

Now I change it to 64, and the error has not occured for now

@GCQi
Copy link
Author

GCQi commented Feb 26, 2023

There comes a terrible thing that i still set the batch size is 64, and set the workers as 32, the error RuntimeError: received 0 items of ancdata appeared again.

@GCQi
Copy link
Author

GCQi commented Feb 26, 2023

Besides, the train_augmentation as :

train_augmentation = dict(
    name='Compose',
    transforms=[
        dict(
            name='Resize',
            size_image=(360, 640),
            size_label=(360, 640)
        ),
        dict(
            name='RandomHorizontalFlip',
            flip_prob=0.5
                ),
        dict(
            name='RandomRotation',
            degrees=10
                ),
        dict(
            name='ColorJitter',
            brightness=0.4,
            contrast=0.4,
            saturation=0.4,
            hue=0.2
        ),
        dict(
            name='ToTensor'
        ),
        dict(
            name='RandomLighting',
            mean=0.0,
            std=0.1,
            eigen_value=[0.00341571, 0.01817699, 0.2141788],
            eigen_vector=[
                [0.41340352, -0.69563484, -0.58752847],
                [-0.81221408, 0.00994535, -0.5832747],
                [0.41158938, 0.71832671, -0.56089297]
            ]
        ),
        dict(
            name='Normalize',
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225],
            normalize_target=True
        )
    ]
)

@GCQi
Copy link
Author

GCQi commented Feb 26, 2023

Have you ever encountered this problem before? I can not get the useful message from the error message.

@voldemortX
Copy link
Owner

voldemortX commented Feb 26, 2023

@GCQi In my experience, this problem comes with heavy data loading (according to your hardware). Large batch size, more workers, and long training schedule increase the probability to encounter this error, which could happen halfway through training. You may find that my default batch size is kept at 20 for this very reason.

@voldemortX
Copy link
Owner

voldemortX commented Feb 26, 2023

Sometimes the file_system strategy could help, but it has a memory leak issue of its own.

# torch.multiprocessing.set_sharing_strategy('file_system')

@GCQi
Copy link
Author

GCQi commented Feb 26, 2023

OK. thanks for your help !! This open frame work is pretty good, thanks for your contirbution and great work

@GCQi GCQi closed this as completed Feb 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants