Issue to load vit_h_14 model with pretrained weights (DEFAULT or IMAGENET1K_SWAG_E2E_V1) #6489

giangdip2410 · 2022-08-25T02:29:10Z

🐛 Describe the bug

I faced the below issue after training vit_h_14 model with pretrained weights. If I do not load pretrained weights, everything is fine.

how to reproduce this bug

import torchvision
model = torchvision.models.get_model('vit_h_14', weights='DEFAULT')
#or
#model = torchvision.models.get_model('vit_h_14', weights='IMAGENET1K_SWAG_E2E_V1')

Traceback (most recent call last):
  File "train.py", line 545, in <module>
    main(args)
  File "train.py", line 225, in main
    model = torchvision.models.get_model(args.model, weights=args.weights)
  File "/usr/local/lib/python3.7/dist-packages/torchvision/models/_api.py", line 225, in get_model
    return fn(**config)
  File "/usr/local/lib/python3.7/dist-packages/torchvision/models/vision_transformer.py", line 764, in vit_h_14
    **kwargs,
  File "/usr/local/lib/python3.7/dist-packages/torchvision/models/vision_transformer.py", line 335, in _vision_transformer
    model.load_state_dict(weights.get_state_dict(progress=progress))
  File "/usr/local/lib/python3.7/dist-packages/torchvision/models/_api.py", line 66, in get_state_dict
    return load_state_dict_from_url(self.url, progress=progress)
  File "/usr/local/lib/python3.7/dist-packages/torch/hub.py", line 731, in load_state_dict_from_url
    return torch.load(cached_file, map_location=map_location)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 726, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 262, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 33723) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------

Versions

Collecting environment information...
PyTorch version: 1.13.0.dev20220810+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.26

Python version: 3.7.5 (default, Dec 9 2021, 17:04:37) [GCC 8.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-122-generic-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: True
CUDA runtime version: 11.2.152
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090
GPU 2: NVIDIA GeForce RTX 3090
GPU 3: NVIDIA GeForce RTX 3090

Nvidia driver version: 470.141.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] pytorch-lightning==1.7.1
[pip3] pytorch-lightning-bolts==0.3.2.post1
[pip3] torch==1.13.0.dev20220810+cu113
[pip3] torchaudio==0.13.0.dev20220810+cu113
[pip3] torchmetrics==0.9.3
[pip3] torchvision==0.14.0.dev20220810+cu113
[conda] Could not collect

The text was updated successfully, but these errors were encountered:

datumbox · 2022-08-25T08:12:38Z

@giangdip2410 I can't reproduce the problem. The following works fine with me:

import torchvision
torchvision.models.get_model('vit_h_14', weights='DEFAULT')

Judging from your error message:

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

It seems that the local weights were not downloaded properly or in the right location locally. They typically get stored at your ~/.cache/torch/hub/checkpoints. Try deleting the existing vit_h_14_* from there to force their redownloading and ensure that the path is accessible via your script when you run the analysis.

jxguo14 · 2023-01-09T07:31:25Z

I also had the same problem when using pre-trained weights to train the ssd model,the command I used is 'torchrun --nproc_per_node=8 train.py
--dataset coco --model ssd300_vgg16 --epochs 120
--lr-steps 80 110 --aspect-ratio-group-factor 3 --lr 0.002 --batch-size 4
--weight-decay 0.0005 --data-augmentation ssd --weights-backbone VGG16_Weights.IMAGENET1K_FEATURES'.
and i have already checked the weights in ~/.cache/torch/hub/checkpoints , can you give me some advice?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue to load vit_h_14 model with pretrained weights (DEFAULT or IMAGENET1K_SWAG_E2E_V1) #6489

Issue to load vit_h_14 model with pretrained weights (DEFAULT or IMAGENET1K_SWAG_E2E_V1) #6489

giangdip2410 commented Aug 25, 2022 •

edited by YosuaMichael

datumbox commented Aug 25, 2022

jxguo14 commented Jan 9, 2023

Issue to load vit_h_14 model with pretrained weights (DEFAULT or IMAGENET1K_SWAG_E2E_V1) #6489

Issue to load vit_h_14 model with pretrained weights (DEFAULT or IMAGENET1K_SWAG_E2E_V1) #6489

Comments

giangdip2410 commented Aug 25, 2022 • edited by YosuaMichael

🐛 Describe the bug

how to reproduce this bug

Versions

datumbox commented Aug 25, 2022

jxguo14 commented Jan 9, 2023

giangdip2410 commented Aug 25, 2022 •

edited by YosuaMichael