Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue to load vit_h_14 model with pretrained weights (DEFAULT or IMAGENET1K_SWAG_E2E_V1) #6489

Open
giangdip2410 opened this issue Aug 25, 2022 · 2 comments

Comments

@giangdip2410
Copy link

giangdip2410 commented Aug 25, 2022

馃悰 Describe the bug

I faced the below issue after training vit_h_14 model with pretrained weights. If I do not load pretrained weights, everything is fine.

how to reproduce this bug

import torchvision
model = torchvision.models.get_model('vit_h_14', weights='DEFAULT')
#or
#model = torchvision.models.get_model('vit_h_14', weights='IMAGENET1K_SWAG_E2E_V1')

Traceback (most recent call last):
  File "train.py", line 545, in <module>
    main(args)
  File "train.py", line 225, in main
    model = torchvision.models.get_model(args.model, weights=args.weights)
  File "/usr/local/lib/python3.7/dist-packages/torchvision/models/_api.py", line 225, in get_model
    return fn(**config)
  File "/usr/local/lib/python3.7/dist-packages/torchvision/models/vision_transformer.py", line 764, in vit_h_14
    **kwargs,
  File "/usr/local/lib/python3.7/dist-packages/torchvision/models/vision_transformer.py", line 335, in _vision_transformer
    model.load_state_dict(weights.get_state_dict(progress=progress))
  File "/usr/local/lib/python3.7/dist-packages/torchvision/models/_api.py", line 66, in get_state_dict
    return load_state_dict_from_url(self.url, progress=progress)
  File "/usr/local/lib/python3.7/dist-packages/torch/hub.py", line 731, in load_state_dict_from_url
    return torch.load(cached_file, map_location=map_location)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 726, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 262, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 33723) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------

Versions

Collecting environment information...
PyTorch version: 1.13.0.dev20220810+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.26

Python version: 3.7.5 (default, Dec 9 2021, 17:04:37) [GCC 8.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-122-generic-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: True
CUDA runtime version: 11.2.152
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090
GPU 2: NVIDIA GeForce RTX 3090
GPU 3: NVIDIA GeForce RTX 3090

Nvidia driver version: 470.141.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] pytorch-lightning==1.7.1
[pip3] pytorch-lightning-bolts==0.3.2.post1
[pip3] torch==1.13.0.dev20220810+cu113
[pip3] torchaudio==0.13.0.dev20220810+cu113
[pip3] torchmetrics==0.9.3
[pip3] torchvision==0.14.0.dev20220810+cu113
[conda] Could not collect

@datumbox
Copy link
Contributor

@giangdip2410 I can't reproduce the problem. The following works fine with me:

import torchvision
torchvision.models.get_model('vit_h_14', weights='DEFAULT')

Judging from your error message:

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

It seems that the local weights were not downloaded properly or in the right location locally. They typically get stored at your ~/.cache/torch/hub/checkpoints. Try deleting the existing vit_h_14_* from there to force their redownloading and ensure that the path is accessible via your script when you run the analysis.

@jxguo14
Copy link

jxguo14 commented Jan 9, 2023

I also had the same problem when using pre-trained weights to train the ssd model,the command I used is 'torchrun --nproc_per_node=8 train.py
--dataset coco --model ssd300_vgg16 --epochs 120
--lr-steps 80 110 --aspect-ratio-group-factor 3 --lr 0.002 --batch-size 4
--weight-decay 0.0005 --data-augmentation ssd --weights-backbone VGG16_Weights.IMAGENET1K_FEATURES'.
and i have already checked the weights in ~/.cache/torch/hub/checkpoints , can you give me some advice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants