You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "train.py", line 545, in <module>
main(args)
File "train.py", line 225, in main
model = torchvision.models.get_model(args.model, weights=args.weights)
File "/usr/local/lib/python3.7/dist-packages/torchvision/models/_api.py", line 225, in get_model
return fn(**config)
File "/usr/local/lib/python3.7/dist-packages/torchvision/models/vision_transformer.py", line 764, in vit_h_14
**kwargs,
File "/usr/local/lib/python3.7/dist-packages/torchvision/models/vision_transformer.py", line 335, in _vision_transformer
model.load_state_dict(weights.get_state_dict(progress=progress))
File "/usr/local/lib/python3.7/dist-packages/torchvision/models/_api.py", line 66, in get_state_dict
return load_state_dict_from_url(self.url, progress=progress)
File "/usr/local/lib/python3.7/dist-packages/torch/hub.py", line 731, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 726, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 262, in __init__
super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 33723) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
)(*cmd_args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Versions
Collecting environment information...
PyTorch version: 1.13.0.dev20220810+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.26
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
It seems that the local weights were not downloaded properly or in the right location locally. They typically get stored at your ~/.cache/torch/hub/checkpoints. Try deleting the existing vit_h_14_* from there to force their redownloading and ensure that the path is accessible via your script when you run the analysis.
I also had the same problem when using pre-trained weights to train the ssd model,the command I used is 'torchrun --nproc_per_node=8 train.py
--dataset coco --model ssd300_vgg16 --epochs 120
--lr-steps 80 110 --aspect-ratio-group-factor 3 --lr 0.002 --batch-size 4
--weight-decay 0.0005 --data-augmentation ssd --weights-backbone VGG16_Weights.IMAGENET1K_FEATURES'.
and i have already checked the weights in ~/.cache/torch/hub/checkpoints , can you give me some advice?
馃悰 Describe the bug
I faced the below issue after training vit_h_14 model with pretrained weights. If I do not load pretrained weights, everything is fine.
how to reproduce this bug
import torchvision
model = torchvision.models.get_model('vit_h_14', weights='DEFAULT')
#or
#model = torchvision.models.get_model('vit_h_14', weights='IMAGENET1K_SWAG_E2E_V1')
Versions
Collecting environment information...
PyTorch version: 1.13.0.dev20220810+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.26
Python version: 3.7.5 (default, Dec 9 2021, 17:04:37) [GCC 8.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-122-generic-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: True
CUDA runtime version: 11.2.152
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090
GPU 2: NVIDIA GeForce RTX 3090
GPU 3: NVIDIA GeForce RTX 3090
Nvidia driver version: 470.141.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] pytorch-lightning==1.7.1
[pip3] pytorch-lightning-bolts==0.3.2.post1
[pip3] torch==1.13.0.dev20220810+cu113
[pip3] torchaudio==0.13.0.dev20220810+cu113
[pip3] torchmetrics==0.9.3
[pip3] torchvision==0.14.0.dev20220810+cu113
[conda] Could not collect
The text was updated successfully, but these errors were encountered: