You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.
When running the imagenet example from examples/imagenet,
I get the following error:
[INFO] 2021-05-30 13:09:18,531 api: [default] Starting worker group
=> set cuda device = 0
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
Traceback (most recent call last):
File "main.py", line 594, in
main()
File "main.py", line 183, in main
train(train_loader, model, criterion, optimizer, epoch, device_id, print_freq)
File "main.py", line 455, in train
acc1, acc5 = accuracy(output, target, topk=(1, 5))
File "main.py", line 588, in accuracy
correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
Component (check all that applies):
state api
train_step api
train_loop
rendezvous
checkpoint
rollback
metrics
petctl
[ X] examples
docker
other
To Reproduce
See environment
Expected behavior
Training should work and accuracy should be reported correctly
Environment
Dockerfile:
FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime
RUN apt-get -q update && apt-get -q install -y wget unzip
RUN pip install torchelastic==0.2.2
RUN mkdir ./train
COPY elastic/examples/imagenet/main.py ./train
WORKDIR ./train
RUN chmod -R a+w .
USER root
ENTRYPOINT ["python", "-m", "torchelastic.distributed.launch"]
CMD ["--help"]
The text was updated successfully, but these errors were encountered:
🐛 Bug
When running the imagenet example from examples/imagenet,
I get the following error:
[INFO] 2021-05-30 13:09:18,531 api: [default] Starting worker group
=> set cuda device = 0
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
Traceback (most recent call last):
File "main.py", line 594, in
main()
File "main.py", line 183, in main
train(train_loader, model, criterion, optimizer, epoch, device_id, print_freq)
File "main.py", line 455, in train
acc1, acc5 = accuracy(output, target, topk=(1, 5))
File "main.py", line 588, in accuracy
correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
Component (check all that applies):
state api
train_step api
train_loop
rendezvous
checkpoint
rollback
metrics
petctl
examples
docker
To Reproduce
See environment
Expected behavior
Training should work and accuracy should be reported correctly
Environment
Dockerfile:
FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime
RUN apt-get -q update && apt-get -q install -y wget unzip
RUN pip install torchelastic==0.2.2
RUN mkdir ./train
COPY elastic/examples/imagenet/main.py ./train
WORKDIR ./train
RUN chmod -R a+w .
USER root
ENTRYPOINT ["python", "-m", "torchelastic.distributed.launch"]
CMD ["--help"]
The text was updated successfully, but these errors were encountered: