Imagenet example fails during accuracy calculation (v0.2.2 on 1.8.1) #150

assapin · 2021-05-30T13:36:57Z

🐛 Bug

When running the imagenet example from examples/imagenet,
I get the following error:

[INFO] 2021-05-30 13:09:18,531 api: [default] Starting worker group
=> set cuda device = 0
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
Traceback (most recent call last):
File "main.py", line 594, in
main()
File "main.py", line 183, in main
train(train_loader, model, criterion, optimizer, epoch, device_id, print_freq)
File "main.py", line 455, in train
acc1, acc5 = accuracy(output, target, topk=(1, 5))
File "main.py", line 588, in accuracy
correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

Component (check all that applies):

To Reproduce

See environment

Expected behavior

Training should work and accuracy should be reported correctly

Environment

Dockerfile:

FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime

RUN apt-get -q update && apt-get -q install -y wget unzip
RUN pip install torchelastic==0.2.2

RUN mkdir ./train
COPY elastic/examples/imagenet/main.py ./train
WORKDIR ./train
RUN chmod -R a+w .
USER root
ENTRYPOINT ["python", "-m", "torchelastic.distributed.launch"]
CMD ["--help"]

The text was updated successfully, but these errors were encountered:

assapin · 2021-05-30T13:39:31Z

I see you fixed it in master.
Was going to do a pull request.... next time :-)

assapin closed this as completed May 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imagenet example fails during accuracy calculation (v0.2.2 on 1.8.1) #150

Imagenet example fails during accuracy calculation (v0.2.2 on 1.8.1) #150

assapin commented May 30, 2021

assapin commented May 30, 2021

Imagenet example fails during accuracy calculation (v0.2.2 on 1.8.1) #150

Imagenet example fails during accuracy calculation (v0.2.2 on 1.8.1) #150

Comments

assapin commented May 30, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

assapin commented May 30, 2021