Model checkpointed using torch.save() unable to be loaded using torch.load() #12042

deepakn94 · 2018-09-25T06:33:52Z

I have created a PyTorch model checkpoint using torch.save; however, I'm unable to load this model using torch.load. I run into the following error:

>>> torch.load('model_best.pth.tar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/anaconda3/envs/pytorch_source/lib/python3.7/site-packages/torch/serialization.py", line 358, in load
    return _load(f, map_location, pickle_module)
  File "/home/ubuntu/anaconda3/envs/pytorch_source/lib/python3.7/site-packages/torch/serialization.py", line 549, in _load
    deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected -7659745797817883467 got 512

The model was saved using code like this:

def save_checkpoint(epoch, model, best_top5, optimizer, is_best=False, filename='checkpoint.pth.tar'):
    state = {
        'epoch': epoch+1, 'state_dict': model.state_dict(),
        'best_top5': best_top5, 'optimizer' : optimizer.state_dict(),
    }
    torch.save(state, filename)

if args.local_rank == 0:
    if is_best: save_checkpoint(epoch, model, best_top5, optimizer, is_best=True, filename='model_best.pth.tar')

The model was trained across multiple p3.16xlarge instances.

The text was updated successfully, but these errors were encountered:

deepakn94 · 2018-09-25T06:35:59Z

PyTorch version:

>>> print(torch.__version__)
0.5.0a0+6993e4a

Python version:

>>> python --version
Python 3.7.0

ssnl · 2018-09-25T18:38:20Z

cc @ezyang

ezyang · 2018-09-25T20:41:37Z

Would it be possible to upload the checkpoint file somewhere, so we can look at it? (Or, if you can provide a script which generates the checkpoint file, that would work too.)

ddkang · 2018-09-25T20:45:33Z

Here you go: https://s3.amazonaws.com/distributed-pytorch-imagenet-runs/imagenet-16/run1/model_best.pth.tar @ezyang

ezyang · 2018-09-25T20:48:31Z

Thanks! Looking into it.

ezyang · 2018-09-25T20:56:29Z

(/home/ezyang/Dev/pytorch-tmp-env) [ezyang@devgpu005.ash6 ~/Dev/pytorch-tmp] tar tf model_best.pth.tar 
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors
(/home/ezyang/Dev/pytorch-tmp-env) [ezyang@devgpu005.ash6 ~/Dev/pytorch-tmp] sha1sum model_best.pth.tar
ca1d315ffddd014ceb3895a919394369dbb8e076  model_best.pth.tar

Also, looks like you're training imagenet; can you make the code available to repro, if possible?

ddkang · 2018-09-25T21:30:08Z

Which part in particular?

Here's the model: https://github.com/diux-dev/imagenet18/blob/086675a6df3d468e89c651ae4c75f31e5b3f381d/training/resnet.py

Here's how the code is launched: https://github.com/diux-dev/imagenet18/tree/59a8f25171fb8cede51db9187a32fc8f802384a0

deepakn94 · 2018-09-25T21:39:44Z

An easier to use version of the code is here: https://github.com/stanford-futuredata/pytorch-distributed/blob/master/train.py

You can reproduce using python train.py --machines 16. There's some additional setup needed to get this working on an EC2 account, and I can walk you through that if needed.

Also, to answer your previous question, you're right that this isn't a .tar file -- it was just named with a .tar extension, for whatever reason

ezyang · 2018-09-26T21:25:36Z

I can reproduce the failure on load. Still investigating.

ezyang · 2018-09-26T21:35:11Z

One data point: the file descriptor at the time of error is misaligned:

(gdb) p/x (int)lseek(file, 0, 1)                                                                     
$7 = 0xa0099f

ezyang · 2018-09-26T21:48:24Z

@deepakn94 Does this repro if you run it on only one node? Basically, I want the model to be as small as possible while still reproducing the error.

ddkang · 2018-09-26T21:53:41Z

The model serializes and deserializes fine when run on one node.

deepakn94 · 2018-09-26T21:55:09Z

Here's another datapoint: serialization and deserialization seems to work fine for 4 nodes when using PyTorch 0.4.0

ezyang · 2018-09-26T21:59:50Z

OK, I'm reading the serialization code, and I think I see an incorrect use of the write() function. Posting patch soon...

ezyang · 2018-09-26T22:10:18Z

Please recompile PyTorch with the following patch, which will fix write-time corruption.

diff --git a/torch/csrc/generic/serialization.cpp b/torch/csrc/generic/serialization.cpp
index 2299cce24..1e5889b15 100644
--- a/torch/csrc/generic/serialization.cpp
+++ b/torch/csrc/generic/serialization.cpp
@@ -35,7 +35,7 @@ void THPStorage_(writeFileRaw)(THWStorage *self, io fd)
       throw std::system_error(result, std::system_category());
   } else {
     int64_t buffer_size = std::min(size, (int64_t)5000);
-    std::unique_ptr<uint8_t[]> le_buffer(new uint8_t[buffer_size * sizeof(scalar_t)]);
+    std::unique_ptr<char[]> le_buffer(new char[buffer_size * sizeof(scalar_t)]);
     for (int64_t i = 0; i < size; i += buffer_size) {
       size_t to_convert = std::min(size - i, buffer_size);
       if (sizeof(scalar_t) == 2) {
@@ -54,7 +54,19 @@ void THPStorage_(writeFileRaw)(THWStorage *self, io fd)
             THPByteOrder::THP_LITTLE_ENDIAN,
             to_convert);
       }
-      SYSCHECK(doWrite(fd, le_buffer.get(), to_convert * sizeof(scalar_t)));
+      int64_t remaining = buffer_size * sizeof(scalar_t);
+      char *bytes = le_buffer.get();
+      while (remaining > 0) {
+        ssize_t result = doWrite(fd, bytes, to_convert * sizeof(scalar_t));
+        if (result < 0) {
+          throw std::system_error(result, std::system_category());
+        }
+        bytes += result;
+        remaining -= result;
+      }
+      if (remaining != 0) {
+        throw std::system_error(result, std::system_category());
+      }
     }
   }
 }

I am not 100% sure this will fix the problem, I need to audit the rest of the sites now.

deepakn94 · 2018-09-26T22:13:46Z

Okay, thanks.

There's no way to salvage the existing checkpoints, right?

ezyang · 2018-09-26T22:14:17Z

If the patch above fixes the problem, no, they're irretrievably corrupted.

deepakn94 · 2018-09-26T23:35:06Z

Okay. I'll a little busy for the next day or two, but will check this patch over the weekend.

deepakn94 · 2018-09-27T02:51:34Z

Should I apply this patch to current master? Or to the old commit we were using?

Also, seems like PyTorch 0.4.0 on 16 machines doesn't work.

ezyang · 2018-09-27T02:57:13Z

I authored this patch on master, but it should backport to older versions too. Perhaps it would be better to backport to the old commit to get a cleaner test.

deepakn94 · 2018-09-27T17:11:42Z

That unfortunately didn't work. I applied the patch to the old commit (6993e4a).

>>> torch.load('model_best.pth.tar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 303, in load
    return _load(f, map_location, pickle_module)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 476, in _load
    deserialized_objects[key]._set_from_file(f, offset, f_is_real_file)
RuntimeError: storage has wrong size: expected 2930331299881099915 got 128

ubuntu@ip-172-31-93-108:~/pytorch$ git diff
diff --git a/torch/csrc/generic/serialization.cpp b/torch/csrc/generic/serialization.cpp
index 42dff61..0496311 100644
--- a/torch/csrc/generic/serialization.cpp
+++ b/torch/csrc/generic/serialization.cpp
@@ -35,7 +35,7 @@ void THPStorage_(writeFileRaw)(THWStorage *self, io fd)
       throw std::system_error(result, std::system_category());
   } else {
     int64_t buffer_size = std::min(size, (int64_t)5000);
-    std::unique_ptr<uint8_t[]> le_buffer(new uint8_t[buffer_size * sizeof(real)]);
+    std::unique_ptr<char[]> le_buffer(new char[buffer_size * sizeof(real)]);
     for (int64_t i = 0; i < size; i += buffer_size) {
       size_t to_convert = std::min(size - i, buffer_size);
       if (sizeof(real) == 2) {
@@ -54,7 +54,19 @@ void THPStorage_(writeFileRaw)(THWStorage *self, io fd)
             THPByteOrder::THP_LITTLE_ENDIAN,
             to_convert);
       }
-      SYSCHECK(doWrite(fd, le_buffer.get(), to_convert * sizeof(real)));
+      int64_t remaining = buffer_size * sizeof(real);
+      char *bytes = le_buffer.get();
+      while (remaining > 0) {
+        ssize_t result = doWrite(fd, bytes, to_convert * sizeof(real));
+        if (result < 0) {
+          throw std::system_error(result, std::system_category());
+        }
+        bytes += result;
+        remaining -= result;
+      }
+      if (remaining != 0) {
+        throw std::system_error(result, std::system_category());
+      }
     }
   }
 }

Uploaded checkpoint here: https://s3.amazonaws.com/distributed-pytorch-imagenet-runs/imagenet-16-new/run1/model_best.pth.tar

ezyang · 2018-09-27T18:56:34Z

I can't read the updated checkpoint (no permissions). I have a more complete patch which also fixes an underrun on reads, but it doesn't catch any more write side errors, so it must be a different bug.

… cases. Previously, doRead/doWrite were functions that could return partial reads/writes, and we checked for this case inconsistently in the call sites of serialization.cpp. Now, these functions do NOT return the amount of bytes read/written, and instead handle the necessary checking loop themselves. Fixes pytorch#12042. Signed-off-by: Edward Z. Yang <ezyang@fb.com>

deepakn94 · 2018-09-27T19:55:56Z

I updated the permissions on the checkpoint.

ezyang · 2018-09-27T21:44:11Z

Thanks. Confirmed that it still seems to be a write side bug. I guess I'll have to figure something else out...

Any luck minimizing the repro?

deepakn94 · 2018-09-28T02:13:47Z

This shouldn't be closed, right?

I'm working on a smaller repro -- I suspect that running this on any distributed setup causes this, but will confirm sometime over the weekend.

soumith · 2018-09-28T02:52:48Z

the closing was an accident, sorry about that. reopened the issue

… cases. Previously, doRead/doWrite were functions that could return partial reads/writes, and we checked for this case inconsistently in the call sites of serialization.cpp. Now, these functions do NOT return the amount of bytes read/written, and instead handle the necessary checking loop themselves. Fixes pytorch#12042. Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ezyang · 2018-09-28T02:54:44Z

So... I tried saving and loading on the small distributed setup we have in our test suite, and I got a very similar looking error: "EOFError: Ran out of input". I'll work on debugging this case. Branch I'm testing off of is https://github.com/ezyang/pytorch/tree/test/semicharmed-kind-of-life using python test/run_test.py -i distributed -b nccl

EDIT: Never mind! I forgot to seek the file back to the beginning before reading it out again.

apaszke · 2018-09-28T14:19:54Z

If the problem only appears in the distributed setting... are you sure that all processes aren't writing to the same file at the same time? That would corrupt it for sure.

ezyang · 2018-09-28T15:41:50Z

The script seems to only write from local_rank == 0: https://github.com/diux-dev/imagenet18/blob/59a8f25171fb8cede51db9187a32fc8f802384a0/training/train_imagenet_nv.py#L150 so unless I misunderstand how rank works it should be OK.

deepakn94 · 2018-09-28T16:36:00Z

Yup, I don't think that's the problem -- only the "master" worker should write the checkpoint. The bug seems to be non-deterministic, because I do have a single 4-machine run that succeeded (along with perhaps 20 failures).

ezyang · 2018-09-28T17:26:33Z

@deepakn94 I haven't tried to get the script to run for me, but another thing to try: when you get to the save point, save the model multiple times; like, 8 times. We can then compare them and see if they're all corrupted identically, or some of them are ok, etc.

deepakn94 · 2018-09-29T07:13:15Z

Links are of the form https://s3.amazonaws.com/distributed-pytorch-imagenet-runs/multi-checkpoint/model_best.0.pth.tar (replace 0 with numbers from 0 to 7)

deepakn94 · 2018-09-29T22:02:50Z

This is actually interesting; looks like some of the checkpoints are corrupted identically, but most are different (and one of the eight checkpoints is not corrupted).

>>> torch.load('model_best.3.pth.tar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 303, in load
    return _load(f, map_location, pickle_module)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 476, in _load
    deserialized_objects[key]._set_from_file(f, offset, f_is_real_file)
RuntimeError: storage has wrong size: expected -4669315570785868528 got 512
>>> torch.load('model_best.4.pth.tar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 303, in load
    return _load(f, map_location, pickle_module)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 476, in _load
    deserialized_objects[key]._set_from_file(f, offset, f_is_real_file)
RuntimeError: storage has wrong size: expected 3219564007566745640 got 256
>>> torch.load('model_best.5.pth.tar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 303, in load
    return _load(f, map_location, pickle_module)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 476, in _load
    deserialized_objects[key]._set_from_file(f, offset, f_is_real_file)
RuntimeError: storage has wrong size: expected 1618383146375255311 got 256
>>> torch.load('model_best.6.pth.tar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 303, in load
    return _load(f, map_location, pickle_module)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 476, in _load
    deserialized_objects[key]._set_from_file(f, offset, f_is_real_file)
RuntimeError: storage has wrong size: expected 3219564007566745640 got 256
>>> torch.load('model_best.7.pth.tar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 303, in load
    return _load(f, map_location, pickle_module)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 476, in _load
    deserialized_objects[key]._set_from_file(f, offset, f_is_real_file)
RuntimeError: storage has wrong size: expected 2579182944752902985 got 128

ezyang · 2018-09-30T21:45:11Z

I'm not going to get around to looking at this until the work week, but my plan is to do a binary comparison on the checkpoints and see where they diverge, and what kind of corruption is happening, and then check which particular part of the serialization code was writing out that part of the file.

deepakn94 · 2018-10-01T06:11:27Z

Sounds good. Let me know if there's anything else you need from my side! (is the easier-to-produce test case still useful?)

deepakn94 · 2018-10-02T23:45:40Z

Sigh, I think I found the problem. It seems like local_rank is actually the ID within a worker; multiple workers have a local_rank of 0, so they're probably trampling each other's checkpoints.

ezyang · 2018-10-03T02:49:41Z

Aw man, that sounds like a good one for the docs. Very happy you figured it out :)

deepakn94 · 2018-10-04T05:43:15Z

Verified that this is indeed the case -- closing this. Thanks for all the help!

ezyang · 2018-10-05T14:28:25Z

@deepakn94 If you don't mind me asking, what change did you make to solve the problem? IIUC, you were writing to a network filesystem for the checkpoints; did you just make them stop writing to NFS?

deepakn94 · 2018-10-05T15:56:22Z

I added a --global_rank command line argument as well. Full commit here: stanford-futuredata/pytorch-distributed@23990ca

bermanmaxim · 2018-10-26T15:16:16Z

Note that the launch utility torch.distributed.launch sets up a RANK environment variable which can be used to detect if you are on the master process (with os.environ['RANK'] == '0' from python).
EDIT: actually it is even simpler than that, you can use torch.distributed.get_rank() to get the global rank.

djstrong · 2019-04-08T10:14:55Z

I have similar error when I only load pretrained model. The problem does not occur if only one process is loading the model.

ezyang mentioned this issue Sep 27, 2018

Rewrite serialization to correctly handle partial reads/writes in all cases #12143

Closed

facebook-github-bot closed this as completed in a581804 Sep 28, 2018

soumith reopened this Sep 28, 2018

deepakn94 closed this as completed Oct 4, 2018

bermanmaxim mentioned this issue Oct 26, 2018

Use dist.get_rank() instead of local_rank to detect master process facebookresearch/maskrcnn-benchmark#40

Merged

proteus1991 mentioned this issue Aug 7, 2019

RuntimeError: storage has wrong size: expected 1152921504606846976 got 16 proteus1991/GridDehazeNet#1

Closed

ArashHosseini mentioned this issue Feb 19, 2020

What is the input for TFBertForSequenceClassification? huggingface/transformers#2705

Closed

Genius1237 mentioned this issue Mar 20, 2020

Update run_language_modeling.py to handle writes on networked filesystem better huggingface/transformers#3356

Closed

theRealSuperMario mentioned this issue Jul 9, 2020

RuntimeError: storage has wrong size: expected 0 got 64 AliaksandrSiarohin/motion-cosegmentation#18

Open

YunchaoYang mentioned this issue Jul 8, 2022

Distributed Data Parallel on PyTorch YunchaoYang/Blogs#3

Open

Adel-Moumen mentioned this issue Apr 13, 2024

fix LOCAL_RANK to be RANK in if_main_process speechbrain/speechbrain#2506

Merged

13 tasks

Model checkpointed using torch.save() unable to be loaded using torch.load() #12042

Model checkpointed using torch.save() unable to be loaded using torch.load() #12042

Comments

deepakn94 commented Sep 25, 2018

deepakn94 commented Sep 25, 2018

ssnl commented Sep 25, 2018

ezyang commented Sep 25, 2018

ddkang commented Sep 25, 2018 • edited Loading

ezyang commented Sep 25, 2018

ezyang commented Sep 25, 2018

ddkang commented Sep 25, 2018

deepakn94 commented Sep 25, 2018

ezyang commented Sep 26, 2018

ezyang commented Sep 26, 2018

ezyang commented Sep 26, 2018

ddkang commented Sep 26, 2018

deepakn94 commented Sep 26, 2018

ezyang commented Sep 26, 2018

ezyang commented Sep 26, 2018

deepakn94 commented Sep 26, 2018

ezyang commented Sep 26, 2018

deepakn94 commented Sep 26, 2018

deepakn94 commented Sep 27, 2018

ezyang commented Sep 27, 2018

deepakn94 commented Sep 27, 2018

ezyang commented Sep 27, 2018

deepakn94 commented Sep 27, 2018

ezyang commented Sep 27, 2018

deepakn94 commented Sep 28, 2018

soumith commented Sep 28, 2018

ezyang commented Sep 28, 2018 • edited Loading

apaszke commented Sep 28, 2018

ezyang commented Sep 28, 2018

deepakn94 commented Sep 28, 2018

ezyang commented Sep 28, 2018

deepakn94 commented Sep 29, 2018

deepakn94 commented Sep 29, 2018

ezyang commented Sep 30, 2018

deepakn94 commented Oct 1, 2018

deepakn94 commented Oct 2, 2018

ezyang commented Oct 3, 2018

deepakn94 commented Oct 4, 2018

ezyang commented Oct 5, 2018

deepakn94 commented Oct 5, 2018

bermanmaxim commented Oct 26, 2018 • edited Loading

djstrong commented Apr 8, 2019

ddkang commented Sep 25, 2018 •

edited

Loading

ezyang commented Sep 28, 2018 •

edited

Loading

bermanmaxim commented Oct 26, 2018 •

edited

Loading