[jit] Support `torch.save` for saving values during execution #18154

driazati · 2019-03-19T01:19:24Z

This PR makes torch.save call out to the pickler which saves a tensor in the same format that torch.save() does, the file looks like | pickle archive 1 (includes sizes, strides, requires_grad, etc...) | pickle archive 2 (list of tensor keys) | tensor binary data | and can be read back in with torch.load(my_file, pickle_module=torch.jit._pickle)

Fixes #18003

Unpickling in the JIT for things such as model parallelism will be a follow up PR

Differential Revision: D15015160

suo · 2019-03-21T16:27:58Z

torch/csrc/jit/pickler.cpp

+  // TODO: making IValues does a useless copy of the storage
+  IValue storage_bytes =
+      std::string((char*)tensor.storage().data(), record_size);
+  IValue list = c10::ivalue::GenericList::create({storage_bytes, num_elements});


I don't understand—why are we pickling the literal tensor as a generic list?

When un-pickling only the last object on the stack gets popped off and passed to __setstate__, and to recreate the tensor we need both values hence the list wrapper. Creating IValues from them lets us re-use the existing list serialization code instead of copying it here

suo · 2019-03-21T16:28:52Z

torch/csrc/jit/register_prim_ops.cpp

+           // Write file
+           std::fstream output(filename, std::ios::out | std::ios::binary);
+           output.write(p.stack().data(), p.stack().size());
+           output.close();


you don't need to manually close, it will be RAII'd out

suo · 2019-03-21T16:29:36Z

torch/csrc/jit/register_prim_ops.cpp

+
+           // Pickle the tensor
+           Pickler p;
+           p.start();


Aside: is there value in the start() and finish() calls? Can we make them part of the constructor/destructor to avoid mistakes?

start could be in the constructor but finish puts some necessary opcodes at the end of the binary blob so it needs to run before the stack is stored somewhere (so it can't be in the destructor). Because of that it's more clear I think to have both start and finish

torch/jit/__init__.py

torch/serialization.py

zdevito

Looks pretty good, but I am concerned about corner cases for tensor serialization, and have a few api suggestions.

zdevito · 2019-04-23T05:51:25Z

torch/csrc/jit/pickler.cpp

-void Pickler::pushClass(PicklerClass cls) {
-  const auto& name = getClassName(cls);
-  // Write it to the tensor table
+void Pickler::pushGlobal(const std::string& name) {
  auto memo_entry = memo_map_.find(&name);


I think &name is a bug here. Previously it was returning one-time-allocated strings. Now it is returning things in a hash_map, whose addresses are not guaranteed to stay the same.

Since the maps are const doesn't that imply that there will be no re-allocating / re-hashing and so the pointers will always be valid?

Based on how pushGlobal is called, no, it is not always valid. I see pushGlobal being called as:

with a const char* string

with a string generated from stringstream.

with a string using string concat

Version 2 and 3 fail. This kind of bug happens because the API looks like it does one thing (take a string), but the real API is suppose to take only statically allocated strings. In this case, it is best not to have an API like this in the first place.

Regarding this and #20090, an API like this is nice to have so everything doesn't have to be statically spelled out. Adding a reference to pointer IValues and keeping strings around in a table on the pickler should fix this

zdevito · 2019-04-24T03:24:39Z

torch/csrc/jit/pickler.cpp


  // All attributes get pushed into a list and their indices saved in the
  // module def
-  push<OpCode>(OpCode::EMPTY_LIST);
-  push<OpCode>(OpCode::MARK);
+  wrap_in_list_ = wrap_in_list;


Weird interface bool creep here. Why not have the thing that needs to wrap the result in a list explicit call:

pickler.start(); pickler.beginPushList(); pickler.endPushList(); pickler.end();

torch/csrc/jit/pickler.cpp

zdevito · 2019-04-24T03:33:01Z

torch/csrc/jit/pickler.cpp

+      auto numel_ptr = reinterpret_cast<const char*>(&numel);
+      stack_.insert(stack_.end(), numel_ptr, numel_ptr + sizeof(numel));
+
+      uint64_t record_size = tensor.element_size() * tensor.numel();


Is the tensor contiguous? Is the tensor on the CPU? Otherwise this code is bogus. Look at how the other tensor serializer writes out tensors.

zdevito · 2019-04-24T03:35:14Z

torch/csrc/jit/pickler.cpp

-  }
-  AT_ERROR("Unknown class name for unpickler: ", str);
-}
+const static std::unordered_map<std::string, PicklerClass> name_to_class{


What was wrong with how it was before? The other one is almost certainly faster.

This is also missing an entry for LITERAL_TENSOR, right?

It's a little more readable I think and this isn't really accessed that often to be a performance bottleneck or anything.

LITERAL_TENSOR wasn't necessary for this PR (it's only needed to Unpickle these tensors in C++, which isn't crucial since torch.load can read them in)

torch/serialization.py

…kpoint

zdevito

This is looking good. Two things before it is ready:

Bug in pushGlobal that can cause memoization to fail.
Needs more tests around how tensors are serialized because I can't tell from the implementation if it is correct or not.

torch/csrc/jit/pickler.h

zdevito · 2019-05-02T21:11:17Z

torch/csrc/jit/pickler.cpp

-void Pickler::pushClass(PicklerClass cls) {
-  const auto& name = getClassName(cls);
-  // Write it to the tensor table
+void Pickler::pushGlobal(const std::string& name) {
  auto memo_entry = memo_map_.find(&name);


Based on how pushGlobal is called, no, it is not always valid. I see pushGlobal being called as:

with a const char* string

with a string generated from stringstream.

with a string using string concat

Version 2 and 3 fail. This kind of bug happens because the API looks like it does one thing (take a string), but the real API is suppose to take only statically allocated strings. In this case, it is best not to have an API like this in the first place.

torch/csrc/jit/pickler.cpp

test/test_jit.py

…e for the the duration of the pickler

zdevito

Pretty much ready to go. I think there is a bug with string memoization, if you can resolve that I will just look at that change and approve.

zdevito · 2019-05-07T04:51:58Z

torch/csrc/jit/pickler.cpp

+void Pickler::pushGlobal(const std::string& name_temp) {
+  memoized_strings_.push_back(name_temp);
+  auto name = memoized_strings_.back();
+  auto memo_entry = memo_map_.find(&(memoized_strings_.back()));


This doesn't seem right. What is the intention here? &memoized_strings_.back()is a pointer to a position in memoized_strings_. Since it was just inserted, it will never be in the memo_map_, and if the memoized_strings_ gets reallocated, then it will be a pointer to bogus data. memo_map_ only really works for reference IValue types. Memoizing non-ivalue strings will probably require a hash map from string -> memo id.

zdevito

Looks good. Minor API comment.

zdevito · 2019-05-07T23:14:23Z

torch/csrc/jit/pickler.cpp

+    pushString(name_temp);
+
+    // Push BINPUT without adding anything to the memo_map_
+    pushMemoization(nullptr);


This is a weird API. It decides not to update the memo_map_ in pushMemoization and then carefully uses memo_id here (something secretly updated in pushMemoization)

Suggestion:

memoized_string_map_.insert({name_temp, pushBinPutNext()});

pushBinPutNext() pushes the BINPUT opcode with an incremented memo_id and then returns it. Then pushMemoization becomes:

void pushMemoization(void* item) { memo_map_[item] = pushBinPutNext(); }

Summary: This PR makes `torch.save` call out to the pickler which saves a tensor in the same format that `torch.save()` does, the file looks like `| pickle archive 1 (includes sizes, strides, requires_grad, etc...) | pickle archive 2 (list of tensor keys) | tensor binary data |` and can be read back in with `torch.load(my_file, pickle_module=torch.jit._pickle)` Fixes #18003 Unpickling in the JIT for things such as model parallelism will be a follow up PR ](https://our.intern.facebook.com/intern/diff/15015160/) Pull Request resolved: pytorch/pytorch#18154 Pulled By: driazati Differential Revision: D15015160 fbshipit-source-id: ef76a44b8c243f4794cd7e245ec8305e965bc59f

facebook-github-bot · 2019-05-09T01:14:11Z

@driazati merged this pull request in 8ebb86d.

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Mar 19, 2019

driazati changed the title ~~[jit] Add torch.jit.save_ivalue~~ [wip][jit] Add torch.jit.save_ivalue Mar 20, 2019

driazati requested review from suo and eellison March 20, 2019 01:19

driazati changed the title ~~[wip][jit] Add torch.jit.save_ivalue~~ [jit] Add torch.jit.save_ivalue for saving values during execution Mar 21, 2019

suo reviewed Mar 21, 2019

View reviewed changes

driazati mentioned this pull request Mar 21, 2019

[jit] Save IValues from Python and load them in C++ #18286

Closed

driazati changed the title ~~[jit] Add torch.jit.save_ivalue for saving values during execution~~ [wip][jit] Add torch.jit.save_ivalue for saving values during execution Apr 2, 2019

driazati changed the title ~~[wip][jit] Add torch.jit.save_ivalue for saving values during execution~~ [wip][jit] Support torch.save for saving values during execution Apr 18, 2019

Squashed commits

4d32b47

driazati force-pushed the checkpoint branch from 6062b94 to 4d32b47 Compare April 18, 2019 23:41

davidriazati added 4 commits April 18, 2019 16:42

Cleanup

3724478

Cleanup

88e6b9a

Cleanup

9777e3c

Cleanup

fd33139

driazati requested review from suo and zdevito April 19, 2019 00:05

driazati changed the title ~~[wip][jit] Support torch.save for saving values during execution~~ [jit] Support torch.save for saving values during execution Apr 19, 2019

davidriazati added 2 commits April 19, 2019 13:55

Fix py2 test

be1a70b

Merge branch 'master' of github.com:pytorch/pytorch into checkpoint

929a220

driazati commented Apr 23, 2019

View reviewed changes

torch/serialization.py Outdated Show resolved Hide resolved

zdevito reviewed Apr 24, 2019

View reviewed changes

davidriazati added 8 commits April 25, 2019 14:08

Update

1af6d0a

Use same method as export.cpp for saving tensors

9a332c5

BC breaking change of pickling as a tuple instead of a list

bca0ec3

Merge

91a29e9

Fix clangtidy

366e898

Fix clangtidy agaiun

6d42241

fix test

6dd171c

fix tensor writing

f61c2a9

davidriazati added 7 commits May 1, 2019 11:03

Merge

a9354df

Fix bad merge

132c084

Fix cuda tests

2850f24

Cleanup

5bcbf93

Cleanup

712436a

Merge branch 'master' of https://github.com/pytorch/pytorch into chec…

a8b856b

…kpoint

fix bad merge 2

8d818b7

driazati requested a review from zdevito May 2, 2019 16:11

fix lint

1875cfa

zdevito reviewed May 2, 2019

View reviewed changes

davidriazati added 2 commits May 3, 2019 14:31

Add a test for views, add tables to keep memoized strings/IValues liv…

c8a34e6

…e for the the duration of the pickler

cleanup dead code

8513d13

driazati requested a review from zdevito May 6, 2019 23:30

zdevito reviewed May 7, 2019

View reviewed changes

Use memoized_strings_map<string, int> for storing temp strings

c4d04a9

driazati requested a review from zdevito May 7, 2019 21:38

zdevito approved these changes May 7, 2019

View reviewed changes

Use size_t pushNextBinPut() for cleaner API

d86d919

facebook-github-bot closed this in 8ebb86d May 8, 2019

facebook-github-bot added the merged label May 9, 2019

driazati mentioned this pull request May 17, 2019

How to add dynamically allocated strings to Pickler? #20090

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jit] Support `torch.save` for saving values during execution #18154

[jit] Support `torch.save` for saving values during execution #18154

driazati commented Mar 19, 2019 •

edited

suo Mar 21, 2019

driazati Mar 22, 2019

suo Mar 21, 2019

suo Mar 21, 2019

driazati Mar 22, 2019

zdevito left a comment

zdevito Apr 23, 2019

driazati Apr 24, 2019

zdevito May 2, 2019

driazati May 3, 2019

zdevito Apr 24, 2019

zdevito Apr 24, 2019

zdevito Apr 24, 2019

zdevito Apr 24, 2019

driazati Apr 25, 2019

zdevito left a comment

zdevito May 2, 2019

zdevito left a comment

zdevito May 7, 2019

zdevito left a comment

zdevito May 7, 2019

facebook-github-bot commented May 9, 2019

[jit] Support torch.save for saving values during execution #18154

[jit] Support torch.save for saving values during execution #18154

Conversation

driazati commented Mar 19, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zdevito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zdevito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zdevito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zdevito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented May 9, 2019

[jit] Support `torch.save` for saving values during execution #18154

[jit] Support `torch.save` for saving values during execution #18154

driazati commented Mar 19, 2019 •

edited