Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: SavedModel Save/Load in 2.x #34

Open
wants to merge 6 commits into
base: master
from

Conversation

9 participants
@allenlavoie
Copy link
Member

allenlavoie commented Nov 16, 2018

Review period closes 2018-12-04

SavedModel Save/Load in 2.x

Status       Accepted      
Author(s) Allen Lavoie (allenl@google.com), André Susano Pinto (andresp@google.com), Arno Eigenwillig (arnoegw@google.com), Rohan Jain (rohanj@google.com)
Sponsor   Karmel Allison (karmel@google.com)
Updated   2019-02-28                                          

Objective: provide an API for serialization/deserialization in TF-2.0 that supports both serving and reuse use-cases.

@ewilderj ewilderj added this to Needs attention in RFC management via automation Nov 16, 2018

@ewilderj ewilderj changed the title SavedModel Save/Load in 2.x RFC: SavedModel Save/Load in 2.x Nov 16, 2018

@ewilderj ewilderj moved this from Needs attention to Open reviews in RFC management Nov 16, 2018

@ppwwyyxx

This comment has been minimized.

Copy link

ppwwyyxx commented Nov 17, 2018

There is another use-case which I'm not sure has been covered by this proposal: make large code modifications/refactor and restore the weights/states only (but not restore the computation).
For example, originally I have a network like this:

class Net():
    def __init__(self):
        self.W1 = tf.Variable(...)
        self.b1 = tf.Variable(...)
        self.W2 = tf.Variable(...)
        self.b2 = tf.Variable(...)

    @tf.function
    def __call__(self, x):
        return self.W2 * (self.W1 * x + self.b1) + self.b2

And after a while I want to refactor it into:

class Net():
    def __init__(self):
        self.L1 = Layer()
        self.L2 = Layer()

    @tf.function
    def __call__(self, x):
        return self.L2(self.L1(x))


class Layer():
    def __init__(self):
        self.W = tf.Variable(...)
        self.b = tf.Variable(...)

    @tf.function
    def __call__(self, x):
        return self.W * x + b

This is easily supported by tf.train.Saver in TF 1, as long as the variable names are unchanged between the two versions above. I'm not sure how this case would be handled with the proposed API, because it appears to me that saving a Checkpointable always save both the variables and the computation. And where a variable is loaded is strongly associated to how the code is originally written (i.e. the attribute names, etc).

@allenlavoie

This comment has been minimized.

Copy link
Member Author

allenlavoie commented Nov 19, 2018

@ppwwyyxx I do mention the training checkpoint use-case in the introduction. After that it's not mentioned much mostly because it's already implemented.

tf.train.Checkpoint does not save computation, and like Saver in v1 is aimed at training checkpoints. It does use attribute names in objects rather than scope names, but should generally be a bit more flexible because of it.

@ppwwyyxx

This comment has been minimized.

Copy link

ppwwyyxx commented Nov 19, 2018

@allenlavoie I'm aware of the paragraph, but "using attribute names in objects" is exactly what I'm concerned about. In the code example I gave above, due to refactoring, the object hierarchy has changed and therefore the attributes are different. How should users restore an old model after the code has been changed like the way above?

@allenlavoie

This comment has been minimized.

Copy link
Member Author

allenlavoie commented Nov 19, 2018

@ppwwyyxx one option with object-based checkpointing is to construct a checkpoint-compatible structure and a new structure which both reference the same variable objects, then load the checkpoint into the checkpoint compatible structure. Which is roughly equivalent to passing a dictionary of names to Saver after a large-scale refactor. (There is an equivalent refactoring issue when name-based code which opens scopes has to be restructured.)

I wrote a design document for tf.train.Checkpoint quite a while back. If it's helpful for background I can dig it up and post it.

@ppwwyyxx

This comment has been minimized.

Copy link

ppwwyyxx commented Nov 19, 2018

Thanks! From this solution, it sounds to me that, in order to load an old checkpoint, the legacy code has to be kept after refactoring, right? This would be against the goal of refactoring.

And users will still need the extra logic to "call old Net() (and then share with new Net) if loading old models, and call new Net() if loading new models" -- as a result, this solution is no better than nothing: users are still just going through the old codepath to load old models.

I agree to some extent that this is roughly like passing a dict of names to Saver, like you said. But keeping a legacy mapping of strings is much easier than keeping an entire legacy implementation of the model. (Think about how to write a script to convert the checkpoint: with Saver it only needs to translate the names, with Checkpointable it needs to have access to the model implementation).

Another complexity compared to Saver, is that with Saver, code refactoring does not usually result in incompatibilities at all. For example, in the two versions of Net() I gave above, refactoring from the 2nd Net to the 1st Net can be done with no changes to variable names.

Because of this, I would disagree that "using attribute names is more flexible than using scope names", since attribute names are more closely tied to code structure. In fact, using attribute names to save models is a major headache I have had with PyTorch.

@allenlavoie

This comment has been minimized.

Copy link
Member Author

allenlavoie commented Nov 19, 2018

Thanks! From this solution, it sounds to me that, in order to load an old checkpoint, the legacy code has to be kept after refactoring, right? This would be against the goal of refactoring.

And users will still need the extra logic to "call old Net() (and then share with new Net) if loading old models, and call new Net() if loading new models" -- as a result, this solution is no better than nothing: users are still just going through the old codepath to load old models.

It doesn't require a legacy implementation of the model, it just requires the same structure. This can be constructed of dummy tf.train.Checkpoint objects which just have the same attributes as before. I don't see why that would be any more difficult than names. And once you get down to Layers or submodules which are compatible, those can be mentioned as a whole. That's much nicer than having to go variable-by-variable in a Saver dictionary.

Another complexity compared to Saver, is that with Saver, code refactoring does not usually result in incompatibilities at all. For example, in the two versions of Net() I gave above, refactoring from the 2nd Net to the 1st Net can be done with no changes to variable names.

I've broken many models by refactoring, so I have to disagree with you here. Scopes are pretty notorious for being fragile.

If naming each variable explicitly is an option, it's easy with the object-based API to save a dictionary of variables and nothing else. Mostly we needed to get rid of collections and variable_scope; I don't think we took away much freedom in terms of using names to save/restore.

Because of this, I would disagree that "using attribute names is more flexible than using scope names", since attribute names are more closely tied to code structure. In fact, using attribute names to save models is a major headache many others and I have had with PyTorch.

I'm curious, a headache compared to what? There's almost a one-to-one mapping between object relationships and the scope-based naming that was common in v1. I agree that refactoring can still break checkpoints with the object-based API, but I think our main disagreement is over whether name-based APIs have any advantages there.

@ppwwyyxx

This comment has been minimized.

Copy link

ppwwyyxx commented Nov 19, 2018

It doesn't require a legacy implementation of the model, it just requires the same structure. This can be constructed of dummy tf.train.Checkpoint objects which just have the same attributes as before.

Thanks for the clarification! Now it makes more sense. Using a dummy Checkpoint object with only the attributes would be quite easy.

If I'm not mistaken, to load an old model in the examples I gave above, I'll need to:

new = NewNet()
old = DummyEmptyCheckpoint()
old.W1 = new.L1.W
old.W2 = new.L2.W
old.b1 = new.L1.b
old.b2 = new.L2.b
# or perhaps 
# old = DummyEmptyCheckpoint(W1=new.L1.W, W2=new.L2.W, b1=new.L1.b, b2=new.L2.b)
old.restore('checkpoint')

which is awesome and has the same complexity as saver = tf.train.Saver({'W1': 'L1/W', 'W2': 'L2/W'... }); saver.restore('checkpoint').

I'm curious, a headache compared to what? There's almost a one-to-one mapping between object relationships and the scope-based naming that was common in v1. I agree that refactoring can still break checkpoints with the object-based API, but I think our main disagreement is over whether name-based APIs have any advantages there

Compared to scope-based APIs, and mainly because code refactoring does not usually result in incompatibilities with scope-based API. I've given one example above. Another example that's more common:

class OldNet():
    def __init__(self):
        self.W1 = tf.Variable(...)
        self.b1 = tf.Variable(...)
        self.W2 = tf.Variable(...)
        self.b2 = tf.Variable(...)

    def __call__(self, x):
        return tf.nn.conv2d(self.W2, self.W1 * x + self.b1) + self.b2

class NewNet():
    def __init__(self):
        self.L1 = MyFC()
        self.L2 = MyConv()

    def __call__(self, x):
        return self.L2(self.L1(x))

In this refactoring, variable names are unchanged as long as we don't enter a new scope in MyFC().
This sort of refactoring is common, because often we just split things into submodules when the big module (the OldNet) has more and more functionalities and becomes crowded. However with attribute-based checkpointing, such refactoring will have to introduce incompatibilities, right?

@martinwicke

This comment has been minimized.

Copy link
Member

martinwicke commented Nov 19, 2018

@ppwwyyxx I've seen a lot of code for which this is not true, and the failure is extremely vexing. Consider this (old-style) code:

class Net: ...
class OtherNet: ...

def model(x):
  net = Net()
  other = OtherNet()

  x1 = net(x)
  x2 = net(x)
  x3 = other(x1, x2)
  x4 = other(x, x)

  return x3, x4

If I change the last bit to

  x1 = net(x)
  x4 = other(x, x)
  x2 = net(x)
  x3 = other(x1, x2)

  return x3, x4

does this break a checkpoint? The answer is, maybe. The problem is that it is impossible to tell without looking at the code for Net and OtherNet. The transformation that I made to the code is completely harmless, just some reordering that has no effect on the outputs at all. We have seen a lot of problems like this. The scope-based naming breaks a lot of abstractions and makes it hard to impossible to reason about code locally.

@martinwicke martinwicke reopened this Nov 19, 2018

RFC management automation moved this from Open reviews to Needs attention Nov 19, 2018

@ppwwyyxx

This comment has been minimized.

Copy link

ppwwyyxx commented Nov 20, 2018

@martinwicke I can't see the rest of your post :) Of course both methods will have annoying incompatibilities. What I meant to say is that scope-based may have fewer incompatibilities (in some examples I gave) but you can probably found counter examples as well.
I'm not against object-based approach given that it's similarly easy to handle incompatibilities. (and has many other advantages)

@martinwicke

This comment has been minimized.

Copy link
Member

martinwicke commented Nov 20, 2018

@ppwwyyxx Sorry, my bad, I updated the comment. I just wanted to give an example where scope-naming goes wrong. We've seen it a lot, and I strongly believe the object-based naming will be significantly more robust to refactoring.

@ppwwyyxx

This comment has been minimized.

Copy link

ppwwyyxx commented Nov 20, 2018

Thanks! Very interesting example!

@ewilderj ewilderj moved this from Needs attention to Open reviews in RFC management Nov 20, 2018

@ewilderj ewilderj moved this from Open reviews to Awaiting Committee Notes in RFC management Nov 28, 2018

@alanhdu

This comment has been minimized.

Copy link

alanhdu commented Dec 7, 2018

One question I have is whether it'd be possible to somehow hook into and save/load things into assets.extra. Right now, we have two different use cases:

(1) Save some Python attribute variables, unrelated to the tensorflow graph, (which we'd like to deserialize appropriately on loading). This might be possible via the "custom revived types" but it's not entirely clear to me.
(2) Save extra metadata about our model for some of our other infrastructure to read. This use-case might be covered by the "resources and assets", although that seems more for Tensorflow-specific things than arbitrary metadata.

Right now, we have a thin wrapper to save/load tensorflow graphs into saved models (which is totally fine and not something we have a problem with), but it'd be kind of nice to make this pluggable at the base!

@zakizhou

This comment has been minimized.

Copy link

zakizhou commented Dec 8, 2018

@allenlavoie After an instance of a subclass of tf.keras.Model is serialized, will the loaded object be of type tf.keras.Model or just CheckpointableBase?

@allenlavoie

This comment has been minimized.

Copy link
Member Author

allenlavoie commented Dec 10, 2018

@zakizhou generic Checkpointable to start, Model eventually. The hope is that we can then swap in a SavedModel implementation of tf.keras.Model.save and load_model.

tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Feb 21, 2019

Add the SavedObjectGraph used by tf.saved_model.save/load to SavedModel
We reference a bunch of things in the MetaGraph/GraphDef, so it makes sense to add it there rather than to the SavedModel directly.

This is in preparation for non-experimental tf.saved_model.save/load symbols. We don't yet have an exposed symbol for loading object-based SavedModels, so this CL won't break anyone (despite moving around the proto and not checking the old location).

RFC: tensorflow/community#34
PiperOrigin-RevId: 234887195

tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Feb 21, 2019

Expose tf.saved_model.load and make tf.saved_model.save non-experimen…
…tal in v1

We can't expose tf.saved_model.load in v1 because there's already a symbol with that name. I've used tf.saved_model.load_v2 instead.

There's plenty still to do, but after the proto move the format is stable and this is the API we want.

tf.saved_model.save only works when executing eagerly, but makes sense to add to the v1 API to make upgrades simpler. This doesn't work for tf.saved_model.load due to the name conflict; in practice most v1 codebases will need to refer to it as tf.compat.v2.load if they want to run in v1 and v2. But it should have some stable symbol in v1 if v1/v2 codebases are going to rely on it.

RFC: tensorflow/community#34
PiperOrigin-RevId: 234894316
@alanhdu

This comment has been minimized.

Copy link

alanhdu commented Feb 26, 2019

Hi! I don't know if this is the write thread to post this in, but I've been playing around with the TF 2.0 APIs and have run into a problem with the saving and loading tf.Modules.

Our actual use-case is to deploy RNNs working on streams of data. At inference time, we need to (1) be able to change the batch size between calls and (2) save the output state from one call to feed into the input state to another.

In TF 1.0, we handled this by explicitly passing in our input states and explicitly capturing the output states using a simple Python dictionary mapping input state placeholders to output state tensors and stuffing some extra data into assets.extra at export time). I've tried a two different ideas in TF 2.0, but ran into problems with both:

  1. Allow each tf.Module to manage its own state (something Keras's stateful flag). The problem here is that I can't figure out how to allow a dynamically sized variable (i.e. the batch size of the state needs to be dynamic, because it might change between calls to reset_state).
  2. Explicitly pass in the model stateless and explicitly pass in the input state and retrieve the output state. The problem here is that the most natural Python-API would use dictionaries and/or lists to pass in input states, which would make those explicitly unavailable to serving API:
 The `Tensor` arguments are the only ones serving APIs will know about, and these may not be nested inside lists, tuples, or dictionaries (since there is no way to specify nested `Tensor` arguments when serving).

Is it possible to either loosen this restriction, or to allow tf.functions to declare dependencies on dynamically sized variables?

@allenlavoie

This comment has been minimized.

Copy link
Member Author

allenlavoie commented Feb 27, 2019

@alanhdu

Dynamic variables should still work as long as the variable is created inside a tf.function:

import tensorflow as tf

class HasDynamicVariable(tf.Module):

  def __init__(self):
    super(HasDynamicVariable, self).__init__()
    self.v = None

  @tf.function(input_signature=[tf.TensorSpec([None, 2], tf.float32)])
  def make_dynamic_variable(self, initial_value):
    if self.v is None:
      self.v = tf.Variable(initial_value)
    return self.v

has_dynamic = HasDynamicVariable()
has_dynamic.make_dynamic_variable(tf.ones([10, 2]))
assert 20. == tf.reduce_sum(has_dynamic.v).numpy()
print(has_dynamic.v.shape)  # None, 2
has_dynamic.v.assign(tf.ones([15, 2]))
assert 30. == tf.reduce_sum(has_dynamic.v).numpy()

(Works for me running 2.0.0-dev20190226 on Python 3.6)

But getting TensorFlow Serving (and TensorFlow's C++ loader API) to allow nested structures is a good idea too. One thought (I forget whose) is that signatures could go away completely at some point (TF 4.x? We maintain SavedModel compatibility for at least one extra major version, and are exporting/consuming in 2.x.). Then Serving would support the same nested structures as tf.saved_model.load. Plenty of details to work out (how do you specify those in a request?) but I think it's workable. And we certainly don't need to wait for signatures to go away before allowing Serving to access arbitrary tf.saved_model.save functions.

@googlebot googlebot added the cla: yes label Feb 27, 2019

@alanhdu

This comment has been minimized.

Copy link

alanhdu commented Feb 27, 2019

@allenlavoie Ah -- by "dynamically sized" I meant something like:

class DyanmicSized(tf.Module):
    @tf.function
    def reset_state(self, batch_size):
        self.batch_size = tf.Variables(tf.zeros(batch_size, 32))

(that is, a tf.Variable with the None dynamic shape (tf.placeholder(tf.float32, [None, 32])), which I don't think is possible. That would mean for the LSTM case, we could do something like:

lstm.reset_states(batch_size=64)
lstm(batch1)    # training
lstm.reset_states(batch_size=8)
lstm(batch2)    # inference

But getting TensorFlow Serving (and TensorFlow's C++ loader API) to allow nested structures is a good idea too.

👍 to that, although I think all our particular use case needs is to somehow map the Python nested structure to TF names, so we know what to feed into the C++ API.

@allenlavoie

This comment has been minimized.

Copy link
Member Author

allenlavoie commented Feb 27, 2019

That's what the snippet I posted does. The function just needs some indication of what the shapes should be (input_signature), otherwise it defaults to the most specific input shape.

@alanhdu

This comment has been minimized.

Copy link

alanhdu commented Feb 27, 2019

@allenlavoie Ah... I see. So you can get dynamic shapes by setting it via tf.function(input_signature=...) appropriately. I was thinking that there was a way to do tf.Variable(..., shape=[None, ...]) or something (in the example I had, you'll notice that we pass in a batch_size parameter which is an integer scalar, but want the tf.Variable to have shape [None, 32]), since that intuitively seems the easiest to use. Should I just file a feature request for that somewhere?

(I should also say that for me, on Python 3.7 and tf-nightly-2.0-preview==2.0.0.dev20190227, I get the following warning):

WARNING: Logging before flag parsing goes to stderr.
W0227 12:38:06.226005 140243216058176 tf_logging.py:161] Entity <method-wrapper '__call__' of weakref object at 0x7f8ce614c4f8> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/autograph/README.md#using-the-functional-api for more information.

That said, if this works then I think this should work for our use case (although we still might run into specifying tf.TensorShapes for tensors in nested inputs as the most "natural" programming model).

@yueqiw

This comment has been minimized.

Copy link

yueqiw commented Feb 28, 2019

@allenlavoie After saving a model in Tensorflow 2.0 using tf.saved_model.save(), how to load it back?

Below is how I created the model

...
model = tf.keras.Model(inputs, outputs)
... # train the model
print(model)

<tensorflow.python.keras.engine.training.Model object at 0x7fa4f7e9df28>

Now I want to save the model and load it back. However, the loaded object is no longer a Model() object and I cannot figure out a way to use it for inference.

tf.saved_model.save(model, save_path)
imported = tf.saved_model.load(save_path)
print(imported)

<tensorflow.python.saved_model.load._Loader._recreate_user_object.<locals>._UserObject object at 0x7fa49a073cc0>

exported = tf.train.Checkpoint(model = model)
tf.saved_model.save(exported, save_path)
imported = tf.saved_model.load(save_path)
print(imported.model)

<tensorflow.python.saved_model.load._Loader._recreate_user_object.<locals>._UserObject object at 0x7fa243812080>

What's the correct way to use tf.saved_model ? Thanks!

@allenlavoie

This comment has been minimized.

Copy link
Member Author

allenlavoie commented Feb 28, 2019

@yueqiw

Yep, we won't have re-wrapping into Keras types implemented for the alpha. But if you're exporting a Model, the forward pass is available as a signature. imported.signatures['serving_default'] will have a function.

I have a guide to 2.x SavedModel pending review.

@yueqiw

This comment has been minimized.

Copy link

yueqiw commented Feb 28, 2019

@allenlavoie Thanks, it works!

@allenlavoie

This comment has been minimized.

Copy link
Member Author

allenlavoie commented Feb 28, 2019

@goldiegadde added review notes. Does it need anything else?

@zakizhou

This comment has been minimized.

Copy link

zakizhou commented Mar 1, 2019

Hi @allenlavoie , will tf.Module be the base class for tf.keras.layers.Layer in the release of TF 2.0?

@allenlavoie

This comment has been minimized.

Copy link
Member Author

allenlavoie commented Mar 1, 2019

@zakizhou I believe that's still the plan.

@goldiegadde

This comment has been minimized.

Copy link
Contributor

goldiegadde commented Mar 4, 2019

@goldiegadde added review notes. Does it need anything else?
Sorry , missed this. added a comment.

@goldiegadde goldiegadde closed this Mar 4, 2019

@goldiegadde goldiegadde reopened this Mar 4, 2019

RFC management automation moved this from Awaiting Committee Notes to Needs attention Mar 4, 2019

@karmel karmel referenced this pull request Mar 6, 2019

Open

Docs Needed #23798

@zh794390558

This comment has been minimized.

Copy link

zh794390558 commented Mar 16, 2019

So, I can retrain, serving with saved model in the future? And how can i use the saved model with different device and distribution strategy? More detailed use case will be preferred.

@allenlavoie

This comment has been minimized.

Copy link
Member Author

allenlavoie commented Mar 18, 2019

@zh794390558 yes on the first question; see the 2.x guide to SavedModel.

There isn't yet any integration between distribution strategies and SavedModels. Happy to give pointers if you or someone else is interested in starting on that; it's on my list, but that's a very long list right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.