-
Notifications
You must be signed in to change notification settings - Fork 45.5k
Add SavedModel export to Resnet #3759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
'the resulting SavedModel will require the same GPUs be available.' | ||
'If you wish to serve the SavedModel from a different device, ' | ||
'try exporting the SavedModel with multi-GPU mode turned off.') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @isaprykin - For now, warning the user. Eventually, it would be nice if Estimator would know not to run the saved model through the replication part of the graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpoint loading seems flexible enough that you can load between single and multi gpu models. So in the case of multi gpu models it might makes sense to construct a new estimator, load in the trained weights, and then serialize that. It's perhaps not the most elegant solution, but training with multi_gpu and then serving on a per-GPU basis seems like the most common use case so supporting it is somewhat important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I'm going to leave as is. DistributionStrategies is a moving target that aims to hide replicate_model_fn, so we can reevaluate in a ~month when that is firmed up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds perfectly reasonable.
official/resnet/resnet_run_loop.py
Outdated
return classifier | ||
|
||
|
||
def export_savedmodel(classifier, export_dir, shape, batch_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
savedmodel seems to be a two words to me, so probably export_saved_model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, there is a lot of inconsistency in the spelling of this across TF+friends. In this case, I went with what Estimator has, which is export_savedmodel.
'You are exporting a SavedModel while in multi-GPU mode. Note that ' | ||
'the resulting SavedModel will require the same GPUs be available.' | ||
'If you wish to serve the SavedModel from a different device, ' | ||
'try exporting the SavedModel with multi-GPU mode turned off.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should stay with double quote for consistence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file uses single quotes. Perhaps sometime we should just choose which quote style to use for this repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I would opt for double quotes just because docstrings are """
, but we can fix in a separate PR for all files if desired. For now, will stick with what's there to avoid confusion in this PR.
return classifier | ||
|
||
|
||
def warn_on_multi_gpu_export(multi_gpu=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user export the model on a single GPU mode and try to run the exported model on a multi GPU env, will it work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not. Although serving in multi-GPU mode is not really a thing that's typically done, I think; parallelization is handled more efficiently on the request-partitioning side, I believe, rather than inside the model itself, since there is no need to update shared weights/other info.
official/resnet/resnet_run_loop.py
Outdated
input_receiver_fn = export.build_tensor_serving_input_receiver_fn( | ||
shape, batch_size=batch_size) | ||
classifier.export_savedmodel(export_dir, input_receiver_fn) | ||
return classifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to return the classifier here? Since the export_savedmodel is not called on classifier instance, I don't think it can help chaining the calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistency with the other function is what I was thinking, but I'm not attached to it. Will remove.
official/resnet/resnet_run_loop.py
Outdated
return classifier | ||
|
||
|
||
def export_savedmodel(classifier, export_dir, shape, batch_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest move this to some util file since it has nothing to do with the model or any internal variable for resnet. I would expect a lot of other model will have function like this too if they want export the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with @robieta 's suggestion above and included the steps in renset_main, which removes the need for this function.
official/resnet/cifar10_main.py
Outdated
classifier = resnet_run_loop.resnet_main( | ||
flags, cifar10_model_fn, input_function) | ||
|
||
# Export the model if desired |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cleaner to handle this inside of resnet_main() and just pass in shape. That would dedupe some code and is more intuitive IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sigh. Yeah, I went back and forth on that, but in the end, I felt like weighing resnet_main down with another, totally separate task set that required an extra param was asking too much of it. Quick poll: what do the people in reviewing this PR think-- put more of the logic in resnet_main, or keep it separate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with both ideas. The main function should contain all the functions IMO, and it would be nice to avoid duplicate code. If we keeping them separate, perhaps we should rename the function resnet_main
to train
, train_and_evaluate
, run_training
or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two and a half people counts as quorum. Done.
if export_dir: | ||
self.add_argument( | ||
"--export_dir", | ||
type=str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For strings type can be omitted.
super(ExportParser, self).__init__(add_help=add_help) | ||
if export_dir: | ||
self.add_argument( | ||
"--export_dir", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abbreviation and metavar please.
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# ============================================================================== | ||
"""Tests for exporting utils.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be much trouble to put tests for the individual models? (Probably in their respective files rather than this one.) Basically generate a trivial model with synthetic data and then confirm that it can be loaded and serve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hahahaha. I laugh because, as I learned during this process, actually using a saved model is incredibly difficult without a dedicated TF Serving instance. I am hoping we can change that next quarter, but, for now, it would require quite a few tf.contrib calls. Punting on that until we formalize the story on model exports, hopefully next quarter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All comments addressed; please take a look.
official/resnet/resnet_run_loop.py
Outdated
return classifier | ||
|
||
|
||
def export_savedmodel(classifier, export_dir, shape, batch_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with @robieta 's suggestion above and included the steps in renset_main, which removes the need for this function.
'You are exporting a SavedModel while in multi-GPU mode. Note that ' | ||
'the resulting SavedModel will require the same GPUs be available.' | ||
'If you wish to serve the SavedModel from a different device, ' | ||
'try exporting the SavedModel with multi-GPU mode turned off.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I would opt for double quotes just because docstrings are """
, but we can fix in a separate PR for all files if desired. For now, will stick with what's there to avoid confusion in this PR.
'the resulting SavedModel will require the same GPUs be available.' | ||
'If you wish to serve the SavedModel from a different device, ' | ||
'try exporting the SavedModel with multi-GPU mode turned off.') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I'm going to leave as is. DistributionStrategies is a moving target that aims to hide replicate_model_fn, so we can reevaluate in a ~month when that is firmed up.
official/resnet/cifar10_main.py
Outdated
classifier = resnet_run_loop.resnet_main( | ||
flags, cifar10_model_fn, input_function) | ||
|
||
# Export the model if desired |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two and a half people counts as quorum. Done.
This adds the ability to export a Saved Model to ResNet. Note that this will be followed up with more functionality for TensorRT shortly.
Ideally, we want to update the WideDeep and all other new models to do the same-- then we can push the ExportParser into BaseParser.
@isaprykin - a question for you: right now, I pass batch_size through to the input receiver fn, even though in theory, one might want the savedmodel to request a batch_size of 1 by default. The reason I do that is because the replicate_model_fn ensure_divisible_by_shards complains during model saving if the batch size is one, even just for exporting the model (presumably because the graph gets run through again)? Not sure what the implications of my approach are, or if there is a better way?
@qlzh727 -- FYI, I envision that ultimately we will want every benchmark run to also export a savedmodel, which can then be used to benchmark TF inference.