Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prediction different between TF Serving 1.4 and TF 1.4 #656

Closed
paragon00 opened this issue Nov 15, 2017 · 9 comments
Closed

prediction different between TF Serving 1.4 and TF 1.4 #656

paragon00 opened this issue Nov 15, 2017 · 9 comments

Comments

@paragon00
Copy link

paragon00 commented Nov 15, 2017

after updating our TF and Keras, and TF Serving, I'm seeing a difference in prediction values on the same model and images between TF and Keras, and Serving. I updated to TF 1.4, Keras 2.0.9 and built TF Serving from 1.4 branch (tried master too). Then prediction on some random images gives-

Keras, TensorFlow, TensorFlowServing, TrueLabel
0.294510304928, 0.294510304928, 0.306598514318, 1
0.973454713821, 0.973454713821, 0.974921882153, 1
0.0169313177466, 0.0169313177466, 0.109000883996, 0
0.969210922718, 0.969210922718, 0.964440405369, 1
0.996860027313, 0.996860027313, 0.998536705971, 1
0.996983230114, 0.996983230114, 0.994152128696, 1
0.259784668684, 0.259784668684, 0.300680160522, 0
0.989252388477, 0.989252388477, 0.97792416811, 1

ie. Keras and TF predict the same, but TF Serving gives different numbers. Its possible we didn't upgrade our TF Serving correctly (although didn't see any errors).

Is anyone else getting this? We didn't get this on TF 1.3

@zmjjmz
Copy link

zmjjmz commented Nov 20, 2017

I'm seeing something very similar at the moment, although the numerical difference is a bit more dramatic.

Notably, I have an embedding layer which has some 0 vectors (for e.g. padding / OOV). Broken down in terms of what it takes to export a keras model to TF serving:

  • The keras model itself produces the right output (i.e., 0)
  • The exported TF graph (when looked at with the inspect_checkpoint tool) has the correct values
  • The prediction proto response does not have the correct values (I had the model output the embeddings directly)
    -- Specifically instead of 0 I see 0.00273621
    -- All the dtypes check out

I'm using 1.4 here but can't confirm that I wasn't seeing this in 1.3. I guess I could try downgrading if that helps?

Forgot to mention: this is all on CPU, using the default builds (i.e. none of the available CPU optimizations are being used).

If needed I could probably put together a reproducible test case but there's a lot of moving parts :)

Also: the servable I'm testing also has ops that have boolean or int32 outputs -- all of those come out fine! However the float outputs are all funky.

@zmjjmz
Copy link

zmjjmz commented Nov 20, 2017

Further note: I tried to determine if the corruption is happening somewhere by quickly abusing a ThresholdedReLU keras layer to zero out the embeddings and then add them back in, and then compare the original embedding layer to the one with zeros added to it. If the zeros are broken w/in the graph, I'd see different numbers between them -- however it looks like they're the same.

What I did notice on a second run is that I have two embedding vectors that are all zeros (to distinguish between OOV and pad tokens -- don't worry about it) and they're coming out as different (garbage) vectors. So, e.g. the following sequence:

[1 1 1 0 0 0 0 0 0 0] should resolve to

[[ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]]

after the embedding layer, but from TF serving I get

[[ 0.00374124  0.02665842 -0.04161887  0.01480421 -0.02126383]
 [ 0.00374124  0.02665842 -0.04161887  0.01480421 -0.02126383]
 [ 0.00374124  0.02665842 -0.04161887  0.01480421 -0.02126383]
 [ 0.01171316 -0.03365946  0.0402073  -0.02044135  0.00470774]
 [ 0.01171316 -0.03365946  0.0402073  -0.02044135  0.00470774]
 [ 0.01171316 -0.03365946  0.0402073  -0.02044135  0.00470774]
 [ 0.01171316 -0.03365946  0.0402073  -0.02044135  0.00470774]
 [ 0.01171316 -0.03365946  0.0402073  -0.02044135  0.00470774]
 [ 0.01171316 -0.03365946  0.0402073  -0.02044135  0.00470774]
 [ 0.01171316 -0.03365946  0.0402073  -0.02044135  0.00470774]]

I'm working on putting together a minimal repro -- currently what I have relies on a bunch of weird custom code / keras layers that's not worth including.

I've also noticed that between versions (not between requests) the vectors change, so my previous comment about what the 0 gets changed to is inaccurate. Also, if I inspect the output of the ThresholdedReLU I do see all zeros (though sometimes -0.0, which I'm not sure what to make of).

@zmjjmz
Copy link

zmjjmz commented Nov 21, 2017

Here's a gist that should (at least, on my system) reproduce this issue:

https://gist.github.com/zmjjmz/64cf9771922aa6cf58da6233e022f056

@zmjjmz
Copy link

zmjjmz commented Nov 21, 2017

I was initially encountering this issue in a servable that used a lookup table, hence when I call add_meta_graph_and_variables I provide tensorflow.saved_model.main_op.main_op() to instantiate the table when the servable is loaded. In the test case that's unnecessary, and if I remove it the output matches up!

So, I think I can narrow down that something funky is going on in that main_op that's causing this.

@zmjjmz
Copy link

zmjjmz commented Nov 21, 2017

Ok so playing with this a bit more, I think the issue is specifically with tensorflow.python.ops.variables.global_variables_initializer.

Currently main_op() produces a grouped op that is essentially:

from tensorflow.python.ops import control_flow_ops, lookup_ops, variables        
main_op_new = control_flow_ops.group(                                            
        lookup_ops.tables_initializer(), 
        variables.local_variables_initializer(),
        variables.global_variables_initializer())                                

If I remove just that last initializer, this issue goes away, and I'm able to use the model as normal!

Definitely something strange going on in the global_variables_initializer, which I realize may have to do with the way I'm exporting the model (using the keras backend session, which may be the wrong way).

@sukritiramesh
Copy link
Contributor

Thanks for reporting back @zmjjmz. Resolving since this seems export specific for now.

@zmjjmz
Copy link

zmjjmz commented Dec 19, 2017

Should I open this as a separate issue on the main tensorflow repo then?

@sukritiramesh
Copy link
Contributor

@zmjjmz Is this to follow-up regarding the global variable initializer? If so, sure.

@paragon00
Copy link
Author

paragon00 commented Dec 19, 2017

for me, the difference here disappeared in a recent TF / keras update ... as far as I could understand it it was a difference in TF prediction and TF Serving prediction that existed briefly and was fixed in a recent release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants