TF.assign surprising behaviour #92

jcberentsen · 2017-04-06T08:47:13Z

I wrote some tests to guide my understanding of TF.assign in relation to run and runSession. Similar to:
https://github.com/tensorflow/haskell/blob/master/tensorflow-ops/tests/BuildTest.hs#L83

case_use_scalar_after_assigns_surprizingly_fails = do
  (f0, f1, f2) <- TF.runSession $ do
    (formula, reset0, reset1) <- do
      w <- TF.initializedVariable 0
      let formula = w                                                                                                                                                   
      r0 <- TF.assign w (TF.scalar (0.1::Float))
      r1 <- TF.assign w (TF.scalar (0.2::Float))
      return (formula, r0, r1)

    (TF.Scalar f0) <- TF.run formula
    TF.run_ reset0
    (TF.Scalar f1) <- TF.run formula
    TF.run_ reset1
    (TF.Scalar f2) <- TF.run formula
    return (f0, f1, f2)

  (0.0, 0.1, 0.2) @=? (f0, f1, f2)

This fails with f0, f1 and f2 being the final value of the assigned variable. If I change the test to:

case_use_scalar_after_assigns_works_but_why = do
  (f0, f1, f2) <- TF.runSession $ do
    (formula, reset0, reset1) <- do
      w <- TF.initializedVariable 0
      let formula = TF.mul 1 w -- Matters
      r0 <- TF.assign w (TF.scalar (0.1::Float))
      r1 <- TF.assign w (TF.scalar (0.2::Float))
      return (formula, r0, r1)

    (TF.Scalar f0) <- TF.run formula
    TF.run_ reset0
    (TF.Scalar f1) <- TF.run formula
    TF.run_ reset1
    (TF.Scalar f2) <- TF.run formula
    return (f0, f1, f2)

  (0.0, 0.1, 0.2) @=? (f0, f1, f2)

by changing to let formula = TF.mul 1 w This works, but why doesn't the previous version?

I wrote these tests because I tried in other code do something like this:

...
M.replicateM epochs $ do
    TF.run_ fitStep
    TF.run $ TF.value refW

to collect the evolution of the w terms but they ended up being equal (but verifiably different from the initial value), even with the TF.value conversion.
(The fitStep does assign to refW)

Is this expected behaviour? Am I missing something?

The text was updated successfully, but these errors were encountered:

silky · 2017-04-06T12:42:27Z

i agree that this seems to be a bug (the equiv. python works as you expect).

i don't have the answer for you, but for whatever reason i got distracted looking into this (and not getting very far!)

i did find some things like this -

haskell/tensorflow/src/TensorFlow/Build.hs

Line 271 in 2c5c879

addNewOp :: OpDef -> Build NodeDef

- which might be mildly helpful in tracking down the true cause.

i'm kind of surprised, i think, that say let formula = identity w doesn't work either (and the value function doesn't help at all).

i'd love to know the answer to this!

judah · 2017-04-06T14:58:44Z

Thanks for finding this. I think it's a problem with laziness and how we're interacting with the C API. In particular, forcing the values immediately after fetching them causes the test to pass:

case_use_scalar_after_assigns_surprizingly_fails = testCase "assigns" $ do
  runSession $ do
    w <- initializedVariable 0
    r <- assign w (scalar (0.1::Float))
    Scalar f0 <- run w
    liftIO $ evaluate f0
    run_ r
    Scalar f1 <- run w
    liftIO $ (0.0, 0.1) @=? (f0, f1)

And removing the call to evaluate reproduces the error you were seeing.

It seems that in both calls to run w, the C API is returning the same pointer to tensor data - namely, the memory that backs the variable itself, which might change between run calls. And we're currently lazy enough that sometimes we don't read the contents of that memory immediately.

I'll look into making our code more strict to prevent this problem, and try to see whether this behavior also warrants a fix from the TensorFlow side.

fkm3 · 2017-04-06T15:19:09Z

Haven't looked into this yet, but I would expect that making things more strict is only enough for scalar and Data.Vector results. TensorData and Data.Vector.Storable results point directly at the array returned by tensorflow instead of copying.

The C API has a comment that I thought implied this was safe, but now I'm guessing it is only true in a trivial sense (the caller owns the pointer to the pointer to the data, but not the data...): https://github.com/tensorflow/tensorflow/blob/0c68156ffcf918df905f9f39a632888724c66c3b/tensorflow/c/c_api.h#L895

judah · 2017-04-06T16:09:02Z

Ouch, you're right; I forgot about that behavior of Data.Vector.Storable.

It looks like switching to the new, DT_RESOURCE-based variable ops fixes this problem. AFAIK the Python APIs have all made the switch, so we probably should as well. I'll try putting together a patch and see how invasive the change ends up being.

judah · 2017-04-06T20:49:17Z

For reference, see also tensorflow/tensorflow/issues/4663 which details some of the issues with the old DT_REF_*-based approach, and in particular the difference between tf.identity(x) and 1 * x (the former is sometimes optimized away).

The main difference between these and the `Ref`-bases ops is the explicit `readValue` op. I'm not sure how this should interact with gradients and save/restore, so I'm keeping it as a separate module for now. Once we figure out the details, we can merge it into `TensorFlow.Ops` and replace all uses of the old `Ref`-based ops. (That would also fix tensorflow#92.) Also replaces our special case newtype `ResourceHandle` to `Tensor Value ResourceHandle`, where `ResourceHandle` is the TF proto corresponding to `DT_RESOURCE`.

The main difference between these and the `Ref`-bases ops is the explicit `readValue` op. I'm not sure how this should interact with gradients and save/restore, so I'm keeping it as a separate module for now. Once we figure out the details, we can merge it into `TensorFlow.Ops` and replace all uses of the old `Ref`-based ops. (That would also fix #92.) Also replaces our special case newtype `ResourceHandle` to `Tensor Value ResourceHandle`, where `ResourceHandle` is the TF proto corresponding to `DT_RESOURCE`.

judah · 2017-04-16T16:41:21Z

@blackgnezdo, can you reopen this issue? (I don't have permission to do it.) The stated problem won't be resolved until TensorFlow.Variable completely replaces the functions in TensorFlow.Ops.

Looks like committing my PR (which referenced this issue) caused GitHub to auto-close it.

judah · 2017-05-08T22:17:07Z

Unfortunately, it turns out that in TensorFlow 1.1.0 (released last month), switching to TensorFlow.Variable won't fix this issue anymore:
tensorflow/tensorflow@9f12227
In short, fetching a resource variable now returns a pointer to the original data (rather than a copy), just like for Refs.

I think this means we need to treat TensorData as inherently stateful...which probably also means allowing Fetch actions to do IO under the hood. @fkm3, what do you think?

It would be better to avoid the copy when it's not necessary, but that will require more involved changes to the internal API. (For example, Fetchable might need to allow IO or ST actions.)

fkm3 · 2017-05-09T01:41:17Z

That commit would make you think they did it for performance reasons, but after digging around some, I'm pretty sure it was done for semantic reasons...

Always copying TensorData contents SGTM for now. Maybe we can ask for an attribute to disable that optimization pass, or find some way to keep it from triggering.

It would be better to avoid the copy when it's not necessary, but that will require more involved changes to the internal API. (For example, Fetchable might need to allow IO or ST actions.)

judah · 2017-05-09T20:12:07Z

Closing this ticket now that the behavior is fixed; filed #109 as a long-term task to avoid the extra copy for some types.

judah mentioned this issue Apr 15, 2017

Add resource-based variable ops. #98

Merged

blackgnezdo closed this as completed in #98 Apr 16, 2017

blackgnezdo reopened this Apr 16, 2017

judah mentioned this issue May 5, 2017

Upload to Hackage #66

Closed

judah mentioned this issue May 9, 2017

Avoid unnecessary copies when fetching #109

Open

judah closed this as completed May 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TF.assign surprising behaviour #92

TF.assign surprising behaviour #92

jcberentsen commented Apr 6, 2017

silky commented Apr 6, 2017

judah commented Apr 6, 2017

fkm3 commented Apr 6, 2017

judah commented Apr 6, 2017

judah commented Apr 6, 2017

judah commented Apr 16, 2017

judah commented May 8, 2017

fkm3 commented May 9, 2017

judah commented May 9, 2017

TF.assign surprising behaviour #92

TF.assign surprising behaviour #92

Comments

jcberentsen commented Apr 6, 2017

silky commented Apr 6, 2017

judah commented Apr 6, 2017

fkm3 commented Apr 6, 2017

judah commented Apr 6, 2017

judah commented Apr 6, 2017

judah commented Apr 16, 2017

judah commented May 8, 2017

fkm3 commented May 9, 2017

judah commented May 9, 2017