Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF.assign surprising behaviour #92

Closed
jcberentsen opened this issue Apr 6, 2017 · 9 comments · Fixed by #98
Closed

TF.assign surprising behaviour #92

jcberentsen opened this issue Apr 6, 2017 · 9 comments · Fixed by #98

Comments

@jcberentsen
Copy link
Contributor

I wrote some tests to guide my understanding of TF.assign in relation to run and runSession. Similar to:
https://github.com/tensorflow/haskell/blob/master/tensorflow-ops/tests/BuildTest.hs#L83

case_use_scalar_after_assigns_surprizingly_fails = do
  (f0, f1, f2) <- TF.runSession $ do
    (formula, reset0, reset1) <- do
      w <- TF.initializedVariable 0
      let formula = w                                                                                                                                                   
      r0 <- TF.assign w (TF.scalar (0.1::Float))
      r1 <- TF.assign w (TF.scalar (0.2::Float))
      return (formula, r0, r1)

    (TF.Scalar f0) <- TF.run formula
    TF.run_ reset0
    (TF.Scalar f1) <- TF.run formula
    TF.run_ reset1
    (TF.Scalar f2) <- TF.run formula
    return (f0, f1, f2)

  (0.0, 0.1, 0.2) @=? (f0, f1, f2)

This fails with f0, f1 and f2 being the final value of the assigned variable. If I change the test to:

case_use_scalar_after_assigns_works_but_why = do
  (f0, f1, f2) <- TF.runSession $ do
    (formula, reset0, reset1) <- do
      w <- TF.initializedVariable 0
      let formula = TF.mul 1 w -- Matters
      r0 <- TF.assign w (TF.scalar (0.1::Float))
      r1 <- TF.assign w (TF.scalar (0.2::Float))
      return (formula, r0, r1)

    (TF.Scalar f0) <- TF.run formula
    TF.run_ reset0
    (TF.Scalar f1) <- TF.run formula
    TF.run_ reset1
    (TF.Scalar f2) <- TF.run formula
    return (f0, f1, f2)

  (0.0, 0.1, 0.2) @=? (f0, f1, f2)

by changing to let formula = TF.mul 1 w This works, but why doesn't the previous version?

I wrote these tests because I tried in other code do something like this:

...
M.replicateM epochs $ do
    TF.run_ fitStep
    TF.run $ TF.value refW

to collect the evolution of the w terms but they ended up being equal (but verifiably different from the initial value), even with the TF.value conversion.
(The fitStep does assign to refW)

Is this expected behaviour? Am I missing something?

@silky
Copy link
Contributor

silky commented Apr 6, 2017

i agree that this seems to be a bug (the equiv. python works as you expect).

i don't have the answer for you, but for whatever reason i got distracted looking into this (and not getting very far!)

i did find some things like this -

addNewOp :: OpDef -> Build NodeDef
- which might be mildly helpful in tracking down the true cause.

i'm kind of surprised, i think, that say let formula = identity w doesn't work either (and the value function doesn't help at all).

i'd love to know the answer to this!

@judah
Copy link
Contributor

judah commented Apr 6, 2017

Thanks for finding this. I think it's a problem with laziness and how we're interacting with the C API. In particular, forcing the values immediately after fetching them causes the test to pass:

case_use_scalar_after_assigns_surprizingly_fails = testCase "assigns" $ do
  runSession $ do
    w <- initializedVariable 0
    r <- assign w (scalar (0.1::Float))
    Scalar f0 <- run w
    liftIO $ evaluate f0
    run_ r
    Scalar f1 <- run w
    liftIO $ (0.0, 0.1) @=? (f0, f1)

And removing the call to evaluate reproduces the error you were seeing.

It seems that in both calls to run w, the C API is returning the same pointer to tensor data - namely, the memory that backs the variable itself, which might change between run calls. And we're currently lazy enough that sometimes we don't read the contents of that memory immediately.

I'll look into making our code more strict to prevent this problem, and try to see whether this behavior also warrants a fix from the TensorFlow side.

@fkm3
Copy link
Contributor

fkm3 commented Apr 6, 2017

Haven't looked into this yet, but I would expect that making things more strict is only enough for scalar and Data.Vector results. TensorData and Data.Vector.Storable results point directly at the array returned by tensorflow instead of copying.

The C API has a comment that I thought implied this was safe, but now I'm guessing it is only true in a trivial sense (the caller owns the pointer to the pointer to the data, but not the data...): https://github.com/tensorflow/tensorflow/blob/0c68156ffcf918df905f9f39a632888724c66c3b/tensorflow/c/c_api.h#L895

@judah
Copy link
Contributor

judah commented Apr 6, 2017

Ouch, you're right; I forgot about that behavior of Data.Vector.Storable.

It looks like switching to the new, DT_RESOURCE-based variable ops fixes this problem. AFAIK the Python APIs have all made the switch, so we probably should as well. I'll try putting together a patch and see how invasive the change ends up being.

@judah
Copy link
Contributor

judah commented Apr 6, 2017

For reference, see also tensorflow/tensorflow/issues/4663 which details some of the issues with the old DT_REF_*-based approach, and in particular the difference between tf.identity(x) and 1 * x (the former is sometimes optimized away).

judah added a commit to judah/tensorflow-haskell that referenced this issue Apr 15, 2017
The main difference between these and the `Ref`-bases ops is the explicit
`readValue` op.  I'm not sure how this should interact with gradients
and save/restore, so I'm keeping it as a separate module for now.  Once we
figure out the details, we can merge it into `TensorFlow.Ops` and replace
all uses of the old `Ref`-based ops.  (That would also fix tensorflow#92.)

Also replaces our special case newtype `ResourceHandle` to
`Tensor Value ResourceHandle`, where `ResourceHandle` is the TF proto
corresponding to `DT_RESOURCE`.
judah added a commit to judah/tensorflow-haskell that referenced this issue Apr 15, 2017
The main difference between these and the `Ref`-bases ops is the explicit
`readValue` op.  I'm not sure how this should interact with gradients
and save/restore, so I'm keeping it as a separate module for now.  Once we
figure out the details, we can merge it into `TensorFlow.Ops` and replace
all uses of the old `Ref`-based ops.  (That would also fix tensorflow#92.)

Also replaces our special case newtype `ResourceHandle` to
`Tensor Value ResourceHandle`, where `ResourceHandle` is the TF proto
corresponding to `DT_RESOURCE`.
judah added a commit to judah/tensorflow-haskell that referenced this issue Apr 15, 2017
The main difference between these and the `Ref`-bases ops is the explicit
`readValue` op.  I'm not sure how this should interact with gradients
and save/restore, so I'm keeping it as a separate module for now.  Once we
figure out the details, we can merge it into `TensorFlow.Ops` and replace
all uses of the old `Ref`-based ops.  (That would also fix tensorflow#92.)

Also replaces our special case newtype `ResourceHandle` to
`Tensor Value ResourceHandle`, where `ResourceHandle` is the TF proto
corresponding to `DT_RESOURCE`.
judah added a commit to judah/tensorflow-haskell that referenced this issue Apr 15, 2017
The main difference between these and the `Ref`-bases ops is the explicit
`readValue` op.  I'm not sure how this should interact with gradients
and save/restore, so I'm keeping it as a separate module for now.  Once we
figure out the details, we can merge it into `TensorFlow.Ops` and replace
all uses of the old `Ref`-based ops.  (That would also fix tensorflow#92.)

Also replaces our special case newtype `ResourceHandle` to
`Tensor Value ResourceHandle`, where `ResourceHandle` is the TF proto
corresponding to `DT_RESOURCE`.
blackgnezdo pushed a commit that referenced this issue Apr 16, 2017
The main difference between these and the `Ref`-bases ops is the explicit
`readValue` op.  I'm not sure how this should interact with gradients
and save/restore, so I'm keeping it as a separate module for now.  Once we
figure out the details, we can merge it into `TensorFlow.Ops` and replace
all uses of the old `Ref`-based ops.  (That would also fix #92.)

Also replaces our special case newtype `ResourceHandle` to
`Tensor Value ResourceHandle`, where `ResourceHandle` is the TF proto
corresponding to `DT_RESOURCE`.
@judah
Copy link
Contributor

judah commented Apr 16, 2017

@blackgnezdo, can you reopen this issue? (I don't have permission to do it.) The stated problem won't be resolved until TensorFlow.Variable completely replaces the functions in TensorFlow.Ops.

Looks like committing my PR (which referenced this issue) caused GitHub to auto-close it.

@blackgnezdo blackgnezdo reopened this Apr 16, 2017
@judah judah mentioned this issue May 5, 2017
@judah
Copy link
Contributor

judah commented May 8, 2017

Unfortunately, it turns out that in TensorFlow 1.1.0 (released last month), switching to TensorFlow.Variable won't fix this issue anymore:
tensorflow/tensorflow@9f12227
In short, fetching a resource variable now returns a pointer to the original data (rather than a copy), just like for Refs.

I think this means we need to treat TensorData as inherently stateful...which probably also means allowing Fetch actions to do IO under the hood. @fkm3, what do you think?

judah added a commit to judah/tensorflow-haskell that referenced this issue May 9, 2017
It would be better to avoid the copy when it's not necessary, but
that will require more involved changes to the internal API.  (For example,
Fetchable might need to allow IO or ST actions.)
@fkm3
Copy link
Contributor

fkm3 commented May 9, 2017

That commit would make you think they did it for performance reasons, but after digging around some, I'm pretty sure it was done for semantic reasons...

Always copying TensorData contents SGTM for now. Maybe we can ask for an attribute to disable that optimization pass, or find some way to keep it from triggering.

fkm3 pushed a commit that referenced this issue May 9, 2017
It would be better to avoid the copy when it's not necessary, but
that will require more involved changes to the internal API.  (For example,
Fetchable might need to allow IO or ST actions.)
@judah
Copy link
Contributor

judah commented May 9, 2017

Closing this ticket now that the behavior is fixed; filed #109 as a long-term task to avoid the extra copy for some types.

@judah judah closed this as completed May 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants