Skip to content
This repository has been archived by the owner on Mar 30, 2022. It is now read-only.

Implicit Copy Warning Improvements #55

Closed
wants to merge 6 commits into from

Conversation

marcrasi
Copy link
Contributor

Here's a proposal on how to clean up implicit copy warnings, based on some discussions with @mhong.

Copy link
Contributor

@mhong mhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great write-up! Left some comments/questions.

training data being copied to the GPU) or ending (e.g. final weights being
copied to the CPU).
2. Within the program, only warn when data makes a round trip between devices:
when a piece of data moves from device A to device B, then a computation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the notion of a "device" include swift host? usually device refers only to a TF device, like TF GPU.

Also is the description here overly general -- we are not talking about warnings over data transfers between TF CPU and GPU, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I meant to just talk about host<->accelerator transfers. I will reword this to make it more clear.

Since the round-trip-rule might be hard to implement, I also propose that we
initially implement a simple heuristic that approximates the round-trip-rule:
Warn for all transfers from the host to the accelerator, but do not warn for any
transfers from the accelerator to the host.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arguably warning only on swift->tf transfers can stand on its own right, rather than serving as an approximation due to impl-level convenience. e.g. say swift is running along side TF, and will send various tensors from swift to TF when these tensors are produced -- even without round-trips, such sends can slow down TF, if swift is being slow in producing these tensors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this as similar to the tf->swift transfer case -- there are a lot of situations where the user does expect the transfers to happen and where the transfers don't slow anything down. For example, scalar transfers for control flow. Or some kind of CPU loading & preprocessing step that queues up a bunch of training data for the accelerator.

Thinking through scalars is what made me come up with the "round trip" idea. We can suppress all scalar warnings, like we do now, but this creates a possibly-bad situation in combination with suppressing tf->swift transfers: what if your program does a really slow tf->swift transfer and then sends it back as a scalar? Something like this:

for step in 0...1000 {
  let gradients = ... // on accelerator
  weights += gradients // on accelerator
  if cpuOnlyFunc(weights) == 0 { // long round-trip transfer that blocks the computation
    weights += 1
  }
}

If we hide all warnings for scalar transfers and we hide all warnings for tf->swift transfers, then that compiles without warnings. A "round trip" rule would catch it though.

Copy link
Contributor

@mhong mhong Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a tf-swift transfer is really slow, it might stall TF compute (due to limited TF RAM buffering) already, without involving a round-trip. :)

Overall what compiler can warn based on static analysis has limited accuracy, so we might as well go with a fairly simple (explainable / predictible to end users) model. Round-trip involves somewhat more mental calculation than saying "we warn on all host->tf transfers, other than scalar ones, but that's still just an approximation". :)

But my position is based on some speculation and is just my opinion. I'll let you decide here.

@@ -0,0 +1,84 @@
reduced.swift:14:23: warning: 'inputs' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit
public func predict(inputs: Tensor<Float>) -> Tensor<Float> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i don't understand why such warnings appear in the first place -- there should be no sends/recvs here.
maybe this is due to the use of mutating?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because predict(inputs:) is being partitioned as a deabstraction scope. It's a public function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we have a better inlining model, should we recommend that all public APIs be marked as "inline always"? Otherwise if user code calls a sequence of public APIs, that could generate a lot of sends/recvs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public APIs be marked as "inline always"?

I think you meant @inlinable. @inline(__always) is just local inlining and won't make the SIL function serialized.

I would not say requiring @inlinable is a good idea, because

  1. Requiring excessive annotation from the user breaks the mental model of "tensor code just works".
  2. It would delay shape checking for public APIs if they are not called in the current module. If any shape error occurs in the caller module, the error location would be unknown.
  3. Partitioning in the first place guarantees the function will at least run under all circumstances, instead of printing "__tfop_xxx cannot be lowered to LLVM IR!!" at run-time. For example, if you use an opaque library function like expectEqual that makes a call to a tensor function through protocol conformance, the program fails because the function was never partitioned. This is unavoidable for generic functions because generic functions cannot be partitioned and must be @inlinable, but non-generic functions can at least be partitioned and have a binary representation to execute reliably no matter how it's being called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this discussion is mostly orthogonal to the warnings proposal. Regardless of what causes such a function to get partitioned on its own, we need to make the warnings good when it does get partitioned.

Copy link
Contributor

@rxwei rxwei Aug 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it is orthogonal! But at the same time I hope this is providing some context on how argument send/receive warnings are emitted in the first place.

One middle ground I can think of is to have the compiler still partition inlinable functions outside the TF module. This would resolve both Mingsheng’s concerns and mine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Richard's proposal SGTM.

I agree with Mark that what's partitionable is itself orthogonal to the warnings proposal. One related aspect is: when we are partitioning a public API, but if it does not get called in user code, it'd be nice not to generate sends/recvs warnings for it?

warnings).

I propose that we clean up the warnings as follows:
1. Emit no warnings for data transferred while the program is starting (e.g.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree #1 is a good idea, but how do we implement it (how do we tell whether a swift->tf data transfer happens before the TF graph runs)?

Currently our use of "input args" and "result tensors" of the TF program serve the purpose of differentiating themselves from the sends/recvs, and thus generate no warning.

If we convert the input args and result tensors into tensor sends/recvs, i'm not sure how to reliably tell them apart from those other sends/revs that are happening while the tf program is running.

Also, if swift is being too slow sending the first tensor to tf, it'd still increase the end-to-end running time, and it might be useful to warn in that case (if possible).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to capture the current use of "input args" and "result tensors" in this rule, because it seems like part of the big picture.

Also it seems like they currently might be bugged in some way because a lot of the warnings in the example look like they should be input args or result tensors. So a big part of cleaning up the warnings might be debugging those.

Are we planning to eventually remove input args and result tensors and replace them with send/recv? If we do that, could we keep the logic that currently creates input args and result tensors, and use it to decide whether to warn?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it seems like they currently might be bugged in some way because a lot of the warnings in the example look like they should be input args or result tensors. So a big part of cleaning up the warnings might be debugging those.

SGTM.

Are we planning to eventually remove input args and result tensors and replace them with send/recv? If we do that, could we keep the logic that currently creates input args and result tensors, and use it to decide whether to warn?

I think that direction is worth exploration, but it's too early to commit to that. If we do go there, I agree that should probably not affect our warnings policy.

the accelerator to consume.

Without implementing the true round-trip-rule, we can denoise the warnings in
that situation by suppressing warnings for scalar transfers. But the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we already implemented the disabling off warnings on scalar values -- should we still include that in our policy design here (instead of forcing us to choose between it and the new round-trip-rule)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this proposal interacts pretty interestingly with disabled scalar warnings, because of the example in my comment above with the slow scalar transfer.

Without implementing the true round-trip-rule, we can denoise the warnings in
that situation by suppressing warnings for scalar transfers. But the
round-trip-rule suppresses those warnings in a cleaner and more reliable way, so
I propose that we eventually do implement the round-trip-rule.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another possible future direction that you mentioned in our earlier discussion is:
Move sends/recvs diagnostics to a separate tool, possibly using runtime info (e.g. profile data).

Do you want to also briefly cover it in this doc, to establish the "space" of our design exploration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add!

@mhong
Copy link
Contributor

mhong commented Aug 14, 2018

Sorry about the delayed reply. The doc is good to land from my perspective!

Copy link
Contributor

@rxwei rxwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great doc!

@rxwei
Copy link
Contributor

rxwei commented Aug 17, 2018

@marcrasi Feel free to land this as is and await further feedback!

@eaplatanios
Copy link
Contributor

Thanks for putting together this document! I understand how such warnings can be annoying but I am leaning more towards letting the user know when data is moved without them explicitly specifying it. I also see how currently Swift for TF is a bit loose about devices; there is a notion of a single accelerator and host, but in many applications we use multiple GPUs, for example. Allowing for customized device placement is in the plan I imagine, and so assuming that eventually we will be able to place ops on devices as we choose, I believe that those warnings will be very useful. Currently, I feel I often lose control of when TensorFlow (using the Python API) will move stuff to the CPU when allowing soft placement and also when using "fake" int32 kernels. Therefore, I believe that the warnings are very useful even though they may be a bit verbose. And there is also a way to prevent them which is good.

One option may be to add such heuristics to prevent some of the warnings, but be able to not use them through say a compiler flag that controls verbosity for device placement warnings.

@marcrasi
Copy link
Contributor Author

Things have changed a lot since I wrote this. I think it's mostly obsolete, so I'll close it. Thanks for the comments everyone!

@marcrasi marcrasi closed this Jan 18, 2019
@marcrasi marcrasi deleted the implicit-copy-warnings branch January 18, 2019 01:04
brettkoonce pushed a commit to brettkoonce/swift that referenced this pull request Mar 14, 2019
* Add information about where docs go

* add samples too
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants