-
Notifications
You must be signed in to change notification settings - Fork 606
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great write-up! Left some comments/questions.
training data being copied to the GPU) or ending (e.g. final weights being | ||
copied to the CPU). | ||
2. Within the program, only warn when data makes a round trip between devices: | ||
when a piece of data moves from device A to device B, then a computation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the notion of a "device" include swift host? usually device refers only to a TF device, like TF GPU.
Also is the description here overly general -- we are not talking about warnings over data transfers between TF CPU and GPU, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, I meant to just talk about host<->accelerator transfers. I will reword this to make it more clear.
Since the round-trip-rule might be hard to implement, I also propose that we | ||
initially implement a simple heuristic that approximates the round-trip-rule: | ||
Warn for all transfers from the host to the accelerator, but do not warn for any | ||
transfers from the accelerator to the host. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arguably warning only on swift->tf transfers can stand on its own right, rather than serving as an approximation due to impl-level convenience. e.g. say swift is running along side TF, and will send various tensors from swift to TF when these tensors are produced -- even without round-trips, such sends can slow down TF, if swift is being slow in producing these tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this as similar to the tf->swift transfer case -- there are a lot of situations where the user does expect the transfers to happen and where the transfers don't slow anything down. For example, scalar transfers for control flow. Or some kind of CPU loading & preprocessing step that queues up a bunch of training data for the accelerator.
Thinking through scalars is what made me come up with the "round trip" idea. We can suppress all scalar warnings, like we do now, but this creates a possibly-bad situation in combination with suppressing tf->swift transfers: what if your program does a really slow tf->swift transfer and then sends it back as a scalar? Something like this:
for step in 0...1000 {
let gradients = ... // on accelerator
weights += gradients // on accelerator
if cpuOnlyFunc(weights) == 0 { // long round-trip transfer that blocks the computation
weights += 1
}
}
If we hide all warnings for scalar transfers and we hide all warnings for tf->swift transfers, then that compiles without warnings. A "round trip" rule would catch it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a tf-swift transfer is really slow, it might stall TF compute (due to limited TF RAM buffering) already, without involving a round-trip. :)
Overall what compiler can warn based on static analysis has limited accuracy, so we might as well go with a fairly simple (explainable / predictible to end users) model. Round-trip involves somewhat more mental calculation than saying "we warn on all host->tf transfers, other than scalar ones, but that's still just an approximation". :)
But my position is based on some speculation and is just my opinion. I'll let you decide here.
@@ -0,0 +1,84 @@ | |||
reduced.swift:14:23: warning: 'inputs' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit | |||
public func predict(inputs: Tensor<Float>) -> Tensor<Float> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm i don't understand why such warnings appear in the first place -- there should be no sends/recvs here.
maybe this is due to the use of mutating
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because predict(inputs:)
is being partitioned as a deabstraction scope. It's a public function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we have a better inlining model, should we recommend that all public APIs be marked as "inline always"? Otherwise if user code calls a sequence of public APIs, that could generate a lot of sends/recvs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public APIs be marked as "inline always"?
I think you meant @inlinable
. @inline(__always)
is just local inlining and won't make the SIL function serialized.
I would not say requiring @inlinable
is a good idea, because
- Requiring excessive annotation from the user breaks the mental model of "tensor code just works".
- It would delay shape checking for public APIs if they are not called in the current module. If any shape error occurs in the caller module, the error location would be unknown.
- Partitioning in the first place guarantees the function will at least run under all circumstances, instead of printing "__tfop_xxx cannot be lowered to LLVM IR!!" at run-time. For example, if you use an opaque library function like
expectEqual
that makes a call to a tensor function through protocol conformance, the program fails because the function was never partitioned. This is unavoidable for generic functions because generic functions cannot be partitioned and must be@inlinable
, but non-generic functions can at least be partitioned and have a binary representation to execute reliably no matter how it's being called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this discussion is mostly orthogonal to the warnings proposal. Regardless of what causes such a function to get partitioned on its own, we need to make the warnings good when it does get partitioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it is orthogonal! But at the same time I hope this is providing some context on how argument send/receive warnings are emitted in the first place.
One middle ground I can think of is to have the compiler still partition inlinable functions outside the TF module. This would resolve both Mingsheng’s concerns and mine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Richard's proposal SGTM.
I agree with Mark that what's partitionable is itself orthogonal to the warnings proposal. One related aspect is: when we are partitioning a public API, but if it does not get called in user code, it'd be nice not to generate sends/recvs warnings for it?
warnings). | ||
|
||
I propose that we clean up the warnings as follows: | ||
1. Emit no warnings for data transferred while the program is starting (e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree #1 is a good idea, but how do we implement it (how do we tell whether a swift->tf data transfer happens before the TF graph runs)?
Currently our use of "input args" and "result tensors" of the TF program serve the purpose of differentiating themselves from the sends/recvs, and thus generate no warning.
If we convert the input args and result tensors into tensor sends/recvs, i'm not sure how to reliably tell them apart from those other sends/revs that are happening while the tf program is running.
Also, if swift is being too slow sending the first tensor to tf, it'd still increase the end-to-end running time, and it might be useful to warn in that case (if possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to capture the current use of "input args" and "result tensors" in this rule, because it seems like part of the big picture.
Also it seems like they currently might be bugged in some way because a lot of the warnings in the example look like they should be input args or result tensors. So a big part of cleaning up the warnings might be debugging those.
Are we planning to eventually remove input args and result tensors and replace them with send/recv? If we do that, could we keep the logic that currently creates input args and result tensors, and use it to decide whether to warn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it seems like they currently might be bugged in some way because a lot of the warnings in the example look like they should be input args or result tensors. So a big part of cleaning up the warnings might be debugging those.
SGTM.
Are we planning to eventually remove input args and result tensors and replace them with send/recv? If we do that, could we keep the logic that currently creates input args and result tensors, and use it to decide whether to warn?
I think that direction is worth exploration, but it's too early to commit to that. If we do go there, I agree that should probably not affect our warnings policy.
the accelerator to consume. | ||
|
||
Without implementing the true round-trip-rule, we can denoise the warnings in | ||
that situation by suppressing warnings for scalar transfers. But the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and we already implemented the disabling off warnings on scalar values -- should we still include that in our policy design here (instead of forcing us to choose between it and the new round-trip-rule)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this proposal interacts pretty interestingly with disabled scalar warnings, because of the example in my comment above with the slow scalar transfer.
Without implementing the true round-trip-rule, we can denoise the warnings in | ||
that situation by suppressing warnings for scalar transfers. But the | ||
round-trip-rule suppresses those warnings in a cleaner and more reliable way, so | ||
I propose that we eventually do implement the round-trip-rule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another possible future direction that you mentioned in our earlier discussion is:
Move sends/recvs diagnostics to a separate tool, possibly using runtime info (e.g. profile data).
Do you want to also briefly cover it in this doc, to establish the "space" of our design exploration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add!
Sorry about the delayed reply. The doc is good to land from my perspective! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great doc!
@marcrasi Feel free to land this as is and await further feedback! |
Thanks for putting together this document! I understand how such warnings can be annoying but I am leaning more towards letting the user know when data is moved without them explicitly specifying it. I also see how currently Swift for TF is a bit loose about devices; there is a notion of a single accelerator and host, but in many applications we use multiple GPUs, for example. Allowing for customized device placement is in the plan I imagine, and so assuming that eventually we will be able to place ops on devices as we choose, I believe that those warnings will be very useful. Currently, I feel I often lose control of when TensorFlow (using the Python API) will move stuff to the CPU when allowing soft placement and also when using "fake" int32 kernels. Therefore, I believe that the warnings are very useful even though they may be a bit verbose. And there is also a way to prevent them which is good. One option may be to add such heuristics to prevent some of the warnings, but be able to not use them through say a compiler flag that controls verbosity for device placement warnings. |
Things have changed a lot since I wrote this. I think it's mostly obsolete, so I'll close it. Thanks for the comments everyone! |
* Add information about where docs go * add samples too
Here's a proposal on how to clean up implicit copy warnings, based on some discussions with @mhong.