-
Notifications
You must be signed in to change notification settings - Fork 607
Implicit Copy Warning Improvements #55
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,144 @@ | ||
# Implicit Copy Warning Improvements | ||
|
||
* Author: [Marc Rasi](https://github.com/marcrasi) | ||
|
||
## Introduction | ||
|
||
Currently, implicit copy warnings are very noisy. For example, [this simple | ||
model] produces [these warnings]. (See the "Performance Predictability" section | ||
of [Graph Program Extraction] for more information about implicit copy | ||
warnings). | ||
|
||
I propose that we clean up the warnings as follows: | ||
1. Emit no warnings for data transferred while the program is starting (e.g. | ||
training data being copied to the GPU) or ending (e.g. final weights being | ||
copied to the CPU). | ||
2. Within the program, only warn when data makes a round trip between devices: | ||
when a piece of data moves from device A to device B, then a computation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does the notion of a "device" include swift host? usually device refers only to a TF device, like TF GPU. Also is the description here overly general -- we are not talking about warnings over data transfers between TF CPU and GPU, correct? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh yes, I meant to just talk about host<->accelerator transfers. I will reword this to make it more clear. |
||
using that data happens on device B, and then the result of that computation | ||
moves back to device A. | ||
|
||
Concretely, this proposal eliminates all warnings in [the example]. | ||
|
||
Since the round-trip-rule might be hard to implement, I also propose that we | ||
initially implement a simple heuristic that approximates the round-trip-rule: | ||
Warn for all transfers from the host to the accelerator, but do not warn for any | ||
transfers from the accelerator to the host. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. arguably warning only on swift->tf transfers can stand on its own right, rather than serving as an approximation due to impl-level convenience. e.g. say swift is running along side TF, and will send various tensors from swift to TF when these tensors are produced -- even without round-trips, such sends can slow down TF, if swift is being slow in producing these tensors. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see this as similar to the tf->swift transfer case -- there are a lot of situations where the user does expect the transfers to happen and where the transfers don't slow anything down. For example, scalar transfers for control flow. Or some kind of CPU loading & preprocessing step that queues up a bunch of training data for the accelerator. Thinking through scalars is what made me come up with the "round trip" idea. We can suppress all scalar warnings, like we do now, but this creates a possibly-bad situation in combination with suppressing tf->swift transfers: what if your program does a really slow tf->swift transfer and then sends it back as a scalar? Something like this:
If we hide all warnings for scalar transfers and we hide all warnings for tf->swift transfers, then that compiles without warnings. A "round trip" rule would catch it though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If a tf-swift transfer is really slow, it might stall TF compute (due to limited TF RAM buffering) already, without involving a round-trip. :) Overall what compiler can warn based on static analysis has limited accuracy, so we might as well go with a fairly simple (explainable / predictible to end users) model. Round-trip involves somewhat more mental calculation than saying "we warn on all host->tf transfers, other than scalar ones, but that's still just an approximation". :) But my position is based on some speculation and is just my opinion. I'll let you decide here. |
||
|
||
Since all round trips involve a transfer from the host to the accelerator, the | ||
heuristic catches all transfers that the round-trip-rule catches. | ||
|
||
[Graph Program Extraction]: https://github.com/tensorflow/swift/blob/master/docs/GraphProgramExtraction.md | ||
[this simple model]: ./ImplicitCopyWarnings/LinearRegression.swift | ||
[the example]: ./ImplicitCopyWarnings/LinearRegression.swift | ||
[these warnings]: ./ImplicitCopyWarnings/LinearRegression-warnings.txt | ||
|
||
## Justification | ||
|
||
The main purpose of implicit copy warnings is to alert the user when the Swift | ||
for TensorFlow programming model causes their program to have unexpectedly bad | ||
performance. | ||
|
||
Users expect their programs to start off by transferring data to an accelerator, | ||
they expect their programs to occasionally send debugging or status information | ||
(e.g. model loss) back to the CPU for display or logging purposes, and they | ||
expect their programs to send results back to the CPU when they finish. So we | ||
should not produce warnings for any of these things. For example, this code, | ||
which does all of those things, should not produce any warnings: | ||
|
||
```swift | ||
public func train(inputs: Tensor<Float>, outputs: Tensor<Float>, initialWeights: Tensor<Float>) -> Tensor<Float> { | ||
var weights = initialWeights | ||
for step in 0...1000 { | ||
let predictions = inputs • weights | ||
let errors = predictions - outputs | ||
let dweights = (errors * inputs).sum(alongAxes: 0).transposed() | ||
weights -= 0.01 * dweights | ||
|
||
if (step % 100 == 0) { | ||
print("Current weights: \(weights)") // Notice that `weights` gets copied to the CPU | ||
} | ||
} | ||
return weights | ||
} | ||
``` | ||
|
||
What we want to avoid is unexpectedly blocking users' computations. This can | ||
happen when the user writes some code that (unbeknownst to them) forces some | ||
computation to happen on the CPU before computation on the accelerator can | ||
proceed. For example, suppose that `cpuOnlyComputation` runs a computation that | ||
can only happen on the CPU. Then this training loop blocks on data transfer and | ||
CPU computation every iteration: | ||
|
||
```swift | ||
public func train(inputs: Tensor<Float>, outputs: Tensor<Float>, initialWeights: Tensor<Float>) -> Tensor<Float> { | ||
var weights = initialWeights | ||
for step in 0...1000 { | ||
let predictions = cpuOnlyComputation(inputs • weights) | ||
let errors = predictions - outputs | ||
let dweights = (errors * inputs).sum(alongAxes: 0).transposed() | ||
weights -= 0.01 * dweights | ||
|
||
if (step % 100 == 0) { | ||
print("Current weights: \(weights)") | ||
} | ||
} | ||
return weights | ||
} | ||
``` | ||
|
||
We should emit a warning in the above code so that the user is aware that their | ||
training loop is blocking on data transfer and CPU computation. | ||
|
||
The round-trip-rule achieves exactly what we want in these examples! So does the | ||
heuristic of warning for all transfers from the host to the accelerator. | ||
|
||
## False-positive / False-negative tradeoff | ||
|
||
The round-trip-rule does not catch all slowness related to implicit copies. For | ||
example, a program that frequently dumps large pieces of data from the GPU to | ||
the CPU might soak up GPU memory with data waiting to be copied, and the | ||
round-trip-rule will not catch that. | ||
|
||
However, the round-trip-rule does capture what I currently believe will be the | ||
main source of performance unpredictability (accidentally writing code that | ||
moves a piece of a computation onto a different device), so I propose that it's | ||
a good initial balance between false positives and false negatives. | ||
|
||
After we gain more experience with Swift models, we can revisit implicit copy | ||
warnings and see whether the balance still appears to be good. | ||
|
||
## Issues with the heuristic | ||
|
||
The heuristic (warn for all transfers from the host to the accelerator) can | ||
produce false positive warnings for harmless transfers from the host to the | ||
accelerator. One common situation where this happens is when the host calculates | ||
some simple control flow conditions and sends them to the accelerator. For | ||
example: | ||
|
||
```swift | ||
public func example(steps: Int) -> Tensor<Float> { | ||
var result: Tensor<Float> = Tensor(zeros: [10, 10]) | ||
vat step: Int = 0 | ||
while step < steps { | ||
if step % 2 == 0 { | ||
result += 1 | ||
} else { | ||
result += 2 | ||
} | ||
step += 1 | ||
} | ||
return result | ||
} | ||
``` | ||
|
||
If `step < steps` and `step % 2 == 0` get evaluated on the host and the boolean | ||
results get copied over to the accelerator, then the heuristic will warn about | ||
implicit copies. But these implicit copies are harmless because the host can | ||
quickly run through a bunch of iterations and queue up a bunch of booleans for | ||
the accelerator to consume. | ||
|
||
Without implementing the true round-trip-rule, we can denoise the warnings in | ||
that situation by suppressing warnings for scalar transfers. But the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. and we already implemented the disabling off warnings on scalar values -- should we still include that in our policy design here (instead of forcing us to choose between it and the new round-trip-rule)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that this proposal interacts pretty interestingly with disabled scalar warnings, because of the example in my comment above with the slow scalar transfer. |
||
round-trip-rule suppresses those warnings in a cleaner and more reliable way, so | ||
I propose that we eventually do implement the round-trip-rule. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. another possible future direction that you mentioned in our earlier discussion is: Do you want to also briefly cover it in this doc, to establish the "space" of our design exploration? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will add! |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
reduced.swift:14:23: warning: 'inputs' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit | ||
public func predict(inputs: Tensor<Float>) -> Tensor<Float> { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm i don't understand why such warnings appear in the first place -- there should be no sends/recvs here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is because There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Before we have a better inlining model, should we recommend that all public APIs be marked as "inline always"? Otherwise if user code calls a sequence of public APIs, that could generate a lot of sends/recvs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think you meant I would not say requiring
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this discussion is mostly orthogonal to the warnings proposal. Regardless of what causes such a function to get partitioned on its own, we need to make the warnings good when it does get partitioned. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, it is orthogonal! But at the same time I hope this is providing some context on how argument send/receive warnings are emitted in the first place. One middle ground I can think of is to have the compiler still partition inlinable functions outside the TF module. This would resolve both Mingsheng’s concerns and mine. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Richard's proposal SGTM. I agree with Mark that what's partitionable is itself orthogonal to the warnings proposal. One related aspect is: when we are partitioning a public API, but if it does not get called in user code, it'd be nice not to generate sends/recvs warnings for it? |
||
^~~~~~~~~~~~~~~~~~~~~ | ||
reduced.swift:15:19: note: value used here | ||
return inputs • w + b | ||
reduced.swift:15:23: warning: value implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit | ||
return inputs • w + b | ||
reduced.swift:15:27: warning: value implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit | ||
return inputs • w + b | ||
reduced.swift:18:20: warning: 'inputs' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit | ||
public func loss(inputs: Tensor<Float>, outputs: Tensor<Float>) -> Float { | ||
^~~~~~~~~~~~~~~~~~~~~ | ||
reduced.swift:15:19: note: value used here | ||
return inputs • w + b | ||
reduced.swift:15:23: warning: value implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit | ||
return inputs • w + b | ||
reduced.swift:15:27: warning: value implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit | ||
return inputs • w + b | ||
reduced.swift:18:43: warning: 'outputs' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit | ||
public func loss(inputs: Tensor<Float>, outputs: Tensor<Float>) -> Float { | ||
^~~~~~~~~~~~~~~~~~~~~~ | ||
reduced.swift:20:25: note: value used here | ||
return (predictions - outputs).squared().mean() | ||
~~~~~~~~~~~~^~~~~~~~~ | ||
reduced.swift:23:34: warning: 'inputs' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit | ||
public mutating func trainStep(inputs: Tensor<Float>, outputs: Tensor<Float>, | ||
^~~~~~~~~~~~~~~~~~~~~ | ||
reduced.swift:15:19: note: value used here | ||
return inputs • w + b | ||
reduced.swift:23:24: warning: 'self' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit | ||
public mutating func trainStep(inputs: Tensor<Float>, outputs: Tensor<Float>, | ||
^~~~~~~~~ | ||
reduced.swift:15:19: note: value used here | ||
return inputs • w + b | ||
reduced.swift:23:57: warning: 'outputs' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit | ||
public mutating func trainStep(inputs: Tensor<Float>, outputs: Tensor<Float>, | ||
^~~~~~~~~~~~~~~~~~~~~~ | ||
reduced.swift:27:30: note: value used here | ||
let errors = predictions - outputs | ||
~~~~~~~~~~~~^~~~~~~~~ | ||
reduced.swift:35:30: warning: 'inputs' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit | ||
public mutating func train(inputs: Tensor<Float>, outputs: Tensor<Float>, learningRate: Float, | ||
^~~~~~~~~~~~~~~~~~~~~ | ||
reduced.swift:15:19: note: value used here | ||
return inputs • w + b | ||
reduced.swift:35:24: warning: 'self' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit | ||
public mutating func train(inputs: Tensor<Float>, outputs: Tensor<Float>, learningRate: Float, | ||
^~~~~ | ||
reduced.swift:15:19: note: value used here | ||
return inputs • w + b | ||
reduced.swift:35:53: warning: 'outputs' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit | ||
public mutating func train(inputs: Tensor<Float>, outputs: Tensor<Float>, learningRate: Float, | ||
^~~~~~~~~~~~~~~~~~~~~~ | ||
reduced.swift:27:30: note: value used here | ||
let errors = predictions - outputs | ||
~~~~~~~~~~~~^~~~~~~~~ | ||
reduced.swift:31:7: warning: value implicitly copied to the host, use .toHost() to make transfer explicit | ||
w -= learningRate * dw | ||
~~^~~~~~~~~~~~~~~~~~~~ | ||
reduced.swift:35:24: note: value used here | ||
public mutating func train(inputs: Tensor<Float>, outputs: Tensor<Float>, learningRate: Float, | ||
~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
reduced.swift:32:7: warning: value implicitly copied to the host, use .toHost() to make transfer explicit | ||
b -= learningRate * db | ||
~~^~~~~~~~~~~~~~~~~~~~ | ||
reduced.swift:35:24: note: value used here | ||
public mutating func train(inputs: Tensor<Float>, outputs: Tensor<Float>, learningRate: Float, | ||
~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
reduced.swift:31:7: warning: value implicitly copied to the host, use .toHost() to make transfer explicit | ||
w -= learningRate * dw | ||
~~^~~~~~~~~~~~~~~~~~~~ | ||
reduced.swift:65:10: note: value used here | ||
return model | ||
^~~~~ | ||
reduced.swift:32:7: warning: value implicitly copied to the host, use .toHost() to make transfer explicit | ||
b -= learningRate * db | ||
~~^~~~~~~~~~~~~~~~~~~~ | ||
reduced.swift:65:10: note: value used here | ||
return model | ||
^~~~~ | ||
reduced.swift:15:23: warning: value implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit | ||
return inputs • w + b | ||
reduced.swift:15:27: warning: value implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit | ||
return inputs • w + b |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
import TensorFlow | ||
|
||
public struct LinearModel { | ||
var w: Tensor<Float> | ||
var b: Tensor<Float> | ||
|
||
init(inputSize: Int32) { | ||
w = Tensor<Float>(randomUniform: [inputSize, 1]) | ||
b = Tensor<Float>(randomUniform: [1]) | ||
} | ||
} | ||
|
||
extension LinearModel { | ||
public func predict(inputs: Tensor<Float>) -> Tensor<Float> { | ||
return inputs • w + b | ||
} | ||
|
||
public func loss(inputs: Tensor<Float>, outputs: Tensor<Float>) -> Float { | ||
let predictions = predict(inputs: inputs) | ||
return (predictions - outputs).squared().mean() | ||
} | ||
|
||
public mutating func trainStep(inputs: Tensor<Float>, outputs: Tensor<Float>, | ||
learningRate: Float) { | ||
let predictions = predict(inputs: inputs) | ||
|
||
let errors = predictions - outputs | ||
let dw = (errors * inputs).sum(alongAxes: 0).transposed() | ||
let db = errors.sum(squeezingAxes: 0) | ||
|
||
w -= learningRate * dw | ||
b -= learningRate * db | ||
} | ||
|
||
public mutating func train(inputs: Tensor<Float>, outputs: Tensor<Float>, learningRate: Float, | ||
steps: Int) { | ||
print("Training for \(steps) steps") | ||
for i in 0..<steps { | ||
trainStep(inputs: inputs, outputs: outputs, learningRate: learningRate) | ||
if i % (steps / 10) == 0 || i == steps - 1 { | ||
print("Current model: \(self), training loss: \(loss(inputs: inputs, outputs: outputs))") | ||
} | ||
} | ||
} | ||
} | ||
|
||
public func trainSampleModel() -> LinearModel { | ||
// The output is the sum of the two inputs. But the inputs are noisy. | ||
let inputSize: Int32 = 2 | ||
let trainInputs = Tensor<Float>([ | ||
[1, 1], | ||
[-1, 1], | ||
[5, 5], | ||
[2, 3] | ||
]) + 0.1 * Tensor(randomNormal: [4, 2]) | ||
let trainOutputs = Tensor<Float>([ | ||
[2], | ||
[0], | ||
[10], | ||
[5] | ||
]) | ||
|
||
var model = LinearModel(inputSize: inputSize) | ||
model.train(inputs: trainInputs, outputs: trainOutputs, learningRate: 0.01, steps: 1000) | ||
return model | ||
} | ||
|
||
public func testSampleModel(model: LinearModel) { | ||
// The output is the sum of the two inputs. But the inputs are noisy. | ||
let testInputs = Tensor<Float>([ | ||
[3, 10], | ||
[-5, 6], | ||
[4, 2], | ||
[0, 6] | ||
]) + 0.1 * Tensor(randomNormal: [4, 2]) | ||
let testOutputs = Tensor<Float>([ | ||
[13], | ||
[1], | ||
[6], | ||
[6] | ||
]) | ||
|
||
let loss = model.loss(inputs: testInputs, outputs: testOutputs) | ||
let predictions = model.predict(inputs: testInputs) | ||
print("Testing loss: \(loss)") | ||
print("Testing predictions: \(predictions)") | ||
print("Correct outputs: \(testOutputs)") | ||
} | ||
|
||
let model = trainSampleModel() | ||
print("") | ||
testSampleModel(model: model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree #1 is a good idea, but how do we implement it (how do we tell whether a swift->tf data transfer happens before the TF graph runs)?
Currently our use of "input args" and "result tensors" of the TF program serve the purpose of differentiating themselves from the sends/recvs, and thus generate no warning.
If we convert the input args and result tensors into tensor sends/recvs, i'm not sure how to reliably tell them apart from those other sends/revs that are happening while the tf program is running.
Also, if swift is being too slow sending the first tensor to tf, it'd still increase the end-to-end running time, and it might be useful to warn in that case (if possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to capture the current use of "input args" and "result tensors" in this rule, because it seems like part of the big picture.
Also it seems like they currently might be bugged in some way because a lot of the warnings in the example look like they should be input args or result tensors. So a big part of cleaning up the warnings might be debugging those.
Are we planning to eventually remove input args and result tensors and replace them with send/recv? If we do that, could we keep the logic that currently creates input args and result tensors, and use it to decide whether to warn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM.
I think that direction is worth exploration, but it's too early to commit to that. If we do go there, I agree that should probably not affect our warnings policy.