Skip to content
This repository has been archived by the owner on Mar 30, 2022. It is now read-only.

Implicit Copy Warning Improvements #55

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 144 additions & 0 deletions proposals/ImplicitCopyWarnings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Implicit Copy Warning Improvements

* Author: [Marc Rasi](https://github.com/marcrasi)

## Introduction

Currently, implicit copy warnings are very noisy. For example, [this simple
model] produces [these warnings]. (See the "Performance Predictability" section
of [Graph Program Extraction] for more information about implicit copy
warnings).

I propose that we clean up the warnings as follows:
1. Emit no warnings for data transferred while the program is starting (e.g.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree #1 is a good idea, but how do we implement it (how do we tell whether a swift->tf data transfer happens before the TF graph runs)?

Currently our use of "input args" and "result tensors" of the TF program serve the purpose of differentiating themselves from the sends/recvs, and thus generate no warning.

If we convert the input args and result tensors into tensor sends/recvs, i'm not sure how to reliably tell them apart from those other sends/revs that are happening while the tf program is running.

Also, if swift is being too slow sending the first tensor to tf, it'd still increase the end-to-end running time, and it might be useful to warn in that case (if possible).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to capture the current use of "input args" and "result tensors" in this rule, because it seems like part of the big picture.

Also it seems like they currently might be bugged in some way because a lot of the warnings in the example look like they should be input args or result tensors. So a big part of cleaning up the warnings might be debugging those.

Are we planning to eventually remove input args and result tensors and replace them with send/recv? If we do that, could we keep the logic that currently creates input args and result tensors, and use it to decide whether to warn?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it seems like they currently might be bugged in some way because a lot of the warnings in the example look like they should be input args or result tensors. So a big part of cleaning up the warnings might be debugging those.

SGTM.

Are we planning to eventually remove input args and result tensors and replace them with send/recv? If we do that, could we keep the logic that currently creates input args and result tensors, and use it to decide whether to warn?

I think that direction is worth exploration, but it's too early to commit to that. If we do go there, I agree that should probably not affect our warnings policy.

training data being copied to the GPU) or ending (e.g. final weights being
copied to the CPU).
2. Within the program, only warn when data makes a round trip between devices:
when a piece of data moves from device A to device B, then a computation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the notion of a "device" include swift host? usually device refers only to a TF device, like TF GPU.

Also is the description here overly general -- we are not talking about warnings over data transfers between TF CPU and GPU, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I meant to just talk about host<->accelerator transfers. I will reword this to make it more clear.

using that data happens on device B, and then the result of that computation
moves back to device A.

Concretely, this proposal eliminates all warnings in [the example].

Since the round-trip-rule might be hard to implement, I also propose that we
initially implement a simple heuristic that approximates the round-trip-rule:
Warn for all transfers from the host to the accelerator, but do not warn for any
transfers from the accelerator to the host.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arguably warning only on swift->tf transfers can stand on its own right, rather than serving as an approximation due to impl-level convenience. e.g. say swift is running along side TF, and will send various tensors from swift to TF when these tensors are produced -- even without round-trips, such sends can slow down TF, if swift is being slow in producing these tensors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this as similar to the tf->swift transfer case -- there are a lot of situations where the user does expect the transfers to happen and where the transfers don't slow anything down. For example, scalar transfers for control flow. Or some kind of CPU loading & preprocessing step that queues up a bunch of training data for the accelerator.

Thinking through scalars is what made me come up with the "round trip" idea. We can suppress all scalar warnings, like we do now, but this creates a possibly-bad situation in combination with suppressing tf->swift transfers: what if your program does a really slow tf->swift transfer and then sends it back as a scalar? Something like this:

for step in 0...1000 {
  let gradients = ... // on accelerator
  weights += gradients // on accelerator
  if cpuOnlyFunc(weights) == 0 { // long round-trip transfer that blocks the computation
    weights += 1
  }
}

If we hide all warnings for scalar transfers and we hide all warnings for tf->swift transfers, then that compiles without warnings. A "round trip" rule would catch it though.

Copy link
Contributor

@mhong mhong Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a tf-swift transfer is really slow, it might stall TF compute (due to limited TF RAM buffering) already, without involving a round-trip. :)

Overall what compiler can warn based on static analysis has limited accuracy, so we might as well go with a fairly simple (explainable / predictible to end users) model. Round-trip involves somewhat more mental calculation than saying "we warn on all host->tf transfers, other than scalar ones, but that's still just an approximation". :)

But my position is based on some speculation and is just my opinion. I'll let you decide here.


Since all round trips involve a transfer from the host to the accelerator, the
heuristic catches all transfers that the round-trip-rule catches.

[Graph Program Extraction]: https://github.com/tensorflow/swift/blob/master/docs/GraphProgramExtraction.md
[this simple model]: ./ImplicitCopyWarnings/LinearRegression.swift
[the example]: ./ImplicitCopyWarnings/LinearRegression.swift
[these warnings]: ./ImplicitCopyWarnings/LinearRegression-warnings.txt

## Justification

The main purpose of implicit copy warnings is to alert the user when the Swift
for TensorFlow programming model causes their program to have unexpectedly bad
performance.

Users expect their programs to start off by transferring data to an accelerator,
they expect their programs to occasionally send debugging or status information
(e.g. model loss) back to the CPU for display or logging purposes, and they
expect their programs to send results back to the CPU when they finish. So we
should not produce warnings for any of these things. For example, this code,
which does all of those things, should not produce any warnings:

```swift
public func train(inputs: Tensor<Float>, outputs: Tensor<Float>, initialWeights: Tensor<Float>) -> Tensor<Float> {
var weights = initialWeights
for step in 0...1000 {
let predictions = inputs • weights
let errors = predictions - outputs
let dweights = (errors * inputs).sum(alongAxes: 0).transposed()
weights -= 0.01 * dweights

if (step % 100 == 0) {
print("Current weights: \(weights)") // Notice that `weights` gets copied to the CPU
}
}
return weights
}
```

What we want to avoid is unexpectedly blocking users' computations. This can
happen when the user writes some code that (unbeknownst to them) forces some
computation to happen on the CPU before computation on the accelerator can
proceed. For example, suppose that `cpuOnlyComputation` runs a computation that
can only happen on the CPU. Then this training loop blocks on data transfer and
CPU computation every iteration:

```swift
public func train(inputs: Tensor<Float>, outputs: Tensor<Float>, initialWeights: Tensor<Float>) -> Tensor<Float> {
var weights = initialWeights
for step in 0...1000 {
let predictions = cpuOnlyComputation(inputs • weights)
let errors = predictions - outputs
let dweights = (errors * inputs).sum(alongAxes: 0).transposed()
weights -= 0.01 * dweights

if (step % 100 == 0) {
print("Current weights: \(weights)")
}
}
return weights
}
```

We should emit a warning in the above code so that the user is aware that their
training loop is blocking on data transfer and CPU computation.

The round-trip-rule achieves exactly what we want in these examples! So does the
heuristic of warning for all transfers from the host to the accelerator.

## False-positive / False-negative tradeoff

The round-trip-rule does not catch all slowness related to implicit copies. For
example, a program that frequently dumps large pieces of data from the GPU to
the CPU might soak up GPU memory with data waiting to be copied, and the
round-trip-rule will not catch that.

However, the round-trip-rule does capture what I currently believe will be the
main source of performance unpredictability (accidentally writing code that
moves a piece of a computation onto a different device), so I propose that it's
a good initial balance between false positives and false negatives.

After we gain more experience with Swift models, we can revisit implicit copy
warnings and see whether the balance still appears to be good.

## Issues with the heuristic

The heuristic (warn for all transfers from the host to the accelerator) can
produce false positive warnings for harmless transfers from the host to the
accelerator. One common situation where this happens is when the host calculates
some simple control flow conditions and sends them to the accelerator. For
example:

```swift
public func example(steps: Int) -> Tensor<Float> {
var result: Tensor<Float> = Tensor(zeros: [10, 10])
vat step: Int = 0
while step < steps {
if step % 2 == 0 {
result += 1
} else {
result += 2
}
step += 1
}
return result
}
```

If `step < steps` and `step % 2 == 0` get evaluated on the host and the boolean
results get copied over to the accelerator, then the heuristic will warn about
implicit copies. But these implicit copies are harmless because the host can
quickly run through a bunch of iterations and queue up a bunch of booleans for
the accelerator to consume.

Without implementing the true round-trip-rule, we can denoise the warnings in
that situation by suppressing warnings for scalar transfers. But the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we already implemented the disabling off warnings on scalar values -- should we still include that in our policy design here (instead of forcing us to choose between it and the new round-trip-rule)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this proposal interacts pretty interestingly with disabled scalar warnings, because of the example in my comment above with the slow scalar transfer.

round-trip-rule suppresses those warnings in a cleaner and more reliable way, so
I propose that we eventually do implement the round-trip-rule.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another possible future direction that you mentioned in our earlier discussion is:
Move sends/recvs diagnostics to a separate tool, possibly using runtime info (e.g. profile data).

Do you want to also briefly cover it in this doc, to establish the "space" of our design exploration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add!

84 changes: 84 additions & 0 deletions proposals/ImplicitCopyWarnings/LinearRegression-warnings.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
reduced.swift:14:23: warning: 'inputs' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit
public func predict(inputs: Tensor<Float>) -> Tensor<Float> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i don't understand why such warnings appear in the first place -- there should be no sends/recvs here.
maybe this is due to the use of mutating?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because predict(inputs:) is being partitioned as a deabstraction scope. It's a public function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we have a better inlining model, should we recommend that all public APIs be marked as "inline always"? Otherwise if user code calls a sequence of public APIs, that could generate a lot of sends/recvs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public APIs be marked as "inline always"?

I think you meant @inlinable. @inline(__always) is just local inlining and won't make the SIL function serialized.

I would not say requiring @inlinable is a good idea, because

  1. Requiring excessive annotation from the user breaks the mental model of "tensor code just works".
  2. It would delay shape checking for public APIs if they are not called in the current module. If any shape error occurs in the caller module, the error location would be unknown.
  3. Partitioning in the first place guarantees the function will at least run under all circumstances, instead of printing "__tfop_xxx cannot be lowered to LLVM IR!!" at run-time. For example, if you use an opaque library function like expectEqual that makes a call to a tensor function through protocol conformance, the program fails because the function was never partitioned. This is unavoidable for generic functions because generic functions cannot be partitioned and must be @inlinable, but non-generic functions can at least be partitioned and have a binary representation to execute reliably no matter how it's being called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this discussion is mostly orthogonal to the warnings proposal. Regardless of what causes such a function to get partitioned on its own, we need to make the warnings good when it does get partitioned.

Copy link
Contributor

@rxwei rxwei Aug 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it is orthogonal! But at the same time I hope this is providing some context on how argument send/receive warnings are emitted in the first place.

One middle ground I can think of is to have the compiler still partition inlinable functions outside the TF module. This would resolve both Mingsheng’s concerns and mine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Richard's proposal SGTM.

I agree with Mark that what's partitionable is itself orthogonal to the warnings proposal. One related aspect is: when we are partitioning a public API, but if it does not get called in user code, it'd be nice not to generate sends/recvs warnings for it?

^~~~~~~~~~~~~~~~~~~~~
reduced.swift:15:19: note: value used here
return inputs • w + b
reduced.swift:15:23: warning: value implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit
return inputs • w + b
reduced.swift:15:27: warning: value implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit
return inputs • w + b
reduced.swift:18:20: warning: 'inputs' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit
public func loss(inputs: Tensor<Float>, outputs: Tensor<Float>) -> Float {
^~~~~~~~~~~~~~~~~~~~~
reduced.swift:15:19: note: value used here
return inputs • w + b
reduced.swift:15:23: warning: value implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit
return inputs • w + b
reduced.swift:15:27: warning: value implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit
return inputs • w + b
reduced.swift:18:43: warning: 'outputs' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit
public func loss(inputs: Tensor<Float>, outputs: Tensor<Float>) -> Float {
^~~~~~~~~~~~~~~~~~~~~~
reduced.swift:20:25: note: value used here
return (predictions - outputs).squared().mean()
~~~~~~~~~~~~^~~~~~~~~
reduced.swift:23:34: warning: 'inputs' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit
public mutating func trainStep(inputs: Tensor<Float>, outputs: Tensor<Float>,
^~~~~~~~~~~~~~~~~~~~~
reduced.swift:15:19: note: value used here
return inputs • w + b
reduced.swift:23:24: warning: 'self' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit
public mutating func trainStep(inputs: Tensor<Float>, outputs: Tensor<Float>,
^~~~~~~~~
reduced.swift:15:19: note: value used here
return inputs • w + b
reduced.swift:23:57: warning: 'outputs' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit
public mutating func trainStep(inputs: Tensor<Float>, outputs: Tensor<Float>,
^~~~~~~~~~~~~~~~~~~~~~
reduced.swift:27:30: note: value used here
let errors = predictions - outputs
~~~~~~~~~~~~^~~~~~~~~
reduced.swift:35:30: warning: 'inputs' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit
public mutating func train(inputs: Tensor<Float>, outputs: Tensor<Float>, learningRate: Float,
^~~~~~~~~~~~~~~~~~~~~
reduced.swift:15:19: note: value used here
return inputs • w + b
reduced.swift:35:24: warning: 'self' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit
public mutating func train(inputs: Tensor<Float>, outputs: Tensor<Float>, learningRate: Float,
^~~~~
reduced.swift:15:19: note: value used here
return inputs • w + b
reduced.swift:35:53: warning: 'outputs' implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit
public mutating func train(inputs: Tensor<Float>, outputs: Tensor<Float>, learningRate: Float,
^~~~~~~~~~~~~~~~~~~~~~
reduced.swift:27:30: note: value used here
let errors = predictions - outputs
~~~~~~~~~~~~^~~~~~~~~
reduced.swift:31:7: warning: value implicitly copied to the host, use .toHost() to make transfer explicit
w -= learningRate * dw
~~^~~~~~~~~~~~~~~~~~~~
reduced.swift:35:24: note: value used here
public mutating func train(inputs: Tensor<Float>, outputs: Tensor<Float>, learningRate: Float,
~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
reduced.swift:32:7: warning: value implicitly copied to the host, use .toHost() to make transfer explicit
b -= learningRate * db
~~^~~~~~~~~~~~~~~~~~~~
reduced.swift:35:24: note: value used here
public mutating func train(inputs: Tensor<Float>, outputs: Tensor<Float>, learningRate: Float,
~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
reduced.swift:31:7: warning: value implicitly copied to the host, use .toHost() to make transfer explicit
w -= learningRate * dw
~~^~~~~~~~~~~~~~~~~~~~
reduced.swift:65:10: note: value used here
return model
^~~~~
reduced.swift:32:7: warning: value implicitly copied to the host, use .toHost() to make transfer explicit
b -= learningRate * db
~~^~~~~~~~~~~~~~~~~~~~
reduced.swift:65:10: note: value used here
return model
^~~~~
reduced.swift:15:23: warning: value implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit
return inputs • w + b
reduced.swift:15:27: warning: value implicitly copied to the accelerator, use .toAccelerator() to make transfer explicit
return inputs • w + b
92 changes: 92 additions & 0 deletions proposals/ImplicitCopyWarnings/LinearRegression.swift
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
import TensorFlow

public struct LinearModel {
var w: Tensor<Float>
var b: Tensor<Float>

init(inputSize: Int32) {
w = Tensor<Float>(randomUniform: [inputSize, 1])
b = Tensor<Float>(randomUniform: [1])
}
}

extension LinearModel {
public func predict(inputs: Tensor<Float>) -> Tensor<Float> {
return inputs • w + b
}

public func loss(inputs: Tensor<Float>, outputs: Tensor<Float>) -> Float {
let predictions = predict(inputs: inputs)
return (predictions - outputs).squared().mean()
}

public mutating func trainStep(inputs: Tensor<Float>, outputs: Tensor<Float>,
learningRate: Float) {
let predictions = predict(inputs: inputs)

let errors = predictions - outputs
let dw = (errors * inputs).sum(alongAxes: 0).transposed()
let db = errors.sum(squeezingAxes: 0)

w -= learningRate * dw
b -= learningRate * db
}

public mutating func train(inputs: Tensor<Float>, outputs: Tensor<Float>, learningRate: Float,
steps: Int) {
print("Training for \(steps) steps")
for i in 0..<steps {
trainStep(inputs: inputs, outputs: outputs, learningRate: learningRate)
if i % (steps / 10) == 0 || i == steps - 1 {
print("Current model: \(self), training loss: \(loss(inputs: inputs, outputs: outputs))")
}
}
}
}

public func trainSampleModel() -> LinearModel {
// The output is the sum of the two inputs. But the inputs are noisy.
let inputSize: Int32 = 2
let trainInputs = Tensor<Float>([
[1, 1],
[-1, 1],
[5, 5],
[2, 3]
]) + 0.1 * Tensor(randomNormal: [4, 2])
let trainOutputs = Tensor<Float>([
[2],
[0],
[10],
[5]
])

var model = LinearModel(inputSize: inputSize)
model.train(inputs: trainInputs, outputs: trainOutputs, learningRate: 0.01, steps: 1000)
return model
}

public func testSampleModel(model: LinearModel) {
// The output is the sum of the two inputs. But the inputs are noisy.
let testInputs = Tensor<Float>([
[3, 10],
[-5, 6],
[4, 2],
[0, 6]
]) + 0.1 * Tensor(randomNormal: [4, 2])
let testOutputs = Tensor<Float>([
[13],
[1],
[6],
[6]
])

let loss = model.loss(inputs: testInputs, outputs: testOutputs)
let predictions = model.predict(inputs: testInputs)
print("Testing loss: \(loss)")
print("Testing predictions: \(predictions)")
print("Correct outputs: \(testOutputs)")
}

let model = trainSampleModel()
print("")
testSampleModel(model: model)