Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make checkpointing evaluate only the related tasks #18568

Closed
wants to merge 6 commits into from

Conversation

Projects
None yet
6 participants
@sublee
Copy link
Contributor

commented Mar 28, 2019

This patch fixes #18566 what I reported.

When the autograd graph has parallel lanes, and a checkpointing is on one of the lanes, the checkpointing evaluates irrelevant backwards on another lane.

For example, in the below autograd graph, Checkpoint(Eval2) evaluates Eval1 as its recursive call:

          +--> Eval1 --> AccumulateGrad1
Sum --> Cat
          +--> Checkpoint(Eval2) --> AccumulateGrad2

The reason I found is the order of priority between Eval1 and Eval2 which is recomputed by Checkpoint(Eval2). In the autograd engine, the order priority relies on a function's sequence number. The sequence number generator is thread-local. Eval1 is computed on the main thread but Eval2 is recomputed on the worker thread. So there's no logical synchronization between them.

How Eval1 evaluates during Checkpoint(Eval2)

Let's suppose there were 10 autograd functions before this case. The sequence numbers of the functions would be like:

Eval1: 10
Checkpoint(Eval2): 11
Cat: 12
Sum: 13

Recomputed Eval2: 0

The autograd engine evaluates a task with a function having a higher sequence number first. Thus, the tasks/functions will be evaluated in the following order: (I omitted GraphRoot to simplify)

  1. Evaluate Sum: 13, then schedule Cat.
  2. Evaluate Cat: 12, then schedule Eval1 and Checkpoint(Eval2).
  3. Evaluate Checkpoint(Eval2): 11 which has a higher priority than Eval1: 1, then schedule a new recomputed Eval2: 0.
    1. (recursion) Evaluate Eval1: 10 which has a higher priority than Eval2: 0
    2. (recursion) Evaluate Eval2: 0

Eval1: 10 should be evaluated AFTER Checkpoint(Eval2): 11 done. But it is evaluated WITHIN Checkpoint(Eval2): 11.

How this patch fixes it

I propose a stack depth counter for nested backward. Now the autograd engine evaluates a deeply nested task first even it has a lower sequence number. Nested backward would evaluate only the tasks created in the backward.

The priority is modified like: (depth, sequence number)

Eval1: (0, 10)
Checkpoint(Eval2): (0, 11)
Cat: (0, 12)
Sum: (0, 13)

Recomputed Eval2: (1, 0)

Please review my pull request. I would be happy to discuss this approach.

@soumith soumith requested review from albanD and apaszke Mar 28, 2019

@soumith

This comment has been minimized.

Copy link
Member

commented Mar 28, 2019

man, this is great investigation, and also issuing a patch. Thanks a lot for the PR.
I've tagged @apaszke and @albanD to review it

@SsnL

This comment has been minimized.

Copy link
Collaborator

commented Mar 28, 2019

Thanks for the investigation and patch! The fix looks reasonable to me. An alternative fix is to sync the time counter when we start backward threads. I don't know which is better, but @apaszke would for sure know.

@albanD

This comment has been minimized.

Copy link
Collaborator

commented Mar 28, 2019

Hey, thanks for the patch !
I understand that the behavior that you see is not what you expect but why is it wrong? Can it lead to the gradients being wrong?
I'm not sure to understand why this is actually needed for the backward to work properly?

@SsnL

This comment has been minimized.

Copy link
Collaborator

commented Mar 28, 2019

@albanD My understanding is that this is not a correctness problem. But the order is unexpected.

@sublee

This comment has been minimized.

Copy link
Contributor Author

commented Mar 28, 2019

@albanD Thank you for the comment.

There's no problem in gradients. But the current behavior leads to a deadlock if two independent backward have a race condition.

# Sorry for this extremely simplified example.
global_lock = threading.Lock()

def checkpoint(...):
    with global_lock:
        out = recompute(module)
        out.backward()
    ...

a = checkpoint(model, a)
b = checkpoint(model, b)
c = torch.cat((a, b))

# It will never finish.
c.sum().backward()

There also can be a recursion limit issue. This issue is introduced at #6959.

@sublee sublee changed the title Make checkpointing evaluates only the correct tasks Make checkpointing evaluates only the related tasks Mar 29, 2019

@sublee sublee changed the title Make checkpointing evaluates only the related tasks Make checkpointing evaluate only the related tasks Mar 29, 2019

@sublee

This comment has been minimized.

Copy link
Contributor Author

commented Mar 29, 2019

If we use CUDA workers, it's a bit more complicated.

Now everything in the graph is on a CUDA device. The autograd engine will use 2 workers; one for the CPU, another one for the CUDA device (GPU.)

          +--> Eval1 --> AccumulateGrad1
Sum --> Cat
          +--> Checkpoint(Eval2) --> AccumulateGrad2

I drew a sequence diagram to explain what happens here:

[ CUDA worker ]
  /
 /  [ CPU worker ]
|     /
|    |
|    +- Evaluate GraphRoot, which is automatically created by the autograd engine
|    |  +- Schedule Sum at the CUDA worker
|
+- Evaluate Sum
|  +- Schedule Cat
|
+- Evaluate Cat
|  +- Schedule Eval1 and Checkpoint(Eval2)
|
+- Evaluate Checkpoint(Eval2) because of a higher priority then Eval1
|  +- Schedule GraphRoot for recomputed Eval2 at the CPU worker
|  +- Reentrant backward
|     |                  [ CPU worker ]
|     +- Evaluate Eval1    /
|     |                   |
|     |                   +- Evaluate GraphRoot for recomputed Eval2
|     |                   |  +- Schedule Eval2 at the CUDA worker
|     +- Evaluate Eval2
|

My first approach is not enough because GraphRoot is always scheduled at the CPU worker, even it is created on a CUDA worker.

I added some commits to suggest to schedule GraphRoot at the current worker to prevent evaluating already queued tasks (3736c55.)

diff --git a/torch/csrc/autograd/engine.cpp b/torch/csrc/autograd/engine.cpp
index 1e4c5cc4c..a6ecfba69 100644
--- a/torch/csrc/autograd/engine.cpp
+++ b/torch/csrc/autograd/engine.cpp
@@ -612,15 +612,20 @@ auto Engine::execute(const edge_list& roots,
   if (!outputs.empty()) {
     graph_task.init_to_execute(*graph_root, outputs);
   }
-  ready_queue(at::kCPU).push(FunctionTask(&graph_task, std::move(graph_root), InputBuffer(0)));
+
+  auto task = FunctionTask(&graph_task, std::move(graph_root), InputBuffer(0));

   // Not a worker
   if (worker_device == NO_DEVICE) {
+    ready_queue(at::kCPU).push(std::move(task));
+
     // Wait for all tasks to complete
     graph_task.not_done.wait(lock, [&graph_task]{
       return graph_task.outstanding_tasks.load() == 0;
     });
   } else {
+    ready_queue_by_index(worker_device).push(std::move(task));
+
     // Get back to work while we wait for our new graph_task to
     // complete!
     // See Note [Reentrant backwards]

Now the sequence diagram is aware of reentrant backwards:

[ CUDA worker ]
  /
 /  [ CPU worker ]
|     /
|    |
|    +- Evaluate GraphRoot, which is automatically created by the autograd engine
|    |  +- Schedule Sum at the CUDA worker
|
+- Evaluate Sum
|  +- Schedule Cat
|
+- Evaluate Cat
|  +- Schedule Eval1 and Checkpoint(Eval2)
|
+- Evaluate Checkpoint(Eval2) because of a higher priority then Eval1
|  +- Schedule GraphRoot for recomputed Eval2 at the CUDA worker
|  +- Reentrant backward                      ^^^^^^^^^^^^^^^^^^
|     |
|     +- Evaluate GraphRoot for recomputed Eval2
|     |  +- Schedule Eval2
|     |
|     +- Evaluate Eval2
|
+- Evaluate Eval1
|
@albanD

This comment has been minimized.

Copy link
Collaborator

commented Mar 29, 2019

But then what happens if both your checkpoints are on different devices (and thus different threads). They can still run interleaved depending on which one gets python GIL and when.
What you're trying to enforce in this PR is "When you enter reentrant autograd, no other autograd task will run until all the ones associated with this entry are done" ? Or is it something else?

@sublee

This comment has been minimized.

Copy link
Contributor Author

commented Mar 30, 2019

@albanD Thanks for your concern. I guess you want to examine this new graph. It's probably a very good case to figure out the wrong things:

          +--> Checkpoint(Eval1) --> AccumulateGrad1
Sum --> Cat
          +--> Copy --> Checkpoint(Eval2) --> AccumulateGrad2

No problem with two lanes on different device

The graph has two checkpoints and one of the checkpoints is on a different device. Copy will copy an output of Checkpoint(Eval2) from cuda:2 to cuda:1. In this case, there's no problem without my patch. Because, if I understand engine.cpp correctly, each device has its own worker thread and queue for ready tasks.

// One for CPU, plus one for every GPU device (but colocate GPUs of different
// types)
int num_threads = num_devices + 1;
ready_queues = std::vector<std::shared_ptr<ReadyQueue>>(num_threads);

The problem appears only when the task queue is not empty when a nested backward is called. Those checkpoints, which are on different devices, are queued on different task queues. So they can be evaluated independently and also parallelly.

This sequence diagram shows what happens on CPU worker, CUDA worker #1, and CUDA worker #2 if two checkpoints on different devices without my patch. Checkpoint(Eval1) and Checkpoint(Eval2) may be evaluated simultaneously:

[ CPU worker ]
 /
|
+- Evaluate GraphRoot
|  +- Schedule Sum at the CUDA worker #1

[ CUDA worker #1 ]
 /
|
+- Evaluate Sum
|  +- Schedule Cat
|
+- Evaluate Cat
|  +- Schedule Eval1 and Copy
|
+- Evaluate Copy because of a higher priority then Eval1
|  +- Schedule Checkpoint(Eval2) at the CUDA worker #2
|
|                                               [ CUDA worker #2 ]
|                                                  \
+- Evaluate Checkpoint(Eval1)                       \
|  +- Schedule GraphRoot                             |
|  |  for recomputed Eval1 at the CPU worker         +- Evaluate Checkpoint(Eval2)
|  +- Reentrant backward                             |  +- Schedule GraphRoot
|     |                                              |  |  for recomputed Eval2 at the CPU worker
|     |  [ CPU worker ]                              |  +- Reentrant backward
|     |    /                                         |     |
|     |   /                                          |     |
|     |  | (The order of the below two GraphRoots    |     |
|     |  |  can be changed because they have same    |     |
|     |  |  sequence number)                         |     |
|     |  |                                           |     |
|     |  +- Evaluate GraphRoot for Eval 1            |     |
|     |  |  +- Schedule Eval1 at the CUDA worker #1  |     |
|     |  |                                           |     |
|     |  +- Evaluate GraphRoot for Eval 2            |     |
|     |  |  +- Schedule Eval2 at the CUDA worker #2  |     |
|     |                                              |     |
|     |                                              |     |
|     +- Evaluate Eval1                              |     +- Evaluate Eval 2

This patch doesn't hurt the backward pass parallelism

What you're trying to enforce in this PR is "When you enter reentrant autograd, no other autograd task will run until all the ones associated with this entry are done" ? Or is it something else?

Your summary is almost right. But one more constraint should be included. "No other autograd task on the same device worker will run until all the ones associated with a nested backward."

To apply the constraint, I thought the GraphRoot belongs to a nested backward should be queued to the same worker thread, rather than the CPU worker thread regardless of where it is called.

- ready_queue(at::kCPU).push(FunctionTask(&graph_task, std::move(graph_root), InputBuffer(0)));
+ auto task = FunctionTask(&graph_task, std::move(graph_root), InputBuffer(0));
  if (worker_device == NO_DEVICE) {
+   ready_queue(at::kCPU).push(std::move(task));
    ...
  } else {
+   ready_queue_by_index(worker_device).push(std::move(task));

With my patch, this GraphRoot task has higher priority than already queued tasks because of the deeper stack depth. So it can be always evaluated earlier than other tasks in the nested backward.

I think you worry about losing parallelism in the backward pass due to a nested backward. But I also want to keep the parallelism because it is very efficient and it makes PyTorch's autograd awesome. I think my patch doesn't hurt it. Because it affects only the same worker thread. The tasks on other worker threads would be evaluated parallelly as the current behavior.

@sublee

This comment has been minimized.

Copy link
Contributor Author

commented Apr 1, 2019

I wrote a simple gist to clarify the problems (a deadlock, a recursion limit error) what I want to fix. You can reproduce those problems with this gist (tested on PyTorch 1.0.1.post2.)

https://gist.github.com/sublee/e370f3ecadb570eac18a9a0f0cc3b095

Please try it and give me feedback. I hope to hear your opinion.

@albanD

This comment has been minimized.

Copy link
Collaborator

commented Apr 1, 2019

Thanks for the code samples, it makes it clearer.

I am not sure "No other autograd task on the same device worker will run until all the ones associated with a nested backward." is enough to prevent deadlocks.

In particular, if the following modification of your example still deadlock with this PR

import threading
import torch

lock = threading.Lock()

class RaceCondition(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        return input

    @staticmethod
    def backward(ctx, grad_output):
        # A deadlock occurs.
        with lock:
            with torch.enable_grad():
                grad_output.requires_grad_()
                tmp = grad_output.cuda()
                tmp = tmp.cpu()
            tmp.backward()
        return grad_output.grad


a = torch.ones(1, requires_grad=True)
b = torch.ones(1, requires_grad=True)

a = RaceCondition.apply(a)
b = RaceCondition.apply(b)

c = a + b
print('forward done!')

c.backward()
print('backward done!')  # Here is never reached.

During the backward, the .cuda() op will actually be dispatched to the gpu worker thread.
This means that the cpu thread will be free to run the next available op: Eval2.
And this will deadlock as well.

So you would actually want "No other autograd task will run until all the ones associated with a nested backward." to prevent any possible deadlock.

But then if you do that, you might have a very large performance loss in terms of parallelism.

So I don't think we can solve this problem completely, a user will always be able to deadlock the autograd by playing with locks and reentrant autograd.
That being said, I agree that it would be nice if the engine behavior was to run the newly created tasks created within the backward first when continuing working with that thread.
The reason for that seem to be that the sequence_nr is thread local and so in reentrant autograd, the newly created functions will always start at 0 and so will be processed last.

Now I am not sure if introducing the notion of depth and moving the GraphRoot function is necessary to fix that particular case or if we just want to have a better handling of sequence_nr for multiple threads?
@apaszke might have an opinion on that?

@sublee

This comment has been minimized.

Copy link
Contributor Author

commented Apr 1, 2019

@albanD Thank you for the great counterexample!

I agree with you. My approach doesn't prevent every unexpected recursion; rather, it just reduces the possibility. In other words, code complexity increased by this patch is worse than its functionality. Now I don't have an intent to enforce this approach anymore. Thanks for changing my mind.

As you said, the unexpected recursion doesn't seem to be fixed completely. But I just want to note about the difficulty of inspecting this problem. In my case, it appeared with a deadlock like what I shared several times. The recursion was very hard to figure out because it wasn't intuitional (there's no explicit recursion) and there was no understandable Python traceback. I had to spend a lot of hours until understanding the Autograd's worker thread and priority mechanism to figure it out. It's not only about a deadlock. An unexpected recursion may lead to any kind of unexpected bug including RecursionError I mentioned. If we can't fix it completely, I think it's better to expose to PyTorch users transparently. Maybe we can describe the worker thread and priority mechanism at the "Autograd mechanics" documentation and warn about the possible recursions.

I'll close this PR after getting @apaszke's opinion, or in a few days.

@albanD

This comment has been minimized.

Copy link
Collaborator

commented Apr 2, 2019

I agree that adding a note to the autograd mechanism section to warn the user that there are no guarantee on the order the ops on the backward when there are independent of each other is necessary !

Whether or not we want to fix this particular use case still remains an open question :)

@sublee

This comment has been minimized.

Copy link
Contributor Author

commented Apr 3, 2019

I just have a little idea. I'm not sure whether it can be implemented or acceptable.

If we can split a custom backward function at a nested backward (imagine a generator which yields at a nested backward), it would be possible to inject the nested graph into the existing graph on the fly.

class Checkpoint(torch.autograd.Function):
    ...
    @torch.autograd.reentrant_backwards
    @staticmethod
    def backward(ctx, grad_output):
        ...
        yield output.backward(grad_output)
        ...

For example, this graph:

          +--> Eval1 --> AccumulateGrad1
Sum --> Cat
          +--> Checkpoint(Eval2) --> AccumulateGrad2

dynamically becomes in the backward pass:

          +--> Eval1 --> AccumulateGrad1
Sum --> Cat
          +--> Checkpoint[before yield](Eval2) --+            +--> Checkpoint[after yield] --> AccumulateGrad2
                                                 +--> Eval2 --+
@ezyang

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2019

So, I am not convinced generators will solve your problem (prevent the recursive call from evaluating unrelated tasks prevent deadlock when a lock is used around reentrant backwards), but it would solve #6959 (stack overflow) so that may make it worth pursuing in any case. I don't know how complicated the final solution would be though, and that will influence whether or not we can accept it.

@soumith

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

I'd like @apaszke to weigh in here.

@ezyang

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2019

Actually, yield has a different problem, which is that you are exiting the scope that has taken out the lock. Is it OK to release the lock (and reacquire it when control returns)? I don't know about your underlying use case, so it is hard to say.

@ezyang

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2019

To @albanD

So you would actually want "No other autograd task will run until all the ones associated with a nested backward." to prevent any possible deadlock.

But then if you do that, you might have a very large performance loss in terms of parallelism.

I'm not convinced this is entirely true.

There are two interplaying concerns here:

  1. In the "obvious" implementations of reentrant autograd, you spin up an entirely new set of autograd threads and a new function queue and have them operate independently. However...
  2. We maintain the invariant that only one backwards function per device every runs in parallel. So, in the implementation strategy of (1), you'd need to add a lock per device, which each function takes out before executing

If we think about it this way, the worker thread which calls the reentrant backwards call is obligated to release its lock upon the reentrant backwards call, and reacquire it on reentry. If you do that, all is well, there are no deadlocks, and you even get to make progress on both the original autograd graph in the meantime. But you have to be able to release the lock, and then reacquire it when everything finishes.

@sublee

This comment has been minimized.

Copy link
Contributor Author

commented Apr 5, 2019

@ezyang Thank you for your opinion. Yes, as you pointed out, the yield approach can't solve the deadlock. But this new interface may become a hint for the mechanism of reentrant backward to users. Now the user has a responsibility to control a race condition around explicit yield.

@torch.autograd.reentrant_backwards
@staticmethod
def backward(ctx, grad_output):
    with lock:
        ...

    yield output.backward()

    with lock:
        ...

Anyways, I don't advocate this approach anymore. Because I found a good workaround which looks straightforward. I'll note it in a separate different comment.

@sublee

This comment has been minimized.

Copy link
Contributor Author

commented Apr 5, 2019

In my case, I found a simple trick to isolate parallel checkpoints. The trick is to linearize the parallel lanes by an explicit dependency.

My real use case looks like it. I have to handle two or more parallel checkpoints: (circles are tensors, black squares are checkpoints)

An autograd graph with two parallel lanes joined by a cat. Each lane has a checkpoint.

I implemented a simple autograd function named First. It doesn't calculate anything but it makes a dependency between two tensors:

class First(torch.autograd.Function):
    @staticmethod
    def forward(ctx, first, second):
        return first
    @staticmethod
    def backward(ctx, grad_first):
        return grad_first, None

By First, I made the input of the second checkpoint depend on the output of the first checkpoint:

The same graph, but two parallel lanes linked with a first.

Now the first lane could not be scheduled before the second lane finishes because its dependencies are not resolved yet. The cat couldn't schedule parallel lanes anymore. Every lanes are linearized. So a nested backward can be done without any unrelated tasks.

In my case, this trick looks like the best solution. Because it's straightforward and it doesn't make a requirement of the race condition. I guess if we note about the reentrant backward to users, probably they would find the best solution for themselves.

@apaszke

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

Wow, this is a long thread. Thanks a lot for the great investigation @sublee!

I tried to follow most of it, but from a cursory look it seems like we still don't have a good answer for this, and so I would rather avoid code complication if it doesn't resolve this particular issue.

Additionally, I really do think that we should maintain the "functions execute in an arbitrary (topologically valid) order" invariant just for the sake of having some leeway in which we implement the autodiff engine. Note that separate connected components (which appear e.g. due to reevaluation of checkpoints) have no data dependencies, and so any interleaving is topologically valid. It might not necessarily be deterministic (because we want to use threading and possibly with work-stealing), and allows us to not worry about keeping a consistent order between the JIT autodiff and autograd which would be a huge pain.

I think it would be ok to add a note to the documentation to mention that we do not guarantee anything. We can mention that if you really want to hack the system then you can insert phony data dependencies like you did, because that would have to be respected by any valid AD implementation.

@sublee

This comment has been minimized.

Copy link
Contributor Author

commented Apr 5, 2019

@apaszke Thanks to coming here and following the long thread. Your agreement with my final approach brings me great confidence. Now as I said when I agreed with @albanD, I'm going to close this pull request. Thank you all for the abundant discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.