[CUDA] Remove footgun related to non-blocking copies

Edit: I filed this issue because I believed the following code has a use-after-free.
```
def foo():
    A = torch.rand(SIZE, device="cpu", pin_memory=True)
    B = A.cuda(non_blocking=True)
    return B

C = foo()
# do stuff with C
```
It turns out that PyTorch actually keeps the pinned memory alive until the memcpy event is complete. There is no bug. Pretty cool!