Fast gradcheck background and plan

This is the background + current plan to move forward with integrating the new faster gradcheck into CI. The idea to do this has been on the roadmap for quite a while according to @albanD. 

**Why will the new gradcheck be faster? (and how much faster?)**
TLDR is that for each (input, output) pair, instead of computing the entire `m x n` Jacobian (where m is the dimension of the output and n is the dimension of the input), which requires calling `.grad` at least m times and `f` 2*n times, we compute a linear combination of its entries instead, which requires only calling `.grad` a single time, and `f` twice. After some initial testing, the time to complete `test/test_autograd.py` on CPU has been reduced 20x from roughly ~3 minutes, to ~7 seconds. Read on for a longer explanation. 

Recall that the goal of gradcheck is to verify that the backward implementation of a forward op `f` is correct. To do this we compute the Jacobian of `f` in two ways: once using our backward implementation (analytically), then once without relying on the backward implementation (numerically). If the analytical Jacobian `J` is equivalent to the numerical Jacobian `J'`, that means we wrote our backward correctly.
1) For the "analytical" way, we know that when we do reverse-mode AD, the grad_input we are computing is really v^T J, where v is grad_out passed in to `.grad`. If we let v be a one-hot vector, where the ith entry is 1, `v^T J` obtains the ith row of J. If we repeat this process row by row, we can reconstruct the entire `J`. 
2) Numerically, we use finite-differencing. Because` lim h->0 (f(x + h) - f(x + h)) / 2h` is by definition `J u` , by the same logic as (1), we can again obtain `J'` but, column by column. 

The above is how the old gradcheck works. For the new gradcheck, the idea is to realize that if a random linear combination of the entries of the J are equivalent to the same linear combination of the entries of some J', there is a good chance that `J = J'`, and that computing `v^T J u` is one way to obtain this linear combination of `J` or `J'`'s entries. The next step is to find a way to obtain `v^T J u` using both analytical and numerical methods. For the analytical method, recall in old gradcheck that if grad_out, `v` is a one-hot vector, computing `v^T J` computes a single row of J. If we let `v` be a random vector instead, `v^T J` computes a linear combination of the rows of J. All that is left to do to obtain `v^T J u` is to multiply again by another random vector `u`. For the numerical way, we do the same, but in a different order. Because `J u` is actually a linear combination of J's column vectors, and we can multiply by `v^T` to obtain `v^T J u` as desired.

_edit: thanks to @ezyang 's keen observation, we know that we can't actually make the assumption that the backward implementation is linear here, so we don't actually know if `.grad(grad_out, v)` of a random vector v is `v^T J`, i.e., a linear combination of J's rows. Practically, we can't really prove that either because that would require m calls to `.grad`. What we are really showing here is that `f(v) dot u` is equal to `f'(v) dot u` where `f'(v) is v^T J'`, but if you read the section below, the argument should still basically hold._

**What are the chances that this fails? (roughly, at least)**
We already know that if the numerical Jacobian is equivalent to the analytical Jacobian, clearly any linear combination would also be equivalent, but the converse is not necessarily true. There is indeed a chance that `J != J'`, but `v^T J u == v^T J' u` . However, we can show that this probability is relatively small. If we have `f` mapping `R^n -> R^m`, its Jacobian J also maps `R^n -> R^m`. If we fix `u, v J`, and let `c = v^T J u` (scalar), notice that the set of `x`'s in `R^m` that satsify `v^T x = c` defines some hyperplane in `R^m`. Thus, we can see that geometrically, if some `J' != J` also satisfies `v^T J' u = v^T J u`, it must map u to the same hyperplane in `R^m` as determined by u, v, J. Finally, hyperplanes have measure zero, so theoretically the probability is zero. (Buuut practically, we tolerate some difference between the linear combinations, so our hyperplane actually does have some "thickness", and the true probability of a false negative is roughly proportional to this tolerance.)

**Plan on how to integrate (this will roughly correspond to the stack of PRs)**
1) Refactor old gradcheck simply moving things without behavior change https://github.com/pytorch/pytorch/pull/53857
2) Refactor old gradcheck but involves changing behavior, to fix discovered bugs from (1)
3) Add new gradcheck, route old gradcheck calls to new gradcheck. When new gradcheck fails, compute entire Jacobian using old gradcheck to provide better diagnostic. Individual tests may be failing for random reasons. If its not obvious how to deal with this, temporarily have these tests cases still run old gradcheck
Keep old gradcheck running on nightly builds as to discover any issues that new gradcheck missed. 
4) Gradually enable failing tests from (3) as we discover why. If we discover certain failures of old gradcheck in nightly builds, that pass fast gradcheck, fix these missed cases.





cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fast gradcheck background and plan #53876

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fast gradcheck background and plan #53876

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions