Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add eager mode gradients for ops. #108

Closed
48 of 78 tasks
nsthorat opened this issue Apr 5, 2018 · 16 comments
Closed
48 of 78 tasks

Add eager mode gradients for ops. #108

nsthorat opened this issue Apr 5, 2018 · 16 comments

Comments

@nsthorat
Copy link
Contributor

nsthorat commented Apr 5, 2018

From @nsthorat on January 18, 2018 15:56

The infrastructure for eager mode is now ready for gradient methods to be filled in!

Eager mode provides a new set of methods on NDArrayMath which allows the user to eagerly compute gradients. Most users will use an optimizer like this:

const weights = dl.variable(Array2D.randNormal([784, 10]));
const cost = optimizer.minimize(() => {
  const batch = data.nextTrainBatch(BATCH_SIZE);
  const y = math.matMul(batch.xs, weights);
  const loss = math.mean(math.softmaxCrossEntropyWithLogits(labels, ys));
  return loss;
});

You'll notice that there is no use of the Graph, we simply use ops on NDArrayMath directly inside of an optimizer.minimize() method.

You can find a full example of training MNIST in eager mode here: https://github.com/PAIR-code/deeplearnjs/blob/master/demos/mnist_eager/model.ts

As part of NDArrayMath we expose several new methods. The important ones are these:

  • math.gradients(f: () => cost, xs) which executes f() (which produces a scalar value) and returns the gradient of the output of f with respect to xs (which can be an NDArray or a string => NDArray map).
  • math.valueAndGradients(f: () => cost, xs) which is the same as math.gradients() but also returns the output of f().
  • math.vjp(f: () => y, x, dy) which computes a vector-jacobian product - it is similar to gradients, but allows f() to produce a non-scalar value and lets the user provide a dy. This is useful to compute a subset of backpropagation, or to tests gradients of a single op with a provided dy (this is how we unit test).
  • math.customGradient(f: () => {value, gradients}, xs) which allows the user to provide a custom gradient of an arbitrary function closure instead of using the default gradients of the ops in the function. We use this for numerical stability for ops like softmaxCrossEntropy, and for mean / sum so we can compute a faster gradient (instead of the combination of gradients of the kernels they use). Most of the time, you shouldn't need to use this.

Now that these methods exist and are relatively stable, we can flush out gradients for kernels and ops!

To add gradients for kernels, we simply need to add a derivative function to the executeKernel calls inside of NDArrayMath. An example:

const der = (dy: Array2D<'float32'>, y: Array2D) => {
  return {
    a: () => this.matMul(dy, b, MatrixOrientation.REGULAR, MatrixOrientation.TRANSPOSED),
    b: () => this.matMul(a, dy, MatrixOrientation.TRANSPOSED, MatrixOrientation.REGULAR)
  };
};
return this.backendEngine.executeKernel(
  'MatMul', {inputs: {a, b}, args: {aOrientation, bOrientation}}, der);

The derivative is an function that takes dy, and y, and returns an object whose keys are the inputs (as defined by the inputs argument to executeKernel) and returns a function that returns the derivative with respect to that input. These derivatives should not call executeKernel, rather call math ops directly (this is so we can compute second order gradients).

Two example PRs adding gradients:
tensorflow/tfjs-core#521
tensorflow/tfjs-core#544

Note that we have lots of gradients in the Graph layer already, we just need to move them over to the gradients defined in eager mode.

Here is the list of ops and whether the gradient has been implemented:

  • abs
  • acos
  • add
  • argmax
  • argmin (not important)
  • asin
  • atan
  • avgPool
  • batchNormalization
  • cast
  • ceil
  • clip
  • clone
  • ceil
  • concat
  • conv1D
  • conv2D
  • conv2DDerBias (would be a second order der)
  • conv2DDerFilter (would be a second order der)
  • conv2DDerInput (would be a second order der)
  • conv2DTranspose
  • cos
  • cosh
  • depthwiseConv2D
  • divide
  • elu
  • eluDer
  • equal
  • exp
  • floor
  • greater
  • greaterEqual
  • leakyRelu
  • less
  • lessEqual
  • localResponseNormalization
  • log
  • logicalOr
  • logicalAnd
  • matMul (needs derivatives when using transposed bit)
  • max
  • maximum
  • maxPool
  • maxPoolBackprop (would be second order der)
  • min
  • minimum
  • minPool
  • multinomial
  • multiply
  • neg
  • notEqual
  • oneHot
  • pad
  • pow (half implemented, needs broadcast + derB)
  • prelu
  • preluDer (would be second order der)
  • relu
  • reshape
  • resizeBilinear3D
  • reverse
  • selu
  • sigmoid
  • sin
  • sinh
  • slice
  • softmax
  • softmaxCrossEntropyWithLogits
  • sqrt
  • square
  • step
  • sub
  • sum
  • tan
  • where
  • tanh
  • tile
  • topK
  • transpose

Copied from original issue: tensorflow/tfjs-core#561

@nsthorat
Copy link
Contributor Author

nsthorat commented Apr 5, 2018

From @dsmilkov on January 18, 2018 16:15

For those that are interested:

  • leave a comment here to claim one or several ops
  • make a PR that adds gradients for those ops
  • After we merge your PR, we'll check the checkbox

I'd suggest to start with the unary ops.

@nsthorat
Copy link
Contributor Author

nsthorat commented Apr 5, 2018

From @israelvicars on January 19, 2018 17:20

I'll start abs. Thanks for tweeting the invitation to contribute.

@nsthorat
Copy link
Contributor Author

nsthorat commented Apr 5, 2018

From @manrajgrover on January 28, 2018 10:27

@nsthorat @dsmilkov A little guidance required here. What should the gradient of comparison and logical operations look like? Any good sources for the same?

@nsthorat
Copy link
Contributor Author

nsthorat commented Apr 5, 2018

From @gena on January 28, 2018 19:36

Added implementation for sigmoid, PR #603.

@nsthorat
Copy link
Contributor Author

nsthorat commented Apr 5, 2018

From @manrajgrover on February 9, 2018 13:57

@nsthorat We can tick off ceil, clip, cosh, floor, maximum, minimum, selu, sigmoid, sinh, softmax and tanh.

prelu, elu are partially implemented. Only gradients w.r.t. alpha need to be added.

leakyRelu and step gradients are in progress.

Since logical and comparison operations don't have an actual gradient, should we return NaN, zeros or not pass gradient function toexecuteKernel?

Also, looks like unary and binary operations are now almost covered. Would be needing pointers on next set to focus on.

@nsthorat
Copy link
Contributor Author

nsthorat commented Apr 5, 2018

From @dsmilkov on February 9, 2018 22:51

@manrajgrover Great q. For now, let's leave out logical and comparison ops and not pass a gradient function to executeKernel. Next would be reverse, slice, pad, concat (in that order). Thanks!

@nsthorat
Copy link
Contributor Author

nsthorat commented Apr 5, 2018

From @dsmilkov on February 19, 2018 3:30

I'll be taking reverse, slice, pad and concat (already started work). Thanks!

@nsthorat
Copy link
Contributor Author

nsthorat commented Apr 5, 2018

From @easadler on April 5, 2018 1:14

Seems like this may be a little out of date, but I'd like to help if you still would like help. I think oneHot is still available.

@nsthorat
Copy link
Contributor Author

nsthorat commented Apr 5, 2018

A high priority one would be batchNorm (which you'd return gradients for all of the parameters - this should be pretty straightforward) or resizeBilinear, if you're willing to take it on! resizeBilinear has a big filed here: #38

@nsthorat
Copy link
Contributor Author

nsthorat commented Apr 5, 2018

From @easadler on April 5, 2018 14:53

I will give batchNorm a shot!

@tafsiri
Copy link
Contributor

tafsiri commented Apr 13, 2018

I will work on resizeBilinear

@jgartman
Copy link
Contributor

I can finish up pow.

@jgartman
Copy link
Contributor

I'll finish matMul as well.

@jgartman
Copy link
Contributor

I'll try localResponseNormalization

dsmilkov pushed a commit to tensorflow/tfjs-core that referenced this issue Jul 14, 2018
This PR adds the gradient for LRN.  tensorflow/tfjs#108

FEATURE
@nsthorat
Copy link
Contributor Author

Closing this out in favor of individual issues.

@generic-github-user
Copy link

Is there an updated list of which functions have gradients implemented and which do not?

nsthorat pushed a commit that referenced this issue Aug 19, 2019
* quick test.

* bump travis

* Revert change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants