Consider adding y/x -> y * 1/x optimization for `_foreach_div_.ScalarList` and other div Scalar overloads

`_foreach_div_.ScalarList` is not bitwise equal to a for loop of `div_.Scalar` due to the following optimization added in https://github.com/pytorch/pytorch/pull/29428: in a div op, we'd prefer `Tensor * (1/Scalar)` over `Tensor / Scalar`. This is because computing multiplication is faster than computing division. This optimization is observable for when there is a lot of math going on for the same memory (like in normalization). Otherwise, the speedup may not be observable between div and mul when most of what's happening is reading and storing memory, in which case the performance is bandwidth bound. 

To be clear, _**one should not rely on bitwise equality between foreach ops and their single-tensor equivalents**_, since foreach is an optimization that may take its own shortcuts. That said, it may be interesting considering the following so I made a tracker:

- [ ] Add this optimization for `_foreach_div_.Scalar` as it should be trivial to reroute to `_foreach_mul_.Scalar`.
- [ ] Reroute `_foreach_div_.Tensor` to `_foreach_mul_.Scalar` directly when the ScalarTensor is on CPU (currently it routes to `_foreach_mul_.Scalar`)
- [ ] Would this be considered an optimization still for `_foreach_div_.ScalarList`? Is there a fast way to invert scalars in CUDA so we could avoid adding a forloop of reciprocals on the CPU side?

## Code example

The code that causes a discrepancy between _foreach_div and normal div_: 
https://github.com/pytorch/pytorch/blob/aeb5fd52c74b1c673e8565b9593cbdd5ed6f04f2/aten/src/ATen/native/cuda/BinaryDivTrueKernel.cu#L35-L49

## Interesting context

This discrepancy is the only reason default `Adam` and `Adam(foreach=False)` differ. Otherwise the two optimizers would be bitwise equivalent. cc @vincentqb @jbschlosser @albanD @crcrpar @mcarilli @awgu 

cc @crcrpar @ngimel 

	if (iter.is_cpu_scalar(2)) {
	// optimization for floating-point types: if the second operand is a CPU
	// scalar, compute a * reciprocal(b). Note that this may lose one bit of
	// precision compared to computing the division.
	AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(
	kHalf, kBFloat16, common_dtype, "div_true_cuda", [&]() {
	using opmath_t = at::opmath_type<scalar_t>;
	auto inv_b = opmath_t(1.0) / iter.scalar_value<opmath_t>(2);
	iter.remove_operand(2);
	gpu_kernel(
	iter,
	BUnaryFunctor<scalar_t, scalar_t, scalar_t, MulFunctor<opmath_t>>(
	MulFunctor<opmath_t>(), inv_b));
	});
	} else {

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider adding y/x -> y * 1/x optimization for `_foreach_div_.ScalarList` and other div Scalar overloads #114165

Code example

Interesting context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider adding y/x -> y * 1/x optimization for _foreach_div_.ScalarList and other div Scalar overloads #114165

Description

Code example

Interesting context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Consider adding y/x -> y * 1/x optimization for `_foreach_div_.ScalarList` and other div Scalar overloads #114165