Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation of bucketize() is wrong, and is contradicting itself. #91580

Closed
raaaaaymond opened this issue Jan 2, 2023 · 5 comments
Closed
Labels
module: docs Related to our documentation, both in docs/ and docblocks module: sorting and selection triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@raaaaaymond
Copy link

raaaaaymond commented Jan 2, 2023

📚 The doc issue

The documentation of the bucketize function, specifically the right argument, is wrong. I mean, is it wrong or am I going mad? It says if "right is False (default), then the left boundary is closed". While this makes sense, it's in direct contradiction with the reference table directly below that sentence, and the actual behaviour of the function (which agrees with the reference table but contradicts the aforementioned sentence).

As a further complaint, while that sentence "if right is False (default), then the left boundary is closed" agrees with Numpy's equivalent digitize function, the actual behaviour is the opposite of Numpy's digitize() (with respect to the right argument). Why??? Why not just make them behave the same way??????

Suggest a potential alternative/fix

Ideally I'd like the behaviour bucketize() to change (slightly), so that it agrees with the (currently false) sentence "if right is False (default), then the left boundary is closed" in the documentation, and as such agrees with the behaviour of Numpy's digitize(), and not with the reference table also in the documentation, but I know that's probably not going to happen because, you know, it's behaviour-changing. But let's at least change that sentence in the documentation so that it doesn't contradict itself. That sentence should read "if right is False (default), then the right boundary is closed".

By the way I apologise if it turns out I'm the one in the wrong. I've stared at that sentence for 2 hours now and it still seems wrong to me.

cc @svekars @carljparker

@samdow samdow added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: sorting and selection topic: docs topic category labels Jan 3, 2023
@samdow
Copy link
Contributor

samdow commented Jan 3, 2023

It says if "right is False (default), then the left boundary is closed". While this makes sense, it's in direct contradiction with the reference table directly below that sentence, and the actual behaviour of the function (which agrees with the reference table but contradicts the aforementioned sentence).

Sorry I'm seeing that the reference table does have a closed left boundary (i.e. boundaries[i-1] < input[m][n]...[l][x]) and that this matches the examples where this gives us the lower bound of i when there's a value lies on a bin value. If it helps, I think the kwarg description for right explains it pretty well: In other words, if False, gets the lower bound index for each value in input from boundaries. If True, gets the upper bound index instead.

@samdow samdow added module: docs Related to our documentation, both in docs/ and docblocks and removed topic: docs topic category labels Jan 3, 2023
@raaaaaymond
Copy link
Author

To quote, boundaries[i-1] < input[m][n]...[l][x] is by definition not a closed left boundary. Mathematically, a closed left boundary means non-strict inequality <=, not strict inequality <.

I'm going to use simpler notation below, and just use x to mean input[m][n]...[l][x], I hope that simplifies things without making things confusing. Also, I'll be using mathematical notations such as [x, y] for a fully closed interval (between real numbers x and y where let's assume x < y), (x, y) for a fully open interval, [x, y) for a half-open interval that is closed on the left, and (x, y] for a half-open interval that is closed on the right. These are correct mathematical definitions and these are standard mathematical notation.

So as I've said before, the reference table does agree with the actual examples and behaviour. With right = False (default), the reference table (and the actual behaviour) is such that each input x is allocated to a bucket based on the condition boundary[i-1] < x <= boundary[i]. In mathematical notation, this is the half-open interval ( boundary[i-1], boundary[i] ], which is closed on the right, not left. In other words, with right = False, the buckets are actually the half-open intervals ( boundary[0], boundary[1] ], ( boundary[1], boundary[2] ], ( ... ]. These are half-open intervals that are closed on the right boundary, not the left boundary, as claimed in the documentation in the sentence "if right is False (default), then the left boundary is closed". This is the sentence that I object to.

Finally, regarding the documentation in the "Keyword Arguments" section, some of which you have quoted, I personally find it very confusing anyway, and I don't disagree with it (at least for now). I kind of wish it was omitted because I find it even more confusing that if it weren't there but that's my subjective view and let's leave that for another conversation. ( Don't bother reading the following but I mean, it just says "gets the lower bound index [...]", without a pronoun. And I'm like, what "gets" the lower bound index? Do I get it? Does the function get it? What? Anyway after reading it, it looks correct, if confusing. So let's not get into that here.)

@raaaaaymond
Copy link
Author

Also, to my other point, that Pytorch's right = False (default) has the opposite behaviour to Numpy's right = False (default) in their digitize() function, try the following example (which is very similar to your example in the documentation):

boundaries = torch.tensor([1, 3, 5, 7, 9])
v = torch.tensor([-100, 3, 6, 9, 999])

Then torch.bucketize(v, boundaries) will output tensor([0, 1, 3, 4, 5]), while np.digitize(v.numpy(), boundaries.numpy()) will output array([0, 2, 3, 5, 5], dtype=int64)

@raaaaaymond
Copy link
Author

My friend just pointed out that my tone isn't good, after he read it LOL. I'm sorry if I come across rude, I don't mean to. I really appreciate your time and attention in addressing this @samdow . Thank you. And thank you everyone else who is looking into this.

@clane9
Copy link

clane9 commented May 23, 2023

+1 I also thought this was confusing

pytorchmergebot pushed a commit that referenced this issue Jul 3, 2023
**TL;DR**: This PR is a first step in adding lowerings for torch.bucketize. It adds an initial lowering for this op - but because this  implementation is not currently efficient, it registers the lowering for prims._inductor_bucketize. After we make the implementation more efficient, we'll remove prims._inductor_bucketize and add the lowering directly to torch.bucketize.

**Background - torch.bucketize**: torch.bucketize(values, boundaries, right=False): for an arbitrary tensor of values and a non-decreasing 1D tensor of boundaries that define buckets, it returns the index of the bucket that each of the values will fall in. e.g. for values [0, 1, 2, 3, 4] and boundaries [1, 3], it will return [0, 0, 1, 1, 2].

**Implementation**: This PR adds a new inductor op called "bucketize". In this PR it only has a triton implementation - for CPU it is a fallback. The triton implementation uses a binary search in `triton_helpers.py`. This PR also adds a new prim `_inductor_bucketize()` for testing purposes and adds lowering for this op.

~~**"right"**: The current behavior of the "right" kwarg in the inductor op is the opposite of the behavior of the torch op. "right" controls how the op treats a value that is equal to one of the boundary values. In the torch op, "right=True" means "if a value is equal to a boundary value, then put it in the bucket to the right". In the inductor op, "right=True" means "the right boundary of a bucket is closed". These are opposite. **I'm open to switching the behavior of the inductor op** - but I chose to implement this way because I think it makes more sense, and I think the torch.bucketize behavior may have been a mistake (it's the opposite of numpy.digitize).~~ Switched the behavior of the inductor bucketize op to match the torch op

* places where "right" means "if a value is equal to a boundary value, then put it in the bucket to the right" (i.e. current torch.bucketize behavior)
  + current torch.bucketize behavior
  + table in [torch.bucketize docs](https://pytorch.org/docs/stable/generated/torch.bucketize.html)
* places where "right" means "the right boundary of a bucket is closed":
  + the text description of [torch.bucketize docs](https://pytorch.org/docs/stable/generated/torch.bucketize.html) (observed in #91580)
  + [numpy.digitize](https://numpy.org/doc/stable/reference/generated/numpy.digitize.html) (which is basically the same op)

**Performance**: Benchmark script: "values" as a [16, 1024, 1024] float32 tensor and "boundaries" as a [1025] tensor (i.e. defining 1024 buckets).

As is:
```
Eager 0.30117499828338623 ms
PT2   0.9298200011253357 ms
```

But performance improves significantly if we add an additional pointwise autotuning config (WIP in #104456):
```
Eager 0.3015420138835907 ms
PT2   0.23028500378131866 ms
```

Pull Request resolved: #104007
Approved by: https://github.com/jansel
pytorchmergebot pushed a commit that referenced this issue Aug 16, 2023
The docs correctly (i.e matching actual op behavior) state that

`right = False` means `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]`.

However they previously stated that
`If 'right' is False (default), then the left boundary is closed.`

which contradicts the `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]` statement.

This modifies the docs to say `... then the left boundary is OPEN.` and also clarifies that this is the opposite behavior of numpy.digitize.

Fixes #91580

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this issue Aug 16, 2023
The docs correctly (i.e matching actual op behavior) state that

`right = False` means `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]`.

However they previously stated that
`If 'right' is False (default), then the left boundary is closed.`

which contradicts the `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]` statement.

This modifies the docs to say `... then the left boundary is OPEN.` and also clarifies that this is the opposite behavior of numpy.digitize.

Fixes #91580

[ghstack-poisoned]
davidberard98 added a commit that referenced this issue Aug 16, 2023
…rch.bucketize docs for "right""


The docs correctly (i.e matching actual op behavior) state that

`right = False` means `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]`.

However they previously stated that
`If 'right' is False (default), then the left boundary is closed.`

which contradicts the `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]` statement.

This modifies the docs to say `... then the left boundary is OPEN.` and also clarifies that this is the opposite behavior of numpy.digitize.

Fixes #91580

[ghstack-poisoned]
davidberard98 added a commit that referenced this issue Aug 16, 2023
…cs for "right""


The docs correctly (i.e matching actual op behavior) state that

`right = False` means `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]`.

However they previously stated that
`If 'right' is False (default), then the left boundary is closed.`

which contradicts the `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]` statement.

This modifies the docs to say `... then the left boundary is OPEN.` and also clarifies that this is the opposite behavior of numpy.digitize.

Fixes #91580

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: docs Related to our documentation, both in docs/ and docblocks module: sorting and selection triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants