Documentation of `bucketize()` is wrong, and is contradicting itself. #91580

raaaaaymond · 2023-01-02T12:42:25Z

📚 The doc issue

The documentation of the bucketize function, specifically the right argument, is wrong. I mean, is it wrong or am I going mad? It says if "right is False (default), then the left boundary is closed". While this makes sense, it's in direct contradiction with the reference table directly below that sentence, and the actual behaviour of the function (which agrees with the reference table but contradicts the aforementioned sentence).

As a further complaint, while that sentence "if right is False (default), then the left boundary is closed" agrees with Numpy's equivalent digitize function, the actual behaviour is the opposite of Numpy's digitize() (with respect to the right argument). Why??? Why not just make them behave the same way??????

Suggest a potential alternative/fix

Ideally I'd like the behaviour bucketize() to change (slightly), so that it agrees with the (currently false) sentence "if right is False (default), then the left boundary is closed" in the documentation, and as such agrees with the behaviour of Numpy's digitize(), and not with the reference table also in the documentation, but I know that's probably not going to happen because, you know, it's behaviour-changing. But let's at least change that sentence in the documentation so that it doesn't contradict itself. That sentence should read "if right is False (default), then the right boundary is closed".

By the way I apologise if it turns out I'm the one in the wrong. I've stared at that sentence for 2 hours now and it still seems wrong to me.

cc @svekars @carljparker

The text was updated successfully, but these errors were encountered:

samdow · 2023-01-03T02:33:05Z

It says if "right is False (default), then the left boundary is closed". While this makes sense, it's in direct contradiction with the reference table directly below that sentence, and the actual behaviour of the function (which agrees with the reference table but contradicts the aforementioned sentence).

Sorry I'm seeing that the reference table does have a closed left boundary (i.e. boundaries[i-1] < input[m][n]...[l][x]) and that this matches the examples where this gives us the lower bound of i when there's a value lies on a bin value. If it helps, I think the kwarg description for right explains it pretty well: In other words, if False, gets the lower bound index for each value in input from boundaries. If True, gets the upper bound index instead.

raaaaaymond · 2023-01-03T12:45:06Z

To quote, boundaries[i-1] < input[m][n]...[l][x] is by definition not a closed left boundary. Mathematically, a closed left boundary means non-strict inequality <=, not strict inequality <.

I'm going to use simpler notation below, and just use x to mean input[m][n]...[l][x], I hope that simplifies things without making things confusing. Also, I'll be using mathematical notations such as [x, y] for a fully closed interval (between real numbers x and y where let's assume x < y), (x, y) for a fully open interval, [x, y) for a half-open interval that is closed on the left, and (x, y] for a half-open interval that is closed on the right. These are correct mathematical definitions and these are standard mathematical notation.

So as I've said before, the reference table does agree with the actual examples and behaviour. With right = False (default), the reference table (and the actual behaviour) is such that each input x is allocated to a bucket based on the condition boundary[i-1] < x <= boundary[i]. In mathematical notation, this is the half-open interval ( boundary[i-1], boundary[i] ], which is closed on the right, not left. In other words, with right = False, the buckets are actually the half-open intervals ( boundary[0], boundary[1] ], ( boundary[1], boundary[2] ], ( ... ]. These are half-open intervals that are closed on the right boundary, not the left boundary, as claimed in the documentation in the sentence "if right is False (default), then the left boundary is closed". This is the sentence that I object to.

Finally, regarding the documentation in the "Keyword Arguments" section, some of which you have quoted, I personally find it very confusing anyway, and I don't disagree with it (at least for now). I kind of wish it was omitted because I find it even more confusing that if it weren't there but that's my subjective view and let's leave that for another conversation. ( Don't bother reading the following but I mean, it just says "gets the lower bound index [...]", without a pronoun. And I'm like, what "gets" the lower bound index? Do I get it? Does the function get it? What? Anyway after reading it, it looks correct, if confusing. So let's not get into that here.)

raaaaaymond · 2023-01-03T12:52:15Z

Also, to my other point, that Pytorch's right = False (default) has the opposite behaviour to Numpy's right = False (default) in their digitize() function, try the following example (which is very similar to your example in the documentation):

boundaries = torch.tensor([1, 3, 5, 7, 9])
v = torch.tensor([-100, 3, 6, 9, 999])

Then torch.bucketize(v, boundaries) will output tensor([0, 1, 3, 4, 5]), while np.digitize(v.numpy(), boundaries.numpy()) will output array([0, 2, 3, 5, 5], dtype=int64)

raaaaaymond · 2023-01-03T12:56:30Z

My friend just pointed out that my tone isn't good, after he read it LOL. I'm sorry if I come across rude, I don't mean to. I really appreciate your time and attention in addressing this @samdow . Thank you. And thank you everyone else who is looking into this.

clane9 · 2023-05-23T19:06:56Z

+1 I also thought this was confusing

**TL;DR**: This PR is a first step in adding lowerings for torch.bucketize. It adds an initial lowering for this op - but because this implementation is not currently efficient, it registers the lowering for prims._inductor_bucketize. After we make the implementation more efficient, we'll remove prims._inductor_bucketize and add the lowering directly to torch.bucketize. **Background - torch.bucketize**: torch.bucketize(values, boundaries, right=False): for an arbitrary tensor of values and a non-decreasing 1D tensor of boundaries that define buckets, it returns the index of the bucket that each of the values will fall in. e.g. for values [0, 1, 2, 3, 4] and boundaries [1, 3], it will return [0, 0, 1, 1, 2]. **Implementation**: This PR adds a new inductor op called "bucketize". In this PR it only has a triton implementation - for CPU it is a fallback. The triton implementation uses a binary search in `triton_helpers.py`. This PR also adds a new prim `_inductor_bucketize()` for testing purposes and adds lowering for this op. ~~**"right"**: The current behavior of the "right" kwarg in the inductor op is the opposite of the behavior of the torch op. "right" controls how the op treats a value that is equal to one of the boundary values. In the torch op, "right=True" means "if a value is equal to a boundary value, then put it in the bucket to the right". In the inductor op, "right=True" means "the right boundary of a bucket is closed". These are opposite. **I'm open to switching the behavior of the inductor op** - but I chose to implement this way because I think it makes more sense, and I think the torch.bucketize behavior may have been a mistake (it's the opposite of numpy.digitize).~~ Switched the behavior of the inductor bucketize op to match the torch op * places where "right" means "if a value is equal to a boundary value, then put it in the bucket to the right" (i.e. current torch.bucketize behavior) + current torch.bucketize behavior + table in [torch.bucketize docs](https://pytorch.org/docs/stable/generated/torch.bucketize.html) * places where "right" means "the right boundary of a bucket is closed": + the text description of [torch.bucketize docs](https://pytorch.org/docs/stable/generated/torch.bucketize.html) (observed in #91580) + [numpy.digitize](https://numpy.org/doc/stable/reference/generated/numpy.digitize.html) (which is basically the same op) **Performance**: Benchmark script: "values" as a [16, 1024, 1024] float32 tensor and "boundaries" as a [1025] tensor (i.e. defining 1024 buckets). As is: ``` Eager 0.30117499828338623 ms PT2 0.9298200011253357 ms ``` But performance improves significantly if we add an additional pointwise autotuning config (WIP in #104456): ``` Eager 0.3015420138835907 ms PT2 0.23028500378131866 ms ``` Pull Request resolved: #104007 Approved by: https://github.com/jansel

The docs correctly (i.e matching actual op behavior) state that `right = False` means `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]`. However they previously stated that `If 'right' is False (default), then the left boundary is closed.` which contradicts the `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]` statement. This modifies the docs to say `... then the left boundary is OPEN.` and also clarifies that this is the opposite behavior of numpy.digitize. Fixes #91580 [ghstack-poisoned]

…rch.bucketize docs for "right"" The docs correctly (i.e matching actual op behavior) state that `right = False` means `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]`. However they previously stated that `If 'right' is False (default), then the left boundary is closed.` which contradicts the `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]` statement. This modifies the docs to say `... then the left boundary is OPEN.` and also clarifies that this is the opposite behavior of numpy.digitize. Fixes #91580 [ghstack-poisoned]

…cs for "right"" The docs correctly (i.e matching actual op behavior) state that `right = False` means `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]`. However they previously stated that `If 'right' is False (default), then the left boundary is closed.` which contradicts the `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]` statement. This modifies the docs to say `... then the left boundary is OPEN.` and also clarifies that this is the opposite behavior of numpy.digitize. Fixes #91580 [ghstack-poisoned]

samdow added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: sorting and selection topic: docs topic category labels Jan 3, 2023

samdow added module: docs Related to our documentation, both in docs/ and docblocks and removed topic: docs topic category labels Jan 3, 2023

davidberard98 mentioned this issue Jun 30, 2023

[inductor] Add prims._inductor_bucketize and add lowerings #104007

Closed

davidberard98 mentioned this issue Aug 16, 2023

Fix torch.bucketize docs for "right" #104474

Closed

pytorchmergebot closed this as completed in e9af315 Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation of `bucketize()` is wrong, and is contradicting itself. #91580

Documentation of `bucketize()` is wrong, and is contradicting itself. #91580

raaaaaymond commented Jan 2, 2023 •

edited by pytorch-bot bot

samdow commented Jan 3, 2023

raaaaaymond commented Jan 3, 2023

raaaaaymond commented Jan 3, 2023

raaaaaymond commented Jan 3, 2023

clane9 commented May 23, 2023

Documentation of bucketize() is wrong, and is contradicting itself. #91580

Documentation of bucketize() is wrong, and is contradicting itself. #91580

Comments

raaaaaymond commented Jan 2, 2023 • edited by pytorch-bot bot

📚 The doc issue

Suggest a potential alternative/fix

samdow commented Jan 3, 2023

raaaaaymond commented Jan 3, 2023

raaaaaymond commented Jan 3, 2023

raaaaaymond commented Jan 3, 2023

clane9 commented May 23, 2023

Documentation of `bucketize()` is wrong, and is contradicting itself. #91580

Documentation of `bucketize()` is wrong, and is contradicting itself. #91580

raaaaaymond commented Jan 2, 2023 •

edited by pytorch-bot bot