fix: support merging of `RegularArray` and `NumpyArray` #2063

agoose77 · 2023-01-04T23:36:23Z

This is a PoC to fix #2058. Jim and I also discussed doing this properly at the form level, which I might even explore in another draft PR.

Restore main casting semantics

agoose77 · 2023-01-04T23:40:59Z

src/awkward/contents/numpyarray.py

+            # Default merging (can we cast one to the other)
+            else:
+                return self.backend.nplike.can_cast(
+                    self.dtype, other.dtype, casting="same_kind"
+                ) or self.backend.nplike.can_cast(
+                    other.dtype, self.dtype, casting="same_kind"
+                )


The idea here is that np.concatenate uses something like can_cast to test mergeability. I think this should predict np.concatenate, at least (I'm not certain).

It's specifically this part of the code (line 350 to here) that was tuned to give the right merging behavior for datetime and timedelta types. It was originally wrong; I fixed it here:

0734ab2#diff-cdc2b042962b124ea98cc8bacd6aea030a50dc73f6502f9686475554c8913320

("Wrong" was mixing datetimes or timedeltas of different units without conversion.)

As I said in today's meeting, I'd like to continue being stricter than NumPy, not allowing datetimes or timedeltas to be mixed with non-datetimes or non-timedeltas. When the Array API Consortium eventually gets around to defining time-unit behavior, I suspect that they'll come to the same conclusion: promoting non-time data to act as time-data means guessing a unit, which usually isn't safe. If I'm right and they do pick this behavior, we won't have to change it. (Since it's a behavior without an API marker, like a function argument, it would be hard to set up a deprecation cycle.)

So let's revert the use of nplike.can_cast, at least for time units or any conversions that are already in violation with Array API decisions. We can diverge from current-NumPy if we think doing so means making smaller or no changes later, when NumPy is compliant with Array API.

I'm surprised, though, that the above didn't break any tests. I remember there being some tests that were sensitive to the time-unit fix, but I can't find them now.

agoose77 · 2023-01-04T23:41:38Z

src/awkward/contents/numpyarray.py

+    def _to_regular_primitive(self):
+        index = tuple([slice(None, 1)] * len(self.shape))
+        new_data = self.backend.nplike.broadcast_to(self._data[index], self.shape)
+        return NumpyArray(
+            new_data, backend=self.backend, parameters=self.parameters
+        ).to_RegularArray()


Yes, this is a terrible name; this PR is a PoC and I'd appreciate a better suggestion. Maybe _to_mergeable_hack ;)

agoose77 · 2023-01-04T23:41:54Z

src/awkward/_nplikes.py

+    @property
+    def issubdtype(self):
+        return numpy.issubdtype


issubdtype behaves like a method descriptor.

Is this just formal, for code symmetry? Whether it's an attribute or a property on the NumpyMetadata.instance() singleton, issubdtype acts like a function (it doesn't consume a self argument).

agoose77 · 2023-01-04T23:42:26Z

N.B that the tests will fail, because this changes merging behavior for date-times (in line with NumPy AFAICT).

agoose77 · 2023-01-04T23:43:51Z

src/awkward/_typetracer.py

@@ -989,7 +989,7 @@ def unique(self, *args, **kwargs):
            try_touch_data(x)
        raise ak._errors.wrap_error(NotImplementedError)

-    def concatenate(self, arrays):
+    def concatenate(self, arrays, casting="same_kind"):


This is not right - we should actually consider this variable in concatenate. However, this would fall under the typetracer / nplike refactoring that I'm also working on separately, so I'll handle it there.

agoose77 · 2023-01-05T16:56:21Z

TODO: restore existing datetime mergeability test (stricter)

jpivarski

Just a few comments, because this is in the draft stage.

jpivarski · 2023-01-05T18:19:03Z

src/awkward/_nplikes.py

+    @property
+    def issubdtype(self):
+        return numpy.issubdtype


Is this just formal, for code symmetry? Whether it's an attribute or a property on the NumpyMetadata.instance() singleton, issubdtype acts like a function (it doesn't consume a self argument).

jpivarski · 2023-01-05T18:19:35Z

src/awkward/_typetracer.py

@@ -989,7 +989,7 @@ def unique(self, *args, **kwargs):
            try_touch_data(x)
        raise ak._errors.wrap_error(NotImplementedError)

-    def concatenate(self, arrays):
+    def concatenate(self, arrays, casting="same_kind"):


jpivarski · 2023-01-05T18:24:00Z

src/awkward/contents/listarray.py

-        if isinstance(
+        elif isinstance(


The original idea of having the option-type/index check be a separate if statement from the subsequent chain is that it's conceptually distinct. Most node types, across the Content class hierarchy, start by asking this same question before getting into questions that are specific to each node type.

But since the option-type/index if block ends by returning, if and elif are functionally equivalent here, and looking at any one node's _mergeable_next implementation in isolation, a single if chain would be simpler.

jpivarski · 2023-01-05T18:30:08Z

src/awkward/contents/numpyarray.py

+        if len(self.shape) > 1:
+            return self._to_regular_primitive()._mergeable(other, mergebool)


This is the main part of the PR.

In our conversation, I was advocating for

assert len(self.shape) == 1

here, with an upstream conversion of the whole layout to replace NumpyArray with RegularArray before it enters into mergeable and mergemany.

I'm assuming that the NumpyArray would have to be converted (often a copy) into RegularArray again if mergeable returns True and the mergemany step happens. I was just hoping that could be made a single conversion. (It's not a big deal if it isn't.)

Part of that reasoning is that mergeable/mergemany is an internal API that we have to be careful with, anyway. (Calling mergemany on data that would have mergeable == False is undefined; I suspect a segfault would result.)

This conversion is low cost, as only for JAX do we actually need to copy any data. Is your point about wanting an assert here still necessary in spite of that?

We concluded from some experiments (and hindsight) that when reshape reduces the number of dimensions, it does have to copy, for any backend.

>>> a = np.arange(3*5*7).reshape(3, 5, 7) >>> b = a[1:, 1:, 1:] >>> c = a.T >>> a[-1, -1, -1] = 123 >>> b array([[[ 43, 44, 45, 46, 47, 48], [ 50, 51, 52, 53, 54, 55], [ 57, 58, 59, 60, 61, 62], [ 64, 65, 66, 67, 68, 69]], [[ 78, 79, 80, 81, 82, 83], [ 85, 86, 87, 88, 89, 90], [ 92, 93, 94, 95, 96, 97], [ 99, 100, 101, 102, 103, 123]]]) >>> c # also has a "123" in the last element >>> b_flat = b.reshape((-1,) + b.shape[2:]) >>> c_flat = c.reshape((-1,) + c.shape[2:]) >>> a[-1, -1, -1] = 321 >>> b_flat array([[ 43, 44, 45, 46, 47, 48], [ 50, 51, 52, 53, 54, 55], [ 57, 58, 59, 60, 61, 62], [ 64, 65, 66, 67, 68, 69], [ 78, 79, 80, 81, 82, 83], [ 85, 86, 87, 88, 89, 90], [ 92, 93, 94, 95, 96, 97], [ 99, 100, 101, 102, 103, 123]]) >>> c_flat # also does not have a "321" in the last element

It's because when a dimension is removed, there's nowhere for NumPy to insert the information that there should be a different byte-skip between the affected numbers.

So this NumpyArray → RegularArray conversion is an $\mathcal{O}(n)$ performance cost for all backends, including NumPy.

JAX had some different performance cost, which NumPy and CuPy didn't have, which I can't remember right now.

Right, if we use to_RegularArray() we would see this penalty (for non-contiguous arrays).

Hence, I wrote this function, which provides a hacky solution given that we don't care about the buffer contents; scale the array to a single scalar, and then broadcast it to the appropriate dimensions. This creates an array with strides = (0, 0, ..., 0), which consumes no additional memory. Broadcasting like this doesn't work for JAX, because JAX doesn't have views over buffers in the same way that NumPy, JAX do.

Okay, yeah, that's cheap-as-free! Now I remember that this, specifically, is the case that JAX doesn't handle as well as NumPy and CuPy. And it's fine to ignore this performance issue in JAX (it's a corner case of a corner case).

jpivarski · 2023-01-05T18:47:34Z

src/awkward/contents/numpyarray.py

+            # Default merging (can we cast one to the other)
+            else:
+                return self.backend.nplike.can_cast(
+                    self.dtype, other.dtype, casting="same_kind"
+                ) or self.backend.nplike.can_cast(
+                    other.dtype, self.dtype, casting="same_kind"
+                )


It's specifically this part of the code (line 350 to here) that was tuned to give the right merging behavior for datetime and timedelta types. It was originally wrong; I fixed it here:

0734ab2#diff-cdc2b042962b124ea98cc8bacd6aea030a50dc73f6502f9686475554c8913320

("Wrong" was mixing datetimes or timedeltas of different units without conversion.)

As I said in today's meeting, I'd like to continue being stricter than NumPy, not allowing datetimes or timedeltas to be mixed with non-datetimes or non-timedeltas. When the Array API Consortium eventually gets around to defining time-unit behavior, I suspect that they'll come to the same conclusion: promoting non-time data to act as time-data means guessing a unit, which usually isn't safe. If I'm right and they do pick this behavior, we won't have to change it. (Since it's a behavior without an API marker, like a function argument, it would be hard to set up a deprecation cycle.)

So let's revert the use of nplike.can_cast, at least for time units or any conversions that are already in violation with Array API decisions. We can diverge from current-NumPy if we think doing so means making smaller or no changes later, when NumPy is compliant with Array API.

jpivarski · 2023-01-05T18:49:17Z

src/awkward/contents/numpyarray.py

+            # Default merging (can we cast one to the other)
+            else:
+                return self.backend.nplike.can_cast(
+                    self.dtype, other.dtype, casting="same_kind"
+                ) or self.backend.nplike.can_cast(
+                    other.dtype, self.dtype, casting="same_kind"
+                )


I'm surprised, though, that the above didn't break any tests. I remember there being some tests that were sensitive to the time-unit fix, but I can't find them now.

agoose77 · 2023-01-05T18:54:13Z

I'm surprised, though, that the above didn't break any tests. I remember there being some tests that were sensitive to the time-unit fix, but I can't find them now.

It does, it breaks many tests! That was why I raised it today, because we clearly care about this.

Is this just formal, for code symmetry? Whether it's an attribute or a property on the NumpyMetadata.instance() singleton, issubdtype acts like a function (it doesn't consume a self argument).

The Python attribute resolution machinery will always invoke __get__ on a function that's resolved from the class dictionary. So, without this descriptor, Python runs func.__get__ to return a bound function to our nplike object, which is not desirable.

…goose77/fix-merge-numpy

codecov · 2023-01-06T09:21:43Z

Codecov Report

Merging #2063 (241b7ad) into main (7823a2b) will decrease coverage by 0.00%.
The diff coverage is 90.00%.

Additional details and impacted files

Impacted Files	Coverage Δ
src/awkward/contents/listarray.py	`90.08% <33.33%> (-0.38%)`	⬇️
src/awkward/contents/listoffsetarray.py	`80.11% <66.66%> (-0.09%)`	⬇️
src/awkward/_nplikes.py	`64.52% <100.00%> (+0.27%)`	⬆️
src/awkward/_typetracer.py	`61.02% <100.00%> (+0.09%)`	⬆️
src/awkward/contents/numpyarray.py	`91.19% <100.00%> (+0.14%)`	⬆️
src/awkward/contents/regulararray.py	`88.47% <100.00%> (ø)`
src/awkward/contents/unmaskedarray.py	`72.76% <0.00%> (-0.43%)`	⬇️

agoose77 · 2023-01-06T09:53:55Z

@jpivarski I've restored the date-time handling that corresponds to the previous behaviour. Due to the changes in this PR, _mergeable_next should match the old behaviour (besides the change to shape dependence), but the logic is inverted to be clearer, and is now potentially less permissive than the previous implementation. Due to the exhaustiveness of the various test cases, I don't think that is likely — for the types we expect, at this point in the function we should have non-equal numeric types AFAICT.

Previously, we just returned True if we didn't early-exit. This routine is more explicit about why we're returning True, by ending on a can_cast call.

jpivarski · 2023-01-06T13:59:06Z

Okay, I'm on board with this now. My one performance question was addressed, the dates are as strict as they used to be, and calling can_cast at the else at the end of the elif chain is safer than saying things are mergeable if they pass our gauntlet.

potentially less permissive than the previous implementation

If it's actually less permissive, then in principle that would be backward incompatible (something that used to work now won't). Do you know of cases that we used to merge but would now be considered not mergeable?

agoose77 · 2023-01-06T14:00:32Z

If it's actually less permissive, then in principle that would be backward incompatible (something that used to work now won't). Do you know of cases that we used to merge but would now be considered not mergeable?

Yes, that was why I expounded upon this to assert that although it cannot be more permissive, I don't think it's less permissive either. It doesn't fail any tests, and I can't think of any types that would not be supported.

Ooh, actually, let me try one combination.

jpivarski · 2023-01-06T14:02:48Z

Okay, so if the one new test you're trying passes, then I think this PR is ready to be merged.

agoose77 · 2023-01-06T14:04:50Z

Now that I think about it, np.concatenate defaults to "same_kind", so the merge would fail if we had previously permitted an invalid operation.

agoose77 added 6 commits January 4, 2023 23:11

test: check merging of NumpyArray

7a1ad74

feat: add can_cast to nplike

b8e7bc9

fix: convert NumpyArrays to lightweight RegularArray during merge test

b28ec60

fix: improve merging rules for NumpyArray

7ea4718

refactor: further simplify merging

05c2246

fix: add temporary argument for typetracer

552e14f

agoose77 commented Jan 4, 2023

View reviewed changes

agoose77 requested a review from jpivarski January 4, 2023 23:42

agoose77 commented Jan 4, 2023

View reviewed changes

agoose77 temporarily deployed to docs-preview January 4, 2023 23:53 — with GitHub Actions Inactive

docs: make note of _to_regular_primitive

6ef8832

agoose77 temporarily deployed to docs-preview January 5, 2023 00:26 — with GitHub Actions Inactive

jpivarski reviewed Jan 5, 2023

View reviewed changes

agoose77 added 2 commits January 6, 2023 09:02

fix: restore original mergeability rules

82f3733

fix: correct can_cast for typetracer

16996c6

agoose77 marked this pull request as ready for review January 6, 2023 09:02

agoose77 added 4 commits January 6, 2023 09:02

Merge branch 'main' into agoose77/fix-merge-numpy

a3a610e

tet: ensure that we check mergebool

5f7788c

fix: ensure mergebool introduces union if False

c4fc17e

Merge remote-tracking branch 'origin/agoose77/fix-merge-numpy' into a…

2d61b13

…goose77/fix-merge-numpy

agoose77 temporarily deployed to docs-preview January 6, 2023 09:19 — with GitHub Actions Inactive

Merge branch 'main' into agoose77/fix-merge-numpy

241b7ad

agoose77 temporarily deployed to docs-preview January 6, 2023 11:05 — with GitHub Actions Inactive

agoose77 merged commit a80fc48 into main Jan 6, 2023

agoose77 deleted the agoose77/fix-merge-numpy branch January 6, 2023 14:04

This was referenced Jan 12, 2023

NumpyArray merging is broken with option types #2104

Closed

fix: merging 1D NumpyArray with option #2105

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: support merging of `RegularArray` and `NumpyArray` #2063

fix: support merging of `RegularArray` and `NumpyArray` #2063

agoose77 commented Jan 4, 2023 •

edited

Loading

agoose77 Jan 4, 2023

jpivarski Jan 5, 2023

jpivarski Jan 5, 2023

agoose77 Jan 4, 2023

agoose77 Jan 4, 2023

jpivarski Jan 5, 2023

agoose77 commented Jan 4, 2023

agoose77 Jan 4, 2023 •

edited

Loading

jpivarski Jan 5, 2023

agoose77 commented Jan 5, 2023

jpivarski left a comment

jpivarski Jan 5, 2023

jpivarski Jan 5, 2023

jpivarski Jan 5, 2023

jpivarski Jan 5, 2023

agoose77 Jan 5, 2023

jpivarski Jan 5, 2023

agoose77 Jan 5, 2023

jpivarski Jan 6, 2023

jpivarski Jan 5, 2023

jpivarski Jan 5, 2023

agoose77 commented Jan 5, 2023 •

edited

Loading

codecov bot commented Jan 6, 2023 •

edited

Loading

agoose77 commented Jan 6, 2023 •

edited

Loading

jpivarski commented Jan 6, 2023

agoose77 commented Jan 6, 2023 •

edited

Loading

jpivarski commented Jan 6, 2023

agoose77 commented Jan 6, 2023

		if len(self.shape) > 1:
		return self._to_regular_primitive()._mergeable(other, mergebool)

fix: support merging of RegularArray and NumpyArray #2063

fix: support merging of RegularArray and NumpyArray #2063

Conversation

agoose77 commented Jan 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agoose77 commented Jan 4, 2023

agoose77 Jan 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agoose77 commented Jan 5, 2023

jpivarski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agoose77 commented Jan 5, 2023 • edited Loading

codecov bot commented Jan 6, 2023 • edited Loading

Codecov Report

agoose77 commented Jan 6, 2023 • edited Loading

jpivarski commented Jan 6, 2023

agoose77 commented Jan 6, 2023 • edited Loading

jpivarski commented Jan 6, 2023

agoose77 commented Jan 6, 2023

fix: support merging of `RegularArray` and `NumpyArray` #2063

fix: support merging of `RegularArray` and `NumpyArray` #2063

agoose77 commented Jan 4, 2023 •

edited

Loading

agoose77 Jan 4, 2023 •

edited

Loading

agoose77 commented Jan 5, 2023 •

edited

Loading

codecov bot commented Jan 6, 2023 •

edited

Loading

agoose77 commented Jan 6, 2023 •

edited

Loading

agoose77 commented Jan 6, 2023 •

edited

Loading