Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: add units implementation #2545

Closed
wants to merge 4 commits into from
Closed

wip: add units implementation #2545

wants to merge 4 commits into from

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Jun 26, 2023

pint already partially supports Awkward Arrays:

import awkward as ak
import pint

reg = pint.UnitRegistry()

x = reg.Quantity(ak.Array([[1, 2, 3], [5]]), "m")

print(x + (2 * reg.m))

But this doesn't really scale to the nested, complex layout structures that Awkward supports.
This PR adds support for __units__ parameters in arrays, which allows structures to contain mixed units. Closes #2468.

>>> import awkward as ak
>>> array = ak.zip(
...     {
...         "x": ak.with_parameter([60, 70, 80, 90], "__units__", "mm"),
...         "y": ak.with_parameter([1, 2, 3, 4], "__units__", "cm"),
...     }
... )
>>> result = array.x + array.y
>>> result.show(type=True)
type: 4 * float64[parameters={"__units__": "millimeter"}]
[70,
 90,
 110,
 130]
  • Add ak.to_pint and ak.from_pint
  • Support ufuncs
  • Support reducers
  • Support concatenation

@agoose77 agoose77 temporarily deployed to docs-preview June 26, 2023 14:58 — with GitHub Actions Inactive
@agoose77
Copy link
Collaborator Author

I'm just playing around here.

There are two separate possibilities for adding unit support:

  1. We implement units as an array behavior
  2. We implement units as a fundamental (orthogonal) feature

This PR implements (2).

I think (1) is a worse prospect, because it prevents users/library authors from adding array behaviors in addition to units, unless they re-implement the entire units mechanism themselves. Our current typestring override logic is not advanced enough to let users fully mimic a custom units typestring, so this would be noticeable.

My current thinking is that we want to reserve behaviors for users/library authors, not core features; otherwise we preclude their use. That is to say, just because library authors could implement unit support using behaviors, does not mean to say that we should do so ourselves.

@jpivarski how does that logic sound?

@jpivarski
Copy link
Member

I agree that (2) is a good way to go, and we can be opinionated about Pint registries being the preferred way to get units.

I think we already know that Vector would want to "play well" (i.e. "compose") with the application of units, so we would either need to immediately make behavior-composition possible or develop built-in units with an eye toward compatibility with Vector. The latter is more pragmatic.

@agoose77
Copy link
Collaborator Author

@jpivarski I agree that behavior composition is less pragmatic — it is a can of worms given our existing 1:1 approach for names-classes.

Does this mean that you're in favour of e.g. units being added to records, rather than their fields? I can see an argument in both directions and I'm on the fence; there's an elegance to nor limiting ourselves to NumpyArray.

You're right about pint being only one example of a unit registry. My current thinking is that whilst our interface can try to avoid pulling pint into too many modules internally, we probably don't want to make this pluggable, otherwise we would need to define a specification for units. I think pint is well enough established that we can just say "we're using pint's interpretation of units".

@jpivarski
Copy link
Member

I think units should be added to NumpyArray/NumpyForm (non-temporal dtypes only). It has nothing to do with records. But it seems that's your idea, too:

if not self.is_numpy and parameters.get("__units__") is not None:

Sorry if whatever I said led you to think about records.

Vector defines behaviors on records, that's true. But if you define the right unit-behaviors on NumpyArray, then they should pass through Vector's formulas correctly. For example, Vector converts Cartesian to polar with

lib.sqrt(x**2 + y**2)

where lib is usually numpy. If x is in $\mbox{cm}$ and y is in $\mbox{mm}$, x**2 becomes $\mbox{cm}^2$ and y**2 becomes $\mbox{mm}^2$, the numpy.add ufunc puts them in a common unit, and the numpy.sqrt reduces that unit by a power. Pint may need to be a backend to make the ufuncs work, or maybe it can be queried in a metadata-only way so that it's equally applicable to CuPy arrays. (So that units compose with backends, as well as behaviors. Or PintBackend could be a non-singleton that gets instantiated with an other backend.)

This is all that Vector is doing when it sets the fields of record arrays to x, y, and rho, so if we get the ufuncs right, it ought to just pass through Vector. I wouldn't believe it without seeing it demonstrated, though—there are a lot of little things that could go wrong. Unfortunately, we're obliged to do the reducers ourselves, but Vector has only one reducer.

You're right about pint being only one example of a unit registry. My current thinking is that whilst our interface can try to avoid pulling pint into too many modules internally, we probably don't want to make this pluggable, otherwise we would need to define a specification for units. I think pint is well enough established that we can just say "we're using pint's interpretation of units".

That's what I meant, too: that we can freely use Pint as though it's the only system for handling units because it's so well established. The "registry" I was referring to is pint.UnitRegistry. I think that the implementation would use that, rather than carrying unit information in pint.Quantity arrays, since the latter would definitely require a new Awkward backend and would require extra work in implementing reducers.

I was not suggesting that we make the system of units pluggable the way that backends are pluggable. For something like this, we can ask people to express their units in Pint's terms.

@agoose77
Copy link
Collaborator Author

Sorry if whatever I said led you to think about records.

I was thinking about vector!

pint plays nicely with __array_function__ and __array_ufunc__, so we don't need to restrict units to 1D NumpyArray nodes. My sense is to start with that restriction, as we can't tighten it up later very easily.

Pint may need to be a backend to make the ufuncs work, or maybe it can be queried in a metadata-only way so that it's equally applicable to CuPy arrays

Thankfully, due to the above, it does work nicely with cupy and jax (at least, trivially). However, this PR uses Quantity(ak.Array), given that we also implement __array_function__ et al. That means we can ensure user functions don't have to handle units (although, maybe that's a bad design decision - maybe library authors would want that.

@jpivarski
Copy link
Member

I don't see how we can use pint.Quantity(ak.Array) (Pint on the outside), since an ak.Array can have many NumpyArrays, all with different units.

If we use pint.Quantity anywhere, it would have to be as the array class associated with a new Backend (Pint on the inside), and then it wouldn't compose with choice-of-backend unless that new PintBackend took another Backend as its constructor argument:

backend = PintBackend(CupyBackend)

But maybe we don't use pint.Quantity anywhere, or we only briefly make the pint.Quantity from the __units__ immediately before passing arrays to a ufunc in ak._connect.numpy.array_ufunc and break it back down into the Backend's array class before passing it on. We could benefit from Pint's __array_ufunc__ that way without introducing a new PintBackend, but its __array_function__ wouldn't provide much benefit, and of course we have to handle reducers ourselves (as usual).

In either of the two possible implementations, Vectors with units should pass through (i.e. compose), but that remains to be seen. An array of vectors could be constructed with different units in its x and y fields, and computations of rho would regularize them to whatever Pint prefers in the np.add ufunc step. That should work automatically, if Pint is implemented on the inside, not the outside.

@agoose77
Copy link
Collaborator Author

I don't see how we can use pint.Quantity(ak.Array) (Pint on the outside), since an ak.Array can have many NumpyArrays, all with different units.

That's what this PR does in three phases:

  1. Strip the __units__ parameter, and create a reg.Quantity(array_without_units, array.parameter("__units__"))
  2. Invoke the reducer upon these decorated arrays
  3. Pull out the result array, and set the units as a parameter.

But maybe we don't use pint.Quantity anywhere, or we only briefly make the pint.Quantity from the units immediately before passing arrays to a ufunc in ak._connect.numpy.array_ufunc and break it back down into the Backend's array class before passing it on.

Yes, this!

@jpivarski
Copy link
Member

In #2545 (comment), "array" means a rectilinear buffer, not an ak.Array, right?

That would be (e.g. for the NumpyBackend) using pint.Quantity(np.ndarray, the_units_string), possibly in a short-lived way, but it would not be using pint.Quantity(ak.Array) at any point.

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 27, 2023

No, it can wrap ak.Array; we also implement __array_function__ / __array_ufunc__, so we can handle the basic operations that pint performs. We could also implement this via the underlying NumpyArray.data, but that precludes us from putting units on higher-level objects (like records, or lists), should we see fit to do so.

@jpivarski
Copy link
Member

But, conceptually, an ak.Array shouldn't have units. Units don't make sense on any of the data structures except numbers (NumpyArray), which are strictly at the leaves of the tree. A record or a list can't have units of, say, centimeters. Only the fields of a record (if they're numerical) can have units—different units for different fields—and only the items of a list (if they're numerical) can have units.

I'm not talking about technical difficulties in making a pint.Quantity(ak.Array), but conceptual ones. It makes sense to put units on the leaves of the tree; it only makes sense to put units on the whole tree when the whole tree consists of just one leaf (a 1-D numerical ak.Array).

@agoose77
Copy link
Collaborator Author

But, conceptually, an ak.Array shouldn't have units. Units don't make sense on any of the data structures except numbers (NumpyArray), which are strictly at the leaves of the tree. A record or a list can't have units of, say, centimeters. Only the fields of a record (if they're numerical) can have units—different units for different fields—and only the items of a list (if they're numerical) can have units.

It might seem like I'm being pedantic on this, but I'm rather trying to elucidate the desired behavior. I think it does make sense to have units on records; it's a way of saying "all fields have the same unit". I also don't think we want to open that can of worms — thinking about it this morning, this would make operations of record.x + other_value_with_units fail, as record.x doesn't itself have units. Rather than complicate things with unit propagation, we should just constrain this to the leaves.

@jpivarski
Copy link
Member

For several of the coordinate systems in Vector, the units on each field would be different.

  • planar $\rho$, $\phi$: the $\rho$ has length, the $\phi$ is unitless
  • spatial $\theta$ and $\eta$ are unitless, but the planar part would have units (just $\rho$ or both $x$ and $y$)
  • Lorentz $E$ and $m$ are technically different units from the spatial part, though we usually work in systems in which $c = 1$ so it doesn't matter. But an energy might be expressed in Joules or a mass might be in grams, and then converting them to sensible units for combination with the momenta would be important.

In Vector, associations between the fields of the records are generally loose. For instance, they don't even need to be the same numeric type: $x$ can be int32 and $y$ can be float64. They'd all get promoted to the widest floating-point type pretty quickly, after a few steps in the calculation, but if we're not even forcing the numeric types to be the same among fields, it would be weird to force the units, which even have to be different for some coordinate systems.

(Can you tell I'm enjoying GitHub's $\LaTeX$ features?)

@agoose77
Copy link
Collaborator Author

(Can you tell I'm enjoying GitHub's \LaTeX features?)

I can now, haha!

@codecov
Copy link

codecov bot commented Jul 3, 2023

Codecov Report

Merging #2545 (cf78d7b) into main (002a3af) will decrease coverage by 0.12%.
The diff coverage is 56.55%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/_broadcasting.py 92.60% <0.00%> (-0.43%) ⬇️
src/awkward/operations/ak_to_pint.py 43.75% <43.75%> (ø)
src/awkward/_connect/numpy.py 84.45% <47.82%> (-7.42%) ⬇️
src/awkward/operations/ak_from_pint.py 50.00% <50.00%> (ø)
src/awkward/contents/content.py 76.07% <63.63%> (-0.10%) ⬇️
src/awkward/units.py 71.42% <71.42%> (ø)
src/awkward/contents/recordarray.py 85.04% <100.00%> (-0.11%) ⬇️
src/awkward/operations/__init__.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

@pre-commit-ci pre-commit-ci bot temporarily deployed to docs-preview July 3, 2023 11:19 Inactive
@HealthyPear HealthyPear mentioned this pull request Jul 28, 2023
@agoose77 agoose77 closed this Jan 3, 2024
@jpivarski jpivarski deleted the agoose77/feat-add-units branch February 1, 2024 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for units
2 participants