cartesian product of coordinates and using it to index / fill empty dataset #1914

RafalSkolasinski · 2018-02-15T19:03:23Z

For a given empty dataset with only coordinates

import xarray as xr
import numpy as np
data = xr.Dataset(coords={'x': np.linspace(-1, 1), 'y': np.linspace(0, 10), 'a': 1, 'b': 5})

I'd like to iterate over the product of coordinates, in a similar way as it can be done for numpy.arrays

data = np.zeros((10, 5, 10))
for (i, j, k), _ in np.ndenumerate(data):
    data[i, j, k] = some_function(i, j, k)

to fill the data with values of some function.

Also I'd like to extend this to the cases of functions that are multi-valued, i.e. they return a numpy.array.

Is there an easy way to do so? I was unable to find anything similar in the docs.

The text was updated successfully, but these errors were encountered:

RafalSkolasinski · 2018-02-19T16:20:15Z

Let me give a bit of a background what I would like to do:

Create an empty Dataset of coordinates I want to explore, i.e. two np.arrays x and y, and two scalars a and b.
Generate an list of the Cartesian product of all the coordinates, i.e. [ {'x': -1, 'y': 0, 'a': 1, 'b': 5}, ...] (data format doesn't really matter).
For each item of the iterator compute some function: f = f(x, y, a, b). In principle this function can be expensive to compute, therefore I'd compute it for each item of list from 2. separately on the cluster.
"merge" it all together into a single xarray object

In principle f should be allowed to return e.g. np.array.
An related issue in holoviews and the notebook with my initial attempt.
In the linked notebook I managed to achieve the goal however without starting with an xarray object containing coordinates. Also combining the data seems a bit inefficient as it takes more time than generating it for a larger datasets.

max-sixty · 2018-02-19T20:05:53Z

I think that this shouldn't be too hard to 'get done' but also that xarray may not give you much help natively. (I'm not sure though, so take this as hopefully helpful contribution rather than a definitive answer)

Specifically, can you do (2) by generating a product of the coords? Either using numpy, stacking, or some simple python:

In [3]: list(product(*((data[x].values) for x in data.dims)))
Out[3]:
[(0.287706062977495, 0.065327131503921),
 (0.287706062977495, 0.17398282388217068),
 (0.287706062977495, 0.1455022501442349),
 (0.42398126102299216, 0.065327131503921),
 (0.42398126102299216, 0.17398282388217068),
 (0.42398126102299216, 0.1455022501442349),
 (0.13357153947234057, 0.065327131503921),
 (0.13357153947234057, 0.17398282388217068),
 (0.13357153947234057, 0.1455022501442349),
 (0.42347765161572537, 0.065327131503921),
 (0.42347765161572537, 0.17398282388217068),
 (0.42347765161572537, 0.1455022501442349)]

then distribute those out to a cluster if you need, and then unstack them back into a dataset?

RafalSkolasinski · 2018-02-19T22:40:17Z

For "get done" I had for example the following (similar to what I linked as my initial attempt)

coordinates = {
    'x': np.linspace(-1, 1), 
    'y': np.linspace(0, 10), 
}

constants = {
    'a': 1, 
    'b': 5
}

inps = [{**constants, **{k: v for k, v in zip(coordinates.keys(), x)}} 
        for x in list(it.product(*coordinates.values()))]


def f(x, y, a, b):
    """Some dummy function."""
    v = a * x**2 + b * y**2
    return xr.DataArray(v, {'x': x, 'y': y, 'a': a, 'b': b})


# simulate computation on cluster
values = list(map(lambda s: f(**s), inps))

# gather and unstack the inputs
ds = xr.concat(values, dim='new', coords='all')
ds = ds.set_index(new=list(set(ds.coords) - set(ds.dims)))
ds = ds.unstack('new')

It is very close to what you suggest. My main question is if this can be done better. Mainly I am wondering if

Is there any built-in iterator over the Cartesian product of coordinates. If no, are there people that also think it would be useful?
Gathering together / unstacking of the data. My 3 line combo of concat, set_index and unstack seems to do the trick but it seems a bit like over complication. Ideally I'd expect to have some mechanism that works similar to:

inputs = cartesian_product(coordinates) # list similar to ``inps`` above
values = [function(inp) for inp in inputs]  # or using ipypparallel map

xarray_data = ... # some empty xarray object
for inp, val in zip(inputs, values):
   xarray_data[inp] = val

I asked how to generate product of coordinates from xarray object because I was expecting that I can create xarray_data as an empty object with all coordinates set and then fill it.

Added comment

Having an empty, as filled with nans, object to start with would have this benefit that one could save partial results and have clean information what was already computed.

fujiisoup · 2018-02-19T23:21:05Z

I am not sure if it is efficient to interact with a cluster, but I often use MultiIndex for make a cartesian product,

In [1]: import xarray as xr
   ...: import numpy as np
   ...: data = xr.DataArray(np.full((3, 4), np.nan), dims=('x', 'y'), 
   ...:                     coords={'x': [0, 1, 2], 'y': ['a', 'b', 'c', 'd']})
   ...: 
   ...: data
   ...: 
Out[1]: 
<xarray.DataArray (x: 3, y: 4)>
array([[ nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) <U1 'a' 'b' 'c' 'd'

In [2]: data1 = data.stack(xy=['x', 'y'])
   ...: data1
   ...: 
Out[2]: 
<xarray.DataArray (xy: 12)>
array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])
Coordinates:
  * xy       (xy) MultiIndex
  - x        (xy) int64 0 0 0 0 1 1 1 1 2 2 2 2
  - y        (xy) object 'a' 'b' 'c' 'd' 'a' 'b' 'c' 'd' 'a' 'b' 'c' 'd'

For the above example, data becomes 1-dimensional with coordinate xy, where xy is a product of x and y.
Each entry of xy is tuple of 'x' and 'y' value,

In [3]: data1[0]
Out[3]: 
<xarray.DataArray ()>
array(np.nan)
Coordinates:
    xy       object (0, 'a')

and we can assign a value for given coordinate values by loc method,

In [5]: # Assuming we found the result with (1, 'a') is 2.0
   ...: data1.loc[(1, 'a'), ] = 2.0

In [6]: data1
Out[6]: 
<xarray.DataArray (xy: 12)>
array([ nan,  nan,  nan,  nan,   2.,  nan,  nan,  nan,  nan,  nan,  nan,  nan])
Coordinates:
  * xy       (xy) MultiIndex
  - x        (xy) int64 0 0 0 0 1 1 1 1 2 2 2 2
  - y        (xy) object 'a' 'b' 'c' 'd' 'a' 'b' 'c' 'd' 'a' 'b' 'c' 'd'

Note that we need to access via data1.loc[(1, 'a'), ], rather than data1.loc[(1, 'a')] (last comma in the bracket is needed.)

EDIT: I modified my previous comment to take the partial assignment into accout.

RafalSkolasinski · 2018-02-20T00:27:36Z

After preparing list similar to [{'x': 0, 'y': 'a'}, {'x': 1, 'y': 'a'}, ...] interaction with cluster is quite efficient. One can easily pass such a thing to async_map of ipyparallel.

Thanks for your suggestion, I need to try few things. I also want to try to extend it to function that computes few different things that could be multi-valued, e.g.

def dummy(x, y):
    ds = xr.Dataset(
        {'out1': ('n', [1*x, 2*x, 3*x]), 'out2': ('m', [x, y])},
        coords = {'x': x, 'y': y, 'n': range(3), 'm': range(2)}
    )
    
    return ds

and then group together such outputs... Ok, I know. I go from simple problem to much more complicated one, but isn't it the case usually?

shoyer · 2018-02-20T07:02:37Z

xarray.broadcast() could also be helpful for generating a cartesian product. Something like xarray.broadcast(*data.coords.values()) would get you three 3D DataArray objects.

apply_ufunc with vectorize=True could also achieve what you're looking for here:

import xarray as xr
import numpy as np

data = xr.Dataset(coords={'x': np.linspace(-1, 1), 'y': np.linspace(0, 10), 'a': 1, 'b': 5})

def some_function(x, y):
    return float(x) * float(y)

xr.apply_ufunc(some_function, data['x'], data['y'], vectorize=True)

Results in:

<xarray.DataArray (x: 50, y: 50)>
array([[ -0.      ,  -0.204082,  -0.408163, ...,  -9.591837,  -9.795918, -10.      ],
       [ -0.      ,  -0.195752,  -0.391504, ...,  -9.200333,  -9.396085,
         -9.591837],
       [ -0.      ,  -0.187422,  -0.374844, ...,  -8.80883 ,  -8.996252,
         -9.183673],
       ...,
       [  0.      ,   0.187422,   0.374844, ...,   8.80883 ,   8.996252,
          9.183673],
       [  0.      ,   0.195752,   0.391504, ...,   9.200333,   9.396085,
          9.591837],
       [  0.      ,   0.204082,   0.408163, ...,   9.591837,   9.795918,  10.      ]])
Coordinates:
  * x        (x) float64 -1.0 -0.9592 -0.9184 -0.8776 -0.8367 -0.7959 ...
    a        int64 1
    b        int64 5
  * y        (y) float64 0.0 0.2041 0.4082 0.6122 0.8163 1.02 1.224 1.429 ...

You can even do this with dask arrays if you set dask='parallelized'.

That said, it does feel like there's some missing functionality here for the xarray equivalent of ndenumerate. I'm not entirely sure what the right API is, yet.

shoyer · 2018-02-22T06:13:58Z

This issue has brought up a lot of the same issues: #1773

Clearly, we need better documentation here at the very least.

RafalSkolasinski · 2018-02-22T13:15:11Z

@shoyer Thanks for your suggestions and linking the other issue. I think this one can also be labelled as the "usage question".

basnijholt · 2018-06-12T21:18:18Z

This StackOverflow question is related to this "issue".

shoyer · 2018-06-12T21:23:09Z

xyzpy (by @jcmgray) looks like it might be a nice way to solve this problem, e.g., see http://xyzpy.readthedocs.io/en/latest/examples/complex%20output%20example.html

jcmgray · 2018-06-12T21:48:31Z

Indeed, this is exactly the kind of situation I wrote xyzpy for. As a quick demo:

import numpy as np
import xyzpy as xyz

def some_function(x, y, z):
    return x * np.random.randn(3, 4) + y / z

# Define how to label the function's output
runner_opts = {
    'fn': some_function,
    'var_names': ['output'],
    'var_dims': {'output': ['a', 'b']},
    'var_coords': {'a': [10, 20, 30]},
}
runner = xyz.Runner(**runner_opts)

# set the parameters we want to explore (combos <-> cartesian product)
combos = {
    'x': np.linspace(1, 2, 11),
    'y': np.linspace(2, 3, 21),
    'z': np.linspace(4, 5, 31),
}

# run them
runner.run_combos(combos)

Should produce:

100%|###################| 7161/7161 [00:00<00:00, 132654.11it/s]

<xarray.Dataset>
Dimensions:  (a: 3, b: 4, x: 11, y: 21, z: 31)
Coordinates:
  * x        (x) float64 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
  * y        (y) float64 2.0 2.05 2.1 2.15 2.2 2.25 2.3 2.35 2.4 2.45 2.5 ...
  * z        (z) float64 4.0 4.033 4.067 4.1 4.133 4.167 4.2 4.233 4.267 4.3 ...
  * a        (a) int32 10 20 30
Dimensions without coordinates: b
Data variables:
    output   (x, y, z, a, b) float64 0.6942 -0.3348 -0.9156 -0.517 -0.834 ...

And there are options for merging successive, disjoint sets of data (combos2, combos3, ...) and parallelizing/distributing the work.

There are also multiple ways to define functions inputs/outputs (the easiest of which is just to actually return a xr.Dataset), but do let me know if your use case is beyond them or unclear.

RafalSkolasinski · 2018-11-04T23:07:56Z

@jcmgray I had to miss your reply to this issue, I saw it just now. I love your code! I will definitely include xyzpy in my tools from now on ;-).

max-sixty · 2023-12-02T02:40:06Z

Suggest we close given xyzpy seems great at thsi

shoyer · 2023-12-06T02:39:53Z

xyzpy seems to be unmaintained at present - the last release was 3 years ago.

Given how useful/common this functionality is, I would support pulling it into Xarray proper. I still find myself writing adhoc versions of this pretty regularly (usually involving expand_dims and concat).

max-sixty · 2023-12-06T02:47:24Z

xyzpy seems to be unmaintained at present - the last release was 3 years ago.

Yeah, it was late 2021. https://pypi.org/project/xyzpy/#history. I had looked at the commit history, which seems reasonably active this year, maybe a release is forthcoming...

That said, reopening!

dcherian · 2024-01-03T17:57:14Z

It does actually have commits pretty recently: https://github.com/jcmgray/xyzpy/commits/main/

jcmgray · 2024-01-05T17:09:54Z

xyzpy is indeed still in use and development, thanks for the reminder to mint a release, I'll try and do that soon...

RafalSkolasinski changed the title ~~iterate over dataset coordinates~~ cartesian product of coordinates and using it to index / fill empty dataset Feb 19, 2018

dcherian added the topic-indexing label Jan 13, 2019

max-sixty added the plan to close May be closeable, needs more eyeballs label Dec 2, 2023

max-sixty closed this as completed Dec 5, 2023

max-sixty reopened this Dec 6, 2023

max-sixty removed the plan to close May be closeable, needs more eyeballs label Dec 15, 2023

dcherian mentioned this issue Jan 3, 2024

release? jcmgray/xyzpy#17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cartesian product of coordinates and using it to index / fill empty dataset #1914

cartesian product of coordinates and using it to index / fill empty dataset #1914

RafalSkolasinski commented Feb 15, 2018 •

edited

Loading

RafalSkolasinski commented Feb 19, 2018 •

edited

Loading

max-sixty commented Feb 19, 2018

RafalSkolasinski commented Feb 19, 2018 •

edited

Loading

fujiisoup commented Feb 19, 2018 •

edited

Loading

RafalSkolasinski commented Feb 20, 2018

shoyer commented Feb 20, 2018

shoyer commented Feb 22, 2018

RafalSkolasinski commented Feb 22, 2018

basnijholt commented Jun 12, 2018

shoyer commented Jun 12, 2018

jcmgray commented Jun 12, 2018 •

edited

Loading

RafalSkolasinski commented Nov 4, 2018

max-sixty commented Dec 2, 2023

shoyer commented Dec 6, 2023

max-sixty commented Dec 6, 2023

dcherian commented Jan 3, 2024

jcmgray commented Jan 5, 2024

cartesian product of coordinates and using it to index / fill empty dataset #1914

cartesian product of coordinates and using it to index / fill empty dataset #1914

Comments

RafalSkolasinski commented Feb 15, 2018 • edited Loading

RafalSkolasinski commented Feb 19, 2018 • edited Loading

max-sixty commented Feb 19, 2018

RafalSkolasinski commented Feb 19, 2018 • edited Loading

Added comment

fujiisoup commented Feb 19, 2018 • edited Loading

RafalSkolasinski commented Feb 20, 2018

shoyer commented Feb 20, 2018

shoyer commented Feb 22, 2018

RafalSkolasinski commented Feb 22, 2018

basnijholt commented Jun 12, 2018

shoyer commented Jun 12, 2018

jcmgray commented Jun 12, 2018 • edited Loading

RafalSkolasinski commented Nov 4, 2018

max-sixty commented Dec 2, 2023

shoyer commented Dec 6, 2023

max-sixty commented Dec 6, 2023

dcherian commented Jan 3, 2024

jcmgray commented Jan 5, 2024

RafalSkolasinski commented Feb 15, 2018 •

edited

Loading

RafalSkolasinski commented Feb 19, 2018 •

edited

Loading

RafalSkolasinski commented Feb 19, 2018 •

edited

Loading

fujiisoup commented Feb 19, 2018 •

edited

Loading

jcmgray commented Jun 12, 2018 •

edited

Loading