Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example code in the documentation for Dataset is not clear #8970

Closed
noahbenson opened this issue Apr 24, 2024 · 13 comments · Fixed by #8973
Closed

Example code in the documentation for Dataset is not clear #8970

noahbenson opened this issue Apr 24, 2024 · 13 comments · Fixed by #8973

Comments

@noahbenson
Copy link
Contributor

What is your issue?

The example code in the documentation for the Dataset class (e.g., here) is probably clear to those who study Earth and Atmospheric Sciences, but it makes no sense to me. Here is the code:

np.random.seed(0)
temperature = 15 + 8 * np.random.randn(2, 2, 3)
precipitation = 10 * np.random.rand(2, 2, 3)
lon = [[-99.83, -99.32], [-99.79, -99.23]]
lat = [[42.25, 42.21], [42.63, 42.59]]
time = pd.date_range("2014-09-06", periods=3)
reference_time = pd.Timestamp("2014-09-05")

ds = xr.Dataset(
    data_vars=dict(
        temperature=(["x", "y", "time"], temperature),
        precipitation=(["x", "y", "time"], precipitation),
    ),
    coords=dict(
        lon=(["x", "y"], lon),
        lat=(["x", "y"], lat),
        time=time,
        reference_time=reference_time,
    ),
    attrs=dict(description="Weather related data."),
)

To be clear, I understand each individual line of code, but I don't understand why there is both a latitude/longitude and an x/y in this example or how they are supposed to be related to each other (and there do not appear to be any additional details about this dataset's intended structure). Probably due to this lack of clarity I'm having a hard time wrapping my head around what the x/y coordinates and the lat/lon coordinates are supposed to demonstrate about xarray here, or how the x/y and lat/lon values are represented in the data structure. Are the x and y coordinates in a map projection of some kind? I have worked successfully with Datasets in the past, but as someone who doesn't work with geospatial data, I find myself more confused about Datasets after reading this example than before.

I suspect that all that is needed is a clear description of what these data are supposed to represent, how they are intended to be used, and how x/y and lat/lon are related. If someone can explain this to me, I'd be happy to submit a PR for the docs.

@noahbenson noahbenson added the needs triage Issue that has not been reviewed by xarray team member label Apr 24, 2024
Copy link

welcome bot commented Apr 24, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@max-sixty
Copy link
Collaborator

I agree it's a bit confusing. I read it as the points not being aligned to lat & lon — so the points along the same x dimension don't have the same latitude, for example.

I think we'd be open to a clearer example! Check out the xr.tutorial.load_dataset? for some inspiration...

@max-sixty max-sixty added topic-documentation and removed needs triage Issue that has not been reviewed by xarray team member labels Apr 24, 2024
@noahbenson
Copy link
Contributor Author

so the points along the same x dimension don't have the same latitude, for example

My confusion comes from the fact that I don't know why there is a dimension called x in this dataset at all, let alone why it would be the first dimension in the data array. Surely if each (2x2) slice of the 2 x 2 x 3 array were supposed to represent a 2D image of some kind, then the rows would be named y and the columns would be named x?

That said, I spoke in person to @scottyhq about the example, and he explained that x and y are in fact just arbitrary names for the first two array axes and not related to any meaning one would typically assume for dimensions labeled x and y. To clarify why this is confusing to me, consider that the example could be rewritten like this without changing the data or its representation at all (note that x and y are changed):

# Bad rewrite of the original example that nonetheless works just the same up to the
# names of things:
ds = xr.Dataset(
    data_vars=dict(
        temperature=(["elevation", "pressure", "time"], temperature),
        precipitation=(["elevation", "pressure", "time"], precipitation),
    ),
    coords=dict(
        lon=(["elevation", "pressure"], lon),
        lat=(["elevation", "pressure"], lat),
        time=time,
        reference_time=reference_time,
    ),
    attrs=dict(description="Weather related data."),
)

I think it's clear that the above example would engender a lot of confusion about how the coordinates and data vars are related to each other. I have the same confusion for x and y as dimension names because there's no sense in which the dimensions represent anything like an x or y coordinate (but simultaneously, latitude and longitude are closely-enough related to x and y coordinates that I assumed some kind of connection).

I actually think the example could be salvaged if x were to be renamed row and y were to be renamed column and some explanation were given as to why the data exist in a 2 x 2 x 3 array instead of a 4 x 3 matrix (which seems more natural to me in this particular example).

np.random.seed(0)
# This dataset contains measurements of temperature and precipitation for four points on Earth
# at three different times. Due to <reasons?>, the points are stored in a 2 x 2 x 3 array instead of
# in a 4 x 3 matrix.
temperature = 15 + 8 * np.random.randn(2, 2, 3)
precipitation = 10 * np.random.rand(2, 2, 3)
# The longitude values for each of the four collection points:
lon = [[-99.83, -99.32], [-99.79, -99.23]]
# The latitude values for each of the four collection points:
lat = [[42.25, 42.21], [42.63, 42.59]]
# The three times at which the data were collected (and a reference time):
time = pd.date_range("2014-09-06", periods=3)
reference_time = pd.Timestamp("2014-09-05")

ds = xr.Dataset(
    data_vars=dict(
        # The temperature and precipitation arrays are stored as 2 x 2 x 3 arrays
        # in which the final dimension represents time. The first two dimensions
        # are named "row" and "column". (These names can be arbitrarily changed
        # without changing the dataset so long as they match the dimension names
        # in the coords arguments for lon and lat as well.)
        temperature=(["row", "column", "time"], temperature),
        precipitation=(["row", "column", "time"], precipitation),
    ),
    coords=dict(
        lon=(["row", "column"], lon),
        lat=(["row", "column"], lat),
        time=time,
        reference_time=reference_time,
    ),
    attrs=dict(description="Weather related data."),
)

Better yet, give the rows and columns separate meanings, like the rows representing different measurement sites and the columns representing measurements taken by two different researchers—something like that?

@noahbenson
Copy link
Contributor Author

noahbenson commented Apr 24, 2024

(TL;DR) Here's a proposed edit to the documentation based on my understanding.

Happy to make a PR if this looks correct and there's general consensus that this is an improvement.

The following code-block contains the proposed change to the relevant part of the doc-string; it is repeated for readability in markdown below.


    Examples
    --------

    In this example dataset, we will represent measurements of the temperature
    and pressure that were made under various conditions:
     * the measurements were made on four different days;
     * they were made at two separate locations, which we will represent using
       their latitude and longitude; and
     * they were made using three different sets of instruments, which we will
       refer to as `'inst1'`, `'inst2'`, and `'inst3'`.

    >>> np.random.seed(0)
    >>> temperature = 15 + 8 * np.random.randn(2, 3, 4)
    >>> precipitation = 10 * np.random.rand(2, 3, 4)
    >>> lon = [-99.83, -99.32]
    >>> lat = [42.25, 42.21]
    >>> instruments = ['inst1', 'inst2', 'inst3']
    >>> time = pd.date_range("2014-09-06", periods=4)
    >>> reference_time = pd.Timestamp("2014-09-05")

    Here, we initialize the dataset with multiple dimensions. We use the string `"loc"`
    to represent the location dimension of the data, the string `"instrument"` to
    represent the instrument dimension, and the string `"time"` for the time.

    >>> ds = xr.Dataset(
    ...     data_vars=dict(
    ...         temperature=(["loc", "instrument", "time"], temperature),
    ...         precipitation=(["loc", "instrument", "time"], precipitation),
    ...     ),
    ...     coords=dict(
    ...         lon=("loc", lon),
    ...         lat=("loc", lat),
    ...         instrument=instruments,
    ...         time=time,
    ...         reference_time=reference_time,
    ...     ),
    ...     attrs=dict(description="Weather related data."),
    ... )
    >>> ds
    <xarray.Dataset>
    Dimensions:         (loc: 2, instrument: 3, time: 4)
    Coordinates:
        lon             (loc) float64 -99.83 -99.32
        lat             (loc) float64 42.25 42.21
      * instrument      (instrument) <U5 'inst1' 'inst2' 'inst3'
      * time            (time) datetime64[ns] 2014-09-06 2014-09-07 ... 2014-09-09
        reference_time  datetime64[ns] 2014-09-05
    Dimensions without coordinates: loc
    Data variables:
        temperature     (loc, instrument, time) float64 29.11 18.2 ... 21.92 9.063
        precipitation   (loc, instrument, time) float64 4.562 5.684 ... 2.089 1.613
    Attributes:
        description:  Weather related data.

    Find out where the coldest temperature was and what values the
    other variables had:

    >>> ds.isel(ds.temperature.argmin(...))
    <xarray.Dataset>
    Dimensions:         ()
    Coordinates:
        lon             float64 -99.32
        lat             float64 42.21
        instrument      <U5 'inst3'
        time            datetime64[ns] 2014-09-06
        reference_time  datetime64[ns] 2014-09-05
    Data variables:
        temperature     float64 -5.424
        precipitation   float64 9.884
    Attributes:
        description:  Weather related data.

Examples

In this example dataset, we will represent measurements of the temperature
and pressure that were made under various conditions:

  • the measurements were made on four different days;
  • they were made at two separate locations, which we will represent using
    their latitude and longitude; and
  • they were made using three different sets of instruments, which we will
    refer to as 'inst1', 'inst2', and 'inst3'.
>>> np.random.seed(0)
>>> temperature = 15 + 8 * np.random.randn(2, 3, 4)
>>> precipitation = 10 * np.random.rand(2, 3, 4)
>>> lon = [-99.83, -99.32]
>>> lat = [42.25, 42.21]
>>> instruments = ['inst1', 'inst2', 'inst3']
>>> time = pd.date_range("2014-09-06", periods=4)
>>> reference_time = pd.Timestamp("2014-09-05")

Here, we initialize the dataset with multiple dimensions. We use the string "loc" to represent the location dimension of the data, the string "instrument" to represent the instrument dimension, and the string "time" for the time.

>>> ds = xr.Dataset(
...     data_vars=dict(
...         temperature=(["loc", "instrument", "time"], temperature),
...         precipitation=(["loc", "instrument", "time"], precipitation),
...     ),
...     coords=dict(
...         lon=("loc", lon),
...         lat=("loc", lat),
...         instrument=instruments,
...         time=time,
...         reference_time=reference_time,
...     ),
...     attrs=dict(description="Weather related data."),
... )
>>> ds
<xarray.Dataset>
Dimensions:         (loc: 2, instrument: 3, time: 4)
Coordinates:
    lon             (loc) float64 -99.83 -99.32
    lat             (loc) float64 42.25 42.21
  * instrument      (instrument) <U5 'inst1' 'inst2' 'inst3'
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 ... 2014-09-09
    reference_time  datetime64[ns] 2014-09-05
Dimensions without coordinates: loc
Data variables:
    temperature     (loc, instrument, time) float64 29.11 18.2 ... 21.92 9.063
    precipitation   (loc, instrument, time) float64 4.562 5.684 ... 2.089 1.613
Attributes:
    description:  Weather related data.

Find out where the coldest temperature was and what values the other variables had:

>>> ds.isel(ds.temperature.argmin(...))
<xarray.Dataset>
Dimensions:         ()
Coordinates:
    lon             float64 -99.32
    lat             float64 42.21
    instrument      <U5 'inst3'
    time            datetime64[ns] 2014-09-06
    reference_time  datetime64[ns] 2014-09-05
Data variables:
    temperature     float64 -5.424
    precipitation   float64 9.884
Attributes:
    description:  Weather related data.

@max-sixty
Copy link
Collaborator

I think that's really good @noahbenson ! Thank you! We can confirm with the others in the PR, but I'm fairly confident that will be accepted as a good improvement if you can submit it.

noahbenson added a commit to noahbenson/xarray that referenced this issue Apr 25, 2024
…arer.

The example in the doc-string of the `Dataset` class prior to this commit
uses an example array whose size is `2 x 2 x 3` with the first two dimensions
labeled `"x"` and `"y"` and the final dimension labeled `"time"`. This was
confusing due to the fact that `"x"` and `"y"` are just arbitrary names for
these axes and that no reason is given for the data to be organized in a `2x2x3`
array instead of a `2x2` matrix. This commit clarifies the example.

See issue pydata#8970 for more information.
noahbenson added a commit to noahbenson/xarray that referenced this issue Apr 25, 2024
These changes to the documentation bring it into alignment with the
changes to the `Dataset` doc-string committed previously.

See issue pydata#8970 for more information.
@keewis
Copy link
Collaborator

keewis commented Apr 25, 2024

The example contains a field of values on a local coordinate system with axes x and y without coordinate values (so we'll assume integer indices), which is rotated / shifted compared to the global geographic coordinate system. This means that for each position in the array we have a unique pair of lat / lon values that are not shared along the rows and columns of the array, requiring the use of 2D coordinates.

While in the case of satellite imagery rows and columns would make sense, I suspect this is meant to show a weather model. I believe in modelling it is uncommon to speak of pixels, instead the term "cell" is used, and thus calling the dimensions "rows" and "columns" would not fit.

As this is a docstring, though, it really depends on what we want to show. If the intention is to show that Dataset can contain coordinates (at all) I agree we don't need the additional complexity of 2D coordinates. If instead we intend to show that 2D coordinates are possible, then we might indeed need to figure out a way to make the example less domain specific.

@noahbenson
Copy link
Contributor Author

noahbenson commented Apr 25, 2024

@keewis I don't see any evidence or description of a "local coordinate system" in the example except for the fact that the axes are named x and y. There is in fact nowhere in the example where x or y coordinates are represented—as far as I can tell, 'x' and 'y' are just arbitrary names for the rows/columns that don't carry meaningful information. I see no reason whatsoever in the example for the data to be arranged in a 2x2x3 array instead of a 4x3 array. Am I still missing something? (Or were these choices all made with the purpose of demonstrating that coordinates can be matrices then not explained well?)

Is the ability to represent coordinates in a matrix (like lon and lat in the example) very central to the purpose of the Dataset class? Because if not, I'm skeptical that showing a convoluted edge-case in the only example in the doc-string is a good idea. It's better for the documentation to be clear than to show every feature (at least, per my own philosophy of writing documentation; YMMV). I think the example I proposed in the PR is more intuitive to a the kind of user for whom the doc-string is intended than an example showing that your coordinates can theoretically be stored in matrices. But again, maybe I'm missing something?

@TomNicholas
Copy link
Contributor

@noahbenson thank you for raising this! This is the sort of stuff that we might miss. Although xarray tries very hard to be domain-agnostic, a disproportionate fraction of our users / maintainers are still in the geosciences, so feedback like this is helpful.

Is the ability to represent coordinates in a matrix (like lon and lat in the example) very central to the purpose of the Dataset class? Because if not, I'm skeptical that showing a convoluted edge-case in the only example in the doc-string is a good idea.

I agree that showing that coordinates can be 2D is not the essence of the Dataset class. Coordinates can be 2D on a DataArray also. I also think that 2D coordinates are important, especially for model data (geoscience or otherwise).

Can we just add another example to the docstring first, which contains multiple data variables but not 2D coordinates? Non-geoscience ideas for such an example would be welcome.

@max-sixty
Copy link
Collaborator

@TomNicholas & @keewis WDYT of the example in the PR?

(FWIW I find it much clearer than the current one; I did struggle to explain that one to myself...)

@keewis
Copy link
Collaborator

keewis commented Apr 25, 2024

I don't see any evidence or description of a "local coordinate system"

yeah, we didn't provide any explanation here, and you'd have to have seen this kind of dataset before to understand the physical meaning. I wouldn't put an explanation in the description, though, we're really just trying to showcase the capabilities of Dataset here. So really, I wonder whether we should put anything domain-specific in there at all, or whether a synthetic example would be better? That way, nobody will get distracted by the example in the docstring.

A different thing would be the narrative documentation, where I think we just need to do better in explaining what we're looking at.

Edit: I'll give feedback to the proposed example in the PR

@noahbenson
Copy link
Contributor Author

noahbenson commented Apr 25, 2024

I will of course defer to all of you who have much more invested in this project than I do, but I find this statement troubling as far as the documentation goes:

I wouldn't put an explanation in the description, though, we're really just trying to showcase the capabilities of Dataset here.

The Dataset doc-string is the first place that many people who are trying to understand the class will go. I am a modestly experienced xarray user and found that the current doc-string increased my confusion in this exact scenario. Are you saying that the purpose of the doc-string isn't to help people like me understand the class but to show off its capabilities, and thus there shouldn't be an explanation of the represented data in the docstring?

@keewis
Copy link
Collaborator

keewis commented Apr 25, 2024

what I meant was that the example in the docstring should be chosen in a way that is least distracting: the focus should be on the thing that is documented, not the data that we're using for that purpose. If we need to explain the data, to me that means that we should choose a different example where that is not necessary. The narrative documentation should of course use real-world examples where possible, and explain the data it is using.

So my suggestion was to use synthetic data in the docstring, with generic variable names like a and b, where it is very clear that the point of the example is not that we need to understand what the data represents, but how the class works.

@max-sixty
Copy link
Collaborator

max-sixty commented Apr 25, 2024

I don't mean to have a philosophical debate — I'm fine including both if that's a reasonable compromise. But I do think that we want to relate the concepts in the code to the concepts that people are familiar with.

That means not having things that are contrary to familiar concepts. So having lat and x as two different values seems needlessly confusing here — maybe that's familiar to earth scientists, but it does seem to require explanation, to @keewis 's point

It also implies having labels like temperature & precipitation, rather than var1 & var2, assuming most understand that temperature & precipitation are things we have measurements for

max-sixty added a commit that referenced this issue Apr 30, 2024
* Updates the example in the doc-string for the Dataset class to be clearer.

The example in the doc-string of the `Dataset` class prior to this commit
uses an example array whose size is `2 x 2 x 3` with the first two dimensions
labeled `"x"` and `"y"` and the final dimension labeled `"time"`. This was
confusing due to the fact that `"x"` and `"y"` are just arbitrary names for
these axes and that no reason is given for the data to be organized in a `2x2x3`
array instead of a `2x2` matrix. This commit clarifies the example.

See issue #8970 for more information.

* Updates the documentation of the Dataset class to have clearer examples.

These changes to the documentation bring it into alignment with the
changes to the `Dataset` doc-string committed previously.

See issue #8970 for more information.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Adds dataset size reports to the output of the example in the Dataset docstring.

* Fixes the documentation errors in the previous commits.

* Fixes indentation errors in the docs for previous commits.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants