New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example code in the documentation for Dataset
is not clear
#8970
Comments
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! |
I agree it's a bit confusing. I read it as the points not being aligned to lat & lon — so the points along the same x dimension don't have the same latitude, for example. I think we'd be open to a clearer example! Check out the |
My confusion comes from the fact that I don't know why there is a dimension called That said, I spoke in person to @scottyhq about the example, and he explained that # Bad rewrite of the original example that nonetheless works just the same up to the
# names of things:
ds = xr.Dataset(
data_vars=dict(
temperature=(["elevation", "pressure", "time"], temperature),
precipitation=(["elevation", "pressure", "time"], precipitation),
),
coords=dict(
lon=(["elevation", "pressure"], lon),
lat=(["elevation", "pressure"], lat),
time=time,
reference_time=reference_time,
),
attrs=dict(description="Weather related data."),
) I think it's clear that the above example would engender a lot of confusion about how the coordinates and data vars are related to each other. I have the same confusion for I actually think the example could be salvaged if np.random.seed(0)
# This dataset contains measurements of temperature and precipitation for four points on Earth
# at three different times. Due to <reasons?>, the points are stored in a 2 x 2 x 3 array instead of
# in a 4 x 3 matrix.
temperature = 15 + 8 * np.random.randn(2, 2, 3)
precipitation = 10 * np.random.rand(2, 2, 3)
# The longitude values for each of the four collection points:
lon = [[-99.83, -99.32], [-99.79, -99.23]]
# The latitude values for each of the four collection points:
lat = [[42.25, 42.21], [42.63, 42.59]]
# The three times at which the data were collected (and a reference time):
time = pd.date_range("2014-09-06", periods=3)
reference_time = pd.Timestamp("2014-09-05")
ds = xr.Dataset(
data_vars=dict(
# The temperature and precipitation arrays are stored as 2 x 2 x 3 arrays
# in which the final dimension represents time. The first two dimensions
# are named "row" and "column". (These names can be arbitrarily changed
# without changing the dataset so long as they match the dimension names
# in the coords arguments for lon and lat as well.)
temperature=(["row", "column", "time"], temperature),
precipitation=(["row", "column", "time"], precipitation),
),
coords=dict(
lon=(["row", "column"], lon),
lat=(["row", "column"], lat),
time=time,
reference_time=reference_time,
),
attrs=dict(description="Weather related data."),
) Better yet, give the rows and columns separate meanings, like the rows representing different measurement sites and the columns representing measurements taken by two different researchers—something like that? |
(TL;DR) Here's a proposed edit to the documentation based on my understanding.Happy to make a PR if this looks correct and there's general consensus that this is an improvement. The following code-block contains the proposed change to the relevant part of the doc-string; it is repeated for readability in markdown below.
ExamplesIn this example dataset, we will represent measurements of the temperature
Here, we initialize the dataset with multiple dimensions. We use the string
Find out where the coldest temperature was and what values the other variables had:
|
I think that's really good @noahbenson ! Thank you! We can confirm with the others in the PR, but I'm fairly confident that will be accepted as a good improvement if you can submit it. |
…arer. The example in the doc-string of the `Dataset` class prior to this commit uses an example array whose size is `2 x 2 x 3` with the first two dimensions labeled `"x"` and `"y"` and the final dimension labeled `"time"`. This was confusing due to the fact that `"x"` and `"y"` are just arbitrary names for these axes and that no reason is given for the data to be organized in a `2x2x3` array instead of a `2x2` matrix. This commit clarifies the example. See issue pydata#8970 for more information.
These changes to the documentation bring it into alignment with the changes to the `Dataset` doc-string committed previously. See issue pydata#8970 for more information.
The example contains a field of values on a local coordinate system with axes While in the case of satellite imagery As this is a docstring, though, it really depends on what we want to show. If the intention is to show that |
@keewis I don't see any evidence or description of a "local coordinate system" in the example except for the fact that the axes are named Is the ability to represent coordinates in a matrix (like |
@noahbenson thank you for raising this! This is the sort of stuff that we might miss. Although xarray tries very hard to be domain-agnostic, a disproportionate fraction of our users / maintainers are still in the geosciences, so feedback like this is helpful.
I agree that showing that coordinates can be 2D is not the essence of the Can we just add another example to the docstring first, which contains multiple data variables but not 2D coordinates? Non-geoscience ideas for such an example would be welcome. |
@TomNicholas & @keewis WDYT of the example in the PR? (FWIW I find it much clearer than the current one; I did struggle to explain that one to myself...) |
yeah, we didn't provide any explanation here, and you'd have to have seen this kind of dataset before to understand the physical meaning. I wouldn't put an explanation in the description, though, we're really just trying to showcase the capabilities of A different thing would be the narrative documentation, where I think we just need to do better in explaining what we're looking at. Edit: I'll give feedback to the proposed example in the PR |
I will of course defer to all of you who have much more invested in this project than I do, but I find this statement troubling as far as the documentation goes:
The |
what I meant was that the example in the docstring should be chosen in a way that is least distracting: the focus should be on the thing that is documented, not the data that we're using for that purpose. If we need to explain the data, to me that means that we should choose a different example where that is not necessary. The narrative documentation should of course use real-world examples where possible, and explain the data it is using. So my suggestion was to use synthetic data in the docstring, with generic variable names like |
I don't mean to have a philosophical debate — I'm fine including both if that's a reasonable compromise. But I do think that we want to relate the concepts in the code to the concepts that people are familiar with. That means not having things that are contrary to familiar concepts. So having It also implies having labels like |
* Updates the example in the doc-string for the Dataset class to be clearer. The example in the doc-string of the `Dataset` class prior to this commit uses an example array whose size is `2 x 2 x 3` with the first two dimensions labeled `"x"` and `"y"` and the final dimension labeled `"time"`. This was confusing due to the fact that `"x"` and `"y"` are just arbitrary names for these axes and that no reason is given for the data to be organized in a `2x2x3` array instead of a `2x2` matrix. This commit clarifies the example. See issue #8970 for more information. * Updates the documentation of the Dataset class to have clearer examples. These changes to the documentation bring it into alignment with the changes to the `Dataset` doc-string committed previously. See issue #8970 for more information. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Adds dataset size reports to the output of the example in the Dataset docstring. * Fixes the documentation errors in the previous commits. * Fixes indentation errors in the docs for previous commits. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
What is your issue?
The example code in the documentation for the
Dataset
class (e.g., here) is probably clear to those who study Earth and Atmospheric Sciences, but it makes no sense to me. Here is the code:To be clear, I understand each individual line of code, but I don't understand why there is both a latitude/longitude and an x/y in this example or how they are supposed to be related to each other (and there do not appear to be any additional details about this dataset's intended structure). Probably due to this lack of clarity I'm having a hard time wrapping my head around what the x/y coordinates and the lat/lon coordinates are supposed to demonstrate about xarray here, or how the x/y and lat/lon values are represented in the data structure. Are the x and y coordinates in a map projection of some kind? I have worked successfully with
Dataset
s in the past, but as someone who doesn't work with geospatial data, I find myself more confused aboutDataset
s after reading this example than before.I suspect that all that is needed is a clear description of what these data are supposed to represent, how they are intended to be used, and how x/y and lat/lon are related. If someone can explain this to me, I'd be happy to submit a PR for the docs.
The text was updated successfully, but these errors were encountered: