Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nbytes to repr? #8690

Open
max-sixty opened this issue Jan 31, 2024 · 13 comments · Fixed by #8702 · May be fixed by #9078
Open

Add nbytes to repr? #8690

max-sixty opened this issue Jan 31, 2024 · 13 comments · Fixed by #8702 · May be fixed by #9078

Comments

@max-sixty
Copy link
Collaborator

Is your feature request related to a problem?

Would having the nbytes value in the Dataset repr be reasonable?

I frequently find myself logging this separately. For example:

<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
-    air      (time, lat, lon) float32 dask.array<chunksize=(2920, 25, 53), meta=np.ndarray>
+    air      (time, lat, lon) float32 15MB dask.array<chunksize=(2920, 25, 53), meta=np.ndarray> 
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Describe the solution you'd like

No response

Describe alternatives you've considered

Status quo :)

Additional context

No response

@TomNicholas
Copy link
Contributor

I agree - I'm constantly checking this attribute. It would be nice to also quickly see the total nbytes of the whole dataset, but I'm not sure where that would go in the repr.

@max-sixty
Copy link
Collaborator Author

It would be nice to also quickly see the total nbytes of the whole dataset,

Yes very much agree. Maaaybe after Dimensions: (lat: 25, time: 2920, lon: 53); 16MB??

Or if there's some consensus about adding to data vars, we could start with that. Though arguably it's more useful to have for the whole object...

@norlandrhagen
Copy link

This would be a really nice addition!

@etienneschalk
Copy link
Contributor

Hello,

I can suggest the following:

  • Use "natural human units" (multiples of 1000 like "MB"), not binary units (1024) ;
  • Max unit is "YB" (Yotta, 10**24) (this is arbitrary; is there real-life use cases of required larger units ❓ ) ;
  • Put the nbytes representation in the "Data variables" section like suggested in the initial post ;
  • Do not print decimal part, as suggested in all posts above. Helps conciseness ;
  • For DataArray only representation and total size of a Dataset, put the rendered size into the header of the repr. There is room for short string content like a size. DataArrays representation already uses this place to put the name and dimensions. Datasets don't make use of this space yet and there is plenty of room.

If more customization capabilities are needed, eg choosing between "human" and "binary" prefixes, there exists a library under MIT license, humanize, that specializes into rendering various numbers, including file sizes. Some of its code could potentially be extracted and integrated into xarray.

Examples

<xarray.Dataset 10kB>
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo      (foo) int64  10kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar      (bar) int64 888B  0 1 2 3 4 5 6 7 ... 104 105 106 107 108 109 110
Data variables:
    *empty*


<xarray.DataArray 'foo' (foo: 1200) 10kB>
array([   0,    1,    2, ..., 1197, 1198, 1199])
Coordinates:
  * foo      (foo) int64  10kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199

@etienneschalk
Copy link
Contributor

etienneschalk commented Feb 4, 2024

Updated examples after update on the PR. The update is: the size in the header is outside of the <>-delimited header string, and prefixed with Size: :

<xarray.Dataset> Size: 3kB
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo      (foo) int16   2kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar      (bar) int16 222B  0 1 2 3 4 5 6 7 ... 104 105 106 107 108 109 110
Data variables:
    *empty*

<xarray.DataArray 'foo' (foo: 1200)> Size: 2kB
array([   0,    1,    2, ..., 1197, 1198, 1199], dtype=int16)
Coordinates:
  * foo      (foo) int16   2kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199

As there are many representations scattered accross the code, both in tests and doctests (800+ occurences when looking for the string <xarray. in the codebase`, I would like to get your feedback on this representation before updating all of them.

Summary of the current repr:

  • For both Dataset and DataArray, the header is appended with a Size: {size} string
  • There should be at most 3 digits for size as long as we do not exceed "YB" (Yotta, 10**24)
  • For the inline representation of a Variable of a Dataset, the size occupies a fixed 5-char width (as long as we do not exceed "YB" (Yotta, 10**24) ): Examples:
[999kB]
[  8B ]

This padding may be irrelevant, as it only preserves layout in some specific case (same dimension tuple width and same dtype width). It could make sense to (1) move the size before the dimension tuple and dtype and keeping its fixed width, or (2) keep its current position but remove the padding. (1) allows quick size comparison of variables by eye thanks to fixed width but puts focus on the size, while (2) is more minimalistic but maybe less readable for size comparison


(1)

<xarray.Dataset> Size: 3kB
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo         2kB (foo) int16 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar       222B  (bar) int16 0 1 2 3 4 5 6 7 ... 104 105 106 107 108 109 110
Data variables:
    *empty*

(2)

<xarray.Dataset> Size: 3kB
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo      (foo) int16 2kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar      (bar) int16 222B 0 1 2 3 4 5 6 7 ... 104 105 106 107 108 109 110
Data variables:
    *empty*

@dcherian
Copy link
Contributor

dcherian commented Feb 5, 2024

Just a quick comment: @max-sixty wrote pytest-accept for this kind of thing. It's pretty great :)

I prefer (2) because dimension names are usually what i look for first in the variable repr.

@djhoese
Copy link
Contributor

djhoese commented Feb 19, 2024

I really wish I would have read your comment @dcherian after getting hit with this change in my own CI and spending an hour tracking down each new difference (and only in 4 small example documents). Sphinx apparently really doesn't want to render docs in a consistent order...or maybe doctest isn't running the tests in the same order.

As someone who has never checked the raw bytes size of an array, I was surprised when I tracked down this change and saw so many people eager to have it. I guess that just shows how many use cases there are for something so "simple" (the repr).

@etienneschalk
Copy link
Contributor

Maybe we should add an option to opt-out? Or is it better to have a canonical repr?

@djhoese
Copy link
Contributor

djhoese commented Feb 19, 2024

Probably just one repr since it seems like enough people want this. If more people come here to complain then maybe revisit the idea?

@shoyer
Copy link
Member

shoyer commented Jun 3, 2024

I'd like to reopen this issue to discuss the repr. My main complaint is that the repr says something like "Size: 24B" but .size on xarray.DataArray refers to the number of elements, not the size in memory!

I also think this information is generally less relevant on individual coordinates/variables inside the repr, and would prefer not to include it there in favor of showing more data values (unless that data values are hidden, e.g., for a lazy array, in which case we could show memory usage instead).

So my suggestion is something like:

<xarray.Dataset nbytes=3k>
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo      (foo) int16 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar      (bar) int16 0 1 2 3 4 5 6 7 ... 104 105 106 107 108 109 110
Data variables:
    *empty*

@shoyer shoyer reopened this Jun 3, 2024
@max-sixty
Copy link
Collaborator Author

FWIW I find the size per variable quite helpful. I'm often working with datasets with very differently sized variables.

We should do the utilitarian thing, very possibly I'm less representative

My main complaint is that the repr says something like "Size: 24B" but .size on xarray.DataArray refers to the number of elements, not the size in memory!

We don't technically need the term Size there; it's self-explanatory if we have the "15GB", no objection to removing that. Though I also don't think we need nbytes for the same reason...

@etienneschalk
Copy link
Contributor

Hello,

About display of individual variables' nbytes

Two different feedbacks here:

I also think this information is generally less relevant on individual coordinates/variables inside the repr, and would prefer not to include it there in favor of showing more data values (unless that data values are hidden, e.g., for a lazy array, in which case we could show memory usage instead).

and

FWIW I find the size per variable quite helpful. I'm often working with datasets with very differently sized variables.

I can add mine: I also find the size per variable helpful

Could a global option control this preference maybe? I see on the Configuration doc page that other options already exist to control the repr. A display_size_per_variable option or display_variable_nbytes could be added to suit everyone's needs.

❔ What should be the default for display_variable_nbytes if such a global option is introduced?

➡️ I would suggest True to keep the existing behaviour.

❔ Regarding a different display behaviour according to the laziness of a variable, do you know a reliable attribute in xarray that can answer the question: is the variable lazy?

I had in mind that the current way was to check for the value of the chunks attribute, for Dask arrays at least

https://docs.xarray.dev/en/stable/generated/xarray.Dataset.chunks.html

Mapping from dimension names to block lengths for this dataset’s data, or None if the underlying data is not a dask array.

About the removal of the term Size

We don't technically need the term Size there; it's self-explanatory if we have the "15GB", no objection to removing that. Though I also don't think we need nbytes for the same reason...

I agree with that. For instance, dtypes are not prefixed, and the "Size: " prefix is absent from the variables' representations to gain space. We can assume that the suffix solely (B, kB, MB...) is a well-known hint that the number represents a memory footprint.

Proposal of new reprs

What do you think of this? ⬇️

For non-lazy arrays:

with xr.set_options(display_variable_nbytes=True):
    print(xds) # Existing behaviour, without the Size: prefix

<xarray.Dataset> 27B 
Dimensions:  (x: 3)
Coordinates:
  * x        (x) float64 24B 10.0 20.0 30.0
Data variables:
    myvar    (x) uint8 3B 11 22 33
    
with xr.set_options(display_variable_nbytes=False):
    print(xds) # Only top-level nbytes is represented

<xarray.Dataset> 27B
Dimensions:  (x: 3)
Coordinates:
  * x        (x) float64 10.0 20.0 30.0
Data variables:
    myvar    (x) uint8 11 22 33

For lazy arrays:

with xr.set_options(display_variable_nbytes=True):
    print(xds) # Existing behaviour, without the Size: prefix

<xarray.Dataset> 27B
Dimensions:  (x: 3)
Coordinates:
  * x        (x) float64 24B 10.0 20.0 30.0
Data variables:
    myvar    (x) uint8 3B dask.array<chunksize=(1,), meta=np.ndarray>
    
with xr.set_options(display_variable_nbytes=False):
    print(xds) # The `myvar` nbytes is still shown because it is a dask array.

<xarray.Dataset> 27B
Dimensions:  (x: 3)
Coordinates:
  * x        (x) float64 10.0 20.0 30.0
Data variables:
    myvar    (x) uint8 3B dask.array<chunksize=(1,), meta=np.ndarray>

❔ Should the nbytes be integrated into the "class name tag" ? <xarray.Dataset 27B> for Dataset and <xarray.DataArray (x: 3) 3B> (size last) or <xarray.DataArray 3B (x: 3)> (size first) for DataArray

In the examples I kept the nbytes out of the class name enclosed in chevrons (<xarray.Dataset> 27B and not <xarray.Dataset 27B>), as in the existing behaviour.

Also, I note that the fact that the nbytes is shown for a variable that is a dask array even when display_variable_nbytes=False could be a bit surprising. It would be less surprising if there was no global option in the first place. However, some might find the printing of nbytes per variable helpful even for non-lazy arrays and this would penalize them 🤔

@HCookie
Copy link

HCookie commented Jul 26, 2024

This nbytes representation can be quite useful,
I'm curious is there a reason as to why it is not included in the html repr?
While dask arrays show their size, pure numpy arrays do not, and both have no top level complete size representation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment