Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nbytes to repr? #8690

Closed
max-sixty opened this issue Jan 31, 2024 · 9 comments · Fixed by #8702
Closed

Add nbytes to repr? #8690

max-sixty opened this issue Jan 31, 2024 · 9 comments · Fixed by #8702

Comments

@max-sixty
Copy link
Collaborator

Is your feature request related to a problem?

Would having the nbytes value in the Dataset repr be reasonable?

I frequently find myself logging this separately. For example:

<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
-    air      (time, lat, lon) float32 dask.array<chunksize=(2920, 25, 53), meta=np.ndarray>
+    air      (time, lat, lon) float32 15MB dask.array<chunksize=(2920, 25, 53), meta=np.ndarray> 
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Describe the solution you'd like

No response

Describe alternatives you've considered

Status quo :)

Additional context

No response

@TomNicholas
Copy link
Contributor

I agree - I'm constantly checking this attribute. It would be nice to also quickly see the total nbytes of the whole dataset, but I'm not sure where that would go in the repr.

@max-sixty
Copy link
Collaborator Author

It would be nice to also quickly see the total nbytes of the whole dataset,

Yes very much agree. Maaaybe after Dimensions: (lat: 25, time: 2920, lon: 53); 16MB??

Or if there's some consensus about adding to data vars, we could start with that. Though arguably it's more useful to have for the whole object...

@norlandrhagen
Copy link

This would be a really nice addition!

@etienneschalk
Copy link
Contributor

Hello,

I can suggest the following:

  • Use "natural human units" (multiples of 1000 like "MB"), not binary units (1024) ;
  • Max unit is "YB" (Yotta, 10**24) (this is arbitrary; is there real-life use cases of required larger units ❓ ) ;
  • Put the nbytes representation in the "Data variables" section like suggested in the initial post ;
  • Do not print decimal part, as suggested in all posts above. Helps conciseness ;
  • For DataArray only representation and total size of a Dataset, put the rendered size into the header of the repr. There is room for short string content like a size. DataArrays representation already uses this place to put the name and dimensions. Datasets don't make use of this space yet and there is plenty of room.

If more customization capabilities are needed, eg choosing between "human" and "binary" prefixes, there exists a library under MIT license, humanize, that specializes into rendering various numbers, including file sizes. Some of its code could potentially be extracted and integrated into xarray.

Examples

<xarray.Dataset 10kB>
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo      (foo) int64  10kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar      (bar) int64 888B  0 1 2 3 4 5 6 7 ... 104 105 106 107 108 109 110
Data variables:
    *empty*


<xarray.DataArray 'foo' (foo: 1200) 10kB>
array([   0,    1,    2, ..., 1197, 1198, 1199])
Coordinates:
  * foo      (foo) int64  10kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199

@etienneschalk
Copy link
Contributor

etienneschalk commented Feb 4, 2024

Updated examples after update on the PR. The update is: the size in the header is outside of the <>-delimited header string, and prefixed with Size: :

<xarray.Dataset> Size: 3kB
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo      (foo) int16   2kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar      (bar) int16 222B  0 1 2 3 4 5 6 7 ... 104 105 106 107 108 109 110
Data variables:
    *empty*

<xarray.DataArray 'foo' (foo: 1200)> Size: 2kB
array([   0,    1,    2, ..., 1197, 1198, 1199], dtype=int16)
Coordinates:
  * foo      (foo) int16   2kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199

As there are many representations scattered accross the code, both in tests and doctests (800+ occurences when looking for the string <xarray. in the codebase`, I would like to get your feedback on this representation before updating all of them.

Summary of the current repr:

  • For both Dataset and DataArray, the header is appended with a Size: {size} string
  • There should be at most 3 digits for size as long as we do not exceed "YB" (Yotta, 10**24)
  • For the inline representation of a Variable of a Dataset, the size occupies a fixed 5-char width (as long as we do not exceed "YB" (Yotta, 10**24) ): Examples:
[999kB]
[  8B ]

This padding may be irrelevant, as it only preserves layout in some specific case (same dimension tuple width and same dtype width). It could make sense to (1) move the size before the dimension tuple and dtype and keeping its fixed width, or (2) keep its current position but remove the padding. (1) allows quick size comparison of variables by eye thanks to fixed width but puts focus on the size, while (2) is more minimalistic but maybe less readable for size comparison


(1)

<xarray.Dataset> Size: 3kB
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo         2kB (foo) int16 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar       222B  (bar) int16 0 1 2 3 4 5 6 7 ... 104 105 106 107 108 109 110
Data variables:
    *empty*

(2)

<xarray.Dataset> Size: 3kB
Dimensions:  (foo: 1200, bar: 111)
Coordinates:
  * foo      (foo) int16 2kB 0 1 2 3 4 5 6 ... 1194 1195 1196 1197 1198 1199
  * bar      (bar) int16 222B 0 1 2 3 4 5 6 7 ... 104 105 106 107 108 109 110
Data variables:
    *empty*

@dcherian
Copy link
Contributor

dcherian commented Feb 5, 2024

Just a quick comment: @max-sixty wrote pytest-accept for this kind of thing. It's pretty great :)

I prefer (2) because dimension names are usually what i look for first in the variable repr.

@djhoese
Copy link
Contributor

djhoese commented Feb 19, 2024

I really wish I would have read your comment @dcherian after getting hit with this change in my own CI and spending an hour tracking down each new difference (and only in 4 small example documents). Sphinx apparently really doesn't want to render docs in a consistent order...or maybe doctest isn't running the tests in the same order.

As someone who has never checked the raw bytes size of an array, I was surprised when I tracked down this change and saw so many people eager to have it. I guess that just shows how many use cases there are for something so "simple" (the repr).

@etienneschalk
Copy link
Contributor

Maybe we should add an option to opt-out? Or is it better to have a canonical repr?

@djhoese
Copy link
Contributor

djhoese commented Feb 19, 2024

Probably just one repr since it seems like enough people want this. If more people come here to complain then maybe revisit the idea?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants