Improve fetching for large datasets #616

loichuder · 2021-04-20T15:09:43Z

Having played around with large datasets (~10M of points but a slice is ~100k), I find that the long fetch of the data breaks the flow...

Let's make use of this issue to gather possible improvements:

Allow the user to cancel and retry requests (can be done with axios) Allow cancelling fetching of dataset values #635 Allow user to retry after cancelling a fetch #640 Add suspense/error boundaries to containers and mock slow NXdata groups #643 Fix retrying to fetch 2D dataset from Heatmap vis #647 Fix cancelling/retrying in NeXus visualizations #652 Move base ProviderApi in its own file and rename providers' API files #657 Abstract values fetching out of mapped vis components #658 Fix fetching waterfall after retrying a NeXus vis #659
Make use of binary instead of JSON (benefits: 1. smaller payload, 2. no need to stringify/parse JSON on the server/client, 3. no need to flatten the array on the client) Fetch numeric arrays with binary in h5grove #817
Implement specific strategies for fetching of large datasets (e.g. use subsampling; warn the user that the fetching can take a long time and ask them if they want to proceed anyway...)
Request domain separately (not exactly related to the data fetching but avoiding computing it in the front-end could be a nice improvement for large datasets)

The text was updated successfully, but these errors were encountered:

axelboc · 2021-04-26T12:34:10Z

I'd add that, when performing long downloads and computations, the UI should:

be more informative (i.e. progress status, subsampling rate, etc.)
remain responsive and allow cancelling slow computations (i.e. not just requests)

loichuder · 2021-04-28T07:57:12Z

Our discussion on #632 also gave me an idea: we could make the flattening operation more consistent by encapsulating in the get/useValue method.

Edit: This was done in #661

It then becomes relevant to this issue as it would be a stepping stone to request the flattening on the back-end. Thus, avoiding another expensive computation in h5web.

axelboc · 2021-04-28T08:30:07Z

In the providers' getValue() methods? Yeah, totally 👍

axelboc · 2021-04-28T12:55:08Z

#635 implements cancellation on the front-end, but it doesn't resolve crashes on Bosquet when attempting to fetch (and cancel the fetch of) extremely large datasets.

axelboc · 2021-04-29T10:36:58Z

#640 implements retrying after cancelling (including evicting cancellation errors from the value store's cache).

jreadey · 2021-05-30T20:54:03Z

Just curious - for HSDS, have you tried using HTTP Compression? That should reduce the payload size considerably.

loichuder · 2021-05-31T09:47:40Z

Unfortunately, the impact will be limited as most of our heavy datasets are not compatible with HSDS due to HDFGroup/hsds#76 😕

But this is something that we still need to try !

axelboc · 2021-10-26T06:56:45Z

Binary is now used with H5Grove when getting dataset values: #817

loichuder · 2021-11-09T15:25:41Z

The auto-scale-off feature in the LineVis that forces us to fetch the whole dataset can be a real limiter for huge datasets (silx-kit/jupyterlab-h5web#71).

Maybe, it is time to review it ? We could

Disable it somehow for huge datasets. And give an indication to the user ?
Request the domain separately (as proposed originally in Improve fetching for large datasets #616 (comment)) to avoid the need to compute it in the front-end (and therefore the need to have the full dataset values)

loichuder · 2021-11-30T15:17:42Z

The auto-scale-off feature in the LineVis that forces us to fetch the whole dataset can be a real limiter for huge datasets (silx-kit/jupyterlab-h5web#71).

#877 implemented an intemediate solution:

When the auto-scale is on, only the relevant slice is fetched
Auto-scale is no longer persisted and is activated by default. That means that, by default, only slices are fetched.
Turning the auto-scale off fetches the full dataset. For now, it is up to the user to not trigger this for huge datasets.

headtr1ck · 2023-01-16T10:43:40Z

It seems that h5wasm now (as of v0.4.8) supports lazy loading of arrays.
Is that beneficial for this issue as well (or in general for loading files >2GB)?
Not really familiar with the interal workings though, so excuse me if this has nothing to do with this :)

For reference, see this discussion: usnistgov/h5wasm#40

loichuder · 2023-01-16T15:29:19Z

Sure, that's relevant also for large datasets.

For h5wasm, we have a more specific issue tracking this at #1264

domna · 2023-03-07T12:35:46Z

Is it also planned to have streaming binary support for hsds?
I could also try to implement it myself in the hsds api but I'm not a typescript expert, so I could use some guidance.

I experienced problems with this while I was experimenting with storing and loading large datasets via hsds and I use h5web in a simple hsds directory browser to view the stored data. However, the hsds server gets stuck on large datasets because h5web requests the data in json format.

loichuder · 2023-03-08T10:37:09Z

Is it also planned to have streaming binary support for hsds? I could also try to implement it myself in the hsds api but I'm not a typescript expert, so I could use some guidance.

To be honest, we don't really plan to improve the HSDS part since we mostly use h5grove and h5wasm. But you are welcome to contribute and we will be happy to help you doing so.

If you have some working code, feel free to open a draft PR to discuss. If something blocks you, you can drop us a line at h5web@esrf.fr.

loichuder added the epic Issue that will need to be split up later on label Apr 20, 2021

axelboc mentioned this issue Apr 28, 2021

Allow cancelling fetching of dataset values #635

Merged

loichuder mentioned this issue May 17, 2021

Values are now flattened in providers #661

Merged

loichuder mentioned this issue Nov 9, 2021

Crashes reading a large file silx-kit/jupyterlab-h5web#71

Closed

This was referenced Nov 26, 2021

Fetch only slice for LineVis of huge datasets #874

Closed

Fetch whole dataset only when autoScale is off (and slices otherwise) #877

Merged

axelboc mentioned this issue Sep 12, 2023

Upgrade dependencies #1490

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve fetching for large datasets #616

Improve fetching for large datasets #616

loichuder commented Apr 20, 2021 •

edited

Loading

axelboc commented Apr 26, 2021

loichuder commented Apr 28, 2021 •

edited

Loading

axelboc commented Apr 28, 2021

axelboc commented Apr 28, 2021

axelboc commented Apr 29, 2021

jreadey commented May 30, 2021

loichuder commented May 31, 2021

axelboc commented Oct 26, 2021

loichuder commented Nov 9, 2021 •

edited

Loading

loichuder commented Nov 30, 2021

headtr1ck commented Jan 16, 2023

loichuder commented Jan 16, 2023

domna commented Mar 7, 2023

loichuder commented Mar 8, 2023

Improve fetching for large datasets #616

Improve fetching for large datasets #616

Comments

loichuder commented Apr 20, 2021 • edited Loading

axelboc commented Apr 26, 2021

loichuder commented Apr 28, 2021 • edited Loading

axelboc commented Apr 28, 2021

axelboc commented Apr 28, 2021

axelboc commented Apr 29, 2021

jreadey commented May 30, 2021

loichuder commented May 31, 2021

axelboc commented Oct 26, 2021

loichuder commented Nov 9, 2021 • edited Loading

loichuder commented Nov 30, 2021

headtr1ck commented Jan 16, 2023

loichuder commented Jan 16, 2023

domna commented Mar 7, 2023

loichuder commented Mar 8, 2023

loichuder commented Apr 20, 2021 •

edited

Loading

loichuder commented Apr 28, 2021 •

edited

Loading

loichuder commented Nov 9, 2021 •

edited

Loading