Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support async key retrieval #7

Open
jhamman opened this issue Feb 25, 2020 · 9 comments · Fixed by #12
Open

Support async key retrieval #7

jhamman opened this issue Feb 25, 2020 · 9 comments · Fixed by #12

Comments

@jhamman
Copy link
Contributor

jhamman commented Feb 25, 2020

We are currently using FastAPI as xpublish's web framework / api engine. FastAPI supports async out of the box (details here: https://fastapi.tiangolo.com/async/).

Some initial applications indicate that we're not getting the expected async behavior. I suspect this is somehow related to how we're using dask to fetch chunks of data. Here's what we have now:

GET chunk endpoint:

https://github.com/jhamman/xpublish/blob/044e9b7b07202f57302a9368b16c2db4cddaa79a/xpublish/rest.py#L140-L143

Which calls the get_key method:

https://github.com/jhamman/xpublish/blob/044e9b7b07202f57302a9368b16c2db4cddaa79a/xpublish/rest.py#L93-L105

Within get_data_chunk, we call compute() on individual chunks:

https://github.com/jhamman/xpublish/blob/044e9b7b07202f57302a9368b16c2db4cddaa79a/xpublish/rest.py#L254-L255

Is this the best way to do this? Do we need to modify how dask's scheduler see's these tasks or is configured to use async?

@jhamman jhamman mentioned this issue Feb 25, 2020
@andersy005
Copy link
Contributor

@TomAugspurger and I are looking into this issue. We changed get_data_chunk to an awaited call without any success. With Tom's help, I will keep looking into this for the next few days.

@jhamman
Copy link
Contributor Author

jhamman commented Feb 27, 2020

Thanks @andersy005, much appreciated! I've been thinking it may be worth constructing a new test case for this. The test case we have now has very small chunks and I'm not sure we're effectively exercising the server-side cluster to an extent that would fully illuminate the issue here.

@TomAugspurger
Copy link

One question I had: what level should the parallelism be happening at? Will it just be at the request level, or will each request be getting results in parallel?

IIUC, with a typical async webserver, you're serving many requests simultaneously on a single thread, since most of the time you're just waiting around on IO.

In this case, a user filters to some subset which is eventually mapped to some set of keys like ['/air/0.0.0', '/air/0.0.1', ...] and they make many requests. So we would hope that those n requests are served in parallel. As a small test, we tried something roughly like the following.

import requests
import dask

responses = [dask.delayed(requests.get, pure=False)("localhost:9000/air/0.0.0") for i in range(100)]
dask.compute(responses)

Anderson can correct me if I'm wrong, but I think we didn't observe any parallelism there.

@jhamman
Copy link
Contributor Author

jhamman commented Feb 27, 2020

@TomAugspurger - I think we're on the same page here.

Dask question: Can two threads call chunk_data.compute() in parallel? And will the cluster block on the first before moving on to the second? I suspect the answers are yes and yes. In which case we aren't going to get much parallelization since the cluster will only be computing one chunk at a time.

I was trying to avoid adding any dask-distributed functionality in xpublish but it seems we may need to support something like client.get(..., asynchronous=True). This is a part of the distributed api that I am entirely unfamiliar with so I'm shooting in the dark...

@TomAugspurger
Copy link

TomAugspurger commented Feb 29, 2020

I suspect the answers are yes and yes.

I'm not sure either, but I'd also guess yes and yes.

async clients seem like they might help.

I did have one thing that's confusing me. Why do we ever get a Dask array back in get_data_chunk? Shouldn't that always be a chunk for a single key, so should it just be an ndarray?

@andersy005
Copy link
Contributor

While looking at the dashboard for the server, I am noticing that only worker is active. It is likely that #12 didn't fully address the asynchronous fetching:

Screen Shot 2020-03-03 at 5 27 03 PM

Screen Shot 2020-03-03 at 5 30 10 PM

Screen Shot 2020-03-03 at 5 29 52 PM

@andersy005
Copy link
Contributor

Not sure if this is a sign that we still need an async client on the server side

@andersy005
Copy link
Contributor

I did have one thing that's confusing me. Why do we ever get a Dask array back in get_data_chunk? Shouldn't that always be a chunk for a single key, so should it just be an ndarray?

I believe we are passing an entire variable array to get_data_chunk(), and this could be a numpy or dask array:

https://github.com/jhamman/xpublish/blob/d59221e45d2a46bb922e4d2dac0180bab7c53294/xpublish/rest.py#L101-L104

@jhamman jhamman reopened this Mar 11, 2020
@jhamman
Copy link
Contributor Author

jhamman commented Mar 11, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants