Experimental asyncio support #2015

suquark · 2018-05-08T14:46:21Z

What do these changes do?

This is a prototype implementation for #493 which provides awaitable interface for ray.wait & ray's ObjectID

As a prototype, these codes are meant to be modified later.

How do these changes work?

AsyncPlasmaClient is implemeted to override original pyarrow.plasma.PlasmaClient. pyarrow.plasma.PlasmaClient is created by pyarrow.plasma.connect and is attached to ray.worker.global_worker to handle basic ray functions. It also create an interface for wrapping ray's ObjectID.
AsyncPlasmaSocket is created for async socket messaging with PlasmaStore & PlasmaManager. It is the core of async. pyarrow.plasma.PlasmaClient does not make use of event loops and only create a single socket connection, it is why original ray does not support much of async functions. AsyncPlasmaSocket uses asyncio event loop and is capable of creating multiple socket connections with PlasmaManager.
plasma.fbs under format directory needs to be compiled with flatbuffer ahead of time.

Related issue number

#493

cc @mitar

suquark · 2018-05-08T14:48:12Z

Here's a piece of testing code. I shows how this PR works. I will integrate it into the code later.

import asyncio
import ray

from ray.plasma.plasma_client import AsyncPlasmaClient

def cvt(s):
    return [ray.pyarrow.plasma.ObjectID(t.id()) for t in s]


address_dict = ray.init()

async def test_wait(client: AsyncPlasmaClient):
    a = ray.put(2342342)

    @ray.remote
    def delay():
        import time
        time.sleep(10)
        return 'ready'

    b = delay.remote()
    print(ray.wait([a, b]))
    print(await client.wait(cvt([a, b])))
    await asyncio.sleep(5)
    print(await client.wait(cvt([a, b])))        
    print(ray.wait([a, b]))
    await asyncio.sleep(10)
    print(await client.wait(cvt([a, b])))        
    print(ray.wait([a, b]))
    print(await client.wait(cvt([b])))
    print(ray.wait([b]))

async def test_await_get(client):
    @ray.remote
    def delay():
        import time
        time.sleep(10)
        return 'ready'

    k = delay.remote()
    k = client.wrap_objectid_with_future(k)
    result = await k
    print(result)
    
async def test_client(client: AsyncPlasmaClient):
    await client.connect()
    print("store_capacity = %d" % client.store_capacity)
    await test_wait(client)
    await test_await_get(client)

object_store_address = address_dict['object_store_addresses'][0]

client = AsyncPlasmaClient(store_socket_name=object_store_address.name, manager_socket_name=object_store_address.manager_name)

loop = asyncio.get_event_loop()

try:
    loop.run_until_complete(test_client(client))
except KeyboardInterrupt:
    client.disconnect()

AmplabJenkins · 2018-05-08T15:51:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5271/
Test PASSed.

suquark · 2018-05-09T12:37:20Z

There are still some problems remaining to be solved.

pyarrow.plasma.PlasmaClient is part of Arrow. Currently I use AsyncPlasmaClient as wrapper so it is not very efficient (for example, it uses wait for ). A better way is to change Arrow's code.
ray's ObjectID cannot be treated as a base class. So currently I can only create a new class containing ObjectID.
Currently awaitable ray.put is implemented by making use of wait. pyarrow.plasma.PlasmaClient's get is complex and there's no easy ways to create an async version. The easiest way may still be changing the code of Arrow.

mitar · 2018-05-09T14:51:56Z

So looking at this code, it looks like a lot of code is just copied over to new classes unmodified?

This is because inheritance does not work well?

robertnishihara · 2018-05-09T17:45:18Z

python/ray/plasma/format/plasma.fbs

@@ -0,0 +1,331 @@
+// Licensed to the Apache Software Foundation (ASF) under one


This file is not needed, right? (same with python/ray/plasma/format/__init__.py)

suquark · 2018-05-10T12:33:05Z

It takes some words to clarify.

Currently, Ray has a plasma manager (src/plasma/plasma_manager.cc) for itself and shares some codes and formats with Arrow and src/plasma/format/plasma.fbs is one of them.

However, Ray does not implement a plasma client for itself. Instead, it uses Arrow's client using Arrow's plasma.fbs.

And now, Ray's plasma manager uses an event loop (so it's async) but Arrow's plasma client doesn't, so I have to override parts of codes of Arrow's plasma client. To override it, Arrow's plasma.fbs is copied from Arrow's repo.

Arrow's plasma.fbs is almost identical to Ray's but there's still a little difference between them. If we want to keep only one copy of plasma.fbs, a better way may be abandoning Arrow's plasma client (by copying Arrow plasma client's code back to Ray and add event loops). However, this will permanently change Ray's behavior, making Ray totally async. A workaround may be keeping two different APIs for plasma's client that one is async and another is blocking.

@robertnishihara @mitar

robertnishihara · 2018-05-12T05:14:06Z

Thanks a lot for the PR. I'm hoping to try it out this weekend.

This seems like really useful functionality to have. It also seems like it belongs more in Arrow than in Ray. E.g., as the plasma client API changes we'll want to make sure this code gets updated/tested.

mitar · 2018-05-12T05:45:13Z

How is it is to make an async code into blocking code? Would it be reasonable to have underlying implementation fully and just async, and then just expose blocking API for those who prefer that?

robertnishihara · 2018-05-12T06:01:53Z

The challenge with making async the default implementation is that it requires some sort of event loop, and users who just want to use the blocking version may not have an event loop on hand.

We can expose non-blocking calls along with a socket for receiving notifications from the store. We have something like this that is used internally by the local scheduler and object manager (both of which have an event loop).

mitar · 2018-05-12T06:32:12Z

The challenge with making async the default implementation is that it requires some sort of event loop, and users who just want to use the blocking version may not have an event loop on hand.

So, yea. A blocking implementation would get a loop, do a call, wait for call to finish in a loop (block), which would then finish the loop, ending the blocking call, return the value.

For example:

import asyncio
import aiohttp

async def fetch(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            return await response.text()

def block_fetch(url):
    loop = asyncio.get_event_loop()
    return loop.run_until_complete(fetch(url))

print(block_fetch('https://common.tnode.com'))

suquark · 2018-05-14T08:55:21Z

I think we have to make choices:

Options group A:

Change Arrow (change its plasma client)
Change Ray (borrow plasma client from Arrow and change it)

Options group B:

Change C++ codes, implement event loops in C++, export async callbacks to Python via Cython (harder to make use of asyncio, higher performance)
Change C++ codes (keep parts without I/O), use Python to write parts related to I/O with with asyncio's event loops (normal performance)
Just change Python codes (may hurt performance but easier to implement)

Options group C:

Keep both async & blocking implementation (higher performance, more codes)
Make everything async, create an event loop and wait for every blocking API
Make everything async, keep an event loop for each worker and used by every blocking API

Currently I prefer A2, B2, C3

robertnishihara · 2018-05-14T22:00:44Z

For group A, modifying Arrow makes a lot more sense to me, since Plasma is part of Arrow, it makes sense to do all Plasma development there. Otherwise updates to Plasma will break the async client, so it's important to have the client tested in the Arrow CI.

For group B, I think it makes sense to do as much as possible in C++. This will make it easier to wrap from other languages like Java. It'd be ok to do something quick in Python using the existing C++ API, but that doesn't feel like a long-term solution.

For group C, creating an event loop for every blocking call sounds very heavyweight to me, but I could be completely wrong about that, so it's a question of performance. However, if we want to use an underlying async implementation, then that would require the async implementation to be in C++.

mitar · 2018-05-14T22:31:36Z

creating an event loop for every blocking call sounds very heavyweight to me, but I could be completely wrong about that, so it's a question of performance

I do not think that code above creates an event loop every time, but it just reuses "main" one. (There is also new_event_loop.) From documentation:

The default policy defines context as the current thread, and manages an event loop per thread that interacts with asyncio. If the current thread doesn’t already have an event loop associated with it, the default policy’s get_event_loop() method creates one when called from the main thread, but raises RuntimeError otherwise.

Not sure how we would benchmark this code and compare regular blocking call to non-blocking call.

mitar · 2018-05-14T22:41:40Z

I tried:

import asyncio
import time

def block_sleep():
    time.sleep(1)

async def sleep():
    block_sleep()
    
def loop_sleep():
    loop = asyncio.get_event_loop()
    return loop.run_until_complete(sleep())

print("start blocking")
results = []
for i in range(100):
    before = time.perf_counter()
    block_sleep()
    after = time.perf_counter()
    results.append(after - before)
print("end blocking", sum(results) / len(results))

print("start async")
results = []
for i in range(100):
    before = time.perf_counter()
    loop_sleep()
    after = time.perf_counter()
    results.append(after - before)
print("end async", sum(results) / len(results))

Results:

start blocking
end blocking 1.001045424739932
start async
end async 1.0013151963998825

This does not look like big difference?

suquark · 2018-06-02T16:53:07Z

Currently, I am implementing a new async plasma client. Here are some ideas:

C++ Part of Arrow

Every client has two socket pools: one for plasma_manager and another for plasma_store. Each pool has a maximum size.
When a client is asked to do a ray task (put, get, transfer, etc)，a C++ coroutine-like object (we call it a PlasmaCoroutine) will be created.
When a PlasmaCoroutine needs a socket to communicate with plasma_manager, a new socket connection will be created and added to the pool. If the pool has met its maximum size, the PlasmaCoroutine will be paused. And it's same for plasma_store.
Every client maintains a Map<int socket_fd, PlasmaCoroutine task> where socket_fd is the currently pending socket of the PlasmaCoroutine. All socket_fds are listened by epoll.
Every client also maintains a queue for those PlasmaCoroutines which have not allocated sockets.
When a PlasmaCoroutine is finished, a finished flag will be set. There will also be flags for exceptions.
There will be a ray_poll function. When called, it will call epoll to get all socket_fds that have finished and continue related PlasmaCoroutines. Then ray_poll will return finished PlasmaCoroutines' results.

Python Part of Arrow

An asyncio-supported selector based on ray_poll will be implemented.
Each client will have an asyncio-based event_loop equipped with the selector.
All ray tasks will be added to the event_loop so we call poll them.

So then we can have asyncio-friendly async ray tasks.
The main problem is that C++ doesn't support coroutines very well (until C++20). There's a lot of hard works to turn original functions into coroutine-like objects.

robertnishihara · 2018-06-02T23:04:18Z

@suquark the ray.wait implementation is moving into the Ray codebase at the moment (see #2162), since as you mentioned it relies on Ray-specific components. This should make it easier to prototype an async wait in Ray without copying the .fbs files.

The easiest way to implement an async ray.wait may be to ignore the backend wait implementation and to put most of the wait implementation in the client. Basically, the client can call subscribe (in Python) to get notifications whenever a new object is available, and then implement the wait logic itself. What do you think about something like that? Would that work?

suquark · 2018-06-04T05:36:58Z

@robertnishihara That's cool. I think it could work but a better idea may be using ray.wait as a selector like linux poll so we can implement general and asyncio-friendly async programming model. I think that today's asynchronous socket model gives us a good example. According to this model, ray.wait is very similar to linux's poll, and the subscription mechanism is very similar to ʻepoll. A viable and asyncio-compliant solution is to implement an eventloop based on pollorepoll`, and then implement generic asynchronous operations based on the eventloop.

And about subscribe, do you mean ray.worker.global_worker.plasma_client.subscribe?

robertnishihara · 2018-06-04T06:56:32Z

Yes, that is the subscribe that I am referring to.

…

On Sun, Jun 3, 2018 at 10:36 PM Si-Yuan ***@***.***> wrote: @robertnishihara <https://github.com/robertnishihara> That's cool. I think it could work but a better idea may be using ray.wait as a selector like linux poll so we can implement general and asyncio-friendly async programming model. And about subscribe, do you mean ray.worker.global_worker.plasma_client.subscribe? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2015 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAPOrdqaW20a8Z-RQHkTkEQoLpzmLVrpks5t5Md7gaJpZM4T2yCr> .

AmplabJenkins · 2018-06-04T17:12:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5855/
Test PASSed.

AmplabJenkins · 2018-06-04T17:32:33Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5856/
Test PASSed.

AmplabJenkins · 2018-06-05T14:51:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5866/
Test FAILed.

AmplabJenkins · 2018-06-08T18:13:00Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5936/
Test PASSed.

AmplabJenkins · 2018-06-12T15:28:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6026/
Test PASSed.

AmplabJenkins · 2018-06-12T15:35:38Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6027/
Test PASSed.

AmplabJenkins · 2018-06-13T03:53:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6043/
Test FAILed.

AmplabJenkins · 2018-06-18T12:43:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6096/
Test FAILed.

AmplabJenkins · 2018-06-18T17:12:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6100/
Test FAILed.

AmplabJenkins · 2018-06-19T05:28:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6120/
Test PASSed.

AmplabJenkins · 2018-06-25T13:28:35Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6251/
Test FAILed.

AmplabJenkins · 2018-06-25T15:41:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6252/
Test FAILed.

ericl · 2018-12-06T06:44:09Z

python/ray/worker.py

+    if sys.version_info >= (3, 5):
+        from ray.experimental import async_api
+        # Initialize
+        async_api.init()


We shouldn't init unless the user imports async_api right?

But it depends on ray.init(). Because users typically do ray.init() after importing modules, I have to put it here.

We already init on as_future(), which already depends on ray.init().

So we can remove these lines?

But when initializing it, we should ensure the eventloop is not running. I suppose it is not a good idea to let users judge if it is safe to use as_future() to initialize them. However, in most cases, ray.init() is called before the eventloop starts, so initialize it in connect is safe and it also works for remote functions/actors.

ericl

Looks good, but we don't need to init() on worker start right?

ericl · 2018-12-06T07:28:50Z

@suquark I made some edits to clarify that async_api.init() or to_future() must be called before the event loop starts. Let me know if this works.

I also deleted the init block in worker.py. Basically the issue there is that you would be running this code even if the user is not using the async api, which is too risky for an experimental API.

AmplabJenkins · 2018-12-06T07:30:16Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9798/
Test PASSed.

suquark · 2018-12-06T07:35:45Z

@ericl That looks to me, thanks.

AmplabJenkins · 2018-12-06T07:35:50Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9795/
Test FAILed.

AmplabJenkins · 2018-12-06T08:23:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9800/
Test FAILed.

AmplabJenkins · 2018-12-06T09:07:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9802/
Test PASSed.

AmplabJenkins · 2018-12-06T09:15:01Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9803/
Test FAILed.

AmplabJenkins · 2018-12-06T10:28:01Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9801/
Test FAILed.

python/ray/experimental/async_plasma.py

AmplabJenkins · 2018-12-06T23:24:49Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9825/
Test FAILed.

AmplabJenkins · 2018-12-07T00:29:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9823/
Test FAILed.

AmplabJenkins · 2018-12-07T01:23:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9826/
Test FAILed.

mitar · 2018-12-07T08:09:56Z

Awesome! Thanks @suquark for this.

robertnishihara reviewed May 9, 2018

View reviewed changes

ericl added 2 commits December 5, 2018 22:41

Update async_api.rst

0a9b217

Update async_api.py

a8e24b8

ericl reviewed Dec 6, 2018

View reviewed changes

ericl added 3 commits December 5, 2018 23:22

Update async_api.rst

a238cd6

Update async_api.py

837cbdd

Update worker.py

4d8ab75

Update async_api.rst

d2bf03a

fix tests

7f0b61c

ericl approved these changes Dec 6, 2018

View reviewed changes

lint

9a78bb9

robertnishihara reviewed Dec 6, 2018

View reviewed changes

python/ray/experimental/async_plasma.py Outdated Show resolved Hide resolved

robertnishihara reviewed Dec 6, 2018

View reviewed changes

python/ray/experimental/async_plasma.py Outdated Show resolved Hide resolved

robertnishihara reviewed Dec 6, 2018

View reviewed changes

python/ray/experimental/async_plasma.py Outdated Show resolved Hide resolved

suquark added 2 commits December 6, 2018 14:16

lint

ce0fead

replace the magic number

8af2703

pcmoritz merged commit c2c501b into ray-project:master Dec 7, 2018

SaeedAr mentioned this pull request Dec 10, 2018

Pyarrow and asyncio apache/arrow#3151

Closed

		@@ -0,0 +1,331 @@
		// Licensed to the Apache Software Foundation (ASF) under one

Experimental asyncio support #2015

Experimental asyncio support #2015

Uh oh!

Conversation

suquark commented May 8, 2018

What do these changes do?

How do these changes work?

Related issue number

Uh oh!

suquark commented May 8, 2018

Uh oh!

AmplabJenkins commented May 8, 2018

Uh oh!

suquark commented May 9, 2018

Uh oh!

mitar commented May 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertnishihara May 9, 2018

Choose a reason for hiding this comment

Uh oh!

suquark commented May 10, 2018

Uh oh!

robertnishihara commented May 12, 2018

Uh oh!

mitar commented May 12, 2018

Uh oh!

robertnishihara commented May 12, 2018

Uh oh!

mitar commented May 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

suquark commented May 14, 2018

Uh oh!

robertnishihara commented May 14, 2018

Uh oh!

mitar commented May 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mitar commented May 14, 2018

Uh oh!

suquark commented Jun 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

C++ Part of Arrow

Python Part of Arrow

Uh oh!

robertnishihara commented Jun 2, 2018

Uh oh!

suquark commented Jun 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertnishihara commented Jun 4, 2018 via email

Uh oh!

AmplabJenkins commented Jun 4, 2018

Uh oh!

AmplabJenkins commented Jun 4, 2018

Uh oh!

AmplabJenkins commented Jun 5, 2018

Uh oh!

AmplabJenkins commented Jun 8, 2018

Uh oh!

AmplabJenkins commented Jun 12, 2018

Uh oh!

AmplabJenkins commented Jun 12, 2018

Uh oh!

AmplabJenkins commented Jun 13, 2018

Uh oh!

AmplabJenkins commented Jun 18, 2018

Uh oh!

AmplabJenkins commented Jun 18, 2018

Uh oh!

AmplabJenkins commented Jun 19, 2018

Uh oh!

AmplabJenkins commented Jun 25, 2018

Uh oh!

AmplabJenkins commented Jun 25, 2018

Uh oh!

ericl Dec 6, 2018

Choose a reason for hiding this comment

Uh oh!

mitar commented May 9, 2018 •

edited

Loading

mitar commented May 12, 2018 •

edited

Loading

mitar commented May 14, 2018 •

edited

Loading

suquark commented Jun 2, 2018 •

edited

Loading

suquark commented Jun 4, 2018 •

edited

Loading

ericl commented Dec 6, 2018 •

edited

Loading

mitar commented Dec 7, 2018 •

edited

Loading