Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronous Driver For Python <Tornado and Tulip> #2622

Closed
v3ss0n opened this issue Jun 28, 2014 · 82 comments
Closed

Asynchronous Driver For Python <Tornado and Tulip> #2622

v3ss0n opened this issue Jun 28, 2014 · 82 comments
Labels
Milestone

Comments

@v3ss0n
Copy link

v3ss0n commented Jun 28, 2014

Change feed on 1.13 is very intersting feature.Multi-media-realtime HTML5 projects are getitng very popular these days , and i am building one right now too , a multi-media-webchat .It needs a lot of serverside notifications.
However , RethinkDB still needs async driver for Tornadoweb, like Motor driver by MongoDB does , that is only deal-blocking feature that blocking me changing from Mongo.
I Love ReQL and I really believe that RethinkDB is NoSQL-Done-Right.

Any Pointers on modifying current driver work async would be very good,I would like to contribute. (I am still quite new on Async and Tornado.

@AtnNn
Copy link
Member

AtnNn commented Jun 28, 2014

@v3ss0n Thanks for opening an issue. With the current driver you can use threads to perform asynchronous queries. Something like this should work:

from threading import Thread
from queue import Queue
stream = Queue()
def get_changes():
  conn = r.connect()
  for change in r.table(...).changes()...run(conn):
    stream.put(change)
Thread(target=get_changes).start()
print stream.get()

A Python driver with an async API would be better.

@v3ss0n
Copy link
Author

v3ss0n commented Jun 29, 2014

Thanks a lot for code example AtnNn.
That would work , but wouldn't that waste everything Tornado tried to avoid?
It won't block Tornado's IOLoop but create new thread for every new requests that needs rethinkdb features.
This will nullify c10k problem that Tornado addressed, solved well.

On asynchronous Mortor ,MongoDB driver equivilant can be done without using threads :

@gen.coroutine
def tail_example():
    results = []
    collection = db.my_capped_collection
    cursor = collection.find(tailable=True, await_data=True)
    while True:
        if not cursor.alive:
            now = datetime.datetime.utcnow()
            # While collection is empty, tailable cursor dies immediately
            yield gen.Task(loop.add_timeout, datetime.timedelta(seconds=1))  ##Giving Control back to tornado IOLoop 
            cursor = collection.find(tailable=True, await_data=True)  ##<< This will block on normal Pymongo Driver

        if (yield cursor.fetch_next):
            results.append(cursor.next_object())
            print results  

Motor driver uses Greenlets to address this problem, without modifying any line of code of mongodb's PyMongo Driver, very interesting approach.
https://github.com/mongodb/motor/blob/master/motor/__init__.py (just over 2600 lines of code).

His approach is detailed on this video and slide , and should be able to apply to any Blocking NetworkIO drivers without much modification - Thats what i am talking about :) .

http://emptysqua.re/blog/video-slides-and-code-about-async-python-and-mongodb/
https://speakerdeck.com/mongodb/asynchronous-mongodb-with-python-and-tornado-a-jesse-jiryu-davis-python-evangelist

It will be very good feature for rethinkdb to have it build in. I will also look into rethinkdb's python driver when i get time.

@coffeemug coffeemug added this to the backlog milestone Jul 2, 2014
@coffeemug
Copy link
Contributor

@v3ss0n -- thanks for chiming in. I'm moving this to backlog for now because there are much more pressing issues to take care of first, but I agree this would be valuable. As a data point, at least 5 people (may be more) have asked me for this.

I can't promise when we'll get to it, but I can say with almost complete certainty that there will be support for this eventually.

@tonich-sh
Copy link

I wrote a twisted connector for python driver. It may help... https://github.com/tonich-sh/rethinkdb-twisted.git

@v3ss0n
Copy link
Author

v3ss0n commented Sep 23, 2014

Thanks , looking into it!

@v3ss0n
Copy link
Author

v3ss0n commented Oct 26, 2014

I am now seriously thinking about writing Async Tornado driver for RethinkDB after liking the changefeeds compare to tailable_cursors of MongoDB.
With changefeed , we can listen for Updates (Like Vote Up/Down) which is quite nice. Its not possible for MongoDB unless you hack into oplog.
@tonich-sh , i saw re-writing RethinkDB client in twisted , but does it really necessary?

@tonich-sh
Copy link

Current client is too mixed with socket code but twisted's protocols is an abstraction. So i went on easiest way... I made it simple and left only the asynchronous logic.

@csytan
Copy link

csytan commented Oct 26, 2014

I tried two approaches about a month ago, threading & asyncio (Python3). Unfortunately, I don't have the time to make anything more than a hack, but I think the asyncio approach is promising seeing as it's in the standard library.

Maybe this will be helpful to someone:
https://github.com/csytan/rethinkdb/commits/next

@v3ss0n
Copy link
Author

v3ss0n commented Oct 27, 2014

@tonich-sh , so i think going gevent approach (the way motor did with mongodb ) will be easy for current client .

@csytan thanks a lot chris , i am looking into it.
EDIT: Very interesting approach , can be switched with tornado ioloop.

@kzahel
Copy link

kzahel commented Jan 27, 2015

Hi I'm just evaluating rethinkDB at this point for use in a new project. Having support for an asynchronous tornado driver would be very compelling for me. Out of the box support for a python api that mirrors the node/javascript API (using generators/coroutines) would be really great.

@danielmewes
Copy link
Member

We're going to consider different options (including Tornado and Python 3 asyncio) for making changefeeds work better in Python as part of #3298. This will happen very soon.
I'm not sure yet to what degree we are going to convert the rest of the driver (apart of changefeeds) to an asynchronous interface.

@techdragon
Copy link

Ideally the outcome would be to refactor the python code to be event loop agnostic. The same "protocol" code reused in event loop drivers so their feature compatibility is ensured. I feel I have to point this out because unfortunately few libraries are written this way. I dont want to see a future where I'll be installing with pip install rethinkdb-twisted in one codebase and pip install rethinkdb-asyncio in another.

The Autobahn project has a great structure for this kind of thing, https://github.com/tavendo/AutobahnPython/

The Autobahn code is structured so if/when people need to use the protocol with some other unsupported event loop, they just have to rewrite the much smaller driver code to use their event loop.
In the case of rethinkdb there would be an additional 'synchronous' driver branch.

@csytan
Copy link

csytan commented Feb 15, 2015

I think the cleanest way to get the driver to run on both Python 2 & 3 would be to use Tornado. Tornado coroutines are very similar to Asyncio's and are also compatible with Asyncio's event loop.

@danielmewes danielmewes modified the milestones: 2.1, backlog Feb 16, 2015
@danielmewes
Copy link
Member

@techdragon We'd like to avoid duplication as well. Will see what we can do.
@csytan That's interesting, thanks for pointing it out.

Moving this to 2.1. We'll try to get Tornado integration done soon.

@deontologician
Copy link
Contributor

Just a note from my personal experience, Tornado is the node equivalent right now in the python world. Twisted has been around longer, but isn't as widely used in the web-app community. In terms of which async loop to support first, Tornado would be a solid choice.

@danielmewes
Copy link
Member

@deontologician Ok that's good to know. I have no personal preference with respect to whether to support Twisted or Tornado first. My understanding from what @larkost said is that Twisted tends to be slightly easier to install due to more widely available packages.

If Tornado is easier to support than Twisted that sounds like we should support Tornado first. Otherwise we should just pick one and then follow up the other one soon (maybe we can even do both for 2.0).
@gchpaco any opinions?

@danielmewes danielmewes modified the milestones: 2.0, 2.1 Feb 16, 2015
@deontologician
Copy link
Contributor

I can't comment on which one is easier to implement, just in terms of who is most likely to make use of it

@gchpaco
Copy link
Contributor

gchpaco commented Feb 16, 2015

It has again been a while since I did bleeding edge Python, but if @csytan is right then it would be much less work to support Tornado-and-asyncio-etc than it would be to support only Twisted.

@techdragon That's interesting! I'm going to have to look at it in more detail. Certainly a possibility.

@danielmewes
Copy link
Member

Ok let's do Tornado first then.

@techdragon
Copy link

@gchpaco & @danielmewes - If you refactor the logic to separate the drivers from the core logic the drivers rely on, then at least in theory, each driver should be simpler to build. Tornado is the current favourite among a lot of developers, and would be great, but when you refactor the logic you need to take into account that you need to maintain the existing synchronous driver.

Twisted has a LOT of documentation regarding its use.

I would try to refactor the current code to be broken apart like Autobahn has theirs, supporting twisted and regular synchronous python first, then adding tornado & asyncio in whichever order works best.
I dont think the drivers will be the hard part of getting this done, I would expect that after the logic refactoring, the drivers, "Synchronous", "Twisted", "Asyncio", and "Tornado" will all be fairly strait forward.

@v3ss0n
Copy link
Author

v3ss0n commented Feb 17, 2015

Thanks a lot for this going forward.I haven't check for long due to being busy with Async chat Implementation on Tornado and Mongo.

@danielmewes , what @csytan suggested true that tornado's coroutines are compatible with asyncio of Python3. It gives you clean code using yield, without needing to worry about callbacks , and thats the reason for web devs choosing tornado over twisted.
Tornado have a twisted bridge , I think its bidrectional so if you support tornado , it will support twisted too.

One Stone , a flock of birds. Nice? :)

@coffeemug
Copy link
Contributor

@gchpaco -- a couple of questions:

  • Could you post some minimal sample code here of how a user would use the driver with the async API?
  • Would the old synchronous code still work without modifications?
  • Since we have time before the release, could you start working on adding other Python async backends (starting with most common) when the code review goes through?

@coffeemug
Copy link
Contributor

Specifically, with sample code it would be great to mirror @mlucy's mini-tutorial here #3678 (comment) in Python, so we understand how to use the API in different circumstances, how error handling works, etc.

@gchpaco
Copy link
Contributor

gchpaco commented Mar 17, 2015

Right now I'm trying to get it through review, but I'll try to do that. As a temporary stopgap, see test/rql_test/connections/tornado_connection.py which is an analog of connection.py.

@gchpaco
Copy link
Contributor

gchpaco commented Mar 18, 2015

Yup. Note that @Tryneus is working on refactoring some of it, so there may be some large changes when that goes through.

@Tryneus
Copy link
Member

Tryneus commented Mar 19, 2015

The refactor @gchpaco mentioned is up in review 2732, which is based off his branch and eclipses the previous review. It is available in the branch grey_issue_2622.

@Tryneus
Copy link
Member

Tryneus commented Mar 21, 2015

Ok, the python driver refactor has been approved and merged to next in commits 73d83e8 (@gchpaco's changes), 154efa8 (driver refactor), and 43cd9d7 (test updates following the refactor), and v2.0.x in commits a126ed8, 7e51398, and a3d41d0. Will be in release 2.0.

@Tryneus Tryneus closed this as completed Mar 21, 2015
@v3ss0n
Copy link
Author

v3ss0n commented Mar 22, 2015

Thank you so much , i like the way RethinkDB team handle projects , all official workflow integrated with github, that's so cool.
I learn a lot from you and will practice in my company.
I am pulling now and going to test. If everything ok i will make a rehtinkdb + tornado example chat demo.

@danielmewes
Copy link
Member

Out of curiosity @gchpaco:
How does this work exactly (example from rethinkdb/docs#683)?

    # Print every row in the table.
    for future in (yield r.table('test').order_by(index='id').run(connection)):
        item = yield future
        print(item)

More specifically: how do we know in advance whether the cursor will have another result in the for loop? Does yield future always wait for the next batch to be loaded unless the cursor is a changefeed?

@Tryneus
Copy link
Member

Tryneus commented Mar 24, 2015

@danielmewes, I just tested this and it isn't the expected behavior (this is my fault due to some of the cursor changes in the refactor). yield future will raise a StopIteration exception once it is past the end of the cursor. Unfortunately, if the user doesn't catch this, it interacts poorly with Tornado, which assumes the StopIteration means that the coroutine has finished.

Opened #3974 for this issue.

@Tryneus
Copy link
Member

Tryneus commented Mar 24, 2015

So based on my suggestion in #3974, that loop would throw RqlCursorEmpty after the end of the cursor.

@danielmewes
Copy link
Member

Having to handle the RqlCursorEmpty exception for every loop over a cursor sounds a little annoying.

Should we add a fetch_next like in Motor to guarantee that the next call to next doesn't throw?

cursor = yield r.table('test').order_by(index='id').run(connection)
while (yield cursor.fetch_next):
    item = yield cursor.next()
    print(item)

@Tryneus
Copy link
Member

Tryneus commented Mar 24, 2015

@danielmewes, I think that would be useful, I'll do it alongside #3974.

@danielmewes
Copy link
Member

👍

@ajdavis
Copy link

ajdavis commented Mar 24, 2015

Note in Motor, you can create the cursor without a yield since it hasn't begun I/O yet, just stored the query in a MotorCursor object. Only on the first yield cursor.fetch_next do we send the query to the server and get the first batch:

http://motor.readthedocs.org/en/stable/api/motor_cursor.html#motor.MotorCursor.fetch_next

Also, I chose the method name next_object instead of next to avoid interacting with Python's standard iterator protocol in any unexpected way, since a MotorCursor cannot act like a standard iterator:

https://docs.python.org/2/library/stdtypes.html#iterator-types

@danielmewes
Copy link
Member

Thanks @ajdavis .
For RethinkDB we need to yield on creating the cursor because we don't know whether a given query actually returns one before getting the first response from the server (some queries return a single value, which we handle differently).

Good point about the incompatibility with the iterator interface.
We could certainly use a special next function for Tornado cursors (not sure how we would call it). I can't tell whether it's important enough to warrant the separate function though.
I wonder how exactly a script fails if it tries to use a Tornado cursor as an iterator. I'll try that later.

@Tryneus
Copy link
Member

Tryneus commented Mar 25, 2015

@danielmewes, it's would be fine to iterate over a TornadoCursor, as long as you didn't do it eagerly (you would end up with an infinite loop that would probably be OOM-killed), and as far as I can tell, we can't safely do a comprehension like arr = [(yield x) for x in cursor if (yield cursor.fetch_next())]. Taking this into account, we should probably remove iteration from the TornadoCursor implementation.

@v3ss0n
Copy link
Author

v3ss0n commented Apr 2, 2015

No problem now , this code works well.

con = yield conn
curs = yield evt.run(con)
messages = []
while (yield curs.fetch_next()):
    item = yield curs.next()
    messages.append(item)

But i have a few other questions.

Right now Insert performance is not good with async , it takes 200-400 ms , locally , small document of < 2KB each (chat messages).

take a look at this code , i may be doing something wrong:

import logging
import tornado.escape
import tornado.ioloop
import tornado.web
import os.path
import uuid
import rethinkdb as r
from tornado.concurrent import Future
from tornado import gen
from tornado.options import define, options, parse_command_line
r.set_loop_type("tornado")
define("port", default=8080, help="run on the given port", type=int)
define("debug", default=False, help="run in debug mode")
conn = r.connect("localhost")
evt = r.db("rechat").table("events")
# Making this a non-singleton is left as an exercise for the reader.


class MainHandler(tornado.web.RequestHandler):

    @gen.coroutine
    def get(self):
        con = yield conn
        curs = yield evt.run(con)
        messages = []
        while (yield curs.fetch_next()):
            item = yield curs.next()
            messages.append(item)

        self.render("index.html", messages=messages)


class MessageNewHandler(tornado.web.RequestHandler):

    @gen.coroutine
    def post(self):
        con = yield conn
        message = {
            "body": self.get_argument("body")
        }
        # to_basestring is necessary for Python 3's json encoder,
        # which doesn't accept byte strings.
        messages = (yield evt.insert(message).run(con))
        message['id'] = messages['generated_keys'][0]
        message["html"] = tornado.escape.to_basestring(
            self.render_string("message.html", message=message))
        if self.get_argument("next", None):
            self.redirect(self.get_argument("next"))
        else:
            self.write(message)

@danielmewes
Copy link
Member

@v3ss0n I can't see anything obviously wrong about your insert code.

Which revision of the RethinkDB repository are you on ($ git status)? I wonder if you might be affected by #3998. It's closed in the latest revision which you should get through git pull.

@v3ss0n
Copy link
Author

v3ss0n commented Apr 2, 2015

i am at 01fa526

@v3ss0n
Copy link
Author

v3ss0n commented Apr 2, 2015

what branch i should pool? next or v2.0x ?

@danielmewes
Copy link
Member

@v3ss0n Both branches should be ok. The commit you mentioned actually includes the fix for that issue. So something else must be going on. I opened #4007 for further investigation into this.

@shivekkhurana
Copy link

After reviewing a lot of amazing work by David Beazley (https://github.com/dabeaz), and with the improvements in async python (mainly introduction of async/ await keyword and inclusion of asyncio into the stdlib), we can do this very easily.

I have created a gist here with a working example
https://gist.github.com/shivekkhurana/1de00e1e54c36d250a7f19905fe133b9

It gives a simple structure to listen to multiple change feeds asynchronously (tested on python 3.6)
Hope it helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests