Added the to_kafka stream #216

jsmaupin · 2019-01-09T00:38:20Z

I have added the to_kafka class.

The options were to either:

Call the .flush() method after each .produce(...) call. This will seriously hurt performance as messages will be sent serially.
Use atexit to call flush at the end. This requires the use of atexit, of which I'm unsure of any impact. It also means that the result of the .emit(...) call by the client does not yield an immediate call to the downstream. Note that the Confluent library will flush when the batch size reaches a threshold. This is the method that is implemented here.

mrocklin · 2019-01-09T01:32:24Z

cc @martindurant you might find this interesting

martindurant · 2019-01-09T01:43:20Z

I do! I'll try to look at it in the next couple of days.

martindurant · 2019-01-09T01:46:23Z

Is someone maintaining this project? The errors in the tests are not obviously caused by the new code, and there doesn't seem to have been much activity in several months.

CJ-Wright · 2019-01-09T01:48:57Z

I think if pytest is pinned to 3.10 this might be ok.

mrocklin · 2019-01-09T18:00:56Z

Is someone maintaining this project? The errors in the tests are not obviously caused by the new code, and there doesn't seem to have been much activity in several months.

No

mrocklin · 2019-01-09T18:01:13Z

Or at least, I'm not. Apparently no one else is either?

mrocklin · 2019-01-09T18:05:16Z

Resolved the pytest things here: #217

mrocklin · 2019-01-09T18:55:16Z

I've resolved the pytest failures and restarted tests here. It looks like there are kafka failures in master. @martindurant I suspect that you are the most familiar with that testing infrastructure (although @jsmaupin if you have energy to spend it would be good to get your eyes on the kafka testing code as well as we'll need that to write tests for your work here as well)

jsmaupin · 2019-01-10T03:54:26Z

Sure, I'll take a look.

streamz/core.py

martindurant · 2019-01-10T21:15:40Z

The kafka-threads are the ones that appear to error (they first fail, timeout, and then don't gracefully end), so something with the event loop deadlocking, probably. All pass locally?

Note that I don't see any new tests for the functionality in this PR.

confluentinc/confluent-kafka-python#156 (comment)

…pped parameter is in the upstream

…d a test.

martindurant · 2019-01-14T14:05:19Z

Happy to see this fixed! You seem only to be failing on flake lint (lots of it, presumably not added by you).

mrocklin · 2019-01-14T16:23:34Z

You seem only to be failing on flake lint (lots of it, presumably not added by you).

If it's unrelated to this work then I'll take care of it later.

mrocklin · 2019-01-14T16:23:53Z

@martindurant thanks for reviewing this. Do you have any other concerns about the content of the implementation?

mrocklin · 2019-01-14T16:24:05Z

I'll try to take a closer look at this later today

martindurant

A couple of quick questions, now that things are working.

martindurant · 2019-01-14T16:46:46Z

streamz/core.py

+        Stream.__init__(self, upstream, ensure_io_loop=True, **kwargs)
+
+    def update(self, x, who=None):
+        self.producer.poll(0)


Is this necessary? If the data is not sent strictly synchronously, I think that's OK.

I'm also curious about this, and more generally how well this approach works in an asynchronous environment. It has been a while since I looked at confluent_kafka_python so I've forgotten some things

If the producer is busy then what happens with producer.poll(0)? Does it err, does it block? It may be that neither of these outcomes is desirable.

martindurant · 2019-01-14T16:47:27Z

streamz/core.py

+    def flush(self, timeout=-1):
+        self.producer.flush(timeout)
+
+    def cb(self, err, msg):


I wonder if this should be a sink, and not emit anything? Of course, I see the usefulness in testing.

jsmaupin · 2019-01-14T19:20:43Z

I have managed to fix most of the tests. I added another test for the new .from_kafka() class. The main issue is described here from the author of the Confluent Python library: confluentinc/confluent-kafka-python#156 (comment).

Basically, the .subscribe() method does not immediately subscribe. There is an asynchronous operation it must complete. Subscribing and then immediately producing messages from the same client is not a normal use-case. Most of the time, the subscriber and producer are on different locations on a network. I have added a _blocking_subscribe method that resolves this issue. We did not see this in the non-threading tests because they run first and the tests use the auto-create-topic functionality in Kafka. There is a slight delay when producing the first message while Kafka creates the topic. There was enough time here for the .subscribe() message to completely subscribe. Once the topic was created, the rest of tests failed.

I have not been able to get the test_kafka_dask_batch test to pass.

jsmaupin · 2019-01-14T19:22:59Z

On second thought, if this is not a normal use-case, perhaps I should move the blocking subscribe functionality to the tests?

martindurant · 2019-01-14T19:29:22Z

Does sound like a test-helper only, but don't mind where it appears

jsmaupin · 2019-02-14T21:43:15Z

Here are my test results for various implementations. For each I used 10,000 records.

Outside of the streams library
for loop: 0.039s

to_kafka
Awaiting on the callback return from Kafka: 13.2s
Using the BufferError exception to handle back-pressure: 7.47s

to_kafka_batched
With just pass in the update method: 2.15s
Flushing to Kafka (batch size 500): 2.2s

martindurant · 2019-03-13T19:16:33Z

@jsmaupin , any objection if I push flake8 fixes here? Everything else seems to be in order, and other PRs are failing while this is unmerged.

jsmaupin · 2019-03-13T19:17:18Z

No objections here.

martindurant · 2019-03-13T19:18:33Z

Ah, sorry, I can't - I'm not a streamz maintainer. The following diff fixes things:

--- a/streamz/tests/test_core.py
+++ b/streamz/tests/test_core.py
@@ -942,7 +942,7 @@ def dont_test_stream_kwargs(clean):


 @pytest.fixture
-def thread(loop):
+def thread(loop):   # noqa: E811
     from threading import Thread, Event
     thread = Thread(target=loop.start)
     thread.daemon = True
diff --git a/streamz/tests/test_dask.py b/streamz/tests/test_dask.py
index 5c007a3..9a415fd 100644
--- a/streamz/tests/test_dask.py
+++ b/streamz/tests/test_dask.py
@@ -74,7 +74,7 @@ def test_zip(c, s, a, b):


 @pytest.mark.slow
-def test_sync(loop):
+def test_sync(loop):  # noqa: E811
     with cluster() as (s, [a, b]):
         with Client(s['address'], loop=loop) as client:  # flake8: noqa
             source = Stream()
@@ -91,7 +91,7 @@ def test_sync(loop):


 @pytest.mark.slow
-def test_sync_2(loop):
+def test_sync_2(loop):  # noqa: E811
     with cluster() as (s, [a, b]):
         with Client(s['address'], loop=loop):  # flake8: noqa
             source = Stream()
@@ -132,7 +132,7 @@ def test_buffer(c, s, a, b):


 @pytest.mark.slow
-def test_buffer_sync(loop):
+def test_buffer_sync(loop):  # noqa: E811
     with cluster() as (s, [a, b]):
         with Client(s['address'], loop=loop) as c:  # flake8: noqa
             source = Stream()
@@ -158,7 +158,7 @@ def test_buffer_sync(loop):

 @pytest.mark.xfail(reason='')
 @pytest.mark.slow
-def test_stream_shares_client_loop(loop):
+def test_stream_shares_client_loop(loop):  # noqa: E811
     with cluster() as (s, [a, b]):
         with Client(s['address'], loop=loop) as client:  # flake8: noqa
             source = Stream()

martindurant · 2019-03-13T19:20:45Z

(if you don't want to deal with this, @jsmaupin , happy to push to my fork instead)

martindurant · 2019-03-13T19:25:33Z

Sorry, apparently it takes a little more work!

jsmaupin · 2019-03-13T19:30:34Z

Yeah, np. It looks like one version had the noqa on the decorators and yours was on the function itself.

martindurant · 2019-03-13T19:44:54Z

It seems the other one was right! Ooops...

streamz/tests/test_kafka.py:148:5: E999 SyntaxError: invalid syntax
streamz/tests/test_core.py:946:12: F811 redefinition of unused 'loop' from line 23
streamz/tests/test_dask.py:76:18: W291 trailing whitespace
streamz/tests/test_dask.py:94:17: F811 redefinition of unused 'loop' from line 14
streamz/tests/test_dask.py:135:22: F811 redefinition of unused 'loop' from line 14
streamz/tests/test_dask.py:160:36: F811 redefinition of unused 'loop' from line 14

(plus some whitespace has crept in)

jsmaupin · 2019-03-13T20:02:19Z

Some E811 codes should have been F811. I tested locally this time.

martindurant · 2019-03-13T20:12:47Z

Strangely, seeing syntax error now at https://github.com/mrocklin/streamz/pull/216/files#diff-a56707adbd2d2392ea8625521ed6edd0R156 - should be tornado gen/yield pattern? Otherwise, should be in py3_test_core. But this was passing all tests before, so what happened?

jsmaupin · 2019-03-13T20:18:33Z

There was some Python3 specific code in there. This should be a working commit.

skmatti

Possible reason for why you were not able to make test_kafka_dask_batch test to pass.

skmatti · 2019-03-13T20:24:49Z

streamz/tests/test_kafka.py

        stream = Stream.from_kafka_batched(TOPIC, ARGS)
        out = stream.sink_to_list()
        stream.start()
+        sleep(2)


This may cause this test to fail, because default poll_interval is 1 sec and the test would have polled 2 batches. The second batch would be empty list and out[-1][-1] would be null always.

I'm not sure about that, I think the time accounts for the initial setting up of the consumer, and no data has been sent yet at this point.

Yes, if I had another way to do this without a sleep, I think that would be preferable, but the delay for the topic creation is why it is there.

Yeah, you are right! Data hasn't been sent yet.

codecov-io · 2019-03-13T20:36:13Z

Codecov Report

❗ No coverage uploaded for pull request base (master@1486318). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master     #216   +/-   ##
=========================================
  Coverage          ?   93.12%           
=========================================
  Files             ?       13           
  Lines             ?     1483           
  Branches          ?        0           
=========================================
  Hits              ?     1381           
  Misses            ?      102           
  Partials          ?        0

Impacted Files	Coverage Δ
streamz/sources.py	`92.74% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1486318...62d523b. Read the comment docs.

martindurant · 2019-03-13T20:36:23Z

Green! Let's merge this!

martindurant · 2019-03-13T20:38:44Z

(^ @mrocklin )

jsmaupin · 2019-03-15T17:52:57Z

@mrocklin, let me know if there's anything I can do to help get this closed

mrocklin

A few questions about async behavior

mrocklin · 2019-03-16T02:49:18Z

streamz/core.py

+                except Exception as e:
+                    future.set_exception(e)
+
+        self.loop.add_callback(_)


This function above, is it likely to take a non-trivial amount of time? If so then we probably can't place it onto the event loop.

mrocklin · 2019-03-16T02:50:03Z

streamz/tests/test_kafka.py

        stream = Stream.from_kafka([TOPIC], ARGS, asynchronous=True)
        out = stream.sink_to_list()
        stream.start()
+        sleep(5)


Why this sleep?

This can probably be removed. I'll take a closer look

mrocklin · 2019-03-16T02:52:11Z

streamz/tests/test_kafka.py

        stream = Stream.from_kafka([TOPIC], ARGS)
        out = stream.sink_to_list()
        stream.start()
+        sleep(2)


Why this sleep? In general sleeps can be troublesome because:

They make tests take a long time if they are overly conservative

If the system running the tests is abnormally slow then they can fail (Travis is often abnormally slow)

I'm also concerned about using time.sleep here in an async test. This is usually an anti-pattern.

Yes, I totally agree. I'll take some time to come up with a better solution.

mrocklin · 2019-03-16T02:52:57Z

streamz/core.py

+
+        def _():
+            while True:
+                self.producer.poll(0)


What happens if we poll and the producer is still busy? How do we handle back pressure?

The idea is that this while loop will not exit if the producer queue is full. Am I breaking something in here with regards to how async' systems work? Will this block other parts of the program?

jsmaupin · 2019-03-18T20:00:09Z

@mrocklin thank you much for the feedback.

martindurant · 2019-03-18T20:04:17Z

@jsmaupin , I have extended your work and implemented Matt's suggestions in #230 . I believe that that PR is ready, and there should be no need for you to add anything here. The only thing that doesn't make the cut is to_kafka_batched, which can wait for a future PR.

jsmaupin · 2019-03-18T20:06:39Z

So, I can close this PR?

martindurant · 2019-03-18T20:16:47Z

I believe so. @mrocklin , are we close to merging #230 ?

martindurant · 2019-03-20T21:05:40Z

This PR should now be closed

martindurant · 2019-03-20T21:07:21Z

Thanks @jsmaupin for your work here.

jsmaupin added 2 commits January 8, 2019 18:22

added to_kafka stream

80efa35

remove call to _global_sinks.add(self) as it is wrong and not needed

b635167

kkraus14 reviewed Jan 10, 2019

View reviewed changes

streamz/core.py Outdated Show resolved Hide resolved

martindurant reviewed Jan 10, 2019

View reviewed changes

streamz/core.py Outdated Show resolved Hide resolved

jsmaupin added 4 commits January 10, 2019 15:33

Locally import Confluent Kafka library

893300f

need to wait for subscription to happen before producing messages. see:

3b546f3

confluentinc/confluent-kafka-python#156 (comment)

from_kafka_batched does not return the Kafka stream directly. The sto…

2802078

…pped parameter is in the upstream

Removed atexit use. Replaced it with a provided .flush() method. Adde…

ae7a6f2

…d a test.

martindurant reviewed Jan 14, 2019

View reviewed changes

move blocking subscribe to another function

3a808ae

jsmaupin added 3 commits January 14, 2019 13:39

fixed documentation

a1f0b09

remove unusued parameters

a1104b6

add sleep to wait for the subscribe() operation to complete

c6aea02

skmatti mentioned this pull request Mar 8, 2019

Add Kafka Consumer Factory to cache consumers for FromKafkaBatched #226

Closed

apply patch to fix flake8 errors

d206a69

Merge branch 'master' into master

04dbf77

some E811 noqa's shoudl be F811

1ae904d

Fix test to be Py2 compatible. Use tornado gen/yield pattern.

62d523b

skmatti reviewed Mar 13, 2019

View reviewed changes

mrocklin reviewed Mar 16, 2019

View reviewed changes

martindurant mentioned this pull request Mar 16, 2019

Kafka fixes #230

Merged

jsmaupin closed this Mar 20, 2019

Added the to_kafka stream #216

Added the to_kafka stream #216

Uh oh!

Conversation

jsmaupin commented Jan 9, 2019

Uh oh!

mrocklin commented Jan 9, 2019

Uh oh!

martindurant commented Jan 9, 2019

Uh oh!

martindurant commented Jan 9, 2019

Uh oh!

CJ-Wright commented Jan 9, 2019

Uh oh!

mrocklin commented Jan 9, 2019

Uh oh!

mrocklin commented Jan 9, 2019

Uh oh!

mrocklin commented Jan 9, 2019

Uh oh!

mrocklin commented Jan 9, 2019

Uh oh!

jsmaupin commented Jan 10, 2019

Uh oh!

Uh oh!

Uh oh!

martindurant commented Jan 10, 2019

Uh oh!

martindurant commented Jan 14, 2019

Uh oh!

mrocklin commented Jan 14, 2019

Uh oh!

mrocklin commented Jan 14, 2019

Uh oh!

mrocklin commented Jan 14, 2019

Uh oh!

martindurant left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsmaupin commented Jan 14, 2019

Uh oh!

jsmaupin commented Jan 14, 2019

Uh oh!

martindurant commented Jan 14, 2019

Uh oh!

jsmaupin commented Feb 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Mar 13, 2019

Uh oh!

jsmaupin commented Mar 13, 2019

Uh oh!

martindurant commented Mar 13, 2019

Uh oh!

martindurant commented Mar 13, 2019

Uh oh!

martindurant commented Mar 13, 2019

Uh oh!

jsmaupin commented Mar 13, 2019

Uh oh!

martindurant commented Mar 13, 2019

Uh oh!

jsmaupin commented Mar 13, 2019

Uh oh!

martindurant commented Mar 13, 2019

Uh oh!

jsmaupin commented Mar 13, 2019

Uh oh!

skmatti left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jsmaupin commented Feb 14, 2019 •

edited

Loading

codecov-io commented Mar 13, 2019 •

edited

Loading