Skip to content

Conversation

@jsmaupin
Copy link
Contributor

@jsmaupin jsmaupin commented Jan 9, 2019

I have added the to_kafka class.

The options were to either:

  1. Call the .flush() method after each .produce(...) call. This will seriously hurt performance as messages will be sent serially.
  2. Use atexit to call flush at the end. This requires the use of atexit, of which I'm unsure of any impact. It also means that the result of the .emit(...) call by the client does not yield an immediate call to the downstream. Note that the Confluent library will flush when the batch size reaches a threshold. This is the method that is implemented here.

@mrocklin
Copy link
Collaborator

mrocklin commented Jan 9, 2019

cc @martindurant you might find this interesting

@martindurant
Copy link
Member

I do! I'll try to look at it in the next couple of days.

@martindurant
Copy link
Member

Is someone maintaining this project? The errors in the tests are not obviously caused by the new code, and there doesn't seem to have been much activity in several months.

@CJ-Wright
Copy link
Member

I think if pytest is pinned to 3.10 this might be ok.

@mrocklin
Copy link
Collaborator

mrocklin commented Jan 9, 2019

Is someone maintaining this project? The errors in the tests are not obviously caused by the new code, and there doesn't seem to have been much activity in several months.

No

@mrocklin
Copy link
Collaborator

mrocklin commented Jan 9, 2019

Or at least, I'm not. Apparently no one else is either?

@mrocklin
Copy link
Collaborator

mrocklin commented Jan 9, 2019

Resolved the pytest things here: #217

@mrocklin
Copy link
Collaborator

mrocklin commented Jan 9, 2019

I've resolved the pytest failures and restarted tests here. It looks like there are kafka failures in master. @martindurant I suspect that you are the most familiar with that testing infrastructure (although @jsmaupin if you have energy to spend it would be good to get your eyes on the kafka testing code as well as we'll need that to write tests for your work here as well)

@jsmaupin
Copy link
Contributor Author

Sure, I'll take a look.

@martindurant
Copy link
Member

The kafka-threads are the ones that appear to error (they first fail, timeout, and then don't gracefully end), so something with the event loop deadlocking, probably. All pass locally?

Note that I don't see any new tests for the functionality in this PR.

@martindurant
Copy link
Member

Happy to see this fixed! You seem only to be failing on flake lint (lots of it, presumably not added by you).

@mrocklin
Copy link
Collaborator

You seem only to be failing on flake lint (lots of it, presumably not added by you).

If it's unrelated to this work then I'll take care of it later.

@mrocklin
Copy link
Collaborator

@martindurant thanks for reviewing this. Do you have any other concerns about the content of the implementation?

@mrocklin
Copy link
Collaborator

I'll try to take a closer look at this later today

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of quick questions, now that things are working.

streamz/core.py Outdated
Stream.__init__(self, upstream, ensure_io_loop=True, **kwargs)

def update(self, x, who=None):
self.producer.poll(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? If the data is not sent strictly synchronously, I think that's OK.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also curious about this, and more generally how well this approach works in an asynchronous environment. It has been a while since I looked at confluent_kafka_python so I've forgotten some things

If the producer is busy then what happens with producer.poll(0)? Does it err, does it block? It may be that neither of these outcomes is desirable.

streamz/core.py Outdated
def flush(self, timeout=-1):
self.producer.flush(timeout)

def cb(self, err, msg):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be a sink, and not emit anything? Of course, I see the usefulness in testing.

@jsmaupin
Copy link
Contributor Author

I have managed to fix most of the tests. I added another test for the new .from_kafka() class. The main issue is described here from the author of the Confluent Python library: confluentinc/confluent-kafka-python#156 (comment).

Basically, the .subscribe() method does not immediately subscribe. There is an asynchronous operation it must complete. Subscribing and then immediately producing messages from the same client is not a normal use-case. Most of the time, the subscriber and producer are on different locations on a network. I have added a _blocking_subscribe method that resolves this issue. We did not see this in the non-threading tests because they run first and the tests use the auto-create-topic functionality in Kafka. There is a slight delay when producing the first message while Kafka creates the topic. There was enough time here for the .subscribe() message to completely subscribe. Once the topic was created, the rest of tests failed.

I have not been able to get the test_kafka_dask_batch test to pass.

@jsmaupin
Copy link
Contributor Author

On second thought, if this is not a normal use-case, perhaps I should move the blocking subscribe functionality to the tests?

@martindurant
Copy link
Member

Does sound like a test-helper only, but don't mind where it appears

@jsmaupin
Copy link
Contributor Author

jsmaupin commented Feb 14, 2019

Here are my test results for various implementations. For each I used 10,000 records.

Outside of the streams library
for loop: 0.039s

to_kafka
Awaiting on the callback return from Kafka: 13.2s
Using the BufferError exception to handle back-pressure: 7.47s

to_kafka_batched
With just pass in the update method: 2.15s
Flushing to Kafka (batch size 500): 2.2s

@martindurant
Copy link
Member

@jsmaupin , any objection if I push flake8 fixes here? Everything else seems to be in order, and other PRs are failing while this is unmerged.

@jsmaupin
Copy link
Contributor Author

No objections here.

@martindurant
Copy link
Member

Ah, sorry, I can't - I'm not a streamz maintainer. The following diff fixes things:

--- a/streamz/tests/test_core.py
+++ b/streamz/tests/test_core.py
@@ -942,7 +942,7 @@ def dont_test_stream_kwargs(clean):


 @pytest.fixture
-def thread(loop):
+def thread(loop):   # noqa: E811
     from threading import Thread, Event
     thread = Thread(target=loop.start)
     thread.daemon = True
diff --git a/streamz/tests/test_dask.py b/streamz/tests/test_dask.py
index 5c007a3..9a415fd 100644
--- a/streamz/tests/test_dask.py
+++ b/streamz/tests/test_dask.py
@@ -74,7 +74,7 @@ def test_zip(c, s, a, b):


 @pytest.mark.slow
-def test_sync(loop):
+def test_sync(loop):  # noqa: E811
     with cluster() as (s, [a, b]):
         with Client(s['address'], loop=loop) as client:  # flake8: noqa
             source = Stream()
@@ -91,7 +91,7 @@ def test_sync(loop):


 @pytest.mark.slow
-def test_sync_2(loop):
+def test_sync_2(loop):  # noqa: E811
     with cluster() as (s, [a, b]):
         with Client(s['address'], loop=loop):  # flake8: noqa
             source = Stream()
@@ -132,7 +132,7 @@ def test_buffer(c, s, a, b):


 @pytest.mark.slow
-def test_buffer_sync(loop):
+def test_buffer_sync(loop):  # noqa: E811
     with cluster() as (s, [a, b]):
         with Client(s['address'], loop=loop) as c:  # flake8: noqa
             source = Stream()
@@ -158,7 +158,7 @@ def test_buffer_sync(loop):

 @pytest.mark.xfail(reason='')
 @pytest.mark.slow
-def test_stream_shares_client_loop(loop):
+def test_stream_shares_client_loop(loop):  # noqa: E811
     with cluster() as (s, [a, b]):
         with Client(s['address'], loop=loop) as client:  # flake8: noqa
             source = Stream()

@martindurant
Copy link
Member

(if you don't want to deal with this, @jsmaupin , happy to push to my fork instead)

@martindurant
Copy link
Member

Sorry, apparently it takes a little more work!

@jsmaupin
Copy link
Contributor Author

Yeah, np. It looks like one version had the noqa on the decorators and yours was on the function itself.

@martindurant
Copy link
Member

It seems the other one was right! Ooops...

streamz/tests/test_kafka.py:148:5: E999 SyntaxError: invalid syntax
streamz/tests/test_core.py:946:12: F811 redefinition of unused 'loop' from line 23
streamz/tests/test_dask.py:76:18: W291 trailing whitespace
streamz/tests/test_dask.py:94:17: F811 redefinition of unused 'loop' from line 14
streamz/tests/test_dask.py:135:22: F811 redefinition of unused 'loop' from line 14
streamz/tests/test_dask.py:160:36: F811 redefinition of unused 'loop' from line 14

(plus some whitespace has crept in)

@jsmaupin
Copy link
Contributor Author

Some E811 codes should have been F811. I tested locally this time.

@martindurant
Copy link
Member

Strangely, seeing syntax error now at https://github.com/mrocklin/streamz/pull/216/files#diff-a56707adbd2d2392ea8625521ed6edd0R156 - should be tornado gen/yield pattern? Otherwise, should be in py3_test_core. But this was passing all tests before, so what happened?

@jsmaupin
Copy link
Contributor Author

There was some Python3 specific code in there. This should be a working commit.

Copy link
Contributor

@skmatti skmatti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible reason for why you were not able to make test_kafka_dask_batch test to pass.

stream = Stream.from_kafka_batched(TOPIC, ARGS)
out = stream.sink_to_list()
stream.start()
sleep(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may cause this test to fail, because default poll_interval is 1 sec and the test would have polled 2 batches. The second batch would be empty list and out[-1][-1] would be null always.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about that, I think the time accounts for the initial setting up of the consumer, and no data has been sent yet at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if I had another way to do this without a sleep, I think that would be preferable, but the delay for the topic creation is why it is there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right! Data hasn't been sent yet.

@codecov-io
Copy link

codecov-io commented Mar 13, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@1486318). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #216   +/-   ##
=========================================
  Coverage          ?   93.12%           
=========================================
  Files             ?       13           
  Lines             ?     1483           
  Branches          ?        0           
=========================================
  Hits              ?     1381           
  Misses            ?      102           
  Partials          ?        0
Impacted Files Coverage Δ
streamz/sources.py 92.74% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1486318...62d523b. Read the comment docs.

@martindurant
Copy link
Member

Green! Let's merge this!

@martindurant
Copy link
Member

(^ @mrocklin )

@jsmaupin
Copy link
Contributor Author

@mrocklin, let me know if there's anything I can do to help get this closed

Copy link
Collaborator

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions about async behavior

except Exception as e:
future.set_exception(e)

self.loop.add_callback(_)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function above, is it likely to take a non-trivial amount of time? If so then we probably can't place it onto the event loop.

stream = Stream.from_kafka([TOPIC], ARGS, asynchronous=True)
out = stream.sink_to_list()
stream.start()
sleep(5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this sleep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be removed. I'll take a closer look

stream = Stream.from_kafka([TOPIC], ARGS)
out = stream.sink_to_list()
stream.start()
sleep(2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this sleep? In general sleeps can be troublesome because:

  1. They make tests take a long time if they are overly conservative
  2. If the system running the tests is abnormally slow then they can fail (Travis is often abnormally slow)

I'm also concerned about using time.sleep here in an async test. This is usually an anti-pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I totally agree. I'll take some time to come up with a better solution.


def _():
while True:
self.producer.poll(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we poll and the producer is still busy? How do we handle back pressure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that this while loop will not exit if the producer queue is full. Am I breaking something in here with regards to how async' systems work? Will this block other parts of the program?

@martindurant martindurant mentioned this pull request Mar 16, 2019
@jsmaupin
Copy link
Contributor Author

@mrocklin thank you much for the feedback.

@martindurant
Copy link
Member

@jsmaupin , I have extended your work and implemented Matt's suggestions in #230 . I believe that that PR is ready, and there should be no need for you to add anything here. The only thing that doesn't make the cut is to_kafka_batched, which can wait for a future PR.

@jsmaupin
Copy link
Contributor Author

So, I can close this PR?

@martindurant
Copy link
Member

I believe so. @mrocklin , are we close to merging #230 ?

@martindurant
Copy link
Member

This PR should now be closed

@jsmaupin jsmaupin closed this Mar 20, 2019
@martindurant
Copy link
Member

Thanks @jsmaupin for your work here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants