Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming.py Crash on Incomplete Read Error when tweets are very high... #448

Closed
jamesfebin opened this issue Jun 13, 2014 · 12 comments
Closed
Labels
Bug This is regarding a bug with the library Duplicate This is a duplicate
Milestone

Comments

@jamesfebin
Copy link

This error occurs when there is a high stream of tweets in a particular time.. Example try streaming world cup hashtag during the game. The problem looks similar to this https://dev.twitter.com/discussions/9554 Can anyone help in fixing this?

Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(_self.__args, *_self.__kwargs)
File "/usr/local/lib/python2.7/dist-packages/tweepy/streaming.py", line 173, in _run
self._read_loop(resp)
File "/usr/local/lib/python2.7/dist-packages/tweepy/streaming.py", line 220, in _read_loop
d = resp.read(1)
File "/usr/lib/python2.7/httplib.py", line 541, in read
return self._read_chunked(amt)
File "/usr/lib/python2.7/httplib.py", line 586, in _read_chunked
raise IncompleteRead(''.join(value))

@waltersf
Copy link

Yes, I'm facing this error only during high streams of tweets. I suspect that the connection is being closed by Twitter because my program is not consuming data as fast as they are produced. I can monitor how many tweets my program process by minute and date (created_at) of the last processed tweet. Sometimes, the last processed tweet is 5 to 8 minutes behind current hour.

My program is running in a VM hosted in US and the database server (MongoDB) is running in Brazil. Next things I'll check are network latency and database throughput. I never had this problem before when both, program and database, were running on the same network, even in high streams.

@jamesfebin
Copy link
Author

@waltersf the connection is not closed by twitter. It's a bug how httplib handle things i patched the streaming.py to suppress the error and continue streaming. Because my application completely dependent on realtime twitter data and can't afford to stop for any reason . But i will know only if my patch works tonight during the match ..

You can look at the code
def _read_loop(self, resp):

    while self.running and not resp.isclosed():

        # Note: keep-alive newlines might be inserted before each length value.
        # read until we get a digit...
        c = '\n'
        while c == '\n' and self.running and not resp.isclosed():
            try:
                c = resp.read(1)
            except httplib.IncompleteRead:
                break;
        delimited_string = c

        # read rest of delimiter length..
        d = ''
        while d != '\n' and self.running and not resp.isclosed():
            try:
                d = resp.read(1)
                delimited_string += d
            except httplib.IncompleteRead:
                break

@cbelden
Copy link

cbelden commented Jul 7, 2014

Did this end up working for you? I just started receiving this error. Using v2.3.0.

@jamesfebin
Copy link
Author

I suppressed the error. It's working all fine now..

Regards,

Febin John James
Co-Founder

On Tue, Jul 8, 2014 at 3:07 AM, Calvin Belden notifications@github.com
wrote:

Did this end up working for you? I just started receiving this error.
Using v2.3.0.


Reply to this email directly or view it on GitHub
#448 (comment).

@Aaron1011
Copy link
Contributor

@jamesfebin: Would you mind opening a PR making that change? It looks great!

@rthijssen
Copy link

Sorry all, but I don't think suppressing an exception is a good plan. If this is a bug in httplib this should be fixed there. Does anyone have more info on what is exactly happening here? Is there a reason why resp.read(1) throws up?

@tewalds
Copy link

tewalds commented Oct 20, 2014

Reading through the code, it seems httplib.read() returns IncompleteRead when socket.read() returns 0 bytes. Unfortunately it can't distinguish between being disconnected and an interrupt causing the read call to return with no data. In practice, I only see this when I'm falling behind, meaning that twitter cuts me off. It's easy to replicate this by listening for a popular keyword, then in the on_status callback, just sleep for a bit to guarantee you're falling behind.

I guess you could try to distinguish between those two cases by keeping track of the delay between the tweet being posted and received. If it's greater than a couple seconds, you're probably falling behind and it's a disconnect. If it's real time, you're probably ok and should just continue. Alternatively, try reading again, and disconnect/reconnect if it happens again or you get some other error (socket closed?). Any better solutions?

@rthijssen
Copy link

I'm quite confident that you (@tewalds) are correct. I've just done multiple tests and I've noticed the following.

I've ran Tweepy - streaming the sample API - 4 times and every time I receive a Tweet I display the time it was received and the created_at value provided by the Twitter API indicating the time the Tweet was tweeted. I've come back with the following results:

  • I hope it goes without saying that the hour difference is because of a timezone difference. The minutes are important in these times.
TEST 1
    Stored at '2014-11-03 18:08:36.449828' - Created at '2014-11-03 07:08:35'
    Stored at '2014-11-03 18:13:21.657997' - Created at '2014-11-03 07:10:02’

TEST 2
    Stored at '2014-11-03 18:13:52.534743' - Created at '2014-11-03 07:13:51’
    Stored at '2014-11-03 18:18:16.642068' - Created at '2014-11-03 07:15:12’

TEST 3
    Stored at '2014-11-03 18:19:55.822804' - Created at '2014-11-03 07:19:55’
    Stored at '2014-11-03 18:23:06.322588' - Created at '2014-11-03 07:21:15'

As you can see with each test I'm falling quite a bit behind on the live stream. Obviously, either my script or my connection is having trouble keeping up with the data stream provided by Twitter.

Additionally, what is interesting is that the crash almost always happens at the 5 minute mark. This seems quite consistent with the rate 'stall_warnings' are send out.

It appears the sample API supports the stall_warnings but Tweepy doesn't have stall_warnings implemented in the sample call. Is this for a particular reason?

That said, even when implementing the stall_warnings parameter into the sample function it appears I'm not receiving these warnings when appropriate.

Unfortunately I don't have a solution to the disconnecting problem. Mostly because if @tewalds is right there probably is no decent solution for this.

@tewalds
Copy link

tewalds commented Nov 3, 2014

Well, the solution is easy: stop falling behind. This can either be achieved by listening to fewer or less popular words, or by processing them faster. You can probably achieve that by pushing the tweets to a different thread for the real processing.

In my case the problem was that the streaming api did many socket.read() calls per tweet and on appengine every socket.read() is an api call over the network, so it was very slow. I fixed that in #496 by doing only 2 per read at most, and allowing buffering to read many at a time.

benfei pushed a commit to benfei/tweepy that referenced this issue Oct 16, 2016
This should handle both cases of incomplete read catched by requests, or
catched by tweepy.

This resolves tweepy#237, resolves tweepy#448, resolves tweepy#536, resolves tweepy#650,
resolves tweepy#691, resolves tweepy#798.

Similar to tweepy#498.
@Harmon758 Harmon758 added the Bug This is regarding a bug with the library label Apr 26, 2019
@Harmon758
Copy link
Member

Marking as a duplicate of #237, but keeping this open until it's resolved, due to the relevant conversation in this thread.

@Harmon758 Harmon758 added the Duplicate This is a duplicate label Apr 26, 2019
@finiteautomata

This comment has been minimized.

@Harmon758
Copy link
Member

Harmon758 commented Jan 19, 2021

This should now be resolved with 68e19cc by simply handling it as a connection error and attempting to reconnect.

It might be worth mentioning this behavior (of Tweepy automatically attempting to reconnect when this connection error occurs due to Twitter's API disconnecting the stream for falling too behind) in the documentation for streaming. I plan on improving the documentation for streaming at some point in the future, so if it hasn't been added to the documentation by then, I'll probably mention that Tweepy automatically attempts to reconnect when connection errors occur and note the StreamListener method that's called when that happens, as well as include this along with other reasons for connection errors.

@jamesfebin For code block usage, see https://docs.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks.

@rthijssen stall_warnings has been added as a parameter for Stream.sample with #701 (bc3bbc1) now as part of v3.6.0 and newer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug This is regarding a bug with the library Duplicate This is a duplicate
Projects
None yet
Development

No branches or pull requests

8 participants