A Simpler Fix to the Streaming Code due to Changes from Twitter on Jan. 13, 2014. #196

adonoho · 2014-01-15T15:14:00Z

Gentlefolk,

First allow me to applaud and thank RouxRC on his fine work pushing this little library forward. This pull request would not have been possible without his efforts. That said, RouxRC has dropped a quite large pull request that had problems in my Python3 system. When I rolled back to Python2, the system started working but would not stay connected for very long.

As a result of the above, I felt I needed to look into RouxRC's pull request in more detail and use just the parts I needed to get my streams running again. Any errors in this pull request are mine and do not reflect negatively on RouxRC or his code.

RouxRC's streaming patch falls into two main areas. First, it modifies the api.py to respect a new property, gzip. I have largely copied his work here. I did simplify how he set the headers from 3 lines to one. The second fix occurs in stream.py. Once the gzip compression was turned off, I was able to see what Twitter is now sending us. Even though length delimiters are turned off, Twitter nonetheless is inserting a hex encoded number and a \r\n between JSON items. RouxRC and I differ in how we chose to address these changes. My fix is focussed upon handling those extra items if they appear in the stream, removing them and then returning to the existing control flow. The important code block is below and my changes occur with the if statement:

    while True:
        try:
            utf8_buf = self.buf.decode('utf8').lstrip()
            if utf8_buf and utf8_buf[0] != '{':  # Remove the hex delimiter length and extra whitespace.
                utf8_buf = utf8_buf.lstrip('0123456789abcdefABCDEF')
                utf8_buf = utf8_buf.lstrip()
            res, ptr = self.decoder.raw_decode(utf8_buf)
            self.buf = utf8_buf[ptr:].encode('utf8')
            yield wrap_response(res, self.handle.headers)

RouxRC's code does this in a different way. I believe my solution is both shorter and more robust. But you, gentle reader, are the ultimate arbiter. RouxRC did identify in his code some opportunities to raise TwitterHTTPErrors. I've tried to duplicate those.

I've run this set of fixes under Python 3.3.3 and Python 2.7.6 both on OS X 10.8.5 (Mountain Lion).

In addition to the changes to the library, I modernized a test script, stream-example to take the OAuth parameters needed by modern Twitter. As I also use PyCharm, I've added .idea to the .gitignore file.

Again, this simpler fix would not have been possible without the fine work of RouxRC. Thank you RouxRC.

Anon,
Andrew

…d29eae] & [642c4a4]

…ports to function from within the module.

RouxRC · 2014-01-15T15:54:02Z

twitter/stream.py

@@ -34,6 +36,9 @@ def __iter__(self):
        while True:
            try:
                utf8_buf = self.buf.decode('utf8').lstrip()
+                if utf8_buf and utf8_buf[0] != '{':  # Remove the hex delimiter length and extra whitespace.
+                    utf8_buf = utf8_buf.lstrip('0123456789abcdefABCDEF')
+                    utf8_buf = utf8_buf.lstrip()


Need to try it out but I'm worried this should break the delimited=true argument from twitter API by doing this this way

RouxRC · 2014-01-15T16:59:58Z

All right, so I did a few tryouts and I confirm there are still a few issues there:

delimited=true arg from twitter is logically not working with this version, I guess this option is more of a tech thing that could typically be useful to precise the size to receive from sock and users should generally not want to use it, but I feel like it would not be very consistent/transparent to drop the option support considering the whole concept of this lib is to provide an abstraction to the twitter API
userstream is still broken: in your fix you only remove head hexa from the buffer and not from the data received from the sock which also can contain hexa when receiving further pieces of incomplete data.
You can test userstream with:
```
t = twitter.TwitterStream(domain="userstream.twitter.com", auth=twitter.OAuth(OAUTH_TOKEN, OAUTH_SECRET, KEY, SECRET))
for a in t.user(track="bieber,gaga"):
  if a:
     print a  
```

But the gzip simplification is indeed better, I'll add it to my branch, and the lstrip way is certainly a better path than my re way. I'll try and port it to my branch later on.

Btw, I removed completely in my version the except TwitterHTTPErrror which you commented here as pointless, I agree with you.
Thanks also for the update of the example script, I completely forgot about that one, will take your commit as well.

…eet.

adonoho · 2014-01-17T15:23:23Z

RouxRC,

People embrace small changes.

I reverted to the master branch and just focussed upon the problem facing me: the broken public streams of January 13th. Everyone can judge whether I solved the problem of the moment. For my purposes, I did. You believe that it doesn't handle user streams. While I don't use this library for user streams, as a public service, I'll take a look at those later today.

People also don't want to lose capabilities, such as Python3 support. This pull request supports both Python dialects. I hope you migrate your pull request to support both dialects.

Anon,
Andrew

RouxRC · 2014-01-17T16:17:44Z

I just did yes (cf 82c7c43). My branch now works with both version of python.

All I was saying is that the bug we try to fix here corresponds to the same source than the bug that broke userstream when it migrated 2 months earlier than the public one. Fixing only one of the two here seems to me like as bad as not supporting python 3 imho, it's just as much incomplete and "losing capabilities" :)
And like I said, this fix also breaks support of an option existing in Twitter's API specs for all streaming endpoints, even though this option may not be most useful for the users of this library.

adonoho · 2014-01-20T13:31:40Z

The latest fix addresses RouxRC's concerns about the user stream. I have reworked my fix to improve the logic in two ways. First, the chunk header processing is removed from buffer management. That was the wrong place for it. Second, the sock.recv() call has been replaced with a method that handles reading in HTTP chunks and then sending the full result to the parser/buffer management part of the code. This highly localizes the changes and preserves the structure and logic of Mr. Verdone's code. As a happy side effect, because the read is typically larger than the prior code, the new code appears to be more efficient. It spends less time parsing incomplete JSON. I have also extended the stream_example.py to support the user and site streams.

This code has run for 3 days and processed over 12M tweets. While it could still have problems, I am quite happy with it and hope Mr. Verdone chooses to base the canonical fix for his important library on it.

RouxRC · 2014-01-20T13:51:35Z

Thanks Andrew, this looks a lot better indeed! And it looks like the delimited argument should work with this version now as well. It makes total sense this would be more efficient this way, I was hesitating doing something like this with the chunks size as discussed earlier. Glad to see it's working well when implemented!
I'll try and merge your branch into mine as well.

ptwobrussell · 2014-01-23T17:26:24Z

Hi folks - Just wanted to express my support for this fix, and ask if there's anything I can specifically do to help? I'm tied up until the evening, but would be glad to jump in and help with whatever may be worthwhile at this stage.

adonoho · 2014-01-23T18:25:01Z

Matthew,

The key thing needed right now is that the patch needs to be used and abused. I primarily use it sitting on the sample and filter endpoints. At this point, it has probably downloaded 25M+ tweets to my systems. It has had a modest number of timeouts, but then any network app will have timeouts. I don't particularly use it to track the site stream or user accounts. (I should probably set up a daemon to track my personal account to test the code.)

BTW, I did not get a chance to attend your seminar at Data Day Texas. Folks tell me it a good presentation. I saw your talk though. Keep up the good work.

Anon,
Andrew

…This reduces memory consumption and is faster than the += operator for buffer concatenation and trimming.

adonoho · 2014-01-24T15:53:57Z

Gentlefolk,

I am a programmer who appreciates precise code. As Python is new to me, though I am an old and grumpy programmer, I am always looking for how to improve the code.

The prior patch, while simplistically Pythonic, was not efficient and wasted memory with many large extra buffers being created and thrown away. I've chosen to handle it by just parsing what is needed into an int() for the chunk size determination and then moving to a byte array to handle the perhaps multi-stage read from the underlying socket. Also, the byte array is used to trim the trailing CRLF off of the buffer, saving us a needless trip through the JSON parser.

In other words, this patch has no effect on your data. It has an affect on your memory usage/garbage collection. It reduces both. If you are happy with the prior patch, then you need not upgrade to this patch. I am using it in my production stack for the last 12+ hours.

Also, as before, this patch has been tested on Python v3.3.3 and Python v2.7.6 on OS X 10.8.5 (Mountain Lion). Because I use bytearray(), I do not know whether this patch will run on Python v2.6.x. As I understand it, it should. bytearray() is a supported Python 2.6.x datatype.

Anon,
Andrew

…y use by using a chunk sized bytearray.

…ion. This slightly improves locality of code and data.

…fy buffer management to only operate on strings and not re-encode the string as bytes. This improves readability at the expense of breakage if Twitter starts spanning JSON across HTTP chunks. This is an unlikely change to their infrastructure. That said, this is a totally optional patch.

ptwobrussell · 2014-01-27T20:35:19Z

@adonoho - Thanks, I'll be testing this starting here in a bit and let you know if I find anything worth thinking about. Out of curiosity, does this path also address #194? If not, is it compatible with it?

All HTTP chunks are read in their entirety. Cosmetic code improvements. (The socket's blocking state is set in a more compact form after a DeMorgan's boolean transformation.) Hangups by Twitter, as with timeouts, are signaled via a message to allow gracious recovery.

adonoho · 2014-01-28T14:32:19Z

I stage my fixes to this patch, pr-fix-stream, on the branch adonoho:fix-stream. Per your request above, it now addresses, I believe, the issue addressed by #194. I expect to move adonoho:fix-stream to pr-fix-stream after the CoB today.

Simplify protocol parsing checks. Move HTTP exception catching to where it might actually be thrown.

A few cosmetic edits of the iterator function.

…n the Twitter stream.

adonoho · 2014-01-30T13:43:39Z

Gentlefolk,

This is a candidate release patch. I propose it become the formal branch of this library and have dubbed it version v1.10.3. I once again formally thank RouxRC for his efforts moving this library forward. Any errors in this patch remain mine and do not reflect upon RouxRC or his code.

This library is a high performance streaming library. Compared to other Twitter libraries, it is easily an order of magnitude faster at delivering tweets to your application. Why is that? When streaming, this library pierces Python's urllib abstraction and takes control of the socket. It interprets the HTTP stream directly. That makes it fast. It also makes it vulnerable to changes. It needed to be upgraded when Twitter upgraded the protocol version.

Twitter's switch to HTTP v1.1 was long overdue.

Summary of changes:

Based upon RouxRC's code, I turned off gzip compression. My version is slightly different than RouxRC's version.
Instead of incrementally reading arbitrary lengths of bytes from the socket and seeing if they parse in the JSON parser, a good technique, the switch to HTTP chunking forced us to process in chunk sized blocks. Based upon inspection, Twitter never sends partial JSON in a chunk. They also send keep-alive delimiters in single 7 byte long chunks. This code depends upon both of these observations. It does not do general purpose HTTP chunk processing. It is a Twitter specific HTTP chunk parser.
Chunk oriented processing allowed me to isolate stream interpretation to the chunk code and migrate the wrapper code to operate exclusively using strings. This makes the wrapper code more readable.
Once I had opened up the wrapper code, I cleaned it up. This involved modest edits in how certain socket parameters were determined and moving data exclusive to the generator into the generator and out of the containing object.
As this is exclusively socket oriented code, the HTTP exception catching was removed from the method. The exception was moved to wrap the opening of the socket by url lib.
Due to reading the data in larger chunks and, hence, running it through the JSON parser less often, this code is about 10% faster than the prior generation.
When Twitter hangs up on us, this code emits a hangup message in the stream.
This code has been tested using Python v2.7.6 and v3.3.3 on OS X 10.8.5 (Mountain Lion). I have tested it on the high volume sample stream and on a user stream under both versions of Python. It is believed, but not tested, that it will function under Python v2.6.x. It uses the bytearray type. I believe that has been back ported all the way to Python v2.6.x. As the code is not particularly tricky, I do not foresee that it has introduced any new issues that were not already apparent in this library.
I use this patch in production and have captured 50M+ tweets with it. It is solid and reliable. If you find it to not be so, please contact me. I use it in production and have a vested interest in ensuring that it catches all corner cases.

Thank you for your patience while I refine this patch and I ask Mr. Verdone to select this patch as the basis for moving this library forward.

Enjoy and Anon,
Andrew

np1 · 2014-01-31T00:28:40Z

Great job, thanks for your work on this. So far it is working well. Will let you know if I run into any issues with it.

sixohsix · 2014-02-03T21:42:58Z

Yowza!

Whoa, so I travel for most of December and January, and in the interim Twitter utterly breaks my library, and then a task force self-assembles to totally fix the mess. Great work!

As I understand it, this is the "best" branch in that it is 2.7 and 3.3+ compatible, and fixes the specific blocking issues. Is that correct? If so I'll merge this and release it as 1.11.

@RouxRC if I merge this it will probably break your branch. Are you willing to resubmit your improvements afterward? They can go out as 1.12.

RouxRC · 2014-02-03T21:48:43Z

Yep no worries, like I said, I'll keep using my branch until I get the time to port the other changes sometime in the coming weeks. Let's move forward :)

A Simpler Fix to the Streaming Code due to Changes from Twitter on Jan. 13, 2014. Gentlefolk, This is a candidate release patch. I propose it become the formal branch of this library and have dubbed it version v1.10.3. I once again formally thank RouxRC for his efforts moving this library forward. Any errors in this patch remain mine and do not reflect upon RouxRC or his code. This library is a high performance streaming library. Compared to other Twitter libraries, it is easily an order of magnitude faster at delivering tweets to your application. Why is that? When streaming, this library pierces Python's urllib abstraction and takes control of the socket. It interprets the HTTP stream directly. That makes it fast. It also makes it vulnerable to changes. It needed to be upgraded when Twitter upgraded the protocol version. Twitter's switch to HTTP v1.1 was long overdue. Summary of changes: - Based upon RouxRC's code, I turned off gzip compression. My version is slightly different than RouxRC's version. - Instead of incrementally reading arbitrary lengths of bytes from the socket and seeing if they parse in the JSON parser, a good technique, the switch to HTTP chunking forced us to process in chunk sized blocks. Based upon inspection, Twitter never sends partial JSON in a chunk. They also send keep-alive delimiters in single 7 byte long chunks. This code depends upon both of these observations. It does not do general purpose HTTP chunk processing. It is a Twitter specific HTTP chunk parser. - Chunk oriented processing allowed me to isolate stream interpretation to the chunk code and migrate the wrapper code to operate exclusively using strings. This makes the wrapper code more readable. - Once I had opened up the wrapper code, I cleaned it up. This involved modest edits in how certain socket parameters were determined and moving data exclusive to the generator into the generator and out of the containing object. - As this is exclusively socket oriented code, the HTTP exception catching was removed from the method. The exception was moved to wrap the opening of the socket by url lib. - Due to reading the data in larger chunks and, hence, running it through the JSON parser less often, this code is about 10% faster than the prior generation. - When Twitter hangs up on us, this code emits a `hangup` message in the stream. - This code has been tested using Python v2.7.6 and v3.3.3 on OS X 10.8.5 (Mountain Lion). I have tested it on the high volume sample stream and on a user stream under both versions of Python. It is believed, but not tested, that it will function under Python v2.6.x. It uses the bytearray type. I believe that has been back ported all the way to Python v2.6.x. As the code is not particularly tricky, I do not foresee that it has introduced any new issues that were not already apparent in this library. - I use this patch in production and have captured 50M+ tweets with it. It is solid and reliable. If you find it to not be so, please contact me. I use it in production and have a vested interest in ensuring that it catches all corner cases. Thank you for your patience while I refine this patch and I ask Mr. Verdone to select this patch as the basis for moving this library forward. Enjoy and Anon, Andrew

adonoho added 6 commits January 14, 2014 17:58

Ignore .idea files from PyCharm.

128ec04

Do not send gzip headers for streaming calls. (ref: RouxRC [73efaca])

8631806

Fix the uri extension by attribute name.

ff3ca19

Change some default attributes and add TwitterHTTPError (ref: RouxRC [3…

25ea832

…d29eae] & [642c4a4]

Test for delimiters in the stream and removes them. Add comments.

d488eec

Update to use OAuth, take in command line arguments and modify the im…

95d4980

…ports to function from within the module.

RouxRC reviewed Jan 15, 2014
View reviewed changes

adonoho added 2 commits January 16, 2014 15:10

Move the potentially uninitialized values out of the if test.

d908997

Increase the size of the read buffer to be larger than the average tw…

ef99d73

…eet.

adonoho added 4 commits January 17, 2014 14:53

Add support for both user and site streams.

7333aa5

Bring HTTP chunk downloading into its own separate method.

57aa6d8

Cosmetic edit

54555a7

Merge branch 'fix-stream' into pr-fix-stream

b8cdd54

ptwobrussell mentioned this pull request Jan 23, 2014

Hanging on Chapter 9: Example 8 ptwobrussell/Mining-the-Social-Web-2nd-Edition#133

Closed

adonoho added 3 commits January 23, 2014 17:44

Minimize string decoding and move to use a bytearray for the buffer. …

c0fc741

…This reduces memory consumption and is faster than the += operator for buffer concatenation and trimming.

Cosmetic edits.

02bce53

Merge branch 'fix-stream' into pr-fix-stream

db75126

adonoho added 2 commits January 26, 2014 15:02

Move recv_chunk() into a stand alone function. Further minimize memor…

2693800

…y use by using a chunk sized bytearray.

Move variables out of the iterator class and into the generator funct…

e28a1da

…ion. This slightly improves locality of code and data.

RouxRC mentioned this pull request Jan 27, 2014

Userstream not working anymore. #190

Closed

adonoho added 2 commits January 28, 2014 08:13

Bump the version number.

cd2fbdf

adonoho added 7 commits January 28, 2014 10:18

Remove all keep-alive delimiters to allow the hangup patch to function.

c20d1a8

Remove socket timeout mutation code.

0d92536

Simplify protocol parsing checks. Move HTTP exception catching to where it might actually be thrown.

Set a timeout on the main sample stream to test that code path.

b01fa3f

Handle HTTP chunks that only contain keep-alive delimiters.

443e409

A few cosmetic edits of the iterator function.

Add comments detailing why we can avoid handling certain edge cases i…

3e782f6

…n the Twitter stream.

Clarify the comment about edge cases.

a8880f9

Merge branch 'fix-stream' into pr-fix-stream

12bb62d

RouxRC mentioned this pull request Feb 3, 2014

Fix broken userstream and all other streams supposed to break after Jan 13 + add image support + fix diverses issues #195

Closed

sixohsix merged commit 40834b5 into python-twitter-tools:master Feb 3, 2014

sixohsix mentioned this pull request Feb 4, 2014

Don't busywait forever if Twitter hangs up on us. #194

Closed

DarkDefender mentioned this pull request Feb 4, 2014

Home timeline no longer updates automatically DarkDefender/weetweet#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Simpler Fix to the Streaming Code due to Changes from Twitter on Jan. 13, 2014. #196

A Simpler Fix to the Streaming Code due to Changes from Twitter on Jan. 13, 2014. #196

adonoho commented Jan 15, 2014

RouxRC Jan 15, 2014

RouxRC commented Jan 15, 2014

adonoho commented Jan 17, 2014

RouxRC commented Jan 17, 2014

adonoho commented Jan 20, 2014

RouxRC commented Jan 20, 2014

ptwobrussell commented Jan 23, 2014

adonoho commented Jan 23, 2014

adonoho commented Jan 24, 2014

ptwobrussell commented Jan 27, 2014

adonoho commented Jan 28, 2014

adonoho commented Jan 30, 2014

np1 commented Jan 31, 2014

sixohsix commented Feb 3, 2014

RouxRC commented Feb 3, 2014

A Simpler Fix to the Streaming Code due to Changes from Twitter on Jan. 13, 2014. #196

A Simpler Fix to the Streaming Code due to Changes from Twitter on Jan. 13, 2014. #196

Conversation

adonoho commented Jan 15, 2014

RouxRC Jan 15, 2014

Choose a reason for hiding this comment

RouxRC commented Jan 15, 2014

adonoho commented Jan 17, 2014

RouxRC commented Jan 17, 2014

adonoho commented Jan 20, 2014

RouxRC commented Jan 20, 2014

ptwobrussell commented Jan 23, 2014

adonoho commented Jan 23, 2014

adonoho commented Jan 24, 2014

ptwobrussell commented Jan 27, 2014

adonoho commented Jan 28, 2014

adonoho commented Jan 30, 2014

np1 commented Jan 31, 2014

sixohsix commented Feb 3, 2014

RouxRC commented Feb 3, 2014