Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Simpler Fix to the Streaming Code due to Changes from Twitter on Jan. 13, 2014. #196

Merged
merged 27 commits into from Feb 3, 2014

Conversation

adonoho
Copy link
Contributor

@adonoho adonoho commented Jan 15, 2014

Gentlefolk,

First allow me to applaud and thank RouxRC on his fine work pushing this little library forward. This pull request would not have been possible without his efforts. That said, RouxRC has dropped a quite large pull request that had problems in my Python3 system. When I rolled back to Python2, the system started working but would not stay connected for very long.

As a result of the above, I felt I needed to look into RouxRC's pull request in more detail and use just the parts I needed to get my streams running again. Any errors in this pull request are mine and do not reflect negatively on RouxRC or his code.

RouxRC's streaming patch falls into two main areas. First, it modifies the api.py to respect a new property, gzip. I have largely copied his work here. I did simplify how he set the headers from 3 lines to one. The second fix occurs in stream.py. Once the gzip compression was turned off, I was able to see what Twitter is now sending us. Even though length delimiters are turned off, Twitter nonetheless is inserting a hex encoded number and a \r\n between JSON items. RouxRC and I differ in how we chose to address these changes. My fix is focussed upon handling those extra items if they appear in the stream, removing them and then returning to the existing control flow. The important code block is below and my changes occur with the if statement:

    while True:
        try:
            utf8_buf = self.buf.decode('utf8').lstrip()
            if utf8_buf and utf8_buf[0] != '{':  # Remove the hex delimiter length and extra whitespace.
                utf8_buf = utf8_buf.lstrip('0123456789abcdefABCDEF')
                utf8_buf = utf8_buf.lstrip()
            res, ptr = self.decoder.raw_decode(utf8_buf)
            self.buf = utf8_buf[ptr:].encode('utf8')
            yield wrap_response(res, self.handle.headers)

RouxRC's code does this in a different way. I believe my solution is both shorter and more robust. But you, gentle reader, are the ultimate arbiter. RouxRC did identify in his code some opportunities to raise TwitterHTTPErrors. I've tried to duplicate those.

I've run this set of fixes under Python 3.3.3 and Python 2.7.6 both on OS X 10.8.5 (Mountain Lion).

In addition to the changes to the library, I modernized a test script, stream-example to take the OAuth parameters needed by modern Twitter. As I also use PyCharm, I've added .idea to the .gitignore file.

Again, this simpler fix would not have been possible without the fine work of RouxRC. Thank you RouxRC.

Anon,
Andrew

@@ -34,6 +36,9 @@ def __iter__(self):
while True:
try:
utf8_buf = self.buf.decode('utf8').lstrip()
if utf8_buf and utf8_buf[0] != '{': # Remove the hex delimiter length and extra whitespace.
utf8_buf = utf8_buf.lstrip('0123456789abcdefABCDEF')
utf8_buf = utf8_buf.lstrip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to try it out but I'm worried this should break the delimited=true argument from twitter API by doing this this way

@RouxRC
Copy link
Member

RouxRC commented Jan 15, 2014

All right, so I did a few tryouts and I confirm there are still a few issues there:

  • delimited=true arg from twitter is logically not working with this version, I guess this option is more of a tech thing that could typically be useful to precise the size to receive from sock and users should generally not want to use it, but I feel like it would not be very consistent/transparent to drop the option support considering the whole concept of this lib is to provide an abstraction to the twitter API
  • userstream is still broken: in your fix you only remove head hexa from the buffer and not from the data received from the sock which also can contain hexa when receiving further pieces of incomplete data.
    You can test userstream with:
    t = twitter.TwitterStream(domain="userstream.twitter.com", auth=twitter.OAuth(OAUTH_TOKEN, OAUTH_SECRET, KEY, SECRET))
    for a in t.user(track="bieber,gaga"):
      if a:
         print a  
    

But the gzip simplification is indeed better, I'll add it to my branch, and the lstrip way is certainly a better path than my re way. I'll try and port it to my branch later on.

Btw, I removed completely in my version the except TwitterHTTPErrror which you commented here as pointless, I agree with you.
Thanks also for the update of the example script, I completely forgot about that one, will take your commit as well.

@adonoho
Copy link
Contributor Author

adonoho commented Jan 17, 2014

RouxRC,

People embrace small changes.

I reverted to the master branch and just focussed upon the problem facing me: the broken public streams of January 13th. Everyone can judge whether I solved the problem of the moment. For my purposes, I did. You believe that it doesn't handle user streams. While I don't use this library for user streams, as a public service, I'll take a look at those later today.

People also don't want to lose capabilities, such as Python3 support. This pull request supports both Python dialects. I hope you migrate your pull request to support both dialects.

Anon,
Andrew

@RouxRC
Copy link
Member

RouxRC commented Jan 17, 2014

I just did yes (cf 82c7c43). My branch now works with both version of python.

All I was saying is that the bug we try to fix here corresponds to the same source than the bug that broke userstream when it migrated 2 months earlier than the public one. Fixing only one of the two here seems to me like as bad as not supporting python 3 imho, it's just as much incomplete and "losing capabilities" :)
And like I said, this fix also breaks support of an option existing in Twitter's API specs for all streaming endpoints, even though this option may not be most useful for the users of this library.

@adonoho
Copy link
Contributor Author

adonoho commented Jan 20, 2014

The latest fix addresses RouxRC's concerns about the user stream. I have reworked my fix to improve the logic in two ways. First, the chunk header processing is removed from buffer management. That was the wrong place for it. Second, the sock.recv() call has been replaced with a method that handles reading in HTTP chunks and then sending the full result to the parser/buffer management part of the code. This highly localizes the changes and preserves the structure and logic of Mr. Verdone's code. As a happy side effect, because the read is typically larger than the prior code, the new code appears to be more efficient. It spends less time parsing incomplete JSON. I have also extended the stream_example.py to support the user and site streams.

This code has run for 3 days and processed over 12M tweets. While it could still have problems, I am quite happy with it and hope Mr. Verdone chooses to base the canonical fix for his important library on it.

@RouxRC
Copy link
Member

RouxRC commented Jan 20, 2014

Thanks Andrew, this looks a lot better indeed! And it looks like the delimited argument should work with this version now as well. It makes total sense this would be more efficient this way, I was hesitating doing something like this with the chunks size as discussed earlier. Glad to see it's working well when implemented!
I'll try and merge your branch into mine as well.

@ptwobrussell
Copy link
Contributor

Hi folks - Just wanted to express my support for this fix, and ask if there's anything I can specifically do to help? I'm tied up until the evening, but would be glad to jump in and help with whatever may be worthwhile at this stage.

@adonoho
Copy link
Contributor Author

adonoho commented Jan 23, 2014

Matthew,

The key thing needed right now is that the patch needs to be used and abused. I primarily use it sitting on the sample and filter endpoints. At this point, it has probably downloaded 25M+ tweets to my systems. It has had a modest number of timeouts, but then any network app will have timeouts. I don't particularly use it to track the site stream or user accounts. (I should probably set up a daemon to track my personal account to test the code.)

BTW, I did not get a chance to attend your seminar at Data Day Texas. Folks tell me it a good presentation. I saw your talk though. Keep up the good work.

Anon,
Andrew

…This reduces memory consumption and is faster than the += operator for buffer concatenation and trimming.
@adonoho
Copy link
Contributor Author

adonoho commented Jan 24, 2014

Gentlefolk,

I am a programmer who appreciates precise code. As Python is new to me, though I am an old and grumpy programmer, I am always looking for how to improve the code.

The prior patch, while simplistically Pythonic, was not efficient and wasted memory with many large extra buffers being created and thrown away. I've chosen to handle it by just parsing what is needed into an int() for the chunk size determination and then moving to a byte array to handle the perhaps multi-stage read from the underlying socket. Also, the byte array is used to trim the trailing CRLF off of the buffer, saving us a needless trip through the JSON parser.

In other words, this patch has no effect on your data. It has an affect on your memory usage/garbage collection. It reduces both. If you are happy with the prior patch, then you need not upgrade to this patch. I am using it in my production stack for the last 12+ hours.

Also, as before, this patch has been tested on Python v3.3.3 and Python v2.7.6 on OS X 10.8.5 (Mountain Lion). Because I use bytearray(), I do not know whether this patch will run on Python v2.6.x. As I understand it, it should. bytearray() is a supported Python 2.6.x datatype.

Anon,
Andrew

…ion. This slightly improves locality of code and data.
…fy buffer management to only operate on strings and not re-encode the string as bytes. This improves readability at the expense of breakage if Twitter starts spanning JSON across HTTP chunks. This is an unlikely change to their infrastructure. That said, this is a totally optional patch.
@ptwobrussell
Copy link
Contributor

@adonoho - Thanks, I'll be testing this starting here in a bit and let you know if I find anything worth thinking about. Out of curiosity, does this path also address #194? If not, is it compatible with it?

All HTTP chunks are read in their entirety.
Cosmetic code improvements. (The socket's blocking state is set in a more compact form after a DeMorgan's boolean transformation.)
Hangups by Twitter, as with timeouts, are signaled via a message to allow gracious recovery.
@adonoho
Copy link
Contributor Author

adonoho commented Jan 28, 2014

I stage my fixes to this patch, pr-fix-stream, on the branch adonoho:fix-stream. Per your request above, it now addresses, I believe, the issue addressed by #194. I expect to move adonoho:fix-stream to pr-fix-stream after the CoB today.

@adonoho
Copy link
Contributor Author

adonoho commented Jan 30, 2014

Gentlefolk,

This is a candidate release patch. I propose it become the formal branch of this library and have dubbed it version v1.10.3. I once again formally thank RouxRC for his efforts moving this library forward. Any errors in this patch remain mine and do not reflect upon RouxRC or his code.

This library is a high performance streaming library. Compared to other Twitter libraries, it is easily an order of magnitude faster at delivering tweets to your application. Why is that? When streaming, this library pierces Python's urllib abstraction and takes control of the socket. It interprets the HTTP stream directly. That makes it fast. It also makes it vulnerable to changes. It needed to be upgraded when Twitter upgraded the protocol version.

Twitter's switch to HTTP v1.1 was long overdue.

Summary of changes:

  • Based upon RouxRC's code, I turned off gzip compression. My version is slightly different than RouxRC's version.
  • Instead of incrementally reading arbitrary lengths of bytes from the socket and seeing if they parse in the JSON parser, a good technique, the switch to HTTP chunking forced us to process in chunk sized blocks. Based upon inspection, Twitter never sends partial JSON in a chunk. They also send keep-alive delimiters in single 7 byte long chunks. This code depends upon both of these observations. It does not do general purpose HTTP chunk processing. It is a Twitter specific HTTP chunk parser.
  • Chunk oriented processing allowed me to isolate stream interpretation to the chunk code and migrate the wrapper code to operate exclusively using strings. This makes the wrapper code more readable.
  • Once I had opened up the wrapper code, I cleaned it up. This involved modest edits in how certain socket parameters were determined and moving data exclusive to the generator into the generator and out of the containing object.
  • As this is exclusively socket oriented code, the HTTP exception catching was removed from the method. The exception was moved to wrap the opening of the socket by url lib.
  • Due to reading the data in larger chunks and, hence, running it through the JSON parser less often, this code is about 10% faster than the prior generation.
  • When Twitter hangs up on us, this code emits a hangup message in the stream.
  • This code has been tested using Python v2.7.6 and v3.3.3 on OS X 10.8.5 (Mountain Lion). I have tested it on the high volume sample stream and on a user stream under both versions of Python. It is believed, but not tested, that it will function under Python v2.6.x. It uses the bytearray type. I believe that has been back ported all the way to Python v2.6.x. As the code is not particularly tricky, I do not foresee that it has introduced any new issues that were not already apparent in this library.
  • I use this patch in production and have captured 50M+ tweets with it. It is solid and reliable. If you find it to not be so, please contact me. I use it in production and have a vested interest in ensuring that it catches all corner cases.

Thank you for your patience while I refine this patch and I ask Mr. Verdone to select this patch as the basis for moving this library forward.

Enjoy and Anon,
Andrew

@np1
Copy link
Contributor

np1 commented Jan 31, 2014

Great job, thanks for your work on this. So far it is working well. Will let you know if I run into any issues with it.

@sixohsix
Copy link
Collaborator

sixohsix commented Feb 3, 2014

Yowza!

Whoa, so I travel for most of December and January, and in the interim Twitter utterly breaks my library, and then a task force self-assembles to totally fix the mess. Great work!

As I understand it, this is the "best" branch in that it is 2.7 and 3.3+ compatible, and fixes the specific blocking issues. Is that correct? If so I'll merge this and release it as 1.11.

@RouxRC if I merge this it will probably break your branch. Are you willing to resubmit your improvements afterward? They can go out as 1.12.

@RouxRC
Copy link
Member

RouxRC commented Feb 3, 2014

Yep no worries, like I said, I'll keep using my branch until I get the time to port the other changes sometime in the coming weeks. Let's move forward :)

sixohsix added a commit that referenced this pull request Feb 3, 2014
A Simpler Fix to the Streaming Code due to Changes from Twitter on Jan. 13, 2014.
Gentlefolk,

This is a candidate release patch. I propose it become the formal branch of this library and have dubbed it version v1.10.3. I once again formally thank RouxRC for his efforts moving this library forward. Any errors in this patch remain mine and do not reflect upon RouxRC or his code.

This library is a high performance streaming library. Compared to other Twitter libraries, it is easily an order of magnitude faster at delivering tweets to your application. Why is that? When streaming, this library pierces Python's urllib abstraction and takes control of the socket. It interprets the HTTP stream directly. That makes it fast. It also makes it vulnerable to changes. It needed to be upgraded when Twitter upgraded the protocol version.

Twitter's switch to HTTP v1.1 was long overdue. 

Summary of changes:

- Based upon RouxRC's code, I turned off gzip compression. My version is slightly different than RouxRC's version.
- Instead of incrementally reading arbitrary lengths of bytes from the socket and seeing if they parse in the JSON parser, a good technique, the switch to HTTP chunking forced us to process in chunk sized blocks. Based upon inspection, Twitter never sends partial JSON in a chunk. They also send keep-alive delimiters in single 7 byte long chunks. This code depends upon both of these observations. It does not do general purpose HTTP chunk processing. It is a Twitter specific HTTP chunk parser.
- Chunk oriented processing allowed me to isolate stream interpretation to the chunk code and migrate the wrapper code to operate exclusively using strings. This makes the wrapper code more readable.
- Once I had opened up the wrapper code, I cleaned it up. This involved modest edits in how certain socket parameters were determined and moving data exclusive to the generator into the generator and out of the containing object.
- As this is exclusively socket oriented code, the HTTP exception catching was removed from the method. The exception was moved to wrap the opening of the socket by url lib.
- Due to reading the data in larger chunks and, hence, running it through the JSON parser less often, this code is about 10% faster than the prior generation.
- When Twitter hangs up on us, this code emits a `hangup` message in the stream. 
- This code has been tested using Python v2.7.6 and v3.3.3 on OS X 10.8.5 (Mountain Lion). I have tested it on the high volume sample stream and on a user stream under both versions of Python. It is believed, but not tested, that it will function under Python v2.6.x. It uses the bytearray type. I believe that has been back ported all the way to Python v2.6.x. As the code is not particularly tricky, I do not foresee that it has introduced any new issues that were not already apparent in this library.
- I use this patch in production and have captured 50M+ tweets with it. It is solid and reliable. If you find it to not be so, please contact me. I use it in production and have a vested interest in ensuring that it catches all corner cases.

Thank you for your patience while I refine this patch and I ask Mr. Verdone to select this patch as the basis for moving this library forward.

Enjoy and Anon,
Andrew
@sixohsix sixohsix merged commit 40834b5 into python-twitter-tools:master Feb 3, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants