[WIP] Remove httplib. #1067

Lukasa · 2016-12-08T15:15:25Z

This branch contains the most substantial chunk of work I have ever done on urllib3: the beginnings of an effort to remove our dependency on httplib.

Removing our dependency on httplib has been a long-held desire of the urllib3 project. In fact, it's so desired that #776 (our wishlist of features for v2.0) includes in a section entitled "unlikely longshots" the phrase "Get rid of httplib dependence".

This branch does exactly that, by bringing the http.client implementation from Python 3.5 that we were already used to interacting with into our codebase and then ripping its guts out and replacing them with @njsmith's h11 library.

As you can see from the commit log, my work on this began back in October, but I've sat on it for the past two months in order to get it to a place where I had something worth showing. Now I finally do. On my machine, a run of tox -e py27 passes all tests, albeit with incomplete test coverage:

py27 runtests: commands[1] | nosetests
SSS....................S....................................................................................................................................S.....................................................S........................................................................................................................................................................................................................SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS...............................S..S..............................S................................................................................
Name                      Stmts   Miss  Cover   Missing
-------------------------------------------------------
urllib3                      20      0   100%   
urllib3._collections        148      0   100%   
urllib3.connection          515    103    80%   80-82, 158, 162, 175, 229, 261, 285, 295-300, 306-338, 341-364, 368-372, 375, 385-392, 396-399, 408, 449, 453, 471, 567, 571-572, 602-611, 615, 651-652, 657, 729-732, 736-740, 755, 781, 830, 855-856, 868-869, 905-906, 972-973, 985, 1005
urllib3.connectionpool      240      1    99%   256
urllib3.contrib               0      0   100%   
urllib3.contrib.socks        35      0   100%   
urllib3.exceptions           73      2    97%   216, 219
urllib3.fields               62      0   100%   
urllib3.filepost             32      0   100%   
urllib3.poolmanager         124      0   100%   
urllib3.request              32      0   100%   
urllib3.response            232      1    99%   453
urllib3.util                  1      0   100%   
urllib3.util.connection      47      0   100%   
urllib3.util.request         37      0   100%   
urllib3.util.response        19      5    74%   54-65
urllib3.util.retry          121      0   100%   
urllib3.util.selectors      295     45    85%   101, 360-423
urllib3.util.ssl_            64      0   100%   
urllib3.util.timeout         48      0   100%   
urllib3.util.url             97      0   100%   
urllib3.util.wait            14      1    93%   21
-------------------------------------------------------
TOTAL                      2256    158    93%   
----------------------------------------------------------------------
Ran 604 tests in 13.646s

OK (SKIP=42)
___________________________________ summary ____________________________________
  py27: commands succeeded
  congratulations :)

Initially, my goal was to do this in the least-disruptive way possible: that's why I began by trying to replicate the logic of httplib inside urllib3 but on top of h11. I would have succeeded in that goal if I'd been able to avoid rewriting any tests in any substantive way.

Unfortunately, as you can see from this change, I did not succeed at that. In particular, it was simply not possible to continue with some behaviours that urllib3 had previously promised, such as maintaining header case. I also had to bend over backwards to preserve some other behaviours urllib3 users have expected, such as reporting chunk boundaries, which is probably not something we should ever have promised to do.

From my perspective, then, I think that if we want to continue this work it needs to be considered a breaking change for urllib3. Any change this substantial simply cannot be done in a non-breaking way. A huge number of things will change: our tolerance of errors (we're much stricter now because h11 is much more insistent about being spec-compliant), our wire format (h11 does some case normalization on header field names), and maybe even some of our output formats.

This is also still very much a work in progress. There are TODO statements everywhere. The code is poorly factored: it's a wacky in-between state between trying to maintain the httplib abstraction and a total rejection of it in favour of something much clearer. A whole lot of our code that worked around httplib flaws is no longer necessary and can be removed, but hasn't yet been. A better underlying implementation that doesn't perform socket writes would also be possible, as would some refactoring of code I wrote to pull out common features. And the Python 3 tests don't pass, mostly because of the fact that right now this new branch emits headers as bytestrings rather than unicode strings.

However, at this point it's no longer feasible for me to work on this branch alone. Further progress requires actual governance decisions, and I am not qualified to make those on my own, nor should I. urllib3 has a great collection of core maintainers and regular contributors whose opinions and work I value, and would like to solicit to help me. As I see it, we need to answer the following questions:

Is this worth doing? Do we want to be rid of httplib so badly that we will perform a massive codebase rewrite to get rid of it?
If so, are we happy with using h11?
1. If we are, do we depend on it using pip, or vendor it? How does our decision affect Requests?
Should this be the beginning of a v2.0 branch that will contain other bits of work from v2.0 Wishlist #776?
1. If it should, how many breakages do we want to roll into one change? Do we want to go all in and leave a massive codebase change?
What pieces of functionality do we want to retain from the old codebase? In particular, do we want:
1. To still have stream() expose chunk boundaries?
2. To return str header field names and values on Python 3?
3. To return different types for header field names and values on Python 3?
4. To implement buffered I/O in the low-level primitives?
5. To maintain the weird two-level Response structure where we have a urllib3 Response object that contains a httplib Response object within it?
How would we want to manage a rollout of this functionality?
How do we want to review it?
What kind of timeline do we want for releasing it?
If we decide to go ahead with a v2.0, should we freeze the current master branch for everything except bug/security fixes until we complete v2.0?

When @shazow originally made me interim lead maintainer, he told me that I had complete authority because he could always undo any change I made that he didn't like. I think this particular change is an exception: I don't think I have any degree of "lead maintainer" authority on this. We're going to need @shazow to express some opinions too.

While I'm here, I'd also like to tag a few community members that are particularly likely to be interested in this proposal:

/cc @njsmith @durin42 @dstufft

@nateprewitt

Sorry @nateprewitt, I know that you worked hard on this, but h11 handles this for us more effectively than we can, so all of this code is redundant with its own logic. It's good that you did it, though, becuase it means our transition to being strict about content-length enforcement is finally present!

codecov-io · 2016-12-09T10:54:31Z

Current coverage is 95.74% (diff: 78.58%)

Merging #1067 into master will decrease coverage by 4.25%

@@           master      #1067   diff @@
========================================
  Files          21         21          
  Lines        1900       2233   +333   
  Methods         0          0          
  Messages        0          0          
  Branches        0          0          
========================================
+ Hits         1900       2138   +238   
- Misses          0         95    +95   
  Partials        0          0

Powered by Codecov. Last update e6cae79...48780ee

Lukasa · 2016-12-09T10:57:27Z

I'm beginning to get a bit more aggressive about removing things we no longer need. That'll help with a future refactoring effort: we won't spend our time trying to refactor in a content-length check (for example) that is no longer required because h11 does it for us.

This is a temporary change that we'll almost certainly revert, but I want to get the tests passing so that others can help out.

Lukasa · 2016-12-12T11:12:15Z

So, as I've been cleaning up the builds, I've noticed an issue: AppEngine. One major side-effect of this refactor is that we lose the tight integration with GAE that has been used in the past, because we are no longer using httplib. On top of that, we've generally be in favour of drastically changing the API we work with, meaning that it wouldn't even necessarily be that easy to get GAE working.

I think, for this reason, I'd like to tag @jonparrott to weigh in here, as well as the wider community. How do we forsee our GAE integration working in the future? Will we want to try to factor our code in such a way that it becomes possible to drop back in a httplib-style backend like GAE?

Lukasa · 2016-12-12T12:48:29Z

So for the short term I'm going to mark the GAE builds as "allowed to fail" on this branch, because I'd like us to come back to them but I don't want them standing in the way right now. =D

Lukasa · 2016-12-12T13:30:57Z

Woo, ok, tests are green!

So, I have opened up a project and shoved a whole lot of cards in there for things that people can do if they want to start getting involved. I'm going to continue pushing forward on some of the major refactoring work, but at this point now that the tests are passing (albeit with missing coverage) we have a decent basis to start work with.

For that reason, I'm going to move this to a v2 branch on this repository and encourage people who want to get involved to start making pull requests against that branch. Please create issues and move project cards around to make it clear when something is being worked on by someone to avoid duplicating too much effort. I'm going to continue focusing on excising code that we don't need and trying to complete some of the refactoring so that I can reduce the feature surface, and then I'll look at the internal refactor to try to separate the common protocol code and the bits that are to do with "synchronous I/O", with the goal of enabling future work to support async code.

Lukasa · 2016-12-12T13:33:20Z

Closing in favour of #1068.

shazow · 2016-12-12T16:39:05Z

Re: GAE, Jon can definitely add more context, but my understanding is that GAE doesn't actually need httplib. Even today, it just straight-up uses urlfetch when available (and would have to fall back to plain sockets if not). If we can have a place where we bypass anything that touches a socket and pass it straight into AppEngine's urlfetch when enabled, then that's all it will take. This would mean we'd only support a subset of features, but that's already true today.

Assuming we keep a manager hierarchy like we have today, I don't think GAE support would need to change much.

theacodes · 2016-12-12T17:13:41Z

Echoing what @shazow said: Today we're basically just adapting App Engine's urlfetch API to match urllib3's. As long as that remains a viable option, we can continue to support GAE. Also note that I'm more than willing to help with that.

To add some more context to this - I've been continuously pushing for App Engine standard to be "less weird". Some of that is coming to bear fruit, but App Engine's support for sockets is still "weird". I would expect to see App Engine standard's Python runtime become less weird over time. Also keep in mind that App Engine standard is currently limited to Python 2.7 only, and any Python 3 support on standard would likely look closer to "vanilla" Python than "App Engine" Python.

All this is to say - don't sweat App Engine that much. As long as the API allows the right level of abstraction we can use the same approach we use today.

Lukasa added 30 commits October 24, 2016 13:45

Initial rewrite of HTTPConnection with h11

bd9b61e

Begin the stub of replacing httplib.

678f46c

More stuff.

309713d

Some basic amount of connection reuse.

054e0b3

Get rid of auto_open.

638dbbc

No idea how I missed this.

1c71395

Start trying to run CI with h11

738a04f

We should fail parsing always now.

3c72931

This is clearly wrong.

ee6ed3d

Try to make things more conservative.

f4c3dd5

Try to pass protocol errors on.

1ca1830

Handle socket closure better.

2f82164

Remove some catching of httplib exceptions.

b44e2df

We emit lowercase Host headers now.

862ada1

Fix chunked encoding sending.

abd2f2c

Allow setting no socket options.

993a1e9

Simpler content-length handling.

8343db2

bytearray, bytearray everywhere.

73a28f4

Support for chunked streaming.

291426c

This isn't Python 3-only code.

8edcedc

Create a general ConnectionError.

03bd051

Support HTTP tunnelling.

72a5779

Fixup sock/fp.

22a7c94

Fixup new detection of short response bodies.

ecf8bef

We now fail to parse bad headers entirely.

f220a99

We no longer preserve header case.

d35504c

Header blocks end with two CRLFs.

885f937

More header case fixes.

76bed86

More case fixes.

2da91ff

We no longer maintain compatibility.

db89e95

Lukasa added 2 commits December 9, 2016 10:52

Changelog for removing length enforcement.

36c9c7f

Lukasa added 13 commits December 9, 2016 11:53

Remove an unneeded TODO

01f4bd8

Convert headers to native strings.

038ab7f

This is a temporary change that we'll almost certainly revert, but I want to get the tests passing so that others can help out.

We never have the buffering kwarg.

666106c

ConnectionError is always SocketError for us.

1b8221e

Fix up bytes/str issues with tunnelling.

bdf37e2

Rewrite a Py3 case sensitivity test.

defa021

Increase LONG_TIMEOUT for more reliable tests

26ffc3f

Remove Python 2.6 support.

ba09018

Minor flake8 cleanups.

b1dc97e

Cleanup get_hostport.

729b2b9

Cleanup some problems with exceptions.

5cd21e0

Remove Python 2.6 from Travis matrix.

bc6de40

Remove garish docstring.

41baa1d

Temporarily allow GAE builds to fail.

48780ee

Remove unneeded TODO.

db2e68d

Remove another TODO.

afadc5f

Lukasa mentioned this pull request Dec 12, 2016

Work for v2 of urllib3. #1068

Closed

Lukasa closed this Dec 12, 2016

njsmith mentioned this pull request Mar 16, 2017

Best approach to building trio ecosystem python-trio/trio#105

Closed

merrellb mentioned this pull request Jul 15, 2017

Urllib3 with h11 and trio #1228

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Remove httplib. #1067

[WIP] Remove httplib. #1067

Lukasa commented Dec 8, 2016

codecov-io commented Dec 9, 2016 •

edited

Lukasa commented Dec 9, 2016

Lukasa commented Dec 12, 2016

Lukasa commented Dec 12, 2016

Lukasa commented Dec 12, 2016

Lukasa commented Dec 12, 2016

shazow commented Dec 12, 2016 •

edited

theacodes commented Dec 12, 2016

[WIP] Remove httplib. #1067

[WIP] Remove httplib. #1067

Conversation

Lukasa commented Dec 8, 2016

codecov-io commented Dec 9, 2016 • edited

Current coverage is 95.74% (diff: 78.58%)

Lukasa commented Dec 9, 2016

Lukasa commented Dec 12, 2016

Lukasa commented Dec 12, 2016

Lukasa commented Dec 12, 2016

Lukasa commented Dec 12, 2016

shazow commented Dec 12, 2016 • edited

theacodes commented Dec 12, 2016

codecov-io commented Dec 9, 2016 •

edited

shazow commented Dec 12, 2016 •

edited