Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-39503: CVE-2020-8492: Fix AbstractBasicAuthHandler #18284

Open
wants to merge 1 commit into
base: master
from

Conversation

@vstinner
Copy link
Member

vstinner commented Jan 30, 2020

The AbstractBasicAuthHandler class of the urllib.request module uses
an inefficient regular expression which can be exploited by an
attacker to cause a denial of service. Fix the regex to prevent the
catastrophic backtracking.

Vulnerability reported by Matt Schwager.

https://bugs.python.org/issue39503

@vstinner

This comment has been minimized.

Copy link
Member Author

vstinner commented Jan 30, 2020

@mschwager

This comment has been minimized.

Copy link

mschwager commented Jan 30, 2020

This fix looks good to me!

@@ -937,7 +937,7 @@ class AbstractBasicAuthHandler:

# allow for double- and single-quoted realm values
# (single quotes are a violation of the RFC, but appear in the wild)
rx = re.compile('(?:.*,)*[ \t]*([^ \t]+)[ \t]+'
rx = re.compile('(?:[^,]*,)*[ \t]*([^ \t]+)[ \t]+'

This comment has been minimized.

Copy link
@serhiy-storchaka

serhiy-storchaka Jan 30, 2020

Member

(?:.*,)* is equivalent to (?:.*,)?.

But since this regular expresion is only used with search(). (?:.*,)*[ \t]* can be removed at all.

I'll analyze whether it is correct or there is an error in the regular expression.

This comment has been minimized.

Copy link
@serhiy-storchaka

serhiy-storchaka Jan 30, 2020

Member

Well, I am cannot say that I completely understand the code, but to give it some sense we can either

  1. Replace rx.search() with rx.match() and replace (?:.*,)* with (?:.*,)?.

or

  1. Keep rx.search() and replace (?:.*,)* with (?:^|,).

Do not keep (?:[^,]*,)*. It is a waster of resources.

This comment has been minimized.

Copy link
@serhiy-storchaka

serhiy-storchaka Jan 30, 2020

Member

Humm, options 1 and 2 are not equivalent if the field value contains more than one challenge. Option 2 is closer to the current behavior. But correct support of more than one challenge need rewriting the code.

https://tools.ietf.org/html/rfc7235#section-4.1

This comment has been minimized.

Copy link
@davidfraser

davidfraser Feb 17, 2020

The current patch seems to give an O(n^3) time to evaluate - much better than O(2^n), but still very slow - with 2000 commas it takes about a minute to evalute. With 65000 it takes much, much longer. Testing using the code from here gave the following (commas, seconds) values:
[(100, 0.124), (250, 0.261), (500, 0.923), (750, 2.85), (1000, 6.433), (1250, 12.608), (1500, 21.576), (2000, 50.751)]

This comment has been minimized.

Copy link
@serhiy-storchaka

serhiy-storchaka Mar 24, 2020

Member

I was wrong, the first option is equivalent to the current behavior (returns the last realm).

@serhiy-storchaka

This comment has been minimized.

Copy link
Member

serhiy-storchaka commented Jan 30, 2020

Would be nice to add a test.

@mschwager

This comment has been minimized.

Copy link

mschwager commented Jan 30, 2020

Just a heads up, CVE-2020-8492 has been created. I'm not sure how Python CVEs are generally tracked, but it may be useful to include the information on the bug tracker issue 👍

@vstinner

This comment has been minimized.

Copy link
Member Author

vstinner commented Feb 3, 2020

Not only (?:.*,)* is inefficient, but it's also useless. It can be removed. Simplified example:

# reference
>>> all(re.search("(?:a,)*b", text) for text in ("a,b", "a,a,b", "b"))
True
# only match last ","
>>> all(re.search("(?:,)?b", text) for text in ("a,b", "a,a,b", "b"))
True
# don't match the prefix
>>> all(re.search("b", text) for text in ("a,b", "a,a,b", "b"))
True

We can either simplify the regex to prevent the "catastrophic backtracking" or even remove the prefix.

UPDATE: Oops, my example was wrong, I fixed it :-)

@bcaller

This comment has been minimized.

Copy link
Contributor

bcaller commented Feb 4, 2020

Does this also fix https://bugs.python.org/issue38826 ?

@encukou

This comment has been minimized.

Copy link
Member

encukou commented Mar 24, 2020

Not only (?:.*,)* is inefficient, but it's also useless. It can be removed.

It seems to me that the (?:.*,)* is there so that the last realm is selected, as mentioned in the comment above the regex. See:

>>> header = 'basic realm="1", x, other realm="2"'
>>> re.search("(?:.*,)*[ \t]*([^ \t]+)[ \t]+", header).group(1)
'other'
>>> re.search("[ \t]*([^ \t]+)[ \t]+", header).group(1)
'basic'
>>> 

I don't see a way to fix this by just changing the regex while preserving the previous behavior. Then again, corner cases of the previous behavior might be wrong.

@vstinner vstinner changed the title bpo-39503: Fix urllib basic auth regex bpo-39503: CVE-2020-8492: Fix urllib basic auth regex Mar 25, 2020
@vstinner vstinner force-pushed the vstinner:urllib_basic_auth_regex branch from 7c4bae4 to cc59fb1 Mar 25, 2020
@vstinner

This comment has been minimized.

Copy link
Member Author

vstinner commented Mar 25, 2020

I rebased my PR and added more tests.

@vstinner

This comment has been minimized.

Copy link
Member Author

vstinner commented Mar 25, 2020

@serhiy-storchaka: I don't understand if you consider that the fix is wrong or that the fix is not enough (it remains possible to create a denial of service)?

@serhiy-storchaka

This comment has been minimized.

Copy link
Member

serhiy-storchaka commented Mar 25, 2020

@vstinner your fix helps, but we can do better. It has cubic complexity, my suggestion has quadratic complexity. It is possible to implement an algorithm with linear complexity, but not with such small changes.

@davidfraser

This comment has been minimized.

Copy link

davidfraser commented Mar 25, 2020

I also added some tests and implemented a simpler complexity regex - see master...davidfraser:urllib_basic_auth_regex

@davidfraser

This comment has been minimized.

Copy link

davidfraser commented Mar 25, 2020

It's worth seeing how the results of this regex are actually used

Note my comment in f79379c:

Note that the original regex was roughly O(2**n)
The search for commas and spaces is unnecessary
(and insufficient to ensure that this starts a new scheme).
Replace with a simpler search for an initial scheme, since
we already check that the text starts with 'basic'.

@vstinner

This comment has been minimized.

Copy link
Member Author

vstinner commented Mar 25, 2020

WWW-Authenticate is badly specified. The RFC doesn't specify if a single HTTP header can contain multiple challenges.

I found these resources:

A variant is to have multiple WWW-Authenticate: one challenge per WWW-Authenticate header.

By the way, AbstractBasicAuthHandler code contains this interesting comment:

        # XXX could be multiple headers
        authreq = headers.get(authreq, None)

Current behavior:

  • Even if there are multiple WWW-Authenticate headers, only parse the first header. That's a bug: the Basic challenge may be in a following WWW-Authenticate header. Moreover, there may be two Basic challenges with two different realm.

  • scheme = str.split()[0] parses the scheme, if scheme.lower() != "basic": raise a ValueError.

  • Use the regex to parse the realm.

  • If the header contains multiple realm=xxx: use the last realm, even if it belongs to another challenge using a different scheme. IMO it's a bug: we should not check the scheme at the beginning of the header and use the last realm at the end of the string.

For example, WWW-Authenticate: Basic realm="ACME Widget Store", Digest realm="other realm" header is accepted since it starts with Basic, but the extracted realm is other realm: the wrong realm is used.

@vstinner vstinner force-pushed the vstinner:urllib_basic_auth_regex branch from cc59fb1 to 477be6e Mar 25, 2020
@vstinner

This comment has been minimized.

Copy link
Member Author

vstinner commented Mar 25, 2020

@serhiy-storchaka:

  1. Keep rx.search() and replace (?:.*,)* with (?:^|,).

Sorry, I misunderstood this proposition. In fact, I proposed something similar except that I missed the "start of the string" (regex ^) case. I modified my PR to use this PR. I also added comments to the regex to explain it.

I decided to write a way more complex change to not only fix the vulnerability, but also fix the parser since it didn't look possible to fix the regex without changing the behavior. Currently, the code uses the last realm if there are multiple challenges per header. I fixed this behavior to use the realm of the first Basic challenge.

I also modified the code to support multiple headers, except of only parsing the first one.

@vstinner vstinner force-pushed the vstinner:urllib_basic_auth_regex branch 2 times, most recently from 137cf0b to ee8ff4f Mar 25, 2020
@vstinner vstinner changed the title bpo-39503: CVE-2020-8492: Fix urllib basic auth regex bpo-39503: CVE-2020-8492: Fix urllib AbstractBasicAuthHandler Mar 25, 2020
@vstinner vstinner force-pushed the vstinner:urllib_basic_auth_regex branch from ee8ff4f to 3c7dae4 Mar 25, 2020
@vstinner vstinner changed the title bpo-39503: CVE-2020-8492: Fix urllib AbstractBasicAuthHandler bpo-39503: CVE-2020-8492: Fix AbstractBasicAuthHandler Mar 25, 2020
@vstinner vstinner force-pushed the vstinner:urllib_basic_auth_regex branch from 3c7dae4 to c461645 Mar 25, 2020
@vstinner

This comment has been minimized.

Copy link
Member Author

vstinner commented Mar 25, 2020

Ok, the PR is now ready for a new round of reviews. I fixed the vulnerability but I also changed the code to parse all WWW-Authenticate HTTP Headers and accept multiple challenges per header.

The AbstractBasicAuthHandler class of the urllib.request module uses
an inefficient regular expression which can be exploited by an
attacker to cause a denial of service. Fix the regex to prevent the
catastrophic backtracking. Vulnerability reported by Ben Caller
and Matt Schwager.

AbstractBasicAuthHandler of urllib.request now parses all
WWW-Authenticate HTTP headers and accepts multiple challenges per
header: use the realm of the first Basic challenge.

Co-Authored-By: Serhiy Storchaka <storchaka@gmail.com>
@vstinner vstinner force-pushed the vstinner:urllib_basic_auth_regex branch from c461645 to 3652a84 Mar 25, 2020
@vstinner

This comment has been minimized.

Copy link
Member Author

vstinner commented Mar 25, 2020

Oh, Ben Caller reported the issue at 2019-11-17: https://bugs.python.org/issue38826

@vstinner

This comment has been minimized.

Copy link
Member Author

vstinner commented Mar 25, 2020

@vstinner vstinner requested a review from serhiy-storchaka Mar 26, 2020
@orsenthil

This comment has been minimized.

Copy link
Member

orsenthil commented Mar 26, 2020

I also modified the code to support multiple headers, except of only parsing the first one.

Thanks, @vstinner - This was a good detection and change.

I am reviewing the patch further.

basic = f'Basic realm="{realm}"'
basic2 = f'Basic realm="{realm2}"'
other_no_realm = 'Otherscheme xxx'
digest = (f'Digest realm="{realm2}", '

This comment has been minimized.

Copy link
@orsenthil

orsenthil Mar 26, 2020

Member

What was the motivation of adding a digest test case here?

This comment has been minimized.

Copy link
@vstinner

vstinner Mar 26, 2020

Author Member

I tried to write a more realistic test than "Otherscheme xxx". The Digest challenge uses "realm" and the test ensures that it's skipped. It uses use multiple fields separated by commas, some use quotes: test that the parser handles that properly.

I picked the example from Wikipedia :-) https://fr.wikipedia.org/wiki/Authentification_HTTP#Demande_d'identification

This comment has been minimized.

Copy link
@orsenthil

orsenthil Mar 26, 2020

Member

perfect. sounds good to me. :)

Copy link
Member

orsenthil left a comment

Hi @vstinner

Thanks for the patch and the change. The entire addition is meaningful and looks good to me.

  • The change of regex only the security issue keeps the patch simple.

I only had a question in the tests, but it is not a critical question. I thought, I am missing something by adding of "digest" test case in line 1469, when the multiple headers challenge is covered again in the same test case at line 1499.

LGTM.

@serhiy-storchaka

This comment has been minimized.

Copy link
Member

serhiy-storchaka commented Mar 26, 2020

Sorry, I may be not unable to make a review until the weekend.

@vstinner

This comment has been minimized.

Copy link
Member Author

vstinner commented Mar 26, 2020

I thought, I am missing something by adding of "digest" test case in line 1469, when the multiple headers challenge is covered again in the same test case at line 1499.

For tests on a single header, I tested 3 cases:

  • two Basic challenges
  • Basic challenge and one challenge without realm
  • Basic challenge (with realm) and Digest with realm

And the variant in the opposite order, to make sure that the order doesn't matter (except for two Basic challenges: it must pick the first one).

I tried to cover all cases.

@vstinner

This comment has been minimized.

Copy link
Member Author

vstinner commented Mar 26, 2020

@serhiy-storchaka: The root of the security fix is to replace (?:.*,)* inefficient regex with (?:^|,) regex which should be efficient. It's what you proposed, second choice of: #18284 (comment)

@vstinner

This comment has been minimized.

Copy link
Member Author

vstinner commented Mar 27, 2020

@serhiy-storchaka: "Sorry, I may be not unable to make a review until the weekend."

Ok, I plan to merge this PR at the beginning of next week, to give you the weekend for a review ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

9 participants
You can’t perform that action at this time.