Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hostnames should be case-insensitive, but most extractors ignore that. #10882

Open
johnhawkinson opened this issue Oct 9, 2016 · 3 comments
Open
Labels

Comments

@johnhawkinson
Copy link
Contributor

@johnhawkinson johnhawkinson commented Oct 9, 2016

Forking off this discussion from #10854, where @dstftw suggested that
making all _VALID_URL checks case-insensitive was the wrong way to go:

Use (?i) in regex itself if you want case insensitivity.

I, @johnhawkinson replied:

Good point! Though very few current extractors do that:

pb3:extractor jhawk$ fgrep -Rn '(?i)'  *.py|grep VALID
commonprotocols.py:11:    _VALID_URL = r'(?i)rtmp[est]?://.+'
commonprotocols.py:37:    _VALID_URL = r'(?i)mms://.+'
dailymotion.py:41:    _VALID_URL = r'(?i)(?:https?://)?(?:(www|touch)\.)?dailymotion\.[a-z]{2,3}/(?:(?:embed|swf|#)/)?video/(?P<id>[^/?_]+)'

would it not make sense to consider changing suitable() globally to make most regexps case insensitive, rather than trying to touch every regexp?
OTOH I guess it might break some things?

@dstftw replied:

No, this should not be global at least due to presence of potentially case sensitive parts and every regexp should not be touched either. Only those seen to be case insensitive in the wild should do.

and I, @johnhawkinson said:

Probably this is another issue to open, but it's very easy to make most extractors fail by uppercasing something.
Compare:

pb3:Downloads jhawk$ youtube-dl -s  'http://abcnews.go.COM/ThisWeek/video/week-exclusive-irans-foreign-minister-zarif-20411932'
[generic] week-exclusive-irans-foreign-minister-zarif-20411932: Requesting header
WARNING: Falling back on generic information extractor.
[generic] week-exclusive-irans-foreign-minister-zarif-20411932: Downloading webpage
[generic] week-exclusive-irans-foreign-minister-zarif-20411932: Extracting information
pb3:Downloads jhawk$ youtube-dl -s  'http://abcnews.go.com/ThisWeek/video/week-exclusive-irans-foreign-minister-zarif-20411932'
[abcnews:video] Downloading Akamai AMP feed
[abcnews:video] 20411932: Downloading f4m manifest
[abcnews:video] 20411932: Downloading m3u8 information
pb3:Downloads jhawk$ 

Sticking to a "seen in the wild" standard masks a lot of bugs.
I'll open an issue and submit a pull to at least make the YourExtractor sample code case-insentitive.
But I do think the project probably can and should do better...

Finally @dstftw said:

That's simulated. No such URLs is seen in the wild so far and no one will ever intentionally upper case some part of it.


I disagree. The hostname part of a URL is by definition case-insensitive. Any extractor in youtube-dl that assumes the hostname has fixed case is buggy. And a few of them go to ugly contortions using character classes to try to be case-insensitive, like YoutubeIE:

                         (?:(?:(?:(?:\w+\.)?[yY][oO][uU][tT][uU][bB][eE](?:-nocookie)?\.com/|
                            youtu\.be|                                        # just youtu.be/xxxx

And yet ironically it doesn't allow http://YOUTU.BE (although those work anyhow, I think because of some very broad matching of the path component for Youtube).

Anyhow, the authority on this is RFC 1034: Domain Names - Concepts And Facilities, stating:

By convention, domain names can be stored with arbitrary case, but
domain name comparisons for all present domain functions are done in a
case-insensitive manner, assuming an ASCII character set, and a high
order zero bit

And also RFC3986: Uniform Resource Identifier (URI): Generic Syntax:

3.2.2.  Host

   The host subcomponent of authority is identified by an IP literal
   encapsulated within square brackets, an IPv4 address in dotted-
   decimal form, or a registered name.  The host subcomponent is case-
   insensitive. 

See also RFC 1035, RFC 4343.


But that only goes so far: while the domain names are case-insensitive, the rest of the URLs are not.

But what is the risk of processing them case-insensitively? From youtube-dl's perspective, it means a URL might match _VALID_URL on the correct site but with a different case, like the extractor for https://www.youtube.com/watch?v=d9TpRfDdyU0 might be triggered by https://www.youtube.com/WATCH?v=d9TpRfDdyU0.

But so what? At worst it means an extractor might be unnecessarily invoked in a few rare cases, which is a fair thing to trade to have it work in more places.

Any any website that has different video content at /ABC and /abc where both need to work is going to need careful attention to this in the extractor anyhow. Although I'm skeptical such sites exist.

Anyhow, the compromise proposal is to just change the README.md and CONTRIBUTING.md examples such that they recommend using (?i) in regexps, so that new extractors are case insensitive. I'll submit a pull request.

I guess we could also go in en masse and prefix most VALID_URI entries with (?i) and see what breaks, if anything?

Thanks.

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Oct 9, 2016

Any any website that has different video content at /ABC and /abc where both need to work is going to need careful attention to this in the extractor anyhow. Although I'm skeptical such sites exist.

I know that RFC and I agree that URL matching should follow it, but it's bad to leave bugs in codes to accomplish a goal.

Possible correct solutions can be:

  • Create two expressions, like _VALID_URL_HOSTNAME and _VALID_URL_PATH. The former is matched with re.I and the latter not.
  • Python 3.6 supports local modifiers [1]. If one day youtube-dl supports Python 3.6+ only, we can use it. Using the regex module [2] is another possibility before waiting for Python 3.6+.

[1] https://docs.python.org/3.6/whatsnew/3.6.html#re
[2] https://pypi.python.org/pypi/regex

@yan12125 yan12125 added the request label Oct 9, 2016
@johnhawkinson
Copy link
Contributor Author

@johnhawkinson johnhawkinson commented Oct 9, 2016

I know that RFC and I agree that URL matching should follow it, but it's bad to leave bugs in codes to accomplish a goal.

This argument cuts both ways. Leaving the code as it is now, there are approximately 932 bugs we would be leaving:

pb3:extractor jhawk$ grep '_VALID_URL =' *.py | fgrep -v '(?i)' | wc -l
     932

It's not clear how many bugs we might create if we fixed those 932 bugs by making the regexp match case-insensitive, but the net result would be a lot more fixes.

Another solution might be:

  • Downcase the hostname portion of the URL into lowercase before doing the regular expression comparison. (This is hard in practice because hundreds of extractors use the pattern
    def _real_extract(self, url):
        mobj = re.match(self._VALID_URL, url)
        video_id = mobj.group('id')

instead of calling self._match_id(), so this would be a lot of churn, too. On the other hand, maybe that means that stuff should be converted to _match_id(). Although sometimes there's good reason not to, if you need to match other parameters? But perhaps that means the _match_id() abstraction isn't general enough.)


Anyhow, it is clear this is not a pressing problem.

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Oct 9, 2016

But perhaps that means the _match_id() abstraction isn't general enough.

Yes this function should be improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.