Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Hostnames should be case-insensitive, but most extractors ignore that. #10882
Comments
I know that RFC and I agree that URL matching should follow it, but it's bad to leave bugs in codes to accomplish a goal. Possible correct solutions can be:
[1] https://docs.python.org/3.6/whatsnew/3.6.html#re |
This argument cuts both ways. Leaving the code as it is now, there are approximately 932 bugs we would be leaving:
It's not clear how many bugs we might create if we fixed those 932 bugs by making the regexp match case-insensitive, but the net result would be a lot more fixes. Another solution might be:
instead of calling self._match_id(), so this would be a lot of churn, too. On the other hand, maybe that means that stuff should be converted to _match_id(). Although sometimes there's good reason not to, if you need to match other parameters? But perhaps that means the _match_id() abstraction isn't general enough.) Anyhow, it is clear this is not a pressing problem. |
Yes this function should be improved. |
Forking off this discussion from #10854, where @dstftw suggested that
making all _VALID_URL checks case-insensitive was the wrong way to go:
I, @johnhawkinson replied:
@dstftw replied:
and I, @johnhawkinson said:
Finally @dstftw said:
I disagree. The hostname part of a URL is by definition case-insensitive. Any extractor in youtube-dl that assumes the hostname has fixed case is buggy. And a few of them go to ugly contortions using character classes to try to be case-insensitive, like YoutubeIE:
And yet ironically it doesn't allow http://YOUTU.BE (although those work anyhow, I think because of some very broad matching of the path component for Youtube).
Anyhow, the authority on this is RFC 1034: Domain Names - Concepts And Facilities, stating:
And also RFC3986: Uniform Resource Identifier (URI): Generic Syntax:
See also RFC 1035, RFC 4343.
But that only goes so far: while the domain names are case-insensitive, the rest of the URLs are not.
But what is the risk of processing them case-insensitively? From youtube-dl's perspective, it means a URL might match _VALID_URL on the correct site but with a different case, like the extractor for
https://www.youtube.com/watch?v=d9TpRfDdyU0might be triggered byhttps://www.youtube.com/WATCH?v=d9TpRfDdyU0.But so what? At worst it means an extractor might be unnecessarily invoked in a few rare cases, which is a fair thing to trade to have it work in more places.
Any any website that has different video content at /ABC and /abc where both need to work is going to need careful attention to this in the extractor anyhow. Although I'm skeptical such sites exist.
Anyhow, the compromise proposal is to just change the README.md and CONTRIBUTING.md examples such that they recommend using
(?i)in regexps, so that new extractors are case insensitive. I'll submit a pull request.I guess we could also go in en masse and prefix most VALID_URI entries with
(?i)and see what breaks, if anything?Thanks.