Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[utils] base_url: URL paths can contain &. Backport from yt-dlp Resolves #31485 #31490

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Fuzion24
Copy link

@Fuzion24 Fuzion24 commented Jan 23, 2023

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

Explanation of your pull request in arbitrary form goes here. Please make sure the description explains the purpose and effect of your pull request and is worded well enough to be understood. Provide as much context and examples as possible.

Resolves #31485

@dirkf
Copy link
Contributor

dirkf commented Jan 23, 2023

Please fill in the description and include the magic line Resolves #31485..

Anyway, why is base_url() using regex to parse URLs when there's an actual Python library module for that?

@dirkf
Copy link
Contributor

dirkf commented Jan 23, 2023

Also, the change does have a test, so you can tick that checkbox too.

@Fuzion24 Fuzion24 changed the title [utils] base_url: URL paths can contain &. Backport from yt-dlp fixes #31485 [utils] base_url: URL paths can contain &. Backport from yt-dlp Resolves #31485 Jan 24, 2023
@Fuzion24
Copy link
Author

I don't know. I didn't write any of this code. I'm just trying to get it to work. "Anyway, why is base_url() using regex to parse URLs when there's an actual Python library module for that?"

@dirkf
Copy link
Contributor

dirkf commented Jan 24, 2023

Likewise. It was really a rhetorical question: the regex must have been an easy win.

Something like this makes it clearer what's being removed and what kept. It's not obvious what the right answer for (eg) https://example.com/path1/path2 should be, vs. https://example.com/path1/path2/. The url_basename() function above gives path2 for both, while the regex version of base_url() gives https://example.com/path1/, https://example.com/path1/path2/. A little specification comment would have been reassuring.

def base_url(url):
    """
        Return the URL, with its path trimmed after the rightmost / 
        and without any query, params or fragment
    """
    parsed_url = compat_urlparse.urlparse(url)
    path = parsed_url.path.rsplit('/')[0]
    # don't restore trailing //: correct?
    if not path.endswith('/'):
        path += '/'
    return compat_urlparse.urlunparse(parsed_url._replace(
        path=path, query='', params='', fragment='')

@dirkf
Copy link
Contributor

dirkf commented Feb 10, 2023

Of course the "right answer" is what test/test_utils.py says:

    def test_base_url(self):
        self.assertEqual(base_url('http://foo.de/'), 'http://foo.de/')
        self.assertEqual(base_url('http://foo.de/bar'), 'http://foo.de/')
        self.assertEqual(base_url('http://foo.de/bar/'), 'http://foo.de/bar/')
        self.assertEqual(base_url('http://foo.de/bar/baz'), 'http://foo.de/bar/')
        self.assertEqual(base_url('http://foo.de/bar/baz?x=z/x/c'), 'http://foo.de/bar/')

@dirkf
Copy link
Contributor

dirkf commented Feb 10, 2023

And also, RFC 3986 2.2 categorises & as a reserved character of type sub-delims:

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

The principle is that a *sub*-delims character like & need only be percent-encoded if it appears in a context where it has a delimiting role. Thus in http://example.com/contains_an_&_here?param=contains_an_&_there, the here & is fine but the there & must be rendered as %26, http://example.com/contains_an_&_here?param=contains_an_%26_there:

   URI producing applications should percent-encode data octets that
   correspond to characters in the reserved set unless these characters
   are specifically allowed by the URI scheme to represent data in that
   component.  If a reserved character is found in a URI component and
   no delimiting role is known for that character, then it must be
   interpreted as representing the data octet corresponding to that
   character's encoding in US-ASCII.

In other words, base_url() should support & in the path part of a URL.

@dirkf dirkf mentioned this pull request Jul 8, 2023
@joaotolovikeepers
Copy link

Please approve this PR, the error is affecting many systems @dirkf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resources parsed from MPD files dont contain signature for request resulting in a HTTP 403 (DiscoveryPlus)
3 participants