[utils] base_url: URL paths can contain &. Backport from yt-dlp Resolves #31485 #31490

Fuzion24 · 2023-01-23T18:47:09Z

Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
Covered the code with tests (note that PRs without tests will be REJECTED)
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

Explanation of your pull request in arbitrary form goes here. Please make sure the description explains the purpose and effect of your pull request and is worded well enough to be understood. Provide as much context and examples as possible.

Resolves #31485

dirkf · 2023-01-23T21:14:37Z

Please fill in the description and include the magic line Resolves #31485..

Anyway, why is base_url() using regex to parse URLs when there's an actual Python library module for that?

dirkf · 2023-01-23T21:15:56Z

Also, the change does have a test, so you can tick that checkbox too.

Fuzion24 · 2023-01-24T15:02:33Z

I don't know. I didn't write any of this code. I'm just trying to get it to work. "Anyway, why is base_url() using regex to parse URLs when there's an actual Python library module for that?"

dirkf · 2023-01-24T17:08:40Z

Likewise. It was really a rhetorical question: the regex must have been an easy win.

Something like this makes it clearer what's being removed and what kept. It's not obvious what the right answer for (eg) https://example.com/path1/path2 should be, vs. https://example.com/path1/path2/. The url_basename() function above gives path2 for both, while the regex version of base_url() gives https://example.com/path1/, https://example.com/path1/path2/. A little specification comment would have been reassuring.

def base_url(url):
    """
        Return the URL, with its path trimmed after the rightmost / 
        and without any query, params or fragment
    """
    parsed_url = compat_urlparse.urlparse(url)
    path = parsed_url.path.rsplit('/')[0]
    # don't restore trailing //: correct?
    if not path.endswith('/'):
        path += '/'
    return compat_urlparse.urlunparse(parsed_url._replace(
        path=path, query='', params='', fragment='')

dirkf · 2023-02-10T00:24:38Z

Of course the "right answer" is what test/test_utils.py says:

    def test_base_url(self):
        self.assertEqual(base_url('http://foo.de/'), 'http://foo.de/')
        self.assertEqual(base_url('http://foo.de/bar'), 'http://foo.de/')
        self.assertEqual(base_url('http://foo.de/bar/'), 'http://foo.de/bar/')
        self.assertEqual(base_url('http://foo.de/bar/baz'), 'http://foo.de/bar/')
        self.assertEqual(base_url('http://foo.de/bar/baz?x=z/x/c'), 'http://foo.de/bar/')

dirkf · 2023-02-10T00:54:20Z

And also, RFC 3986 2.2 categorises & as a reserved character of type sub-delims:

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

The principle is that a *sub*-delims character like & need only be percent-encoded if it appears in a context where it has a delimiting role. Thus in http://example.com/contains_an_&_here?param=contains_an_&_there, the here & is fine but the there & must be rendered as %26, http://example.com/contains_an_&_here?param=contains_an_%26_there:

   URI producing applications should percent-encode data octets that
   correspond to characters in the reserved set unless these characters
   are specifically allowed by the URI scheme to represent data in that
   component.  If a reserved character is found in a URI component and
   no delimiting role is known for that character, then it must be
   interpreted as representing the data octet corresponding to that
   character's encoding in US-ASCII.

In other words, base_url() should support & in the path part of a URL.

joaotolovikeepers · 2023-09-12T16:24:51Z

Please approve this PR, the error is affecting many systems @dirkf

[utils] base_url: URL paths can contain & (yt-dlp#4841)

6e34e8f

Fuzion24 mentioned this pull request Jan 23, 2023

Resources parsed from MPD files dont contain signature for request resulting in a HTTP 403 (DiscoveryPlus) #31485

Open

9 tasks

Fuzion24 changed the title ~~[utils] base_url: URL paths can contain &. Backport from yt-dlp fixes #31485~~ [utils] base_url: URL paths can contain &. Backport from yt-dlp Resolves #31485 Jan 24, 2023

dirkf mentioned this pull request Feb 9, 2023

[KommunetvIE] Add extractor for kommunetv.no #31516

Merged

11 tasks

dirkf mentioned this pull request Jul 8, 2023

Vimeo issue just came up #32409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[utils] base_url: URL paths can contain &. Backport from yt-dlp Resolves #31485 #31490

[utils] base_url: URL paths can contain &. Backport from yt-dlp Resolves #31485 #31490

Fuzion24 commented Jan 23, 2023 •

edited

dirkf commented Jan 23, 2023

dirkf commented Jan 23, 2023

Fuzion24 commented Jan 24, 2023

dirkf commented Jan 24, 2023 •

edited

dirkf commented Feb 10, 2023

dirkf commented Feb 10, 2023

joaotolovikeepers commented Sep 12, 2023

[utils] base_url: URL paths can contain &. Backport from yt-dlp Resolves #31485 #31490

Are you sure you want to change the base?

[utils] base_url: URL paths can contain &. Backport from yt-dlp Resolves #31485 #31490

Conversation

Fuzion24 commented Jan 23, 2023 • edited

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

dirkf commented Jan 23, 2023

dirkf commented Jan 23, 2023

Fuzion24 commented Jan 24, 2023

dirkf commented Jan 24, 2023 • edited

dirkf commented Feb 10, 2023

dirkf commented Feb 10, 2023

joaotolovikeepers commented Sep 12, 2023

Fuzion24 commented Jan 23, 2023 •

edited

dirkf commented Jan 24, 2023 •

edited