Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL builder utils #1675

Merged
merged 8 commits into from Aug 1, 2018
Merged

Conversation

beardypig
Copy link
Member

As part of the solution for the issues raised in #1519 by @amurzeau, I have started adding some URL utils that can be used in plugins to better manipulate URLs.

As well as a useful method for iterating through HTML tags, using regex in a brute force kind of way - like most of the plugins do.

update_qsd and url_concat are the two URL manipulation methods. url_concat will join together URL parts to make a URL, and update_qsd can be used to add, remove or update query string parameters in a URL.

I updated a couple of plugins as examples.

@codecov
Copy link

codecov bot commented May 22, 2018

Codecov Report

Merging #1675 into master will increase coverage by 0.09%.
The diff coverage is 85.36%.

@@            Coverage Diff             @@
##           master    #1675      +/-   ##
==========================================
+ Coverage   51.26%   51.36%   +0.09%     
==========================================
  Files         242      243       +1     
  Lines       14358    14383      +25     
==========================================
+ Hits         7361     7388      +27     
+ Misses       6997     6995       -2

@gravyboat
Copy link
Member

This fell through and I didn't ever merge it, sorry @beardypig. Do you want to rebase and we can get this in?

@back-to
Copy link
Collaborator

back-to commented Jul 25, 2018

well, I tried to rebase it
but the tests won't pass on Python 3.7 / 3.8-dev

=================================== FAILURES ===================================
___________________ TestPluginUtil.test_itertags_multi_attrs ___________________
self = <tests.test_plugin_utils.TestPluginUtil testMethod=test_itertags_multi_attrs>
    def test_itertags_multi_attrs(self):
        metas = list(itertags(self.test_html, "meta"))
        self.assertTrue(len(metas), 3)
        self.assertTrue(all(meta.tag == "meta" for meta in metas))
    
>       self.assertEqual(metas[0].text, None)
E       AssertionError: 'Title' != None
tests/test_plugin_utils.py:49: AssertionError
________________________ TestPluginUtil.test_no_end_tag ________________________
self = <tests.test_plugin_utils.TestPluginUtil testMethod=test_no_end_tag>
    def test_no_end_tag(self):
        links = list(itertags(self.test_html, "link"))
        self.assertTrue(len(links), 1)
        self.assertEqual(links[0].tag, "link")
>       self.assertEqual(links[0].text, None)
E       AssertionError: '' != None
tests/test_plugin_utils.py:68: AssertionError

@beardypig
Copy link
Member Author

I cannot work out why it's failing on 3.7 right now ... When I have some time I will look at it, unless @back-to is able to work it out :)

@beardypig beardypig force-pushed the url-builder-review branch 2 times, most recently from 57d86dd to 1e5cecb Compare July 25, 2018 23:54
@back-to
Copy link
Collaborator

back-to commented Jul 26, 2018

I know why it fails

re.finditer
Changed in version 3.7: Non-empty matches can now start just after a previous empty match.

https://docs.python.org/3/library/re.html#re.finditer
python/cpython#4471
https://bugs.python.org/issue25054

https://docs.python.org/3/whatsnew/3.7.html#changes-in-the-python-api


here is something similar https://bugs.python.org/issue33585
might be the new expected behavior

If you don't want to find an empty string, change you patter so that it will not match an empty string: ".+".

that has maybe something todo with it,
but not sure whats the best way to fix it.

import re
test_html = """
<title>Title</title>
<meta property="og:type" content= "website" />
<meta property="og:url" content="http://test.se/"/>
<meta property="og:site_name" content="Test" />
<script src="https://test.se/test.js"></script>
<link rel="stylesheet" type="text/css" href="https://test.se/test.css">
<script>Tester.ready(function () {
alert("Hello, world!"); });</script>
<a
href="http://test.se/foo">bar</a>
"""
tag_re = re.compile(r'''(?=<(?P<tag>[a-zA-Z]+)(?P<attr>.*?)(?P<end>/)?>(?:(?P<inner>.*?)</\s*(?P=tag)\s*>)?)''', re.MULTILINE | re.DOTALL)
print([m.groups() for m in tag_re.finditer(test_html)])

Python 3.7

[('title', '', None, 'Title'),
('meta', ' property="og:type" content= "website" ', '/', 'Title'),
('meta', ' property="og:url" content="http://test.se/"', '/', None),
('meta', ' property="og:site_name" content="Test" ', '/', None),
('script', ' src="https://test.se/test.js"', '/', ''),
('link', ' rel="stylesheet" type="text/css" href="https://test.se/test.css"', None, ''),
('script', '', None, 'Tester.ready(function () {\nalert("Hello, world!"); });'),
('a', '\nhref="http://test.se/foo"', None, 'bar')]

Python 3.6

[('title', '', None, 'Title'),
('meta', ' property="og:type" content= "website" ', '/', None),
('meta', ' property="og:url" content="http://test.se/"', '/', None),
('meta', ' property="og:site_name" content="Test" ', '/', None),
('script', ' src="https://test.se/test.js"', None, ''),
('link', ' rel="stylesheet" type="text/css" href="https://test.se/test.css"', None, None),
('script', '', None, 'Tester.ready(function () {\nalert("Hello, world!"); });'),
('a', '\nhref="http://test.se/foo"', None, 'bar')]

@beardypig
Copy link
Member Author

I must be missing something - I cannot see why the inner group would be "Title" for the second match. It seems like the inner group is not being reset, and gets the previous value for some reason (same for when it is '').

@back-to
Copy link
Collaborator

back-to commented Jul 30, 2018

@beardypig

with removing the (?= ... ) part, the tests will pass on 3.7

diff --git a/src/streamlink/plugin/api/utils.py b/src/streamlink/plugin/api/utils.py
index f6f3f12..00445e2 100644
--- a/src/streamlink/plugin/api/utils.py
+++ b/src/streamlink/plugin/api/utils.py
@@ -7,9 +7,9 @@ from ...utils import parse_qsd as parse_query, parse_json, parse_xml
 __all__ = ["parse_json", "parse_xml", "parse_query"]
 
 
-tag_re = re.compile('''(?=<(?P<tag>[a-zA-Z]+)(?P<attr>.*?)(?P<end>/)?>(?:(?P<inner>.*?)</\s*(?P=tag)\s*>)?)''',
+tag_re = re.compile(r'''<(?P<tag>[a-zA-Z]+)(?P<attr>.*?)(?P<end>/)?>(?:(?P<inner>.*?)</\s*(?P=tag)\s*>)?''',
                     re.MULTILINE | re.DOTALL)
-attr_re = re.compile('''\s*(?P<key>[\w-]+)\s*(?:=\s*(?P<quote>["']?)(?P<value>.*?)(?P=quote)\s*)?''')
+attr_re = re.compile(r'''\s*(?P<key>[\w-]+)\s*(?:=\s*(?P<quote>["']?)(?P<value>.*?)(?P=quote)\s*)?''')
 Tag = namedtuple("Tag", "tag attributes text")
 
 
@@ -26,4 +26,3 @@ def itertags(html, tag):
         if match.group("tag") == tag:
             attrs = dict((a.group("key").lower(), a.group("value")) for a in attr_re.finditer(match.group("attr")))
             yield Tag(match.group("tag"), attrs, match.group("inner"))
-

platform linux -- Python 3.7.0, pytest-3.6.3, py-1.5.4, pluggy-0.6.0
rootdir: /run/media/ka/N/github/back-to/streamlink, inifile:
plugins: requests-mock-1.5.2, cov-2.5.1
collected 5 items                                                                                                                                            

tests/test_plugin_utils.py .....

@beardypig
Copy link
Member Author

Not sure that the positive look ahead is even required, might have been left over from a previous iter_tags implementation (I had a few goes) :)

@@ -4,26 +4,27 @@

from streamlink import NoPluginError
from streamlink.plugin import Plugin
from streamlink.plugin.api import http
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http can be removed, it got added with a rebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True that, forgot to clean up after I rebased. I saw it (using PyCharm to do the rebase), and i was going to tidy it up but I forgot :)

@back-to
Copy link
Collaborator

back-to commented Jul 31, 2018

@beardypig

itertags is actually broken now, because the test html code is missing the <html> tag,
which will work for the tests but not for real websites.

it can be tested with the gardenersworld plugin as it is broken with the current changes.

<> must be ignored for inner tags or there will be only one tag html

here is a posible fix

back-to@ab7d6bf
https://travis-ci.org/back-to/streamlink/builds/410256162

@beardypig
Copy link
Member Author

@back-to I updated the test html with your changes, and added a test for inner/outer tags - to maintain the existing behaviour.

I think this might be a bug in Python. I have created a bug report, so we can see what they say.

@back-to
Copy link
Collaborator

back-to commented Aug 1, 2018

well, this might take awhile.

Maybe the tests should be ignored on py37 and py38 for now,
so url_concat and update_qsd can be used.

As the Issue was merged in #1693 and it is not really related to this PR,
it was only discovered because of the additional tests.


also from streamlink.plugin.api import http got added again with the rebase 🐘


https://github.com/beardypig/streamlink/blob/e449aa326a1cece9a8dc4755c413caa179d915f5/tests/test_utils.py#L100-L111

test_url_equal should be removed,
when you moved it to test_utils_url.py it was not removed in test_utils.py.

@beardypig
Copy link
Member Author

🤦‍♂️ The rebase for this one got complex :)

I'll add a skipTest for Python 3.7/3.8.

due to a possible issue with re.finditer and positive lookaheads
@beardypig
Copy link
Member Author

I updated the tests so those that were failing allow failure on 3.7+. With a note to monitor bpo-34294.

@beardypig
Copy link
Member Author

@gravyboat, I think this is OK to merge with the exception that it might be wonky on Python 3.7.

@gravyboat
Copy link
Member

@beardypig Sounds good, changes look good as well. We can always open another issue depending on what occurs.

@gravyboat gravyboat merged commit a7b6b0d into streamlink:master Aug 1, 2018
@beardypig beardypig mentioned this pull request Aug 1, 2018
1 task
@beardypig
Copy link
Member Author

@gravyboat I opened an issue to remind us :)

@back-to back-to mentioned this pull request Feb 7, 2019
4 tasks
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Apr 1, 2020
streamlink 1.3.1 (2020-01-27)

A small patch release that addresses the removal of MPV's legacy option syntax, also with fixes of several plugins, the addition of the --twitch-disable-reruns parameter and dropped support for Python 3.4.

streamlink 1.3.0 (2019-11-22)

A new release with plugin updates and fixes, including Twitch.tv (see #2680), which had to be delayed due to back and forth API changes.

The Twitch.tv workarounds mentioned in #2680 don't have to be applied anymore, but authenticating via --twitch-oauth-token has been disabled, regardless of the origin of the OAuth token (via --twitch-oauth-authenticate or the Twitch website). In order to not introduce breaking changes, both parameters have been kept in this release and the user name will still be logged when using an OAuth token, but receiving item drops or accessing restricted streams is not possible anymore.

Plugins for the following sites have also been added:

    albavision
    news.now.com
    twitcasting.tv
    viu.tv
    vlive.tv
    willax.tv

streamlink 1.2.0 (2019-08-18)

Here are the changes for this month's release

    Multiple plugin fixes
    Fixed single hyphen params at the beginning of --player-args (#2333)
    --http-proxy will set the default value of --https-proxy to same as --http-proxy. (#2536)
    DASH Streams will handle headers correctly (#2545)
    the timestamp for FFMPEGMuxer streams will start with zero (#2559)

streamlink 1.1.1 (2019-04-02)

This is just a small patch release which fixes a build/deploy issue with the new special wheels for Windows on PyPI. (#2392)

streamlink 1.0.0 (2019-01-30)

The celebratory release of Streamlink 1.0.0!

A lot of hard work has gone into getting Streamlink to where it is. Not only is Streamlink used across multiple applications and platforms, but companies as well.

Streamlink started from the inaugural fork of Livestreamer on September 17th, 2016.

Since then, We've hit multiple milestones:

    Over 886 PRs
    Hit 3,000 commits in Streamlink
    Obtaining our first sponsors as well as backers of the project
    The creation of our own logo (streamlink/streamlink#1123)

Thanks to everyone who has contributed to Streamlink (and our backers)! Without you, we wouldn't be where we are today.

Without further ado, here are the changes in release 1.0.0:

    We have a new icon / logo for Streamlink! (streamlink/streamlink#2165)
    Updated dependencies (streamlink/streamlink#2230)
    A ton of plugin updates. Have a look at this search query for all the recent updates.
    You can now provide a custom key URI to override HLS streams (streamlink/streamlink#2139). For example: --hls-segment-key-uri <URI>
    User agents for API communication have been updated (streamlink/streamlink#2194)
    Special synonyms have been added to sort "best" and "worst" streams (streamlink/streamlink#2127). For example: streamlink --stream-sorting-excludes '>=480p' URL best,best-unfiltered
    Process output will no longer show if tty is unavailable (streamlink/streamlink#2090)
    We've removed BountySource in favour of our OpenCollective page. If you have any features you'd like to request, please open up an issue with the request and possibly consider backing us!
    Improved terminal progress display for wide characters (streamlink/streamlink#2032)
    Fixed a bug with dynamic playlists on playback (streamlink/streamlink#2096)
    Fixed makeinstaller.sh (streamlink/streamlink#2098)
    Old Livestreamer deprecations and API references were removed (streamlink/streamlink#1987)
    Dependencies have been updated for Python (streamlink/streamlink#1975)
    Newer and more common User-Agents are now used (streamlink/streamlink#1974)
    DASH stream bitrates now round-up to the nearest 10, 100, 1000, etc. (streamlink/streamlink#1995)
    Updated documentation on issue templates (streamlink/streamlink#1996)
    URL have been added for better processing of HTML tags (streamlink/streamlink#1675)
    Fixed sort and prog issue (streamlink/streamlink#1964)
    Reformatted issue templates (streamlink/streamlink#1966)
    Fixed crashing bug with player-continuous-http option (streamlink/streamlink#2234)
    Make sure all dev dependencies (streamlink/streamlink#2235)
    -r parameter has been replaced for --rtmp-rtmpdump (streamlink/streamlink#2152)

Breaking changes:

    A large number of unmaintained or NSFW plugins have been removed. You can find the PR that implemented that change here: streamlink/streamlink#2003 . See our CONTRIBUTING.md documentation for plugin policy.
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Apr 6, 2020
streamlink 1.3.1 (2020-01-27)

A small patch release that addresses the removal of MPV's legacy option syntax, also with fixes of several plugins, the addition of the --twitch-disable-reruns parameter and dropped support for Python 3.4.

streamlink 1.3.0 (2019-11-22)

A new release with plugin updates and fixes, including Twitch.tv (see #2680), which had to be delayed due to back and forth API changes.

The Twitch.tv workarounds mentioned in #2680 don't have to be applied anymore, but authenticating via --twitch-oauth-token has been disabled, regardless of the origin of the OAuth token (via --twitch-oauth-authenticate or the Twitch website). In order to not introduce breaking changes, both parameters have been kept in this release and the user name will still be logged when using an OAuth token, but receiving item drops or accessing restricted streams is not possible anymore.

Plugins for the following sites have also been added:

    albavision
    news.now.com
    twitcasting.tv
    viu.tv
    vlive.tv
    willax.tv

streamlink 1.2.0 (2019-08-18)

Here are the changes for this month's release

    Multiple plugin fixes
    Fixed single hyphen params at the beginning of --player-args (#2333)
    --http-proxy will set the default value of --https-proxy to same as --http-proxy. (#2536)
    DASH Streams will handle headers correctly (#2545)
    the timestamp for FFMPEGMuxer streams will start with zero (#2559)

streamlink 1.1.1 (2019-04-02)

This is just a small patch release which fixes a build/deploy issue with the new special wheels for Windows on PyPI. (#2392)

streamlink 1.0.0 (2019-01-30)

The celebratory release of Streamlink 1.0.0!

A lot of hard work has gone into getting Streamlink to where it is. Not only is Streamlink used across multiple applications and platforms, but companies as well.

Streamlink started from the inaugural fork of Livestreamer on September 17th, 2016.

Since then, We've hit multiple milestones:

    Over 886 PRs
    Hit 3,000 commits in Streamlink
    Obtaining our first sponsors as well as backers of the project
    The creation of our own logo (streamlink/streamlink#1123)

Thanks to everyone who has contributed to Streamlink (and our backers)! Without you, we wouldn't be where we are today.

Without further ado, here are the changes in release 1.0.0:

    We have a new icon / logo for Streamlink! (streamlink/streamlink#2165)
    Updated dependencies (streamlink/streamlink#2230)
    A ton of plugin updates. Have a look at this search query for all the recent updates.
    You can now provide a custom key URI to override HLS streams (streamlink/streamlink#2139). For example: --hls-segment-key-uri <URI>
    User agents for API communication have been updated (streamlink/streamlink#2194)
    Special synonyms have been added to sort "best" and "worst" streams (streamlink/streamlink#2127). For example: streamlink --stream-sorting-excludes '>=480p' URL best,best-unfiltered
    Process output will no longer show if tty is unavailable (streamlink/streamlink#2090)
    We've removed BountySource in favour of our OpenCollective page. If you have any features you'd like to request, please open up an issue with the request and possibly consider backing us!
    Improved terminal progress display for wide characters (streamlink/streamlink#2032)
    Fixed a bug with dynamic playlists on playback (streamlink/streamlink#2096)
    Fixed makeinstaller.sh (streamlink/streamlink#2098)
    Old Livestreamer deprecations and API references were removed (streamlink/streamlink#1987)
    Dependencies have been updated for Python (streamlink/streamlink#1975)
    Newer and more common User-Agents are now used (streamlink/streamlink#1974)
    DASH stream bitrates now round-up to the nearest 10, 100, 1000, etc. (streamlink/streamlink#1995)
    Updated documentation on issue templates (streamlink/streamlink#1996)
    URL have been added for better processing of HTML tags (streamlink/streamlink#1675)
    Fixed sort and prog issue (streamlink/streamlink#1964)
    Reformatted issue templates (streamlink/streamlink#1966)
    Fixed crashing bug with player-continuous-http option (streamlink/streamlink#2234)
    Make sure all dev dependencies (streamlink/streamlink#2235)
    -r parameter has been replaced for --rtmp-rtmpdump (streamlink/streamlink#2152)

Breaking changes:

    A large number of unmaintained or NSFW plugins have been removed. You can find the PR that implemented that change here: streamlink/streamlink#2003 . See our CONTRIBUTING.md documentation for plugin policy.
mkbloke pushed a commit to mkbloke/streamlink that referenced this pull request Aug 18, 2020
* utils: add some URL manipulation methods

* utils: method to find html tags using regex

* plugins.gardenersworld: use new itertags method to find iframes

* plugins.tf1: use update_qsd method to update hls url

* add an extra test for inner tags that should generate separate matches

* fix rebase issues

* allow tests to fail with Python 3.7+

due to a possible issue with re.finditer and positive lookaheads

* use OrderedDict in update_qsd to maintain stable query argument ordering
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants