Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paywall redirects can defeat --force-generic-extractor #12700

Open
johnhawkinson opened this issue Apr 10, 2017 · 0 comments
Open

Paywall redirects can defeat --force-generic-extractor #12700

johnhawkinson opened this issue Apr 10, 2017 · 0 comments

Comments

@johnhawkinson
Copy link
Contributor

@johnhawkinson johnhawkinson commented Apr 10, 2017

Sometimes a webpage will redirect not because it's a real redirect, but because it wants the user to login for paywall -type reasons. The New York Times does this (see redirect chain below), e.g. for https://www.nytimes.com/2017/04/05/arts/television/louis-ck-young-stephen-colbert.html?_r=0


That page has four Youtube embeds, but because NYTimesArticleIE matches it, it grabs it and tries to find an NYT video that's not there:

pb3:extractor jhawk$ youtube-dl -v 'https://www.nytimes.com/2017/04/05/arts/television/louis-ck-young-stephen-colbert.html'
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'https://www.nytimes.com/2017/04/05/arts/television/louis-ck-young-stephen-colbert.html']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2017.04.09
[debug] Python version 2.7.10 - Darwin-14.5.0-x86_64-i386-64bit
[debug] exe versions: ffmpeg git-2017-02-28-7f62368, ffprobe git-2017-02-28-7f62368, rtmpdump 2.4
[debug] Proxy map: {}
[NYTimesArticle] louis-ck-young-stephen-colbert: Downloading webpage
ERROR: Unable to extract podcast data; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/usr/local/bin/youtube-dl/youtube_dl/YoutubeDL.py", line 761, in extract_info
    ie_result = ie.extract(url)
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 429, in extract
    ie_result = self._real_extract(url)
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/nytimes.py", line 222, in _real_extract
    webpage, 'podcast data')
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 778, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)
RegexNotFoundError: Unable to extract podcast data; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

The obvious workaround is --force-generic-extractor, but that fails:

pb3:extractor jhawk$ youtube-dl --force-generic-extractor  -v 'https://www.nytimes.com/2017/04/05/arts/television/louis-ck-young-stephen-colbert.html' -s
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'--force-generic-extractor', u'-v', u'https://www.nytimes.com/2017/04/05/arts/television/louis-ck-young-stephen-colbert.html', u'-s']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2017.04.09
[debug] Python version 2.7.10 - Darwin-14.5.0-x86_64-i386-64bit
[debug] exe versions: ffmpeg git-2017-02-28-7f62368, ffprobe git-2017-02-28-7f62368, rtmpdump 2.4
[debug] Proxy map: {}
[generic] louis-ck-young-stephen-colbert: Requesting header
[redirect] Following redirect to https://www.nytimes.com/2017/04/05/arts/television/louis-ck-young-stephen-colbert.html?_r=0
[NYTimesArticle] louis-ck-young-stephen-colbert: Downloading webpage
ERROR: Unable to extract podcast data; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/usr/local/bin/youtube-dl/youtube_dl/YoutubeDL.py", line 761, in extract_info
    ie_result = ie.extract(url)
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 429, in extract
    ie_result = self._real_extract(url)
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/nytimes.py", line 222, in _real_extract
    webpage, 'podcast data')
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 778, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)
RegexNotFoundError: Unable to extract podcast data; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

And of course, the reason it fails is this redirect chain:

pb3:extractor jhawk$ curl -s  -IL https://www.nytimes.com/2017/04/05/arts/television/louis-ck-young-stephen-colbert.html?_r=0 |grep Location
Location: https://www.nytimes.com/glogin?URI=https%3A%2F%2Fwww.nytimes.com%2F2017%2F04%2F05%2Farts%2Ftelevision%2Flouis-ck-young-stephen-colbert.html%3F_r%3D1
Location: https://www.nytimes.com/2017/04/05/arts/television/louis-ck-young-stephen-colbert.html?_r=1
Location: https://www.nytimes.com/glogin?URI=https%3A%2F%2Fwww.nytimes.com%2F2017%2F04%2F05%2Farts%2Ftelevision%2Flouis-ck-young-stephen-colbert.html%3F_r%3D2
Location: https://www.nytimes.com/2017/04/05/arts/television/louis-ck-young-stephen-colbert.html?_r=2
Location: https://www.nytimes.com/glogin?URI=https%3A%2F%2Fwww.nytimes.com%2F2017%2F04%2F05%2Farts%2Ftelevision%2Flouis-ck-young-stephen-colbert.html%3F_r%3D3
Location: https://www.nytimes.com/2017/04/05/arts/television/louis-ck-young-stephen-colbert.html?_r=3
Location: https://www.nytimes.com/glogin?URI=https%3A%2F%2Fwww.nytimes.com%2F2017%2F04%2F05%2Farts%2Ftelevision%2Flouis-ck-young-stephen-colbert.html%3F_r%3D4
Location: https://www.nytimes.com/2017/04/05/arts/television/louis-ck-young-stephen-colbert.html?_r=4
Location: https://www.nytimes.com/glogin?URI=https%3A%2F%2Fwww.nytimes.com%2F2017%2F04%2F05%2Farts%2Ftelevision%2Flouis-ck-young-stephen-colbert.html%3F_r%3D5
Location: https://www.nytimes.com/2017/04/05/arts/television/louis-ck-young-stephen-colbert.html?_r=5
pb3:extractor jhawk$ 

Commenting out the redirect following like so

diff --git a/youtube_dl/extractor/generic.py b/youtube_dl/extractor/generic.py
index 658533cf6..0651024e7 100644
--- a/youtube_dl/extractor/generic.py
+++ b/youtube_dl/extractor/generic.py
@@ -1730,7 +1730,7 @@ class GenericIE(InfoExtractor):
             note=False, errnote='Could not send HEAD request to %s' % url,
             fatal=False)
 
-        if head_response is not False:
+        if False: #head_response is not False:
             # Check for redirect
             new_url = head_response.geturl()
             if url != new_url:

Resolves this just fine:

pb3:extractor jhawk$ PYTHONPATH=~/src/youtube-dl python -m youtube_dl -vs https://www.nytimes.com/2017/04/05/arts/television/louis-ck-young-stephen-colbert.html?_r=0 --force-generic
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-vs', u'https://www.nytimes.com/2017/04/05/arts/television/louis-ck-young-stephen-colbert.html?_r=0', u'--force-generic']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2017.04.09
[debug] Git HEAD: 3f2ce6896
[debug] Python version 2.7.10 - Darwin-14.5.0-x86_64-i386-64bit
[debug] exe versions: ffmpeg git-2017-02-28-7f62368, ffprobe git-2017-02-28-7f62368, rtmpdump 2.4
[debug] Proxy map: {}
[generic] louis-ck-young-stephen-colbert: Requesting header
WARNING: Forcing on generic information extractor.
[generic] louis-ck-young-stephen-colbert: Downloading webpage
[generic] louis-ck-young-stephen-colbert: Extracting information
[download] Downloading playlist: Stephen Colbert Was ‘Like an Alien’ Before He Was Famous, Louis C.K. Says
[generic] playlist Stephen Colbert Was ‘Like an Alien’ Before He Was Famous, Louis C.K. Says: Collected 4 video ids (downloading 4 of them)
[download] Downloading video 1 of 4
[youtube] EGKCjw7O_ZM: Downloading webpage
[youtube] EGKCjw7O_ZM: Downloading video info webpage
[youtube] EGKCjw7O_ZM: Extracting video information
[youtube] EGKCjw7O_ZM: Downloading MPD manifest
[download] Downloading video 2 of 4
[youtube] ARCJMFXHalo: Downloading webpage
[youtube] ARCJMFXHalo: Downloading video info webpage
[youtube] ARCJMFXHalo: Extracting video information
[youtube] ARCJMFXHalo: Downloading MPD manifest
[download] Downloading video 3 of 4
[youtube] 6O86IC6ddm8: Downloading webpage
[youtube] 6O86IC6ddm8: Downloading video info webpage
[youtube] 6O86IC6ddm8: Extracting video information
[youtube] 6O86IC6ddm8: Downloading MPD manifest
[download] Downloading video 4 of 4
[youtube] aB6bpFvzjns: Downloading webpage
[youtube] aB6bpFvzjns: Downloading video info webpage
[youtube] aB6bpFvzjns: Extracting video information
[youtube] aB6bpFvzjns: Downloading MPD manifest
[download] Finished downloading playlist: Stephen Colbert Was ‘Like an Alien’ Before He Was Famous, Louis C.K. Says

I have no idea what the right solution is :)
It's certainly not a very pressing problem. And I guess maaaybe it has some connection to #12501 (but maybe not).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.