Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
'utf8' codec can't decode byte 0x8b in position 1: invalid start byte #366
Comments
|
The issue is related to the age verification. I think the whole age verification and login processes are to be rewritten. |
|
Have tested on many videos that require age verification, and yes, that seems to be exactly the problem |
|
This seems to work as a temporary fix. There may be better solutions though that actually fix the problem. I didn't really dig into it.
Updated to reflect error mentioned by GaelicGrime |
|
There is a typo in the above patch
should read
|
|
For me the issue was that gzip is not handled diff --git a/youtube_dl/InfoExtractors.py b/youtube_dl/InfoExtractors.py
index ddb4aa1..cf1b95b 100644
--- a/youtube_dl/InfoExtractors.py
+++ b/youtube_dl/InfoExtractors.py
@@ -278,7 +278,13 @@ class YoutubeIE(InfoExtractor):
self.report_video_webpage_download(video_id)
request = urllib2.Request('http://www.youtube.com/watch?v=%s&gl=US&hl=en&has_verified=1' % video_id)
try:
- video_webpage = urllib2.urlopen(request).read()
+ response = urllib2.urlopen(request)
+ if response.info().get('Content-Encoding') == 'gzip':
+ buf = StringIO.StringIO(response.read())
+ f = gzip.GzipFile(fileobj=buf)
+ video_webpage = f.read()
+ else:
+ video_webpage = request.read()
except (urllib2.URLError, httplib.HTTPException, socket.error), err:
self._downloader.trouble(u'ERROR: unable to download video webpage: %s' % str(err))
return |
|
Hmm, YouTube seems to send out that header regardless if it is actually gzip. diff --git a/youtube_dl/InfoExtractors.py b/youtube_dl/InfoExtractors.py
index ddb4aa1..2ee8bb2 100644
--- a/youtube_dl/InfoExtractors.py
+++ b/youtube_dl/InfoExtractors.py
@@ -279,6 +279,12 @@ class YoutubeIE(InfoExtractor):
request = urllib2.Request('http://www.youtube.com/watch?v=%s&gl=US&hl=en&has_verified=1' % video_id)
try:
video_webpage = urllib2.urlopen(request).read()
+ try:
+ buf = StringIO.StringIO(video_webpage)
+ f = gzip.GzipFile(fileobj=buf)
+ video_webpage = f.read()
+ except IOError:
+ ()
except (urllib2.URLError, httplib.HTTPException, socket.error), err:
self._downloader.trouble(u'ERROR: unable to download video webpage: %s' % str(err))
return |
|
I worked around this by replacing |
|
any chance this fix will be included in the main program? |
|
Anyway I do get this but it's intermittent. Like eventually downloading the same video works, I just have to keep trying. |
|
I'll have a look at it, but the intermittent nature and no clear diagnosis (and the fact that I could never reproduce this issue) make it hard to decide. And instead of blindly decoding gzip, we should really detect it. |
|
The problem is not occurring for me anymore. But I looked into detecting gzip, and the second byte is indeed 0x8b. diff --git a/youtube_dl/InfoExtractors.py b/youtube_dl/InfoExtractors.py
index 9df521d..29886c3 100644
--- a/youtube_dl/InfoExtractors.py
+++ b/youtube_dl/InfoExtractors.py
@@ -304,6 +304,10 @@ class YoutubeIE(InfoExtractor):
request = urllib2.Request('http://www.youtube.com/watch?v=%s&gl=US&hl=en&has_verified=1' % video_id)
try:
video_webpage = urllib2.urlopen(request).read()
+ if len(video_webpage) > 2 and video_webpage[0] == '\x1f' and video_webpage[1] == '\x8b':
+ buf = StringIO.StringIO(video_webpage)
+ f = gzip.GzipFile(fileobj=buf)
+ video_webpage = f.read()
except (urllib2.URLError, httplib.HTTPException, socket.error), err:
self._downloader.trouble(u'ERROR: unable to download video webpage: %s' % str(err))
return |
|
because the web is gzip,so you should unpack : if res.info().get('Content-Encoding') == 'gzip': |
Output of youtube-dl 5gVYfDCgYxk:
The title is normal english text, not exotic characters. I imagine at some point it's assuming the text is unicode but it isn't.