Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'utf8' codec can't decode byte 0x8b in position 1: invalid start byte #366

Open
jherazob opened this issue Jun 28, 2012 · 12 comments
Open

'utf8' codec can't decode byte 0x8b in position 1: invalid start byte #366

jherazob opened this issue Jun 28, 2012 · 12 comments

Comments

@jherazob
Copy link

@jherazob jherazob commented Jun 28, 2012

Output of youtube-dl 5gVYfDCgYxk:

[youtube] Setting language
[youtube] 5gVYfDCgYxk: Downloading video webpage
[youtube] 5gVYfDCgYxk: Downloading video info webpage
[youtube] 5gVYfDCgYxk: Extracting video information
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/jherazob/bin/youtube-dl/__main__.py", line 7, in <module>
  File "/home/jherazob/bin/youtube-dl/__init__.py", line 535, in main

  File "/home/jherazob/bin/youtube-dl/__init__.py", line 519, in _real_main

  File "/home/jherazob/bin/youtube-dl/FileDownloader.py", line 475, in download
  File "/home/jherazob/bin/youtube-dl/InfoExtractors.py", line 80, in extract
  File "/home/jherazob/bin/youtube-dl/InfoExtractors.py", line 350, in _real_extract
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte

The title is normal english text, not exotic characters. I imagine at some point it's assuming the text is unicode but it isn't.

@FiloSottile
Copy link
Collaborator

@FiloSottile FiloSottile commented Jun 30, 2012

The issue is related to the age verification. I think the whole age verification and login processes are to be rewritten.

@jherazob
Copy link
Author

@jherazob jherazob commented Jul 2, 2012

Have tested on many videos that require age verification, and yes, that seems to be exactly the problem

@zoredache
Copy link

@zoredache zoredache commented Jul 6, 2012

This seems to work as a temporary fix. There may be better solutions though that actually fix the problem. I didn't really dig into it.

diff --git a/youtube_dl/InfoExtractors.py b/youtube_dl/InfoExtractors.py
index baf859e..4a43b46 100644
--- a/youtube_dl/InfoExtractors.py
+++ b/youtube_dl/InfoExtractors.py
@@ -352,9 +352,12 @@ class YoutubeIE(InfoExtractor):
                                        pass

                # description
-               video_description = get_element_by_id("eow-description", video_webpage.decode('utf8'))
-               if video_description: video_description = clean_html(video_description)
-               else: video_description = ''
+               try:
+                       video_description = get_element_by_id("eow-description", video_webpage.decode('utf8'))
+                       if video_description: video_description = clean_html(video_description)
+                       else: video_description = ''
+               except UnicodeDecodeError, err:
+                       video_description = ''

                # closed captions
                video_subtitles = None

Updated to reflect error mentioned by GaelicGrime

@GaelicGrime
Copy link

@GaelicGrime GaelicGrime commented Jul 10, 2012

There is a typo in the above patch

  •          video_description = get_element_by_id("eow-description", video_webpage.decode('utf8')
    

should read

  •          video_description = get_element_by_id("eow-description", video_webpage.decode('utf8'))
    
@Cybjit
Copy link

@Cybjit Cybjit commented Aug 5, 2012

For me the issue was that gzip is not handled

diff --git a/youtube_dl/InfoExtractors.py b/youtube_dl/InfoExtractors.py
index ddb4aa1..cf1b95b 100644
--- a/youtube_dl/InfoExtractors.py
+++ b/youtube_dl/InfoExtractors.py
@@ -278,7 +278,13 @@ class YoutubeIE(InfoExtractor):
        self.report_video_webpage_download(video_id)
        request = urllib2.Request('http://www.youtube.com/watch?v=%s&gl=US&hl=en&has_verified=1' % video_id)
        try:
-           video_webpage = urllib2.urlopen(request).read()
+           response = urllib2.urlopen(request)
+           if response.info().get('Content-Encoding') == 'gzip':
+               buf = StringIO.StringIO(response.read())
+               f = gzip.GzipFile(fileobj=buf)
+               video_webpage = f.read()
+           else:
+               video_webpage = request.read()
        except (urllib2.URLError, httplib.HTTPException, socket.error), err:
            self._downloader.trouble(u'ERROR: unable to download video webpage: %s' % str(err))
            return
@Cybjit
Copy link

@Cybjit Cybjit commented Aug 5, 2012

Hmm, YouTube seems to send out that header regardless if it is actually gzip.

diff --git a/youtube_dl/InfoExtractors.py b/youtube_dl/InfoExtractors.py
index ddb4aa1..2ee8bb2 100644
--- a/youtube_dl/InfoExtractors.py
+++ b/youtube_dl/InfoExtractors.py
@@ -279,6 +279,12 @@ class YoutubeIE(InfoExtractor):
        request = urllib2.Request('http://www.youtube.com/watch?v=%s&gl=US&hl=en&has_verified=1' % video_id)
        try:
            video_webpage = urllib2.urlopen(request).read()
+           try:
+               buf = StringIO.StringIO(video_webpage)
+               f = gzip.GzipFile(fileobj=buf)
+               video_webpage = f.read()
+           except IOError:
+               ()
        except (urllib2.URLError, httplib.HTTPException, socket.error), err:
            self._downloader.trouble(u'ERROR: unable to download video webpage: %s' % str(err))
            return
@DopefishJustin
Copy link

@DopefishJustin DopefishJustin commented Aug 10, 2012

I worked around this by replacing video_webpage.decode('utf8') with video_webpage.decode('utf8','replace') which replaces invalid characters rather than just bombing out.

@rifter
Copy link

@rifter rifter commented Oct 27, 2012

any chance this fix will be included in the main program?

@rifter
Copy link

@rifter rifter commented Oct 27, 2012

Anyway I do get this but it's intermittent. Like eventually downloading the same video works, I just have to keep trying.

@phihag
Copy link
Contributor

@phihag phihag commented Oct 27, 2012

I'll have a look at it, but the intermittent nature and no clear diagnosis (and the fact that I could never reproduce this issue) make it hard to decide. And instead of blindly decoding gzip, we should really detect it.

@Cybjit
Copy link

@Cybjit Cybjit commented Oct 30, 2012

The problem is not occurring for me anymore. But I looked into detecting gzip, and the second byte is indeed 0x8b.
Second attempt, untested:

diff --git a/youtube_dl/InfoExtractors.py b/youtube_dl/InfoExtractors.py
index 9df521d..29886c3 100644
--- a/youtube_dl/InfoExtractors.py
+++ b/youtube_dl/InfoExtractors.py
@@ -304,6 +304,10 @@ class YoutubeIE(InfoExtractor):
        request = urllib2.Request('http://www.youtube.com/watch?v=%s&gl=US&hl=en&has_verified=1' % video_id)
        try:
            video_webpage = urllib2.urlopen(request).read()
+           if len(video_webpage) > 2 and video_webpage[0] == '\x1f' and video_webpage[1] == '\x8b':
+               buf = StringIO.StringIO(video_webpage)
+               f = gzip.GzipFile(fileobj=buf)
+               video_webpage = f.read()
        except (urllib2.URLError, httplib.HTTPException, socket.error), err:
            self._downloader.trouble(u'ERROR: unable to download video webpage: %s' % str(err))
            return
@ly695908698
Copy link

@ly695908698 ly695908698 commented Jun 26, 2017

because the web is gzip,so you should unpack :

if res.info().get('Content-Encoding') == 'gzip':
buf = io.BytesIO(data) #if python2 please use StringIO.StringIO
gzip_f = gzip.GzipFile(fileobj=buf)
content = gzip_f.read()
else:
content = response.read()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants
You can’t perform that action at this time.