Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrapy logging show UnicodeDecodeError #1270

Closed
binbibi opened this issue Jun 1, 2015 · 9 comments
Closed

scrapy logging show UnicodeDecodeError #1270

binbibi opened this issue Jun 1, 2015 · 9 comments
Milestone

Comments

@binbibi
Copy link

binbibi commented Jun 1, 2015

hello,all, I use scrapy 1.00rc1 from pip ,my pc is win7x64,python is 2.7.9(x64),
when I start my project

scrapy crawl imzx
Traceback (most recent call last):
  File "C:\Python27\lib\logging\__init__.py", line 859, in emit
    msg = self.format(record)
  File "C:\Python27\lib\logging\__init__.py", line 732, in format
    return fmt.format(record)
  File "C:\Python27\lib\logging\__init__.py", line 474, in format
    s = self._fmt % record.__dict__
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd6 in position 19: ordinal not in range(128)
Logged from file log.py, line 108

Then I find utils/log.py ,line 108 ,I commented-out the code as you can see

#logger.info("Scrapy %(version)s started (bot: %(bot)s)",
                #{'version': scrapy.__version__, 'bot': settings['BOT_NAME']})

And I restart my project,it's OK ,but something about logging error still:

Traceback (most recent call last):
  File "C:\Python27\lib\logging\__init__.py", line 859, in emit
    msg = self.format(record)
  File "C:\Python27\lib\logging\__init__.py", line 732, in format
    return fmt.format(record)
  File "C:\Python27\lib\logging\__init__.py", line 474, in format
    s = self._fmt % record.__dict__
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd6 in position 19: ordinal not in range(128)
Logged from file engine.py, line 212

It's to much in my screen,though too many logging Traceback , my code work fine ,I get all data I need, But I confused,my code work fine in scrapy 0.24 without any above logging Traceback,so I think it's maybe scrapy bug

@curita
Copy link
Member

curita commented Jun 1, 2015

Hi, thanks for reporting this issue!

Seems like the new logging system isn't handling well some encoding, though I'm not sure how could this issue be reproduced without breaking the old log system too.

Could you show us the code from your spider (and maybe your project settings if you have redefined some of them) so we can try to reproduce it?

@binbibi
Copy link
Author

binbibi commented Jun 1, 2015

yes, I upload my project to Dropbox,you can download it ,
https://www.dropbox.com/s/s14qh12qjptu7pr/imzx.zip?dl=0

@kmike kmike added this to the Scrapy 1.0 milestone Jun 1, 2015
@curita
Copy link
Member

curita commented Jun 2, 2015

I was able to emulate a similar issue touching some settings, I couldn't reproduce it otherwise in the attached project.

Problem here seems unrelated to the system locale. Both log calls, the one in utils/log.py and the one in core/engine.py, try to log an unicode message. By doing that all attributes in the log record are decoded to unicode using the ascii encoding, and that breaks with the reported traceback if there are any non-ascii characters.

The values in the broken line are roughly:

>>> _fmt = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
>>> record = {
    'asctime': '2015-06-02 12:17:01+0000',
    'created': 1433258457.911135,
    'exc_info': None,
    'exc_text': None,
    'filename': 'log.py',
    'funcName': 'log_scrapy_info',
    'levelname': 'INFO',
    'levelno': 20,
    'lineno': 108,
    'module': log,
    'msecs': 911.1349582672119,
    'message': u'Scrapy 1.1.0rc1 started (bot: imzx)',
    'name': 'scrapy',
    'pathname': 'scrapy/utils/log.py',
    'process': 19931,
    'processName': 'MainProcess',
    'relativeCreated': 66988981.72998428,
    'thread': 139932245554944,
    'threadName': 'MainThread'}
}
>>> s = _fmt % record

Since 'message' is unicode, all 'asctime', 'name' and 'levelname' (and only those) in the record dictionary are going to be casted to unicode using the ascii encoding while formatting them into _fmt, which is going to be casted to unicode too. If any of those has non-ascii chars, the conversion is going to break with the UnicodeDecodeError.

Problem is, I don't know how any of those could hold non-ascii characters. The default 'asctime' format is '%Y-%m-%d %H:%M:%S%z', those placeholders format into numbers regardless locale. All these values are bytestrings handled by us, which are encoded by the default file encoding (ascii in python2) unless stated otherwise at the beginning of the file.

Something that I tested to trigger these errors was to install scrapy in a path with non-ascii characters, and then change the default format with the LOG_FORMAT setting to '[%(pathname)s] %(message)s'. Doing this the logging breaks every time a unicode message is logged, but this is something expected of the python logging since it's using bytestrings internally to store the pathname.

@binbibi, can I ask you to run the following commands (notice the leading u in both messages, both calls are logging unicode) and see if any of these calls throws errors? I'm really sorry for the troubles.

>>> import logging
>>> logging.basicConfig(format='%(asctime)s [%(name)s] %(levelname)s: %(message)s')
>>> logging.warning(u'First log message')
>>> import logging
>>> from scrapy.utils.log import configure_logging
>>> configure_logging({})
>>> logger = logging.getLogger('scrapy')
>>> logger.debug(u'Second message')

@binbibi
Copy link
Author

binbibi commented Jun 3, 2015

second command error again,I run the command in ipython,you can see the result:

In [1]: import logging

In [2]: logging.basicConfig(format='%(asctime)s [%(name)s] %(levelname)s: %(message)s')

In [3]: logging.warning(u'First log message')
2015-06-03 12:40:35,532 [root] WARNING: First log message

In [4]: import logging

In [5]: from scrapy.utils.log import configure_logging

In [6]: configure_logging({})

In [7]: logger = logging.getLogger('scrapy')

In [8]: logger.debug(u'Second message')
2015-06-03 12:41:34,206 [scrapy] DEBUG: Second message
Traceback (most recent call last):
  File "C:\Python27\lib\logging\__init__.py", line 859, in emit
    msg = self.format(record)
  File "C:\Python27\lib\logging\__init__.py", line 732, in format
    return fmt.format(record)
  File "C:\Python27\lib\logging\__init__.py", line 474, in format
    s = self._fmt % record.__dict__
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd6 in position 19: ordinal not in range(128)
Logged from file <ipython-input-8-86121427498f>, line 1

@curita
Copy link
Member

curita commented Jun 3, 2015

I really appreciate your help! Guess it truly is an issue concerning Scrapy alone, configure_logging seems the culprit of setting something wrongly. I probably missed something while debugging different system locales, I'll try to reproduce the issue again.

I'd really like to inspect that record.__dict__ to see what is the offending string. Given there aren't that many changes in configure_logging I have a hunch that it could be the actual LOG_FORMAT (or LOG_DATEFORMAT) from the default settings that is stored in a non-ascii encoding for some reason (If this is the reason I guess we could enforce the encoding at the top of scrapy/settings/default_settings.py with a # -*- coding: utf-8 -*- line, but I don't know why it's using something different than the default ascii). @binbibi can you run these commands and see if they throw any UnicodeDecodeErrors?

>>> from scrapy.settings.default_settings import LOG_FORMAT
>>> LOG_FORMAT.decode('ascii')

@binbibi
Copy link
Author

binbibi commented Jun 3, 2015

I debug the code ,

__author__ = 'Binbibi'
import logging
from scrapy.utils.log import configure_logging
configure_logging({})
logger = logging.getLogger('scrapy')
logger.debug(u'Second message')

you are write ,the LOG_DATEFORMAT cause error,the timeformat

"%Y-%m-%d %H:%M:%S%z"

The bug is happen when you add "%z",python logging use formatTime formatTime to handle time,but
if datefmt (a string) is specified, it is used with time.strftime() to format the creation time of the record,and from the doc time.strftime strftime() returns a locale depedent byte string; the result may be converted to unicode by doing strftime().decode(locale.getlocale()[1]).here is the python formatTime source code:

def formatTime(self, record, datefmt=None):
    ct = self.converter(record.created)
    if datefmt:
        s = time.strftime(datefmt, ct)
    else:
        t = time.strftime("%Y-%m-%d %H:%M:%S", ct)
        s = "%s,%03d" % (t, record.msecs)
    return s

when scrapy use "%Y-%m-%d %H:%M:%S%z" timeformat,the %Z is Time zone name,I come from
china , so I think s is a gbk string,I hack the formatTime source code,by

s = time.strftime(datefmt, ct).decode('gbk')

Then I find everything is Ok,you can see the output:

2015-06-04 01:26:02中国标准时间 [scrapy] DEBUG: Second message

and my project imzx also work fine without logging trackback,so I think ,we can simply delete the "%z" in LOG_DATEFORMAT so scrapy use the default logging.py timeformat "%Y-%m-%d %H:%M:%S",otherwise ,hack the logging source code,I'm not sure I'm right,so you can try test again,just chinese development environment,I think you can reproduce it ;

@curita
Copy link
Member

curita commented Jun 3, 2015

Ohh, the timezone was the catch! Thanks a lot for debugging it, I couldn't reproduce it in the linux I'm using since the timezone name is stored in english regardless system language (I got 'CST' there even with chinese locale), seems it's something platform specific.

I still have a doubt, why is '%z' being formatted as '%Z', or was that an example? The first one should return the offset from UTC (something like '+0800') and the second one the timezone name (something like '中国标准时间' or 'CST'). I would have hoped that '%z' contained only ascii characters. Running time.strftime('%z').decode('ascii') (lower z) raises UnicodeError? Guess windows handles it differently in that case.

In any case, I think we should remove '%z' if it brings this kind of issues, seems like if was deprecated too: https://docs.python.org/2/library/time.html#id2. I'll create a pull request so we can review that change.

@binbibi
Copy link
Author

binbibi commented Jun 4, 2015

thanks for the details about "%z" and "%Z",In my VM Ubuntu,It work the way as you point,But in windows,"%z" Is equivalent to "%Z",you can see the result:

In [28]: import time

In [29]: print time.strftime('%z').decode('gbk')
中国标准时间

In [30]: print time.strftime('%Z').decode('gbk')
中国标准时间

I don't know why

@curita
Copy link
Member

curita commented Jun 4, 2015

Windows seems to implement it differently, probably it just fallbacks to '%Z'. Looks like '%z' is not officially supported and this case shows us it's not reliable, so I think our best bet is to remove it from LOG_DATEFORMAT.

Thanks again @binbibi for such dedicated report, you'll see the fix for it in the next release candidate :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants