Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging - Inconsistent behaviour when handling unicode #81292

Closed
jonathan-lp mannequin opened this issue May 31, 2019 · 20 comments
Closed

Logging - Inconsistent behaviour when handling unicode #81292

jonathan-lp mannequin opened this issue May 31, 2019 · 20 comments
Labels
3.9 only security fixes OS-windows type-feature A feature request or enhancement

Comments

@jonathan-lp
Copy link
Mannequin

jonathan-lp mannequin commented May 31, 2019

BPO 37111
Nosy @pfmoore, @vsajip, @tjguk, @methane, @zware, @zooba
PRs
  • bpo-37111: Added encoding keyword parameter to basicConfig. #13932
  • bpo-37111: Add 'encoding' and 'errors' parameters to logging.basicCon… #14008
  • Files
  • my_log.log: File that won't take the unicode
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-06-17.16:43:59.796>
    created_at = <Date 2019-05-31.09:57:34.129>
    labels = ['type-feature', '3.9', 'OS-windows']
    title = 'Logging - Inconsistent behaviour when handling unicode'
    updated_at = <Date 2019-06-17.16:43:59.608>
    user = 'https://bugs.python.org/jonathan-lp'

    bugs.python.org fields:

    activity = <Date 2019-06-17.16:43:59.608>
    actor = 'vinay.sajip'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-06-17.16:43:59.796>
    closer = 'vinay.sajip'
    components = ['Windows']
    creation = <Date 2019-05-31.09:57:34.129>
    creator = 'jonathan-lp'
    dependencies = []
    files = ['48380']
    hgrepos = []
    issue_num = 37111
    keywords = ['patch']
    message_count = 20.0
    messages = ['344053', '344059', '344060', '344081', '344106', '344110', '344113', '344164', '344198', '344203', '344204', '344208', '344212', '344456', '345319', '345413', '345430', '345432', '345523', '345877']
    nosy_count = 8.0
    nosy_names = ['paul.moore', 'vinay.sajip', 'tim.golden', 'methane', 'SilentGhost', 'zach.ware', 'steve.dower', 'jonathan-lp']
    pr_nums = ['13932', '14008']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue37111'
    versions = ['Python 3.9']

    @jonathan-lp
    Copy link
    Mannequin Author

    jonathan-lp mannequin commented May 31, 2019

    Python is inconsistent in how it handles errors that have some unicode characters. It works to screen but fails to log

    This works:

    >>> import logging
    >>> logging.error('จุด1')
    ERROR:root:จุด1
    

    The following breaks:

    >>> import logging
    >>> logging.basicConfig(filename='c:\\my_log.log')
    >>> logging.error('จุด1')
    

    This raises a unicode error:
    UnicodeEncodeError: 'charmap' codec can't encode characters in position 11-13: character maps to <undefined>

    Python 3.6.3

    Given that the file created by the logger is utf-8, it's unclear why it doesn't work.

    I found a workaround by using a Handler, but surely the loggers should all work the same way so that people don't get unpleasant surprises that even more painful to debug when things only break in certain logging modes?

    @SilentGhost
    Copy link
    Mannequin

    SilentGhost mannequin commented May 31, 2019

    Given that the file created by the logger is utf-8

    I don't think this is true. Default file handler created using filename argument is opened with encoding None (meaning that it relies on system-default encoding, which is on Windows is not utf-8). And it also explains why your work-around works, provided you've opened file handler with explicit utf-8 encoding.

    @SilentGhost SilentGhost mannequin added 3.7 (EOL) end of life 3.8 (EOL) end of life labels May 31, 2019
    @jonathan-lp
    Copy link
    Mannequin Author

    jonathan-lp mannequin commented May 31, 2019

    It definitely claims to be "utf-8" in NotePad++. I've attached it if you want to double-check.

    (Windows 7)

    @SilentGhost SilentGhost mannequin added OS-windows type-bug An unexpected behavior, bug, or error labels May 31, 2019
    @vsajip
    Copy link
    Member

    vsajip commented May 31, 2019

    Given that the file created by the logger is utf-8, it's unclear why it doesn't work ... I found a workaround by using a Handler

    Loggers don't create files - handlers do. The file that you attached seems to be just a text file containing ASCII text. Any ASCII is also utf-8, so Notepad++'s assertion doesn't mean anything in this situation.

    so that people don't get unpleasant surprises that even more painful to debug when things only break in certain logging modes?

    People have been using logging, on Windows, without problems, for years, often using utf-8 to encode their log files. Perhaps you need to read the documentation and look at examples more carefully - I'm not intending to be rude, but say that because using a handler isn't some kind of workaround, and you thinking that it was indicates that you're not sufficiently familiar with the documentation.

    I'd suggest posting questions on Stack Overflow or the python-list mailing list, to establish whether there really seems to be a Python bug, before actually logging a Python issue. I appreciate that you're trying to improve Python.

    @vsajip vsajip closed this as completed May 31, 2019
    @vsajip vsajip added the invalid label May 31, 2019
    @jonathan-lp
    Copy link
    Mannequin Author

    jonathan-lp mannequin commented May 31, 2019

    Thank you for your comments but this wasn't a question and I maintain this is a bug, or at least undesirable behaviour. I'd consider it a bug in my own software.

    Reasoning:

    • It's an inconsistent default with the logging to screen. This causes more complexity for users when their bug is intermittent.
    • Despite your assertion, it's not documented anywhere on the logging docs (I did check before creating this bug when trying to figure out what's going on) - The word "utf" or "unicode" doesn't appear on the logging page, or any of the two tutorials, or the logging.handlers page. There's something in the cookbook but that's about BOM's.
    • Much of the world's native characters won't log to ASCII

    Per this page: https://docs.python.org/3/howto/unicode.html
    "UTF-8 is one of the most commonly used encodings, and Python often defaults to using it."

    People have been using logging, on Windows, without problems, for years, often using utf-8 to encode their log files.
    I'm afraid this line of reasoning is suffering from selection bias, cherry picking, confirmation bias, and probably some others too. Clearly people have had problems before because it was from one of those folks I took the solution.

    Doing something as basic as logging unicode shouldn't require knowledge of "handlers" - that's failing "simple is better than complex".

    @vsajip
    Copy link
    Member

    vsajip commented May 31, 2019

    Doing something as basic as logging unicode shouldn't require knowledge of "handlers" - that's failing "simple is better than complex".

    I reiterate my statement that it appears that you aren't sufficiently familiar with how logging is designed to work. Logging is more complex than print() or open() because it has to work flexibly in lots of different scenarios. Handlers are part of this, as are Formatters and Filters.

    See my 2009 blog post

    https://plumberjack.blogspot.com/2009/09/python-logging-101.html

    for more information for why logging is designed the way it is.

    If you want to configure a stream with utf-8 encoding, without explicitly using handlers, this is easily doable, as follows:

    C:\Users\Vinay> \python37\python
    Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import logging
    >>> logging.basicConfig(stream=open(r'c:\temp\my_log.log', 'w', encoding='utf-8'))
    >>> logging.error('จุด1')
    >>> ^Z

    and, lo and behold, the file c:\temp\my_log.log contains

    ERROR:root:จุด1

    The use of the stream keyword parameter is clearly documented at

    https://docs.python.org/3/library/logging.html#logging.basicConfig

    Did you look at the basicConfig documentation before raising this issue?

    @jonathan-lp
    Copy link
    Mannequin Author

    jonathan-lp mannequin commented May 31, 2019

    Did you look at the basicConfig documentation before raising this issue?

    This seems like an attempt at victim blaming. But yes, I did. In fact, this is now the third time I've looked at that page - once before raising this issue, once before my previous reply, and now. I note that your example and nothing like your example is anywhere on that page. The word "encoding" doesn't appear anywhere on the page either. Sure "stream" is on there, but then you need to know about streams and make the association with logging which I apparently don't. You have to remember not everyone has your level of proficiency in the language. In fact, most Python users don't.

    Lets put this another way - is there a reason NOT to have Unicode logging as the default? Clearly Unicode was important enough for Guido-et-al to decide to throw Python 2 under the bus.
    I've listed the advantages of changing it, what are the disadvantages?

    @vsajip
    Copy link
    Member

    vsajip commented Jun 1, 2019

    This seems like an attempt at victim blaming.

    Er, no, it was a straight question about whether you'd read the documentation.

    I'm afraid this line of reasoning is suffering from selection bias, cherry picking, confirmation bias, and probably some others too.

    Er, no, it was stating that lots of people have successfully used logging on Windows with utf-8, as a way of justifying my opinion that there's no deficiency here that needs addressing.

    Sure "stream" is on there, but then you need to know about streams and make the association with logging which I apparently don't. You have to remember not everyone has your level of proficiency in the language. In fact, most Python users don't.

    Sure, and that's fine. Experienced developers learn stuff every day too. But responding to issues takes time (notice how many people will have been notified about this issue from the "nosy list" above; each would have to spend at least some time looking at it), so raising issues should be done only after exhausting other resources to make sure that the problem isn't in one's own level of knowledge or understanding, but an actual problem with the software. This is why I earlier suggested Stack Overflow and python-list as useful resources for improving one's knowledge and understanding.

    Lets put this another way - is there a reason NOT to have Unicode logging as the default?

    I have no idea what you mean by this. As my example illustrated, it's quite easy to log Unicode in log files.

    A simple Internet search for "basicConfig encoding" yields for me as the second result this Stack Overflow question

    https://stackoverflow.com/questions/21402530

    for which the accepted answer is a slight variation on what I suggested earlier.

    @jonathan-lp
    Copy link
    Mannequin Author

    jonathan-lp mannequin commented Jun 1, 2019

    I have no idea what you mean by this.

    I don't see how I can be clearer. What are the reasons for NOT implementing logging to file be unicode as a default? Logging to screen is unicode as a default. What are the reasons for not wanting consistency?

    A simple Internet search for "basicConfig encoding" yields for me as the second result this Stack Overflow question

    Indeed, and it was from that question I got my solution in fact. The problem was the 30-60 minutes I wasted before that trying to figure out why my program was crashing and why it was only crashing *sometimes*. I'd written the logging part of the program a year ago and not really touched it since, so the logging module being a possible culprit was not even in my mind when the program crashed.

    As my example illustrated, it's quite easy to log Unicode in log files.

    Yes, things are easy when you know it's necessary. It's the process of discovery that's an unnecessary waste of people's time. That's why I raised this and that's why I would consider this a bug in my own software. It's inconsistent, it invites problems, and it wastes peoples time.

    @vsajip
    Copy link
    Member

    vsajip commented Jun 1, 2019

    Learning is not a waste of time. You're entitled to your opinion, but this is not a bug in logging. We'll have to agree to disagree.

    @jonathan-lp
    Copy link
    Mannequin Author

    jonathan-lp mannequin commented Jun 1, 2019

    Learning is not a waste of time. You're entitled to your opinion, but this is not a bug in logging. We'll have to agree to disagree.

    I agree and value learning a great deal. However learning should happen on your own time, NOT when a program crashes randomly and tries taking you down the rabbit hole. I like learning but not about unrelated things when I'm trying to do useful work.

    Fine, if you don't consider this a bug, consider it a feature request. "User would like Python logging of Unicode characters to be consistent" is not an unreasonable request.

    @jonathan-lp jonathan-lp mannequin reopened this Jun 1, 2019
    @jonathan-lp jonathan-lp mannequin added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Jun 1, 2019
    @vsajip
    Copy link
    Member

    vsajip commented Jun 1, 2019

    Different people have different ideas of what a default should be. You can pretty much guarantee that whatever a default is, someone will think it should be something else. The basicConfig functionality has been around in its present form since 2003, and people in general haven't had a problem with the current defaults. There is no case for changing a default unless there is a consensus among lots of people that a default needs changing.

    @pfmoore
    Copy link
    Member

    pfmoore commented Jun 1, 2019

    User would like Python logging of Unicode characters to be consistent

    It is consistent. The encoding of

        logging.basicConfig(filename='c:\\my_log.log')

    is consistent with the encoding of

        open('c:\\my_log.log')

    Both use the system default encoding, which is not UTF-8. There is some discussion going on right now, as to whether it *should* be, but it isn't, and I wouldn't consider changing the behaviour of loging *without* changing the behaviour of open() to be consistent.

    Logging to the console is consistent with the standard IO streams, and it was PEP-528 (https://www.python.org/dev/peps/pep-0528/) that made that change - before that, the standard IO streams, and logging to the console, used the console codepage.

    So, while I agree that the behaviour takes a bit of work to understand, it's not (IMO) inconsistent, nor is it (IMO) a bug.

    @zooba
    Copy link
    Member

    zooba commented Jun 3, 2019

    I agree with Paul. Consistency with open() is the most important aspect here, and when we investigated changing that to UTF-8 it was decided to be too disruptive.

    That said, I'm not opposed to adding encoding/errors to basicConfig if someone wants to do that (and the logging specialists are okay with it).

    @zooba zooba removed the invalid label Jun 3, 2019
    @methane
    Copy link
    Member

    methane commented Jun 12, 2019

    Hmm,, about encoding, I agree that default encoding of open() should be used.
    If we change it, encoding of log files are changed unexpectedly after upgrading Python.

    On the other hand, couldn't we use different default error handler?
    "replace" or "backslashescape" seems better default error handler for the logging.
    Is this change affects too many users?

    @vsajip
    Copy link
    Member

    vsajip commented Jun 12, 2019

    On the other hand, couldn't we use different default error handler?
    "replace" or "backslashescape" seems better default error handler for the logging.

    I've left it as is using the same rationale as I guess open() has at the moment - "Errors should never pass silently. Unless explicitly silenced".

    @methane
    Copy link
    Member

    methane commented Jun 12, 2019

    I've left it as is using the same rationale as I guess open() has at the moment - "Errors should never pass silently. Unless explicitly silenced".

    But open() is general purpose, and logging is for logging.

    Note that stderr uses 'backslashreplace', even when PYTHONIOENCODING=ascii:strict is used.

    And if we don't log something due to encoding error, doesn't it mean "errors may ignored completely silently by default"?
    What logging does when UnicodeEncodeError is happened?

    @methane
    Copy link
    Member

    methane commented Jun 13, 2019

    What logging does when UnicodeEncodeError is happened?

    Hm, it seems logging does best job, show enough information to stderr.

    But sometime, Python is embedded in web servers and there is no stderr. pythonw.exe may not have stderr too....

    I still feel backslashreplace is better default error handler. But it's not strong opinion and it shouldn't block this issue. Go ahead.

    $ cat x.py
    import logging
    logging.basicConfig(filename="mylog.txt")
    msg = "hello, はろー"
    logging.error(msg)
    $ LC_ALL=en_US.US-ASCII python3 x.py
    --- Logging error 

    Traceback (most recent call last):
      File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 1037, in emit
        stream.write(msg + self.terminator)
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128)
    Call stack:
      File "x.py", line 5, in <module>
        logging.error(msg)
    Message: 'hello, \u306f\u308d\u30fc'
    Arguments: ()

    @zooba
    Copy link
    Member

    zooba commented Jun 13, 2019

    I think Vinay gets to make the call, but I also agree that some sort of error handling by default makes sense for logging. Especially now that it's (about to be) easier to override the default.

    FWIW, I'd leave the default arguments as None and just coerce it as late as possible, before TextIOWrapper does it.

    @vsajip
    Copy link
    Member

    vsajip commented Jun 17, 2019

    New changeset ca7b504 by Vinay Sajip in branch 'master':
    bpo-37111: Add 'encoding' and 'errors' parameters to logging.basicCon… (GH-14008)
    ca7b504

    @vsajip vsajip added 3.9 only security fixes and removed 3.7 (EOL) end of life 3.8 (EOL) end of life labels Jun 17, 2019
    @vsajip vsajip closed this as completed Jun 17, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes OS-windows type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants