Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

locale.format bug if thousand separator is space (french separator as example) #45563

Closed
edlm10 mannequin opened this issue Sep 30, 2007 · 23 comments
Closed

locale.format bug if thousand separator is space (french separator as example) #45563

edlm10 mannequin opened this issue Sep 30, 2007 · 23 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@edlm10
Copy link
Mannequin

edlm10 mannequin commented Sep 30, 2007

BPO 1222
Nosy @loewis, @birkenfeld, @mdickinson, @pitrou, @ericvsmith
Files
  • thousands_sep.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/loewis'
    closed_at = <Date 2009-03-14.00:16:26.256>
    created_at = <Date 2007-09-30.14:15:46.278>
    labels = ['type-bug', 'library']
    title = 'locale.format bug if thousand separator is space (french\tseparator as example)'
    updated_at = <Date 2009-03-18.21:30:00.939>
    user = 'https://bugs.python.org/edlm10'

    bugs.python.org fields:

    activity = <Date 2009-03-18.21:30:00.939>
    actor = 'eric.smith'
    assignee = 'loewis'
    closed = True
    closed_date = <Date 2009-03-14.00:16:26.256>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2007-09-30.14:15:46.278>
    creator = 'edlm10'
    dependencies = []
    files = ['10984']
    hgrepos = []
    issue_num = 1222
    keywords = ['patch']
    message_count = 23.0
    messages = ['56200', '59829', '60084', '60093', '60136', '61929', '61956', '65558', '65582', '65912', '65914', '70173', '70272', '70274', '70275', '75854', '83549', '83753', '83757', '83758', '83761', '83769', '83779']
    nosy_count = 8.0
    nosy_names = ['loewis', 'georg.brandl', 'mark.dickinson', 'pitrou', 'eric.smith', 'edlm10', 'waveform', 'wdoekes']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue1222'
    versions = ['Python 2.6']

    @edlm10 edlm10 mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Sep 30, 2007
    @edlm10
    Copy link
    Mannequin Author

    edlm10 mannequin commented Sep 30, 2007

    locale.format function delete spaces in result which is a problem when
    thousand separator is space (in French for example).

    The problem seems in the code below (extract from locale.py):

    145 while seps:
    146 # If the number was formatted for a specific width, then it
    147 # might have been filled with spaces to the left or right. If
    148 # so, kill as much spaces as there where separators.
    149 # Leading zeroes as fillers are not yet dealt with, as it is
    150 # not clear how they should interact with grouping.
    151 sp = result.find(" ")
    152 if sp==-1:break
    153 result = result[:sp]+result[sp+1:]
    154 seps -= 1

    Example :

    >>> import locale
    >>> locale.setlocale(locale.LC_NUMERIC)
    'C'
    >>> locale.setlocale(locale.LC_NUMERIC, "fr_FR.UTF-8")
    'fr_FR.UTF-8'
    >>> locale.format("%.2f", 12345.67, True)
    '12345,67'

    The correct result is '12 345,67' not '12345,67'

    and if I call

    >>> locale.format("%9.2f", 12345.67, True)
    '12 345,67'

    the result is correct

    Is this behavior correct or a bug?

    @pitrou
    Copy link
    Member

    pitrou commented Jan 12, 2008

    Just tried on 2.5.1 and on trunk, I can't reproduce it. On both I get:

    >>> import locale
    >>> locale.setlocale(locale.LC_NUMERIC, "fr_FR.UTF-8")
    'fr_FR.UTF-8'
    >>> locale.format("%9.2f", 12345.67, True)
    ' 12345,67'
    >>> locale.format("%.2f", 12345.67, True)
    '12345,67'
    >>> locale.localeconv()['thousands_sep']
    ''

    Which means that there is no thousands separator for the French locale
    on my system (a Mandriva Linux box).

    @pitrou
    Copy link
    Member

    pitrou commented Jan 18, 2008

    Ok, I manage to reproduce it on an Ubuntu box with both 2.5.2a0 and SVN
    trunk.
    IMO it's a bug since the expected behaviour would be to enforce the
    locale settings for number formatting. That's the whole purpose of
    locale.format() after all. (no to mention the discrepancy between %.2f
    and %9.2f)

    @pitrou
    Copy link
    Member

    pitrou commented Jan 18, 2008

    Eric, your diagnostic looks right, format() gets confused when it tries
    to remove padding characters to account for the added thousands
    separators. It does not check that there were padding characters in the
    first place, and it assumes that the thousands separator is not a space
    character itself.

    Since Georg Brandl and Martin von Loewis are listed as co-authors, I add
    them to the cc list ;)

    @birkenfeld
    Copy link
    Member

    The space removal was there before my change; assigning to Martin.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 31, 2008

    Here is a patch. It is fairly big and I'm sorry about it, here is why:

    1. it reworks the test suite using unittest, and makes it more comprehensive
    2. it adds a feature in locale.py which allows overriding the localeconv
      dictionary (this is necessary for proper testing)
    3. it fixes the present bug, and also fixes integer formatting with
      grouping and padding (which was broken, as well as untested)

    Point 2 above may be controversial, it is just a dictionary (by default
    empty) named _override_localeconv, if you prefer we can make it a real
    API, e.g. _set_localeconv().

    @birkenfeld
    Copy link
    Member

    Looks good to me.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 16, 2008

    Here is an updated patch against trunk.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 17, 2008

    I see that http://wiki.python.org/moin/PythonBugDay says "need to test
    on non-MacOS platform". Actually, I've tested the patch under Mandriva
    and Debian Linux.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 28, 2008

    I've had the opportunity to test on a Windows box and there are various
    failures in the TestStringMethods test case. If someone with more
    knowledge of the Windows world could take a lookm it would be nice.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 28, 2008

    Ok, here is an updated patch for Windows compatibility of the test suite.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 23, 2008

    Would someone object to committing this before beta3?
    For clarity I would first commit the rewrite of test_locale to use
    unittest, and then the fix for the thousands separator bug.

    @ericvsmith
    Copy link
    Member

    I'd recommend breaking the patch into the 3 parts mentioned in msg61929,
    then committing the first 2 parts. That should make the 3rd part much
    easier to evaluate.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 25, 2008

    The first 2 parts have been committed in r65237. I'll soon provide a
    patch for the third part.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 25, 2008

    Here is the patch for the third part.

    @wdoekes
    Copy link
    Mannequin

    wdoekes mannequin commented Nov 14, 2008

    @antoine Pitrou

    Regarding "# XXX is the trailing space a bug?": I'm inclined to believe
    that it is. Man 7 locale does not mention that p_sep_by_space should be
    used for non-int currency symbols, nor that it shouldn't. However, it
    does say:

    """ char *int_curr_symbol; /* First three chars are a currency symbol
    from ISO 4217. Fourth char is the separator. Fifth char is ’ ’. */ """

    I haven't seen that fifth character, and in the libc sources I can't
    find one either:

    glibc-2.7/localedata/locales/nl_NL:66:int_curr_symbol
    "<U0045><U0055><U0052><U0020>"

    I do however see a separator.

    In libc documentation I found here (
    http://www.chemie.fu-berlin.de/chemnet/use/info/libc/libc_19.html#SEC325
    ), it says the following:

    """ In other words, treat all nonzero values alike in these members.
    These members apply only to currency_symbol. When you use
    int_curr_symbol, you never print an additional space, because
    int_curr_symbol itself contains the appropriate separator. The POSIX
    standard says that these two members apply to the int_curr_symbol as
    well as the currency_symbol. But an example in the ANSI C standard
    clearly implies that they should apply only to the
    currency_symbol---that the int_curr_symbol contains any appropriate
    separator, so you should never print an additional space. Based on what
    we know now, we recommend you ignore these members when printing
    international currency symbols, and print no extra space. """

    This is probably not right either, because this forces you to use an
    n_sign_posn and p_sign_posn that have the symbol on the same side of the
    value. (Although that might not be such an awful assumption.) And, more
    importantly, a grep through the sources reveal that no language has a
    preceding space. (But they do, I assume, have *_sign_posn's that want a
    trailing symbol.)

    """ glibc-2.7/localedata/locales$ grep ^int_curr_symbol * | grep -vE
    '(<U0020>| )"' | wc -l
    0 """

    That leaves us with two more options. Option three: the fourth character
    is optional and defines what the separator is but not _where_ it should
    be. I.e. you might have to move it according to what *_sign_posn says.

    And finally, option four: international formatting should ignore all of
    *_cs_precedes, *_sep_by_space and *_sign_posn. Locale(7) explicitly
    mentions currency_symbol, not int_cur_symbol. Perhaps one should assume
    that international notation is universal. (I would guess that most
    common is:
    <int_curr_symbol><space><optional_minus><num><mon_thousands_sep><num><mon_decimal_point><num>)

    Personally I would go with option three. It has the least impact on
    formatting because it only cleans up spaces. I'm guessing however that
    option four is the Right One.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 14, 2009

    Committed in r70356, r70357, r70358, r70359. I didn't try to change
    anything about the trailing space since it's minor anyway.

    @pitrou pitrou closed this as completed Mar 14, 2009
    @mdickinson
    Copy link
    Member

    r70357 seems to have caused the sparc solaris py3k buildbot to start failing on
    test_locale:

    ======================================================================
    FAIL: test_integer_grouping_and_padding (test.test_locale.TestNumberFormatting)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/test_locale.py", line 181, 
    in test_integer_grouping_and_padding
        out=('4%s200' % self.sep).rjust(10))
      File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/test_locale.py", line 143, 
    in _test_format
        func=locale.format, **format_opts)
      File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/test_locale.py", line 139, 
    in _test_formatfunc
        func(format, value, **format_opts), out)
    AssertionError: '     4200' != '      4200'

    Any idea what's going on here?

    @pitrou
    Copy link
    Member

    pitrou commented Mar 18, 2009

    Hi Mark,

    AssertionError: ' 4200' != ' 4200'

    Any idea what's going on here?

    The problem seems to be that the thousands separator on the Solaris
    variant of en_US is an empty string (rather than a comma) (*), and
    apparently it hits a bug in the padding mechanism (which perhaps assumes
    that the thousands separator is always a 1-character string).

    (*) (another reason not to use any C locale-based mechanism for
    localization, by the way...)

    @pitrou
    Copy link
    Member

    pitrou commented Mar 18, 2009

    By the way, Martin, do you have any idea why the Solaris buildbot
    doesn't seem to be enabled on trunk?

    @pitrou
    Copy link
    Member

    pitrou commented Mar 18, 2009

    Trying a fix in r70458.

    @mdickinson
    Copy link
    Member

    It looks like that did the trick. Thank you!

    @ericvsmith
    Copy link
    Member

    Antoine Pitrou wrote:

    The problem seems to be that the thousands separator on the Solaris
    variant of en_US is an empty string (rather than a comma) (*), and
    apparently it hits a bug in the padding mechanism (which perhaps assumes
    that the thousands separator is always a 1-character string).

    (*) (another reason not to use any C locale-based mechanism for
    localization, by the way...)

    I've come to believe this, too. I'm working on cleaning up the C
    implementations so I can do all of the locale-based formating without
    using the locale functions. I'll use the localeconv values, but that's it.

    @ericvsmith ericvsmith changed the title locale.format bug if thousand separator is space (french separator as example) locale.format bug if thousand separator is space (french separator as example) Mar 18, 2009
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants