Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force BOM option in UTF output. #45669

Closed
jgsack mannequin opened this issue Oct 25, 2007 · 19 comments
Closed

Force BOM option in UTF output. #45669

jgsack mannequin opened this issue Oct 25, 2007 · 19 comments
Assignees
Labels
topic-unicode type-feature A feature request or enhancement

Comments

@jgsack
Copy link
Mannequin

jgsack mannequin commented Oct 25, 2007

BPO 1328
Nosy @doerwalter

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = 'https://github.com/doerwalter'
closed_at = <Date 2008-03-22.14:42:01.782>
created_at = <Date 2007-10-25.22:59:35.971>
labels = ['type-feature', 'expert-unicode']
title = 'Force BOM option in UTF output.'
updated_at = <Date 2008-03-22.14:44:52.747>
user = 'https://bugs.python.org/jgsack'

bugs.python.org fields:

activity = <Date 2008-03-22.14:44:52.747>
actor = 'doerwalter'
assignee = 'doerwalter'
closed = True
closed_date = <Date 2008-03-22.14:42:01.782>
closer = 'doerwalter'
components = ['Unicode']
creation = <Date 2007-10-25.22:59:35.971>
creator = 'jgsack'
dependencies = []
files = []
hgrepos = []
issue_num = 1328
keywords = []
message_count = 19.0
messages = ['56759', '56780', '56782', '56801', '56813', '56814', '56817', '57028', '57033', '57041', '57522', '57527', '57529', '57691', '63705', '64189', '64217', '64324', '64325']
nosy_count = 5.0
nosy_names = ['doerwalter', 'jafo', 'jgsack', 'ggenellina', 'Rhamphoryncus']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = None
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue1328'
versions = ['Python 2.6', 'Python 2.5']

@jgsack
Copy link
Mannequin Author

jgsack mannequin commented Oct 25, 2007

The behavior of codecs utf_16_[bl]e is to omit the BOM.

In a testing environment (and perhaps elsewhere), a forced BOM is useful.
I'm requesting an optional argument something like
force_BOM=False

I guess it would require such an option in multiple function calls, sorry I
don't know enough to itemize them.

If this is implemented, it might be desirable to think about the aliases
like unicode*unmarked.

Regards,
..jim

@jgsack jgsack mannequin added type-bug An unexpected behavior, bug, or error topic-unicode labels Oct 25, 2007
@jgsack
Copy link
Mannequin Author

jgsack mannequin commented Oct 26, 2007

Feature Request REVISION
========================
Upon reflection and more playing around with some test cases, I wish to
revise my feature request.

I think the utf8 codecs should accept input with or without the "sig".
On output, only the utf_8_sig should write the 3-byte "sig". This behavior
change would not seem disruptive to current applications.

For utf16, (arguably) a missing BOM should merely assume machian endianess.
For utf_16_le, utf_16_be input, both should accept & discard a BOM.
On output, I'm not sure; maybe all should write a BOM unless passed a flag
signifying no bom?
Or to preserve backward compat, could have a parm write_bom defaulting to
True for utf16 and False for utf_16_le and utf_16_be. This is a
modification of the originial request (for a force_bom flag).

Unless I have confused myself with my test cases, the current codecs are
slightly inconsistent for the utf8 codecs:

utf8 treats "sig" as real data, if present, but..
utf_8_sig works right even without the "sig" (so this one I like as is!)

The 16'ers seem to match the (inferred) specs, but for completeness here:
utf_16 refuses to proceed w/o BOM (even with correct endian input data)
utf_16_le treats BOM as data
utf_16_be treats BOM as data

Regards,
..jim

@jgsack
Copy link
Mannequin Author

jgsack mannequin commented Oct 26, 2007

Later note: kind of weird!

On my LE machine, utf16 reads my BE-formatted test data (no BOM)
apparently assumng some kind of surrogate format, until it finds
an "illegal UTF-16 surrogate".

That I fail to understand, especially since it quits upon seeing
a BOM with valid LE data.

Test data and test code available on request.

Regards,
..jim

@gvanrossum
Copy link
Member

Can't you force a BOM by simply writing \ufffe at the start of the file?

@jgsack
Copy link
Mannequin Author

jgsack mannequin commented Oct 26, 2007

re: msg56782

Yes, of course I can explicitly write the BOM. I did realize that after
my first post ( my-'duh' :-[ ).

But after playing some more, I do think this issue has become a
worthwhile one. My second post msg56780 asks that utf_8 be tolerant
of the 3-byte sig BOM, and uf_16_[be]e be tolerant of their BOMs,
which I argue is consistent with "be liberal on what you accept".

A second half of that message suggests that it might be worth
considering something like a write_bom parameter with utf_16
defaulting to True, and utf_16_[bl]e defaulting to False.

My third post (m56782) may actually represent a bug. I have a
unittest for this and would be glad to provide (although I need
to reduuce a larger test to a simple case). I will look at this
again, and re-pester you as required.

Regards (and thanks for the reply),
..jim

@gvanrossum
Copy link
Member

If you can, please submit a patch that fixes all those issues, with
unit tests and doc changes if at all possible. That will make it much
easier to evaluate the ramifications of your proposal(s).

@jgsack
Copy link
Mannequin Author

jgsack mannequin commented Oct 26, 2007

OK, I will work on it. I have just downloaded trunk and will see what
I can do. Might be a week or two.

..jim

@Rhamphoryncus
Copy link
Mannequin

Rhamphoryncus mannequin commented Nov 1, 2007

The problem with "being tolerate" as you suggest is you lose the ability
to round-trip. Read in a file using the UTF-8 signature, write it back
out, and suddenly nothing else can open it.

Conceptually, these signatures shouldn't even be part of the encoding;
they're a prefix in the file indicating which encoding to use.

Note that the BOM signature (ZWNBSP) is a valid code point. Although it
seems unlikely for a file to start with ZWNBSP, if were to chop a file
up into smaller chunks and decode them individually you'd be more likely
to run into it. (However, it seems general use of ZWNBSP is being
discouraged precisely due to this potential for confusion[1]).

In summary, guessing the encoding should never be the default. Although
it may be appropriate in some contexts, we must ensure we emit the right
encoding for those contexts as well. [2]

[1] http://unicode.org/faq/utf_bom.html#38
[2] http://unicode.org/faq/utf_bom.html#28

@jgsack
Copy link
Mannequin Author

jgsack mannequin commented Nov 1, 2007

Adam Olsen wrote:

Adam Olsen added the comment:

The problem with "being tolerate" as you suggest is you lose the ability
to round-trip. Read in a file using the UTF-8 signature, write it back
out, and suddenly nothing else can open it.

I'm sorry, I don't see the round-trip problem you describe.

If codec utf_8 or utf_8_sig were to accept input with or without the
3-byte BOM, and write it as currently specified without/with the BOM
respectively, then _I_ can reread again with either utf_8 or utf_8_sig.

No round trip problem _for me_.

Now If I need to exchange with some else, that's a different matter. One
way or another I need to know what format they need and create the
output they require for their input.

Am I missing something in your statement of a problem?

Conceptually, these signatures shouldn't even be part of the encoding;
they're a prefix in the file indicating which encoding to use.

Yes, I'm aware of that, but you can't predict what you may find in dusty
archives, or what someone may give to you. IMO, that's the basis of
being tolerant in what you accept, is it not?

Note that the BOM signature (ZWNBSP) is a valid code point. Although it
seems unlikely for a file to start with ZWNBSP, if were to chop a file
up into smaller chunks and decode them individually you'd be more likely
to run into it. (However, it seems general use of ZWNBSP is being
discouraged precisely due to this potential for confusion[1]).

I understand that throwing away a ZWNBSP at the beginning of a file does
risk discarding data rather than metadata. I also believe the standards
people recognized that and deliberately picked a BOM character that is a
calculated low risk. I'm willing to accept that risk.

In summary, guessing the encoding should never be the default. Although
it may be appropriate in some contexts, we must ensure we emit the right
encoding for those contexts as well. [2]

[1] http://unicode.org/faq/utf_bom.html#38
[2] http://unicode.org/faq/utf_bom.html#28

From my point of view, I don't see that being tolerant in what _I_ (or
my applications) accept violates any guidelines.

Please explain where I am wrong.

Regards,
..jim

@Rhamphoryncus
Copy link
Mannequin

Rhamphoryncus mannequin commented Nov 1, 2007

On 11/1/07, James G. sack (jim) <report@bugs.python.org> wrote:

James G. sack (jim) added the comment:

Adam Olsen wrote:
> Adam Olsen added the comment:
>
> The problem with "being tolerate" as you suggest is you lose the ability
> to round-trip. Read in a file using the UTF-8 signature, write it back
> out, and suddenly nothing else can open it.

I'm sorry, I don't see the round-trip problem you describe.

If codec utf_8 or utf_8_sig were to accept input with or without the
3-byte BOM, and write it as currently specified without/with the BOM
respectively, then _I_ can reread again with either utf_8 or utf_8_sig.

No round trip problem _for me_.

Now If I need to exchange with some else, that's a different matter. One
way or another I need to know what format they need and create the
output they require for their input.

Am I missing something in your statement of a problem?

You don't seem to think it's important to interact with other
programs. If you're importing with no intent to write out to a common
format, then yes, autodetecting the BOM is just fine. Python needs a
more general default though, and not guessing is part of that.

> Conceptually, these signatures shouldn't even be part of the encoding;
> they're a prefix in the file indicating which encoding to use.

Yes, I'm aware of that, but you can't predict what you may find in dusty
archives, or what someone may give to you. IMO, that's the basis of
being tolerant in what you accept, is it not?

Garbage in, garbage out. There's a lot of protocols with whitespace,
capitalization, etc that you can fudge around while retaining the same
contents; character set encodings aren't one of them.

@jgsack
Copy link
Mannequin Author

jgsack mannequin commented Nov 15, 2007

re: msg57041, I'm sorry if I gave the wrong impression about interacting
with other programs. I started this feature request with some half-baked
thinking, which I tried to revise in my second post.

Anyway I'm most interested right now in lobbying for a change to utf_8 to
accept input with an _optional_ BOM-signature so that the input part would
behave just like utf_8_sig, where the BOM-sig is already optional (on
input).

In the process of trying to come up with a test and patch for this, I
discovered a bug in utf_8_sig (issue bpo-1444 http://bugs.python.org/
bpo-1444).

After there is some action on that I will return here to continue with
utf_8, which I have convinced myself (anyways) is a reasonable and safe
revision.

..jim

@doerwalter
Copy link
Contributor

jgsack wrote:

If codec utf_8 or utf_8_sig were to accept input with or without the
3-byte BOM, and write it as currently specified without/with the BOM
respectively, then _I_ can reread again with either utf_8 or utf_8_sig.

That's exactly what the utf_8_sig codec does. The decoder accepts input
with or without the BOM (the (first) BOM doesn't get returned). The
encoder always prepends a BOM.

Or do you want a codec that behaves like utf_8 on reading and like
utf_8_sig on writing? Such a codec indead indead wouldn't roundtrip.

@doerwalter
Copy link
Contributor

For utf16, (arguably) a missing BOM should merely assume machian
endianess.
For utf_16_le, utf_16_be input, both should accept & discard a BOM.
On output, I'm not sure; maybe all should write a BOM unless passed a flag
signifying no bom?
Or to preserve backward compat, could have a parm write_bom defaulting to
True for utf16 and False for utf_16_le and utf_16_be. This is a
modification of the originial request (for a force_bom flag).

The Unicode FAQ (http://unicode.org/faq/utf_bom.html#28) clearly states:

"""
Q: How I should deal with BOMs?
[...]
Where the precise type of the data stream is known (e.g. Unicode
big-endian or Unicode little-endian), the BOM should not be used. In
particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE,
UTF-32BE or UTF-32LE a BOM *must* not be used. [...]

@jgsack
Copy link
Mannequin Author

jgsack mannequin commented Nov 20, 2007

More discussion of utf_8.py decoding behavior (and possible change):

For my needs, I would like the decoding parts of the utf_8 module to treat
an initial BOM as an optional signature and skip it if there is one (just
like the utf_8_sig decoder). In fact I have a working patch that replaces
the utf_8_sig decode, IncrementalDecoder and StreamReader components by
direct transplants from utf_8_sig (as recently repaired -- there was a
SteamReader error).

However the reason for discussion is to ask how it might impact existing
code.

I can imagine there might be utf_8 client code out there which expects to
see a leading U+feff as (perhaps) a clue that the output should be returned
with a BOM-signature (say) to accomodate the guessed input requirements of
the remote correspondant.

Making my work easier might actually make someone else's work (probably,
annoyingly) harder.

So what to do?

I can just live with code like
if input[0] == u"\ufeff":
input=input[1:}
spread around, and of course slightly different for incremental and stream
inputs.

But I probably wouldn't. I would probably substitute a
"my_utf_8" encoding for to make my code a little cleaner.

Another thought I had would require "the other guy" to update his code, but
at least it wouldn't make his work annoyingly difficult like my original
change might have.

Here's the basic outline:

  • Add another decoder function that returns a 3-tuple
    decode3(input, errors='strict') => (data, consumed, had_bom)
    where had_bom is true if a leading bom was seen and skipped

  • then the usual decode is just something like
    def decode(input, errors='strict'):
    return decode3(input, errors)[:2]

  • add member variable and accessor to both IncrementalDecoder and
    StreamReader classes something like
    def had_bom(self):
    return self.had_bom
    and initialize/set the self.had_bom variable as required.

This complicates the interface somewhat and requires some additional
documantation.

Tpo document my original simple [-minded] idea required
possibly only a few more words in the existing paragraph
on utf_8_sig, to mention that both mods had the same
decoding behavior but different encoding.

I thought of a secondary consideration: If utf_8 and utf_8_sig are "almost
the same", it's possible that future refactoring might unify them with
differences contained in behavor-flags (eg, skip_leading_bom). The leading
bom processing might even be pushed into codecs.utf_8_decode for possible
minor advantages.

Is there anybody monitoring this who has an opinion on this?

..jim

@jafo
Copy link
Mannequin

jafo mannequin commented Mar 17, 2008

It sounds like the Unicode FAQ has an authoritative statement on this,
is this a "wontfix", or does this need more discussion? Perhaps on
python-dev or at the sprints this week?

@jafo jafo mannequin changed the title feature request: force BOM option Force BOM option in UTF output. Mar 17, 2008
@jafo jafo mannequin assigned doerwalter Mar 17, 2008
@jafo jafo mannequin added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Mar 17, 2008
@doerwalter
Copy link
Contributor

I don't see exactly what James is proposing.

For my needs, I would like the decoding parts of the utf_8 module
to treat an initial BOM as an optional signature and skip it if
there is one (just like the utf_8_sig decoder). In fact I have
a working patch that replaces the utf_8_sig decode,
IncrementalDecoder and StreamReader components by direct
transplants from utf_8_sig (as recently repaired -- there was a
SteamReader error).

I've you want a decoder that behave like the utf-8-sig decoder, use the
utf-8-sig decoder. I don't see how changing the utf-8 decoder helps here.

I can imagine there might be utf_8 client code out there which
expects to see a leading U+feff as (perhaps) a clue that the
output should be returned with a BOM-signature (say) to
accomodate the guessed input requirements of the remote
correspondant.

In this case use UTF-8: The leading BOM will be passed to the application.

I can just live with code like
if input[0] == u"\ufeff":
input=input[1:}
spread around, and of course slightly different for incremental
and stream inputs.

Can you post an example that requires this code?

@jgsack
Copy link
Mannequin Author

jgsack mannequin commented Mar 20, 2008

Can you post an example that requires this code?

This is not a big issue, and it wouldn't hurt if it got declared "go away
and come back later if you have patch, test, docs, and a convincing use
case".

..But, for the record..

Suppose I want to both read and write some utf8. It is unknown whether the
input has a BOM, but it is known to be utf8. I want to write utf8 without
any BOM. I see two options, which I find slightly ugly/annoying/error-prone:

a) Use 2 separate encodings: read via utf_8_sig so as to transparently
accept input with/without BOM; use utf_8 on output to not emit any BOM.

b) Use utf_8 for read and write and explicitly check for and discard
leading BOM on input if any.

What _I_ would prefer is that utf_8 would ignore a BOM, if present (just
like utf_8_sig).

(What I was talking about in my last post was a complication in
consideration of someone else who would prefer otherwise, or of code that
might break upon my change.)

Regards,
..jim

@doerwalter
Copy link
Contributor

If you want to use UTF-8-sig for decoding and UTF-8 for encoding and
have this available as one codec you can define your owen codec for this:

import codecs

def search_function(name):
    if name == "myutf8":
        utf8 = codecs.lookup("utf-8")
        utf8_sig = codecs.lookup("utf-8-sig")
        return codecs.CodecInfo(
            name='myutf8',
            encode=utf8.encode,
            decode=utf8_sig.decode,
            incrementalencoder=utf8.IncrementalEncoder,
            incrementaldecoder=utf8_sig.IncrementalDecoder,
            streamreader=utf8_sig.StreamReader,
            streamwriter=utf8.StreamWriter,
        )


codecs.register(search_function)

Closing the issue as "wont fix"

@doerwalter
Copy link
Contributor

Oops, that code was supposed to read:

import codecs

def search_function(name):
    if name == "myutf8":
        utf8 = codecs.lookup("utf-8")
        utf8_sig = codecs.lookup("utf-8-sig")
        return codecs.CodecInfo(
            name='myutf8',
            encode=utf8.encode,
            decode=utf8_sig.decode,
            incrementalencoder=utf8.incrementalencoder,
            incrementaldecoder=utf8_sig.incrementaldecoder,
            streamreader=utf8_sig.streamreader,
            streamwriter=utf8.streamwriter,
        )


codecs.register(search_function)

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-unicode type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants