New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Force BOM option in UTF output. #45669
Comments
The behavior of codecs utf_16_[bl]e is to omit the BOM. In a testing environment (and perhaps elsewhere), a forced BOM is useful. I guess it would require such an option in multiple function calls, sorry I If this is implemented, it might be desirable to think about the aliases Regards, |
Feature Request REVISION I think the utf8 codecs should accept input with or without the "sig". For utf16, (arguably) a missing BOM should merely assume machian endianess. Unless I have confused myself with my test cases, the current codecs are utf8 treats "sig" as real data, if present, but.. The 16'ers seem to match the (inferred) specs, but for completeness here: Regards, |
Later note: kind of weird! On my LE machine, utf16 reads my BE-formatted test data (no BOM) That I fail to understand, especially since it quits upon seeing Test data and test code available on request. Regards, |
Can't you force a BOM by simply writing \ufffe at the start of the file? |
re: msg56782 Yes, of course I can explicitly write the BOM. I did realize that after But after playing some more, I do think this issue has become a A second half of that message suggests that it might be worth My third post (m56782) may actually represent a bug. I have a Regards (and thanks for the reply), |
If you can, please submit a patch that fixes all those issues, with |
OK, I will work on it. I have just downloaded trunk and will see what ..jim |
The problem with "being tolerate" as you suggest is you lose the ability Conceptually, these signatures shouldn't even be part of the encoding; Note that the BOM signature (ZWNBSP) is a valid code point. Although it In summary, guessing the encoding should never be the default. Although [1] http://unicode.org/faq/utf_bom.html#38 |
Adam Olsen wrote:
I'm sorry, I don't see the round-trip problem you describe. If codec utf_8 or utf_8_sig were to accept input with or without the No round trip problem _for me_. Now If I need to exchange with some else, that's a different matter. One Am I missing something in your statement of a problem?
Yes, I'm aware of that, but you can't predict what you may find in dusty
I understand that throwing away a ZWNBSP at the beginning of a file does
From my point of view, I don't see that being tolerant in what _I_ (or Please explain where I am wrong. Regards, |
On 11/1/07, James G. sack (jim) <report@bugs.python.org> wrote:
You don't seem to think it's important to interact with other
Garbage in, garbage out. There's a lot of protocols with whitespace, |
re: msg57041, I'm sorry if I gave the wrong impression about interacting Anyway I'm most interested right now in lobbying for a change to utf_8 to In the process of trying to come up with a test and patch for this, I After there is some action on that I will return here to continue with ..jim |
jgsack wrote:
That's exactly what the utf_8_sig codec does. The decoder accepts input Or do you want a codec that behaves like utf_8 on reading and like |
The Unicode FAQ (http://unicode.org/faq/utf_bom.html#28) clearly states: """ |
More discussion of utf_8.py decoding behavior (and possible change): For my needs, I would like the decoding parts of the utf_8 module to treat However the reason for discussion is to ask how it might impact existing I can imagine there might be utf_8 client code out there which expects to Making my work easier might actually make someone else's work (probably, So what to do? I can just live with code like But I probably wouldn't. I would probably substitute a Another thought I had would require "the other guy" to update his code, but Here's the basic outline:
This complicates the interface somewhat and requires some additional Tpo document my original simple [-minded] idea required I thought of a secondary consideration: If utf_8 and utf_8_sig are "almost Is there anybody monitoring this who has an opinion on this? ..jim |
It sounds like the Unicode FAQ has an authoritative statement on this, |
I don't see exactly what James is proposing.
I've you want a decoder that behave like the utf-8-sig decoder, use the
In this case use UTF-8: The leading BOM will be passed to the application.
Can you post an example that requires this code? |
This is not a big issue, and it wouldn't hurt if it got declared "go away ..But, for the record.. Suppose I want to both read and write some utf8. It is unknown whether the a) Use 2 separate encodings: read via utf_8_sig so as to transparently b) Use utf_8 for read and write and explicitly check for and discard What _I_ would prefer is that utf_8 would ignore a BOM, if present (just (What I was talking about in my last post was a complication in Regards, |
If you want to use UTF-8-sig for decoding and UTF-8 for encoding and import codecs
def search_function(name):
if name == "myutf8":
utf8 = codecs.lookup("utf-8")
utf8_sig = codecs.lookup("utf-8-sig")
return codecs.CodecInfo(
name='myutf8',
encode=utf8.encode,
decode=utf8_sig.decode,
incrementalencoder=utf8.IncrementalEncoder,
incrementaldecoder=utf8_sig.IncrementalDecoder,
streamreader=utf8_sig.StreamReader,
streamwriter=utf8.StreamWriter,
)
codecs.register(search_function) Closing the issue as "wont fix" |
Oops, that code was supposed to read: import codecs
def search_function(name):
if name == "myutf8":
utf8 = codecs.lookup("utf-8")
utf8_sig = codecs.lookup("utf-8-sig")
return codecs.CodecInfo(
name='myutf8',
encode=utf8.encode,
decode=utf8_sig.decode,
incrementalencoder=utf8.incrementalencoder,
incrementaldecoder=utf8_sig.incrementaldecoder,
streamreader=utf8_sig.streamreader,
streamwriter=utf8.streamwriter,
)
codecs.register(search_function) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: