Handle encoding #3

Merged
merged 4 commits into from May 18, 2011

Conversation

Projects
None yet
3 participants
Contributor

josh commented May 15, 2011

Not sure if this is the right approach, but I wanted to discussion how to handle encodings.

As I understand, the string payload should be encoding unaware. The length should be the ascii 8bit byte size of the string.

TNetstrings Specification

Owner

rkh commented May 15, 2011

Using bytesize is definitely correct according to TNetstring spec, being able to set the encoding while parsing is probably handy for nested objects. However, if the data string has a different encoding than binary, should we respect that?

Contributor

josh commented May 15, 2011

Ah, so for encode we should be calling obj.encode('binary')?

Contributor

benatkin commented May 18, 2011

Oops. I misunderstood this when I read it the first time; I should have looked at the code. #4 is a duplicate of this. I think UTF-8 should be assumed for effortless JSON API compatibility, and another tag character (perhaps ' or `) should be used for raw bytestrings. And yes, bytesize is a must.

Contributor

benatkin commented May 18, 2011

Also if you know a file is UTF-8 or expect it to be, it could be streamed into a string by first getting the size from the filesystem, since the filesystem reports sizes in bytes. I don't think UTF-8 strings are unwieldily at all, so long as that's what we want and we get the bytesize thing out of the way.

Owner

rkh commented May 18, 2011

@benatkin what do you mean by "supporting UTF-8"? str.force_encoding("UTF-8") before returning it?

Contributor

benatkin commented May 18, 2011

@rkh Yes, that's what I mean. Basically require that all implementations do that if indications are that it's UTF-8 (having a standard tag rather than a binary tag). The file would need to be read in as a bytestring, though.

Contributor

josh commented May 18, 2011

I'm 👎 to that.

Netstrings are a lower level transport. It should be the responsibility of the application to decide which encoding is being used.

Calling force_encoding from UTF-8 to something else is a smell.

Owner

rkh commented May 18, 2011

I don't like it either. People should not have to set the encoding more than once or twice.

rkh merged commit 05c680c into rkh:master May 18, 2011

Contributor

josh commented May 18, 2011

I may have fucked up the assumptions about Encoding.default_internal. Think they are only intended to be used for FS related encodings.

https://gist.github.com/83c011e40e1970df0ef4

Owner

rkh commented May 18, 2011

Oh man, how I hate to deal with that stuff.

Contributor

josh commented May 18, 2011

I think we just want to change encoding = 'internal' to encoding = nil' and don't any sort of encoding unless its explicit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment