Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not all String values round-trip through #540

Open
joeyh opened this issue Feb 14, 2016 · 1 comment
Open

not all String values round-trip through #540

joeyh opened this issue Feb 14, 2016 · 1 comment
Labels

Comments

@joeyh
Copy link
Contributor

joeyh commented Feb 14, 2016

I used persistent-sqlite with a table that contains a FilePath. After storing "test_öüä" in the database, I retrieved it back out, and got back "test_������".

This only occurred when I was not using a utf-8 capable locale (ie, LANG=C)

analysis:

String can represent a filepath that may be encoded using any encoding, not just the current system encoding. This is handled by using utf surrogate characters. These surrogates are what don't round-trip through persistent.

I suspect that it may come down to PersistField being implemented in terms of Text. See the "Acceptable data" in Data.Text's haddock. Since Text cannot represent unicode surrogates, packing the String to Text loses them.

impact:

This is likely to mostly impact programs that store FilePaths in a database. And it's easy to miss that such a program has a bug, because it will mostly only happen when using a non-unicode locale, or perhaps when dealing with strange filenames that are not encoded with utf-8.

fixing:

I don't know if this can be fixed as long as PersistField is using Text internally. If it were using only ByteString, it could probably be made to roundtrip all Strings through it. But, I have not checked what happens when PersistField operates on a PersistByteString.

I worked around this in git-annex with a newtype with its own PersistField implementation. The simplest approach is to show the String, which encodes the surrogate characters as \nnnn. But that is not backwards compatable with existing data in the database. So, I
made it only use the "show" approach when there's a surrogate char, and otherwise pass the string through as before.

versions:

persistent-2.2.5
persistent-sqlite-2.2
@snoyberg
Copy link
Member

I doubt there's anything reasonable we can do here. There's certainly a
desire to have Strings from Haskell be treated as normal text by other
SQLite applications. Using a newtype wrapper when you want a special
representation, or using a ByteString and taking responsibility for the
character encoding, seen like reasonable solutions.

On Mon, Feb 15, 2016, 12:01 AM Joey Hess notifications@github.com wrote:

I used persistent-sqlite with a table that contains a FilePath. After
storing "test_öüä" in the database, I retrieved it back out, and got back
"test_������".

This only occurred when I was not using a utf-8 capable locale (ie, LANG=C)

analysis:

String can represent a filepath that may be encoded using any encoding,
not just the current system encoding. This is handled by using utf
surrogate characters. These surrogates are what don't round-trip through
persistent.

I suspect that it may come down to PersistField being implemented in terms
of Text. See the "Acceptable data" in Data.Text's haddock. Since Text
cannot represent unicode surrogates, packing the String to Text loses them.

impact:

This is likely to mostly impact programs that store FilePaths in a
database. And it's easy to miss that such a program has a bug, because it
will mostly only happen when using a non-unicode locale, or perhaps when
dealing with strange filenames that are not encoded with utf-8.

fixing:

I don't know if this can be fixed as long as PersistField is using Text
internally. If it were using only ByteString, it could probably be made to
roundtrip all Strings through it. But, I have not checked what happens when
PersistField operates on a PersistByteString.

I worked around this in git-annex with a newtype with its own PersistField
implementation. The simplest approach is to show the String, which encodes
the surrogate characters as \nnnn. But that is not backwards compatable
with existing data in the database. So, I
made it only use the "show" approach when there's a surrogate char, and
otherwise pass the string through as before.

versions:

persistent-2.2.5
persistent-sqlite-2.2


Reply to this email directly or view it on GitHub
#540.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants