You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used persistent-sqlite with a table that contains a FilePath. After storing "test_öüä" in the database, I retrieved it back out, and got back "test_������".
This only occurred when I was not using a utf-8 capable locale (ie, LANG=C)
analysis:
String can represent a filepath that may be encoded using any encoding, not just the current system encoding. This is handled by using utf surrogate characters. These surrogates are what don't round-trip through persistent.
I suspect that it may come down to PersistField being implemented in terms of Text. See the "Acceptable data" in Data.Text's haddock. Since Text cannot represent unicode surrogates, packing the String to Text loses them.
impact:
This is likely to mostly impact programs that store FilePaths in a database. And it's easy to miss that such a program has a bug, because it will mostly only happen when using a non-unicode locale, or perhaps when dealing with strange filenames that are not encoded with utf-8.
fixing:
I don't know if this can be fixed as long as PersistField is using Text internally. If it were using only ByteString, it could probably be made to roundtrip all Strings through it. But, I have not checked what happens when PersistField operates on a PersistByteString.
I worked around this in git-annex with a newtype with its own PersistField implementation. The simplest approach is to show the String, which encodes the surrogate characters as \nnnn. But that is not backwards compatable with existing data in the database. So, I
made it only use the "show" approach when there's a surrogate char, and otherwise pass the string through as before.
versions:
persistent-2.2.5
persistent-sqlite-2.2
The text was updated successfully, but these errors were encountered:
I doubt there's anything reasonable we can do here. There's certainly a
desire to have Strings from Haskell be treated as normal text by other
SQLite applications. Using a newtype wrapper when you want a special
representation, or using a ByteString and taking responsibility for the
character encoding, seen like reasonable solutions.
I used persistent-sqlite with a table that contains a FilePath. After
storing "test_öüä" in the database, I retrieved it back out, and got back
"test_������".
This only occurred when I was not using a utf-8 capable locale (ie, LANG=C)
analysis:
String can represent a filepath that may be encoded using any encoding,
not just the current system encoding. This is handled by using utf
surrogate characters. These surrogates are what don't round-trip through
persistent.
I suspect that it may come down to PersistField being implemented in terms
of Text. See the "Acceptable data" in Data.Text's haddock. Since Text
cannot represent unicode surrogates, packing the String to Text loses them.
impact:
This is likely to mostly impact programs that store FilePaths in a
database. And it's easy to miss that such a program has a bug, because it
will mostly only happen when using a non-unicode locale, or perhaps when
dealing with strange filenames that are not encoded with utf-8.
fixing:
I don't know if this can be fixed as long as PersistField is using Text
internally. If it were using only ByteString, it could probably be made to
roundtrip all Strings through it. But, I have not checked what happens when
PersistField operates on a PersistByteString.
I worked around this in git-annex with a newtype with its own PersistField
implementation. The simplest approach is to show the String, which encodes
the surrogate characters as \nnnn. But that is not backwards compatable
with existing data in the database. So, I
made it only use the "show" approach when there's a surrogate char, and
otherwise pass the string through as before.
versions:
persistent-2.2.5
persistent-sqlite-2.2
—
Reply to this email directly or view it on GitHub #540.
I used persistent-sqlite with a table that contains a FilePath. After storing "test_öüä" in the database, I retrieved it back out, and got back "test_������".
This only occurred when I was not using a utf-8 capable locale (ie, LANG=C)
analysis:
String can represent a filepath that may be encoded using any encoding, not just the current system encoding. This is handled by using utf surrogate characters. These surrogates are what don't round-trip through persistent.
I suspect that it may come down to PersistField being implemented in terms of Text. See the "Acceptable data" in Data.Text's haddock. Since Text cannot represent unicode surrogates, packing the String to Text loses them.
impact:
This is likely to mostly impact programs that store FilePaths in a database. And it's easy to miss that such a program has a bug, because it will mostly only happen when using a non-unicode locale, or perhaps when dealing with strange filenames that are not encoded with utf-8.
fixing:
I don't know if this can be fixed as long as PersistField is using Text internally. If it were using only ByteString, it could probably be made to roundtrip all Strings through it. But, I have not checked what happens when PersistField operates on a PersistByteString.
I worked around this in git-annex with a newtype with its own PersistField implementation. The simplest approach is to show the String, which encodes the surrogate characters as \nnnn. But that is not backwards compatable with existing data in the database. So, I
made it only use the "show" approach when there's a surrogate char, and otherwise pass the string through as before.
versions:
The text was updated successfully, but these errors were encountered: