Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when querying JSON columns with wide unicode characters. #4798

Closed
abetlen opened this issue Aug 8, 2019 · 11 comments
Closed

Error when querying JSON columns with wide unicode characters. #4798

abetlen opened this issue Aug 8, 2019 · 11 comments
Labels
bug Something isn't working datatypes things to do with database types, like VARCHAR and others external driver issues the issue involves a misbehavior on the part of the DBAPI itself, probably not SQLAlchemy sqlite
Milestone

Comments

@abetlen
Copy link

abetlen commented Aug 8, 2019

SQLAlchemy Version: 1.3.6

SQLAlchemy serializes JSON columns using json.dumps which by default has the parameter ensure_ascii=True, this converts unicode characters to ascii-compatible unicode escape sequences. Unfortunately, this behaviour seems to produce encoding errors when working with unicode characters that require more than a single unicode escape sequence to encode, such as the characters 𝓓 and 𝓞 in the example below.

import sqlalchemy as sa
import sqlalchemy.dialects.sqlite as sqlite_types

metadata = sa.MetaData()

User = sa.Table(
    "users",
    metadata,
    sa.Column("username", sa.String, primary_key=True),
    sa.Column("groups", sqlite_types.JSON)
)

engine = sa.create_engine("sqlite:///:memory:")
conn = engine.connect()
metadata.create_all(engine)

conn.execute("""
INSERT INTO users (username, groups) VALUES ('alice', json_array("𝓓𝓞Γ"));
""")

stmt = sa.select([sa.literal_column("json_each.value")])\
    .select_from(User)\
    .select_from(sa.func.json_each(User.c.groups))

# Works
print(conn.execute(stmt).fetchall())

conn.execute(User.insert(), {
    "username": "bob",
    "groups": ["𝓓𝓞Γ"]
})

# Fails
print(conn.execute(stmt).fetchall())

Running this example yields:

[('𝓓𝓞Γ',)]
Traceback (most recent call last):
  File "/home/andrei/miniconda3/lib/python3.6/site-packages/sqlalchemy/engine/result.py", line 1211, in fetchall
    l = self.process_rows(self._fetchall_impl())
  File "/home/andrei/miniconda3/lib/python3.6/site-packages/sqlalchemy/engine/result.py", line 1161, in _fetchall_impl
    return self.cursor.fetchall()
sqlite3.OperationalError: Could not decode to UTF-8 column 'value' with text '��������������'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test_json_dumps_bug.py", line 34, in <module>
    print(conn.execute(stmt).fetchall())
  File "/home/andrei/miniconda3/lib/python3.6/site-packages/sqlalchemy/engine/result.py", line 1216, in fetchall
    e, None, None, self.cursor, self.context
  File "/home/andrei/miniconda3/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1466, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "/home/andrei/miniconda3/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 398, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/home/andrei/miniconda3/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 152, in reraise
    raise value.with_traceback(tb)
  File "/home/andrei/miniconda3/lib/python3.6/site-packages/sqlalchemy/engine/result.py", line 1211, in fetchall
    l = self.process_rows(self._fetchall_impl())
  File "/home/andrei/miniconda3/lib/python3.6/site-packages/sqlalchemy/engine/result.py", line 1161, in _fetchall_impl
    return self.cursor.fetchall()
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) Could not decode to UTF-8 column 'value' with text '��������������'
(Background on this error at: http://sqlalche.me/e/e3q8)

I know it wouldn't be a fix to the underlying issue but I believe allowing the user to set the serializer ensure_ascii=False option would resolve this error.

@zzzeek
Copy link
Member

zzzeek commented Aug 9, 2019

I've never heard of this parameter but it is ensure_ascii

@abetlen
Copy link
Author

abetlen commented Aug 9, 2019

Oops, that was a typo on my part. I've edited the issue to reflect the correct parameter name.

@zzzeek
Copy link
Member

zzzeek commented Aug 9, 2019

so Id like to change the parameter straight but I need to understand the issue. the raw value comes back without issue so this is something to do with what sqlites doing with that json_arra function and it is unclear if this is a bug we should be reporting to the Python tracker for sqlite3 or not.

I tried changing the URL to postgresql to see what PG does but the json_array() function seems to not work, if you know how to show PG doing the same thing here we can see what it expects, or I could at least experiment with it. There are JSON encoders used w/ the mysql / postgresql dialects also so need to see if this issue applies to them also.

Basically i dont know these SQL functions very well so I don't really know what's going on until I'd have time to look more deeply.

@zzzeek zzzeek added bug Something isn't working datatypes things to do with database types, like VARCHAR and others sqlite labels Aug 9, 2019
@zzzeek zzzeek added this to the 1.3.xx milestone Aug 9, 2019
@abetlen
Copy link
Author

abetlen commented Aug 9, 2019

Thanks for taking a look.

I've tested this in postgresql and it seems to work correctly so the issue does seem to be sqlite specific.

I adapted the example above to postgresql using json_build_array instead of json_array to construct the first input array, and json_array_elements instead of json_each to traverse the json arrays.

import sqlalchemy as sa
import sqlalchemy.dialects.postgresql as pg_types

metadata = sa.MetaData()

User = sa.Table(
    "users",
    metadata,
    sa.Column("username", sa.String, primary_key=True),
    sa.Column("groups", pg_types.JSON)
)

engine = sa.create_engine("postgresql+psycopg2://postgres:postgres@localhost:5432/postgres")

conn = engine.connect()
metadata.drop_all(engine)
metadata.create_all(engine)

conn.execute("""
INSERT INTO users (username, groups) VALUES ('alice', json_build_array('𝓓𝓞Γ'));
""")

stmt = sa.select([sa.literal_column("json_array_elements.value")])\
    .select_from(User)\
    .select_from(sa.func.json_array_elements(User.c.groups))

# Works
print(conn.execute(stmt).fetchall())

conn.execute(User.insert(), {
    "username": "bob",
    "groups": ["𝓓𝓞Γ"]
})

# Also Works
print(conn.execute(stmt).fetchall())

The only hint I have so far as to what's going wrong is that when I check the sqlite database directly it's clear the strings inside the groups column are encoded totally differently.

sqlite> select * from users;
alice|["𝓓𝓞Γ"]
bob|["\ud835\udcd3\ud835\udcde\u0393"]

@zzzeek
Copy link
Member

zzzeek commented Aug 9, 2019

well one issue right off is that there are json_serializer and json_deserializer arguments to the Engine you can set, but for SQLite only, they are named _json_serializer and _json_deserializer, which is wrong. But you can work around using this for now, i will fix the names but the old ones will work throughout 1.3.x:


import json

engine = sa.create_engine(
    "sqlite:///:memory:",
    echo=True,
    _json_serializer=lambda x: json.dumps(x, ensure_ascii=False),
)

@sqla-tester
Copy link
Collaborator

Mike Bayer has proposed a fix for this issue in the master branch:

Correct name for json_serializer / json_deserializer, document and test https://gerrit.sqlalchemy.org/1399

@sqla-tester
Copy link
Collaborator

Mike Bayer has proposed a fix for this issue in the rel_1_3 branch:

Correct name for json_serializer / json_deserializer, document and test https://gerrit.sqlalchemy.org/1400

@zzzeek
Copy link
Member

zzzeek commented Aug 9, 2019

OK now for PG. If you run your program then look at the PG database, they are encoded exactly the same way as they are with the SQLite dialect, because SQLAlchemy is doing the same thing:

test=> select * from users;
 username |               groups               
----------+------------------------------------
 alice    | ["𝓓𝓞Γ"]
 bob      | ["\ud835\udcd3\ud835\udcde\u0393"]
(2 rows)


psycopg2 isn't tripping over this, but it's getting back the escaped string and json.loads is just fixing it.

So, SQLAlchemy shouldn't be doing this (or should it?) but also sqlite3 dialect is still looking suspect.

@zzzeek
Copy link
Member

zzzeek commented Aug 9, 2019

mmmm this is totally valid JSON, here's the spec:

Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lowercase. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".

so Python json.dumps is doing the right thing, the JSON is valid in the DB with the \u, SQLAlchemy will fix the serializers here so they are programmable, and the thing you are seeing is a bug in sqlite, I think.

@abetlen
Copy link
Author

abetlen commented Aug 9, 2019

Thank you for the help, that solves my problem.

I believe you're correct about this being a bug in sqlite, I'll report it there.

@zzzeek
Copy link
Member

zzzeek commented Aug 9, 2019

yeah is this valid?

sqlite> select * from users;
alice|["𝓓𝓞Γ"]
bob|["\ud835\udcd3\ud835\udcde\u0393"]
sqlite> select json_each.value from users, json_each(users.groups);
𝓓𝓞Γ
������������Γ

postgresql doesn't have this problem.

$ psql -U scott test
Password for user scott: 
psql (11.2)
Type "help" for help.

test=> SELECT json_array_elements.value FROM users, json_array_elements(users.groups) ;
              value               
----------------------------------
 "𝓓𝓞Γ"
 "\ud835\udcd3\ud835\udcde\u0393"
(2 rows)

@zzzeek zzzeek added the external driver issues the issue involves a misbehavior on the part of the DBAPI itself, probably not SQLAlchemy label Aug 9, 2019
sqlalchemy-bot pushed a commit that referenced this issue Aug 10, 2019
The dialects that support json are supposed to take arguments
``json_serializer`` and ``json_deserializer`` at the create_engine() level,
however the SQLite dialect calls them ``_json_serilizer`` and
``_json_deserilalizer``.  The names have been corrected, the old names are
accepted with a change warning, and these parameters are now documented as
:paramref:`.create_engine.json_serializer` and
:paramref:`.create_engine.json_deserializer`.

Fixes: #4798
Change-Id: I1dbfe439b421fe9bb7ff3594ef455af8156f8851
(cherry picked from commit 104e690)
@zzzeek zzzeek modified the milestones: 1.3.xx, 1.3.x Dec 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datatypes things to do with database types, like VARCHAR and others external driver issues the issue involves a misbehavior on the part of the DBAPI itself, probably not SQLAlchemy sqlite
Projects
None yet
Development

No branches or pull requests

3 participants