Manually Controlling JSONB Serialization/Deserialization #10944

rmorshea · 2024-01-30T02:51:53Z

rmorshea
Jan 30, 2024

I am developing an artifact system that uses a PostgreSQL (psycopg2) JSONB column to store data along with its hash. To accomplish this I've defined an ORM object similar to the following:

class Artifact(Base):
    __tablename__ = "artifact"
    id = Column(Integer, primary_key=True, autoincrement=True)
    content_hash = Column(String, nullable=False)
    content_json = Column(StringJSONB, nullable=True)

Because I keep track of the content's hash I end up doing something like this when creating Artifact objects:

def create_artifact(value: Any) -> None:
    return Artifact(
        content_json=value,
        content_hash=hashlib.sha256(json.dumps(value).encode("utf-8")).hexdigest()
    )

The consequence being that I end up serializing my data twice - once to compute the hash and again when SQLAlchemy automatically serializes the data to store it in the database. It would be much better if I could instead ask SQLAlchemy to not automatically serialize the data to JSON and instead do it myself:

def create_artifact(value: Any) -> None:
    data = json.dumps(value)
    return Artifact(
        content_json=data,
        content_hash=hashlib.sha256(data.encode("utf-8")).hexdigest()
    )

Turning off this automatic serialization behavior is possible, but as per the docs, it requires that I disable automatic serialization of JSON values at the engine level. As explained here this is because, "when using psycopg2, the DBAPI only allows serializers at the per-cursor or per-connection level." This is problematic because the rest of my application would benefit from, and in fact assumes, the automatic serialization of JSON values.

I attempted to create a custom type decorator that I could substitute for JSONB and that would allow me to manually declare when I wanted control over serialization. The following implementation works, but suffers needing to re-serialize the data when it is read from the database:

from typing import Any, Callable
import json
from sqlalchemy import Dialect
from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy.sql.elements import Null
from sqlalchemy.types import TypeDecorator


class StringJSONB(TypeDecorator):
    """JSON that requires the user to manually serialize and deserialize to and from strings"""

    impl = JSONB
    cache_ok = True

    def __init__(self, *args: Any, **kwargs: Any):
        super().__init__(*args, none_as_null=False, **kwargs)

    def bind_processor(self, dialect: Dialect) -> Callable[[Any], Any]:
        return lambda value: None if value is None or value is self.NULL or isinstance(value, Null) else value

    def result_processor(self, dialect: Dialect, coltype: Any) -> Callable[[Any], Any]:
        return lambda value: (
            value
            if value is None
            # Unfortunately we have to re-serialize the value because the postgres dialect automatically
            # deserializes the value into a dict. As far as I can tell, there's no way to disable this behavior.
            else json.dumps(value)
        )

It seems odd that I'm able to successfully skip the automatic JSON serialization by defining a bind_processor, but I'm not able to do the same thing for deserialization with result_processor. Perhaps this is more an issue with psycopg than SQLAlchemy? I tried inspecting the relevant code, but it seems entirely dependent on what processors are getting passed when a row is being reconstructed after it's read from the database. I have no idea if it's possible to control that behavior with a TypeDecorator beyond what I'm already doing.

Here is a full working example showing StringJSONB in use (albeit with the aforementioned re-serialization issue):

import hashlib
import json
from typing import Any, Callable

from sqlalchemy import Column, Dialect, Integer, String, create_engine
from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy.sql.elements import Null
from sqlalchemy.types import TypeDecorator

Base = declarative_base()


class StringJSONB(TypeDecorator):
    """JSON that requires the user to manually serialize and deserialize to and from strings"""

    impl = JSONB
    cache_ok = True

    def __init__(self, *args: Any, **kwargs: Any):
        super().__init__(*args, none_as_null=False, **kwargs)

    def bind_processor(self, dialect: Dialect) -> Callable[[Any], Any]:
        return lambda value: None if value is None or value is self.NULL or isinstance(value, Null) else value

    def result_processor(self, dialect: Dialect, coltype: Any) -> Callable[[Any], Any]:
        return lambda value: (
            value
            if value is None
            # Unfortunately we have to re-serialize the value because the postgres dialect automatically
            # deserializes the value into a dict. As far as I can tell, there's no way to disable this behavior.
            else json.dumps(value)
        )


class Artifact(Base):
    __tablename__ = "artifact"

    id = Column(Integer, primary_key=True, autoincrement=True)
    content_hash = Column(String, nullable=False)
    content_json = Column(StringJSONB, nullable=True)


def create_artifact(value: Any) -> Artifact:
    data = json.dumps(value)
    return Artifact(content_json=data, content_hash=hashlib.sha256(data.encode("utf-8")).hexdigest())


def main() -> None:
    engine = create_engine("postgresql://username:password@localhost:5432/postgres")
    Base.metadata.create_all(engine)
    Session = sessionmaker(bind=engine)
    session = Session()

    value = {"foo": "bar"}
    artifact = create_artifact(value)
    session.add(artifact)
    session.commit()

    artifact = session.query(Artifact).first()

    assert isinstance(artifact.content_json, str)
    assert json.loads(artifact.content_json) == value

    print("It works!")


if __name__ == "__main__":
    main()

Answered by CaselIT

Jan 30, 2024

Hi,

It seems odd that I'm able to successfully skip the automatic JSON serialization by defining a bind_processor, but I'm not able to do the same thing for deserialization with result_processor

well you can control what gets passed to the dialect, but once the result_processor runs the dialect has already de-serialized the data from the db.

I think yout best option here is probably to either de-register the deserializer in the dialect or update your StringJSONB type to cast to text in bind_expression.

Personally I would make the db compute that hash, using a computed column and one of the hashing function supported by pg https://www.postgresql.org/docs/current/pgcrypto.html

View full answer

CaselIT · 2024-01-30T19:51:43Z

CaselIT
Jan 30, 2024
Maintainer

Hi,

It seems odd that I'm able to successfully skip the automatic JSON serialization by defining a bind_processor, but I'm not able to do the same thing for deserialization with result_processor

well you can control what gets passed to the dialect, but once the result_processor runs the dialect has already de-serialized the data from the db.

I think yout best option here is probably to either de-register the deserializer in the dialect or update your StringJSONB type to cast to text in bind_expression.

Personally I would make the db compute that hash, using a computed column and one of the hashing function supported by pg https://www.postgresql.org/docs/current/pgcrypto.html

4 replies

rmorshea Jan 30, 2024
Author

I have other reasons for maintaining direct control over serialization so unfortunately computed columns won't allow me to escape the underlying issue here. It seems like my best course of action is to disable automatic JSON serialization at the engine-level and define a type decorator that will automatically serialize things. In other words, rather than explicitly declaring when you want manual serialization, you declare when you want automatic serialization.

In particular, I have a library serializers and storages backends that must work together interchangeably. In that system, serializers hand off a bundle of dumped data (which includes the hash) to storages so they can save it. Thus, as designed, the storage backend (of which PostgreSQL is one) isn't responsible for serialization. Given this I could either rework the system, or make a special case for PostgreSQL and SQLAlchemy. I'm not particularly keen to take either of those approaches if I don't have to though.

All that said, there does seem to be an odd lack of behavioral parity between bind_processor and result_processor. It leaves me with a couple questions:

Why is it that bind_processor is responsible for calling the dialect's serializer while result_processor is not?
Is this a quark of the PostgreSQL dialect?
Could this behavior be changed if someone were willing to make PR or is this the intended behavior/interface?

CaselIT Jan 30, 2024
Maintainer

odd lack of behavioral parity between bind_processor and result_processor

That's an implementation detail of the driver you are using ( i'm assuming psycopg2). In sqlite they work as you thought they did

Why is it that bind_processor is responsible for calling the dialect's serializer while result_processor is not?

It's an implementation detail. For example psycopg3 doesn't serialize the data, the dricer does

Is this a quark of the PostgreSQL dialect?

The dialect mainly. Different drivers do different things in postgresql

Could this behavior be changed if someone were willing to make PR or is this the intended behavior/interface?

That's very unlikely it would be accepted. The right thing to do here is let the driver handle serialization and deserialization if they support it.

I have a library serializers and storages backends that must work together interchangeabl...

Honestly reading your use case I think you are using the wrong sql datatype. I think the right one for your use case is bytea, that would support any type of serialization and has not driver semantic to worry about

rmorshea Jan 31, 2024
Author

Thanks for the responses. All that makes sense.

I think you are using the wrong sql datatype

Our users are scientists so there's a demand for maximal flexibility when it comes to post-hoc analyses of experimental data. As a result they're interested in being able to write queries that introspect the data they're saving through this system. JSONB seemed like the best way to facilitate that.

rmorshea Jun 10, 2024
Author

In my prior response I suggested that "my best course of action is to disable automatic JSON serialization at the engine-level and define a type decorator that will automatically serialize things." It turns out this is a problem because it also turns off automatic serialization when the dialect is trying to handle query parameters.

To resolve this I explored switching to Psycopg3 but it turns out that driver also has this odd dissimilarity between bind/result processing. SQLite on the other hand does have parity. So this really does seem like a choice on the part of the driver authors for things to behave as they do.

Given that I'm using PostgreSQL I resolved this problem by enabling automatic serialization, but disabling automatic deserialization. The consequence of this is that I've had to define both a StringJSONB and PythonJSONB where the former allows for manual serialization while the latter does so for automatic serialization. StringJSONB implements the exact same bind_processor shown in the example code above but does not provide a result_processor (since automatic deserialization is disabled). PythonJSONB on the other hand does not provide a bind_processor (since automatic serialization is enabled) but does provide a ``result_processor` that deserializes values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manually Controlling JSONB Serialization/Deserialization #10944

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Manually Controlling JSONB Serialization/Deserialization #10944

rmorshea Jan 30, 2024

Replies: 1 comment · 4 replies

CaselIT Jan 30, 2024 Maintainer

rmorshea Jan 30, 2024 Author

CaselIT Jan 30, 2024 Maintainer

rmorshea Jan 31, 2024 Author

rmorshea Jun 10, 2024 Author

rmorshea
Jan 30, 2024

Replies: 1 comment 4 replies

CaselIT
Jan 30, 2024
Maintainer

rmorshea Jan 30, 2024
Author

CaselIT Jan 30, 2024
Maintainer

rmorshea Jan 31, 2024
Author

rmorshea Jun 10, 2024
Author