Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does pydantic.dataclasses.dataclass support classic mapping in SQLAlchemy? #1089

Closed
HymanZHAN opened this issue Dec 9, 2019 · 17 comments
Closed
Labels

Comments

@HymanZHAN
Copy link

Question

  • OS: Fedora 31
  • Python version: 3.7.5 (default, Oct 25 2019, 15:51:11)
  • Pydantic version: 1.2

First of all, thank you so much for your awesome job! Pydantic is a very good library and I really like its combination with FastAPI. 🚀

So my question is does pydantic.dataclasses.dataclass support classic mapping in SQLAlchemy? I am working on a project and hopefully can build it with clean architecture and therefore, would like to use pydantic's dataclass for my domain model and have SQLAlchemy depend on it.

I have a proof of concept code snippet as below, which would work if I use the dataclass from Python's std lib.

from dataclasses import field
from pydantic.dataclasses import dataclass
from pydantic import constr
from typing import List
from datetime import datetime
from sqlalchemy import create_engine, MetaData, Table, Column, Integer, String, ARRAY, TIMESTAMP
from sqlalchemy.orm import sessionmaker, mapper

metadata = MetaData()
person_table = \
    Table('people', metadata,
          Column('id', Integer, primary_key=True, autoincrement=True),
          Column('name', String),
          Column('age', Integer),
          Column('hobbies', ARRAY(String)),
          Column('birthday', TIMESTAMP)
          )


@dataclass
class Person:
    name: str = constr(max_length=50)
    age: int = field(default_factory=int)
    hobbies: List[str] = field(default_factory=list)
    birthday: datetime = field(default_factory=datetime.now)


mapper(Person, person_table)

engine = create_engine(
    'postgresql://postgres:defaultpassword@127.0.0.1:5432/test_db', echo=True)
metadata.create_all(engine)

session = sessionmaker(bind=engine)()
person = Person(hobbies=['golf', 'hiking'],
                birthday=datetime(1985, 7, 25))
session.add(person)
session.commit()

Error Output:

> python main.py         
2019-12-08 22:11:07,663 INFO sqlalchemy.engine.base.Engine select version()
2019-12-08 22:11:07,663 INFO sqlalchemy.engine.base.Engine {}
2019-12-08 22:11:07,664 INFO sqlalchemy.engine.base.Engine select current_schema()
2019-12-08 22:11:07,664 INFO sqlalchemy.engine.base.Engine {}
2019-12-08 22:11:07,665 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
2019-12-08 22:11:07,665 INFO sqlalchemy.engine.base.Engine {}
2019-12-08 22:11:07,665 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
2019-12-08 22:11:07,665 INFO sqlalchemy.engine.base.Engine {}
2019-12-08 22:11:07,665 INFO sqlalchemy.engine.base.Engine show standard_conforming_strings
2019-12-08 22:11:07,665 INFO sqlalchemy.engine.base.Engine {}
2019-12-08 22:11:07,666 INFO sqlalchemy.engine.base.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where pg_catalog.pg_table_is_visible(c.oid) and relname=%(name)s
2019-12-08 22:11:07,666 INFO sqlalchemy.engine.base.Engine {'name': 'people'}
2019-12-08 22:11:07,667 INFO sqlalchemy.engine.base.Engine 
CREATE TABLE people (
        id SERIAL NOT NULL, 
        name VARCHAR, 
        age INTEGER, 
        hobbies VARCHAR[], 
        birthday TIMESTAMP WITHOUT TIME ZONE, 
        PRIMARY KEY (id)
)


2019-12-08 22:11:07,667 INFO sqlalchemy.engine.base.Engine {}
2019-12-08 22:11:07,683 INFO sqlalchemy.engine.base.Engine COMMIT
Traceback (most recent call last):
  File "/home/xzhan/Development/Projects/sqlalchemy_mapping/.venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1955, in add
    state = attributes.instance_state(instance)
AttributeError: 'Person' object has no attribute '_sa_instance_state'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 37, in <module>
    session.add(person)
  File "/home/xzhan/Development/Projects/sqlalchemy_mapping/.venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1957, in add
    raise exc.UnmappedInstanceError(instance)
sqlalchemy.orm.exc.UnmappedInstanceError: Class '__main__.Person' is mapped, but this instance lacks instrumentation.  This occurs when the instance is created before sqlalchemy.orm.mapper(__main__.Person) was called.

Thanks in advance!

@samuelcolvin
Copy link
Member

Best to use ORM mode.

I have no personal interest in ORMs or supporting them - long term they're a mistake, you'd do much better to just write SQL.

However if there's some easy way to support mapper that required a relatively small change to pydantic, I'd be happy to review a PR.

@HymanZHAN
Copy link
Author

Thank you for the quick response!
Wish I had the SQL skill level to say that lol. 🤦‍♂️ I have worked on a Django project where most database operations are written in raw sql and it's not a pleasant experience. I find it to be a great maintenance burden not only for me but for other newer team members.
I will take a look and see what I can do when I am more settled with my current project. I shall experiment with BaseModel and ORM Mode as well. Thanks again!

@samuelcolvin
Copy link
Member

samuelcolvin commented Dec 9, 2019

Django's ORM is kind of a special case, since:

  1. it's integrated with the entire framework
  2. it doesn't try to be a shallow layer on top of SQL, but rather starts from the bottom up with it's own data approach, see Support for (de-)serializing SQLAlchemy models #491 (comment)

Good luck with from_orm that's the approach most people take.

@HymanZHAN
Copy link
Author

HymanZHAN commented Dec 9, 2019

From my personal experience, the major difference between SQLAlchemy and Django ORM is really the difference between Active Record and Data Mapper. (Or maybe I am just not experienced enough to spot/appreciate the others lol. 🤦‍♂️)

I really disliked SQLAlchemy when I first started Python web dev in Flask & Flask-SQLAlchemy and came to like Django ORM quite a lot when my new job uses Django. But using Django ORM means a tight data coupling. With some background in .NET Core and EF Core and recently learning more about software architecture, I slowly come to appreciate many design choices SQLAlchemy makes. The syntax is not the best for sure, but many things just make more sense now.

Edit: SQLAlchemy Core also has some very useful tools for building queries, including helpers for using textual SQL.

@sandys
Copy link

sandys commented Apr 17, 2020

i have closed my other issue (#1403) and would like to add my request here.
This would be great if there was some way to support sqlalchemy mapper mode. This allows us to not double-define models for sqlalchemy and pydantic separately and will be a huge help in using fastapi

@sandys
Copy link

sandys commented Apr 17, 2020

a good compromise would be if pydantic could take a python standard dataclass and convert it into a new dataclass made of pydantic dataclasses.
Right now, I'm doing this

from sqlalchemy import Table, Text, MetaData, Column, Integer, String, ForeignKey, create_engine
from sqlalchemy.orm import mapper, relationship, sessionmaker

from typing import Optional
from pydantic import  EmailStr
from pydantic.dataclasses import dataclass as py_dataclass
from dataclasses import dataclass

metadata = MetaData()

audit_mixin = Table('audit_mixin', metadata,
                     Column('id',Integer, primary_key=True),
                      Column('performed_by',Integer, nullable=True))

user = Table('user', metadata,
            Column('user_id', Integer, primary_key=True),
            Column('name', String(50)),
            Column('email', String(100)),
            Column('fullname', String(50)),
            Column('nickname', String(12))
        )


@dataclass
class AuditMixin:
    id: int
    performed_by: int 

@py_dataclass
class AuditMixinPy(AuditMixin):
    pass

@dataclass
class User(AuditMixin):
    user_id :int
    identity :str
    identity_type :str
    row_status :str
    comments :str

@py_dataclass
class UserPy(User):
    pass


engine = create_engine("sqlite:///", echo=False)
metadata.create_all(engine)

session = sessionmaker(engine)()

mapper(AuditMixin, audit_mixin)
mapper(User,user)

u = User(100,123,1, "a","dfd","dfdf","dfdd")
u.id
u.performed_by
session.add(u)
session.commit()

u = User("sss",123,1, "a","dfd","dfdf",1)
u = UserPy("sss",123,1, "a","dfd","dfdf",1)


@denisSurkov
Copy link

denisSurkov commented Apr 21, 2020

Hi!

This is really interesting feature, and I would like to give it some thought. I was playing around with sqlalchemy mapper and pydantic, so the main problem was:

@no_type_check
def __setattr__(self, name, value):
        ...
        if not hasattr(self, '__fields_set__'):
            object.__setattr__(self, '__fields_set__', set())

This will cause some problems with __eq__, because __eq__ checks all fields, but sqlalchemy loads some private attributes. @samuelcolvin , what do you think about this approach? As I see no tests failed.

Yes, sqlalchemy users would have to rewrite __eq__ of their pydantic BaseModels, but I think it's much easier for developers.

@sandys
Copy link

sandys commented Apr 21, 2020 via email

@denisSurkov
Copy link

Some dirty hacks, but I hope you understand the idea.

pydantic/main.py

class BaseModel(...):
    ...

    @no_type_check
    def __setattr__(self, name, value):
        if self.__config__.extra is not Extra.allow and name not in self.__fields__:
            raise ValueError(f'"{self.__class__.__name__}" object has no field "{name}"')
        elif not self.__config__.allow_mutation:
            raise TypeError(f'"{self.__class__.__name__}" is immutable and does not support item assignment')
        elif self.__config__.validate_assignment:
            known_field = self.__fields__.get(name, None)
            if known_field:
                value, error_ = known_field.validate(value, self.dict(exclude={name}), loc=name, cls=self.__class__)
                if error_:
                    raise ValidationError([error_], self.__class__)
        self.__dict__[name] = value
        # this lines added 
        if not hasattr(self, '__fields_set__'):
            object.__setattr__(self, '__fields_set__', set())
        # end 
        self.__fields_set__.add(name)

   ...
metadata = MetaData()

agents = Table(
    'agents', metadata,
    Column('someid', Integer, primary_key=True, index=True),
    Column('raw_json', JSON)
)


def start_mappers():
    mapper(agent_models.Agent, agents)
class Agent(BaseModel):
    someid: int = Field(...)
    raw_json: t.Optional[t.Dict[str, t.Any]] = Field(None)

    def __eq__(self, other):
        if isinstance(other, Agent):
            return self.someid == other.someid
        else:
            return False

    class Config:
        extra = Extra.allow
        allow_mutation = True
        orm_mode = True
def test_agent_mapper_can_load_lines(session):
    some_dict = {
        "key": "value"
    }

    raw_json = json.dumps(some_dict)

    session.execute(
        'INSERT INTO agents (someid, raw_json) VALUES '
        '(1, NULL ),'
        f'(2, :raw_json)', dict(raw_json=raw_json, )
    )
    session.commit()

    expected = [
        models.Agent(someid=1, raw_json={}),
        models.Agent(someid=2, raw_json=some_dict)
    ]

    assert session.query(models.Agent).all() == expected

@saisasidhar
Copy link

saisasidhar commented May 12, 2020

It is indeed quite an interesting use-case to use @dataclass with classical mapping..
I'm using pydantic for request validation (this dataclass instance is sent around in the application). But, I also wanted to persist values in this pydantic instance. So, here's my solution/hack to implement this transparently.

Model definitions:

from pydantic.dataclasses import dataclass as pydantic_dataclass
from dataclasses import dataclass

@pydantic_dataclass
class PydanticRequest:
    test: confloat(ge=10, le=100)


@dataclass  # This then takes the annotations and builds a new dataclass ;)
@duplicates_pydantic_dataclass(PydanticRequest)  # This prepares __annotations__ dynamically
class Request:
    pass

And the decorator which does the job

def duplicates_pydantic_dataclass(pydantic_cls):
    def wrapper(cls):
        cls.__annotations__ = {}
        for field_key, field_value in pydantic_cls.__dataclass_fields__.items():
            cls.__annotations__[field_key] = field_value.type
        return cls
    return wrapper

This way I transparently use pydantic for validations

pr = PydanticRequest(test=11.0)

sqlalchemy mappers are defined for Request !

For CRUD operation, I have the following:

from dataclasses import asdict
def instantiate_with_data(cls, instance):
    return cls(**asdict(instance))

>>> pr = PydanticRequest(test=11.0)
>>> r = instantiate_with_data(Request, pr)
>>> session.add(r)

Although, it suits alright for simple objects and mappings

@samuelcolvin
Copy link
Member

Just to add since it isn't mentioned here:

Since v1.7 pydantic has supported automatically creating pydantic dataclasses for stdlib dataclasses, see #1817

@DomWeldon
Copy link

I think I've followed most of the discussion so far, but I don't think that supporting pydantic dataclass creation from a standard dataclass would solve my problem, as, I like pydantic dataclasses and am using them as a part of my validation elsewhere, I'd rather keep pydantic dataclasses with extra validation, and hopefully stick as close as possible to PEP 557 in terms of keeping dataclasses as close as possible to "normal" python classes and not interfering with the usage of the class otherwise.

A minimal (ish) test case for my issue is here.
https://github.com/DomWeldon/pydantic-dataclasses-sqlalchemy

As far as I can tell, there is something going on with the implementation of a pydantic dataclass which alters the way that SQLAlchemy sets up either the new or init methods on new instances of a model which means that a new instance of a model has no _sa_instance_state, whereas a stdlib dataclass does.

This feels fixable, even if just by some kind of monkeypatch via a class decorator, for now, in advance of a proper PR to fix the issue in the future. I thought it was to do with the __post_init__ hook being messed up, but a cursory look at the SQLA repo seems to suggest this isn't the case, but I may be wrong on that.

@samuelcolvin I understand your opinions/frustrations on ORMs, and in the past have shared them (and indeed, now do share some), it's an interesting academic argument between the relative merits of mappers/active record, and about whether or not ORMs are worth the complexity that they add. My hot take is that they are, I like to reason about my database structures as python objects to automate most things, and have my team focus on spending time writing more complex edge cases (i.e., writing SQL that matters). This is actually something that - given our geographies - I'd love to chat about over a beer in London sometime.

That said, SQLAlchemy is one of two leading ORMs used by Pythonistas around the world, follows a sensible architecture pattern, is well supported and is basically essential to the python ecosystem. Pydantic is also a popular library, and on the up, so easier integration between the two seems like a good idea. It's also blocking a couple of my projects at the moment so is something I want to fix, at least for my code, ASAP!

I would be more than happy (read: keen) to put together a draft PR in either SQLAlchemy or pydantic, or at least post some gist about how to fix and get around this issue, but could do with a pointer to exactly where the issue is currently.

If anyone can add some detail to this issue before tomorrow afternoon, I would be enormously grateful, else, I'll update once I've had chance to find out what the exact issue is.

@zzzeek
Copy link

zzzeek commented Jul 26, 2021

As far as I can tell, there is something going on with the implementation of a pydantic dataclass which alters the way that SQLAlchemy sets up either the new or init methods on new instances of a model which means that a new instance of a model has no _sa_instance_state, whereas a stdlib dataclass does.

i dont have much resources to look into this but when I did, the main thing I was observing was that pydantic seems to want to write into obj.__dict__ directly, which is normally fine, however SQLAlchemy's whole thing is based on Python descriptors so that it can intercept setattr() and similar calls before it populates __dict__. That's the main problem, which is that the pydantic base seems to have several areas that it circumvents the descriptor protocol (which is likely because Python attributes are slow. indeed). An alternate Pydantic base could likely work around most/all of these issues but it got pretty in the weeds when I tried to do it awhile back.

@DomWeldon
Copy link

Thank-you @zzzeek this is very helpful and interesting and I really appreciate your time! If it's easy do you have a source code/docs link to the SQLA use of the descriptor protocol? I've been a long-time introspector of SQLA for how it defines my models/tables but I'm a bit hazy on exactly how it works in relation to instances and it would give me a headstart to have some docs on the ideas behind how this works.

The problem you described makes it sound like it could be trickier than planned, but I think the idea behind pydantic dataclasses is that they're very similar to standard dataclasses, but just add a few dunder properties such as __pydantic_model__. I noticed a few caveat-comments in your implementation of the code to parse dataclasses for 1.4, and I wonder if it's related to these or something entirely different. Any thoughts or braindumps you can provide are super useful.

Honestly, I'm looking forward to getting stuck into this tomorrow. Any notes / headstarts very much welcome.

@zzzeek
Copy link

zzzeek commented Jul 27, 2021

SQLAlchemy's descriptor logic is based on the descriptor class InstrumentedAttribute:

https://github.com/sqlalchemy/sqlalchemy/blob/master/lib/sqlalchemy/orm/attributes.py#L447

that's where all userspace attribute operations happen.

SQLAlchemy's use of descriptors can be modified highly using some little-used extension points, but still needs some level where the "set" can be intercepted.

The attempt I made to make some of this work can be seen at:

https://gist.github.com/zzzeek/bcccc70393f14a190e849878ed63e419

@DomWeldon
Copy link

Thanks Mike - this has been really helpful. Unfortunately I've only been able to look at this late today so will be a day or so until I find a solution to my issue.

Based on your code and notes I think the issue is the line below in pydantic/dataclasses.py:101, but I haven't had chance to properly investigate what the precise issue is yet and how it's manifesting.

    object.__setattr__(self, '__dict__', d)

source

My implementation of building a dataclass from a pydantic schema is slightly different from your gist, in that I've defined my models as (pydantic) dataclasses, thinking that, as in the spirit of PEP 557 there would be minimal interference between the two libraries on what should be essentially just a plain old python class.

However, it seems that since (of course) both libraries have a need to alter behaviour around __init__() they are effectively competing/clashing with eachother and the line above, which is executed post __init__() being called effectively removes all SQLAlchemy behaviour on the instance (although as yet I haven't proven that it does remove behaviour which has already been attributed on the object, this is more of a hunch at this stage, but it's getting late so I'll prove or disprove that hunch tomorrow).

Thinking aloud, and without chance to properly plan or sketch out a solution, I think a potential longer-term candidate for a fix could be to make pydantic's implementation with respect to dataclasses customizable such that operations which directly alter (and/or access?) __dict__ on a dataclass can be configured, even if it's just supplying a callable to access or modify __dict__ which could protect a certain namespace or check that pydantic is operating only in the expected namespace and/or modify behaviour if it does. You might even be able to do this already using the descriptor protocol on __dict__ and/or altering the MRO, but I think that's a sure way to head for a mental stack overflow (if you're reading this in 6 months and wondering if it might be the solution, please don't do this!).

I'll take a closer look again tomorrow and sketch out some thoughts as I try to find a solution to my issue.

As always, any further input/comments much appreciated. Thanks again for your help @zzzeek, and to both Mike and @samuelcolvin for writing and maintaining such amazing open source libraries.

@DomWeldon
Copy link

The fix (in my example) turns out to be remarkably simple using a monkey patch.

https://github.com/DomWeldon/pydantic-dataclasses-sqlalchemy/blob/8ddab913816702dd5ec6a5a6de1ad9ee9b89164d/pdsqla/dcs.py#L36

I'm not sure whether reassigning __dict__ in the post-init function should be counted as a bug or not, so will outline some thinking in a new issue with proposals for a fix/change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants