-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support efficient recursion for loader options #8126
Comments
Sounds cool! |
Mike Bayer has proposed a fix for this issue in the main branch: WIP: add auto_recurse option to selectinload https://gerrit.sqlalchemy.org/c/sqlalchemy/sqlalchemy/+/3920 |
I think it would make sense to bail out at certain depth. Like 10 by default making it configurable. shall I reopen this? |
not sure. you can make a chain of ten selectinload options that would do that, so if someone wants a fixed depth we already have that, though a number is easeir to enter. There is also join_depth on relationship() which does this at the class level for self-referential relationships. There's a bunch of APIs here and I think we would need to be very careful about just throwing down more keywords and such without considering the whole thing in totality. like should the option use the relatoinship-configured join_depth automatically, stuff like that. |
Forgot about Also reading the docs for join depth they seem to contradict each other: |
I think given that the user who wanted this ended up having a non-working use case anyway, we should keep this at one parameter and have it be something like |
it doesn't say there's no limit, it says something that's hard to understand what it means, since it refers to the implementation: "When left at its default of None, eager loaders will stop chaining when they encounter a the same target mapper which is already higher up in the chain. " so say I am Node, I'm now loading Node.children, now im on Node again, I come and see Node.children, that's "encountered the same target mapper which is already higher up in the chain". So we stop. So in that case, "join_depth" was one. however, say I have Node, which has but if we look at "join_depth" as, "number of times we can load this same relationship nested", basically we can say that join_depth=None is the same as join_depth=1. it still contradicts what's at https://docs.sqlalchemy.org/en/14/orm/self_referential.html#configuring-self-referential-eager-loading and im not sure if that's maybe how it was first implemented, though i cant find a note where it might have been changed, not sure. |
Ok re-reading it it does make sense. Not sure why I read the same Instance instead of the same Mapper. |
yes maybe using |
OK so I think also we should add a warning to all the loaders that will emit when any kind of path is just too long. the very long paths are just not architecturally planned and someone should not be going more than 25 levels deep anywhere. then we need to enhance the two postloaders in question here to somehow pass through the recursion depth without using the cache. what's hard there is I have to review all the ways that "current_path" is used and make sure every case is covered. |
Mike Bayer has proposed a fix for this issue in the main branch: WIP: try to see if recursive loading can use a counter https://gerrit.sqlalchemy.org/c/sqlalchemy/sqlalchemy/+/3931 |
ive reverted b3a1162 as the cache key issue makes that feature more or less a non-starter for now. |
the worst case recursive test from #8142: import sqlalchemy as sa
from sqlalchemy import orm
R = orm.registry()
@R.mapped
class N:
__tablename__ = "n"
id = sa.Column(sa.Integer, primary_key=True)
parent_id = sa.Column(sa.ForeignKey("n.id"))
children = orm.relationship("N", back_populates="parent", remote_side=parent_id)
parent = orm.relationship("N", back_populates="children", remote_side=id)
e = sa.create_engine("sqlite:///:memory:", echo=True)
R.metadata.create_all(e)
with e.begin() as c:
c.execute(
N.__table__.insert(),
[{"id": i, "parent_id": i - 1 if i > 0 else None} for i in range(200)],
)
def stack_selectinload(depth: int):
if depth == 1:
return orm.selectinload(N.children)
return stack_selectinload(depth - 1).selectinload(N.children)
def stack_joinedload(depth: int):
if depth == 1:
return orm.joinedload(N.children)
return stack_joinedload(depth - 1).joinedload(N.children)
def stack_subqueryload(depth: int):
if depth == 1:
return orm.subqueryload(N.children)
return stack_subqueryload(depth - 1).subqueryload(N.children)
with orm.Session(e) as s:
# fine
# stmt = sa.select(N).filter(N.id == 1).options(stack_joinedload(50))
# sqilte bails but it generates quickly
# stmt = sa.select(N).filter(N.id == 1).options(stack_joinedload(200))
# does not cache at all. always generated. fast
#stmt = sa.select(N).filter(N.id == 1).options(stack_selectinload(5))
# does not cache at all. always generated. fast
# stmt = sa.select(N).filter(N.id == 1).options(stack_selectinload(25))
# does not cache at all. always generated. starts slowing
# stmt = sa.select(N).filter(N.id == 1).options(stack_selectinload(50))
# does not cache at all. always generated. slowing
# stmt = sa.select(N).filter(N.id == 1).options(stack_selectinload(75))
# does not cache at all. always generated. slow
# stmt = sa.select(N).filter(N.id == 1).options(stack_selectinload(100))
# does not cache at all. always generated. very slow
stmt = sa.select(N).filter(N.id == 1).options(stack_selectinload(150))
# does not cache at all. always generated. very very slow.
# Also does ~10 fast than pause for some time (guessing gc). large memory use 2gb
# stmt = sa.select(N).filter(N.id == 1).options(stack_selectinload(150))
# also slow (also sqlite bails at 64 tables)
# stmt = sa.select(N).filter(N.id == 1).options(stack_subqueryload(63))
# very slow (also sqlite bails at 64 tables)
# stmt = sa.select(N).filter(N.id == 1).options(stack_subqueryload(200))
# Not tried the other loaders
result = s.execute(stmt)
print(result.unique().all())
result = s.execute(stmt)
print(result.unique().all()) |
so i still held off on making this work for more than immediately self-referential relationships, i think overall this option might not be that popular anyway because this has never really been asked for. however the idea is that asyncio might make it more popular than before. |
Have you managed to fix also the memory usage? |
if recursion_depth is used, then it certainly should. the mem issue was from generating a long line of hundreds of loader options that were then multiplied into that many Load objects. then cache keys had to be generated from each of those, the whole thing went into the cache, the 1G thing is very plausible. the implementation now uses just one Load object with two elements inside that are fixed, paths dont grow beyond four tokens, the first statement is cached, and that's it. if you dont do recursion_depth and actually string together dozens of loader objects, something I dont think anyone was doing other than me, then the memory use is still very wasteful but at least it turns off caching so it's not nearly as bad. |
small regression caused by this issue: from sqlalchemy import *
from sqlalchemy.ext.hybrid import hybrid_property
from sqlalchemy.orm import *
Base = declarative_base()
class A(Base):
__tablename__ = 'a'
id = Column(Integer, primary_key=True)
name = Column(String)
bs = relationship("B", lazy="selectin")
class B(Base):
__tablename__ = 'b'
id = Column(Integer, primary_key=True)
a_id = Column(ForeignKey("a.id"))
data = Column(String)
e = create_engine("sqlite://", echo=True)
Base.metadata.create_all(e)
s = Session(e)
a1 = A(bs=[B()])
s.add(a1)
s.commit()
s.expire(a1, ["name"])
s.refresh(a1, ["name", 'bs'])
|
Mike Bayer has proposed a fix for this issue in the main branch: include pk cols in refresh() operations unconditionally https://gerrit.sqlalchemy.org/c/sqlalchemy/sqlalchemy/+/4311 |
A series of changes and improvements regarding :meth:`_orm.Session.refresh`. The overall change is that primary key attributes for an object are now included in a refresh operation unconditionally when relationship-bound attributes are to be refreshed, even if not expired and even if not specified in the refresh. * Improved :meth:`_orm.Session.refresh` so that if autoflush is enabled (as is the default for :class:`_orm.Session`), the autoflush takes place at an earlier part of the refresh process so that pending primary key changes are applied without errors being raised. Previously, this autoflush took place too late in the process and the SELECT statement would not use the correct key to locate the row and an :class:`.InvalidRequestError` would be raised. * When the above condition is present, that is, unflushed primary key changes are present on the object, but autoflush is not enabled, the refresh() method now explicitly disallows the operation to proceed, and an informative :class:`.InvalidRequestError` is raised asking that the pending primary key changes be flushed first. Previously, this use case was simply broken and :class:`.InvalidRequestError` would be raised anyway. This restriction is so that it's safe for the primary key attributes to be refreshed, as is necessary for the case of being able to refresh the object with relationship-bound secondary eagerloaders also being emitted. This rule applies in all cases to keep API behavior consistent regardless of whether or not the PK cols are actually needed in the refresh, as it is unusual to be refreshing some attributes on an object while keeping other attributes "pending" in any case. * The :meth:`_orm.Session.refresh` method has been enhanced such that attributes which are :func:`_orm.relationship`-bound and linked to an eager loader, either at mapping time or via last-used loader options, will be refreshed in all cases even when a list of attributes is passed that does not include any columns on the parent row. This builds upon the feature first implemented for non-column attributes as part of :ticket:`1763` fixed in 1.4 allowing eagerly-loaded relationship-bound attributes to participate in the :meth:`_orm.Session.refresh` operation. If the refresh operation does not indicate any columns on the parent row to be refreshed, the primary key columns will nonetheless be included in the refresh operation, which allows the load to proceed into the secondary relationship loaders indicated as it does normally. Previously an :class:`.InvalidRequestError` error would be raised for this condition (:ticket:`8703`) * Fixed issue where an unnecessary additional SELECT would be emitted in the case where :meth:`_orm.Session.refresh` were called with a combination of expired attributes, as well as an eager loader such as :func:`_orm.selectinload` that emits a "secondary" query, if the primary key attributes were also in an expired state. As the primary key attributes are now included in the refresh automatically, there is no additional load for these attributes when a relationship loader goes to select for them (:ticket:`8997`) * Fixed regression caused by :ticket:`8126` released in 2.0.0b1 where the :meth:`_orm.Session.refresh` method would fail with an ``AttributeError``, if passed both an expired column name as well as the name of a relationship-bound attribute that was linked to a "secondary" eagerloader such as the :func:`_orm.selectinload` eager loader (:ticket:`8996`) Fixes: #8703 Fixes: #8996 Fixes: #8997 Fixes: #8126 Change-Id: I88dcbc0a9a8337f6af0bc4bcc5b0261819acd1c4
given the test case in #8125, we can make selectinload recurse arbitrarily deep automatically like this:
that is, when it makes the new query, locate the loader option that's used for this load, then copy it with a shifted path.
the recursion is stopped for two reasons: one is once selectinload has no ids to load, it stops. so there's no extra recursion. for loops within objects, in theory, when it comes across an object again, it should be already present in the identity map and it will not try to re-load it.
if we put this under a very experimental /alpha / no guarantees kind of parameter on selectinload, we can introduce this gradually without any immediate pressure.
The text was updated successfully, but these errors were encountered: