support efficient recursion for loader options #8126

zzzeek · 2022-06-11T14:39:08Z

given the test case in #8125, we can make selectinload recurse arbitrarily deep automatically like this:

diff --git a/lib/sqlalchemy/orm/strategies.py b/lib/sqlalchemy/orm/strategies.py
index c4c0fb180f..68e916e7cb 100644
--- a/lib/sqlalchemy/orm/strategies.py
+++ b/lib/sqlalchemy/orm/strategies.py
@@ -3003,6 +3003,11 @@ class SelectInLoader(PostLoader, util.MemoizedSlots):
                 ),
             )
 
+        if loadopt in new_options[0].context:
+            recursive_opt = loadopt._prepend_path_from(new_options[0])
+
+            new_options[0].context += (recursive_opt, )
+
         q = q.options(*new_options)._update_compile_options(
             {"_current_path": effective_path}
         )

that is, when it makes the new query, locate the loader option that's used for this load, then copy it with a shifted path.

the recursion is stopped for two reasons: one is once selectinload has no ids to load, it stops. so there's no extra recursion. for loops within objects, in theory, when it comes across an object again, it should be already present in the identity map and it will not try to re-load it.

if we put this under a very experimental /alpha / no guarantees kind of parameter on selectinload, we can introduce this gradually without any immediate pressure.

0dminnimda · 2022-06-11T14:46:40Z

Sounds cool!

sqla-tester · 2022-06-11T15:34:26Z

Mike Bayer has proposed a fix for this issue in the main branch:

WIP: add auto_recurse option to selectinload https://gerrit.sqlalchemy.org/c/sqlalchemy/sqlalchemy/+/3920

CaselIT · 2022-06-15T21:32:49Z

I think it would make sense to bail out at certain depth. Like 10 by default making it configurable.

shall I reopen this?

zzzeek · 2022-06-15T21:40:57Z

not sure. you can make a chain of ten selectinload options that would do that, so if someone wants a fixed depth we already have that, though a number is easeir to enter. There is also join_depth on relationship() which does this at the class level for self-referential relationships. There's a bunch of APIs here and I think we would need to be very careful about just throwing down more keywords and such without considering the whole thing in totality. like should the option use the relatoinship-configured join_depth automatically, stuff like that.

CaselIT · 2022-06-15T21:57:28Z

Forgot about join_depth.
Yes I think it could use that by default. Not sure if we want to allow override of that value in the seledtin. Maybe just to allow no limit with something like auto_recurse='nolimit'?

Also reading the docs for join depth they seem to contradict each other:
Here says that when None it does not eager load, while here says that the default None is no limit

zzzeek · 2022-06-16T02:03:29Z

I think given that the user who wanted this ended up having a non-working use case anyway, we should keep this at one parameter and have it be something like recursion_depth, then it can be set to integer for specific depth, -1 for unlimited, None for no recursion. i dont think it's that useful that it try to read relationship(join_depth) though if it were to do that, we'd have another constant I guess. I dont really like the Union[Literal['x', 'y'], int, None] thing too much.

zzzeek · 2022-06-16T02:15:08Z

it doesn't say there's no limit, it says something that's hard to understand what it means, since it refers to the implementation: "When left at its default of None, eager loaders will stop chaining when they encounter a the same target mapper which is already higher up in the chain. "

so say I am Node, I'm now loading Node.children, now im on Node again, I come and see Node.children, that's "encountered the same target mapper which is already higher up in the chain". So we stop. So in that case, "join_depth" was one.

however, say I have Node, which has edges that refers to Edge, which has right_node that refers to Node. Here I load Node.edges, I can then load Edge.right_node, I come back to Node.edges, that is again "encountered the same mapper that's already higher up". So the "join_depth" is technically two here as far as how deep it's joining since it's joining Node -> Edge -> Node.

but if we look at "join_depth" as, "number of times we can load this same relationship nested", basically we can say that join_depth=None is the same as join_depth=1.

it still contradicts what's at https://docs.sqlalchemy.org/en/14/orm/self_referential.html#configuring-self-referential-eager-loading and im not sure if that's maybe how it was first implemented, though i cant find a note where it might have been changed, not sure.

CaselIT · 2022-06-16T06:47:22Z

Ok re-reading it it does make sense. Not sure why I read the same Instance instead of the same Mapper.

CaselIT · 2022-06-16T07:47:41Z

I think given that the user who wanted this ended up having a non-working use case anyway, we should keep this at one parameter and have it be something like recursion_depth, then it can be set to integer for specific depth, -1 for unlimited, None for no recursion. i dont think it's that useful that it try to read relationship(join_depth) though if it were to do that, we'd have another constant I guess. I dont really like the Union[Literal['x', 'y'], int, None] thing too much.

yes maybe using recustion_depth (technically max_recursion_depth) is the best option.
Regarding union with literal I don't mind them if there are a small number of options (1 or 2) but I agree they are not the best

zzzeek · 2022-06-16T14:48:17Z

OK so I think also we should add a warning to all the loaders that will emit when any kind of path is just too long. the very long paths are just not architecturally planned and someone should not be going more than 25 levels deep anywhere.

then we need to enhance the two postloaders in question here to somehow pass through the recursion depth without using the cache. what's hard there is I have to review all the ways that "current_path" is used and make sure every case is covered.

sqla-tester · 2022-06-16T17:36:21Z

Mike Bayer has proposed a fix for this issue in the main branch:

WIP: try to see if recursive loading can use a counter https://gerrit.sqlalchemy.org/c/sqlalchemy/sqlalchemy/+/3931

this option works very badly with caching and the API is likely not what we want either. Work continues for #8126 including that the additional work in I9f162e0a09c1ed327dd19498aac193f649333a01 tries to add new recursive features. This reverts commit b3a1162.

zzzeek · 2022-06-16T18:48:06Z

ive reverted b3a1162 as the cache key issue makes that feature more or less a non-starter for now.

zzzeek · 2022-06-16T18:49:22Z

the worst case recursive test from #8142:

import sqlalchemy as sa
from sqlalchemy import orm

R = orm.registry()


@R.mapped
class N:
    __tablename__ = "n"
    id = sa.Column(sa.Integer, primary_key=True)
    parent_id = sa.Column(sa.ForeignKey("n.id"))

    children = orm.relationship("N", back_populates="parent", remote_side=parent_id)
    parent = orm.relationship("N", back_populates="children", remote_side=id)


e = sa.create_engine("sqlite:///:memory:", echo=True)

R.metadata.create_all(e)
with e.begin() as c:
    c.execute(
        N.__table__.insert(),
        [{"id": i, "parent_id": i - 1 if i > 0 else None} for i in range(200)],
    )


def stack_selectinload(depth: int):
    if depth == 1:
        return orm.selectinload(N.children)
    return stack_selectinload(depth - 1).selectinload(N.children)


def stack_joinedload(depth: int):
    if depth == 1:
        return orm.joinedload(N.children)
    return stack_joinedload(depth - 1).joinedload(N.children)


def stack_subqueryload(depth: int):
    if depth == 1:
        return orm.subqueryload(N.children)
    return stack_subqueryload(depth - 1).subqueryload(N.children)


with orm.Session(e) as s:
    # fine
    # stmt = sa.select(N).filter(N.id == 1).options(stack_joinedload(50))
    # sqilte bails but it generates quickly
    # stmt = sa.select(N).filter(N.id == 1).options(stack_joinedload(200))

    # does not cache at all. always generated. fast
    #stmt = sa.select(N).filter(N.id == 1).options(stack_selectinload(5))
    # does not cache at all. always generated. fast
    # stmt = sa.select(N).filter(N.id == 1).options(stack_selectinload(25))

    # does not cache at all. always generated. starts slowing
    # stmt = sa.select(N).filter(N.id == 1).options(stack_selectinload(50))
    # does not cache at all. always generated. slowing
    # stmt = sa.select(N).filter(N.id == 1).options(stack_selectinload(75))
    # does not cache at all. always generated. slow
    # stmt = sa.select(N).filter(N.id == 1).options(stack_selectinload(100))
    # does not cache at all. always generated. very slow

    stmt = sa.select(N).filter(N.id == 1).options(stack_selectinload(150))
    # does not cache at all. always generated. very very slow.
    # Also does ~10 fast than pause for some time (guessing gc). large memory use 2gb
    # stmt = sa.select(N).filter(N.id == 1).options(stack_selectinload(150))

    # also slow (also sqlite bails at 64 tables)
    # stmt = sa.select(N).filter(N.id == 1).options(stack_subqueryload(63))
    # very slow (also sqlite bails at 64 tables)
    # stmt = sa.select(N).filter(N.id == 1).options(stack_subqueryload(200))

    # Not tried the other loaders

    result = s.execute(stmt)
    print(result.unique().all())

    result = s.execute(stmt)
    print(result.unique().all())

zzzeek · 2022-06-18T21:59:31Z

so i still held off on making this work for more than immediately self-referential relationships, i think overall this option might not be that popular anyway because this has never really been asked for. however the idea is that asyncio might make it more popular than before.

CaselIT · 2022-06-18T22:06:42Z

Have you managed to fix also the memory usage?

zzzeek · 2022-06-19T00:54:33Z

if recursion_depth is used, then it certainly should. the mem issue was from generating a long line of hundreds of loader options that were then multiplied into that many Load objects. then cache keys had to be generated from each of those, the whole thing went into the cache, the 1G thing is very plausible.

the implementation now uses just one Load object with two elements inside that are fixed, paths dont grow beyond four tokens, the first statement is cached, and that's it.

if you dont do recursion_depth and actually string together dozens of loader objects, something I dont think anyone was doing other than me, then the memory use is still very wasteful but at least it turns off caching so it's not nearly as bad.

zzzeek · 2022-12-16T20:47:12Z

small regression caused by this issue:

from sqlalchemy import *
from sqlalchemy.ext.hybrid import hybrid_property
from sqlalchemy.orm import *
Base = declarative_base()


class A(Base):
    __tablename__ = 'a'

    id = Column(Integer, primary_key=True)
    name = Column(String)
    bs = relationship("B", lazy="selectin")


class B(Base):
    __tablename__ = 'b'
    id = Column(Integer, primary_key=True)
    a_id = Column(ForeignKey("a.id"))
    data = Column(String)

e = create_engine("sqlite://", echo=True)
Base.metadata.create_all(e)



s = Session(e)

a1 = A(bs=[B()])
s.add(a1)
s.commit()

s.expire(a1, ["name"])
s.refresh(a1, ["name", 'bs'])

Traceback (most recent call last):
  File "/home/classic/dev/sqlalchemy/test4.py", line 34, in <module>
    s.refresh(a1, ["name", 'bs'])
  File "/home/classic/dev/sqlalchemy/lib/sqlalchemy/orm/session.py", line 2696, in refresh
    loading.load_on_ident(
  File "/home/classic/dev/sqlalchemy/lib/sqlalchemy/orm/loading.py", line 478, in load_on_ident
    return load_on_pk_identity(
  File "/home/classic/dev/sqlalchemy/lib/sqlalchemy/orm/loading.py", line 613, in load_on_pk_identity
    return result.one()
  File "/home/classic/dev/sqlalchemy/lib/sqlalchemy/engine/result.py", line 1673, in one
    return self._only_one_row(
  File "/home/classic/dev/sqlalchemy/lib/sqlalchemy/engine/result.py", line 715, in _only_one_row
    row: Optional[_InterimRowType[Any]] = onerow(hard_close=True)
  File "/home/classic/dev/sqlalchemy/lib/sqlalchemy/engine/result.py", line 1531, in _fetchone_impl
    return self._real_result._fetchone_impl(hard_close=hard_close)
  File "/home/classic/dev/sqlalchemy/lib/sqlalchemy/engine/result.py", line 2125, in _fetchone_impl
    row = next(self.iterator, _NO_ROW)
  File "/home/classic/dev/sqlalchemy/lib/sqlalchemy/orm/loading.py", line 216, in chunks
    post_load.invoke(context, path)
  File "/home/classic/dev/sqlalchemy/lib/sqlalchemy/orm/loading.py", line 1410, in invoke
    loader(
  File "/home/classic/dev/sqlalchemy/lib/sqlalchemy/orm/strategies.py", line 1414, in _load_for_path
    state.get_impl(key).set_committed_value(state, dict_, value)
  File "/home/classic/dev/sqlalchemy/lib/sqlalchemy/orm/attributes.py", line 1975, in set_committed_value
    collection.append_multiple_without_event(value)
  File "/home/classic/dev/sqlalchemy/lib/sqlalchemy/orm/collections.py", line 567, in append_multiple_without_event
    for item in items:
TypeError: 'LoaderCallableStatus' object is not iterable

sqla-tester · 2022-12-16T22:53:33Z

Mike Bayer has proposed a fix for this issue in the main branch:

include pk cols in refresh() operations unconditionally https://gerrit.sqlalchemy.org/c/sqlalchemy/sqlalchemy/+/4311

A series of changes and improvements regarding :meth:`_orm.Session.refresh`. The overall change is that primary key attributes for an object are now included in a refresh operation unconditionally when relationship-bound attributes are to be refreshed, even if not expired and even if not specified in the refresh. * Improved :meth:`_orm.Session.refresh` so that if autoflush is enabled (as is the default for :class:`_orm.Session`), the autoflush takes place at an earlier part of the refresh process so that pending primary key changes are applied without errors being raised. Previously, this autoflush took place too late in the process and the SELECT statement would not use the correct key to locate the row and an :class:`.InvalidRequestError` would be raised. * When the above condition is present, that is, unflushed primary key changes are present on the object, but autoflush is not enabled, the refresh() method now explicitly disallows the operation to proceed, and an informative :class:`.InvalidRequestError` is raised asking that the pending primary key changes be flushed first. Previously, this use case was simply broken and :class:`.InvalidRequestError` would be raised anyway. This restriction is so that it's safe for the primary key attributes to be refreshed, as is necessary for the case of being able to refresh the object with relationship-bound secondary eagerloaders also being emitted. This rule applies in all cases to keep API behavior consistent regardless of whether or not the PK cols are actually needed in the refresh, as it is unusual to be refreshing some attributes on an object while keeping other attributes "pending" in any case. * The :meth:`_orm.Session.refresh` method has been enhanced such that attributes which are :func:`_orm.relationship`-bound and linked to an eager loader, either at mapping time or via last-used loader options, will be refreshed in all cases even when a list of attributes is passed that does not include any columns on the parent row. This builds upon the feature first implemented for non-column attributes as part of :ticket:`1763` fixed in 1.4 allowing eagerly-loaded relationship-bound attributes to participate in the :meth:`_orm.Session.refresh` operation. If the refresh operation does not indicate any columns on the parent row to be refreshed, the primary key columns will nonetheless be included in the refresh operation, which allows the load to proceed into the secondary relationship loaders indicated as it does normally. Previously an :class:`.InvalidRequestError` error would be raised for this condition (:ticket:`8703`) * Fixed issue where an unnecessary additional SELECT would be emitted in the case where :meth:`_orm.Session.refresh` were called with a combination of expired attributes, as well as an eager loader such as :func:`_orm.selectinload` that emits a "secondary" query, if the primary key attributes were also in an expired state. As the primary key attributes are now included in the refresh automatically, there is no additional load for these attributes when a relationship loader goes to select for them (:ticket:`8997`) * Fixed regression caused by :ticket:`8126` released in 2.0.0b1 where the :meth:`_orm.Session.refresh` method would fail with an ``AttributeError``, if passed both an expired column name as well as the name of a relationship-bound attribute that was linked to a "secondary" eagerloader such as the :func:`_orm.selectinload` eager loader (:ticket:`8996`) Fixes: #8703 Fixes: #8996 Fixes: #8997 Fixes: #8126 Change-Id: I88dcbc0a9a8337f6af0bc4bcc5b0261819acd1c4

zzzeek added feature orm loader options ORM options like joinedload(), load_only(), these are complicated and have a lot of issues labels Jun 11, 2022

zzzeek added this to the 2.0 milestone Jun 11, 2022

sqlalchemy-bot closed this as completed in b3a1162 Jun 12, 2022

zzzeek reopened this Jun 16, 2022

zzzeek mentioned this issue Jun 16, 2022

deeply recursive options are inefficient #8142

Closed

zzzeek changed the title ~~experiemental "auto recurse" feature for selectinload (maybe others too, but start small)~~ support efficient recursion for loader options Jun 16, 2022

zzzeek added bug Something isn't working hard orm hard ORM bugs mike needs to look at labels Jun 16, 2022

sqlalchemy-bot closed this as completed in 64b9d98 Jun 18, 2022

zzzeek mentioned this issue Dec 16, 2022

regression: immedaiteload fallback fails #8996

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support efficient recursion for loader options #8126

support efficient recursion for loader options #8126

zzzeek commented Jun 11, 2022

0dminnimda commented Jun 11, 2022

sqla-tester commented Jun 11, 2022

CaselIT commented Jun 15, 2022

zzzeek commented Jun 15, 2022

CaselIT commented Jun 15, 2022 •

edited

zzzeek commented Jun 16, 2022

zzzeek commented Jun 16, 2022

CaselIT commented Jun 16, 2022

CaselIT commented Jun 16, 2022

zzzeek commented Jun 16, 2022

sqla-tester commented Jun 16, 2022

zzzeek commented Jun 16, 2022

zzzeek commented Jun 16, 2022

zzzeek commented Jun 18, 2022 •

edited

CaselIT commented Jun 18, 2022

zzzeek commented Jun 19, 2022

zzzeek commented Dec 16, 2022

sqla-tester commented Dec 16, 2022

support efficient recursion for loader options #8126

support efficient recursion for loader options #8126

Comments

zzzeek commented Jun 11, 2022

0dminnimda commented Jun 11, 2022

sqla-tester commented Jun 11, 2022

CaselIT commented Jun 15, 2022

zzzeek commented Jun 15, 2022

CaselIT commented Jun 15, 2022 • edited

zzzeek commented Jun 16, 2022

zzzeek commented Jun 16, 2022

CaselIT commented Jun 16, 2022

CaselIT commented Jun 16, 2022

zzzeek commented Jun 16, 2022

sqla-tester commented Jun 16, 2022

zzzeek commented Jun 16, 2022

zzzeek commented Jun 16, 2022

zzzeek commented Jun 18, 2022 • edited

CaselIT commented Jun 18, 2022

zzzeek commented Jun 19, 2022

zzzeek commented Dec 16, 2022

sqla-tester commented Dec 16, 2022

CaselIT commented Jun 15, 2022 •

edited

zzzeek commented Jun 18, 2022 •

edited