Serve: cannot pass a sub object that includes a dag node #38809

tchordia · 2023-08-23T23:38:01Z

What happened + What you expected to happen

I am passing a serve dag node into another object, and then passing that wrapper object into another deployment.bind. I am expecting the serve dagnode to be replaced magically with a ray serve deployment handle. instead, i get an infinite recursion.

Versions / Dependencies

Commit: 903899d933ee19159381d823a439d0e8f05a59b0

Reproduction script

from ray import serve
@serve.deployment
class Parent:
    def __init__(self, obj):
        pass

class Obj:
    def __init__(self, obj):
        self.obj = obj
        
@serve.deployment
class Child:
    pass

serve.run(Parent.bind(Obj(Child.bind())))

This produces:

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

edoakes · 2023-08-23T23:58:54Z

Looks like this is an issue in the PyObjScanner:

 20 def test_replace_nested_in_obj():
 21     class Outer:
 22         def __init__(self, inner):
 23             self._inner = inner
 24
 25     scanner = _PyObjScanner(source_type=Source)
 26     my_objs = [Outer(Source())]
 27
 28     found = scanner.find_nodes(my_objs)
 29     assert len(found) == 1
 30
 31     replaced = scanner.replace_nodes({obj: 1 for obj in found})
 32     assert replaced == [Outer(1)]

This fails on assert len(found) == 1 because the object isn't found.

@ericl any ideas here?

ericl · 2023-08-28T18:30:42Z

This seems to be because the reducer_override hook is not written to recursively scan non-native objects. If I make the following patch, it resolves the issue:

diff --git a/python/ray/dag/py_obj_scanner.py b/python/ray/dag/py_obj_scanner.py
index 20798b8441..195fc87207 100644
--- a/python/ray/dag/py_obj_scanner.py
+++ b/python/ray/dag/py_obj_scanner.py
@@ -82,6 +82,7 @@ class _PyObjScanner(ray.cloudpickle.CloudPickler, Generic[SourceType, Transforme
             self._found.append(obj)
             return _get_node, (id(self), index)
         else:
+            return super().reducer_override(obj)
             index = len(self._objects)
             self._objects.append(obj)
             return _get_object, (id(self), index)

Though, the proper fix is probably a bit more subtle than this.

edoakes · 2023-08-28T19:00:15Z

I believe the above "early termination" was added to avoid attempting to serialize non-serializable objects. The following test fails:

    def test_not_serializing_objects():
        scanner = _PyObjScanner(source_type=Source)
        not_serializable = NotSerializable()
        my_objs = [not_serializable, {"key": Source()}]

>       found = scanner.find_nodes(my_objs)

tests/test_py_obj_scanner.py:46:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
py_obj_scanner.py:97: in find_nodes
    self.dump(obj)
../cloudpickle/cloudpickle_fast.py:733: in dump
    return Pickler.dump(self, obj)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <test_py_obj_scanner.NotSerializable object at 0x12392dbe0>

    def __reduce__(self):
>       raise Exception("don't even try to serialize me.")
E       Exception: don't even try to serialize me.

I'm not sure if this is actually a requirement -- for Serve at least, we only use the py_obj_scanner on objects that'll be serialized anyways.

As per #38809, you currently cannot pass bound deployments nested within custom objects. This PR lifts that restriction. The approach I took is to remove the "arbitrary object replacement" path in `_PyObjScanner.reducer_override`, which was effectively causing cloudpickle to return early. Instead, we now fully serialize objects aside from the `SourceType` using the standard cloudpickle path. This has one major downside: all objects that `_PyObjScanner` is called on must now be serializable. This is not an issue for its current usage in the code base, but it required me to also add support for finding and replacing multiple types at once (because we currently do multiple passes on each Serve `Query` object).

edoakes · 2023-09-06T17:08:06Z

Re-open until cherry picked into 2.7

…project#39015) As per ray-project#38809, you currently cannot pass bound deployments nested within custom objects. This PR lifts that restriction. The approach I took is to remove the "arbitrary object replacement" path in `_PyObjScanner.reducer_override`, which was effectively causing cloudpickle to return early. Instead, we now fully serialize objects aside from the `SourceType` using the standard cloudpickle path. This has one major downside: all objects that `_PyObjScanner` is called on must now be serializable. This is not an issue for its current usage in the code base, but it required me to also add support for finding and replacing multiple types at once (because we currently do multiple passes on each Serve `Query` object).

…) (#39330) As per #38809, you currently cannot pass bound deployments nested within custom objects. This PR lifts that restriction. The approach I took is to remove the "arbitrary object replacement" path in `_PyObjScanner.reducer_override`, which was effectively causing cloudpickle to return early. Instead, we now fully serialize objects aside from the `SourceType` using the standard cloudpickle path. This has one major downside: all objects that `_PyObjScanner` is called on must now be serializable. This is not an issue for its current usage in the code base, but it required me to also add support for finding and replacing multiple types at once (because we currently do multiple passes on each Serve `Query` object).

…project#39015) As per ray-project#38809, you currently cannot pass bound deployments nested within custom objects. This PR lifts that restriction. The approach I took is to remove the "arbitrary object replacement" path in `_PyObjScanner.reducer_override`, which was effectively causing cloudpickle to return early. Instead, we now fully serialize objects aside from the `SourceType` using the standard cloudpickle path. This has one major downside: all objects that `_PyObjScanner` is called on must now be serializable. This is not an issue for its current usage in the code base, but it required me to also add support for finding and replacing multiple types at once (because we currently do multiple passes on each Serve `Query` object).

…project#39015) As per ray-project#38809, you currently cannot pass bound deployments nested within custom objects. This PR lifts that restriction. The approach I took is to remove the "arbitrary object replacement" path in `_PyObjScanner.reducer_override`, which was effectively causing cloudpickle to return early. Instead, we now fully serialize objects aside from the `SourceType` using the standard cloudpickle path. This has one major downside: all objects that `_PyObjScanner` is called on must now be serializable. This is not an issue for its current usage in the code base, but it required me to also add support for finding and replacing multiple types at once (because we currently do multiple passes on each Serve `Query` object). Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>

…project#39015) As per ray-project#38809, you currently cannot pass bound deployments nested within custom objects. This PR lifts that restriction. The approach I took is to remove the "arbitrary object replacement" path in `_PyObjScanner.reducer_override`, which was effectively causing cloudpickle to return early. Instead, we now fully serialize objects aside from the `SourceType` using the standard cloudpickle path. This has one major downside: all objects that `_PyObjScanner` is called on must now be serializable. This is not an issue for its current usage in the code base, but it required me to also add support for finding and replacing multiple types at once (because we currently do multiple passes on each Serve `Query` object). Signed-off-by: Victor <vctr.y.m@example.com>

tchordia added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 23, 2023

tchordia assigned edoakes Aug 23, 2023

edoakes added release-blocker P0 Issue that blocks the release P0 Issue that must be fixed in short order serve Ray Serve Related Issue and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 28, 2023

edoakes mentioned this issue Aug 28, 2023

[serve] Support passing bound deployments within custom objects #39015

Merged

8 tasks

edoakes closed this as completed in #39015 Sep 6, 2023

edoakes reopened this Sep 6, 2023

edoakes mentioned this issue Sep 6, 2023

[serve] Support passing bound deployments within custom objects #39330

Merged

8 tasks

edoakes closed this as completed Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serve: cannot pass a sub object that includes a dag node #38809

Serve: cannot pass a sub object that includes a dag node #38809

tchordia commented Aug 23, 2023

edoakes commented Aug 23, 2023

ericl commented Aug 28, 2023

edoakes commented Aug 28, 2023

edoakes commented Sep 6, 2023

Serve: cannot pass a sub object that includes a dag node #38809

Serve: cannot pass a sub object that includes a dag node #38809

Comments

tchordia commented Aug 23, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

edoakes commented Aug 23, 2023

ericl commented Aug 28, 2023

edoakes commented Aug 28, 2023

edoakes commented Sep 6, 2023