Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serve: cannot pass a sub object that includes a dag node #38809

Closed
tchordia opened this issue Aug 23, 2023 · 4 comments · Fixed by #39015
Closed

Serve: cannot pass a sub object that includes a dag node #38809

tchordia opened this issue Aug 23, 2023 · 4 comments · Fixed by #39015
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issue that must be fixed in short order release-blocker P0 Issue that blocks the release serve Ray Serve Related Issue

Comments

@tchordia
Copy link
Contributor

What happened + What you expected to happen

I am passing a serve dag node into another object, and then passing that wrapper object into another deployment.bind. I am expecting the serve dagnode to be replaced magically with a ray serve deployment handle. instead, i get an infinite recursion.

Versions / Dependencies

Commit: 903899d933ee19159381d823a439d0e8f05a59b0

Reproduction script

from ray import serve
@serve.deployment
class Parent:
    def __init__(self, obj):
        pass

class Obj:
    def __init__(self, obj):
        self.obj = obj
        
@serve.deployment
class Child:
    pass

serve.run(Parent.bind(Obj(Child.bind())))

This produces:
Screen Shot 2023-08-23 at 4 32 57 PM

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@tchordia tchordia added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 23, 2023
@edoakes
Copy link
Contributor

edoakes commented Aug 23, 2023

Looks like this is an issue in the PyObjScanner:

 20 def test_replace_nested_in_obj():
 21     class Outer:
 22         def __init__(self, inner):
 23             self._inner = inner
 24
 25     scanner = _PyObjScanner(source_type=Source)
 26     my_objs = [Outer(Source())]
 27
 28     found = scanner.find_nodes(my_objs)
 29     assert len(found) == 1
 30
 31     replaced = scanner.replace_nodes({obj: 1 for obj in found})
 32     assert replaced == [Outer(1)]

This fails on assert len(found) == 1 because the object isn't found.

@ericl any ideas here?

@edoakes edoakes added release-blocker P0 Issue that blocks the release P0 Issue that must be fixed in short order serve Ray Serve Related Issue and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 28, 2023
@ericl
Copy link
Contributor

ericl commented Aug 28, 2023

This seems to be because the reducer_override hook is not written to recursively scan non-native objects. If I make the following patch, it resolves the issue:

diff --git a/python/ray/dag/py_obj_scanner.py b/python/ray/dag/py_obj_scanner.py
index 20798b8441..195fc87207 100644
--- a/python/ray/dag/py_obj_scanner.py
+++ b/python/ray/dag/py_obj_scanner.py
@@ -82,6 +82,7 @@ class _PyObjScanner(ray.cloudpickle.CloudPickler, Generic[SourceType, Transforme
             self._found.append(obj)
             return _get_node, (id(self), index)
         else:
+            return super().reducer_override(obj)
             index = len(self._objects)
             self._objects.append(obj)
             return _get_object, (id(self), index)

Though, the proper fix is probably a bit more subtle than this.

@edoakes
Copy link
Contributor

edoakes commented Aug 28, 2023

I believe the above "early termination" was added to avoid attempting to serialize non-serializable objects. The following test fails:

    def test_not_serializing_objects():
        scanner = _PyObjScanner(source_type=Source)
        not_serializable = NotSerializable()
        my_objs = [not_serializable, {"key": Source()}]

>       found = scanner.find_nodes(my_objs)

tests/test_py_obj_scanner.py:46:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
py_obj_scanner.py:97: in find_nodes
    self.dump(obj)
../cloudpickle/cloudpickle_fast.py:733: in dump
    return Pickler.dump(self, obj)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <test_py_obj_scanner.NotSerializable object at 0x12392dbe0>

    def __reduce__(self):
>       raise Exception("don't even try to serialize me.")
E       Exception: don't even try to serialize me.

I'm not sure if this is actually a requirement -- for Serve at least, we only use the py_obj_scanner on objects that'll be serialized anyways.

edoakes added a commit that referenced this issue Sep 6, 2023
As per #38809, you currently cannot pass bound deployments nested within custom objects. This PR lifts that restriction.

The approach I took is to remove the "arbitrary object replacement" path in `_PyObjScanner.reducer_override`, which was effectively causing cloudpickle to return early. Instead, we now fully serialize objects aside from the `SourceType` using the standard cloudpickle path.

This has one major downside: all objects that `_PyObjScanner` is called on must now be serializable. This is not an issue for its current usage in the code base, but it required me to also add support for finding and replacing multiple types at once (because we currently do multiple passes on each Serve `Query` object).
@edoakes
Copy link
Contributor

edoakes commented Sep 6, 2023

Re-open until cherry picked into 2.7

@edoakes edoakes reopened this Sep 6, 2023
edoakes added a commit to edoakes/ray that referenced this issue Sep 6, 2023
…project#39015)

As per ray-project#38809, you currently cannot pass bound deployments nested within custom objects. This PR lifts that restriction.

The approach I took is to remove the "arbitrary object replacement" path in `_PyObjScanner.reducer_override`, which was effectively causing cloudpickle to return early. Instead, we now fully serialize objects aside from the `SourceType` using the standard cloudpickle path.

This has one major downside: all objects that `_PyObjScanner` is called on must now be serializable. This is not an issue for its current usage in the code base, but it required me to also add support for finding and replacing multiple types at once (because we currently do multiple passes on each Serve `Query` object).
GeneDer pushed a commit that referenced this issue Sep 6, 2023
…) (#39330)

As per #38809, you currently cannot pass bound deployments nested within custom objects. This PR lifts that restriction.

The approach I took is to remove the "arbitrary object replacement" path in `_PyObjScanner.reducer_override`, which was effectively causing cloudpickle to return early. Instead, we now fully serialize objects aside from the `SourceType` using the standard cloudpickle path.

This has one major downside: all objects that `_PyObjScanner` is called on must now be serializable. This is not an issue for its current usage in the code base, but it required me to also add support for finding and replacing multiple types at once (because we currently do multiple passes on each Serve `Query` object).
@edoakes edoakes closed this as completed Sep 6, 2023
harborn pushed a commit to harborn/ray that referenced this issue Sep 8, 2023
…project#39015)

As per ray-project#38809, you currently cannot pass bound deployments nested within custom objects. This PR lifts that restriction.

The approach I took is to remove the "arbitrary object replacement" path in `_PyObjScanner.reducer_override`, which was effectively causing cloudpickle to return early. Instead, we now fully serialize objects aside from the `SourceType` using the standard cloudpickle path.

This has one major downside: all objects that `_PyObjScanner` is called on must now be serializable. This is not an issue for its current usage in the code base, but it required me to also add support for finding and replacing multiple types at once (because we currently do multiple passes on each Serve `Query` object).
jimthompson5802 pushed a commit to jimthompson5802/ray that referenced this issue Sep 12, 2023
…project#39015)

As per ray-project#38809, you currently cannot pass bound deployments nested within custom objects. This PR lifts that restriction.

The approach I took is to remove the "arbitrary object replacement" path in `_PyObjScanner.reducer_override`, which was effectively causing cloudpickle to return early. Instead, we now fully serialize objects aside from the `SourceType` using the standard cloudpickle path.

This has one major downside: all objects that `_PyObjScanner` is called on must now be serializable. This is not an issue for its current usage in the code base, but it required me to also add support for finding and replacing multiple types at once (because we currently do multiple passes on each Serve `Query` object).

Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
vymao pushed a commit to vymao/ray that referenced this issue Oct 11, 2023
…project#39015)

As per ray-project#38809, you currently cannot pass bound deployments nested within custom objects. This PR lifts that restriction.

The approach I took is to remove the "arbitrary object replacement" path in `_PyObjScanner.reducer_override`, which was effectively causing cloudpickle to return early. Instead, we now fully serialize objects aside from the `SourceType` using the standard cloudpickle path.

This has one major downside: all objects that `_PyObjScanner` is called on must now be serializable. This is not an issue for its current usage in the code base, but it required me to also add support for finding and replacing multiple types at once (because we currently do multiple passes on each Serve `Query` object).

Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issue that must be fixed in short order release-blocker P0 Issue that blocks the release serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants