Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Support inherit base resources with any_of and ordered in resources field #2833

Merged
merged 15 commits into from
Dec 21, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Dec 1, 2023

Fixes #2831

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch -c test-resources echo hi
    • sky launch -c test-resources --cpus 2 echo hi
    • sky spot launch -n test-resources --cpus 2 echo hi with ~/.sky/config.yaml specifying controller's cloud.
    • sky spot launch -n test-resources --cpus 2 "echo hi; sleep 10000" manually terminate and recover.
    • sky serve up examples/serve/http_server/task.yaml
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
    • pytest tests/test_smoke.py::test_multiple_a ccelerators_ordered_with_default
    • pytest tests/test_smoke.py::test_multiple_accelerators_unordered_with_default
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll marked this pull request as ready for review December 1, 2023 23:12
@Michaelvll Michaelvll requested review from concretevitamin and MaoZiming and removed request for concretevitamin December 7, 2023 06:40
@@ -278,8 +278,15 @@ def get_controller_resources(
controller_type=controller_type,
err=common_utils.format_exception(e,
use_bracket=True))) from e

return controller_resources
if len(controller_resources) != 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we limit controlller_resources to 1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that SkyServe requires a single candidate of controller resources, e.g.:

skypilot/sky/serve/core.py

Lines 105 to 115 in 99f04b1

controller_cloud = (
requested_resources.cloud if not controller_exist and
controller_resources.cloud is None else controller_resources.cloud)
# TODO(tian): Probably run another sky.launch after we get the load
# balancer port from the controller? So we don't need to open so many
# ports here. Or, we should have a nginx traffic control to refuse
# any connection to the unregistered ports.
controller_resources = controller_resources.copy(
cloud=controller_cloud,
ports=[serve_constants.LOAD_BALANCER_PORT_RANGE])
controller_task.set_resources(controller_resources)

I think it might be fine to have a more strict constraint for the controller resources at the moment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good.

@MaoZiming
Copy link
Collaborator

I am wondering if the fields under any_of or ordered conflict with what is above any_of or ordered.
Will the field under any_of overwrite the fields above any_of?

@Michaelvll
Copy link
Collaborator Author

I am wondering if the fields under any_of or ordered conflict with what is above any_of or ordered. Will the field under any_of overwrite the fields above any_of?

Yes, my current thinking is to have the field under any_of or ordered to overwrite the configs outside any_of/ordered.

@MaoZiming
Copy link
Collaborator

@Michaelvll Okay make sense.

@MaoZiming MaoZiming self-requested a review December 15, 2023 04:05
@MaoZiming
Copy link
Collaborator

MaoZiming commented Dec 15, 2023

lgtm after fixing pylint

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature!! Sorry for missed it. Left several things to discuss 🫡

docs/source/reference/yaml-spec.rst Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/resources.py Show resolved Hide resolved
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did another pass and left several nits & one discussion. If we decided to keep the accelerators: {L4:1, T4:1} as a shortcut/backward compatibility/any other reason, then this version of the code looks great to me!

docs/source/examples/auto-failover.rst Outdated Show resolved Hide resolved
docs/source/reference/yaml-spec.rst Show resolved Hide resolved
docs/source/reference/yaml-spec.rst Show resolved Hide resolved
sky/resources.py Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/resources.py Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! I tried the following in latest commit:

>>> import sky
>>> sky.Resources.from_yaml_config({'use_spot': True, 'accelerators': {'A100-40GB:1': None, 'T4:1': None, 'V100:1': None}, 'any_of': [{'cloud': 'aws'}, {'cloud': 'gcp'}]})
ValueError: Invalid resources YAML: {'A100-40GB:1': None, 'T4:1': None, 'V100:1': None} is not valid under any of the given schemas. Check problematic field(s): $.accelerators
>>> sky.Resources.from_yaml_config({'use_spot': True, 'accelerators': 'A100-40GB:1', 'any_of': [{'cloud': 'aws'}, {'cloud': 'gcp'}]})
AssertionError: Invalid resource args: dict_keys(['any_of'])

Is this expected? Correct me if I'm wrong, but I suppose we should:

  1. allow set of accs;
  2. allow any_of.
    in Python API too?

sky/resources.py Outdated Show resolved Hide resolved
sky/resources.py Outdated

def from_yaml_config(
cls, config: Optional[Dict[str, Any]]
) -> Union[Set['Resources'], List['Resources']]:
common_utils.validate_schema(config, schemas.get_resources_schema(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly confused about why we could still use the old schema check. Should we add the any_of and ordered fields in the resources schema?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have the any_of and ordered field included in the schema in #2498

Michaelvll and others added 2 commits December 21, 2023 10:02
Co-authored-by: Tian Xia <cblmemo@gmail.com>
@Michaelvll
Copy link
Collaborator Author

sky.Resources.from_yaml_config({'use_spot': True, 'accelerators': {'A100-40GB:1': None, 'T4:1': None, 'V100:1': None}, 'any_of': [{'cloud': 'aws'}, {'cloud': 'gcp'}]})

Could you double check if you are using the correct commit? It seems working correctly for me with the current branch.

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! Pls ignore the error above; seems like I used the wrong branch. LGTM!

sky/resources.py Outdated
return type(accelerators)(tmp_resources_list)
else:
with ux_utils.print_exception_no_traceback():
raise RuntimeError('Accelerators must be a list or a set.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise RuntimeError('Accelerators must be a list or a set.')
raise RuntimeError('Accelerators must be a list or a set when multiple accelerators are specified.')

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Actually, I find this should be assertion instead of raise, as the type of the resources should have ben guaranteed by the schema check.

@Michaelvll Michaelvll merged commit e93d400 into master Dec 21, 2023
19 checks passed
@Michaelvll Michaelvll deleted the inherit-base-resource-config branch December 21, 2023 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

YAML: any_of/ordered should inherit base resources config
3 participants