Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler][AWS] Make sure subnets belong to same VPC as user-specified security groups #13558

Merged
merged 10 commits into from
Jan 28, 2021

Conversation

DmitriGekhtman
Copy link
Contributor

@DmitriGekhtman DmitriGekhtman commented Jan 19, 2021

Why are these changes needed?

This PR addresses a problem in autoscaler/_private/aws/config.py.
Currently _configure_subnets does not take into account security groups specified in the head_node and worker_nodes fields.
Thus, it's possible that a subnet is chosen that belongs to a different VPC than user-specified security groups. In this situation, a cluster can't be launched.

This PR fixes that problem by detecting the VPC of security groups specified in head_node and worker_nodes configs and restricting to subnets in that VPC.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

I also did some manual tests of _configure_subnets and of ray up to make sure things were working as expected.

@DmitriGekhtman
Copy link
Contributor Author

looks like test_autoscaler_aws isn't happy... better go fix that.

@DmitriGekhtman DmitriGekhtman added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 19, 2021
@DmitriGekhtman
Copy link
Contributor Author

Currently,
test_autoscaler_aws.test_create_sg_different_vpc_same_rules and test_autoscaler_aws.test_create_sg_with_custom_inbound_rules_and_name are failing.

I'm not sure I fully understand the context behind the test failures (and vpc / sg / subnet setup in general).
@allenyin55 could you take a look if you have a second?

@allenyin55
Copy link
Contributor

test_autoscaler_aws.test_create_sg_different_vpc_same_rules and test_autoscaler_aws.test_create_sg_with_custom_inbound_rules_and_name are testing logics for creating security groups and they stub out the ec2 and IAM client. With your code change we need to update how the ec2 client is stubbed. You can find the stubs in tests/aws/utils/stubs.py

@DmitriGekhtman
Copy link
Contributor Author

DmitriGekhtman commented Jan 20, 2021 via email

@DmitriGekhtman DmitriGekhtman added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Jan 20, 2021
@DmitriGekhtman
Copy link
Contributor Author

Stubs are now fixed and tests pass.

Should I add a unit test to test_autoscaler_aws or is this PR fine as is?

@ericl ericl removed their assignment Jan 20, 2021
@AmeerHajAli
Copy link
Contributor

@DmitriGekhtman , lets add a unit test to make sure everything is OK.

@DmitriGekhtman
Copy link
Contributor Author

DmitriGekhtman commented Jan 20, 2021

@DmitriGekhtman , lets add a unit test to make sure everything is OK.

Will do.

@AmeerHajAli AmeerHajAli added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Jan 20, 2021
@DmitriGekhtman DmitriGekhtman removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 22, 2021
@DmitriGekhtman
Copy link
Contributor Author

Added a unit test.

"should belong to the same VPC."
cli_logger.doassert(len(vpc_ids) <= 1, multiple_vpc_msg)
assert len(vpc_ids) <= 1, multiple_vpc_msg

Copy link
Contributor Author

@DmitriGekhtman DmitriGekhtman Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user specifies security groups for both head and worker, and the security groups belong to different VPCs, this will throw an error.

We could have it support user-specified security groups in different VPCs for head and workers.
This would match the behavior for subnets: #8374 .
I think I won't bother with that, unless reviewers think I should.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted to go further down that path, we would want at some point to support different VPCs for each node type in a multi node-type config.

Copy link
Contributor

@AmeerHajAli AmeerHajAli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!
Nice job @DmitriGekhtman !
Let's see what @thomasdesr thinks about it :-).

Copy link
Contributor

@thomasdesr thomasdesr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm comfortable with this as is, but it'd be great to have a unit test that demonstrates the bug.

E.g.

  1. stub describe_subnets to return many subnets (1000?) each in different VPCs
  2. bootstrap a config with explicitly security groups configuration targeting a single VPC
  3. confirm the subnets returned after configuring are both in the security group's VPC and not any others.

@DmitriGekhtman
Copy link
Contributor Author

Added a unit test that demonstrates that the bug is fixed. Changed the filtering logic a bit to enable the unit test.

Co-authored-by: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
@AmeerHajAli AmeerHajAli added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 26, 2021
@DmitriGekhtman DmitriGekhtman removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 26, 2021
@DmitriGekhtman
Copy link
Contributor Author

@ericl I think this is good to merge.

@DmitriGekhtman DmitriGekhtman added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 26, 2021
@AmeerHajAli
Copy link
Contributor

@ericl, can you please merge this?

@ericl ericl merged commit 40234ad into ray-project:master Jan 28, 2021
fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021
…ied security groups (ray-project#13558)

* initial commit

* Filter subnets by security groups' VPCs

* fix stubs

* wip

* Fix inbound rule logic. Tests WIP.

* wip

* unit test

* example yaml

* Unit test tests for bug being fixed

* Update python/ray/tests/aws/utils/constants.py

Co-authored-by: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>

Co-authored-by: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants