Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[diagnose] test service connectivity when no VPC endpoint exists #12

Merged
merged 7 commits into from
Feb 3, 2025

Conversation

kylos101
Copy link
Contributor

@kylos101 kylos101 commented Jan 31, 2025

Description

Provide a means to test connectivity for AWS services when the current VPC lacks VPC endpoints. For example, perhaps VPC endpoints are centralized in another VPC.

Background:
There are two sets of checks in the network check CLI

  1. What VPC endpoints the CLI needs + the execute-api endpoint (so we can inspect the setting for private DNS )
    • it tries inspecting VPC endpoints in the same account (this is existing and slightly changed)
    • if the VPC endpoints don't exist, we degrade and assert service name resolution and connectivity (this is new)
  2. The balance of tests assert other AWS services that Gitpod needs
    • it creates EC2 instances, and uses SSM to send curl requests (from ec2 instance in particular subnets) to each AWS service (this is unchanged)

Related Issue(s)

Fixes CLC-1118

How to test

  1. Setup a network using our public networking template
  2. Delete the execute-api endpoint
  3. Run diagnose, it should fail for the execute-api endpoint, the other check should pass. (results)

Documentation

/hold

@kylos101 kylos101 marked this pull request as ready for review January 31, 2025 22:32
@kylos101 kylos101 requested review from a team as code owners January 31, 2025 22:32
@kylos101 kylos101 marked this pull request as draft January 31, 2025 22:49
@kylos101
Copy link
Contributor Author

kylos101 commented Feb 3, 2025

A positive test when the Execute API service is accessible (because the VPC endpoint exists in the same account and VPC as Gitpod):

go run . diagnose
INFO[0000] ℹ️  Running with region `eu-central-1`, main subnet `[subnet-03ed4c7f3f10ee64a  subnet-03ae0d9e3ad063d83]`, pod subnet `[subnet-09704642a44a1ae9b  subnet-0fc43a731956656cd]`, and hosts `[accounts.google.com  https://github.com]` 
INFO[0000] ✅ Main Subnets are valid                     
INFO[0000] ✅ Pod Subnets are valid                      
INFO[0000] ℹ️  Checking prerequisites                   
INFO[0000] ℹ️  VPC endpoint com.amazonaws.eu-central-1.ec2messages is not configured, testing service connectivity... 
INFO[0000] ✅ Service ec2messages.eu-central-1.amazonaws.com has connectivity 
INFO[0000] ℹ️  VPC endpoint com.amazonaws.eu-central-1.ssm is not configured, testing service connectivity... 
INFO[0000] ✅ Service ssm.eu-central-1.amazonaws.com has connectivity 
INFO[0000] ℹ️  VPC endpoint com.amazonaws.eu-central-1.ssmmessages is not configured, testing service connectivity... 
INFO[0000] ✅ Service ssmmessages.eu-central-1.amazonaws.com has connectivity 
INFO[0000] ✅ VPC endpoint com.amazonaws.eu-central-1.execute-api is configured 
INFO[0000] ✅ IAM role created and policy attached       
INFO[0001] ℹ️  Launching EC2 instances in Main subnets  
INFO[0001] ℹ️  Created security group with ID: sg-04c8ea7ddf76179b2 
INFO[0002] ℹ️  Instance type t2.micro shall be used     
INFO[0009] ℹ️  Created security group with ID: sg-0f7acc3c417480982 
INFO[0009] ℹ️  Instance type t2.micro shall be used     
INFO[0011] ℹ️  Main EC2 instances: [i-03922c702b3d852be i-00c35f0eba6f8358a] 
INFO[0011] ℹ️  Launching EC2 instances in a Pod subnets 
INFO[0011] ℹ️  Created security group with ID: sg-05f66bb10e09f4590 
INFO[0011] ℹ️  Instance type t2.micro shall be used     
INFO[0014] ℹ️  Created security group with ID: sg-06ba706d7bddfe2db 
INFO[0014] ℹ️  Instance type t2.micro shall be used     
INFO[0016] ℹ️  Pod EC2 instances: [i-00f047a1e92d80959 i-08e96784b4245e6f5] 
INFO[0016] ℹ️  Waiting for EC2 instances to become ready (can take up to 2 minutes) 
INFO[0029] ✅ EC2 Instances are now running successfully 
INFO[0029] ℹ️  Connecting to SSM...                     
INFO[0115] ℹ️  Checking if the required AWS Services can be reached from the ec2 instances in the pod subnet 
INFO[0116] ✅ Autoscaling is available                   
INFO[0117] ✅ CloudFormation is available                
INFO[0118] ✅ CloudWatch is available                    
INFO[0119] ✅ EC2 is available                           
INFO[0121] ✅ EC2messages is available                   
INFO[0122] ✅ ECR is available                           
INFO[0123] ✅ ECR Api is available                       
INFO[0124] ✅ EKS is available                           
INFO[0125] ✅ Elastic LoadBalancing is available         
INFO[0126] ✅ KMS is available                           
INFO[0127] ✅ Kinesis Firehose is available              
INFO[0128] ✅ SSM is available                           
INFO[0129] ✅ SSMmessages is available                   
INFO[0130] ✅ SecretsManager is available                
INFO[0131] ✅ Sts is available                           
INFO[0131] ℹ️  Checking if certain AWS Services can be reached from ec2 instances in the main subnet 
INFO[0132] ✅ DynamoDB is available      
                
INFO[0133] ✅ ExecuteAPI is available              <---- this is new
      
INFO[0134] ✅ S3 is available                            
INFO[0134] ℹ️  Checking if hosts can be reached with HTTPS from ec2 instances in the main subnets 
INFO[0135] ✅ accounts.google.com is available           
INFO[0136] ✅ https://github.com is available            
INFO[0136] ✅ Instances terminated                       
INFO[0136] Cleaning up: Waiting for 2 minutes so network interfaces are deleted 
INFO[0257] ✅ Role 'GitpodNetworkCheck' deleted          
INFO[0257] ✅ Instance profile deleted                   
INFO[0257] ✅ Security group 'sg-04c8ea7ddf76179b2' deleted 
INFO[0258] ✅ Security group 'sg-0f7acc3c417480982' deleted 
INFO[0258] ✅ Security group 'sg-05f66bb10e09f4590' deleted 
INFO[0258] ✅ Security group 'sg-06ba706d7bddfe2db' deleted 

@kylos101
Copy link
Contributor Author

kylos101 commented Feb 3, 2025

A negative test for when the VPC endpoint has been deleted:

go run . diagnose
INFO[0000] ℹ️  Running with region `eu-central-1`, main subnet `[subnet-03ed4c7f3f10ee64a  subnet-03ae0d9e3ad063d83]`, pod subnet `[subnet-09704642a44a1ae9b  subnet-0fc43a731956656cd]`, hosts `[accounts.google.com  https://github.com]`, and api endpoint `6v268t83fd` 
INFO[0000] ✅ Main Subnets are valid                     
INFO[0000] ✅ Pod Subnets are valid                      
INFO[0000] ℹ️  Checking prerequisites                   
INFO[0000] ℹ️  VPC endpoint com.amazonaws.eu-central-1.ec2messages is not configured, testing service connectivity... 
INFO[0000] ✅ Service ec2messages.eu-central-1.amazonaws.com has connectivity 
INFO[0000] ℹ️  VPC endpoint com.amazonaws.eu-central-1.ssm is not configured, testing service connectivity... 
INFO[0000] ✅ Service ssm.eu-central-1.amazonaws.com has connectivity 
INFO[0000] ℹ️  VPC endpoint com.amazonaws.eu-central-1.ssmmessages is not configured, testing service connectivity... 
INFO[0000] ✅ Service ssmmessages.eu-central-1.amazonaws.com has connectivity 
INFO[0000] ℹ️  Deferring connectivity test for execute-api.eu-central-1.amazonaws.com service until testing main subnet 
INFO[0000] ✅ IAM role created and policy attached       
INFO[0001] ℹ️  Launching EC2 instances in Main subnets  
INFO[0001] ℹ️  Created security group with ID: sg-0f69fadcb4e4a8729 
INFO[0001] ℹ️  Instance type t2.micro shall be used     
INFO[0009] ℹ️  Created security group with ID: sg-00f43c631812ed870 
INFO[0009] ℹ️  Instance type t2.micro shall be used     
INFO[0011] ℹ️  Main EC2 instances: [i-09f9133b37d2070b4 i-0200eafa51b6b7888] 
INFO[0011] ℹ️  Launching EC2 instances in a Pod subnets 
INFO[0012] ℹ️  Created security group with ID: sg-024cc2c0aca38bb3f 
INFO[0012] ℹ️  Instance type t2.micro shall be used     
INFO[0014] ℹ️  Created security group with ID: sg-04efb8f408dda30a7 
INFO[0014] ℹ️  Instance type t2.micro shall be used     
INFO[0016] ℹ️  Pod EC2 instances: [i-093238e158655c18e i-00b0959897b596f52] 
INFO[0016] ℹ️  Waiting for EC2 instances to become ready (can take up to 2 minutes) 
INFO[0038] ✅ EC2 Instances are now running successfully 
INFO[0038] ℹ️  Connecting to SSM...                     
INFO[0122] ℹ️  Checking if the required AWS Services can be reached from the ec2 instances in the pod subnet 
INFO[0123] ✅ Autoscaling is available                   
INFO[0124] ✅ CloudFormation is available                
INFO[0125] ✅ CloudWatch is available                    
INFO[0126] ✅ EC2 is available                           
INFO[0127] ✅ EC2messages is available                   
INFO[0128] ✅ ECR is available                           
INFO[0129] ✅ ECR Api is available                       
INFO[0130] ✅ EKS is available                           
INFO[0131] ✅ Elastic LoadBalancing is available         
INFO[0132] ✅ KMS is available                           
INFO[0133] ✅ Kinesis Firehose is available              
INFO[0134] ✅ SSM is available                           
INFO[0135] ✅ SSMmessages is available                   
INFO[0136] ✅ SecretsManager is available                
INFO[0137] ✅ Sts is available                           
INFO[0137] ℹ️  Checking if certain AWS Services can be reached from ec2 instances in the main subnet 
INFO[0138] ✅ DynamoDB is available            
          
WARN[0139] ❌ ExecuteAPI is not available (https://6v268t83fd.execute-api.eu-central-1.amazonaws.com)
INFO[0139] ❌ Error fetching command results: instance i-09f9133b37d2070b4 command failed:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: 6v268t83fd.execute-api.eu-central-1.amazonaws.com
failed to run commands: exit status 6                      <-- this is expected

INFO[0139] ✅ S3 is available                            
INFO[0139] ℹ️  Checking if hosts can be reached with HTTPS from ec2 instances in the main subnets 
INFO[0140] ✅ accounts.google.com is available           
INFO[0141] ✅ https://github.com is available            
INFO[0142] ✅ Instances terminated                       
INFO[0142] Cleaning up: Waiting for 2 minutes so network interfaces are deleted 
INFO[0262] ✅ Role 'GitpodNetworkCheck' deleted          
INFO[0262] ✅ Instance profile deleted                   
INFO[0263] ✅ Security group 'sg-0f69fadcb4e4a8729' deleted 
INFO[0263] ✅ Security group 'sg-00f43c631812ed870' deleted 
INFO[0263] ✅ Security group 'sg-024cc2c0aca38bb3f' deleted 
INFO[0263] ✅ Security group 'sg-04efb8f408dda30a7' deleted 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added because...Flex 💪

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Flex!

@@ -65,6 +66,10 @@ var checkCommand = &cobra.Command{ // nolint:gochecknoglobals
log.Infof("ℹ️ Found duplicate subnets. We'll test each subnet '%v' only once.", distinctSubnets)
}

if networkConfig.ApiEndpoint == "" {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this parameter was unnecessary, because we were testing if the VPC endpoint for execute-api existed in the context of "this" AWS credential.

As we've learned, this is not always the case, like when a central account is used for VPC endpoints.

So, it became necessary to add this parameter.

networkCheckCmd.PersistentFlags().StringVar(&networkConfig.InstanceAMI, "instance-ami", "", "Custom ec2 instance AMI id, if not set will use latest ubuntu")
log.Infof("ℹ️ Running with region `%s`, main subnet `%v`, pod subnet `%v`, and hosts `%v`", networkConfig.AwsRegion, networkConfig.MainSubnets, networkConfig.PodSubnets, networkConfig.HttpsHosts)
networkCheckCmd.PersistentFlags().StringVar(&networkConfig.ApiEndpoint, "api-endpoint", "", "The Gitpod Enterprise control plane's regional API endpoint subdomain")
bindFlags(networkCheckCmd, v)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were binding too soon, causing us to not capture the values for instance-ami or api-endpoint.

continue
}
log.Infof("ℹ️ VPC endpoint %s is not configured, testing service connectivity...", endpoint.Endpoint)
_, err := TestServiceConnectivity(ctx, endpoint.PrivateDnsName, 5*time.Second)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new method will help us assert whether or not we have connectivity to VPC endpoints which are necessary to send SSM commands to EC2 instances (which later asserts connectivity, by subnet, to AWS services).

@kylos101 kylos101 marked this pull request as ready for review February 3, 2025 06:20
@@ -198,13 +209,22 @@ func checkSMPrerequisites(ctx context.Context, ec2Client *ec2.Client) error {
}

if len(response.VpcEndpoints) == 0 {
if endpoint.Required {
return fmt.Errorf("❌ VPC endpoint %s not configured: %w", endpoint.Endpoint, err)
if strings.Contains(endpoint.Endpoint, "execute-api") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🫧 Instead of doing a substring match on the name, would it make sense to mark the endpoint with a flag named e.g. "testInMainSubnet" or similar...? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to mark the endpoint with a flag named e.g. "testInMainSubnet" or similar...? 🤔

At the moment, I think no. This is for a couple reasons:

  1. I don't expect the service name to change, and we're not checking for the service name in other places. If we were, then I'd consider adding a flag.
  2. This section of tests is for asserting we can interact with EC2 and SSM services, execute-api service has no impact on CLI functionality. It does help us assert private DNS is enabled, but, only when the VPC endpoint exists in the same account as Gitpod.

Copy link
Member

@geropl geropl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️ to unblock, but I'm not in a position to test today

Left a comment with a potential code improvement, leaving it up to you to decide how to proceed. 👍

@kylos101 kylos101 merged commit 087017c into main Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants