# Introduction: Deploy Anyscale Ray on An Existing AWS EKS Cluster

© 2025, Anyscale. All Rights Reserved

This notebook serves as a guide for deploying an **Anyscale Cloud** on an existing AWS EKS cluster using the custom **`anyscale cloud register`** method. It walks through the necessary steps from prerequisites to Ray installation with Anyscale Operator.

Use it as a starting point and replace all placeholders (e.g.&nbsp;`{ANYSCALE_CLOUD_NAME}`) with values from your environment.

It is based on this [example](https://github.com/anyscale/terraform-kubernetes-anyscale-foundation-modules/tree/main/examples/aws/eks-existing), please refer to it for more information.

## Prerequisites

Before we begin, ensure you have the following tools installed:

```bash
# Install AWS CLI (version 2.15.0+)
# https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

# Configure AWS credentials
# https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html

# Install kubectl (version 1.25+)
# https://kubernetes.io/docs/tasks/tools/

# Install helm (version 3.10+)
# https://helm.sh/docs/intro/install/

# Install Anyscale CLI (version 0.5.86+)
# https://docs.anyscale.com/reference/quickstart-cli/

# Install Terraform (version 1.9+)
# https://developer.hashicorp.com/terraform/install
```

<div class="alert alert-block alert-info">
<b>Alternative Terraform Installation:</b> If you are not able to install <b>Terraform 1.9+</b> with homebrew, you can try to install it with <code>tfenv</code>.

<details>
<summary>Click to expand installation steps</summary>

```bash
brew install tfenv
tfenv install 1.9.0
tfenv use 1.9.0
terraform version
```

</details>
</div>

You also need:
- An existing AWS Account
- An existing AWS VPC
- An existing AWS EKS Cluster running in the VPC (version 1.25+)
- Proper IAM permissions

## 1. Create Anyscale Resources with Terraform

Set up the necessary Terraform variables and apply the configuration:

Steps for deploying Anyscale resources via Terraform:

* Review and modify [variables.tf](variables.tf) with your configurations for the EKS cluster where you want to deploy Anyscale
* (Optional) Create a `terraform.tfvars` file to override any defaults
* View [main.tf](main.tf) to see how the resources are created

In [None]:
"""
Set the global variables for the deployment, which are the same as in the variables.tf file.
"""

EKS_CLUSTER_NAME = "anyscale-eks-private-xxx"  # Replace with your actual EKS cluster name
AWS_REGION = "us-west-2"  # Replace with your actual AWS region
ANYSCALE_CLOUD_NAME = "anyscale-cloud-eks-private-xxx" # Replace with your actual Anyscale cloud name
ANYSCALE_S3_BUCKET_NAME = EKS_CLUSTER_NAME + "-" + AWS_REGION

In [None]:
# Run Terraform commands

# Initialize Terraform
!terraform init

# Preview the changes
!terraform plan

# Apply the changes. (this may take 10-15 minutes)
!terraform apply

<details>
<summary>Sometimes, your AWS account may have too many existing resources consuming capacity, causing you to hit service limits. In such cases, it's a good idea to release unused resources—such as unassociated Elastic IPs (EIPs) or idle NAT gateways.
</summary>
During you installing, you may need to release the unused EIPs and deleted unused NAT.

To find the unattached EIPs:
```bash
aws ec2 describe-addresses | jq '.Addresses[] | select(.InstanceId == null and .NetworkInterfaceId == null)'
```
You will see a list of unattached EIPs.

To release one:

```bash
aws ec2 release-address --allocation-id eipalloc-xxxxxx
```

To find the unused NAT in your region, for example us-west-2:
```bash
aws ec2 describe-nat-gateways --region us-west-2 --filter "Name=state,Values=available" | jq '.NatGateways | length'
```

To delete one:

```bash
aws ec2 delete-nat-gateway --nat-gateway-id nat-xxxxxxxx
```

To identify VPCs that look like they might be safe to delete (test/development ones):
```bash
aws ec2 describe-vpcs --query 'Vpcs[*].[VpcId,Tags[?Key==`Name`].Value|[0]]' --output table | grep -E "(test|temp|dev|scratch|derp|floral|scrumptious)"
```

to delete one:
```bash
aws ec2 delete-vpc --vpc-id vpc-0f8bb12ddf9a451e9  # You can delete more
```

You may need to delete more depends on your scenario.

</detail>

<div class="alert alert-block alert-info">
<b>Take a note to the output of terraform apply! </b>You will need it when you register the Anyscale cloud to your cloud provider.
</div>

<details>
<summary>Sample output</summary>
```
Outputs:

anyscale_registration_command = <<EOT
anyscale cloud register \
        --name <anyscale_cloud_name> \
        --region xxxxxxx \
        --provider aws \
        --compute-stack k8s \
        --kubernetes-zones us-west-2a,us-west-2b \
        --s3-bucket-id xxxxxxx \
        --anyscale-operator-iam-identity arn:aws:iam::xxxxxx:role/default-eks-node-group-xxxxxxxxxxxx
EOT
aws_region = "xxxxxxxx"
eks_cluster_name = "xxxxxxx"
helm_upgrade_command = <<EOT
helm upgrade anyscale-operator anyscale/anyscale-operator \
        --set-string cloudDeploymentId=<cloud-deployment-id> \
        --set-string cloudProvider=aws \
        --set-string region=us-west-2 \
        --set-string workloadServiceAccountName=anyscale-operator \
        --namespace anyscale-operator \
        --create-namespace \
        -i
EOT
```
</details>

## 2. Attach Required IAM Policies to Your existing EKS's Node Role

After running Terraform to create the resources, you need to attach the IAM policies to your EKS node role.

First, run:

In [None]:
!aws eks list-nodegroups --cluster-name {EKS_CLUSTER_NAME}

to get a list of node groups in your cluster. Pick the **Anyscale related node group name** you are looking for.

Second, find your node role name by running:

In [None]:
!aws eks describe-nodegroup --cluster-name {EKS_CLUSTER_NAME} --nodegroup-name {Anyscale related node group name} --query 'nodegroup.nodeRole' --output text

This will give you the ARN of the role, from which you can extract the **Node Role Name** (the part after the last "/").

Third, to get the S3 policy ARN from Terraform output:


In [None]:
!terraform output -raw module.anyscale_iam_roles.anyscale_iam_s3_policy_arn

Then you will attach both policies to your node role (replace **{Node Role Name}** with your existing EKS node role name):

In [None]:
# Attach the S3 policy
!aws iam attach-role-policy \
  --role-name {NODE ROLE NAME} \
  --policy-arn $(terraform output -raw module.anyscale_iam_roles.anyscale_iam_s3_policy_arn)

In [None]:
# Attach the EFS policy
!aws iam attach-role-policy \
  --role-name {NODE ROLE NAM} \
  --policy-arn arn:aws:iam::aws:policy/AmazonElasticFileSystemClientReadWriteAccess

## 3. Install Kubernetes Components

The Anyscale Operator requires the following components:
- Cluster autoscaler
- AWS Load Balancer Controller (LBC)
- Nginx Ingress Controller
- (Optional) Nvidia device plugin (for GPU nodes)

Let's set up each of these components:


### 3.1 Install the Cluster Autoscaler


In [None]:
# Set your EKS cluster name and AWS region

# Update kubectl to connect to your new EKS cluster
!aws eks update-kubeconfig --region {AWS_REGION} --name {EKS_CLUSTER_NAME}

!helm repo add autoscaler https://kubernetes.github.io/autoscaler
!helm upgrade cluster-autoscaler autoscaler/cluster-autoscaler \
  --version 9.46.0 \
  --namespace kube-system \
  --set awsRegion={AWS_REGION} \
  --set 'autoDiscovery.clusterName'={EKS_CLUSTER_NAME} \
  --install

### 3.2 Install the AWS Load Balancer Controller


In [None]:
!helm repo add eks https://aws.github.io/eks-charts
!helm upgrade aws-load-balancer-controller eks/aws-load-balancer-controller \
  --version 1.13.2 \
  --namespace kube-system \
  --set clusterName={EKS_CLUSTER_NAME} \
  --install

### 3.3 Install the Nginx Ingress Controller

In [None]:
# We already have a sample-values_nginx.yaml file in the current directory
!helm repo add nginx https://kubernetes.github.io/ingress-nginx
!helm upgrade ingress-nginx nginx/ingress-nginx \
  --version 4.12.1 \
  --namespace ingress-nginx \
  --values sample-values_nginx.yaml \
  --create-namespace \
  --install

### 3.4 (Optional) Install the Nvidia Device Plugin


In [None]:
# We already have a sample-values_nvdp.yaml file in the current directory

# Uncomment and run these commands when you're ready
!helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
!helm upgrade nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --version 0.17.1 \
  --values sample-values_nvdp.yaml \
  --create-namespace \
  --install

## 4. Register the Anyscale Cloud

First, ensure you're logged into Anyscale. if you cannot run it in console, please run it in your local terminal:

In [None]:
!anyscale login

Then use the output of `terraform apply` to register Anyscale cloud. You only need to replace the `--name` parameter with your preferred `ANYSCALE_CLOUD_NAME`. The command looks like:

```bash
anyscale cloud register ...
```

You will get output like:

```text
Output
(anyscale +17.9s) For registering this cloud's Kubernetes Manager, use cloud deployment ID 'cldrsrc_12345abcdefgh67890ijklmnop'.
(anyscale +18.0s) Successfully created cloud anyscale-cloud-eks-private-xxxxx, and it's ready to use.
```

After running the command, note the Cloud Deployment ID from the output. It will look something like: 
```
cldrsrc_12345abcdefgh67890ijklmnop
```
You'll need this for the next step


## 5. Install the Anyscale Operator

In [None]:
# Set the cloud deployment ID from the previous step
CLOUD_DEPLOYMENT_ID = "cldrsrc_12345abcdefgh67890ijklmnop"  # Replace with your actual cloud deployment ID

!helm repo add anyscale https://anyscale.github.io/helm-charts
!helm upgrade anyscale-operator anyscale/anyscale-operator \
  --set-string cloudDeploymentId={CLOUD_DEPLOYMENT_ID} \
  --set-string cloudProvider=aws \
  --set-string region={AWS_REGION} \
  --set-string workloadServiceAccountName=anyscale-operator \
  --namespace anyscale-operator \
  --create-namespace \
  --install


## 6. Verify the Installation

In [None]:

# Check if Anyscale operator pods are running
!kubectl get pods --all-namespaces | grep -E "(anyscale|ray)" | grep -v "cluster-autoscaler"


# Check if the Anyscale cloud is registered
!anyscale cloud list

# Check if Anyscale operator pods are running
!kubectl get pods --all-namespaces | grep -E "(anyscale|ray)" | grep -v "cluster-autoscaler"

# Check if ray cluster is running
!kubectl get pods --all-namespaces | grep ray

# Check if the Anyscale cloud is registered
!anyscale cloud list

## 7. Test

Once the cluster is created, you can test it by submitting a job from your terminal:

In [None]:
A_RANDOM_JOB_NAME = "A random job name"  # Replace with a random job name you like

!cd ../test && python test_job.py  --cloud-name {ANYSCALE_CLOUD_NAME}  --stack-type k8s

# You can check the job status by running:
!anyscale job list --cloud {ANYSCALE_CLOUD_NAME}



You just start a job and you can see the logs from your Anyscale Console. You can view the running results from Anyscale console in "Jobs".

You can also run:

In [None]:
!kubectl get pods --all-namespaces | grep -E "(anyscale|ray)" | grep -v "cluster-autoscaler"

to see new anyscale nodes are scaled up after this job starts; and after it is completed, those nodes will be terminated.

If you examine the [test job](../test/test_job.py#L30-L50), you'll see that we define a Ray cluster by configuring head nodes and worker nodes with appropriate instance types. When this job is submitted, the Ray cluster is created and the job executes on it.

In [None]:
# Compute Configuration Defines Cluster Resources:

compute_config = ComputeConfig(
    cloud=cloud_name,
    head_node=HeadNodeConfig(
        instance_type="2CPU-8GB",        # Ray head node
    ),
    worker_nodes=[
        WorkerNodeGroupConfig(
            instance_type="2CPU-8GB",    # Ray worker nodes
            min_nodes=1,                 # Minimum workers
            max_nodes=1,                 # Maximum workers  
        )
    ],
)

## 8. Troubleshooting

Here are some common issues and how to resolve them:

<div class="alert alert-block alert-info">
<b>Troubleshooting</b> 

<details>
<summary>Click to expand</summary>


1. **IAM Permissions**: Ensure that the Node IAM role has the necessary policies attached:
   - AmazonElasticFileSystemClientReadWriteAccess
   - The Anyscale IAM S3 policy

2. **Networking Issues**: Verify that the security groups allow the necessary traffic between nodes and for external access.

3. **Cluster Autoscaler**: If nodes aren't scaling, check the cluster autoscaler logs:


kubectl logs -n kube-system -l app=cluster-autoscaler

4. **Anyscale Operator**: If the Anyscale operator isn't functioning correctly, check its logs:

kubectl logs -n anyscale-operator -l app=anyscale-operator

</details>
</div>



**View job logs from Anyscale Console**:

<details>
<summary> You may find you are not able to see full logs of a job, which is because you need to add Cross-origin permission to the S3 bucket you use for the deployment.
</summary>
   You can:

   1. Log in to your AWS console, choose S3 service, and find the bucket you are using for this deployment
   2. Click "Permissions", and scroll down to "Cross-origin resource sharing (CORS)"
   3. Add the following:
   ```
   [
      {
        "AllowedHeaders": [
            "*"
        ],
        "AllowedMethods": [
            "GET"
        ],
        "AllowedOrigins": [
            "https://console.anyscale.com"
        ],
        "ExposeHeaders": []
      }
   ]
   ```
   Save this change and you should be able to view full job logs now.
   </details>

## 9. Clean up

Run these 3 commands in your terminal, updating the placeholder values to match your setup:

In [None]:
# Unregister the cloud
!anyscale cloud delete {ANYSCALE_CLOUD_NAME}

# Verify it is deleted
!anyscale cloud list

# Empty ANYSCALE S3 bucket if it contains data, if it throws error, please try it in your local terminal
!aws s3 rm s3://{ANYSCALE_S3_BUCKET_NAME} --recursive

Now you can run following block to clean up the rest of the resources:

In [None]:
# Uninstall the Anyscale operator
!helm uninstall anyscale-operator --namespace anyscale-operator

# Delete the namespace
!kubectl delete namespace anyscale-operator

# (Optional if you add it because you installed AnyscaleRay) Remove Nginx Ingress Controller
!helm uninstall ingress-nginx --namespace ingress-nginx
!kubectl delete namespace ingress-nginx

# (Optional if you add it because you installed Anyscale) Remove AWS Load Balancer Controller
!helm uninstall aws-load-balancer-controller --namespace kube-system

# (Optional if you add it because you installed Anyscale) Remove Cluster Autoscaler
!helm uninstall cluster-autoscaler --namespace kube-system

# (Optional if you add it because you installed Anyscale) Remove Nvidia Device Plugin
!helm uninstall nvdp --namespace nvidia-device-plugin
!kubectl delete namespace nvidia-device-plugin

# Empty ANYSCALE S3 bucket if it contains data, if it throws error, please try it in your local terminal
!aws s3 rm s3://{ANYSCALE_S3_BUCKET_NAME} --recursive

# Destroy Terraform-managed resources
!terraform plan -destroy
!terraform destroy --auto-approve

# Verify no Anyscale-related pods are running
!kubectl get pods --all-namespaces | grep -E "(anyscale|ray)"

# Verify Helm releases are removed
!helm list --all-namespaces

# Verify Terraform resources are destroyed
!terraform show

## 10. Conclusion

You have now successfully set up the Anyscale environment on an existing AWS EKS cluster. This includes:

1. Creating the necessary AWS resources using Terraform
2. Installing the required Kubernetes components:
   - Cluster Autoscaler
   - AWS Load Balancer Controller
   - Nginx Ingress Controller
   - (Optional) Nvidia Device Plugin
3. Registering the Anyscale Cloud
4. Installing the Anyscale Operator

You can now use this environment to run Ray workloads on Anyscale.
