Skip to content

[Cluster] Autoscaler frequently fails to scale down workers #51585

@FredrikNoren

Description

@FredrikNoren

What happened + What you expected to happen

Often when I open the dashboard in the morning, I'll see that the cluster is still running GPU workers (which are quite expensive). This is despite there being no running or pending jobs.

How can I debug why this is happening?

Versions / Dependencies

[project]
name = "clip2actions"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
    "av>=14.1.0",
    "boto3>=1.36.7",
    "datasets>=3.3.2",
    "docker>=7.1.0",
    "google-cloud-storage>=2.19.0",
    "grpcio>=1.70.0",
    "grpcio-tools>=1.70.0",
    "moviepy>=2.1.2",
    "open-clip-torch>=2.31.0",
    "opencv-python>=4.11.0.86; sys_platform == 'darwin'",
    "opencv-python-headless>=4.11.0.86; sys_platform == 'linux'",
    "pandas>=2.2.3",
    "pillow>=10.4.0",
    "plotly>=6.0.0",
    "py-spy>=0.4.0",
    "pydantic>=2.10.6",
    "pydantic-settings>=2.7.1",
    "pymysql>=1.1.1",
    "ray[data,default,serve,train,tune]>=2.43.0",
    "torch>=2.6.0",
    "torchmetrics>=1.6.1",
    "torchvision>=0.21.0",
    "transformers[torch]@git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3",
    "wandb>=0.19.4",
    # https://github.com/Dao-AILab/flash-attention/issues/833
    "flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl; sys_platform == 'linux'",
    "trl@https://github.com/huggingface/trl.git",
    "peft>=0.14.0",
]

[tool.uv.sources]
torch = { index = "pytorch_cu124", marker = "sys_platform == 'linux'" }
torchvision = { index = "pytorch_cu124", marker = "sys_platform == 'linux'" }

[[tool.uv.index]]
name = "pytorch_cu124"
url = "https://download.pytorch.org/whl/cu124"
explicit = true

Reproduction script

This is my config.yaml

# An unique identifier for the head node and workers of this cluster.
cluster_name: inverse-dynamics

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-east-1 # the ami and efs needs to be switched to versions specific to the region, if this value is changed
    security_group:
        GroupName: inverse_dynamics_security_group
        IpPermissions:
            - FromPort: 9090
              ToPort: 9090
              IpProtocol: TCP
              IpRanges:
                - CidrIp: 0.0.0.0/0
            - FromPort: 3000
              ToPort: 3000
              IpProtocol: TCP
              IpRanges:
                - CidrIp: 0.0.0.0/0

auth:
    ssh_user: ubuntu

docker:
    container_name: "ray_container"
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536
        - --volume /mnt/efs:/home/ray/data
        - --publish 8080:8080
        - --publish 3000:3000
        - --publish 9090:9090
        - --publish 8000:8000
        - --cap-add SYS_PTRACE
        - --env HF_HOME=/home/ray/data/huggingface
        - -v /var/run/docker.sock:/var/run/docker.sock # Enable docker outside of docker so that the minecraft docker can be started
    worker_run_options:
        - --gpus all

    head_image: "fredrikmedal/idm-ray-cpu"
    worker_image: "fredrikmedal/idm-ray-gpu"

max_workers: 1 # remember to change each workers max_workers
upscaling_speed: 1.0
idle_timeout_minutes: 60

available_node_types:
    ray.head.default:
        node_config:
            InstanceType: c7i.4xlarge
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 500
                      VolumeType: gp3
    ray.worker.gpu:
        min_workers: 0
        max_workers: 1 # remember to change the global max_workers
        resources: { "gold": 100000 }
        node_config:
            # InstanceType: g6e.8xlarge  # 1 L40S GPU  / 48 GB,  32 CPUs / 256 GB RAM,  4.52856 USD/h
            InstanceType: p4d.24xlarge   # 8 A100 GPUs / 320 GB, 96 CPUs / 1152 GB RAM, 32.7726 USD/h
            # InstanceType: p5.48xlarge  # 8 H100 GPUs / 640 GB, 192 CPUs / 2 TB RAM,   98.32 USD/h
            # InstanceType: p5e.48xlarge  # 8 H200 GPUs / 1128 GB, 192 CPUs / 2 TB RAM,   98.32 USD/h
            ImageId: ami-08ea187523fb45736 # us-east-1 Supports G4dn, G5, G6, Gr6, G6e, P4d, P4de, P5, P5e, P5en
            # ImageId: ami-0729f1db13c5d63f9 # us-east-2 Supports G4dn, G5, G6, Gr6, G6e, P4d, P4de, P5, P5e, P5en
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 500
                      VolumeType: gp3

head_node_type: ray.head.default



initialization_commands:
    - sudo mkdir /mnt/efs -p;
        sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-03acef52e77b987be.efs.us-east-1.amazonaws.com:/ /mnt/efs;
        sudo chmod 777 /mnt/efs;
        # sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-0895a72305fa1f701.efs.us-east-2.amazonaws.com:/ efs;
    - sudo rm /usr/local/cuda || true;
        sudo ln -s /usr/local/cuda-12.4 /usr/local/cuda || true
    - sudo nvidia-ctk runtime configure --runtime=docker || true;
        sudo systemctl restart docker || true

setup_commands:
    # ray rllib installs an old version of moveipy, so I need to re-install the new version again here
    - pip install -U "moviepy==2.1.2"
    - sudo chmod 666 /var/run/docker.sock # This is for Docker outside of Docker; to run the minecraft docker

head_setup_commands:
    - python3 -m http.server 8000 --directory data > http-file-server.log 2>&1 &
    - sudo ./run-prometheus.sh > prometheus-launcher.log &
    - sudo ./run-grafana.sh > grafana-launcher.log &

head_start_ray_commands:
    - ray stop
    - RAY_GRAFANA_IFRAME_HOST=http://$(wget -qO- http://checkip.amazonaws.com):3000 ray start --head --dashboard-host=0.0.0.0 --autoscaling-config=~/ray_bootstrap_config.yaml --metrics-export-port=8080 --system-config='{"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/home/ray/data/spill\"}}"}'

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

Labels

@external-author-action-requiredAlternate tag for PRs where the author doesn't have labeling permission.P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray Corecore-autoscalerautoscaler related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions