-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Open
Labels
@external-author-action-requiredAlternate tag for PRs where the author doesn't have labeling permission.Alternate tag for PRs where the author doesn't have labeling permission.P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Corecore-autoscalerautoscaler related issuesautoscaler related issues
Description
What happened + What you expected to happen
Often when I open the dashboard in the morning, I'll see that the cluster is still running GPU workers (which are quite expensive). This is despite there being no running or pending jobs.
How can I debug why this is happening?
Versions / Dependencies
[project]
name = "clip2actions"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
"av>=14.1.0",
"boto3>=1.36.7",
"datasets>=3.3.2",
"docker>=7.1.0",
"google-cloud-storage>=2.19.0",
"grpcio>=1.70.0",
"grpcio-tools>=1.70.0",
"moviepy>=2.1.2",
"open-clip-torch>=2.31.0",
"opencv-python>=4.11.0.86; sys_platform == 'darwin'",
"opencv-python-headless>=4.11.0.86; sys_platform == 'linux'",
"pandas>=2.2.3",
"pillow>=10.4.0",
"plotly>=6.0.0",
"py-spy>=0.4.0",
"pydantic>=2.10.6",
"pydantic-settings>=2.7.1",
"pymysql>=1.1.1",
"ray[data,default,serve,train,tune]>=2.43.0",
"torch>=2.6.0",
"torchmetrics>=1.6.1",
"torchvision>=0.21.0",
"transformers[torch]@git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3",
"wandb>=0.19.4",
# https://github.com/Dao-AILab/flash-attention/issues/833
"flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl; sys_platform == 'linux'",
"trl@https://github.com/huggingface/trl.git",
"peft>=0.14.0",
]
[tool.uv.sources]
torch = { index = "pytorch_cu124", marker = "sys_platform == 'linux'" }
torchvision = { index = "pytorch_cu124", marker = "sys_platform == 'linux'" }
[[tool.uv.index]]
name = "pytorch_cu124"
url = "https://download.pytorch.org/whl/cu124"
explicit = trueReproduction script
This is my config.yaml
# An unique identifier for the head node and workers of this cluster.
cluster_name: inverse-dynamics
# Cloud-provider specific configuration.
provider:
type: aws
region: us-east-1 # the ami and efs needs to be switched to versions specific to the region, if this value is changed
security_group:
GroupName: inverse_dynamics_security_group
IpPermissions:
- FromPort: 9090
ToPort: 9090
IpProtocol: TCP
IpRanges:
- CidrIp: 0.0.0.0/0
- FromPort: 3000
ToPort: 3000
IpProtocol: TCP
IpRanges:
- CidrIp: 0.0.0.0/0
auth:
ssh_user: ubuntu
docker:
container_name: "ray_container"
pull_before_run: True
run_options: # Extra options to pass into "docker run"
- --ulimit nofile=65536:65536
- --volume /mnt/efs:/home/ray/data
- --publish 8080:8080
- --publish 3000:3000
- --publish 9090:9090
- --publish 8000:8000
- --cap-add SYS_PTRACE
- --env HF_HOME=/home/ray/data/huggingface
- -v /var/run/docker.sock:/var/run/docker.sock # Enable docker outside of docker so that the minecraft docker can be started
worker_run_options:
- --gpus all
head_image: "fredrikmedal/idm-ray-cpu"
worker_image: "fredrikmedal/idm-ray-gpu"
max_workers: 1 # remember to change each workers max_workers
upscaling_speed: 1.0
idle_timeout_minutes: 60
available_node_types:
ray.head.default:
node_config:
InstanceType: c7i.4xlarge
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 500
VolumeType: gp3
ray.worker.gpu:
min_workers: 0
max_workers: 1 # remember to change the global max_workers
resources: { "gold": 100000 }
node_config:
# InstanceType: g6e.8xlarge # 1 L40S GPU / 48 GB, 32 CPUs / 256 GB RAM, 4.52856 USD/h
InstanceType: p4d.24xlarge # 8 A100 GPUs / 320 GB, 96 CPUs / 1152 GB RAM, 32.7726 USD/h
# InstanceType: p5.48xlarge # 8 H100 GPUs / 640 GB, 192 CPUs / 2 TB RAM, 98.32 USD/h
# InstanceType: p5e.48xlarge # 8 H200 GPUs / 1128 GB, 192 CPUs / 2 TB RAM, 98.32 USD/h
ImageId: ami-08ea187523fb45736 # us-east-1 Supports G4dn, G5, G6, Gr6, G6e, P4d, P4de, P5, P5e, P5en
# ImageId: ami-0729f1db13c5d63f9 # us-east-2 Supports G4dn, G5, G6, Gr6, G6e, P4d, P4de, P5, P5e, P5en
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 500
VolumeType: gp3
head_node_type: ray.head.default
initialization_commands:
- sudo mkdir /mnt/efs -p;
sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-03acef52e77b987be.efs.us-east-1.amazonaws.com:/ /mnt/efs;
sudo chmod 777 /mnt/efs;
# sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-0895a72305fa1f701.efs.us-east-2.amazonaws.com:/ efs;
- sudo rm /usr/local/cuda || true;
sudo ln -s /usr/local/cuda-12.4 /usr/local/cuda || true
- sudo nvidia-ctk runtime configure --runtime=docker || true;
sudo systemctl restart docker || true
setup_commands:
# ray rllib installs an old version of moveipy, so I need to re-install the new version again here
- pip install -U "moviepy==2.1.2"
- sudo chmod 666 /var/run/docker.sock # This is for Docker outside of Docker; to run the minecraft docker
head_setup_commands:
- python3 -m http.server 8000 --directory data > http-file-server.log 2>&1 &
- sudo ./run-prometheus.sh > prometheus-launcher.log &
- sudo ./run-grafana.sh > grafana-launcher.log &
head_start_ray_commands:
- ray stop
- RAY_GRAFANA_IFRAME_HOST=http://$(wget -qO- http://checkip.amazonaws.com):3000 ray start --head --dashboard-host=0.0.0.0 --autoscaling-config=~/ray_bootstrap_config.yaml --metrics-export-port=8080 --system-config='{"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/home/ray/data/spill\"}}"}'Issue Severity
Medium: It is a significant difficulty but I can work around it.
Metadata
Metadata
Assignees
Labels
@external-author-action-requiredAlternate tag for PRs where the author doesn't have labeling permission.Alternate tag for PRs where the author doesn't have labeling permission.P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Corecore-autoscalerautoscaler related issuesautoscaler related issues