You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ray - 2.11.0
Python - 3.10.14
Official docker image: rayproject/ray:latest-py310-cpu
Reproduction script
This reproduction script is specific to AWS. Two resources are required for this - ECR and EKS.
Create two repositories on ECR - translatorapp and customrayimage
Use the following Dockerfile to build and push to ECR
translator.Dockerfile : Use an example Ray application shown here.
FROM rayproject/ray:latest-py310-cpu
RUN pip install "transformers[torch]"WORKDIR /home/ray
ENV PYTHONPATH "${PYTHONPATH}:/home/ray"COPY translator.py .
custom_ray.Dockerfile: Since podman is required for this experimental feature, we add it as a dependency and create a custom ray image.
FROM ubuntu:22.04
RUN apt-get update -y && apt-get install -y curl wget python3.10 python3.10-venv build-essential podman
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10
RUN pip install "ray[serve]==2.11.0"RUN podman version
RUN podman login --username AWS --password <aws-ecr-password> <accid>.dkr.ecr.<aws-region>.amazonaws.com
Two things needs to be configured here
a. Replacing <aws-ecr-password> with output of aws ecr get-login-password --region <your-aws-region>
b. Replacing <accid>.dkr.ecr.<aws-region>.amazonaws.com with your private URL for ECR.
Build and push both the images to the ECR. I used podman for this.
Create an EKS (I used m7i.xlarge instance for testing this).
Install the kuberay operator
Run the following serve_config.yaml on the EKS (kubectl apply -f serve_config.yaml).
serve_config.yaml : This configuration file for now deploys only one container but we can easily extend serveConfigV2 to add multiple containers.
apiVersion: ray.io/v1kind: RayServicemetadata:
name: rayservice-samplespec:
serveConfigV2: | applications: - name: whisper import_path: translator:translator_app route_prefix: /whisper runtime_env: container: image: <acc>.dkr.ecr.eu-west-2.amazonaws.com/translatorapp:latest worker_path: /home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/workers/default_worker.py run_options: ["--tty", "--privileged", "--log-level=debug", "--security-opt=label=disable", "--restart unless-stopped"]rayClusterConfig:
rayVersion: "2.11.0"# should match the Ray version in the image of the containersheadGroupSpec:
rayStartParams:
dashboard-host: "0.0.0.0"template:
spec:
containers:
- name: ray-headimage: <acc>.dkr.ecr.eu-west-2.amazonaws.com/customrayimage:latestresources:
limits:
cpu: 2memory: 2Girequests:
cpu: 2memory: 2Giports:
- containerPort: 6379name: gcs-server
- containerPort: 8265# Ray dashboardname: dashboard
- containerPort: 10001name: client
- containerPort: 8000name: serveworkerGroupSpecs:
- replicas: 1minReplicas: 1maxReplicas: 2groupName: small-grouprayStartParams: {}template:
spec:
containers:
- name: ray-workerimage: <acc>.dkr.ecr.eu-west-2.amazonaws.com/customrayimage:latestresources:
limits:
cpu: "1"memory: "2Gi"requests:
cpu: "500m"memory: "2Gi"
Replace container image in all 3 places with appropriate ECR repo.
I also added the following to ray head and worker group spec but adding these in, podman was not able to pull images from ECR.
securityContext:
privileged: true
Issue Severity
Low: It annoys or frustrates me.
The text was updated successfully, but these errors were encountered:
dudeperf3ct
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Apr 30, 2024
Hi @dudeperf3ct, if you are running out of space when the image is pulled you likely need to increase size of the EBS root volume attached to the instance. You can refer to this guide on how to do that.
I'd recommend using the block device mapping to provision a larger root volume for any future EKS cluster deployments.
@askulkarni2 The EKS instance started with 150GB disk space. Podman tries to pull the image from ECR in an infinite loop fashion that makes it run out of space.
Attaching a screenshot of logs in raylet.err. Some of the layers are being pulled multiple times. Only one container is specified in serveconfigV2 above. I expected the application to start once podman pulls all layers but instead it keeps pulling the same container from ECR.
GeneDer
added
P2
Important issue, but not time-critical
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
May 3, 2024
What happened + What you expected to happen
I am trying to run the experimental feature of running multiple applications in different containers on EKS.
I will include the exact steps in the Reproduction script section. After deploying the application on EKS,
True
actually messes with the authorizationGuide: https://docs.ray.io/en/latest/serve/advanced-guides/multi-app-container.html
Versions / Dependencies
Ray - 2.11.0
Python - 3.10.14
Official docker image:
rayproject/ray:latest-py310-cpu
Reproduction script
This reproduction script is specific to AWS. Two resources are required for this - ECR and EKS.
Create two repositories on ECR -
translatorapp
andcustomrayimage
Use the following Dockerfile to build and push to ECR
translator.Dockerfile
: Use an example Ray application shown here.custom_ray.Dockerfile
: Sincepodman
is required for this experimental feature, we add it as a dependency and create a custom ray image.Two things needs to be configured here
a. Replacing
<aws-ecr-password>
with output ofaws ecr get-login-password --region <your-aws-region>
b. Replacing
<accid>.dkr.ecr.<aws-region>.amazonaws.com
with your private URL for ECR.Build and push both the images to the ECR. I used
podman
for this.Create an EKS (I used
m7i.xlarge
instance for testing this).Install the
kuberay
operatorRun the following
serve_config.yaml
on the EKS (kubectl apply -f serve_config.yaml
).serve_config.yaml
: This configuration file for now deploys only one container but we can easily extendserveConfigV2
to add multiple containers.I also added the following to ray head and worker group spec but adding these in, podman was not able to pull images from ECR.
Issue Severity
Low: It annoys or frustrates me.
The text was updated successfully, but these errors were encountered: