Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Serve] Running experimental multiple application in different containers on EKS #45056

Open
dudeperf3ct opened this issue Apr 30, 2024 · 3 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical serve Ray Serve Related Issue

Comments

@dudeperf3ct
Copy link
Contributor

dudeperf3ct commented Apr 30, 2024

What happened + What you expected to happen

I am trying to run the experimental feature of running multiple applications in different containers on EKS.

I will include the exact steps in the Reproduction script section. After deploying the application on EKS,

  1. After deploying, it tries to pull the image in a loop (and crashes eventually because the container runs out of space)
  2. Setting privileged to True actually messes with the authorization

Guide: https://docs.ray.io/en/latest/serve/advanced-guides/multi-app-container.html

Versions / Dependencies

Ray - 2.11.0
Python - 3.10.14
Official docker image: rayproject/ray:latest-py310-cpu

Reproduction script

This reproduction script is specific to AWS. Two resources are required for this - ECR and EKS.

  1. Create two repositories on ECR - translatorapp and customrayimage

  2. Use the following Dockerfile to build and push to ECR

    translator.Dockerfile : Use an example Ray application shown here.

    FROM rayproject/ray:latest-py310-cpu
    RUN pip install "transformers[torch]"
    WORKDIR /home/ray
    ENV PYTHONPATH "${PYTHONPATH}:/home/ray"
    COPY translator.py . 

    custom_ray.Dockerfile: Since podman is required for this experimental feature, we add it as a dependency and create a custom ray image.

    FROM ubuntu:22.04
    RUN apt-get update -y && apt-get install -y curl wget python3.10 python3.10-venv build-essential podman
    RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10
    RUN pip install "ray[serve]==2.11.0"
    RUN podman version
    RUN podman login --username AWS --password <aws-ecr-password> <accid>.dkr.ecr.<aws-region>.amazonaws.com  

    Two things needs to be configured here
    a. Replacing <aws-ecr-password> with output of aws ecr get-login-password --region <your-aws-region>
    b. Replacing <accid>.dkr.ecr.<aws-region>.amazonaws.com with your private URL for ECR.

  3. Build and push both the images to the ECR. I used podman for this.

  4. Create an EKS (I used m7i.xlarge instance for testing this).

  5. Install the kuberay operator

  6. Run the following serve_config.yaml on the EKS (kubectl apply -f serve_config.yaml).

    serve_config.yaml : This configuration file for now deploys only one container but we can easily extend serveConfigV2 to add multiple containers.

    apiVersion: ray.io/v1
    kind: RayService
    metadata:
      name: rayservice-sample
    spec:
      serveConfigV2: |
        applications:
          - name: whisper
            import_path: translator:translator_app
            route_prefix: /whisper
            runtime_env:
              container:
                image: <acc>.dkr.ecr.eu-west-2.amazonaws.com/translatorapp:latest 
                worker_path: /home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/workers/default_worker.py
                run_options: ["--tty", "--privileged", "--log-level=debug", "--security-opt=label=disable",  "--restart unless-stopped"]
    
      rayClusterConfig:
        rayVersion: "2.11.0" # should match the Ray version in the image of the containers
        headGroupSpec:
          rayStartParams:
            dashboard-host: "0.0.0.0"
          template:
            spec:
              containers:
                - name: ray-head
                  image: <acc>.dkr.ecr.eu-west-2.amazonaws.com/customrayimage:latest
                  resources:
                    limits:
                      cpu: 2
                      memory: 2Gi
                    requests:
                      cpu: 2
                      memory: 2Gi
                  ports:
                    - containerPort: 6379
                      name: gcs-server
                    - containerPort: 8265 # Ray dashboard
                      name: dashboard
                    - containerPort: 10001
                      name: client
                    - containerPort: 8000
                      name: serve
        workerGroupSpecs:
          - replicas: 1
            minReplicas: 1
            maxReplicas: 2
            groupName: small-group
            rayStartParams: {}
            template:
              spec:
                containers:
                  - name: ray-worker
                    image: <acc>.dkr.ecr.eu-west-2.amazonaws.com/customrayimage:latest
                    resources:
                      limits:
                        cpu: "1"
                        memory: "2Gi"
                      requests:
                        cpu: "500m"
                        memory: "2Gi"     

    Replace container image in all 3 places with appropriate ECR repo.

  7. I also added the following to ray head and worker group spec but adding these in, podman was not able to pull images from ECR.

    securityContext:
       privileged: true

Issue Severity

Low: It annoys or frustrates me.

@dudeperf3ct dudeperf3ct added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 30, 2024
@askulkarni2
Copy link

Hi @dudeperf3ct, if you are running out of space when the image is pulled you likely need to increase size of the EBS root volume attached to the instance. You can refer to this guide on how to do that.

I'd recommend using the block device mapping to provision a larger root volume for any future EKS cluster deployments.

@dudeperf3ct
Copy link
Contributor Author

dudeperf3ct commented May 1, 2024

@askulkarni2 The EKS instance started with 150GB disk space. Podman tries to pull the image from ECR in an infinite loop fashion that makes it run out of space.

Attaching a screenshot of logs in raylet.err. Some of the layers are being pulled multiple times. Only one container is specified in serveconfigV2 above. I expected the application to start once podman pulls all layers but instead it keeps pulling the same container from ECR.

markup_1000036566.png

@anyscalesam anyscalesam added the serve Ray Serve Related Issue label May 3, 2024
@GeneDer GeneDer added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 3, 2024
@GeneDer
Copy link
Contributor

GeneDer commented May 3, 2024

@zcin when you have a sec, can you help looking at this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical serve Ray Serve Related Issue
Projects
None yet
Development

No branches or pull requests

5 participants