<a href="https://colab.research.google.com/github/sangalo20/latitude/blob/main/cloud_run_ollama_gemma3_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Serving Gemma 3 on Cloud Run

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/jmwai/gemma3-cloud-run-demo/blob/main/cloud_run_ollama_gemma3_inference.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>

</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/cloud_run_ollama_gemma3_inference.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/cloud_run_ollama_gemma3_inference.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/cloud_run_ollama_gemma3_inference.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/cloud_run_ollama_gemma3_inference.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/cloud_run_ollama_gemma3_inference.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            

<img src="https://ollama.com/public/ollama.png" height="200px" alignment="center"/>
<img src="https://cloud.google.com/static/architecture/images/ac-page-icons/card_google_cloud_partner.svg" height="200px">


| | |
|-|-|
| Author(s) | [Vlad Kolesnikov](https://github.com/vladkol) |

## Overview

  [**Gemma 3**](https://ai.google.dev/gemma) is a new generation of open models developed by Google. It is a collection of lightweight, state-of-the-art open models built from the same research and technology that powers our Gemini 2.0 models. Gemma 3 comes in a range of sizes (270M, 1B, 4B, 12B and 27B), allowing you to choose the best model for your specific hardware and performance needs. Gemma 3 models are available through platforms like Google AI Studio, Vertex AI, Kaggle, and Hugging Face.

> **[Cloud Run](https://cloud.google.com/run)**:
It's a serverless platform by Google Cloud for running containerized applications. It automatically scales and manages infrastructure, supporting various programming languages. Cloud Run now offers GPU acceleration for AI/ML workloads. With 30 seconds to the first token, Cloud Run is a perfect platform for serving lightweight models like Gemma.

> **Note:** GPU support in Cloud Run is in preview. To use the GPU feature, you must request `Total Nvidia L4 GPU allocation, per project per region` quota under Cloud Run in the [Quotas and system limits page](https://cloud.google.com/run/quotas#increase).


> **[Ollama](ollama.com)**: is an open-source tool for easily running and deploying large language models locally. It offers simple management and usage of LLMs on personal computers or servers.

This notebook showcases how to deploy [Google Gemma 3](https://developers.googleblog.com/en/introducing-gemma3) in Cloud Run, with the objective to build a simple API for chat or RAG applications.

By the end of this notebook, you will learn how to:

1. Deploy Google Gemma 3 as an OpenAI-compatible API on Cloud Run using Ollama.
2. Build a custom container with Ollama to deploy any Large Language Model (LLM) of your choice.
3. Make requests to an API hosted on Cloud Run.

## Get started

### Install Google Cloud SDK

Make sure you Google Cloud SDK is installed (try running `gcloud version`) or [install it](https://cloud.google.com/sdk/docs/install) before executing this notebook.

> If you are running in Colab or Vertex AI workbench, you have Google Cloud SDK installed.

### Choose a model, a project, and a region to host the model

Choose a Gemma 3 model to use, a Google Cloud project to host your Cloud Run service, and a region to host it in.
For this demo we will chose the gemma3:270m model. If you cannot attach a GPU to your Cloud Run instance, chose the gemma3:270m and remove the GPU requirements in the cloud run command

If you don't have a project yet:

1. [Create a project](https://console.cloud.google.com/projectcreate) in the Google Cloud Console.
2. Copy your `Project ID` from the project's [Settings page](https://console.cloud.google.com/iam-admin/settings).


In [4]:
# { display-mode: "form", run: "auto" }


PROJECT_ID = "cloudrun-gemma-476809"  # @param {type:"string", isTemplate: true}
REGION = "us-central1"  # @param {type:"string", isTemplate: true}
MODEL = "gemma3:270m" # @param {type:"string", isTemplate: true}

if PROJECT_ID == "[your-project-id]" or not PROJECT_ID:
    print("Please specify your project id in PROJECT_ID variable.")
    raise KeyboardInterrupt

MODEL_NAME_ESCAPED = MODEL.translate(str.maketrans(".:/", "---"))
SERVICE_NAME = f"ollama--{MODEL_NAME_ESCAPED}"

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Run the cell below.

In [7]:
!gcloud auth print-identity-token -q &> /dev/null || gcloud auth login --project="{PROJECT_ID}" --update-adc --quiet

Go to the following link in your browser, and complete the sign-in prompts:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&redirect_uri=https%3A%2F%2Fsdk.cloud.google.com%2Fauthcode.html&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=M9r7azCwX5qEnsjYKS2JsNi1wQMhRK&prompt=consent&token_usage=remote&access_type=offline&code_challenge=uIhh11l_jkjKjwldx5BkEDk1rGs3sYahcc2c2oCqmys&code_challenge_method=S256

Once finished, enter the verification code provided in your browser: 4/0Ab32j92JCH-DlfFSd-2F10Tc3kAL5vKi9HG6hx70QwXbOYz1qmj44M8T9t8FYV3s1qFdWA

Application Default Credentials (ADC) were updated.

You are now logged

## Prepare container image

First, let's create a Docker file for a container with the model embedded into it.

In [13]:
%%writefile Dockerfile

FROM ollama/ollama:latest

ARG MODEL

# Set the model name
ENV MODEL=$MODEL

# Set the host and port to listen on
ENV OLLAMA_HOST 0.0.0.0:8080

# Set the directory to store model weight files
ENV OLLAMA_MODELS /models

# Reduce the verbosity of the logs
ENV OLLAMA_DEBUG false

# Do not unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1

# Start the ollama server and download the model weights
RUN ollama serve & sleep 5 && ollama pull $MODEL

# At startup time we start the server and run a dummy request
# to request the model to be loaded in the GPU memory
ENTRYPOINT ["/bin/sh"]
CMD ["-c", "ollama serve  & (ollama run $MODEL 'Say one word' &) && wait"]

Overwriting Dockerfile


Second, we create a Cloud Build file to use for building and pushing our container image.

In [14]:
%%writefile cloudbuild.yaml

steps:
- name: 'gcr.io/cloud-builders/docker'
  id: build
  entrypoint: 'bash'
  args:
    - -c
    - |
        docker buildx build --tag=${_IMAGE} --build-arg MODEL=${_MODEL} .

images: ["${_IMAGE}"]

substitutions:
  _IMAGE: '${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_AR_REPO_NAME}/${_SERVICE_NAME}'

options:
  dynamicSubstitutions: true
  machineType: "E2_HIGHCPU_32"

Overwriting cloudbuild.yaml


## Build Container Image and Deploy Cloud Run Service

We are ready to build our container image and deploy Cloud Run service.

The script below performs the following actions:

* Enables necessary APIs.
* Creates an Artifact Repository for the image.
* Creates a Service Account for the service.
* Submits a Cloud Build job to create and push the container image.
* Deploys the Cloud Run service.

> The script may take 10-45 minutes to finish.

Note the following important flags in Cloud Build deployment command:

* `--concurrency 4` is set to match the value of the environment variable `OLLAMA_NUM_PARALLEL`.
* `--gpu 1` with `--gpu-type nvidia-l4` assigns 1 NVIDIA L4 GPU to every Cloud Run instance in the service.
`--no-allow-authenticated` restricts unauthenticated access to the service.
By keeping the service private, you can rely on Cloud Run's built-in [Identity and Access Management (IAM)](https://cloud.google.com/iam) authentication for service-to-service communication.
* `--no-cpu-throttling` is required for enabling GPU.
* `--service-account` the service identity of the service.
* `--max-instances` sets maximum number of instances of the service.
It has to be equal to or lower than your project's NVIDIA L4 GPU (`Total Nvidia L4 GPU allocation, per project per region`) quota.

For optimal GPU utilization, increase `--concurrency`, keeping it within twice the value of `OLLAMA_NUM_PARALLEL`.
While this leads to request queuing in Ollama, it can help improve utilization:
Ollama instances can immediately process requests from their queue, and the queues help absorb traffic spikes.

#### If your cloud credits don't allow you to attache a GPU, change to Gemma 270m variant and deploy without GPU requirement.

In [15]:
%%writefile deploy.sh

PROJECT_ID=$1
REGION=$2
MODEL_ID="${3}"
SERVICE_NAME="${4}"
AR_REPO_NAME="ollama-repo"
SERVICE_ACCOUNT="ollama-cloud-run-sa"
SERVICE_ACCOUNT_ADDRESS="${SERVICE_ACCOUNT}@$PROJECT_ID.iam.gserviceaccount.com"
MAX_INSTANCES=1 # Adjust this value to match your Cloud Run L4 GPU quota ("Total Nvidia L4 GPU allocation, per project per region", NvidiaL4GpuAllocPerProjectRegion, run.googleapis.com/nvidia_l4_gpu_allocation)

echo "Enabling APIs in project ${PROJECT_ID}."
gcloud services enable run.googleapis.com \
    cloudbuild.googleapis.com \
    artifactregistry.googleapis.com \
    --project ${PROJECT_ID} \
    --quiet

set -e

# Creating the service account if doesn't exist.
sa_list=$(gcloud iam service-accounts list --quiet --format 'value(email)' --project $PROJECT_ID --filter=email:$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com 2>/dev/null)
if [ -z "${sa_list}" ]; then
    echo "Creating Service Account ${SERVICE_ACCOUNT}."
    gcloud iam service-accounts create $SERVICE_ACCOUNT \
        --project ${PROJECT_ID} \
        --display-name="${SERVICE_ACCOUNT} - Cloud Run Service Account"
fi

# Creating the Artifacts Repository if doesn't exist
repo_list=$(gcloud artifacts repositories list --format 'value(name)' --filter=name="projects/${PROJECT_ID}/locations/${REGION}/repositories/${AR_REPO_NAME}" --project ${PROJECT_ID} --quiet --location ${REGION} 2>/dev/null)
if [ -z "${repo_list}" ]; then
    echo "Creating Artifact Registry ${AR_REPO_NAME}."
    gcloud artifacts repositories create $AR_REPO_NAME \
    --repository-format docker \
    --location ${REGION} \
    --project=${PROJECT_ID}
fi

echo "Building container image."
gcloud builds submit --config=cloudbuild.yaml --project=${PROJECT_ID} . \
    --suppress-logs \
    --substitutions \
  _AR_REPO_NAME=$AR_REPO_NAME,_REGION=$REGION,_SERVICE_NAME=$SERVICE_NAME,_MODEL=$MODEL_ID
rm -f cloudbuild.yaml
rm -f Dockerfile

echo "Deploying Service ${SERVICE_NAME}."
gcloud beta run deploy $SERVICE_NAME \
    --project=${PROJECT_ID} \
    --image=${REGION}-docker.pkg.dev/$PROJECT_ID/$AR_REPO_NAME/$SERVICE_NAME \
    --service-account $SERVICE_ACCOUNT_ADDRESS \
    --cpu=8 \
    --memory=32Gi \
    --set-env-vars OLLAMA_NUM_PARALLEL=4 \
    --region ${REGION} \
    --no-allow-unauthenticated \
    --max-instances ${MAX_INSTANCES} \
    --no-cpu-throttling \
    --timeout 1h

SERVICE_URL=$(gcloud run services describe ${SERVICE_NAME} --project=${PROJECT_ID} --region $REGION --format 'value(status.url)' --quiet)
echo "âœ… Success!"
echo "ðŸš€ Service URL: ${SERVICE_URL}"

Overwriting deploy.sh


In [16]:
!/bin/bash ./deploy.sh "{PROJECT_ID}" "{REGION}" "{MODEL}" "{SERVICE_NAME}" && rm -f ./deploy.sh

Enabling APIs in project cloudrun-gemma-476809.
Operation "operations/acat.p2-781254510635-b415c057-aa95-443b-a84b-0b25352a274a" finished successfully.
Building container image.
Creating temporary archive of 49 file(s) totalling 54.4 MiB before compression.
Uploading tarball of [.] to [gs://cloudrun-gemma-476809_cloudbuild/source/1761904723.495307-bb4d4a5c020f4cf88fc7a09758ba59ce.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/cloudrun-gemma-476809/locations/global/builds/9e459553-69a1-4155-9a55-f52d9f3f8859].
Logs are available at [ https://console.cloud.google.com/cloud-build/builds/9e459553-69a1-4155-9a55-f52d9f3f8859?project=781254510635 ].
Waiting for build to complete. Polling interval: 1 second(s).
ID                                    CREATE_TIME                DURATION  SOURCE                                                                                               IMAGES                                                                                      STATU


## Test the deployed service

Now, let's test the service you deployed.

First, simply by using `cURL`.

In [17]:
%%bash -s "$MODEL" "$SERVICE_NAME" "$PROJECT_ID" "$REGION"

PROMPT="why is the sky blue?"
SERVICE_URL=$(gcloud run services describe "$2" --project "$3" --region "$4" --format 'value(status.url)' --quiet)
AUTH_TOKEN=$(gcloud auth print-identity-token -q)

curl -s -X POST "${SERVICE_URL}/api/generate" \
  -H "Authorization: Bearer ${AUTH_TOKEN}" \
  -H "Content-Type: application/json" \
  -d @<(cat <<EOF
{
  "model": "$1",
  "prompt": "$PROMPT",
  "max_tokens": 1000,
  "stream": false
}
EOF
)


{"model":"gemma3:270m","created_at":"2025-10-31T10:05:28.917449347Z","response":"The sky is blue for a few key reasons:\n\n*   **Rayleigh Scattering:** Sunlight is composed of all colors of light. When these colors are scattered by air molecules, we see the scattered light as blue. This is why the sky appears blue.\n*   **Blue Light:** The sun emits a continuous range of light colors. The blue light is scattered more than other colors, so the sky is blue.\n*   **Absorption and Reflection:** The air molecules absorb some of the blue light, but they also reflect some of it back to the Earth. This is why the sky appears blue.\n*   **Atmospheric Conditions:** The atmosphere also absorbs and scatters some of the blue light, making the sky appear blue.\n\nIn summary, the sky is blue because of the scattering of light by the air molecules, which is what gives it its blue color.","done":true,"done_reason":"stop","context":[105,2364,107,36425,563,506,7217,3730,236881,106,107,105,4368,107,818,72

### Ollama Python Library

You can also use Ollama Python Library to make requests to the service you deployed.

In [18]:
# Install Ollama Python Library
%pip install ollama -q

In [19]:
import subprocess

from ollama import Client

identity_token = (
    subprocess.check_output("gcloud auth print-identity-token -q", shell=True)
    .decode()
    .strip()
)
service_url = (
    subprocess.check_output(
        (
            "gcloud run services describe "
            f"{SERVICE_NAME} --project={PROJECT_ID} "
            f"--region={REGION} "
            "--format='value(status.url)' -q"
        ),
        shell=True,
    )
    .decode()
    .strip()
)
client = Client(host=service_url, headers={"Authorization": f"Bearer {identity_token}"})
stream = client.chat(
    model=MODEL,
    messages=[{"role": "user", "content": "Why is the sky blue?"}],
    stream=True,
)
for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

The sky is blue due to a phenomenon called Rayleigh scattering. 

Here's the breakdown:

*   **Sunlight:** Sunlight is actually made up of all the colors of the rainbow.
*   **Blue Light:** Blue light has a shorter wavelength than other colors.
*   **Rayleigh Scattering:** When sunlight enters the atmosphere, it collides with tiny particles like dust and air molecules. These particles scatter the light in all directions.
*   **Blue Light Scattered:** The scattered blue light is then slowed down by the molecules in the atmosphere, which are much smaller than the wavelengths of blue light.
*   **Why Blue?** Because blue light has a much shorter wavelength than other colors. This means that most of the blue light is scattered away from the Earth's surface by the air molecules.

So, the sky appears blue because the blue light is scattered away by the air molecules.

### Using the python requests library

In [20]:
import requests

headers = {"Authorization": f"Bearer {identity_token}", "Content-Type": "application/json"}  # type: ignore

data = {
    "model": MODEL,
    "prompt": "Hi, what is the meaning of life?",
    "max_tokens": 100,
    "stream": False,
}
service_url = (
    subprocess.check_output(
        (
            "gcloud run services describe "
            f"{SERVICE_NAME} --project={PROJECT_ID} "
            f"--region={REGION} "
            "--format='value(status.url)' -q"
        ),
        shell=True,
    )
    .decode()
    .strip()
)

response = requests.post(f"{service_url}/api/generate", headers=headers, json=data)

print(response.text)

{"model":"gemma3:270m","created_at":"2025-10-31T10:06:56.689470112Z","response":"Ah, the million-dollar question! The meaning of life is a deeply personal and philosophical question that has plagued humanity for centuries. There's no single, universally accepted answer. \n\nHere are some common perspectives and approaches to understanding the meaning of life:\n\n*   **Purpose and Meaning:** This is often the most common answer. It suggests that life has a purpose, whether it's fulfilling a specific goal, contributing to something larger than oneself, or finding meaning in a particular activity.\n*   **Relationships and Connection:** Meaning often arises from strong relationships with others, strong communities, and a sense of belonging.\n*   **Growth and Learning:** Life is a journey of continuous learning and growth. We need to strive to become better, develop new skills, and expand our horizons.\n*   **Experiences and Adventure:** Embracing new experiences, pursuing passions, and tra

### RAG Q&A Chain with Gemma 3 and Cloud Run

We can leverage the LangChain integration to create a simple RAG application with Gemma, Cloud Run, Vertex AI Embedding for generating embeddings and SKLearnVectorStore which is a simple in-memory vector store based on scikit-learn's NearestNeighbors, using embeddings..

Through RAG, we will ask Gemma 3 to answer questions about the Cloud Run documentation page

### Setup embedding model and retriever

We are ready to setup our embedding model and retriever.

### Installed the required libraries

In [2]:
%pip install --upgrade --user --quiet langchain-core langchain-community langchain_google_vertexai


In [22]:
import IPython

app = IPython.Application.instance()

app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

**Note**: If the above installation fails, do this;
- Restart the session
- Run the first cell again
- Run the above cell again and it should work


### Import the required libraries

In [3]:

import requests
import subprocess
from langchain_core.prompts import PromptTemplate
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores import SKLearnVectorStore
from langchain_google_vertexai import VertexAIEmbeddings
import google.auth




In [4]:
# { display-mode: "form", run: "auto" }


PROJECT_ID = "cloudrun-gemma-476809"  # @param {type:"string", isTemplate: true}
REGION = "us-central1"  # @param {type:"string", isTemplate: true}
MODEL = "gemma3:270m" # @param {type:"string", isTemplate: true}

if PROJECT_ID == "[your-project-id]" or not PROJECT_ID:
    print("Please specify your project id in PROJECT_ID variable.")
    raise KeyboardInterrupt

MODEL_NAME_ESCAPED = MODEL.translate(str.maketrans(".:/", "---"))
SERVICE_NAME = f"ollama--{MODEL_NAME_ESCAPED}"

### Setup Vertex AI Embeddings

In [5]:
credentials, _ = google.auth.default(quota_project_id=PROJECT_ID)
embeddings = VertexAIEmbeddings(
    project=PROJECT_ID, model_name="text-embedding-005", credentials=credentials
)

We will ground Gemini with content from Cloud Run Overview page. Load the content and store it in an in memory vector database.

In [7]:
loader = WebBaseLoader("https://cloud.google.com/run/docs/overview/what-is-cloud-run")
docs = loader.load()
documents = CharacterTextSplitter(chunk_size=800, chunk_overlap=100).split_documents(
    docs
)

vector = SKLearnVectorStore.from_documents(documents, embeddings)
retriever = vector.as_retriever()




### RAG Chain Definition

We will define now our RAG Chain.

The RAG chain works as follows:

- The users query is used by the retriever to fetch relevant documents.
- The retrieved documents are formatted into a single string.
- The formatted documents, along with the original user messages, are passed to the Gemma3 with instructions to generate an answer based on the provided context.
- The LLM's response is parsed and returned as the final answer.

In [8]:
prompt = PromptTemplate.from_template(
    "You are a helpful assistant.\n"
    "Answer using ONLY the context below.\n\n"
    "CONTEXT:\n{context}\n\n"
    "QUESTION: {question}\n"
    "ANSWER:"
)

In [9]:
question = "What is Cloud Run and what problem does it solve?"

relevant_docs = retriever.invoke(question)
joined_context = "\n\n".join(d.page_content for d in relevant_docs)

final_prompt_str = prompt.format(context=joined_context, question=question)

### Testing the RAG Chain


In [None]:
service_url = (
    subprocess.check_output(
        (
            "gcloud run services describe "
            f"{SERVICE_NAME} --project={PROJECT_ID} "
            f"--region={REGION} "
            "--format='value(status.url)' -q"
        ),
        shell=True,
    )
    .decode()
    .strip()
)
payload = {
    "model": MODEL,
    "prompt": final_prompt_str,
    "stream": False,
}

resp = requests.post(f"{service_url}/api/generate", json=payload)
resp_json = resp.json()

print(resp_json)


## Conclusion
Congratulations! ðŸ’Ž Now you know how to deploy Gemma 3 to Cloud Run!

## Cleaning up

To delete the Cloud Run service you created, you can uncomment and run the following cell.

In [None]:
# !gcloud run services delete $SERVICE_NAME --project $PROJECT_ID --region $LOCATION --quiet