# Serving Gradio Apps on Union.ai

This document shows how to serve Gradio apps on Union.ai for several use cases.
- [Simple Gradio App with text input](#simple-gradio-app)
- [Gradio Chat interface for LLMs](#gradio-app-with-a-model)
- [Upload or take a photo for computer vision](#gradio-app-with-a-model-and-a-dataset)

Gradio is a Python library that allows you to quickly create user interfaces for machine learning models. It provides a simple way to create web-based applications that can be used for model inference, data visualization, and more.

Union.ai is a platform that allows you to deploy and manage machine learning models and applications. It provides a simple way to serve Gradio apps, making it easy to share your work with others.



## Setup

#### Install packages & clone repo

In [None]:
# 
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    !git clone https://github.com/unionai-oss/gradio-serve-union
    %cd https://github.com/unionai-oss/gradio-serve-union
    !pip install -r requirements.txt

#### Authenticate to Union.ai

In [None]:
# 👇 Authenticate to union serverless
!union create login --serverless --auth device-flow

Once you get your onboarding email you should be ready to the the following code cells to serve the Gradio applications. 

## Simple Gradio App with text input, sliders, and output

This simple demo shows how to create a Gradio app that takes text input and outputs the same text. It also includes a slider for adjusting the output length. It can serve as a starting point for more complex applications. 

See the next example for a more complex Gradio app that uses a LLM.

In [None]:
# 👇 Run this command to deploy the application
!union deploy apps 0_simple_app/app.py gradio-app

In [None]:
%% writefile 0_simple_app/app.py

"""
# Simple Gradio App Deployment Example
"""
from datetime import timedelta
from union import Resources
from union.app import App, ScalingMetric
from containers import container_image

gradio_app = App(
    name="gradio-app",
    container_image=container_image, # image that contains the environment and dependencies needed to run the app
    port=8080, # The port on which the app will be served
    include=["./main.py"],  # Include your gradio code
    args=["python", "main.py"], # Command to run your app inside the container
    limits=Resources(cpu="2", mem="8Gi"), # Maximum resources allocated (CPU, memory, GPU) — hard limit
    requests=Resources(cpu="2", mem="8Gi"), # Minimum resources requested from the scheduler — soft requirement
    min_replicas=0, # Minimum number of instances (pods) running — allows scale-to-zero when idle
    max_replicas=1, # Maximum number of instances — restricts auto-scaling to 1 replica
    scaledown_after=timedelta(minutes=5), # Time to wait before scaling down when traffic is low
    scaling_metric=ScalingMetric.Concurrency(2), # Auto-scaling based on concurrent user requests; 2 concurrent users per replica
    # requires_auth=False # Uncomment to make app public.
)

# union deploy apps 0_simple_app/app.py gradio-app


In [None]:
%% writefile 0_simple_app/main.py

"""
This is a simple Gradio app that takes a name and an intensity level as input,
and returns a greeting message. The app is designed to be deployed using Union from app.py.
"""

import gradio as gr

def greet(name, intensity):
    return "Hello, " + name + "!" * int(intensity)

demo = gr.Interface(
    fn=greet,
    inputs=["text", "slider"],
    outputs=["text"],
)

if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", server_port=8080)


# union deploy apps 0_simple_app/app.py gradio-app


## Serving a Chat interface and LLM on Union.ai


Find the code for this example in 1_llm_chat folder. 
- model.py: This file contains a task to download the LLM from Hugging Face and store it as a Union artifact.
- app.py This contains the environment, compute, and scaling configuration for the app.
- main.py contains the Gradio app code for the chat interface. (This used the Qwen token filtering, you may need to change how the chat is handled if you choose a different model)

Explore the code in the files or see the code below in the notebook. 



In [None]:
# 👇 Run this task to download the model files for the LLM chat example.
!union run --remote 1_llm_chat/model.py download_model

In [None]:
# 👇 Run this command to deploy the chatbot app with the model downloaded above
!union deploy apps 1_llm_chat/app.py gradio-chat

In [None]:
%%writefile 1_llm_chat/model.py

"""
This file downloads the Qwen-3 model and saves it to a specified directory.
You can adjust the model to another from Hugging Face by changing the model_name parameter.
"""

from union import Resources, task, Artifact, FlyteDirectory, current_context, ImageSpec
from transformers import AutoModelForCausalLM, AutoTokenizer
from pathlib import Path
from typing import Annotated
from containers import container_image

# Create Union Artifact 
Qwen3Model8b = Artifact(name="qwen3-model")

# ----------------------------------------------------------------------
# Download the model
# ----------------------------------------------------------------------

@task(
    container_image=container_image,
    cache=True,
    cache_version="1.0",
    requests=Resources(cpu="2", mem="9Gi"),
)
def download_model(
    model_name: str = "Qwen/Qwen3-0.6B",
) -> Annotated[FlyteDirectory, Qwen3Model8b]:

    working_dir = Path(current_context().working_directory)
    saved_model_dir = working_dir / "saved_model"
    saved_model_dir.mkdir(parents=True, exist_ok=True)

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="cpu",
        torch_dtype="auto",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    model.save_pretrained(saved_model_dir)
    tokenizer.save_pretrained(saved_model_dir)

    # return FlyteDirectory(saved_model_dir)
    return Qwen3Model8b.create_from(saved_model_dir)

# union run --remote 1_llm_chat/model.py download_model


In [None]:
%%writefile 1_llm_chat/app.py
"""
This deployment script is for a Gradio app that serves as a chat interface for the Qwen-3 model.
"""

from datetime import timedelta
from union import Artifact, Resources
from union.app import App, Input, ScalingMetric
from flytekit.extras.accelerators import L4
from containers import container_image

# Point to your object detection model artifact
Qwen3Model8b = Artifact(name="qwen3-model")

# Define the Gradio app deployment
gradio_app = App(
    name="gradio-chat",
    inputs=[
        Input(
            name="downloaded-model",
            value=Qwen3Model8b.query(),
            download=True,
        )
    ],
    container_image=container_image, # image that contains the environment and dependencies needed to run the app
    port=8080, # The port on which the app will be served
    include=["./main.py"],  # Include your gradio app code
    args=["python", "main.py"], # Command to run your app inside the container
    limits=Resources(cpu="2", mem="16Gi", gpu="1"),  # Maximum resources allocated (CPU, memory, GPU) — hard limit
    requests=Resources(cpu="2", mem="16Gi", gpu="1"), # Minimum resources requested from the scheduler — soft requirement
    accelerator=L4,  # Specifies the GPU type to use (e.g., NVIDIA L4 accelerator)
    min_replicas=0, # Minimum number of instances (pods) running — allows scale-to-zero when idle
    max_replicas=1, # Maximum number of instances — restricts auto-scaling to 1 replica
    scaledown_after=timedelta(minutes=5), # Time to wait before scaling down when traffic is low
    scaling_metric=ScalingMetric.Concurrency(2), # Auto-scaling based on concurrent user requests; 2 concurrent users per replica
    # requires_auth=False # Uncomment to make app public.
)

# union deploy apps 1_llm_chat/app.py gradio-chat


In [None]:
%%writefile 1_llm_chat/main.py

"""
This creates a Gradio app that serves as a chat interface for the Qwen-3 model.
The thinking and answer tokems are streamed separately, allowing for a more interactive experience.
"""
from transformers import AutoTokenizer, AutoModelForCausalLM
import gradio as gr
from union_runtime import get_input
import threading
from transformers import TextIteratorStreamer

# --------------------------
# Load model path from Union artifact input
# --------------------------
model_path = get_input("downloaded-model")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).cuda()
model.eval()


def chat_fn(message, history):
    """
    Function to handle chat messages and generate responses.
    """

    messages = [{"role": "user", "content": message}]

    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    streamer = TextIteratorStreamer(
        tokenizer, skip_prompt=True, skip_special_tokens=True
    )

    thread = threading.Thread(
        target=model.generate,
        kwargs={
            "input_ids": inputs["input_ids"],
            "attention_mask": inputs["attention_mask"],
            "streamer": streamer,
            "max_new_tokens": 2048,
            "do_sample": True,
            "top_p": 0.9,
            "temperature": 0.3,
            "eos_token_id": tokenizer.eos_token_id,
        },
    )
    thread.start()

    thinking_prefix = "🤔 **Thinking:**\n"
    answer_prefix = "\n\n🧠 **Answer:**\n"

    current_section = "thinking"
    yielded_thinking = ""
    yielded_answer = ""

    for token in streamer:
        if current_section == "thinking":
            yielded_thinking += token
            if "</think>" in yielded_thinking:
                # Split at </think> and switch to answer phase
                thinking_text, remainder = yielded_thinking.split("</think>", 1)
                yield thinking_prefix + thinking_text.strip()
                current_section = "answer"
                yielded_answer += remainder
                yield answer_prefix + yielded_answer.strip()
            else:
                yield thinking_prefix + yielded_thinking.strip()
        else:
            yielded_answer += token
            yield answer_prefix + yielded_answer.strip()

# --------------------------
# Define Gradio interface
# --------------------------
chat_interface = gr.ChatInterface(
    fn=chat_fn,
    title="Qwen3 Chatbot",
    textbox=gr.Textbox(placeholder="Ask me anything...", container=True, scale=7),
    multimodal=False,
    theme="default",
    type="messages",
)

# --------------------------
# Launch Gradio app
# --------------------------
if __name__ == "__main__":
    chat_interface.launch(server_name="0.0.0.0", server_port=8080)

# union deploy apps 1_llm_chat/app.py gradio-chat

## Visual Language Model (VLM) with Gradio


In [None]:
# 👇 Run this task to download the model files for the VLM chat example.
!union run --remote 2_cv_images/model.py download_model

In [None]:
# 👇 Run this command to deploy the Gradio app with the model downloaded above
!union deploy apps 2_cv_images/app.py vlm-gradio

In [None]:
%writefile 2_cv_images/model.py
"""
This task downloads the SmolVLM-Instruct model from Hugging Face and saves it to a specified directory.
You can adjust the model to another from Hugging Face VLM by changing the model_name parameter.
"""

from union import Resources, task, Artifact, FlyteDirectory, current_context, ImageSpec
from transformers import AutoProcessor, AutoModelForVision2Seq
from pathlib import Path
from typing import Annotated
from containers import container_image

# Create Union Artifact 
SmolVLM = Artifact(name="SmolVLM-Instruct")


# ----------------------------------------------------------------------
# Download the model
# ----------------------------------------------------------------------
@task(
    container_image=container_image,
    cache=True,
    cache_version="1.0",
    requests=Resources(cpu="2", mem="9Gi"),
)
def download_model(
    model_name: str = "HuggingFaceTB/SmolVLM-Instruct",
) -> Annotated[FlyteDirectory, SmolVLM]:

    working_dir = Path(current_context().working_directory)
    saved_model_dir = working_dir / "saved_model"
    saved_model_dir.mkdir(parents=True, exist_ok=True)

    model = AutoModelForVision2Seq.from_pretrained(
        model_name,
        torch_dtype="auto",
        trust_remote_code=True,
    )
    processor = AutoProcessor.from_pretrained(model_name)

    model.save_pretrained(saved_model_dir)
    processor.save_pretrained(saved_model_dir)

    return SmolVLM.create_from(saved_model_dir)


# union run --remote 2_cv_images/model.py download_model


In [None]:
%%writefile 2_cv_images/app.py
"""
This serves as a deployment script for the SmolVLM-Instruct model with a Gradio app.
"""
import os
from datetime import timedelta
from union import Artifact, Resources
from union.app import App, Input, ScalingMetric
from flytekit.extras.accelerators import L4
from containers import container_image

# Point to VLM artifact

SmolVLM = Artifact(name="SmolVLM-Instruct")

gradio_app = App(
    name="vlm-gradio",
    inputs=[
        Input(
            name="downloaded-model",
            value=SmolVLM.query(),
            download=True,
        )
    ],
    container_image=container_image,
    port=8080,
    include=["./main.py"],  # Include your gradio app
    args=["python", "main.py"],
    limits=Resources(cpu="2", mem="24Gi", gpu="1", ephemeral_storage="20Gi"),
    requests=Resources(cpu="2", mem="24Gi", gpu="1", ephemeral_storage="20Gi"),
    accelerator=L4,
    min_replicas=0,
    max_replicas=1,
    scaledown_after=timedelta(minutes=10),
    scaling_metric=ScalingMetric.Concurrency(2),
)

# union deploy apps 2_cv_images/app.py vlm-gradio


In [None]:
%%writefile 2_cv_images/main.py
"""
This is a Gradio app for the SmolVLM-Instruct model.
You can upload an image or take photo with webcam and get a vision-language response.
"""

import time
from pathlib import Path

import gradio as gr
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq

# Load model from Union artifact or fallback local path
try:
    from union_runtime import get_input
    model_path = Path(get_input("downloaded-model"))
except:
    model_path = Path("saved_model")

# Load processor and model
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(model_path, trust_remote_code=True).eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Inference function
def vlm_infer(image: Image.Image) -> str:
    start = time.time()

    # Prompt with <image> placeholder required by IDEFICS3
    prompt = "<|user|>\n<image>\nWhat’s going on in this photo?\n<|end|>\n<|assistant|>"

    # Pass all inputs as keyword arguments to avoid conflict
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)

    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=100)

    # Decode output
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    latency = (time.time() - start) * 1000

    return f"{generated_text.strip()}\n\n⚡ {device.type.upper()} | {latency:.1f} ms"

# Gradio UI
demo = gr.Interface(
    fn=vlm_infer,
    inputs=gr.Image(type="pil", label="Upload or Take a Photo"),
    outputs=gr.Text(label="SmolVLM Output"),
    title="SmolVLM-Instruct: Vision-Language Model",
    description="Upload an image to generate a vision-language response using SmolVLM (IDEFICS3).",
)

if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", server_port=8080)


## Resources

