# Project 5: **Build a Multi-Modal Generation Agent**

Welcome to the final project! In this project, you'll use open-source text-to-image and text-to-video models to generate content. Next, you'll build a **unified multi-modal agent** similar to modern chatbots, where a single agent can support general questions, image generation, and video generation requests.

By the end of this project, you'll understand how to integrate multiple model types under one  routing system capable of deciding what modality to use based on the user's intent.



## Learning Objectives

* Use **Text-to-Image** models to generate images from a text.
* Generate short clips with a **Text-to-Video** model
* Build a **Multi-Modal Agent** that answers questions and routes media requests
* Build a simple **Gradio** UI and interact with the multi-modal agent

## Roadmap
1. Environment setup
2. Text‑to‑Image
3. Text‑to‑Video
4. Multimodal Agent
5. Gradio UI
6. Celebrate

## 1 - Environment Setup

In this project, we'll use open-source Text-to-Image and Text-to-Video models to generate visuals from natural-language prompts. These models are computationally heavy and perform best on GPUs, so we recommend running this notebook in Google Colab or another GPU-enabled environment. We'll load all models from Hugging Face, which requires authentication.

Before continuing:
1. Open this project in Google Colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bytebyteai/ai-eng-projects/blob/main/project_5/multimodal_agent.ipynb)
2. Create a Hugging Face account and generate an access token at huggingface.co/settings/tokens
3. Paste your token in the field below to log in.
4. In the Colab environment, enable GPU acceleration by selecting Runtime → Change runtime type → GPU.

In [None]:
from huggingface_hub import login

login(token="YOUR TOKEN HERE")

Let's import the required libraries and confirm that PyTorch can detect the available GPU.

In [None]:
import torch, diffusers, transformers, os, random, gc
print('torch', torch.__version__, '| CUDA:', torch.cuda.is_available())

## 2 - Text-to-Image (T2I)
T2I models translate natural-language descriptions into images. They are typically based on diffusion models, which gradually refine random noise into a coherent picture guided by the text prompt. In this section, you'll load and test one such model to generate images directly from text inputs.

### 2.1: Load a T2I Model
We'll use `Stable Diffusion XL` (SDXL) by `Stability AI`, one of the open-source diffusion models. It provides high-quality, detailed image generation with relatively efficient inference compared to earlier versions.

You'll load the model from Hugging Face using the diffusers library, which simplifies running diffusion-based pipelines. To learn more about diffusers, read: https://huggingface.co/docs/diffusers/main/index


In [None]:
from diffusers import DiffusionPipeline
# Define the Stable Diffusion XL model ID from Hugging Face and load the pre-trained model
"""
YOUR CODE HERE (~2-5 lines of code)
"""

### 2.2: Generate an image

In [None]:
# Generate and display an image from a text prompt using the loaded pipeline
"""
YOUR CODE HERE (~2 lines of code)
"""

### 2.3: Experimenting with "inference_steps"

The number of inference steps determines how many refinement passes the diffusion model makes. Fewer steps give quicker but less detailed images, while more steps improve clarity and structure at the cost of speed.

Try generating images with different step counts and compare the results.

In [None]:
import matplotlib.pyplot as plt

# Generate an image for different values of num_inference_steps (e.g., 10, 25, 50) and compare sharpness and detail
images = []

"""
YOUR CODE HERE (~6-8 lines)
"""

# Plot results side-by-side
plt.figure(figsize=(12, 4))
for i, (steps, img) in enumerate(images, 1):
    plt.subplot(1, len(images), i)
    plt.imshow(img)
    plt.axis("off")
    plt.title(f"{steps} steps")
plt.tight_layout()
plt.show()


### 2.4 (Optional): Visualizing the Diffusion Process
Diffusion models start from random noise and iteratively refine it into an image that matches the prompt. If you are curious, visualize all intermediate steps to see how the noise gradually turns into a coherent picture.

In [None]:
import torch
import matplotlib.pyplot as plt

# Step 1: Run the pipeline with 50 inference steps
# Step 2: Capture intermediate latents or images during generation
# Step 3: Plot them sequentially to show noise evolving into structure
"""
YOUR CODE HERE (~10-12 lines)
"""

### 2.5 (Optional): Experiment with other models.
Different text-to-image models vary in speed, style, and visual quality. Try swapping in other open-source diffusion models and compare how their outputs differ in detail, realism, or artistic tone.

You can browse available models on Hugging Face here: https://huggingface.co/models?library=diffusers

In [None]:
# Step 1: Replace model_id with another text-to-image model from Hugging Face
# Step 2: Reload the pipeline and generate a few test images
# Step 3: Compare image quality, color balance, and prompt fidelity
"""
YOUR CODE HERE
"""

## 3 - Text-to-Video (T2V)
T2V models extend the idea of diffusion from still images to moving sequences. Instead of generating one frame, they create a series of coherent frames that depict motion consistent with the text prompt. These models are computationally heavier and often generate short clips (typically 2-10 seconds).

In this section, you'll load an open-source video diffusion model and prepare it for generation.

### 3.1: Load a T2V model

We'll use the model `damo-vilab/text-to-video-ms-1.7b`, which can produce short video clips from text prompts. This model benefits from a specialized scheduler (DPMSolverMultistepScheduler) that improves stability and speed during sampling.

In [None]:
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler

video_model_id = 'damo-vilab/text-to-video-ms-1.7b'

# Load the model with FP16 precision for efficiency
"""
YOUR CODE HERE (~2 lines of code)
"""

### 3.2: Generate a clip
Create a short video clip from a text prompt using a text-to-video model.

In [None]:
# Step 1: Write a text prompt describing the video you want to generate
# Step 2: Run the text-to-video pipeline with your chosen prompt
"""
YOUR CODE HERE (~2-3 lines)
"""

### 3.3: Frame inspection
Inspect a single frame to sanity-check colors, resolution, and subject positioning before writing a full video.

In [None]:
import numpy as np
from PIL import Image

# Step 1: Select one frame from vid_frames (e.g., index 0)
# Step 2: Convert float [0,1] frame to uint8 [0,255]
# Step 3: Display as a PIL image
"""
YOUR CODE HERE (~1-3 lines)
"""

### 3.4: Convert frames to MP4
Write the generated frames to an MP4 file so you can preview and share the result.

In [None]:
# Step 1: Use diffusers.utils.export_to_video to write vid_frames to an MP4
# Step 2: Capture and print the saved video path
"""
YOUR CODE HERE (~3-4 lines)
"""

### 3.5: Video inspection
Play the saved video inside the notebook to check motion and temporal consistency.

In [None]:
# Display the saved MP4 inline
from IPython.display import Video

"""
YOUR CODE HERE (1 line of code)
"""

### 3.6 (Optional): Experiment with different configs
Increase `num_frames` or decrease `num_inference_steps` to experiment with clip length versus quality.

## 4 - Multimodal Generation Agent
Now that you have text-to-image, text-to-video, and basic LLM question answering, you will build a single agent that routes user requests to the right capability. The agent will read a prompt, infer intent (chat vs image vs video), and return the appropriate output.

### 4.1: Load an LLM for generic queries
Use a small LLM as the default chat brain. We will start with `gemma-3-1b-it` and keep the loading logic simple. You can swap to another compact chat model later.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch, textwrap, json, re

# Load google/gemma-3-1b-it using Hugging Face

"""
YOUR CODE HERE (~2-15 lines)
"""

### 4.2: Build a routing mechanism to route requests

In [None]:
def generate_media(prompt: str, mode: str):
    # Produce either an image or a short video clip from a text prompt.
    """
    YOUR CODE HERE (~3-6 lines)
    """
    pass

def llm_generate(prompt, max_new_tokens=64, temperature=0.7):
    # Return a response to the prompt with the loaded gemma
    """
    YOUR CODE HERE (~2 lines of code)
    """
    pass

In [None]:
def classify_prompt(prompt: str):
    """Classify the user prompt into QA, image, or video."""

    # Step 1: Define a system prompt explaining how to classify requests (qa, image, video)
    # Step 2: Format the user message and system message as input to the LLM
    # Step 3: Generate a response with llm_generate() and parse it using regex
    # Step 4: Extract fields "type" and "expanded_prompt" from the LLM response
    # Step 5: Return a dict with classification results or default to {"type": "qa"} on failure

    """
    YOUR CODE HERE (~5-25 lines of code)
    """
    pass


### 4.3: Build the multimodal agent
This agent takes a single user prompt, sends it to the `classify_prompt` to determine what kind of task it is, and then calls the appropriate module:
- QA: use the chat LLM to generate an answer
- Image: use the text-to-image generator
- Video: use the text-to-video generator

Start with a simple version first. You can improve it later by adding better prompts, guardrails, and citation handling.

In [None]:
def multimodal_agent(user_prompt: str):
    # Step 1: Classify the request
    # Step 2: Route the prompt and generate output

    """
    YOUR CODE HERE (~12-16 lines)
    """
    pass

### 4.4: Test the agent
Now let's test your multimodal agent end to end. Each prompt will automatically be routed to the correct capability: text Q&A, image generation, or video generation, and display the corresponding output.

In [None]:
from diffusers.utils import export_to_video
from IPython.display import display, Video

# Step 1: Define a few diverse prompts (QA, image, video)
# Step 2: For each prompt, call multimodal_agent and inspect the returned result
"""
YOUR CODE HERE (~15-18 lines)
"""

Replace the sample queries with your own and verify that the agent chooses the correct generation path.

## 5 - Interactive Web UI

Launch a simple Gradio web interface so you (or your users) can play with the multimodal agent from the browser.


In [None]:
import gradio as gr
with gr.Blocks() as demo:
    gr.Markdown('# Multimodal Agent')
    inp = gr.Textbox(placeholder='Ask or create...')
    btn = gr.Button('Submit')
    out_text = gr.Markdown()
    out_img = gr.Image()
    out_vid = gr.Video()

    def handle(prompt):
        res = multimodal_agent(prompt)
        if isinstance(res, str):
            return res, None, None
        elif hasattr(res, 'save'):
            return '', res, None
        else:
            vid = export_to_video(res)
            return '', None, vid

    btn.click(handle, inp, [out_text, out_img, out_vid])

demo.launch()

After the UI launches, open the link and generate your own images and videos directly from the browser.

## 🎉 Congratulations!

* You have built a **multi-modal agent** capable of understanding various requests, and routing them to the proper model.
* Try experimenting with other T2I and T2V models.
* Try making your system more efficient. For example, load a separate lightweight llm for routing, and a more capable llm for QA.


👏 **Great job!** Take a moment to celebrate. The techniques you implemented here power many production agents and chatbots.