# Multimodal Capabilities with Gemini 3 Pro

> **Created by [Build Fast with AI](https://www.buildfastwithai.com)**

This notebook explores Gemini 3 Pro's powerful multimodal capabilities, including processing text, images, video, and audio together.

## What you'll learn:
- Understanding multimodal AI
- Working with text and images together
- Video understanding and analysis
- Audio processing capabilities
- Combining multiple modalities
- Building multimodal applications

## 1. Installation and Setup

In [None]:
!pip install -q google-generativeai pillow opencv-python moviepy

In [None]:
import os
import google.generativeai as genai
from PIL import Image, ImageDraw, ImageFont
import io
from IPython.display import display, Markdown, HTML
import base64

In [None]:
# Configure API key
try:
    from google.colab import userdata
    GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
except:
    GOOGLE_API_KEY = os.environ.get('GOOGLE_API_KEY', 'your-api-key-here')

genai.configure(api_key=GOOGLE_API_KEY)

## 2. Text + Image Understanding

Gemini can process text and images together, understanding relationships between them.

In [None]:
# Create a sample image with text and objects
def create_sample_scene():
    """Create an image of a simple scene."""
    img = Image.new('RGB', (600, 400), color='lightblue')
    draw = ImageDraw.Draw(img)
    
    # Draw sun
    draw.ellipse([500, 50, 550, 100], fill='yellow', outline='orange')
    
    # Draw ground
    draw.rectangle([0, 300, 600, 400], fill='green')
    
    # Draw house
    draw.rectangle([150, 200, 300, 300], fill='brown', outline='black')
    draw.polygon([150, 200, 225, 150, 300, 200], fill='red')  # Roof
    draw.rectangle([200, 240, 250, 300], fill='lightblue')  # Door
    
    # Draw tree
    draw.rectangle([400, 250, 420, 300], fill='brown')  # Trunk
    draw.ellipse([380, 200, 440, 260], fill='darkgreen')  # Leaves
    
    # Add text
    try:
        font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 24)
    except:
        font = ImageFont.load_default()
    
    draw.text((200, 20), "My House", fill='black', font=font)
    
    return img

scene_image = create_sample_scene()
scene_image.save('scene.png')
display(scene_image)

In [None]:
# Analyze the scene with multimodal prompting
model = genai.GenerativeModel('gemini-3-pro')

questions = [
    "Describe everything you see in this image.",
    "What text is written in the image?",
    "What time of day does this scene represent and why?",
    "Count the objects in the scene.",
    "If this were a real place, what would you hear?"
]

for question in questions:
    print(f"\n{'='*80}")
    print(f"Q: {question}")
    print(f"{'='*80}")
    
    response = model.generate_content([question, scene_image])
    print(f"A: {response.text}")

## 3. Multiple Images Analysis

In [None]:
# Create multiple related images
def create_sequence_images():
    """Create a sequence of images showing progression."""
    images = []
    
    for i, (color, label) in enumerate([('red', 'Morning'), ('yellow', 'Noon'), ('orange', 'Evening')]):
        img = Image.new('RGB', (300, 200), color=color)
        draw = ImageDraw.Draw(img)
        
        try:
            font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 30)
        except:
            font = ImageFont.load_default()
        
        draw.text((80, 85), label, fill='white', font=font)
        images.append(img)
        img.save(f'time_{i}.png')
    
    return images

sequence_images = create_sequence_images()

print("Created sequence:")
for img in sequence_images:
    display(img)

In [None]:
# Analyze multiple images together
print("Analyzing image sequence...\n")

response = model.generate_content([
    "Analyze these three images as a sequence. What story do they tell? What changes between them?",
    sequence_images[0],
    sequence_images[1],
    sequence_images[2]
])

display(Markdown(response.text))

## 4. Visual Question Answering

In [None]:
class VisualQA:
    """Visual Question Answering system."""
    
    def __init__(self):
        self.model = genai.GenerativeModel('gemini-3-pro')
        self.context_image = None
        self.history = []
    
    def set_image(self, image: Image.Image):
        """Set the context image for Q&A."""
        self.context_image = image
        self.history = []
    
    def ask(self, question: str) -> str:
        """Ask a question about the image."""
        if self.context_image is None:
            return "Please set an image first using set_image()"
        
        # Include previous Q&A in context
        context = "\n".join([
            f"Q: {item['question']}\nA: {item['answer']}"
            for item in self.history[-3:]  # Last 3 Q&A pairs
        ])
        
        full_prompt = f"""
        Previous conversation:
        {context}
        
        New question: {question}
        """
        
        response = self.model.generate_content([full_prompt, self.context_image])
        answer = response.text
        
        self.history.append({
            "question": question,
            "answer": answer
        })
        
        return answer

# Test Visual QA
vqa = VisualQA()
vqa.set_image(scene_image)

questions = [
    "What color is the house?",
    "Is there a tree in the image?",
    "Where is the tree located relative to the house?",
    "What's written at the top of the image?"
]

for q in questions:
    print(f"\nQ: {q}")
    answer = vqa.ask(q)
    print(f"A: {answer}")
    print("-" * 80)

## 5. Document Understanding

Gemini can understand documents with mixed content (text, images, tables, diagrams).

In [None]:
def create_mock_document():
    """Create a mock document with text and simple table."""
    img = Image.new('RGB', (800, 600), color='white')
    draw = ImageDraw.Draw(img)
    
    try:
        title_font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 32)
        text_font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 16)
    except:
        title_font = ImageFont.load_default()
        text_font = ImageFont.load_default()
    
    # Title
    draw.text((50, 30), "Quarterly Sales Report", fill='black', font=title_font)
    
    # Text content
    text_content = [
        "Q3 2024 Performance Summary",
        "",
        "Total Revenue: $1.2M",
        "Growth: +15% YoY",
        "",
        "Regional Breakdown:"
    ]
    
    y = 100
    for line in text_content:
        draw.text((50, y), line, fill='black', font=text_font)
        y += 30
    
    # Simple table
    table_y = y + 20
    draw.rectangle([50, table_y, 500, table_y + 150], outline='black', width=2)
    
    # Table headers
    draw.text((60, table_y + 10), "Region", fill='black', font=text_font)
    draw.text((200, table_y + 10), "Revenue", fill='black', font=text_font)
    draw.text((350, table_y + 10), "Growth", fill='black', font=text_font)
    
    # Table rows
    rows = [
        ("North", "$400K", "+12%"),
        ("South", "$350K", "+18%"),
        ("East", "$300K", "+10%"),
        ("West", "$150K", "+25%")
    ]
    
    row_y = table_y + 40
    for region, revenue, growth in rows:
        draw.text((60, row_y), region, fill='black', font=text_font)
        draw.text((200, row_y), revenue, fill='black', font=text_font)
        draw.text((350, row_y), growth, fill='black', font=text_font)
        row_y += 30
    
    return img

doc_image = create_mock_document()
doc_image.save('mock_document.png')
display(doc_image)

In [None]:
# Document Q&A
doc_questions = [
    "What is the title of this document?",
    "What was the total revenue?",
    "Which region had the highest growth?",
    "Which region had the lowest revenue?",
    "What is the year-over-year growth rate?"
]

print("Document Q&A:\n")
for q in doc_questions:
    print(f"Q: {q}")
    response = model.generate_content([q, doc_image])
    print(f"A: {response.text}\n")
    print("-" * 80 + "\n")

## 6. Chart and Graph Understanding

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Create a sample chart
def create_sample_chart():
    """Create a bar chart."""
    fig, ax = plt.subplots(figsize=(10, 6))
    
    categories = ['Product A', 'Product B', 'Product C', 'Product D']
    values = [450, 380, 520, 290]
    colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A']
    
    bars = ax.bar(categories, values, color=colors)
    
    ax.set_ylabel('Sales (units)', fontsize=12)
    ax.set_title('Product Sales Comparison - Q3 2024', fontsize=14, fontweight='bold')
    ax.set_ylim(0, 600)
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{int(height)}',
                ha='center', va='bottom')
    
    plt.tight_layout()
    plt.savefig('sales_chart.png', dpi=100, bbox_inches='tight')
    plt.close()
    
    return Image.open('sales_chart.png')

chart_image = create_sample_chart()
display(chart_image)

In [None]:
# Analyze the chart
chart_questions = [
    "What type of chart is this?",
    "What does this chart show?",
    "Which product has the highest sales?",
    "What's the difference between the best and worst performing products?",
    "What insights can you derive from this data?",
    "Suggest improvements for the underperforming products."
]

print("Chart Analysis:\n")
for q in chart_questions:
    print(f"Q: {q}")
    response = model.generate_content([q, chart_image])
    print(f"A: {response.text}\n")
    print("-" * 80 + "\n")

## 7. Video Understanding

Gemini can analyze video content and answer questions about it.

In [None]:
# Note: This is a conceptual example. In production, you would upload actual video files.

class VideoAnalyzer:
    """Analyze video content."""
    
    def __init__(self):
        self.model = genai.GenerativeModel('gemini-3-pro')
    
    def analyze_video(self, video_path: str) -> dict:
        """Analyze a video file."""
        # Upload video file
        video_file = genai.upload_file(path=video_path)
        
        analyses = {}
        
        # Overall description
        response = self.model.generate_content([
            "Describe what happens in this video.",
            video_file
        ])
        analyses['description'] = response.text
        
        # Key events
        response = self.model.generate_content([
            "List the key events or scenes in this video in chronological order.",
            video_file
        ])
        analyses['key_events'] = response.text
        
        # Objects/people
        response = self.model.generate_content([
            "What objects, people, or elements appear in this video?",
            video_file
        ])
        analyses['elements'] = response.text
        
        return analyses
    
    def answer_video_question(self, video_path: str, question: str) -> str:
        """Answer a specific question about the video."""
        video_file = genai.upload_file(path=video_path)
        response = self.model.generate_content([question, video_file])
        return response.text

print("Video analyzer created.")
print("\nTo use:")
print("1. Upload a video file to Gemini")
print("2. Use analyze_video() or answer_video_question()")
print("\nSupported formats: MP4, MOV, AVI, etc.")

## 8. Audio Understanding

Gemini can process and understand audio content.

In [None]:
class AudioAnalyzer:
    """Analyze audio content."""
    
    def __init__(self):
        self.model = genai.GenerativeModel('gemini-3-pro')
    
    def transcribe(self, audio_path: str) -> str:
        """Transcribe audio to text."""
        audio_file = genai.upload_file(path=audio_path)
        response = self.model.generate_content([
            "Transcribe this audio file.",
            audio_file
        ])
        return response.text
    
    def summarize_audio(self, audio_path: str) -> str:
        """Summarize audio content."""
        audio_file = genai.upload_file(path=audio_path)
        response = self.model.generate_content([
            "Summarize the main points discussed in this audio.",
            audio_file
        ])
        return response.text
    
    def extract_insights(self, audio_path: str) -> dict:
        """Extract key insights from audio."""
        audio_file = genai.upload_file(path=audio_path)
        
        insights = {}
        
        # Topics
        response = self.model.generate_content([
            "What topics are discussed in this audio?",
            audio_file
        ])
        insights['topics'] = response.text
        
        # Sentiment
        response = self.model.generate_content([
            "What is the overall tone and sentiment of this audio?",
            audio_file
        ])
        insights['sentiment'] = response.text
        
        # Action items
        response = self.model.generate_content([
            "Extract any action items or next steps mentioned in this audio.",
            audio_file
        ])
        insights['action_items'] = response.text
        
        return insights

print("Audio analyzer created.")
print("\nCapabilities:")
print("- Transcription")
print("- Summarization")
print("- Topic extraction")
print("- Sentiment analysis")
print("- Action item extraction")
print("\nSupported formats: MP3, WAV, M4A, etc.")

## 9. Multimodal RAG (Retrieval-Augmented Generation)

In [None]:
class MultimodalRAG:
    """RAG system that works with multiple modalities."""
    
    def __init__(self):
        self.model = genai.GenerativeModel('gemini-3-pro')
        self.knowledge_base = []
    
    def add_text(self, text: str, metadata: dict = None):
        """Add text to knowledge base."""
        self.knowledge_base.append({
            'type': 'text',
            'content': text,
            'metadata': metadata or {}
        })
    
    def add_image(self, image: Image.Image, description: str = None, metadata: dict = None):
        """Add image to knowledge base."""
        # Generate description if not provided
        if description is None:
            response = self.model.generate_content(["Describe this image in detail.", image])
            description = response.text
        
        self.knowledge_base.append({
            'type': 'image',
            'content': image,
            'description': description,
            'metadata': metadata or {}
        })
    
    def query(self, question: str, include_images: bool = True) -> dict:
        """Query the multimodal knowledge base."""
        # Gather relevant context
        text_context = [item['content'] for item in self.knowledge_base if item['type'] == 'text']
        image_context = [item for item in self.knowledge_base if item['type'] == 'image']
        
        # Build prompt
        prompt_parts = [
            f"Answer this question using the provided context:\n\nQuestion: {question}\n\n"
        ]
        
        # Add text context
        if text_context:
            prompt_parts.append("Text Context:\n" + "\n\n".join(text_context))
        
        # Add images
        if include_images and image_context:
            prompt_parts.append("\nRelevant images are also provided.")
            for item in image_context:
                prompt_parts.append(item['content'])
        
        # Generate response
        response = self.model.generate_content(prompt_parts)
        
        return {
            'answer': response.text,
            'sources': {
                'text_items': len(text_context),
                'image_items': len(image_context)
            }
        }

# Test Multimodal RAG
rag = MultimodalRAG()

# Add content
rag.add_text("""
Product Description:
The SmartHome Hub 3000 is our latest home automation controller.
Features include voice control, mobile app, and compatibility with 100+ smart devices.
Available in white and black. Price: $149.99
""")

rag.add_text("""
Customer Reviews:
- "Easy to set up and use!" - 5 stars
- "Great value for money" - 4 stars
- "Voice control is very responsive" - 5 stars
""")

# Add images
rag.add_image(scene_image, description="Product showcase image")

# Query the system
questions = [
    "What is the SmartHome Hub 3000?",
    "How much does it cost?",
    "What do customers say about it?",
    "What colors is it available in?"
]

print("Multimodal RAG System Test:\n")
for q in questions:
    print(f"Q: {q}")
    result = rag.query(q)
    print(f"A: {result['answer']}")
    print(f"Sources: {result['sources']}\n")
    print("-" * 80 + "\n")

## 10. Advanced Multimodal Applications

In [None]:
class MultimodalAssistant:
    """A comprehensive multimodal AI assistant."""
    
    def __init__(self):
        self.model = genai.GenerativeModel('gemini-3-pro')
        self.conversation_history = []
    
    def process_input(self, text: str = None, image: Image.Image = None, 
                     video_path: str = None, audio_path: str = None) -> str:
        """Process any combination of inputs."""
        inputs = []
        
        if text:
            inputs.append(text)
        
        if image:
            inputs.append(image)
        
        if video_path:
            video_file = genai.upload_file(path=video_path)
            inputs.append(video_file)
        
        if audio_path:
            audio_file = genai.upload_file(path=audio_path)
            inputs.append(audio_file)
        
        if not inputs:
            return "No input provided"
        
        # Add conversation history
        if self.conversation_history:
            context = "Previous conversation:\n" + "\n".join([
                f"User: {item['user']}\nAssistant: {item['assistant']}"
                for item in self.conversation_history[-3:]
            ])
            inputs.insert(0, context)
        
        # Generate response
        response = self.model.generate_content(inputs)
        
        # Save to history
        self.conversation_history.append({
            'user': text or "[multimodal input]",
            'assistant': response.text
        })
        
        return response.text
    
    def clear_history(self):
        """Clear conversation history."""
        self.conversation_history = []

# Test the assistant
assistant = MultimodalAssistant()

print("Test 1: Text + Image")
response = assistant.process_input(
    text="What's in this image and what time of day does it represent?",
    image=scene_image
)
print(f"Response: {response}\n")
print("="*80 + "\n")

print("Test 2: Follow-up question (uses conversation history)")
response = assistant.process_input(
    text="What colors are prominent in that image?"
)
print(f"Response: {response}")

## 11. Best Practices for Multimodal AI

### Input Preparation:

1. **Image Quality**: Use clear, well-lit images
2. **Resolution**: Balance quality and file size
3. **File Formats**: Use supported formats (JPEG, PNG, WebP)
4. **Video Length**: Keep videos concise for better analysis
5. **Audio Quality**: Ensure clear audio without background noise

### Prompt Engineering:

1. **Be Specific**: Clearly describe what you want to know
2. **Context**: Provide relevant context for better understanding
3. **Sequential**: For multiple items, indicate order if important
4. **Comparative**: Use comparison prompts for multiple inputs

### Performance Optimization:

1. **Batch Processing**: Group similar analyses
2. **Caching**: Cache results for repeated queries
3. **Preprocessing**: Optimize media before sending
4. **Selective Upload**: Only upload necessary content

### Application Design:

1. **Progressive Enhancement**: Start simple, add complexity
2. **Error Handling**: Handle unsupported formats gracefully
3. **User Feedback**: Show processing status
4. **Result Presentation**: Display results clearly

### Security & Privacy:

1. **Data Handling**: Handle sensitive content appropriately
2. **User Consent**: Get consent before processing personal media
3. **Temporary Storage**: Delete uploaded content after processing
4. **Access Control**: Implement proper authentication

## 12. Real-World Applications

### Multimodal AI Use Cases:

1. **Education**:
   - Analyze diagrams and explain concepts
   - Grade assignments with mixed content
   - Create interactive learning experiences

2. **Healthcare**:
   - Medical image analysis with reports
   - Patient history with imaging
   - Research document analysis

3. **Business**:
   - Document processing (invoices, reports)
   - Product catalog management
   - Meeting transcription and analysis

4. **Content Creation**:
   - Video content analysis
   - Social media management
   - Automated captioning

5. **E-commerce**:
   - Visual search
   - Product recommendations
   - Quality control

6. **Accessibility**:
   - Image descriptions for visually impaired
   - Audio transcription for hearing impaired
   - Document reader for learning disabilities

## Next Steps

Explore advanced multimodal applications:
- Build a multimodal search engine
- Create content moderation systems
- Develop interactive educational tools
- Build accessibility applications
- Create automated content generation pipelines

---

## Learn More

Master multimodal AI and build cutting-edge applications with the **[Gen AI Crash Course](https://www.buildfastwithai.com/genai-course)** by Build Fast with AI!

**Created by [Build Fast with AI](https://www.buildfastwithai.com)**