<a href="https://colab.research.google.com/github/sunilthakur-ai/Track1/blob/master/Building_Multimodal_AI_Application_with_Gemini_2_0_Pro_SOLUTION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Note:

1. To replicate this walkthrough — first download the [files from this folder](https://drive.google.com/drive/folders/1mV7Q4K9WzUfCW9GkMjooJSw2vPq6_jbO?usp=sharing)
2. Upload them to colab under files

## Install Required Packages

To get started, install the necessary Python packages. These include:

- `google-generativeai`: The official SDK for interacting with Gemini 2.0.
- `gradio`: A low-code framework for building interactive web apps.
- `yt-dlp`: A tool for downloading YouTube videos.
- `wget`: A utility for retrieving files from the web.

In [None]:
!pip install google-genai # The official SDK for interacting with Gemini 2.0
!pip install gradio # A low-code framework for building interactive web apps
!pip install yt_dlp # A tool for downloading YouTube videos
!pip install wget # A utility for retrieving files from the web

Collecting gradio
  Downloading gradio-5.16.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.7.0 (from gradio)
  Downloading gradio_client-1.7.0-py3-none-any.whl.metadata (7.1 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.9.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.meta

## Import Required Libraries

Next, import the necessary libraries for working with Gemini 2.0 and building the multimodal application:

- `google.genai`: The official library for interacting with Gemini models.
- `PIL.Image`: Used for handling image data.
- `IPython.display`: Enables rich media display (Markdown, HTML, Images).
- `google.colab.userdata`: Helps manage Colab user authentication.
- `os`: Provides access to system functionalities.
- `pathlib`: Manages file directories.
- `gradio`: Manages web interface.
- `yt_dlp`: Lets us work with YouTube videos.
- `wget`: Lets us download videos.

In [None]:
# Gemini related packages
from google import genai
from google.genai import types

# Helper packages
import time # Time some operations
import os # Provides access to system functionalities
import PIL.Image # Used for handling image data
from IPython.display import Markdown, HTML, Image, display # Rich media display in a notebook
from google.colab import userdata # Manage colab user authentication
import pathlib # Manages file directories
import gradio as gr # Manages web interface
import yt_dlp # Lets us work with YouTube videos
import wget # Lets us download videos

  warn(


## Getting Started with Gemini API

To use Gemini 2.0, you need access to Google's Generative AI API. Follow these steps:

### Step 1: Get API Access
1. Visit the [Google AI Studio](https://aistudio.google.com/) and sign in with your Google account.
2. Navigate to the **API Keys** section and generate a new API key.
3. Copy the key and store it securely.

### Step 2: Set Up the API Key in Your Notebook
The code below retrieves the API key from Colab's `userdata` storage. If you're running this locally, store the API key as an environment variable or use a `.env` file.

```python
API_KEY = userdata.get('GOOGLE_GEMINI_API_KEY')
client = genai.Client(api_key=API_KEY)
```

### Step 3: Understand the Gemini Ecosystem

| Model Variant                     | Input(s)                     | Output          | Optimized For |
|------------------------------------|------------------------------|----------------|---------------|
| **Gemini 2.0 Flash** (`gemini-2.0-flash`) | Audio, images, videos, text | Text, images (coming soon), audio (coming soon) | Next-gen features, speed, and multimodal generation |
| **Gemini 2.0 Flash-Lite Preview** (`gemini-2.0-flash-lite-preview-02-05`) | Audio, images, videos, text | Text | Cost efficiency and low latency |
| **Gemini 1.5 Flash** (`gemini-1.5-flash`) | Audio, images, videos, text | Text | Fast and versatile performance |
| **Gemini 1.5 Flash-8B** (`gemini-1.5-flash-8b`) | Audio, images, videos, text | Text | High volume, lower intelligence tasks |
| **Gemini 1.5 Pro** (`gemini-1.5-pro`) | Audio, images, videos, text | Text | Complex reasoning and intelligence tasks |
| **Gemini 1.0 Pro** (`gemini-1.0-pro`) *(Deprecated on 2/15/2025)* | Text | Text | Natural language tasks, multi-turn chat, code generation |
| **Text Embedding** (`text-embedding-004`) | Text | Text embeddings | Measuring relatedness of text strings |
| **AQA** (`aqa`) | Text | Text | Providing source-grounded answers |
| **gemini-2.0-pro-exp-02-05**             | All                    | Text        | Improved quality, especially for world knowledge, code, and long context |
| **gemini-2.0-flash-thinking-exp-01-21**  | All        | Text  | Reasoning for complex problems, features new thinking capabilities
| **gemini-2.0-flash-exp**                 | All            | Text, images | Next generation features, superior speed, native tool use, and multimodal generation |
| **gemini-exp-1206**                      | All                    | Text        | Quality improvements, celebrates 1 year of Gemini |
| **learnlm-1.5-pro-experimental**         | Audio, images, videos, text | Text  | Multimodal learning and generation |



[Source](https://ai.google.dev/gemini-api/docs/models/gemini?authuser=1)

### Running Your First Prompt

The `generate_content_stream()` function is used to generate text responses from the Gemini model in a **streaming** fashion. This means the model outputs text incrementally rather than waiting for the entire response before displaying it.

#### Parameters:
- **`model`** *(str)*: The specific version of the Gemini model to use.  
- **`contents`** *(list[str])*: A list of input prompts or queries for the model.  
- *(Optional parameters like temperature, top-k, top-p, and safety settings can also be adjusted, but they are not shown in this basic example, but will be explored later on.)*

#### Streaming vs. Non-Streaming
- **Streaming (`generate_content_stream`)**: Outputs text in real time, useful for chat-based applications or interactive experiences.  
- **Non-Streaming (`generate_content`)**: Waits until the full response is generated before displaying it.

In [None]:
# Define model_ID
model_id="gemini-2.0-pro-exp-02-05" # Set this to gemini-2.0-flash for faster inference time

# Retrieve the API key from Google Colab's user data storage
API_KEY = userdata.get('GOOGLE_GEMINI_API_KEY')

# Initialize the Gemini API client
client = genai.Client(api_key=API_KEY)

# Generate a response using the Gemini 2.0 model in streaming mode
response = client.models.generate_content_stream(
    model=model_id,  # Specify the Gemini model version
    contents=["Explain how the stock market works"]  # Provide the input prompt as a list
)

# Iterate over the streamed response and print each chunk as it arrives
for chunk in response:
    print(chunk.text, end="")  # Print the text output without extra line breaks

## Overview of Helper Functions

Before diving into building our multimodal application, let's explore some essential helper functions that will improve our interaction with Gemini 2.0. These functions allow us to:

1. **Count Tokens**: Estimate the number of tokens a prompt will consume.
2. **Modify Generation Parameters**: Control response creativity, length, and structure.
3. **Use Safety Filters**: Apply content moderation to filter inappropriate responses.
4. **Start a Multi-Turn Chat**: Maintain context and generate sequential responses.


### Counting Tokens

Before sending a prompt, it's useful to check how many tokens it consumes. This helps manage token limits and optimize responses. We can do this with `client.models.count_tokens()`

```python
response = client.models.count_tokens(
    model=model_id,
    contents="INSERT YOUR PROMPT HERE",
)
print(response)
```

In [None]:
# Count how many tokens there are in the prompt "What's the highest mountain in Africa?"
response = client.models.count_tokens(
    model=model_id,
    contents="What's the highest mountain in Africa?",
)

print(response)

### Modifying model behavior

We can adapt Gemini’s response behavior using generation parameters such as:

- `temperature`: Controls randomness (lower = more deterministic).
- `top_p` and `top_k`: Adjust response diversity.
- `max_output_tokens`: Limits response length.
- `stop_sequences`: Prevents responses beyond certain words.

```python
response = client.models.generate_content(
    model=model_id,  # Specify the Gemini model version
    contents="INSERT YOUR PROMPT HERE",  # The input prompt
    config=types.GenerateContentConfig(
        temperature=0.4,  # Controls randomness; lower values make responses more deterministic
        top_p=0.95,  # Consider words until their combined probability reaches 95%
        top_k=20,  # Limits token selection to the top 20 most likely next tokens
        candidate_count=1,  # Number of candidate responses to generate (higher values generate multiple variations)
        seed=5,  # Ensures reproducibility of results by using a fixed random seed
        max_output_tokens=100,  # Limits the response length to 100 tokens
        stop_sequences=["STOP!"],  # Stops generation when this sequence appears in the response
        presence_penalty=0.0,  # Encourages or discourages new topics (higher values penalize repetition)
        frequency_penalty=0.0,  # Penalizes frequent token usage to reduce repetition
    )
)
```

In [None]:
# Generate a response with updated output paramaters
response = client.models.generate_content(
    model=model_id,
    contents="Tell me how the internet works, but pretend I'm a puppy who only understands squeaky toys.",
    config=types.GenerateContentConfig(
        temperature=0.4,
        top_p=0.95,
        top_k=20,
        candidate_count=1,
        seed=5,
        max_output_tokens=100,
        stop_sequences=["STOP!"],
        presence_penalty=0.0,
        frequency_penalty=0.0,
    )
)

Markdown(response.text)

### Adjusting Safety Filters

Safety filters help you modulate the type of safety filters that are applied to the output. For example, if you are using Gemini to write content that can be deemed violent or dangerous for a fictional story, you may want to tone down the filters.

- **Categories**: Define content types to filter (e.g., `HARM_CATEGORY_HARASSMENT`, `HARM_CATEGORY_HATE_SPEECH`, `HARM_CATEGORY_SEXUALLY_EXPLICIT`, `HARM_CATEGORY_DANGEROUS_CONTENT`, and `HARM_CATEGORY_CIVIC_INTEGRITY`).
- **Thresholds**: Control how strictly content is filtered (e.g., `"BLOCK_ONLY_HIGH"` blocks high-severity cases).


**Safety Threshold Levels**

| Threshold (Google AI Studio) | Threshold (API)                         | Description |
|------------------------------|------------------------------------------|-------------|
| Block none                   | `BLOCK_NONE`                             | Always show regardless of probability of unsafe content |
| Block few                    | `BLOCK_ONLY_HIGH`                        | Block when high probability of unsafe content |
| Block some                   | `BLOCK_MEDIUM_AND_ABOVE`                 | Block when medium or high probability of unsafe content |
| Block most                   | `BLOCK_LOW_AND_ABOVE`                    | Block when low, medium, or high probability of unsafe content |
| N/A                          | `HARM_BLOCK_THRESHOLD_UNSPECIFIED`       | Threshold is unspecified, block using default threshold |

We can update safety settings using the `types.SafetySetting()` function.


```python
safety_settings = [
    types.SafetySetting(
        category="HARM_CATEGORY_DANGEROUS_CONTENT",
        threshold="BLOCK_ONLY_HIGH"
    )
]

response = client.models.generate_content(
    model=MODEL_ID,
    contents=prompt,
    config=types.GenerateContentConfig(),
    safety_settings=safety_settings,
)

Markdown(response.text)


In [None]:
# Remove dangerous content safety filter
safety_settings = [
    types.SafetySetting(
        category="HARM_CATEGORY_DANGEROUS_CONTENT",
        threshold="BLOCK_NONE",
    ),
]

# Test the safety on a prompt of a fictional Batman setting
prompt = """
    What are five punishment tactics that Batman can apply to the Joker after Joker did serious harm to Gotham.
    List the tactics in bullet points and they should range from least violent to most violent.
"""

# Generate a response
response = client.models.generate_content(
    model=model_id,
    contents=prompt,
    config=types.GenerateContentConfig(
        safety_settings=safety_settings,
    ),
)

Markdown(response.text)

### Multi-Turn Chat with Context

We can create a conversational experience where responses build on previous interactions.  

This is useful for:
- Assistants applications that remember context.
- Coding assistants that generate and refine code iteratively.

We can create a chat instance using the `client.chats.create()` function and and the `.send_message()` method in the instantiated chat object.

In [None]:
# Define a system instruction for the model
system_instruction = """
You are an expert data scientist that can create effective code in Python.
"""

# Create a chat object
chat = client.chats.create(
    model=model_id,
    config=types.GenerateContentConfig(
        system_instruction=system_instruction,
        temperature=0.4
    )
)

# Sending messages to Gemini in a conversation flow
response = chat.send_message("Write a function that cleans a text column from any dashes or spaces in pandas.")
Markdown(response.text)

In [None]:
# Follow up in context
response = chat.send_message("Update that function to also remove any numbers.")
Markdown(response.text)

## Multimodal Capabilities of Gemini 2.0

Gemini 2.0 is a **multimodal** model, meaning it can process and generate responses based on a variety of input formats, including **text, images, audio, video, and documents**. Moreoever, it has an XX context length for inputs, this makes it highly versatile for real-world applications. In this section, we will work with:

- **Images**: Uploading and reasoning with images
- **Text**: Uploading and reasoning with text files
- **Audio**: Uploading and reasoning wiht audio files
- **Documents**: Uploading and reasoning wiht PDF files
- **Video**: Uploading and reasoning wiht video files



### Image Understanding with Gemini 2.0

We can provide an image and ask Gemini 2.0 to describe or analyze it.

#### Steps:
1. Load an image using `PIL.Image`.
2. Pass the image along with a text prompt to `generate_content_stream()`.
3. Iterate through the streamed response and display the output.

```python
image = PIL.Image.open('image.png')

response = client.models.generate_content_stream(
    model="gemini-2.0-pro-exp-02-05",
    contents=["Explain the image", image]
)

for chunk in response:
    print(chunk.text, end="")


In [None]:
# Open the image
image = PIL.Image.open('/content/DeepSeek_R1.png')

# Have the model explain the image
response = client.models.generate_content_stream(
    model=model_id,
    contents=["Explain the image", image])

for chunk in response:
    print(chunk.text, end="")

### Audio Understanding with Gemini 2.0

Gemini 2.0 can process **audio files** and generate insights, transcriptions, or descriptions based on the content. This is useful for applications like:

- **Speech-to-text transcription**
- **Audio event recognition**
- **Music or sound analysis**

#### Steps:
1. **Load the audio file** as binary data.
2. **Wrap it in `types.Part.from_bytes()`** to inform Gemini that it's an audio file.
3. **Provide a text prompt** to guide the model's response.
4. **Process the streamed response** and print the output.

```python
with open('audio.wav', 'rb') as f:
    audio_bytes = f.read()  # Read the audio file as bytes

response = client.models.generate_content_stream(
    model=model_id,
    contents=[
        "Describe this audio",
        types.Part.from_bytes(
            data=audio_bytes,
            mime_type="audio/wav",  # Specify the correct MIME type for audio
        )
    ]
)

# Iterate over the streamed response and print the output
for chunk in response:
    print(chunk.text, end="")
```

In [None]:
# Understand a DataFramed Podcast Episode
with open('/content/DataFramed Episode_Meri Nova.mp3', 'rb') as f:
    audio_bytes = f.read()

response = client.models.generate_content_stream(
  model=model_id,
  contents=[
    'Describe this audio',
    types.Part.from_bytes(
      data=audio_bytes,
      mime_type='audio/wav',
    )
  ]
)

for chunk in response:
    print(chunk.text, end="")

### Document Understanding with Gemini 2.0

Gemini 2.0 can process and analyze **documents** such as PDFs, Word files, and other text-based formats. This is useful for tasks like:

- **Summarizing reports or research papers**
- **Extracting key information from documents**
- **Answering questions based on document content**

### RAG vs Context Window

In a lot of ways, Gemini removes the need for RAG on many document processing tasks. Here's why:

In [None]:
from IPython.display import Image, display

display(Image("/content/Gemini Image_1.png"))
display(Image("/content/Gemini Image_2.png"))

### Steps:
1. **Load the document** as binary data.
2. **Wrap it in `types.Part.from_bytes()`** to specify that it's a document.
3. **Provide a text prompt** to guide the Gemini’s response.
4. **Stream and display the output**.

```python
# Read the PDF file as binary data
response = client.models.generate_content_stream(
    model=model_id,
    contents=[
        "What is this document about?", # Provide the prompt
        types.Part.from_bytes(
            data=pathlib.Path("PATH TO FILE").read_bytes(),  # Read the document
            mime_type="application/pdf",  # Specify the MIME type
        )    
    ]
)

# Stream the response and print it as it arrives
for chunk in response:
    print(chunk.text, end="")
```

In [None]:
# Read the DeepSeek V3 paper and explain it
response = client.models.generate_content_stream(
  model=model_id,
  contents=[
      "What is this document about?",
      types.Part.from_bytes(
        data=pathlib.Path('/content/DeepSeek_V3.pdf').read_bytes(),
        mime_type='application/pdf',
      )]
  )

for chunk in response:
    print(chunk.text, end="")

### Video Understanding with Gemini 2.0

Gemini 2.0 can process and analyze **videos** such as MP4s and .mov files, and other video-based formats.

#### Steps:
- Load the video and upload it to Gemini using `client.files.upload()`.
- Load the video into `client.models.content()` with your prompt.

In [None]:
# Step 1: Upload the videos to Gemini

import time

def upload_video(video_file_name):
  video_file = client.files.upload(file=video_file_name)

  while video_file.state == "PROCESSING":
      print('Waiting for video to be processed.')
      time.sleep(10)
      video_file = client.files.get(name=video_file.name)

  if video_file.state == "FAILED":
    raise ValueError(video_file.state)
  print(f'Video processing complete: ' + video_file.uri)

  return video_file

chatgpt_tasks_video = upload_video('/content/ChatGPT Tasks.mp4')

In [None]:
# Step 2: Reason about the video with Gemini
response = client.models.generate_content(
    model=model_id,
    contents=[
        chatgpt_tasks_video,
        "What is the content of this video?",
    ]
)

Markdown(response.text)

## Putting It All together — Build a YouTube Worfklow App

### What are we trying to build?

A simple web app that allows to perform routine tasks related to YouTube content creation (e.g. summarizing chapter notes, create a title, provide ideas based on papers, etc..)

### What will it look like?


In [None]:
display(Image("/content/Gemini Image_3.png"))

### Steps needed to put it together

- Define the logic of the video processing function
- Define the logic of the PDF processing function
- Build a gradio web-app

#### Step: 1 — Define the logic of the video processing function

##### Step 1.A — Uploading and Processing Videos with Gemini 2.0

This function **uploads a video file** to the Gemini API and **waits for processing** to complete. It works by:

1. **Uploading the video** using `client.files.upload()`.
2. **Polling the processing status** every 10 seconds.
3. **Handling failures** by raising an error if processing fails.
4. **Returning the processed file** once complete.

In [None]:
def upload_video(video_file_path):
    """
    Uploads a video file to the Gemini API and waits for processing.
    """

    # Upload the video file to Gemini API
    video_file = client.files.upload(file=video_file_path)

    # Poll until processing is complete
    while video_file.state == "PROCESSING":  # Check if the video is still being processed
        print("Waiting for video to be processed...")  # Notify the user
        time.sleep(10)  # Wait for 10 seconds before checking again
        video_file = client.files.get(name=video_file.name)  # Refresh the file status

    # If processing fails, raise an error
    if video_file.state == "FAILED":
        raise ValueError("Video processing failed")

    # If processing is successful, print and return the processed video file URI
    print(f"Video processing complete: {video_file.uri}")
    return video_file

##### Step 1.B — Process the video by Gemini

This function allows **YouTube video processing** by:
1. **Extracting video metadata** using `yt-dlp` (without downloading initially).
2. **Downloading the video** using `wget`.
3. **Uploading the video** to the **Gemini API** for processing.
4. **Generating insights** based on a selected **prompt type**.

**Available Prompts:**

- **"Generate Chapters"** → Organizes video scenes into structured chapters.
- **"Generate Title & Description"** → Suggests an SEO-friendly YouTube title and description.

In [None]:
def process_youtube(youtube_url, prompt_choice):
    """
    Given a YouTube URL and a selected prompt, this function:
      1. Extracts video information using yt-dlp.
      2. Downloads the video to a temporary file.
      3. Uploads the video file to the Gemini API.
      4. Calls Gemini to generate content based on the selected prompt.
    """

    # Define yt-dlp options to extract video metadata without downloading
    ydl_opts = {
        'format': 'best',      # Get the best quality available
        'quiet': True,         # Suppress unnecessary output
        'skip_download': True, # Only extract info, do not download
    }

    # Extract video metadata (title, direct video URL, etc.)
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info_dict = ydl.extract_info(youtube_url, download=False)

    # Generate a filename for the downloaded video
    output_filename = str(info_dict['fulltitle']) + '.mp4'
    print(f"\nDownloading video to {output_filename} ...")

    # Get the direct video URL (handles cases where 'url' may not be available)
    direct_url = info_dict.get('url', info_dict['formats'][0]['url'])
    print("Direct video URL extracted:", direct_url)

    # Download the video using wget
    print(f"Downloading video to {output_filename} ...")
    wget.download(direct_url, out=output_filename)
    print(f"\nDownloaded video to {output_filename}")

    # Upload video to Gemini API and wait for processing
    youtube_video = upload_video(output_filename)

    # Remove the temporary video file after upload
    os.remove(output_filename)

    # Define a dictionary mapping prompt choices to specific AI tasks
    prompt_options = {
        "Generate Chapters": """Organize all scenes from this video into chapters and chapter markers,
                             output your response as bullet points where each bullet is a chapter marker
                             and the title of the chapter.""",
        "Generate Title & Description": """Provide one short YouTube title and description for this video —
                                        that is both accurate, engaging, and SEO-friendly."""
    }

    # Get the corresponding prompt text based on user selection; defaults to "Generate Title & Description"
    prompt_text = prompt_options.get(prompt_choice, "Generate Title & Description")

    # Generate content using Gemini API with the uploaded video and selected prompt
    response = client.models.generate_content(
        model=model_id,
        contents=[youtube_video, prompt_text],
        config=types.GenerateContentConfig(
            max_output_tokens=100)
    )

    return response.text

In [None]:
# Test out the functions before setting it up in a gradio application
prompt_choice = "Generate Title & Description" # @param  ["Generate Chapters", "Generate Title & Description"]
youtube_url = "https://www.youtube.com/watch?v=4j8OOH7x2Ho"

process_youtube(youtube_url, prompt_choice)

#### Step: 2 — Define the logic of the PDF processing function


This function allows **PDF document analysis** using the **Gemini API** by:

1. **Reading the PDF file** as binary data.
2. **Selecting a predefined prompt** to guide AI-generated content.
3. **Streaming responses text** based on the document’s content.

**Available Prompts:**

- **"YouTube Script"** → Generates an engaging video script from the document.
- **"Summarize Document"** → Provides a concise summary of the document.
- **"Video Ideas"** → Suggests YouTube video topics inspired by the document.



In [None]:
def process_pdf(pdf_file, prompt_choice):
    """
    Given a PDF file and a selected prompt, this function:
      1. Reads the PDF file.
      2. Selects the appropriate prompt text.
      3. Streams the content generation using the Gemini API.
    """

    # Determine the file path (Handle both Gradio file object and direct string paths)
    if isinstance(pdf_file, dict) and "name" in pdf_file:
        file_path = pdf_file["name"]  # Gradio file object
    elif isinstance(pdf_file, str):  # Direct path in normal execution
        file_path = pdf_file
    else:
        return "Invalid PDF file input."

    # Read the PDF file as bytes
    with open(file_path, "rb") as f:
        pdf_bytes = f.read()

    # Define a dictionary mapping prompt choices to specific prompts
    prompt_options = {
        "YouTube Script": "Create an engaging YouTube script based on this document.",
        "Summarize Document": "Summarize this document for me.",
        "Video Ideas": "What are YouTube video ideas I can pursue based off of this document?"
    }

    # Get the corresponding prompt text based on user selection; defaults to "YouTube Script"
    prompt_text = prompt_options.get(prompt_choice, "Create an engaging YouTube script based on this document.")

    # Call the Gemini API in streaming mode, sending the PDF content and prompt
    response_stream = client.models.generate_content_stream(
        model=model_id,
        contents=[
            types.Part.from_bytes(
                data=pdf_bytes,              # Attach the PDF file as input
                mime_type='application/pdf'  # Specify the MIME type for PDFs
            ),
            prompt_text  # The selected prompt guiding the response
        ]
    )

    # Collect all streamed chunks into a single response string
    full_response = ""
    for chunk in response_stream:
        full_response += chunk.text  # Append each chunk to the final response

    return full_response  # Return the generated response as a complete text

In [None]:
## Test out the functions before setting it up in a gradio application
prompt_choice = "YouTube Script" # @param  ["YouTube Script", "Summarize Document", "Video Ideas"]
pdf_file = "/content/DeepSeek_V3.pdf"

process_pdf(pdf_file, prompt_choice)

#### Step: 3 — Define the Gradio Application

This Gradio application allows you to **process PDFs and YouTube videos** using **Gemini 2.0** according to a set of prompts.

In [None]:
with gr.Blocks() as demo:
    # Title and description for the application
    gr.Markdown("# Gemini API Code-Along Application")
    gr.Markdown("Select a tab to process either a PDF file or a YouTube video using different prompts.")

    with gr.Tabs():
        # --- PDF Processing Tab ---
        with gr.Tab("PDF Processing"):
            gr.Markdown("**Upload a PDF file and select a prompt.**")

            # File uploader for PDF files
            pdf_input = gr.File(label="Upload PDF", file_types=[".pdf"])

            # Radio button to select a processing prompt for the PDF
            pdf_prompt = gr.Radio(
                choices=["YouTube Script", "Summarize Document", "Video Ideas"],
                label="Select PDF Prompt",
                value="YouTube Script"  # Default selected option
            )

            # Textbox to display the generated response
            pdf_output = gr.Textbox(label="Response", lines=10)

            # Button to trigger PDF processing
            pdf_button = gr.Button("Process PDF")

            # Clicking the button triggers the `process_pdf` function
            pdf_button.click(fn=process_pdf, inputs=[pdf_input, pdf_prompt], outputs=pdf_output)

        # --- YouTube Processing Tab ---
        with gr.Tab("YouTube Processing"):
            gr.Markdown("**Enter a YouTube video URL and select a prompt.**")

            # Textbox for entering YouTube video URL
            youtube_url_input = gr.Textbox(label="YouTube Video URL")

            # Radio button to select a processing prompt for the YouTube video
            youtube_prompt = gr.Radio(
                choices=["Generate Chapters", "Generate Title & Description"],
                label="Select YouTube Prompt",
                value="Generate Chapters"  # Default selected option
            )

            # Textbox to display the generated response
            youtube_output = gr.Textbox(label="Response", lines=10)

            # Button to trigger YouTube video processing
            youtube_button = gr.Button("Process YouTube Video")

            # Clicking the button triggers the `process_youtube` function
            youtube_button.click(fn=process_youtube, inputs=[youtube_url_input, youtube_prompt], outputs=youtube_output)


# Enable the queue to handle long-running functions and improve responsiveness
demo.queue()

# Launch the Gradio app with debugging enabled
demo.launch(debug=True)


In [None]:
display(Image("/content/Gemini Image_4.png", width = 800))