In [6]:
# Install the Google Generative AI SDK
!pip install -q google-generativeai

In [7]:
# Set the Google API key as an environment variable
import os
os.environ['GOOGLE_API_KEY'] = 'ENTER API KEY'

In [8]:
# Import the main Gemini module and its types for configuring model requests and responses
from google import genai
from google.genai import types

In [9]:
# Initialize a Gemini client instance for interacting with the API
client = genai.Client()

# Video Understanding Capability of Gemini

Gemini models can work with videos, making it possible to do things that used to need special, seperate models. With Gemini's vision abilities, you can:

1. Describe what's happening in a video
2. Summarize and form notes out of video
3. Split videos into parts and pull out key information
4. Answer questions about what's in the video
5. Point to exact moments in the video

This is all possible because Gemini was built to be multimodel from the ground up

A) Older Systems (Stitched Together): Previous AI systems would have seperate models, one for understanding images, one for transcribing audio, and one for processing text. They would analyze a video, pass the seperate results to a language model, and try to "stitch" the understanding together. This is inefficient and loses a lot of crucial context (like the exact timing of a sound with a visual action).


B) Gemini's Approach (Built-in): Gemini was designed and trained from the ground up on different types of data (text, images, audio, video) all mixed together. It doesn't have a seperate "brain" for each sense. It has one unified model that learned the fundamental relationships between words, pixels, and sounds simultaneously. This allows for a much deeper and more nuanced understanding.

## What Makes Gemini Special for Video ?

1. Native Multimodality: As mentioned, this is the biggest advantage. It can correlate sound and visuals with perfect timing, leading to insights like, "The strange engine noise started exactly when the warning light flashed on the dashboard."

2. Long Context Window: This is a game changer. Gemini new models has a massive context window (upto 1 million tokens, with 10 million demonstrated in research). This means it can "remember" the entire contents of a very long video (e.g., an hour-long movie or lecture) all at once. it can answer questions about an event at the beginning of the movie using context from the end, without forgetting anything.

3. High Quality Video Input: Gemini doesn't just check a few low quality frames. It can watch videos in high resolution and at a fast enough speed to read small text, notice tiny hand movements, or catch fast moving objects.

In [10]:
# Read first this file in binary format
video_1 = "/content/water.mp4"
video_1 = open(video_1, 'rb').read()

In [11]:
# Display the  video inside a Jupyter or Colab notebook
from IPython.display import Video
Video(video_1, embed=True, width=600, height=400)

In [12]:
# Generate a summary of the video in 7 key points using the Gemini model
prompt_1 = "Summarize the video in 7 points"

response_1 = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=types.Content(
        parts=[
            types.Part( # Attach the video file as binary data
                inline_data=types.Blob(data=video_1, mime_type="video/mp4")
            ),
            types.Part( # Add the summarization prompt
                text=prompt_1
            )
        ]
    )
)

# You can use above method if, request size (file, text prompt, system instruction) < 20 mb

In [13]:
response_1

GenerateContentResponse(
  automatic_function_calling_history=[],
  candidates=[
    Candidate(
      content=Content(
        parts=[
          Part(
            text="""Here's a summary of the video in 7 points:

1.  The video starts with a first-person perspective of someone, dressed in dark swim attire, descending steps into a body of water.
2.  An on-screen text overlay appears, stating, "It's not just a pool...", accompanied by a surprised emoji.
3.  As the person steps further into the water, the camera submerges, revealing a much deeper and larger space than initially anticipated.
4.  The view pans downwards to show an immense, multi-level underwater structure that resembles a flooded city or elaborate ruins.
5.  Another text overlay appears, reading, "It's an underground city," along with a shocked emoji and a flashlight emoji.
6.  The underwater "city" features architectural details like building facades with windows and large, gnarled structures resembling tree roots or anci

In [15]:
# If request size is > 20 mb, we use file api to upload the data
water_2 = client.files.upload(file="/content/water.mp4")

In [16]:
# Retrieve and display the current state of the second uploaded  video file
client.files.get(name=water_2.name).state.name

'ACTIVE'

In [17]:
# Generate a full summary of the second  video using the Gemini model
prompt_2 = "Summarize the full video"

response_2 = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[water_2, prompt_2]  # Provide the video file and summarization prompt
)

In [19]:
# Print the generated summary of the  video
print(response_2.text)

The video begins with a first-person perspective showing a person, wearing dark leggings, stepping into shallow, murky water. On-screen text reads, "It's not just a pool..." The person walks down a series of submerged steps, going deeper into the water. Suddenly, the camera plunges completely underwater, revealing a vast and elaborate submerged architectural complex. It's an expansive underwater "city" with intricate structures resembling buildings, archways, and large, gnarled tree-like formations, all lit with an ethereal blue glow. The final on-screen text explicitly states, "It's an underground city."


In [23]:

# Generate a summary from the YouTube video using the Gemini Pro model
prompt_4 = "summarize"

response_4 = client.models.generate_content(
    model='gemini-2.5-pro',
    contents=types.Content(
        parts=[
            types.Part( # Provide the video via its YouTube URL
                file_data=types.FileData(file_uri="https://youtu.be/RBSUwFGa6Fk?si=LU-E9XW7sB6E2t3C")
            ),
            types.Part( # Add the prompt to summarize
                text=prompt_4
            )
        ]
    )
)

In [24]:
# Print summary  from the video
print(response_4.text)

This video provides a comprehensive overview of data science, breaking it down into its core concepts, types, and processes. Here is a summary of the key points discussed:

**What is Data Science?**
Data science is the field of study that involves extracting knowledge and insights from noisy data and then turning those insights into actionable outcomes for a business or organization.

**The Three Pillars of Data Science**
Data science is presented as the intersection of three key disciplines, illustrated with a Venn diagram:
1.  **Computer Science:** The technical foundation for processing and managing data.
2.  **Mathematics & Statistics:** The analytical framework for modeling and understanding data patterns.
3.  **Business Expertise:** The domain knowledge required to ask the right questions and apply insights effectively.

**Types of Data Analytics**
The video outlines four types of analytics, which increase in complexity and business value:
*   **Descriptive Analytics:** Answers t

Now Let See What All Task We Can Do With Video Understanding Capability of Gemini

1. Summarize the video
2. Answer questions about the video (KNOWLEDGE QUESTIONS, SCENE SEGMENTATION, OBJECT AND PERSON DETECTION)
3. Speech-to-text (transcription)
4. Emotion and sentiment detection
5. Data extraction from visuals: Pull numbers, graphs, or chart information shown in frames
6. Multi-video comparison: Compare two videos for similarities, differences, or changes over time



In [30]:

# Generate a question paper from the YouTube video using the Gemini Pro model
prompt_5 = "summarize"

response_5 = client.models.generate_content(
    model='gemini-2.5-pro',
    contents=types.Content(
        parts=[
            types.Part( # Provide the video via its YouTube URL
                file_data=types.FileData(file_uri="https://youtu.be/YWA-xbsJrVg?si=oFtsb0WSl7RawDm3")
            ),
            types.Part( # Add the prompt to extract character names
                text=prompt_5
            )
        ]
    )
)

In [31]:
print(response_5.text)

This video provides a step-by-step tutorial on how to create a website in under 10 minutes using WordPress and GoDaddy. Here is a summary of the process shown:

**Introduction (00:00 - 00:22)**
Shubhang from websitelearners.com introduces the tutorial, promising to show viewers how to create any kind of website using a simple drag-and-drop method.

**Step 1: Pick a Name for Your Site (00:23 - 00:52)**
The process begins by clicking a link in the video description, which leads to a page for checking the availability of a desired website name (domain). The presenter searches for "quicktechy.com" and confirms it is available.

**Step 2: Get Hosting and Domain (00:52 - 02:48)**
This step explains the need for hosting (to store website files) and a domain (the website's name). The tutorial guides the user to a GoDaddy page to purchase WordPress hosting.
*   By selecting a 12-month plan, the domain name is included for free for the first year.
*   The user is walked through the checkout proc

In [32]:

# Analyze the DEMON SLAYER video to describe emotions or moods in each scene based on tone of voice and facial expressions
prompt_6 = "Analyze the DEMON SLAYER video to describe emotions or moods in each scene based on tone of voice and facial expressions"

response_6 = client.models.generate_content(
    model='gemini-2.5-pro',
    contents=types.Content(
        parts=[
            types.Part( # Provide the video via its YouTube URL
                file_data=types.FileData(file_uri="https://youtu.be/rsWe1Li3iqw?si=7sDM_ckBiAM1PXp2")
            ),
            types.Part( # Add the prompt
                text=prompt_6
            )
        ]
    )
)

In [33]:
# Print the analysis of emotions and moods for each scene in the  video
print(response_6.text)

Of course! This video is a wonderful compilation of Inosuke's funniest moments, particularly his inability to remember names. Here is a scene-by-scene breakdown of the emotions and moods conveyed:

**Scene 1: The Name Reveal (00:00 - 00:25)**
*   **Emotion:** Inosuke begins with manic glee and arrogance. His wide, crazed eyes and massive grin show his excitement and pride in revealing his name. His voice is loud and boastful.
*   **Mood Shift:** The mood instantly becomes comedic and flustered when Tanjiro asks him to spell his name. Inosuke's confidence shatters, replaced by panicked confusion and then frustrated anger as he yells about his illiteracy. He faints from a combination of his injuries and sheer embarrassment, ending the scene on a note of slapstick humor.

**Scene 2: "Kamaboko Gonpachiro" (00:26 - 00:43)**
*   **Emotion:** Inosuke is aggressively challenging and confident, completely oblivious that he's saying Tanjiro's name wrong. In contrast, Tanjiro is comically outrage

In [39]:
# Generate a complete transcript with timestamps from the  video using Gemini Pro
prompt_7 = "Transcript the complete spoken dialogue from this video also give timestamp"

response_7 = client.models.generate_content(
    model='gemini-2.5-pro',
    contents=types.Content(
        parts=[
            types.Part( # Attach the video file as binary data
                file_data=types.FileData(file_uri="https://youtube.com/shorts/Mfp4J5E_IjM?si=_COHuU6nmFyw7_I_")
            ),
            types.Part( # Add the transcription prompt with timestamp request
                text=prompt_7
            )
        ]
    )
)

In [40]:
# Print the complete transcript with timestamps of the firs video
print(response_7.text)

Of course! Here is the complete dialogue from the video with timestamps.

[00:00-00:02] Brutally realistic day during Harvard finals week.
[00:02-00:06] I woke up at 3:00 a.m. literally from being stressed about my organic chemistry exam.
[00:06-00:09] If you think this video gets any better, it doesn't. It gets worse.
[00:09-00:13] I spent the next two hours trying to sleep until I gave up and decided to go get ready at 5:00 a.m.
[00:13-00:17] Since I knew I would be sitting all day, I went to the gym and hit a full body workout.
[00:17-00:20] After, I grabbed my coffee and breakfast and headed straight to the library.
[00:20-00:26] I warmed up by doing some chemistry practice questions until I ran out of pencil and paper, so I had to run to the store to get some more.
[00:26-00:32] After, I went to Widener Library and finished my practice questions and here I am pretty sleep deprived and contemplating my life decisions.
[00:32-00:35] Anyways, I then moved on into redoing practice exa