# Summarizing and Querying Videos using Gemini and RAG

This notebook was created as a capstone project to the [5-Day Gen AI Intensive Course with Google](https://www.kaggle.com/learn-guide/5-day-genai). Some of the code in the RAG section is taken from the corresponding notebooks.

We will analyze recent YouTube videos (uploaded within the last 24 hours) about AI, and perform the following tasks:

* Understanding of video content and transcripts.
* Provision of both free-form and structured responses (JSON and other).
* Retrieval-Augmented Generation (RAG): Vectorisation of the video transcripts with ChromaDB to be used for Gemini queries.

We first remove packages that create conflict and install the relevant ones.

In [1]:
# Remove unused conflicting packages
!pip uninstall -qqy jupyterlab jupyterlab-lsp google-cloud-translate google-spark-connect pandas-gbq bigframes google-cloud-bigtable protobuf google-cloud-automl gcsfs
# Install relevant packages
!pip install -qU google-api-core
!pip install -qU "google-genai==1.7.0" "chromadb==0.6.3"
!pip install -qU youtube_transcript_api

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.1/160.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m319.7/319.7 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m49.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB

We now import the main packages that will be used subsequently.

In [2]:
from googleapiclient.discovery import build
from youtube_transcript_api import YouTubeTranscriptApi
from datetime import datetime, timedelta, timezone
import os
from kaggle_secrets import UserSecretsClient
from google import genai
from google.genai import types
from IPython.display import HTML, Markdown, display
from google.api_core import retry

We also need to create API keys for Gemini and Youtube, and then retrieve them as secrets so that they are not publicly visible.

In [3]:
GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
YOUTUBE_API_KEY = UserSecretsClient().get_secret("YOUTUBE_API_KEY")

We set up a helper function to decide whether to retry an API call specifically when a per-minute quota (429 error) or a service unavailability issue (503 error) is encountered within the context of the genai library. 

In [4]:
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

## 1. Search and Summarise using the Youtube API

### 1.2 Defining the required functions

The following function searches YouTube for the newest videos about a specific topic. It connects to YouTube and searches for videos uploaded within the last day, ordered by upload date. It checks that the videos they have English subtitles available. If a video meets these criteria, the algorithm collects its title, the name of the channel that posted it, and its YouTube ID. Finally, it returns a list containing this information.

In [5]:
@retry.Retry(predicate=is_retriable)
def get_latest_videos(query, max_results=10, num_return=5):
    """
    Searches YouTube for the latest videos on a given subject and returns a list of their title, channel, ID.

    Args:
        query (str): The search term.
        max_results (int): The maximum number of videos to retrieve.
        num_return (int): The intended number of video info to return.

    Returns:
        info list[dict]: Each dictionary contains a YouTube video title, channel, and ID.
    """
    youtube = build('youtube', 'v3', developerKey=YOUTUBE_API_KEY)
    now_time = datetime.now(timezone.utc)
    published_after = now_time - timedelta(days=1)

    try:
        search_response = youtube.search().list(
            part='id,snippet',
            q=query,
            type='video',
            order='date',
            relevanceLanguage='en',
            safeSearch='strict',
            videoDuration='medium',
            videoCaption='closedCaption',
            maxResults=max_results,
            publishedAfter=published_after.isoformat()
        ).execute()

        info = []
        counter = 0
        for item in search_response['items']:
            if counter == num_return:
                break

            # eliminate videos that set to premier later
            if item['snippet']['liveBroadcastContent'] == 'none' and item['id']['kind'] == 'youtube#video':
                caption_response = youtube.captions().list(part='snippet', videoId=item['id']['videoId']).execute()
                
                # check the existence of english captions
                en_flag = False
                if caption_response and 'items' in caption_response:
                    for track in caption_response['items']:
                        language = track['snippet'].get('language')
                        if language == 'en':
                            en_flag = True
                            break
                if en_flag == True:
                    info.append({'title': item['snippet']['title'], 'channel':item['snippet']['channelTitle'], 'id': item['id']['videoId']})
                    counter+=1
                       
        return info

    except Exception as e:
        print(f"An error occurred: {e}")
        return []

To retrieve the transcripts after the video IDs are retrieved, we define another function that also processes them into continuous text.

In [6]:
def fetch_pro(info):
    """
    Searches for YouTube transcripts of given video IDs

    Args:
        info (list[dict]): list of dictionaries that contain title, channel, ID of youtube videos

    Returns:
        all_transcripts (list[str]): list of transcripts
    """
    ytt_api = YouTubeTranscriptApi()
    all_transcripts = []
    for item in info:
        try:
            fetched_transcript = ytt_api.fetch(item['id'], languages=['en'])
        
            transcript = 'TRANSCRIPT: '
            for snippet in fetched_transcript:
                transcript = transcript + snippet.text + ' '
        
            transcript = transcript.replace('$', '&#36;') # replaces $ with html symbol, so that it's not seen as LaTeX by Markdown.
            all_transcripts.append(transcript)
        except Exception as e:
            print(f"Error fetching transcript for video ID {item['id']}: {e}")
    return all_transcripts

After obtaining the video transcripts, we will ask Gemini to summarise the contents of these videos in one paragraph each. We will ask it to output the result in **json** form so that we can easily find and display its parts (title, channel, summary) at will.

In [7]:
import typing_extensions as typing
import json

class VideoSum(typing.TypedDict):
    TITLE: str
    CHANNEL: str
    SUMMARY: str

@retry.Retry(predicate=is_retriable)
def summarise_text(question, header, text):
    """
    Receives a question on a text and returns the reply of Gemini in json form.

    Args:
        question (str): The question being asked.
        header (str): The header of the text.
        text (str): The text on which the question is asked.
      
    """
    client = genai.Client(api_key=GOOGLE_API_KEY)
    config = types.GenerateContentConfig(temperature=0.0,response_mime_type="application/json",
        response_schema=VideoSum)
    response = client.models.generate_content(
        model='gemini-2.0-flash',
        config=config,
        contents=[request, header, text],
        )

    return response.text


### 1.2 Calling the above functions and asking questions on the retrieved videos 
We will search for the latest videos about artificial intelligence and print their titles, along with their channel and ID. Note that we are searching for videos that have english captions, but the video's title might be in a different language than english. This is not a problem since we can ask Gemini to translate it.

In [8]:
info = get_latest_videos('artificial intelligence', max_results=10, num_return=5)

#We take a look at the result
for item in info:
    display(Markdown(f"TITLE: {item['title']}, CHANNEL: {item['channel']}, ID: {item['id']}"))

TITLE: Use this prompt to learn from podcast quick with AI (NotebookLM tutorial), CHANNEL: Ahead Of The Curve Artificial Intelligence, ID: xcbjTXOqsiY

TITLE: AI Debates CAPITALISM vs COMMUNISM, CHANNEL: Elite Questions, ID: VRali-H9g9M

TITLE: AI Civil War: DeepSeek-V3-0324 vs Claude 3.7 Sonnet - Which is Best?, CHANNEL: AI Perspectives, ID: mlCDhMyKedI

TITLE: Create AMAZING Branded Materials in Minutes with This Easy Guide, CHANNEL: AI Communication AI Clarity Strategies, ID: 3fjCCrBH7JI

TITLE: [ ENG SUB] AI for Beginners 2025: Learn Artificial Intelligence from Scratch (Step-by-Step Guide), CHANNEL: Toni Kusworo Tutorial, ID: GUolwsTnV7s

We now fetch the transcripts of the retrieved videos.

In [9]:
all_transcripts = fetch_pro(info)

We can finally ask Gemini to summarise the content of the video transcripts.

In [10]:
request = """
Can you summarise the transcript in one paragraph? If the title or the name of the channel are not in english, then translate it in english.
"""

header = """
TITLE: {}
CHANNEL: {}
"""

for index, item in enumerate(info):
    summary = json.loads(summarise_text(request, header.format(item['title'], item['channel']), all_transcripts[index]))
    display(Markdown(f"TITLE: {summary['TITLE']}<br>CHANNEL: {summary['CHANNEL']}<br>SUMMARY: {summary['SUMMARY']}"))
    print('-'*20)

TITLE: Use this prompt to learn from podcast quick with AI (NotebookLM tutorial)<br>CHANNEL: Ahead Of The Curve Artificial Intelligence<br>SUMMARY: This video tutorial by Ahead Of The Curve Artificial Intelligence demonstrates how to use Google's NotebookLM to quickly learn from podcasts. The tutorial explains how to upload a podcast as a source, use the AI to summarize the content and create a 'deep dive' podcast, and then use a specific prompt to identify and list the major sections of the podcast. This allows users to jump directly to the sections of interest, saving time compared to listening to the entire podcast. The presenter emphasizes the efficiency of this method for targeted learning and encourages viewers to subscribe, share, like, and comment.

--------------------


TITLE: AI Debates CAPITALISM vs COMMUNISM<br>CHANNEL: Elite Questions<br>SUMMARY: In a debate hosted by Elite Questions, two AI models argue the merits of capitalism versus communism. The capitalist AI champions freedom, innovation, and individual opportunity, acknowledging the need for regulation to prevent abuse, while the communist AI critiques capitalism for entrenching inequality, prioritizing profit over people, and commodifying basic needs. The debate touches on historical examples, ethical considerations, and potential solutions, ultimately concluding without a simple answer, emphasizing the need for continued conversation and inviting audience input for future discussions.

--------------------


TITLE: AI Civil War: DeepSeek-V3-0324 vs Claude 3.7 Sonnet - Which is Best?<br>CHANNEL: AI Perspectives<br>SUMMARY: In a rapidly evolving AI landscape, AI Perspectives compares Deepseek V3 0324, a cost-effective and open-source friendly model strong in coding, against Claude 3.7 Sonnet from Anthropic, a top-tier generalist known for superior reasoning, a large context window, and multimodal capabilities. Deepseek V3 0324 excels in coding tasks and offers cost savings, while Claude 3.7 Sonnet shines in complex reasoning, handling large contexts, and image analysis. The choice depends on specific needs: Deepseek for efficient coding and open-source flexibility, and Claude for deep reasoning, extensive information processing, and image analysis, ultimately driving AI innovation.

--------------------


TITLE: Create AMAZING Branded Materials in Minutes with This Easy Guide<br>CHANNEL: AI Communication AI Clarity Strategies<br>SUMMARY: Stefan Arbeck, the founder of Just That Simple, introduces a channel focused on leveraging AI tools to solve business problems through training videos. The channel aims to provide quick and simple strategies for various business challenges by demonstrating how to use AI tools effectively. A teaser video showcases how AI can streamline event planning, from brainstorming themes with ChatGPT to automating tasks with Notion AI and creating promotional materials with Canva and Runway, promising a faster, smarter, and more scalable approach to event management.

--------------------


TITLE: AI for Beginners 2025: Learn Artificial Intelligence from Scratch (Step-by-Step Guide)<br>CHANNEL: Toni Kusworo Tutorial<br>SUMMARY: Toni Kusworo's tutorial, "AI for Beginners 2025," guides viewers on learning AI from scratch, starting with understanding the concept of 'prompt' as a command given to AI. The tutorial emphasizes creating detailed prompts with specific subjects, actions, styles, and additional details to achieve desired results, providing examples of visual styles like realistic, anime, 3D, painting, and fantasy. It demonstrates using prompts in both GPT chat and Meta AI on WhatsApp to generate images, further illustrating how to animate these images using Pixverse AI, including instructions on uploading, prompting for animation, and downloading the final video.

--------------------


We can also make more general questions.

In [11]:
prompt = 'According to the transcripts, what is currently trending in AI? Give a short answer.'

client = genai.Client(api_key=GOOGLE_API_KEY)
response = client.models.generate_content(
    model='gemini-2.0-flash',
    contents=[prompt, all_transcripts])

Markdown(response.text)

Based on the transcripts, current trends in AI include:

*   **AI-powered tools for learning and note-taking:** Using AI (like Notebook LM) to summarize and analyze podcasts and other learning materials.
*   **AI ideological debates:** Using AI to simulate debates on complex topics like capitalism versus communism.
*   **Competition between advanced language models:** Deepseek V3 and Claude 3 are competing.
*   **AI in solving business problems:** Utilizing AI tools for specific business tasks like event planning.
*   **AI Image Generation and Animation:** Techniques and tools to generate and animate images using AI.

We'll now force Gemini to give a one-word answer: YES or NO, to a rather hard question, given the retrieved transcripts.

In [12]:
import enum

prompt = 'Is AI going to rule the world?'

class Sentiment(enum.Enum):
    YES = "YES"
    NO = "NO"

response = client.models.generate_content(
    model='gemini-2.0-flash',
    config=types.GenerateContentConfig(
        response_mime_type="text/x.enum",
        response_schema=Sentiment
    ),
    contents=[prompt, all_transcripts])

print(response.text)

NO


## 1.3 Retrieval Augmented Generation with Chroma

Hopefully the retrieved transcripts are of high quality and new information, so we'd like Gemini to have easy access to them. This code defines a custom embedding function, for use with ChromaDB. It uses Google's Gemini API to generate those embeddings. 

In [13]:
import chromadb
from chromadb import Documents, EmbeddingFunction, Embeddings
from google.api_core import retry

from google.genai import types


# Define a helper to retry when per-minute quota is reached.
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})


class GeminiEmbeddingFunction(EmbeddingFunction):
    # Specify whether to generate embeddings for documents, or queries
    document_mode = True

    @retry.Retry(predicate=is_retriable)
    def __call__(self, input: Documents) -> Embeddings:
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(
                task_type=embedding_task,
            ),
        )
        return [e.values for e in response.embeddings]

We now set up a ChromaDB vector database and generate embeddings for the retrieved transcripts, which are added to the database.

In [14]:
DB_NAME = "youtube_transcripts"

embed_fn = GeminiEmbeddingFunction()
embed_fn.document_mode = True

chroma_client = chromadb.Client()
db = chroma_client.get_or_create_collection(name=DB_NAME, embedding_function=embed_fn)

db.add(documents=all_transcripts, ids=[str(i) for i in range(len(all_transcripts))])

The following code performs a semantic search on the ChromaDB database, searching for information related to the query "What is new in AI?". 

In [27]:
embed_fn.document_mode = False # Switch to query mode when generating embeddings.
query = "What is new in AI?"

result = db.query(query_texts=[query], n_results=1)
[all_passages] = result["documents"]

We now construct the prompt, based on the passages retrieved. 

In [30]:
prompt = f"""You are a helpful and informative bot that answers questions in detail using text from the reference passage included below. 
If the passage is irrelevant to the answer, you may ignore it.

QUESTION: {query}
"""

# Add the retrieved documents to the prompt.
for passage in all_passages:
    passage_oneline = passage.replace("\n", " ")
    prompt += f"PASSAGE: {passage_oneline}\n"

We finally make our question: "What is new in AI?", according to the retrieved youtube video transcripts.

In [31]:
answer = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt)

Markdown(answer.text)

According to the passage, the AI use case no one is talking about is AI's role in how we interact socially and what that might mean for all of us. It mentions a fascinating interview with Reid Hoffman, co-founder of LinkedIn, who comes at it from this angle. The true killer app of AI will be multiplayer social, not just single-payer chatbots like ChatGPT. Within a few years, we are likely to be in a surrounding field of agents. These agents won't just be for individual use, but will mediate interactions between individuals, groups, and societies. They will make the currently more invisible networks we live in more mediated. For example, having agents listen during conversation and potentially offering corrections or additional information like mentioning a philosopher relevant to the discussion. These agents would be in the field around us and could be socially aware.
