This project involves extracting transcripts from YouTube videos, generating key notes from the transcripts using Google's Gemini AI, and storing these notes in a ChromaDB vector database for efficient retrieval and querying. The workflow includes installing necessary libraries, setting up API keys, processing video transcripts, generating summaries, and performing searches on the stored notes. The project is useful for creating a searchable knowledge base from video content.

In [None]:
!pip install youtube-transcript-api google-generativeai chromadb

Collecting youtube-transcript-api
  Downloading youtube_transcript_api-1.1.0-py3-none-any.whl.metadata (23 kB)
Collecting chromadb
  Downloading chromadb-1.0.13-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-6.0.0-py3-none-any.whl.metadata (6.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.34.1-py3-none-any.whl.metadata (1.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.34.1-py3-none-any.whl.metadata (2.4 kB)
Collecting opentele

In [None]:
from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.formatters import TextFormatter

# Gemini API
# Free, double-check pricing https://ai.google.dev/pricing
import google.generativeai as genai

# ChromaDB
import chromadb
from chromadb.utils import embedding_functions

import os

Set up resources

In [None]:
GEMINI_API_KEY = 'YOUR_KEY'
genai.configure(api_key=GEMINI_API_KEY)

# Instantiate Gemini model
genai_model = genai.GenerativeModel('models/gemini-2.5-flash')

# Load the vector database, if it exists, otherwise create new on first run
chroma_client = chromadb.PersistentClient(path="my_vectordb")

# Select an embedding function.
# Embedding Function choices:https://docs.trychroma.com/guides/embeddings#custom-embedding-functions
gemini_ef  = embedding_functions.GoogleGenerativeAiEmbeddingFunction(api_key=GEMINI_API_KEY)

# Load collection, if it exists, otherwise create new on first run. Specify the model that we want to use to do the embedding.
chroma_collection = chroma_client.get_or_create_collection(name='yt_notes', embedding_function=gemini_ef)

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


INPUT

In [None]:
# Some sample YouTube videos:
# https://youtu.be/IdLSZEYlWVo
# https://youtu.be/tL-wnMVyTQI
# https://youtu.be/etSdP9CFmko
# https://youtu.be/rgRIZDsEwCk
# https://youtu.be/_EA-74yr5D4

yt_video_id = 'IdLSZEYlWVo'

# Adjust prompt as needed
prompt = "Extract key notes from video transcript: "

Extract Transcript

In [None]:
# Reference: https://github.com/jdepoix/youtube-transcript-api
transcript_list = YouTubeTranscriptApi.get_transcript(yt_video_id, languages=['en','en-US','en-GB'])
transcript = "\n".join([item['text'] for item in transcript_list])

with open("temp_transcript.txt", "w") as file:
    file.write(transcript)

Generate Notes

In [None]:
response = genai_model.generate_content(prompt + transcript, stream=False)

with open("temp_notes.txt", "w") as file:
    file.write(response.text)

# Review temp_notes.txt, edit if necessary

Save Notes

In [None]:
with open("temp_notes.txt", "r") as file:
    notes = file.read()

# Insert, if record doesn't exist, otherwise update existing record
# https://docs.trychroma.com/reference/py-collection#upsert
chroma_collection.upsert(
    documents=[notes],
    ids=[yt_video_id]
)

# Validation
result = chroma_collection.get(yt_video_id, include=['documents'])
result

{'ids': ['IdLSZEYlWVo'],
 'embeddings': None,
 'documents': ['Here are the key notes from the video transcript:\n\n**Speaker\'s Stance:**\n*   A used car dealer advises *against* buying used cars from dealers, including himself.\n\n**Why Avoid Used Car Dealers (Primary Reasons):**\n\n1.  **Source of Inventory (Auctions):**\n    *   98% of used cars come from auctions.\n    *   These are often "trade-ins" from new car dealerships.\n    *   New car dealerships send cars to auction because they don\'t want them on their lot, usually due to:\n        *   Being "problem cars" with existing issues.\n        *   Not wanting older models.\n    *   Essentially, used car dealers buy "someone else\'s problems."\n    *   *Speaker\'s exception:* His dealership can fix many issues cheaply (due to in-house mechanic, selling cars under $5k), but most other dealers don\'t have this luxury and pass problems to customers.\n    *   **Repossessed Cars ("Repos"):** Also common at auctions. Owners often negl

Search Notes

In [None]:
query_text = "How much beef do I need for the beef ribs recipe?"
n_results = 2

# https://docs.trychroma.com/reference/py-collection#query
results = chroma_collection.query(
    query_texts=[query_text],
    n_results=n_results,
    include=['documents', 'distances', 'metadatas'],
)

for i in range(len(results['ids'][0])):
    id       = results["ids"][0][i]
    document = results['documents'][0][i]

    print("************************************************************************")
    print(f"{i+1}.  https://youtu.be/{id}")
    print("************************************************************************")
    print(document)


************************************************************************
1.  https://youtu.be/etSdP9CFmko
************************************************************************
Here are the key notes from the video transcript on making Braised Beef Ribs:

**Dish:** Braised Beef Ribs

**Main Ingredient:**
*   1 kilogram Beef Ribs

**Preparation (Beef Ribs):**
*   Soak ribs in warm water for 30 minutes to remove bone dust and blood.
*   Rinse clean.
*   **Blanching:** Place ribs in cold water, bring to a boil. Remove ribs from the areas where bubbles are actively coming up. (No need to rinse after blanching).

**Ingredients (Spices & Seasoning):**
*   **Spices:** Star anise, cinnamon, cloves, bay leaves, Sichuan peppercorns (more), chilies (more).
*   Green onions, ginger
*   Sugar (for caramelization)
*   Tomato paste
*   Boiling water
*   Soy Sauce: 2 tablespoons regular soy sauce, 1 teaspoon dark soy sauce
*   Salt (to taste, after adding soy sauce)

**Cooking Steps (Pressure Cooker

Search on stored notes

In [None]:
prompt = "Answer the following QUESTION using DOCUMENT as context."
prompt += f"QUESTION: {query_text}"
prompt += f"DOCUMENT: {results['documents'][0][0]}"

response = genai_model.generate_content(prompt, stream=False)
print(response.text)

You need 1 kilogram of Beef Ribs for the recipe.
