In [2]:
pip install yt-dlp openai-whisper transformers torch torchaudio


Collecting yt-dlp
  Downloading yt_dlp-2025.2.19-py3-none-any.whl.metadata (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.9/171.9 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai-whisper
  Downloading openai-whisper-20240930.tar.gz (800 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m800.5/800.5 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.12

In [3]:
import yt_dlp
import whisper
import torch
from transformers import pipeline
import os

# Define file paths
VIDEO_FILE = "video.mp4"
AUDIO_FILE = "audio.mp3"

# Step 1: Download YouTube Video
def download_video(youtube_url, output_path=VIDEO_FILE):
    ydl_opts = {
        'outtmpl': output_path,
        'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]',
        'merge_output_format': 'mp4',
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([youtube_url])

# Step 2: Extract Audio from Video
def extract_audio(video_path=VIDEO_FILE, audio_path=AUDIO_FILE):
    os.system(f"ffmpeg -i {video_path} -q:a 0 -map a {audio_path} -y")

# Step 3: Transcribe Audio using Whisper
def transcribe_audio(audio_path=AUDIO_FILE):
    model = whisper.load_model("small")  # Load Whisper model
    result = model.transcribe(audio_path)
    return result["text"]

# Step 4: Summarize Text using BART
def summarize_text(text):
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=0 if torch.cuda.is_available() else -1)
    summary = summarizer(text, max_length=150, min_length=50, do_sample=False)
    return summary[0]['summary_text']

# Main function to run the entire pipeline
def video_summarization_pipeline(youtube_url):
    print("[1] Downloading Video...")
    download_video(youtube_url)

    print("[2] Extracting Audio...")
    extract_audio()

    print("[3] Transcribing Audio...")
    extracted_text = transcribe_audio()
    print("\nExtracted Text Preview:", extracted_text[:500], "...")  # Print first 500 chars

    print("[4] Summarizing Text...")
    summary = summarize_text(extracted_text)

    print("\n** Video Summary **")
    print(summary)
    return summary

# Run the pipeline with a YouTube Video
video_url = "https://www.youtube.com/watch?v=K27diMbCsuw"  # Replace with your video link
summary = video_summarization_pipeline(video_url)


[1] Downloading Video...
[youtube] Extracting URL: https://www.youtube.com/watch?v=K27diMbCsuw
[youtube] K27diMbCsuw: Downloading webpage
[youtube] K27diMbCsuw: Downloading tv client config
[youtube] K27diMbCsuw: Downloading player f6e09c70
[youtube] K27diMbCsuw: Downloading tv player API JSON
[youtube] K27diMbCsuw: Downloading ios player API JSON
[youtube] K27diMbCsuw: Downloading m3u8 information
[info] K27diMbCsuw: Downloading 1 format(s): 401+140
[download] Destination: video.f401.mp4
[download] 100% of  114.45MiB in 00:00:05 at 21.16MiB/s  
[download] Destination: video.f140.m4a
[download] 100% of    3.98MiB in 00:00:00 at 29.84MiB/s  
[Merger] Merging formats into "video.mp4"
Deleting original file video.f140.m4a (pass -k to keep)
Deleting original file video.f401.mp4 (pass -k to keep)
[2] Extracting Audio...
[3] Transcribing Audio...


100%|████████████████████████████████████████| 461M/461M [00:03<00:00, 135MiB/s]
  checkpoint = torch.load(fp, map_location=device)



Extracted Text Preview:  Hi, I'm Pete from Manus AI. For the past year, we've been quietly building what we believe is the next evolution in AI. And today, we're launching an early preview of Manus, the first general AI agent. This isn't just another chat-border workflow. It's a truly autonomous agent that bridges the gap between conception and execution. While other AI stops at generating ideas, Manus delivers results. We see it as the next paradigm of human-machine collaboration, and potentially, it glimps into AGI.  ...
[4] Summarizing Text...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu



** Video Summary **
Manus is the first general AI agent that bridges the gap between conception and execution. Manus operates as a multi-agent system powered by several distinct models. The name Manus comes from the famous model, MENSE, at Manus, mind and hand.
