# 🎥 Extracting Knowledge from Videos

## What This Notebook Teaches

In this notebook, you'll learn how to **automatically extract knowledge from educational and training videos** by converting them to text and cleaning the transcripts using AI. This is a powerful workflow for processing video content at scale.

### 🎯 The Scenario

Imagine you're a lecturer designing an educational course, and you have a collection of old educational videos from previous seminars, workshops, or lectures. You want to extract the knowledge from these videos to:
- Create course materials
- Generate lecture notes
- Build a knowledge base
- Repurpose content for different formats

### ⚠️ The Problem

**Manually transcribing video content is extremely time-consuming!**

Consider this: If you have 6 hours of video content, manual transcription could take:
- **4-5 person days** of work (for typing)
- Additional time for proofreading and formatting
- Risk of transcription errors and inconsistencies

This is simply not scalable or cost-effective.

### ✨ The Solution

We'll **automate this entire process** using OpenAI's powerful AI models:
1. **Whisper** - For accurate speech-to-text transcription
2. **GPT-5-nano** - For cleaning and formatting the transcripts

What would take days manually can be completed in minutes!

### 📹 About the Videos

💡 **Note:** Open and watch one minute from the videos you'll upload. They may have varying quality in terms of delivery - some speakers may be more articulate than others, there might be background noise, or different recording conditions.

**In this notebook, we'll work with 2 videos to demonstrate the complete workflow.** This allows you to see the entire process from start to finish, understand each step, and then apply it to larger video collections.

---

Let's get started! 🚀

## 📁 File Paths Setup

Before we begin, let's define where our files will be stored:
- **Input:** Videos will be uploaded to `/content/videos/`
- **Intermediate:** Audio files will be saved to `/content/audio_files/`
- **Intermediate:** Raw transcripts will be saved to `/content/transcripts/`
- **Output:** Cleaned transcripts will be saved to `/content/cleaned_transcripts/`

In [None]:
# Define file paths
input_folder_path = '/content/videos'
audio_files_dir = '/content/audio_files'
transcripts_dir = '/content/transcripts'
cleaned_dir = '/content/cleaned_transcripts'

print("✅ File paths configured:")
print(f"  📥 Input videos: {input_folder_path}")
print(f"  🔊 Audio files: {audio_files_dir}")
print(f"  📝 Raw transcripts: {transcripts_dir}")
print(f"  ✨ Cleaned transcripts: {cleaned_dir}")

---

## 🔧 Setup

### Installing Required Dependencies

We'll need two main Python packages:
- **`openai`** - To interact with OpenAI's Whisper and GPT models
- **`moviepy`** - To extract audio from video files

In [None]:
# Install required packages
!pip install -q openai moviepy

print("✅ All dependencies installed successfully!")

### 🔑 API Key Configuration

To use OpenAI's APIs, you need an API key. We'll set this up with two methods:

**Method 1 (Recommended):** Store your API key in Colab Secrets
- Click the 🔑 icon in the left sidebar
- Click "Add new secret"
- Name: `OPENAI_API_KEY`
- Value: Your API key
- Toggle "Notebook access" ON

**Method 2 (Fallback):** Enter your API key manually when prompted

In [None]:
import os

# Configure OpenAI API key
# Method 1: Try to get API key from Colab secrets (recommended)
try:
    from google.colab import userdata
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    print("✅ API key loaded from Colab secrets")
except:
    # Method 2: Manual input (fallback)
    from getpass import getpass
    print("💡 To use Colab secrets: Go to 🔑 (left sidebar) → Add new secret → Name: OPENAI_API_KEY")
    OPENAI_API_KEY = getpass("Enter your OpenAI API Key: ")

# Set the API key as an environment variable
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Validate that the API key is set
if not OPENAI_API_KEY or OPENAI_API_KEY.strip() == "":
    raise ValueError("❌ ERROR: No API key provided!")

print("✅ Authentication configured!")

# Configure which OpenAI model to use for cleaning
OPENAI_MODEL = "gpt-5-nano"  # Using gpt-5-nano for cost efficiency
print(f"🤖 Selected Model for text cleaning: {OPENAI_MODEL}")

### 📦 Import Required Libraries

In [None]:
# Suppress warnings from moviepy library (these are harmless compatibility warnings)
import warnings
warnings.filterwarnings('ignore', category=SyntaxWarning)

# Import necessary libraries
import os
from pathlib import Path
from openai import OpenAI
from moviepy.editor import VideoFileClip

# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)

print("✅ All libraries imported and OpenAI client initialized!")

---

## 📋 Planning Our Work

Before diving into the code, let's break down this task into **three clear steps**:

### Step 1: Convert Videos to Audio Format Only
**Why?** Video files contain both visual and audio information, but we only need the audio for transcription. By extracting just the audio:
- We reduce file sizes significantly (audio files are much smaller)
- Processing becomes faster and cheaper
- Whisper only analyzes audio anyway, so we're not losing any relevant information

### Step 2: Send Audio Files to Whisper Model to Obtain Transcriptions
**Why?** Whisper is OpenAI's state-of-the-art speech-to-text model. It will convert our audio files into text transcripts automatically.

### Step 3: Use GPT to Clean These Transcriptions
**Why?** Raw transcripts from speech-to-text models often contain:
- Filler words ("um", "uh", "you know")
- Run-on sentences without proper punctuation
- Repetitions and false starts
- Lack of proper formatting

GPT models excel at text refinement and can transform these raw transcripts into clean, professional lecture notes.

---

### 📁 Creating Working Directories

Due to the breakdown into these three steps, let's **create directories to store created audio files and intermediate transcriptions**. This keeps our workflow organized and makes it easy to review outputs at each stage.

In [None]:
# Create necessary directories if they don't exist
os.makedirs(input_folder_path, exist_ok=True)
os.makedirs(audio_files_dir, exist_ok=True)
os.makedirs(transcripts_dir, exist_ok=True)
os.makedirs(cleaned_dir, exist_ok=True)

print("✅ All working directories created:")
print(f"  📁 {input_folder_path}")
print(f"  📁 {audio_files_dir}")
print(f"  📁 {transcripts_dir}")
print(f"  📁 {cleaned_dir}")

### 📤 Upload Your Videos

Now, let's upload the 2 video files you want to process. These should be `.mp4` files.

In [None]:
from google.colab import files

print("📤 Please upload your video files (.mp4)...")
uploaded = files.upload()

# Move uploaded files to the videos directory
for filename in uploaded.keys():
    src = filename
    dst = os.path.join(input_folder_path, filename)
    os.rename(src, dst)
    print(f"  ✅ Moved {filename} to {input_folder_path}")

print(f"\n✅ All videos uploaded to {input_folder_path}")

---

## 🎵 Step 1: Video to Audio Conversion

### Why Audio-Only?

**Video files are large and contain visual information we don't need.** When transcribing speech, only the audio track matters. By extracting just the audio:
- Audio files are **significantly smaller** than videos (typically 10-20x smaller)
- Processing is **cheaper** (less data to transfer and store)
- Whisper **only analyzes audio** anyway, so we're not losing any information

For example, a 100MB video might contain only 5-10MB of audio data. Why send 90MB of unnecessary visual data to the API?

### 💡 Key Points:
- We use `moviepy` library to extract audio from video files
- Videos are in `.mp4` format
- Audio files are saved as `.mp3` (compressed, efficient format)
- Each video gets a corresponding audio file with the same base name

In [None]:
print("🎵 Step 1: Converting videos to audio...\n")

# Loop through all files in the input folder
for filename in os.listdir(input_folder_path):
    # Check if the file is a video file (.mp4)
    if filename.endswith('.mp4'):
        video_path = os.path.join(input_folder_path, filename)
        
        # Create audio filename (replace .mp4 with .mp3)
        audio_filename = filename.replace('.mp4', '.mp3')
        audio_path = os.path.join(audio_files_dir, audio_filename)
        
        try:
            print(f"  🔄 Processing: {filename}")
            
            # Load the video file
            video_clip = VideoFileClip(video_path)
            
            # Extract audio from video
            audio_clip = video_clip.audio
            
            # Save audio as MP3 file
            audio_clip.write_audiofile(audio_path, verbose=False, logger=None)
            
            # Close clips to free up resources
            audio_clip.close()
            video_clip.close()
            
            print(f"  ✅ Audio extracted: {audio_filename}\n")
            
        except Exception as e:
            print(f"  ❌ Error processing {filename}: {str(e)}\n")
            continue

print("\n✅ Step 1 Complete: All videos converted to audio files!")
print(f"   Audio files saved to: {audio_files_dir}")

---

## 🎤 Step 2: Speech-to-Text via Whisper

### What is Whisper?

**Whisper is OpenAI's state-of-the-art automatic speech recognition (ASR) system.** It was trained on 680,000 hours of multilingual and multitask supervised data collected from the web. This massive training dataset leads to:
- **High accuracy** across different accents and speaking styles
- **Multilingual support** for 98 languages
- **Robust performance** even with background noise or audio quality issues

### 📚 Whisper API Documentation

The Whisper API accepts audio files and returns transcriptions. Here are the key parameters:

- **`model`**: `"whisper-1"` (the current Whisper model version)
- **`file`**: Audio file object (supports mp3, mp4, wav, and more)
- **`response_format`**: `'text'` returns plain text, `'json'` returns detailed JSON with timestamps
- **File size limit**: 25 MB (for larger files, split them into chunks)
- **Supported formats**: mp3, mp4, mpeg, mpga, m4a, wav, webm

### 💡 Key Points:
- Whisper automatically detects the language (no need to specify)
- It adds punctuation and formatting automatically
- Processing time is typically 20-30% of the audio duration
- We save raw transcripts to review before cleaning

In [None]:
print("🎤 Step 2: Transcribing audio files with Whisper...\n")

# Loop through all audio files
for filename in os.listdir(audio_files_dir):
    # Check if the file is an audio file (.mp3)
    if filename.endswith('.mp3'):
        audio_path = os.path.join(audio_files_dir, filename)
        
        # Create transcript filename (replace .mp3 with .txt)
        transcript_filename = filename.replace('.mp3', '_transcript.txt')
        transcript_path = os.path.join(transcripts_dir, transcript_filename)
        
        try:
            print(f"  🔄 Transcribing: {filename}")
            
            # Open audio file and send to Whisper API
            with open(audio_path, 'rb') as audio_file:
                transcript_text = client.audio.transcriptions.create(
                    model="whisper-1",
                    file=audio_file,
                    response_format='text'
                )
            
            # Save the transcript to a text file
            with open(transcript_path, 'w', encoding='utf-8') as transcript_file:
                transcript_file.write(transcript_text)
            
            print(f"  ✅ Transcript saved: {transcript_filename}")
            print(f"     Preview: {transcript_text[:100]}...\n")
            
        except Exception as e:
            print(f"  ❌ Error transcribing {filename}: {str(e)}\n")
            continue

print("\n✅ Step 2 Complete: All audio files transcribed!")
print(f"   Transcripts saved to: {transcripts_dir}")

---

## ✨ Step 3: Cleaning Transcriptions via GPT

### Why Clean Transcripts?

**Whisper captures speech accurately but includes imperfections that make transcripts hard to read and use.** Raw speech-to-text output often contains:

- **Filler words**: "um", "uh", "you know", "like", "so"
- **Run-on sentences**: Lack of proper sentence breaks and punctuation
- **Repetitions**: People often repeat words or rephrase thoughts mid-sentence
- **False starts**: Beginning a sentence one way, then starting over
- **Poor formatting**: Missing paragraph breaks, inconsistent capitalization

**GPT models excel at text refinement.** They can transform raw transcripts into clean, professional lecture notes that are:
- Easy to read and understand
- Properly formatted with clear structure
- Free of distracting filler words
- Suitable for use as course materials

### 🤖 About GPT-5-nano

We're using **GPT-5-nano** for this task because:
- It's **cost-efficient** ($0.05 per 1M input tokens, $0.40 per 1M output tokens)
- It's **fast** - processes text quickly
- It's **sufficient** for text cleaning tasks (we don't need the largest model for this)

### 💡 Key Points:
- We use a carefully crafted prompt to guide the cleaning process
- The model is instructed to stay faithful to original content (no hallucinations)
- Results are saved as separate "cleaned" files for easy comparison

In [None]:
# Define the cleaning task description
task_description = (
    "You are a helpful assistant tasked with cleaning up lecture notes. "
    "Make the text coherent, correct any typos, and format it for a lecturer "
    "to use as speaking notes. Keep the text friendly, to the point, and ensure "
    "it does not deviate from the original content."
)

print("✨ Step 3: Cleaning transcripts with GPT-5-nano...\n")
print(f"🎯 Task: {task_description}\n")

# Loop through all transcript files
for filename in os.listdir(transcripts_dir):
    # Check if the file is a transcript (.txt)
    if filename.endswith('_transcript.txt'):
        transcript_path = os.path.join(transcripts_dir, filename)
        
        # Create cleaned filename
        cleaned_filename = filename.replace('_transcript.txt', '_cleaned.txt')
        cleaned_path = os.path.join(cleaned_dir, cleaned_filename)
        
        try:
            print(f"  🔄 Cleaning: {filename}")
            
            # Read the raw transcript
            with open(transcript_path, 'r', encoding='utf-8') as f:
                transcript_content = f.read()
            
            # Create the input for GPT with task description and transcript
            input_text = f"{task_description}\n\nTranscript to clean:\n{transcript_content}"
            
            # Call GPT-5-nano API to clean the transcript
            response = client.responses.create(
                model=OPENAI_MODEL,
                input=input_text
            )
            
            # Extract the cleaned text from response
            cleaned_text = response.output_text
            
            # Save the cleaned transcript
            with open(cleaned_path, 'w', encoding='utf-8') as f:
                f.write(cleaned_text)
            
            print(f"  ✅ Cleaned transcript saved: {cleaned_filename}")
            print(f"     Preview: {cleaned_text[:100]}...\n")
            
        except Exception as e:
            print(f"  ❌ Error cleaning {filename}: {str(e)}\n")
            continue

print("\n✅ Step 3 Complete: All transcripts cleaned!")
print(f"   Cleaned transcripts saved to: {cleaned_dir}")

---

## 📊 Results Comparison

Let's compare the raw transcript from Whisper with the cleaned version from GPT to see the improvement!

### Example: Before vs. After Cleaning

**Raw Transcript (from Whisper):**
```
Um, so today we're going to, uh, talk about machine learning and, you know, 
how it's used in like different applications and stuff. So basically, um, 
machine learning is when you have algorithms that can learn from data without 
being explicitly programmed and, uh, yeah, it's really powerful. So, um, 
there are different types like supervised learning where you have labeled data 
and unsupervised learning where you don't and, uh, reinforcement learning too.
```

**Cleaned Version (from GPT):**
```
Today we'll discuss machine learning and its applications in various domains. 
Machine learning refers to algorithms that can learn from data without being 
explicitly programmed. This technology is remarkably powerful.

There are three main types of machine learning:
1. Supervised learning - uses labeled data
2. Unsupervised learning - works with unlabeled data  
3. Reinforcement learning - learns through trial and error
```

### 🎯 Key Improvements:
- **Removed filler words**: "um", "uh", "like", "you know", "stuff"
- **Better structure**: Clear paragraphs and formatting
- **Professional tone**: More suitable for lecture notes
- **Enhanced readability**: Easier to follow and understand
- **Same content**: No deviation from the original meaning

### 📥 View Your Results

Let's display a side-by-side comparison of one of your actual transcripts:

In [None]:
# Get the first transcript file for comparison
transcript_files = [f for f in os.listdir(transcripts_dir) if f.endswith('_transcript.txt')]

if transcript_files:
    # Read the first raw transcript
    sample_transcript = transcript_files[0]
    transcript_path = os.path.join(transcripts_dir, sample_transcript)
    
    with open(transcript_path, 'r', encoding='utf-8') as f:
        raw_text = f.read()
    
    # Read the corresponding cleaned transcript
    cleaned_sample = sample_transcript.replace('_transcript.txt', '_cleaned.txt')
    cleaned_path = os.path.join(cleaned_dir, cleaned_sample)
    
    with open(cleaned_path, 'r', encoding='utf-8') as f:
        cleaned_text = f.read()
    
    print("="*80)
    print(f"📄 File: {sample_transcript}")
    print("="*80)
    print("\n🔍 RAW TRANSCRIPT (First 500 characters):")
    print("-"*80)
    print(raw_text[:500])
    print("\n" + "="*80)
    print("\n✨ CLEANED TRANSCRIPT (First 500 characters):")
    print("-"*80)
    print(cleaned_text[:500])
    print("\n" + "="*80)
else:
    print("⚠️ No transcripts found to display.")

### 💾 Download Your Cleaned Transcripts

You can download all the cleaned transcripts to your local machine:

In [None]:
from google.colab import files
import zipfile

# Create a zip file with all cleaned transcripts
zip_filename = '/content/cleaned_transcripts.zip'

with zipfile.ZipFile(zip_filename, 'w') as zipf:
    for filename in os.listdir(cleaned_dir):
        if filename.endswith('.txt'):
            file_path = os.path.join(cleaned_dir, filename)
            zipf.write(file_path, filename)

print("📦 Created zip file with all cleaned transcripts")
print("⬇️ Downloading...")
files.download(zip_filename)
print("✅ Download complete!")

---

## ⚠️ Limitations and Considerations

While this workflow is powerful and automated, it's important to understand its limitations:

### 🎤 Whisper Limitations:

1. **Audio Quality Matters**
   - Best results with clear speech and minimal background noise
   - Struggles with heavy accents, mumblings, or poor recording quality
   - Multiple overlapping speakers can cause confusion

2. **File Size Constraint**
   - **25MB file size limit** per API request
   - Longer audio files (typically >1 hour of good quality audio) need to be split into chunks
   - For production use, implement audio chunking logic

3. **Language Detection**
   - While Whisper supports 98 languages, accuracy varies by language
   - English has the highest accuracy due to more training data
   - Code-switching (mixing languages) can be challenging

### 🤖 GPT Cleaning Limitations:

1. **Context Length**
   - Very long transcripts may exceed model context limits
   - May need to process in sections for multi-hour videos

2. **Potential Over-Editing**
   - Model might occasionally rephrase content too much
   - Important to review critical transcripts manually

3. **Domain-Specific Terms**
   - May occasionally misinterpret technical jargon or specialized terminology
   - Consider adding domain-specific instructions to the prompt

### 💡 Best Practices:

- **Always review critical content** - Don't rely solely on automated processing for important materials
- **Test with sample videos** first to ensure quality meets your needs
- **Use good source material** - Better input quality = better output
- **Keep original files** - Don't delete raw transcripts; they're useful for comparison
- **Iterate on prompts** - Customize the cleaning prompt for your specific use case

### 🎯 When This Workflow Works Best:

✅ Educational lectures with clear speakers  
✅ Podcast transcription  
✅ Interview recordings  
✅ Training video documentation  
✅ Webinar content extraction  

❌ Not ideal for: Multi-speaker debates, heavily accented speech, very low-quality audio, or content with critical legal/medical importance requiring 100% accuracy.

---

## 🎉 Congratulations!

You've successfully learned how to:
1. ✅ Extract audio from video files
2. ✅ Transcribe speech to text using Whisper
3. ✅ Clean and format transcripts using GPT-5-nano
4. ✅ Automate a process that would take days manually

### 🚀 Next Steps:

- Apply this workflow to your entire video library
- Customize the cleaning prompt for your specific needs
- Explore additional OpenAI features like:
  - Translation (Whisper can translate to English)
  - Summarization (use GPT to create summaries)
  - Q&A generation (create quiz questions from transcripts)

### 📚 Additional Resources:

- [OpenAI Whisper Documentation](https://platform.openai.com/docs/guides/speech-to-text)
- [OpenAI GPT Documentation](https://platform.openai.com/docs/guides/text-generation)
- [MoviePy Documentation](https://zulko.github.io/moviepy/)

---

**Happy learning!** 🎓