# üé• Video Scraper Pipeline - Google Colab

This notebook runs the complete video ingestion pipeline:
1. Harvests video URLs from websites
2. Downloads videos using yt-dlp + browser extraction
3. Uploads to Bunny Stream CDN
4. Tracks status in SQLite database (persisted to Google Drive)

---

## ‚öôÔ∏è Prerequisites

**Required Colab Secrets:**
- `BUNNY_API_KEY` - Get from https://panel.bunny.net/account
- `BUNNY_LIBRARY_ID` - Get from https://panel.bunny.net/stream

**How to add secrets:**
1. Click the üîë key icon in the left sidebar
2. Add `BUNNY_API_KEY` and `BUNNY_LIBRARY_ID`
3. Enable "Notebook access" toggle

---

## üìÅ Step 1: Mount Google Drive

Database and logs will be saved to `/content/drive/MyDrive/video_engine_data/`

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Create data directory
DRIVE_DATA_DIR = '/content/drive/MyDrive/video_engine_data'
os.makedirs(DRIVE_DATA_DIR, exist_ok=True)

print(f"‚úÖ Drive mounted successfully")
print(f"üìÇ Data directory: {DRIVE_DATA_DIR}")

## üîê Step 2: Configure Environment Variables

Load Bunny Stream credentials from Colab secrets

In [None]:
from google.colab import userdata
import os

# Load secrets
try:
    os.environ['BUNNY_API_KEY'] = userdata.get('BUNNY_API_KEY')
    os.environ['BUNNY_LIBRARY_ID'] = userdata.get('BUNNY_LIBRARY_ID')
    print("‚úÖ Bunny Stream credentials loaded")
except Exception as e:
    print(f"‚ùå ERROR: Failed to load secrets - {e}")
    print("\nMake sure you've added BUNNY_API_KEY and BUNNY_LIBRARY_ID as Colab secrets!")
    raise

# Configure for Colab environment
os.environ['USE_BROWSER'] = 'true'  # Enable browser extraction
os.environ['MAX_WORKERS'] = '2'     # Limit workers for Colab RAM

print("\n‚öôÔ∏è Configuration:")
print(f"  - Browser extraction: ENABLED")
print(f"  - Max workers: 2 (optimized for Colab)")
print(f"  - Database: {DRIVE_DATA_DIR}/video_tracker.db")

## üì¶ Step 3: Install System Dependencies

Install Playwright and Chromium browser for Cloudflare-protected sites

In [None]:
%%bash
# Install Playwright system dependencies
apt-get update -qq
apt-get install -y -qq \
    libnss3 \
    libnspr4 \
    libatk1.0-0 \
    libatk-bridge2.0-0 \
    libcups2 \
    libdrm2 \
    libxkbcommon0 \
    libxcomposite1 \
    libxdamage1 \
    libxfixes3 \
    libxrandr2 \
    libgbm1 \
    libasound2 \
    > /dev/null 2>&1

echo "‚úÖ System dependencies installed"

## üêç Step 4: Install Python Dependencies

Install all required Python packages

In [None]:
%%bash
pip install -q \
    yt-dlp>=2023.10.0 \
    requests>=2.31.0 \
    tenacity>=8.2.0 \
    beautifulsoup4>=4.12.0 \
    lxml>=4.9.0 \
    playwright>=1.40.0 \
    playwright-stealth>=0.1.0

echo "‚úÖ Python packages installed"

In [None]:
# Install Playwright browsers
!playwright install chromium
print("\n‚úÖ Chromium browser installed")

## üì§ Step 5: Upload Project Files

Upload the `video_engine` folder to Colab

In [None]:
import os
from google.colab import files

print("üì§ Upload your video_engine.zip file")
print("   (Create zip: compress the video_engine folder)\n")

uploaded = files.upload()

# Extract the zip file
import zipfile
for filename in uploaded.keys():
    if filename.endswith('.zip'):
        with zipfile.ZipFile(filename, 'r') as zip_ref:
            zip_ref.extractall('/content')
        print(f"\n‚úÖ Extracted {filename}")
        break

# Verify extraction
if os.path.exists('/content/video_engine'):
    print("‚úÖ Project files ready")
    !ls -la /content/video_engine
else:
    print("‚ùå ERROR: video_engine folder not found!")
    print("   Make sure your zip contains a 'video_engine' folder at the root")

## üìù Step 6: Create/Upload Links File

Create a `links.txt` file with video URLs to process

In [None]:
# Option A: Upload existing links.txt
from google.colab import files
print("üì§ Upload your links.txt file (or skip and create manually below)\n")
uploaded = files.upload()

if 'links.txt' in uploaded:
    !mv links.txt /content/video_engine/links.txt
    print("‚úÖ links.txt uploaded")

In [None]:
# Option B: Create links.txt manually
links_content = """https://example.com/video1
https://example.com/video2
https://example.com/video3
"""

with open('/content/video_engine/links.txt', 'w') as f:
    f.write(links_content)

print("‚úÖ links.txt created")
print("\nüìù Current links:")
!cat /content/video_engine/links.txt

## üîç Step 7 (Optional): Run Harvester

Auto-discover video URLs from a website using sitemap or crawling

In [None]:
import sys
sys.path.append('/content/video_engine')

from harvester import harvest_and_save

# Configure target website
WEBSITE_URL = "https://example.com"  # Change this to your target site
METHOD = "auto"  # Options: 'auto', 'sitemap', 'generic'
MAX_PAGES = 10   # For generic crawling

print(f"üîç Harvesting URLs from: {WEBSITE_URL}")
print(f"   Method: {METHOD}")
print(f"   Max pages: {MAX_PAGES}\n")

try:
    new_count = harvest_and_save(
        WEBSITE_URL,
        method=METHOD,
        max_pages=MAX_PAGES
    )
    print(f"\n‚úÖ Added {new_count} new video URLs to database")
except Exception as e:
    print(f"‚ùå Harvesting failed: {e}")

## üöÄ Step 8: Run Main Pipeline

Download videos and upload to Bunny Stream

In [None]:
import sys
sys.path.append('/content/video_engine')

from main import run_pipeline

print("üöÄ Starting video pipeline...\n")
print("="*60)

try:
    run_pipeline()
    print("\n" + "="*60)
    print("‚úÖ Pipeline completed successfully!")
    print("="*60)
except KeyboardInterrupt:
    print("\n‚ö†Ô∏è Pipeline stopped by user")
except Exception as e:
    print(f"\n‚ùå Pipeline failed: {e}")
    import traceback
    traceback.print_exc()

## üìä Step 9: Monitor Progress

Check database status and view logs

In [None]:
import sys
sys.path.append('/content/video_engine')

from database import VideoDatabase

db = VideoDatabase()

# Get status counts
stats = db.get_status_counts()

print("üìä DATABASE STATUS")
print("="*60)
for status, count in stats.items():
    print(f"  {status:12s}: {count:3d}")
print("="*60)

# Show pending URLs
pending = db.get_pending_urls(limit=5)
if pending:
    print("\nüìù Next 5 pending URLs:")
    for url in pending:
        print(f"  - {url}")
else:
    print("\n‚úÖ No pending URLs")

In [None]:
# View recent logs
import os

log_file = '/content/drive/MyDrive/video_engine_data/pipeline.log'

if os.path.exists(log_file):
    print("üìã RECENT LOGS (last 50 lines)")
    print("="*60)
    !tail -n 50 {log_file}
else:
    print("‚ö†Ô∏è No log file found yet")

## üîÑ Step 10: Download Database (Optional)

Download the SQLite database for local inspection

In [None]:
from google.colab import files

db_path = '/content/drive/MyDrive/video_engine_data/video_tracker.db'

if os.path.exists(db_path):
    files.download(db_path)
    print("‚úÖ Database downloaded")
else:
    print("‚ùå Database file not found")

---

## üÜò Troubleshooting

### Memory Errors
- Reduce `MAX_WORKERS` to 1: `os.environ['MAX_WORKERS'] = '1'`
- Restart runtime: Runtime ‚Üí Restart runtime

### Browser Crashes
- Check if Chromium is installed: `!playwright install chromium`
- Verify system dependencies are installed (Step 3)

### Timeout Issues
- Colab sessions timeout after 12 hours idle
- For long-running jobs, consider upgrading to Colab Pro
- Database persists in Google Drive, so you can resume

### Missing Secrets
- Click üîë in left sidebar
- Add `BUNNY_API_KEY` and `BUNNY_LIBRARY_ID`
- Enable "Notebook access" toggle

---

## üìö Resources

- [Bunny Stream Dashboard](https://panel.bunny.net/stream)
- [Project Documentation](https://github.com/yourusername/video-scraper)
- [yt-dlp Documentation](https://github.com/yt-dlp/yt-dlp)

---