# üé¨ Interactive Video Scraper - Google Colab

This notebook provides an **interactive interface** to scrape and upload videos to Bunny Stream.

---

## ‚öôÔ∏è Prerequisites

**Required Colab Secrets:**
- `BUNNY_API_KEY` - Get from https://panel.bunny.net/account ‚Üí API
- `BUNNY_LIBRARY_ID` - Get from https://panel.bunny.net/stream

**How to add secrets:**
1. Click the üîë key icon in the left sidebar
2. Add both secrets
3. Enable "Notebook access" toggle

---

## üìÅ Step 1: Mount Google Drive

Database and logs will be saved to Google Drive for persistence

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Create data directory
DRIVE_DATA_DIR = '/content/drive/MyDrive/video_engine_data'
os.makedirs(DRIVE_DATA_DIR, exist_ok=True)

print("‚úÖ Drive mounted successfully")
print(f"üìÇ Data directory: {DRIVE_DATA_DIR}")

## üîê Step 2: Configure Environment Variables

Load Bunny Stream credentials from Colab secrets

In [None]:
from google.colab import userdata
import os

# Load secrets
try:
    os.environ['BUNNY_API_KEY'] = userdata.get('BUNNY_API_KEY')
    os.environ['BUNNY_LIBRARY_ID'] = userdata.get('BUNNY_LIBRARY_ID')
    print("‚úÖ Bunny Stream credentials loaded")
except Exception as e:
    print(f"‚ùå ERROR: Failed to load secrets - {e}")
    print("\nMake sure you've added BUNNY_API_KEY and BUNNY_LIBRARY_ID as Colab secrets!")
    raise

# Configure for Colab environment
os.environ['USE_BROWSER'] = 'true'  # Enable browser extraction
os.environ['MAX_WORKERS'] = '2'     # Limit workers for Colab RAM

print("\n‚öôÔ∏è Configuration:")
print(f"  - Browser extraction: ENABLED")
print(f"  - Max workers: 2 (optimized for Colab)")
print(f"  - Database: {DRIVE_DATA_DIR}/video_tracker.db")

## üì¶ Step 3: Install System Dependencies

Install Playwright and Chromium browser

In [None]:
%%bash
# Install Playwright system dependencies (silent)
apt-get update -qq
apt-get install -y -qq \
    libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 \
    libcups2 libdrm2 libxkbcommon0 libxcomposite1 \
    libxdamage1 libxfixes3 libxrandr2 libgbm1 libasound2 \
    > /dev/null 2>&1

echo "‚úÖ System dependencies installed"

## üêç Step 4: Install Python Dependencies

Install all required Python packages

In [None]:
%%bash
pip install -q \
    yt-dlp>=2023.10.0 \
    requests>=2.31.0 \
    tenacity>=8.2.0 \
    beautifulsoup4>=4.12.0 \
    lxml>=4.9.0 \
    playwright>=1.40.0 \
    playwright-stealth>=0.1.0 \
    python-dotenv>=1.0.0

echo "‚úÖ Python packages installed"

In [None]:
# Install Playwright browsers
!playwright install chromium
print("\n‚úÖ Chromium browser installed")

## üì¶ Step 5: Clone Project from GitHub

Clone the latest version from your repository

In [None]:
import os

# GitHub repository URL
REPO_URL = "https://github.com/rkpcode/auto_web_scraper.git"

print(f"üì¶ Cloning repository from GitHub...")
print(f"   Repository: {REPO_URL}\n")

# Remove existing directory if present
if os.path.exists('/content/auto_web_scraper'):
    !rm -rf /content/auto_web_scraper
    print("üóëÔ∏è  Removed existing directory\n")

# Clone the repository
!git clone {REPO_URL} /content/auto_web_scraper

# Verify clone
if os.path.exists('/content/auto_web_scraper/video_engine'):
    print("\n‚úÖ Repository cloned successfully!")
    print("\nüìÇ Project structure:")
    !ls -la /content/auto_web_scraper/video_engine
else:
    print("‚ùå ERROR: video_engine folder not found in repository!")
    print("   Please check the repository structure")

## üîß Step 6: Database Cleanup (Optional)

Run this if you encounter 'database is locked' errors

In [None]:
import sys
sys.path.append('/content/auto_web_scraper/video_engine')

# Run cleanup
!python /content/auto_web_scraper/video_engine/cleanup_db.py

---

# üé¨ INTERACTIVE PIPELINE

Choose one of the following modes:

---

## üåê Mode 1: Auto-Discovery (Recommended)

Enter a website URL and automatically discover all video URLs

In [None]:
import sys
sys.path.append('/content/auto_web_scraper/video_engine')

from harvester import harvest_and_save
from database import db

# ============================================
# CONFIGURE HERE
# ============================================
WEBSITE_URL = "https://viralkand.com"  # Change this
METHOD = "auto"  # Options: 'auto', 'sitemap', 'generic'
MAX_PAGES = 20   # For generic crawling
# ============================================

print(f"üîç Discovering videos from: {WEBSITE_URL}")
print(f"   Method: {METHOD}")
print(f"   Max pages: {MAX_PAGES}\n")

try:
    new_count = harvest_and_save(WEBSITE_URL, method=METHOD, max_pages=MAX_PAGES)
    print(f"\n‚úÖ Discovered {new_count} new video URLs")
    
    # Show stats
    stats = db.get_stats()
    print(f"\nüìä Database status:")
    for status, count in stats.items():
        print(f"   {status:12s}: {count:3d}")
except Exception as e:
    print(f"\n‚ùå Discovery failed: {e}")

## ‚úçÔ∏è Mode 2: Manual URL Entry

Add specific video URLs manually

In [None]:
import sys
sys.path.append('/content/auto_web_scraper/video_engine')

from database import db

# ============================================
# ADD YOUR VIDEO URLs HERE
# ============================================
video_urls = [
    "https://viralkand.com/video1",
    "https://viralkand.com/video2",
    "https://viralkand.com/video3",
]
# ============================================

print(f"‚úçÔ∏è  Adding {len(video_urls)} URLs to database...\n")

added = 0
for url in video_urls:
    if db.insert_video(url):
        print(f"  ‚úÖ {url}")
        added += 1
    else:
        print(f"  ‚ö†Ô∏è  {url} (already exists)")

print(f"\n‚úÖ Added {added} new URLs")

# Show stats
stats = db.get_stats()
print(f"\nüìä Database status:")
for status, count in stats.items():
    print(f"   {status:12s}: {count:3d}")

## üìÇ Mode 3: Load from File

Upload and load URLs from a text file

In [None]:
from google.colab import files
import sys
sys.path.append('/content/auto_web_scraper/video_engine')

from database import db

print("üì§ Upload your links.txt file\n")
uploaded = files.upload()

if 'links.txt' in uploaded:
    # Read URLs
    with open('links.txt', 'r') as f:
        urls = [line.strip() for line in f if line.strip() and not line.startswith('#')]
    
    print(f"\nüìÇ Found {len(urls)} URLs in file\n")
    
    # Add to database
    added = 0
    for url in urls:
        if db.insert_video(url):
            added += 1
    
    print(f"\n‚úÖ Added {added} new URLs")
    
    # Show stats
    stats = db.get_stats()
    print(f"\nüìä Database status:")
    for status, count in stats.items():
        print(f"   {status:12s}: {count:3d}")
else:
    print("‚ùå No file uploaded")

---

## üöÄ Step 7: Run Pipeline

Process all pending videos (download + upload to Bunny Stream)

**This will process ALL pending URLs in the database**

In [None]:
import sys
sys.path.append('/content/auto_web_scraper/video_engine')

from main import main as run_pipeline

print("üöÄ Starting video pipeline...\n")
print("=" * 60)

try:
    run_pipeline()
    print("\n" + "=" * 60)
    print("‚úÖ Pipeline completed successfully!")
    print("=" * 60)
except KeyboardInterrupt:
    print("\n‚ö†Ô∏è Pipeline stopped by user")
except Exception as e:
    print(f"\n‚ùå Pipeline failed: {e}")
    import traceback
    traceback.print_exc()

print("\n============================================================")
print("‚úÖ Pipeline execution complete!")
print("============================================================")
print("Check your Bunny Stream dashboard for uploaded videos.")

---

## üìä Step 8: Monitor Progress

Check database status and view logs

In [None]:
import sys
sys.path.append('/content/auto_web_scraper/video_engine')

from database import db

# Get status counts
stats = db.get_stats()

print("üìä DATABASE STATUS")
print("=" * 60)
for status, count in stats.items():
    print(f"  {status:12s}: {count:3d}")
print("=" * 60)

# Show pending URLs
pending = db.get_pending_urls()
if pending:
    print(f"\nüìù Pending URLs ({len(pending)}):")
    for i, url in enumerate(pending[:5], 1):
        print(f"  {i}. {url}")
    if len(pending) > 5:
        print(f"  ... and {len(pending) - 5} more")
else:
    print("\n‚úÖ No pending URLs")

In [None]:
# View recent logs
import os

log_file = '/content/drive/MyDrive/video_engine_data/pipeline.log'

if os.path.exists(log_file):
    print("üìã RECENT LOGS (last 50 lines)")
    print("=" * 60)
    !tail -n 50 {log_file}
else:
    print("‚ö†Ô∏è No log file found yet")

---

## üîÑ Step 9: Download Database (Optional)

Download the SQLite database for local inspection

In [None]:
from google.colab import files
import os

db_path = '/content/drive/MyDrive/video_engine_data/video_tracker.db'

if os.path.exists(db_path):
    files.download(db_path)
    print("‚úÖ Database downloaded")
else:
    print("‚ùå Database file not found")

---

## üÜò Troubleshooting

### Memory Errors
- Reduce `MAX_WORKERS` to 1 in Step 2
- Restart runtime: Runtime ‚Üí Restart runtime

### Database Locked
- Re-run Step 6 (Database Cleanup)
- Restart runtime

### Upload Failures
- Check Colab Secrets (üîë icon)
- Verify API key is correct
- Check Bunny Stream dashboard for quota limits

---

## üìö Resources

- [Bunny Stream Dashboard](https://panel.bunny.net/stream)
- [GitHub Repository](https://github.com/rkpcode/auto_web_scraper)
- [yt-dlp Documentation](https://github.com/yt-dlp/yt-dlp)

---