# üéì University Faculty Web Scraper - Colab Edition

This notebook runs the faculty scraper using **Ollama + Qwen3-VL** entirely on Colab.

**Requirements:**
- GPU runtime (T4 or better recommended)
- ~10GB disk space for models

## 1Ô∏è‚É£ Install Ollama & Dependencies

In [None]:
# Install Ollama
!curl -fsSL https://ollama.com/install.sh | sh

# Install Python dependencies
!pip install -q playwright langgraph pydantic pydantic-settings markdownify beautifulsoup4 ollama python-dotenv

# Install Playwright browsers
!playwright install chromium
!playwright install-deps

print("‚úÖ Dependencies installed!")

>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
####################################                                      51.2%

## 2Ô∏è‚É£ Start Ollama Server & Pull Model

In [None]:
import subprocess
import time

# Start Ollama server in background
print("üöÄ Starting Ollama server...")
subprocess.Popen(
    ['ollama', 'serve'],
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL
)
time.sleep(5)

# Check if running
!curl -s http://localhost:11434/api/tags && echo "\n‚úÖ Ollama server running!"

In [None]:
# Pull Qwen3-VL model (this takes a few minutes)
print("üì¶ Pulling Qwen3-VL model (this may take 5-10 minutes)...")
!ollama pull qwen3-vl
print("‚úÖ Model ready!")

## 3Ô∏è‚É£ Clone Repository

In [None]:
# Clone the repository (replace with your repo URL)
!git clone https://github.com/YOUR_USERNAME/instiGPT.git 2>/dev/null || echo "Repo already exists"
%cd instiGPT

# Or upload files manually and skip this cell
print("üìÅ Working directory:", !pwd)

## 4Ô∏è‚É£ Run the Scraper

In [None]:
# Configuration
START_URL = "https://engineering.wustl.edu/faculty/index.html"
OBJECTIVE = "Scrape all engineering faculty profiles"
OUTPUT_FILE = "faculty_data.json"

In [None]:
# Run scraper using Python API
import sys
sys.path.insert(0, '.')

from scraper_app.manager import CrawlerManager

# Initialize with ollama_only backend and headless mode
crawler = CrawlerManager(
    backend_mode="ollama_only",
    headless=True,
    debug=True
)

# Run the crawl
profiles = crawler.run(
    start_url=START_URL,
    objective=OBJECTIVE,
    max_steps=30
)

# Save results
crawler.save_results(profiles, OUTPUT_FILE)
print(f"\n‚úÖ Scraped {len(profiles)} profiles!")

In [None]:
# Alternative: Run via CLI
!python -m scraper_app.main \
    --url "https://engineering.wustl.edu/faculty/index.html" \
    --objective "Scrape all engineering faculty" \
    --backend ollama_only \
    --headless \
    --max-steps 30

## 5Ô∏è‚É£ View & Download Results

In [None]:
import json

# Load and display results
with open(OUTPUT_FILE) as f:
    data = json.load(f)

print(f"üìä Total profiles: {len(data)}\n")

# Show first 3 profiles
for i, profile in enumerate(data[:3]):
    print(f"--- Profile {i+1} ---")
    print(f"Name: {profile.get('name')}")
    print(f"Title: {profile.get('title')}")
    print(f"Email: {profile.get('email')}")
    print()

In [None]:
# Download the results file
from google.colab import files
files.download(OUTPUT_FILE)

## üîß Troubleshooting

**Model not loading?**
```python
!ollama list  # Check available models
!ollama pull qwen3-vl  # Re-pull model
```

**Ollama not responding?**
```python
!pkill ollama  # Kill existing process
# Then re-run cell 2
```

**Out of memory?**
- Reduce `max_steps` to 10-15
- Use T4 GPU runtime