# Image-to-Scene AI Pipeline - Google Colab Test

This notebook tests the full AI pipeline (Depth + Segmentation + VLM) on a free Google Colab GPU.

**Hardware:** T4 GPU (16GB VRAM) - Free tier

**Expected time:**
- First run: ~25-30 min (downloads ~10GB of models + checkpoints)
  - Depth Pro checkpoint: ~1.8GB
  - SAM2 weights: ~150MB
  - Qwen3-VL (4-bit): ~5GB
  - OneFormer: ~500MB
- Subsequent runs: ~30s startup + 15-25s per image

## Step 1: Setup Environment

In [None]:
# Check GPU
!nvidia-smi

In [None]:
# Clone repository
!git clone https://github.com/sslinkyy/Lofi-Station.git
%cd Lofi-Station/image-to-scene

In [None]:
# Install PyTorch with CUDA
!pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

In [None]:
# Install dependencies
%cd ai_worker
!pip install -r requirements.txt
!pip install jupyter ipywidgets  # For interactive widgets

## Step 2: Install Model Dependencies

In [None]:
# Install Depth Pro
%cd /content/Lofi-Station/image-to-scene
!git clone https://github.com/apple/ml-depth-pro external/depth-pro
%cd external/depth-pro
!pip install -e .

In [None]:
# Install SAM2
%cd /content/Lofi-Station/image-to-scene
!git clone https://github.com/facebookresearch/segment-anything-2 external/sam2
%cd external/sam2
!pip install -e .

## Step 3: Test Pipeline Components

In [None]:
import sys
sys.path.insert(0, '/content/Lofi-Station/image-to-scene/ai_worker')

import torch
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
import json

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

### Test 1: Depth Estimation

In [None]:
from models import DepthModelFactory
from services import DepthService

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load Depth Pro
# Note: First run will download ~1.8GB checkpoint from HuggingFace (takes 3-5 min)
print("Loading Depth Pro...")
print("(First run: downloading checkpoint ~1.8GB, please wait...)")
depth_model = DepthModelFactory.create("depth-pro", device=device)
depth_service = DepthService(depth_model, cache_dir="/content/cache/depth", enable_cache=True)
print("✓ Depth Pro loaded")

In [None]:
# Upload test image
from google.colab import files

print("Upload a test image:")
uploaded = files.upload()
test_image_path = list(uploaded.keys())[0]

# Load image
test_image = Image.open(test_image_path).convert('RGB')
print(f"Image loaded: {test_image.size}")

# Display
plt.figure(figsize=(12, 4))
plt.subplot(1, 1, 1)
plt.imshow(test_image)
plt.title("Original Image")
plt.axis('off')
plt.tight_layout()
plt.show()

In [None]:
# Estimate depth
import asyncio

print("Estimating depth...")
depth_result = await depth_service.estimate_depth(test_image, return_confidence=True)

depth_map = depth_result["depth"]
confidence = depth_result.get("confidence")

print(f"✓ Depth estimated:")
print(f"  Range: {depth_result['min_depth']:.2f} - {depth_result['max_depth']:.2f} m")
print(f"  Mean: {depth_result['mean_depth']:.2f} m")

# Visualize
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.imshow(test_image)
plt.title("Original")
plt.axis('off')

plt.subplot(1, 3, 2)
plt.imshow(depth_map, cmap='plasma')
plt.colorbar(label='Depth (m)')
plt.title("Depth Map")
plt.axis('off')

if confidence is not None:
    plt.subplot(1, 3, 3)
    plt.imshow(confidence, cmap='viridis')
    plt.colorbar(label='Confidence')
    plt.title("Confidence")
    plt.axis('off')

plt.tight_layout()
plt.show()

### Test 2: Segmentation

In [None]:
from models import SegmentationFactory
from models.segmentation_model import SegmentationService

# Load SAM2 (may take a few minutes first time)
print("Loading SAM2...")
sam2_model = SegmentationFactory.create("sam2-base", device=device)
print("✓ SAM2 loaded")

# Load OneFormer
print("Loading OneFormer...")
oneformer_model = SegmentationFactory.create("oneformer-semantic", device=device)
print("✓ OneFormer loaded")

# Create service
segmentation_service = SegmentationService(sam2_model, oneformer_model)

In [None]:
# Run segmentation
print("Running SAM2 automatic segmentation...")
masks_data = segmentation_service.segment_scene(test_image, mode="auto")

sam2_masks = masks_data.get("masks", [])
print(f"✓ Generated {len(sam2_masks)} masks")

# Sort by area
sorted_masks = sorted(sam2_masks, key=lambda m: m.get("area", 0), reverse=True)[:9]

# Visualize top 9 masks
fig, axes = plt.subplots(3, 3, figsize=(15, 15))
for i, (ax, mask) in enumerate(zip(axes.flat, sorted_masks)):
    ax.imshow(test_image)
    ax.imshow(mask["segmentation"], alpha=0.5, cmap='jet')
    ax.set_title(f"Mask {i+1} (IoU: {mask.get('predicted_iou', 0):.2f})")
    ax.axis('off')

plt.tight_layout()
plt.show()

In [None]:
# Extract room structure
print("Extracting room structure with OneFormer...")
room_data = segmentation_service.segment_scene(test_image, mode="room_structure")

room_structure = room_data.get("room_structure", {})
print(f"✓ Found {len(room_structure)} room elements:")
for element in room_structure.keys():
    print(f"  - {element}")

# Visualize room structure
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flat

axes[0].imshow(test_image)
axes[0].set_title("Original")
axes[0].axis('off')

for i, (element, mask) in enumerate(list(room_structure.items())[:5], start=1):
    axes[i].imshow(test_image)
    axes[i].imshow(mask, alpha=0.6, cmap='Reds')
    axes[i].set_title(element.capitalize())
    axes[i].axis('off')

plt.tight_layout()
plt.show()

### Test 3: VLM Scene Graph

In [None]:
from models import VLMFactory

# Load Qwen3-VL (will download ~8GB first time)
print("Loading Qwen3-VL... (may take 10-15 min first time)")
vlm_model = VLMFactory.create("qwen3-vl-8b", device=device, quantization="4bit")
print("✓ Qwen3-VL loaded")

In [None]:
# Generate scene graph
print("Generating scene graph...")

# Prepare depth map for VLM
depth_normalized = depth_service.normalize_depth(depth_map, 0, 255)
depth_pil = Image.fromarray(depth_normalized.astype('uint8'))

# Generate
scene_graph = vlm_model.generate_scene_graph(
    image=test_image,
    depth_map=depth_pil,
    masks={},  # Simplified for this test
    style_preset="lofi"
)

print(f"✓ Scene graph generated")
print(f"  Objects: {len(scene_graph.get('objects', []))}")
print(f"  Materials: {len(scene_graph.get('materials', []))}")

In [None]:
# Pretty print scene graph
print("\n" + "="*60)
print("SCENE GRAPH")
print("="*60)
print(json.dumps(scene_graph, indent=2))

In [None]:
# Save scene graph
output_path = "/content/scene_graph.json"
with open(output_path, 'w') as f:
    json.dump(scene_graph, f, indent=2)

print(f"✓ Scene graph saved to: {output_path}")

# Download
from google.colab import files
files.download(output_path)

## Step 4: Full Pipeline Test

In [None]:
# Test full pipeline end-to-end
print("Running full pipeline...")
print("=" * 60)

import time
start_time = time.time()

# Step 1: Depth
print("[1/3] Depth estimation...")
t1 = time.time()
depth_result = await depth_service.estimate_depth(test_image, return_confidence=True)
print(f"  ✓ Complete ({time.time() - t1:.2f}s)")

# Step 2: Segmentation
print("[2/3] Segmentation...")
t2 = time.time()
masks_data = segmentation_service.segment_scene(test_image, mode="auto")
room_data = segmentation_service.segment_scene(test_image, mode="room_structure")
print(f"  ✓ Complete ({time.time() - t2:.2f}s)")

# Step 3: VLM
print("[3/3] VLM scene graph...")
t3 = time.time()
scene_graph = vlm_model.generate_scene_graph(
    image=test_image,
    depth_map=depth_pil,
    masks={},
    style_preset="lofi"
)
print(f"  ✓ Complete ({time.time() - t3:.2f}s)")

total_time = time.time() - start_time
print("=" * 60)
print(f"✓ Full pipeline complete in {total_time:.2f}s")
print(f"  Depth: {t1 - start_time:.2f}s")
print(f"  Segmentation: {time.time() - t2:.2f}s")
print(f"  VLM: {time.time() - t3:.2f}s")

## Success!

If you got here without errors, the full AI pipeline is working!

**Next steps:**
1. Download the `scene_graph.json` file
2. Open Blender
3. Install the Image-to-Scene add-on
4. Use "Build from JSON (Test)" to build the scene

**Performance on Colab T4:**
- Expected total time: 15-25 seconds per image
- Depth: < 1s
- Segmentation: 5-10s
- VLM: 10-15s