# Real‑Time Camera + Qwen2‑VL / LLaVA 1.6 (Template)

This notebook is a **starter template** for building a kid‑friendly, real‑time camera assistant.

## What this template does
1. Captures frames from your webcam (OpenCV).
2. (Optional) Runs **YOLOv8** for fast real‑time object detection and cropping.
3. Sends a selected frame/crop to a **Vision‑Language Model** (Qwen2‑VL or LLaVA) **on demand** (e.g., when the user presses a key).
4. Produces a short, child‑friendly answer.

## Why “on demand” inference?
Running a VLM on every frame is slow/expensive. The best UX is:
- YOLO runs continuously (30–60 FPS)
- VLM runs only when the child asks a question / object changes.


## 0) Environment check
If you have an NVIDIA GPU, install the appropriate PyTorch CUDA build.


In [None]:
import torch, sys, platform
print('Python:', sys.version)
print('Platform:', platform.platform())
print('Torch:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('GPU:', torch.cuda.get_device_name(0))


## 1) Install dependencies
Choose one path:
- **Path A (Local webcam loop):** OpenCV + (optional) YOLO + Transformers VLM
- **Path B (Web app):** Gradio WebRTC webcam streaming


In [None]:
# Core
%pip -q install opencv-python pillow numpy

# VLM
%pip -q install transformers accelerate sentencepiece

# Optional: YOLOv8 for real-time detection/tracking
%pip -q install ultralytics

# Optional: If you want a browser-based webcam UI
%pip -q install gradio


## 2) Pick your model
### A) Qwen2‑VL (recommended)
- Pros: strong open model family; good multimodal instruction following.
- Suggested sizes:
  - **2B** if you care about speed
  - **7B** for better quality (needs a stronger GPU)

### B) LLaVA 1.6
- Also strong; similar usage patterns.

⚠️ Note: model IDs can change; always check Hugging Face for the exact repo name you want.


In [None]:
# ========= CONFIG =========
# Set ONE of these.
MODEL_FAMILY = 'qwen2-vl'  # 'qwen2-vl' or 'llava'

# Examples (edit as needed):
QWEN_MODEL_ID = 'Qwen/Qwen2-VL-2B-Instruct'   # faster
# QWEN_MODEL_ID = 'Qwen/Qwen2-VL-7B-Instruct' # higher quality

# LLaVA examples (edit as needed):
LLAVA_MODEL_ID = 'llava-hf/llava-1.6-mistral-7b-hf'

# Child-friendly style
SYSTEM_STYLE = (
    'You are talking to a 1.5-year-old child. '
    'Use very simple words. Use 1-2 short sentences. '
    'Be warm and encouraging. No scary content.'
)


## 3) Load the VLM (Transformers)
This cell provides **template loaders**. Depending on the exact model repo, you may need to adjust the processor/model classes.

If you hit a loading error, paste the error and the exact model ID you chose, and we’ll adapt it.


In [None]:
import torch
from PIL import Image

device = 'cuda' if torch.cuda.is_available() else 'cpu'

vlm = None
processor = None

if MODEL_FAMILY == 'qwen2-vl':
    from transformers import AutoProcessor, AutoModelForVision2Seq
    model_id = QWEN_MODEL_ID
    processor = AutoProcessor.from_pretrained(model_id)
    vlm = AutoModelForVision2Seq.from_pretrained(
        model_id,
        torch_dtype=torch.float16 if device == 'cuda' else torch.float32,
        device_map='auto' if device == 'cuda' else None,
    )
elif MODEL_FAMILY == 'llava':
    # Many LLaVA HF repos are compatible with AutoProcessor + AutoModelForVision2Seq
    from transformers import AutoProcessor, AutoModelForVision2Seq
    model_id = LLAVA_MODEL_ID
    processor = AutoProcessor.from_pretrained(model_id)
    vlm = AutoModelForVision2Seq.from_pretrained(
        model_id,
        torch_dtype=torch.float16 if device == 'cuda' else torch.float32,
        device_map='auto' if device == 'cuda' else None,
    )
else:
    raise ValueError('MODEL_FAMILY must be qwen2-vl or llava')

print('Loaded:', model_id)


## 4) VLM helper: ask about an image
You can ask:
- “What is this?”
- “What color is it?”
- “What is it used for?”

This helper keeps answers short for kids.


In [None]:
def ask_vlm(pil_image: Image.Image, user_question: str, max_new_tokens: int = 60) -> str:
    """Ask the loaded VLM a question about the given PIL image."""
    prompt = f"{SYSTEM_STYLE}\nQuestion: {user_question}"

    # Many vision-language models accept a chat-style format; others accept simple text.
    # This is a safe default. If your chosen model expects a different format, adjust here.
    inputs = processor(images=pil_image, text=prompt, return_tensors='pt')

    if device == 'cuda':
        inputs = {k: v.to('cuda') if hasattr(v, 'to') else v for k, v in inputs.items()}

    with torch.no_grad():
        out = vlm.generate(**inputs, max_new_tokens=max_new_tokens)
    text = processor.batch_decode(out, skip_special_tokens=True)[0]

    # Clean up: sometimes the model echoes the prompt
    return text.strip()


## 5) Real-time camera loop (OpenCV)
Controls:
- Press **Space** to ask the model about the current frame
- Press **q** to quit

Tip: You can replace `question = ...` dynamically using speech-to-text later.


In [None]:
import cv2
import numpy as np
from PIL import Image

cap = cv2.VideoCapture(0)
if not cap.isOpened():
    raise RuntimeError('Could not open webcam. Try changing the camera index: VideoCapture(1)')

print('Webcam started. Press SPACE to ask, q to quit.')

while True:
    ret, frame = cap.read()
    if not ret:
        print('Failed to grab frame')
        break

    cv2.imshow('Camera', frame)
    key = cv2.waitKey(1) & 0xFF

    if key == ord('q'):
        break

    # Ask on demand
    if key == 32:  # SPACE
        # Convert BGR (OpenCV) -> RGB (PIL)
        rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        pil = Image.fromarray(rgb)

        # Example questions (swap this with speech input later)
        question = 'What is this?'
        # question = 'What color is it?'
        # question = 'What is it used for?'

        print('\nAsking:', question)
        answer = ask_vlm(pil, question, max_new_tokens=60)
        print('Answer:', answer)

cap.release()
cv2.destroyAllWindows()


## 6) Optional: YOLOv8 real-time detection + crop → VLM
This is the recommended production pattern:
- YOLO detects objects fast
- When user asks, crop the most confident detection and ask VLM on that crop

Controls:
- Press **Space** to ask about the most confident detected object crop
- Press **q** to quit


In [None]:
from ultralytics import YOLO

# Use a lightweight model for speed
yolo = YOLO('yolov8n.pt')

cap = cv2.VideoCapture(0)
if not cap.isOpened():
    raise RuntimeError('Could not open webcam. Try changing the camera index: VideoCapture(1)')

print('YOLO webcam started. Press SPACE to ask about top detection, q to quit.')

last_crop = None

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # YOLO inference
    results = yolo.predict(frame, verbose=False)
    r = results[0]

    # Draw boxes and keep the best crop
    best = None
    best_conf = -1
    if r.boxes is not None and len(r.boxes) > 0:
        for b in r.boxes:
            x1, y1, x2, y2 = map(int, b.xyxy[0].tolist())
            conf = float(b.conf[0].item())
            cls = int(b.cls[0].item())
            label = f"{r.names.get(cls, cls)} {conf:.2f}"
            cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
            cv2.putText(frame, label, (x1, max(0, y1-10)), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0,255,0), 2)

            if conf > best_conf:
                best_conf = conf
                best = (x1, y1, x2, y2, label)

    if best is not None:
        x1, y1, x2, y2, label = best
        crop = frame[max(0,y1):max(0,y2), max(0,x1):max(0,x2)]
        if crop.size > 0:
            last_crop = (crop.copy(), label)

    cv2.imshow('YOLO Camera', frame)
    key = cv2.waitKey(1) & 0xFF

    if key == ord('q'):
        break

    if key == 32:  # SPACE
        if last_crop is None:
            print('No detection to ask about yet.')
            continue

        crop_bgr, yolo_label = last_crop
        crop_rgb = cv2.cvtColor(crop_bgr, cv2.COLOR_BGR2RGB)
        pil = Image.fromarray(crop_rgb)

        question = 'What is this?'
        # question = 'What color is it?'
        # question = 'What is it used for?'

        print('\nYOLO detected:', yolo_label)
        print('Asking:', question)
        answer = ask_vlm(pil, question, max_new_tokens=60)
        print('Answer:', answer)

cap.release()
cv2.destroyAllWindows()


## 7) Optional: Browser webcam UI (Gradio)
This is useful when:
- you want to run it on a server and open it from a phone
- you want a quick demo without dealing with OpenCV windows


In [None]:
import gradio as gr
from PIL import Image

def respond(image: Image.Image, question: str):
    if image is None:
        return 'No image received.'
    if not question:
        question = 'What is this?'
    return ask_vlm(image, question, max_new_tokens=60)

demo = gr.Interface(
    fn=respond,
    inputs=[gr.Image(type='pil', sources=['webcam']), gr.Textbox(label='Question', value='What is this?')],
    outputs=gr.Textbox(label='Answer'),
    title='Kid-Friendly Vision Assistant (Template)',
    description='Use the webcam. Ask: What is this? What color is it? What is it used for?'
)

demo.launch()


## Next upgrades
- Add speech-to-text (Whisper) so the child can ask questions by voice.
- Add text-to-speech (Coqui TTS) to speak answers.
- Add safety filters + a kid-safe response policy.
- Add caching: if the object hasn’t changed, reuse the last answer.
