# Agentic Object Detection Demo

This notebook demonstrates a prototype of an agentic object detection workflow. It uses a multimodal Large Language Model (LLM) to intelligently guide the object detection process by analyzing image patches and deciding whether to use a traditional vision model, request more context, or skip the patch.

The workflow relies on Python modules located in the `src` directory of this project.

In [None]:
import sys
import os

# Add src directory to sys.path to allow module imports
if 'src' not in sys.path:
    sys.path.append(os.path.abspath('src'))
if '.' not in sys.path: # ensure root is also in path for notebooks in subdirs if any
    sys.path.append(os.path.abspath('.'))

# Verify by trying to import one of the modules
try:
    import config
    print("Successfully imported config module.")
except ImportError as e:
    print(f"Error importing config module: {e}. Ensure 'src' is in sys.path and __init__.py exists in src.")

# Import project modules
import image_utils
import vision_tool_interface
import openrouter_agent
# config already imported

# Import other necessary libraries
from PIL import Image, ImageDraw, ImageFont
from IPython.display import display, Markdown

## Configuration Parameters

Before running the cells below, ensure that your OpenRouter API key is correctly set in the `src/config.py` file. 

You can adjust the following parameters to change the input image, target object classes for detection, image partitioning strategy, and contextual expansion factor.

In [None]:
# --- User Configurable Parameters ---
INPUT_IMAGE_PATH = os.path.join("data", "input.jpg") # Or "data/shark.png"
TARGET_CLASSES = ["person", "car", "dog", "cat", "bird", "bicycle", "traffic light", "stop sign"] # User can customize
NUM_ROWS = 3  # Number of rows for partitioning
NUM_COLS = 3  # Number of columns for partitioning
EXPANSION_FACTOR = 1.5 # For contextual analysis
# --- End User Configurable Parameters ---

print(f"Input image path: {INPUT_IMAGE_PATH}")
print(f"Target classes: {TARGET_CLASSES}")
print(f"Partitioning grid: {NUM_ROWS}x{NUM_COLS}")
print(f"Expansion factor: {EXPANSION_FACTOR}")

# Check if API key is placeholder
if config.OPENROUTER_API_KEY == "YOUR_OPENROUTER_API_KEY_HERE" or not config.OPENROUTER_API_KEY:
    display(Markdown("<font color='red'>**ERROR:** Please set your OpenRouter API key in `src/config.py` before running!</font>"))
    # In a real notebook, you might want to raise an error to stop execution
    # raise ValueError("OpenRouter API key not set in src/config.py")
else:
    print(f"Using OpenRouter Model: {config.OPENROUTER_MULTIMODAL_MODEL}")
    # Check if the specified image file exists
    if not os.path.exists(INPUT_IMAGE_PATH):
        display(Markdown(f"<font color='red'>**ERROR:** Input image not found at {INPUT_IMAGE_PATH}. Please ensure it exists.</font>"))
        if os.path.exists("data"):
            print(f"Files in data/ directory: {os.listdir('data')}")
        else:
            print("Error: data/ directory does not exist.")

## Load Image

Load the specified image using the `image_utils` module and display it.

In [None]:
original_image = None
if os.path.exists(INPUT_IMAGE_PATH) and (config.OPENROUTER_API_KEY != "YOUR_OPENROUTER_API_KEY_HERE" and config.OPENROUTER_API_KEY) :
    print(f"Loading image: {INPUT_IMAGE_PATH}")
    original_image = image_utils.load_image(INPUT_IMAGE_PATH)
    if original_image:
        print("Image loaded successfully.")
        display(original_image)
    else:
        print(f"Failed to load image: {INPUT_IMAGE_PATH}")
else:
    if not (config.OPENROUTER_API_KEY != "YOUR_OPENROUTER_API_KEY_HERE" and config.OPENROUTER_API_KEY):
         print("Skipping image load due to missing API key.")
    else:
         print("Skipping image load as file does not exist or API key is not set.")

## Partition Image

Divide the loaded image into a grid of patches. These patches will be individually analyzed by the LLM agent.

In [None]:
patches_info = []
if original_image:
    print(f"Partitioning image into {NUM_ROWS}x{NUM_COLS} patches...")
    patches_info = image_utils.partition_image(original_image, NUM_ROWS, NUM_COLS)
    if patches_info:
        print(f"Image partitioned into {len(patches_info)} patches.")
        # Optionally display a few patches
        if len(patches_info) > 0:
            print("Displaying the first patch as an example:")
            display(patches_info[0]['patch_image'])
            print(f"Coordinates of the first patch: {patches_info[0]['coords']}")
    else:
        print("Failed to partition image.")
else:
    print("Skipping image partitioning because the original image was not loaded.")

## Agentic Analysis Loop

This is the core of the workflow. Each patch is sent to the multimodal LLM agent. The agent decides whether to:
1.  **ANALYZE**: Run a standard object detection model on the patch.
2.  **EXPAND_CONTEXT**: Request a larger, contextual patch around the current one for re-analysis.
3.  **SKIP**: Ignore the patch if it seems uninteresting or irrelevant.

In [None]:
all_detections = []

if not patches_info:
    print("No patches to process. Ensure image was loaded and partitioned correctly.")
elif config.OPENROUTER_API_KEY == "YOUR_OPENROUTER_API_KEY_HERE" or not config.OPENROUTER_API_KEY:
    print("OpenRouter API key not set. Skipping agentic analysis loop.")
else:
    for i, patch_info in enumerate(patches_info):
        patch_image = patch_info['patch_image']
        patch_coords = patch_info['coords'] # (left, upper, right, lower) relative to original
        
        print(f"\n--- Processing Patch {i+1}/{len(patches_info)} - Coords: {patch_coords} ---")

        # Construct initial prompt for the LLM agent
        prompt = (
            f"You are an object detection assistant. This image patch is from coordinates {patch_coords} of a larger image. "
            f"Your task is to identify if any of the following target objects might be present in THIS SPECIFIC PATCH: {', '.join(TARGET_CLASSES)}. "
            f"Based on the visual information in this patch, do you think it's worth running a detailed object detection model on it? "
            f"Respond with only one of these keywords: 'ANALYZE' if yes, 'SKIP' if no. "
            f"If the patch is ambiguous (e.g., shows only a small part of a potential object, like a wheel of a car) such that the full object might be outside this patch but nearby, respond with 'EXPAND_CONTEXT'. "
            f"Your response must be ONLY one of these three keywords."
        )

        # Call the OpenRouter agent
        print(f"  Sending patch to OpenRouter agent (model: {config.OPENROUTER_MULTIMODAL_MODEL})...")
        agent_decision_text = openrouter_agent.get_agent_response(prompt, image=patch_image)
        print(f"  Agent raw decision: '{agent_decision_text}'")
        
        # Sanitize and simplify the agent's response
        decision = "SKIP" # Default
        if isinstance(agent_decision_text, str):
            cleaned_response = agent_decision_text.strip().upper()
            if "ANALYZE" in cleaned_response:
                decision = "ANALYZE"
            elif "EXPAND_CONTEXT" in cleaned_response:
                decision = "EXPAND_CONTEXT"
            elif "SKIP" in cleaned_response: # Ensure SKIP is also checked if it's part of a longer response
                decision = "SKIP"
            else:
                print(f"  Could not parse a clear keyword from agent response, defaulting to SKIP.")
        else:
            print(f"  Agent response was not a string: '{agent_decision_text}'. Defaulting to SKIP.")

        print(f"  Parsed decision: {decision}")

        if decision == "ANALYZE":
            print(f"  Action: ANALYZE - Running object detection on current patch {patch_coords}...")
            # Ensure vision_tool_interface.model and .processor are loaded
            if not vision_tool_interface.model or not vision_tool_interface.processor:
                print("  Error: Vision model or processor not loaded. Skipping detection.")
                # Attempt to reload them (this is a fallback, should be loaded at module import)
                try:
                    vision_tool_interface.processor = vision_tool_interface.DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
                    vision_tool_interface.model = vision_tool_interface.DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
                    print("  Successfully reloaded vision model and processor.")
                except Exception as e:
                    print(f"  Failed to reload vision model/processor: {e}")
                    continue # Skip to next patch if model can't be loaded

            detections = vision_tool_interface.detect_objects(patch_image, TARGET_CLASSES)
            if detections:
                print(f"    Found {len(detections)} objects in patch {patch_coords}.")
                for det in detections:
                    det['box'][0] += patch_coords[0]  # x_orig = x_patch + patch_left
                    det['box'][1] += patch_coords[1]  # y_orig = y_patch + patch_top
                    all_detections.append(det)
            else:
                print(f"    No objects found in patch {patch_coords} by vision tool.")

        elif decision == "EXPAND_CONTEXT":
            print(f"  Action: EXPAND_CONTEXT - Getting contextual patch and running object detection...")
            contextual_patch_img, contextual_coords = image_utils.get_contextual_patch(original_image, patch_coords, EXPANSION_FACTOR)
            print(f"    Contextual patch coords (original image): {contextual_coords}")
            display(Markdown("**Contextual Patch for Review:**"))
            display(contextual_patch_img)
            
            if not vision_tool_interface.model or not vision_tool_interface.processor:
                print("  Error: Vision model or processor not loaded. Skipping detection.")
                # (Add reload attempt here if necessary, similar to ANALYZE block)
                continue

            detections = vision_tool_interface.detect_objects(contextual_patch_img, TARGET_CLASSES)
            if detections:
                print(f"    Found {len(detections)} objects in contextual patch {contextual_coords}.")
                for det in detections:
                    det['box'][0] += contextual_coords[0]  # x_orig = x_context + contextual_patch_left
                    det['box'][1] += contextual_coords[1]  # y_orig = y_context + contextual_patch_top
                    all_detections.append(det)
            else:
                print(f"    No objects found in contextual patch {contextual_coords} by vision tool.")
                
        elif decision == "SKIP":
            print(f"  Action: SKIP - Skipping detailed analysis for patch {patch_coords}.")
    
    print("\nFinished processing all patches.")

## Display Results

If any objects were detected across the analyzed patches, they are drawn onto a copy of the original image. The resulting image is displayed and saved.

In [None]:
if not patches_info: # Check if processing was skipped early
    print("Result display skipped as no patches were processed.")
elif all_detections:
    print(f"\nTotal objects detected in the image: {len(all_detections)}")
    # Remove duplicate detections (simple version based on box and label)
    unique_detections_set = set()
    unique_detections = []
    for det in all_detections:
        # Create a unique key for the detection (box coordinates and label)
        # Rounding coordinates to handle minor float differences if any, and converting to int for hashability
        # Box is [x,y,w,h]
        detection_key = (tuple(int(round(c)) for c in det['box']), det['label'])
        if detection_key not in unique_detections_set:
            unique_detections_set.add(detection_key)
            unique_detections.append(det)
    
    print(f"Total unique objects after basic filtering: {len(unique_detections)}")
    all_detections = unique_detections # Use the filtered list

    draw_image = original_image.copy()
    draw = ImageDraw.Draw(draw_image)
    try:
        font = ImageFont.load_default(size=15) # Specify a size
    except IOError:
        print("Default font not found or size parameter not supported. Using basic fallback.")
        try:
            font = ImageFont.load_default() # Basic fallback
        except:
            font = None # Absolute fallback

    for det in all_detections:
        box = det['box']  # Original image coordinates [x, y, w, h]
        label = f"{det['label']}: {det['score']:.2f}"
        
        rect = [box[0], box[1], box[0] + box[2], box[1] + box[3]]
        draw.rectangle(rect, outline="red", width=3)
        
        text_x = box[0]
        text_y = box[1] - 20 if box[1] - 20 > 0 else box[1] + 5
        
        if font:
            draw.text((text_x, text_y), label, fill="red", font=font)
        else:
            draw.text((text_x, text_y), label, fill="red") # No font object if it failed to load

    display(Markdown("**Final Image with Detections:**"))
    display(draw_image)

    base_name, ext = os.path.splitext(os.path.basename(INPUT_IMAGE_PATH))
    output_image_path = os.path.join("data", f"notebook_output_{base_name}{ext}")
    
    try:
        draw_image.save(output_image_path)
        print(f"Processed image with detections saved to: {output_image_path}")
    except Exception as e:
        print(f"Error saving image: {e}")
        
else:
    print("\nNo objects were detected in the image after processing all patches.")