Skip to content

youssef47048/ScreenSeeker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScreenSeeker

Vision-Based Desktop Automation with Dynamic Icon Grounding

A Python application that uses computer vision and multimodal LLMs to dynamically locate desktop icons on Windows and automate desktop tasks. Implements the ScreenSeekeR cascaded visual search algorithm for robust icon detection regardless of position, name, or appearance.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                       Workflow Orchestrator                      │
│  (fetch posts → find icon → open app → type → save → close)    │
└──────────┬──────────────────────────────┬───────────────────────┘
           │                              │
    ┌──────▼──────┐                ┌──────▼──────┐
    │   Grounding  │◄──────────────│  Automation  │
    │   Engine     │               │  (pyautogui) │
    │ (Algorithm 1)│               └──────────────┘
    └──────┬───────┘
           │
    ┌──────▼──────┐
    │   Planner    │
    │  (Gemini 2.5 │
    │   Flash)     │
    └──────────────┘

ScreenSeekeR Algorithm (Algorithm 1)

The visual search follows a recursive cascaded approach:

  1. Position Inference — Gemini proposes candidate regions where the target is likely located
  2. Candidate Grounding — Each candidate description is grounded to bounding-box coordinates
  3. Patch Dilation — Small bounding boxes are expanded to avoid missing the target
  4. Centrality Scoring — Candidates are scored using a Gaussian function (σ = 0.3)
  5. NMS — Non-maximum suppression removes overlapping candidates
  6. Recursive Search — Top candidates are cropped and searched recursively
  7. Result Verification — The planner verifies if the detected element is the target

Prerequisites

  • Windows 10/11 at 1920×1080 resolution
  • Python 3.11+
  • uv package manager (install)
  • Target application shortcut on the desktop (defaults to Notepad)
  • Google Gemini API key

Setup

# Clone the repository
git clone https://github.com/youssef47048/ScreenSeeker.git
cd ScreenSeeker

# Install dependencies with uv
uv sync

# Copy and edit environment variables
cp .env.example .env
# Edit .env and add your GEMINI_API_KEY

Configuration

All settings are controlled via .env (see .env.example for the full template).

Gemini Backend

Variable Description Default
GEMINI_API_KEY AI Studio API key (required)
GEMINI_MODEL Planning model gemini-2.5-flash
GEMINI_GROUNDING_MODEL Grounding model gemini-robotics-er-1.5-preview

Target Application

Variable Description Default
ICON_LABEL Name Gemini searches for on the desktop Notepad
ICON_DESCRIPTION Visual description so Gemini can recognise the icon even if the shortcut is renamed the Windows Notepad text editor — a blue notebook/notepad icon
WINDOW_TITLE Substring matched against window titles for focus/close Notepad

This allows the tool to work even when the desktop shortcut has a different name. For example, if the shortcut is renamed to MY EDITOR, set:

ICON_LABEL=MY EDITOR
ICON_DESCRIPTION=the Windows Notepad text editor — a blue notebook/notepad icon
WINDOW_TITLE=Notepad

Gemini will find the icon by visual appearance (blue notebook icon) regardless of the label.

Posts API

Variable Description Default
POSTS_API_URL URL to fetch blog posts from https://dummyjson.com/posts

Alternative: https://jsonplaceholder.typicode.com/posts (may be blocked in some regions).

Usage

Full Automation (10 posts)

uv run screenseeker

This will:

  1. Fetch 10 blog posts from the configured posts API
  2. Detect the target icon on the desktop once (cached for all posts)
  3. For each post: open the app → type content → save as post_{id}.txt → close
  4. Files are saved to ~/Desktop/tjm-project/
  5. Annotated screenshots are saved to screenshots/ with timestamps

Annotate Only (generate screenshots)

uv run screenseeker --annotate-only

Captures a screenshot, detects the target icon, and saves an annotated image to screenshots/.

Dry Run

uv run screenseeker --dry-run

Fetches posts and detects the icon but doesn't automate the target application.

Verbose Logging

uv run screenseeker -v

Running Tests

uv run pytest tests/ -v

Project Structure

ScreenSeeker/
├── pyproject.toml              # uv project config & dependencies
├── .env.example                # Environment variable template
├── screenshots/                # Annotated screenshots (timestamped)
├── src/screenseeker/
│   ├── __init__.py
│   ├── __main__.py             # python -m screenseeker support
│   ├── main.py                 # CLI entry point
│   ├── config.py               # Central configuration (all .env vars)
│   ├── screen_capture.py       # Screenshot capture & annotation
│   ├── planner.py              # Gemini visual planner
│   ├── grounding.py            # ScreenSeekeR recursive search (Algorithm 1)
│   ├── automation.py           # Mouse/keyboard/window automation (pyautogui + Win32)
│   ├── api_client.py           # Posts API client (dummyjson / jsonplaceholder)
│   └── workflow.py             # Main orchestrator
└── tests/
    ├── test_api_client.py
    ├── test_config.py
    └── test_grounding.py

How It Works

  1. Fetch posts — Downloads blog posts from the configured API endpoint
  2. Detect icon once — Uses the ScreenSeekeR algorithm to locate the target application icon on the desktop. The pixel coordinates are cached and reused for all posts, avoiding redundant API calls
  3. Per-post loop — For each post:
    • Kills any leftover instance of the target app
    • Double-clicks the cached icon position to launch a fresh instance
    • Waits for the window, then uses Win32 API to find its rectangle and click inside it (ensuring keyboard focus lands in the editor, not on the desktop)
    • Types the post content via clipboard paste
    • Saves via Ctrl+Shift+S (Save As dialog) to ~/Desktop/tjm-project/post_{id}.txt
    • Kills the process to guarantee a clean close before the next post

Error Handling

  • Icon not found: Retries up to 3 times with 1-second delays
  • Window launch validation: Verifies the window title appeared within timeout
  • Window focus: Uses Win32 GetWindowRect + click-at-center to guarantee focus lands inside the app, not on the desktop
  • Process cleanup: Uses taskkill to reliably close the app between posts
  • API failure: Clear error messages with automatic retry (3 attempts)
  • File conflicts: Handles existing files in the target directory
  • Screenshot history: All annotated screenshots include timestamps to preserve history across runs

Discussion Points

Why this approach?

The ScreenSeekeR algorithm generalizes well because it uses a multimodal LLM (Gemini) for semantic understanding, not template matching. It can find any icon by description, handle different themes, backgrounds, and icon sizes.

Failure cases

  • Very cluttered desktops with many similar icons
  • Icons fully obscured by overlapping windows
  • LLM hallucinations in bounding-box coordinates (mitigated by verification step)

Performance

  • Icon detection happens once at startup (~3-8 seconds), not per post
  • Each post takes ~5-10 seconds (open, type, save, close)
  • Total for 10 posts: ~1-2 minutes

Extending

  • Works for any desktop application — configure ICON_LABEL, ICON_DESCRIPTION, and WINDOW_TITLE in .env
  • Resolution-independent — coordinates are computed from screenshots
  • Visual description allows detection even when shortcuts are renamed
  • Could add OCR-based fallback for text-heavy icons

About

Vision-Based Desktop Automation with Dynamic Icon Grounding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages