ScreenSeeker

Vision-Based Desktop Automation with Dynamic Icon Grounding

A Python application that uses computer vision and multimodal LLMs to dynamically locate desktop icons on Windows and automate desktop tasks. Implements the ScreenSeekeR cascaded visual search algorithm for robust icon detection regardless of position, name, or appearance.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                       Workflow Orchestrator                      │
│  (fetch posts → find icon → open app → type → save → close)    │
└──────────┬──────────────────────────────┬───────────────────────┘
           │                              │
    ┌──────▼──────┐                ┌──────▼──────┐
    │   Grounding  │◄──────────────│  Automation  │
    │   Engine     │               │  (pyautogui) │
    │ (Algorithm 1)│               └──────────────┘
    └──────┬───────┘
           │
    ┌──────▼──────┐
    │   Planner    │
    │  (Gemini 2.5 │
    │   Flash)     │
    └──────────────┘

ScreenSeekeR Algorithm (Algorithm 1)

The visual search follows a recursive cascaded approach:

Position Inference — Gemini proposes candidate regions where the target is likely located
Candidate Grounding — Each candidate description is grounded to bounding-box coordinates
Patch Dilation — Small bounding boxes are expanded to avoid missing the target
Centrality Scoring — Candidates are scored using a Gaussian function (σ = 0.3)
NMS — Non-maximum suppression removes overlapping candidates
Recursive Search — Top candidates are cropped and searched recursively
Result Verification — The planner verifies if the detected element is the target

Prerequisites

Windows 10/11 at 1920×1080 resolution
Python 3.11+
uv package manager (install)
Target application shortcut on the desktop (defaults to Notepad)
Google Gemini API key

Setup

# Clone the repository
git clone https://github.com/youssef47048/ScreenSeeker.git
cd ScreenSeeker

# Install dependencies with uv
uv sync

# Copy and edit environment variables
cp .env.example .env
# Edit .env and add your GEMINI_API_KEY

Configuration

All settings are controlled via .env (see .env.example for the full template).

Gemini Backend

Variable	Description	Default
`GEMINI_API_KEY`	AI Studio API key	(required)
`GEMINI_MODEL`	Planning model	`gemini-2.5-flash`
`GEMINI_GROUNDING_MODEL`	Grounding model	`gemini-robotics-er-1.5-preview`

Target Application

Variable	Description	Default
`ICON_LABEL`	Name Gemini searches for on the desktop	`Notepad`
`ICON_DESCRIPTION`	Visual description so Gemini can recognise the icon even if the shortcut is renamed	`the Windows Notepad text editor — a blue notebook/notepad icon`
`WINDOW_TITLE`	Substring matched against window titles for focus/close	`Notepad`

This allows the tool to work even when the desktop shortcut has a different name. For example, if the shortcut is renamed to MY EDITOR, set:

ICON_LABEL=MY EDITOR
ICON_DESCRIPTION=the Windows Notepad text editor — a blue notebook/notepad icon
WINDOW_TITLE=Notepad

Gemini will find the icon by visual appearance (blue notebook icon) regardless of the label.

Posts API

Variable	Description	Default
`POSTS_API_URL`	URL to fetch blog posts from	`https://dummyjson.com/posts`

Alternative: https://jsonplaceholder.typicode.com/posts (may be blocked in some regions).

Usage

Full Automation (10 posts)

uv run screenseeker

This will:

Fetch 10 blog posts from the configured posts API
Detect the target icon on the desktop once (cached for all posts)
For each post: open the app → type content → save as post_{id}.txt → close
Files are saved to ~/Desktop/tjm-project/
Annotated screenshots are saved to screenshots/ with timestamps

Annotate Only (generate screenshots)

uv run screenseeker --annotate-only

Captures a screenshot, detects the target icon, and saves an annotated image to screenshots/.

Dry Run

uv run screenseeker --dry-run

Fetches posts and detects the icon but doesn't automate the target application.

Verbose Logging

uv run screenseeker -v

Running Tests

uv run pytest tests/ -v

Project Structure

ScreenSeeker/
├── pyproject.toml              # uv project config & dependencies
├── .env.example                # Environment variable template
├── screenshots/                # Annotated screenshots (timestamped)
├── src/screenseeker/
│   ├── __init__.py
│   ├── __main__.py             # python -m screenseeker support
│   ├── main.py                 # CLI entry point
│   ├── config.py               # Central configuration (all .env vars)
│   ├── screen_capture.py       # Screenshot capture & annotation
│   ├── planner.py              # Gemini visual planner
│   ├── grounding.py            # ScreenSeekeR recursive search (Algorithm 1)
│   ├── automation.py           # Mouse/keyboard/window automation (pyautogui + Win32)
│   ├── api_client.py           # Posts API client (dummyjson / jsonplaceholder)
│   └── workflow.py             # Main orchestrator
└── tests/
    ├── test_api_client.py
    ├── test_config.py
    └── test_grounding.py

How It Works

Fetch posts — Downloads blog posts from the configured API endpoint
Detect icon once — Uses the ScreenSeekeR algorithm to locate the target application icon on the desktop. The pixel coordinates are cached and reused for all posts, avoiding redundant API calls
Per-post loop — For each post:
- Kills any leftover instance of the target app
- Double-clicks the cached icon position to launch a fresh instance
- Waits for the window, then uses Win32 API to find its rectangle and click inside it (ensuring keyboard focus lands in the editor, not on the desktop)
- Types the post content via clipboard paste
- Saves via Ctrl+Shift+S (Save As dialog) to ~/Desktop/tjm-project/post_{id}.txt
- Kills the process to guarantee a clean close before the next post

Error Handling

Icon not found: Retries up to 3 times with 1-second delays
Window launch validation: Verifies the window title appeared within timeout
Window focus: Uses Win32 GetWindowRect + click-at-center to guarantee focus lands inside the app, not on the desktop
Process cleanup: Uses taskkill to reliably close the app between posts
API failure: Clear error messages with automatic retry (3 attempts)
File conflicts: Handles existing files in the target directory
Screenshot history: All annotated screenshots include timestamps to preserve history across runs

Discussion Points

Why this approach?

The ScreenSeekeR algorithm generalizes well because it uses a multimodal LLM (Gemini) for semantic understanding, not template matching. It can find any icon by description, handle different themes, backgrounds, and icon sizes.

Failure cases

Very cluttered desktops with many similar icons
Icons fully obscured by overlapping windows
LLM hallucinations in bounding-box coordinates (mitigated by verification step)

Performance

Icon detection happens once at startup (~3-8 seconds), not per post
Each post takes ~5-10 seconds (open, type, save, close)
Total for 10 posts: ~1-2 minutes

Extending

Works for any desktop application — configure ICON_LABEL, ICON_DESCRIPTION, and WINDOW_TITLE in .env
Resolution-independent — coordinates are computed from screenshots
Visual description allows detection even when shortcuts are renamed
Could add OCR-based fallback for text-heavy icons

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScreenSeeker

Architecture

ScreenSeekeR Algorithm (Algorithm 1)

Prerequisites

Setup

Configuration

Gemini Backend

Target Application

Posts API

Usage

Full Automation (10 posts)

Annotate Only (generate screenshots)

Dry Run

Verbose Logging

Running Tests

Project Structure

How It Works

Error Handling

Discussion Points

Why this approach?

Failure cases

Performance

Extending

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
screenshots		screenshots
src/screenseeker		src/screenseeker
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

ScreenSeeker

Architecture

ScreenSeekeR Algorithm (Algorithm 1)

Prerequisites

Setup

Configuration

Gemini Backend

Target Application

Posts API

Usage

Full Automation (10 posts)

Annotate Only (generate screenshots)

Dry Run

Verbose Logging

Running Tests

Project Structure

How It Works

Error Handling

Discussion Points

Why this approach?

Failure cases

Performance

Extending

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages