Skip to content

vmanmarked/MMSkills

Β 
Β 

Repository files navigation

MMSkills
Towards Multimodal Skills for General Visual Agents

Python 3.10+ License OSWorld arXiv Website Skill Library Demos Agent Adapter Submit MMSkill GitHub stars

News | Paper | Website | Skill Library | Demos | Agent Adapter | Submit MMSkills | Overview | Installation | Quick Start | Citation

If you find this project helpful, please give us a star ⭐ for the latest updates.
Typing Animation purple MMSkills

πŸ“£ Latest News

  • πŸ† [May 2026] MMSkills ranked #1 on Hugging Face Daily Papers on 2026.5.18.
  • πŸ€— [May 2026] The MMSkills dataset is now available on Hugging Face Datasets; the paper page is also available on Hugging Face Papers.
  • 🌐 [May 2026] The project website is live with demo comparisons and a searchable MMSkills Library indexing 515 skills across Ubuntu, macOS, VAB-Minecraft, and Mario.
  • πŸš€ [May 2026] The public release includes a compact multimodal desktop-skill subset, OSWorld-ready runtime adapters, task mappings, and model-agnostic skill modes.
  • πŸ”Œ [May 2026] We added the MMSkills Agent Adapter for Codex, OpenClaw, and Claude Code, with one-line Codex installation and on-demand Hugging Face skill retrieval.
  • 🌱 [May 2026] Community MMSkill submissions are open for new domains such as autonomous driving, robotics, mobile agents, and beyond.

🎬 Demos

Four OSWorld demos compare the same task under no skills, text-only skill guidance, and multimodal MMSkills. These videos show selected trajectory excerpts to highlight behavioral differences between the three settings; they are not complete end-to-end trajectories. To keep GUI text readable in the GitHub README, each case uses three separate 1080p MP4 players instead of a compressed side-by-side composite. The full video layout is also available at deepexperience.github.io/MMSkills/cases.html.

1. Calc merged headers

No skills Text-only MMSkills
01_no_skills.mp4
01_text_only.mp4
01_mmskills.mp4

Creates Sheet2, merges the requested header ranges, and writes the target labels. MMSkills follows the intended spreadsheet workflow while the other modes make slower or less reliable progress.

2. VS Code local VSIX install

No skills Text-only MMSkills
02_no_skills.mp4
02_text_only.mp4
02_mmskills.mp4

Installs a local VSIX extension through the GUI workflow. The comparison highlights how multimodal skill references reduce detours around extension discovery and confirmation steps.

3. GIMP text-layer move

No skills Text-only MMSkills
03_no_skills.mp4
03_text_only.mp4
03_mmskills.mp4

Moves a specific text layer in GIMP. The multimodal skill package provides visual grounding for the relevant layer and toolbar state, making the edit path clearer.

4. Calc chart creation

No skills Text-only MMSkills
04_no_skills.mp4
04_text_only.mp4
04_mmskills.mp4

Builds the requested clustered chart in LibreOffice Calc. The side-by-side run shows the effect of reusable spreadsheet procedure knowledge on multi-step GUI manipulation.

πŸ’‘ Overview

MMSkills is a framework for representing, loading, and using reusable multimodal procedural knowledge for visual agents. Each skill combines textual procedure guidance, compact state-card metadata, and optional visual references. At inference time, the agent keeps only lightweight skill hints in the main context, then opens a temporary skill branch when task state suggests that a skill may help.

MMSkills overview

This repository is a focused open-source release. It is not a full OSWorld fork; instead, it provides the MMSkill runtime layer, an install script, OSWorld runner patches, task-to-skill mappings, and a representative public skill library.

Project pages:

Website frontend files are published from the gh-pages branch. The main branch is kept focused on the open-source code, runtime integration, skills, and documentation.

✨ Highlights

🧩 Self-contained skill packages
Each skill directory contains SKILL.md, runtime state cards, audit state cards, and visual keyframes.
πŸ‘οΈ Multimodal evidence gating
The runtime first decides whether visual references are needed, then loads only the requested state views.
🧠 Branch-loaded planning
A temporary planner branch consults selected skills and returns concise guidance, fallback advice, and verification cues.
πŸ”Œ OSWorld ready
Helper scripts install the agent files, runner integration, skills, and task mappings into a local OSWorld checkout.
⚑ Agent-product adapter
The mmskills-agent-adapter can be installed as a Codex skill and reused by OpenClaw or Claude Code through the same package contract.
πŸ“¦ On-demand skill retrieval
Agents search the 515-skill Hugging Face library, download only task-relevant packages, then read SKILL.md, runtime states, and visual references as needed.
🌱 Community-extensible library
Researchers can submit MMSkill packages for new domains such as autonomous driving, robotics, mobile apps, web agents, and games.
βœ… Review-first publishing
Submissions open GitHub issues, notify maintainers, and are reviewed before being normalized into the public Hugging Face library and website.

πŸ”Œ Agent Adapter

The mmskills-agent-adapter module turns MMSkills into an installable, product-neutral skill adapter for agent systems. It keeps one shared MMSkills package format across Codex, OpenClaw, Claude Code, and future agent products instead of maintaining separate copies for each ecosystem.

The adapter is intentionally lightweight. It does not bundle the full 515-skill asset set inside the repository branch. Instead, it points agents to the public Hugging Face MMSkills dataset, searches the metadata index, and downloads only the skill package needed for the current task.

One-line Codex install:

curl -fsSL https://raw.githubusercontent.com/DeepExperience/MMSkills/main/scripts/install_codex_mmskills.sh | bash

Direct Codex skill-installer form:

python ~/.codex/skills/.system/skill-installer/scripts/install-skill-from-github.py \
  --repo DeepExperience/MMSkills \
  --path agent_integrations/mmskills-agent-adapter

After restarting Codex, invoke $mmskills for GUI-agent or computer-use tasks. The adapter scripts provide the standard flow:

python scripts/search_skills.py "chrome bookmark" --package ubuntu
python scripts/download_skill.py ubuntu/chrome/CHROME_Manage_Bookmarks_Reading_List_And_Shortcuts
python scripts/inspect_skill.py ~/.cache/mmskills/skills/ubuntu/chrome/CHROME_Manage_Bookmarks_Reading_List_And_Shortcuts

For OpenClaw and Claude Code, use the same adapter contract: call the search/download scripts, parse SKILL.md and runtime_state_cards.json, and route Images/ into the product's visual grounding or verification layer only when visual evidence is needed.

🌱 Community Submissions

We welcome MMSkill packages from new domains. A submission can be a single reusable skill or a new domain collection, such as autonomous driving, robotics, mobile agents, browser workflows, scientific software, games, or other visual-agent environments.

Submit through the website entrypoint or directly through the GitHub issue form:

Each submission creates a GitHub issue assigned to the maintainer account, so maintainers can receive email notifications through GitHub's repository notification settings. After review, accepted packages are normalized into the MMSkills library, uploaded to the public Hugging Face dataset, and surfaced on the website Skill Library.

πŸ—‚οΈ Repository Layout

MMSkills/
β”œβ”€β”€ agent_integrations/        # Codex/OpenClaw/Claude Code agent adapters and download helpers
β”œβ”€β”€ mm_agents/                 # MMSkill runtime architecture and model adapters
β”œβ”€β”€ osworld_integration/       # MMSkills-aware OSWorld runner files
β”œβ”€β”€ skills_library/            # Public multimodal skills subset for direct runtime use
β”œβ”€β”€ task_skill_mappings/       # OSWorld task-to-skill mapping for released skills
└── scripts/
    β”œβ”€β”€ install_into_osworld.py # Install this release into an OSWorld checkout
    └── sync_from_sources.py    # Maintainer sync helper for source checkouts

🧠 Architecture

The public runtime entrypoint is mm_agents/mm_skill_agent.py, exposed in OSWorld as:

--agent_type mm_skill

The architecture is model-agnostic. A main visual agent receives compact skill hints; when a skill may apply, the runtime opens a branch that decides whether visual evidence is needed, requests relevant state views, compares them with the live screenshot, and returns structured guidance for the next grounded action.

The reference integration supports:

  • mm_skill: multimodal branch-loaded skill consultation.
  • general_text_skill: text-only skill consultation for ablation and lightweight runs.
  • general: baseline model-agnostic screenshot-to-pyautogui visual-agent routing.

Legacy gemini, gemini_skill, and gemini_text_skill CLI names are still accepted by the runner as aliases for compatibility, but the public files and recommended commands use the model-agnostic general* names.

Any screenshot-capable VLM served through an OpenAI-compatible chat-completions API can use the same general* and mm_skill interfaces by setting --model, --api_model when needed, --base_url, and --api_key.

πŸ”§ Installation

1. Clone MMSkills

git clone https://github.com/DeepExperience/MMSkills.git
cd MMSkills

2. Install Python dependencies

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

3. Install into OSWorld

Clone and install OSWorld following its upstream instructions, then run:

python3 scripts/install_into_osworld.py /path/to/OSWorld --with-runner --with-skills

This copies the MMSkill agent files into OSWorld/mm_agents/, installs the MMSkills-aware runner files, and copies the released skills_library/ plus task_skill_mappings/.

4. Configure model endpoints

For an OpenAI-compatible endpoint:

export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1"
export OPENAI_API_KEY="your_api_key"

For native Gemini-compatible routing, pass --api_backend gemini and set:

export GEMINI_BASE_URL="https://your-gemini-compatible-endpoint/v1"
export GEMINI_API_KEY="your_api_key"

5. Install the Codex Agent Adapter

MMSkills also ships a lightweight agent-product adapter under agent_integrations/mmskills-agent-adapter/. The adapter is installable as a Codex skill and points agents to the full Hugging Face skill dataset for on-demand retrieval. See Agent Adapter for the full cross-agent contract.

One-line Codex install:

curl -fsSL https://raw.githubusercontent.com/DeepExperience/MMSkills/main/scripts/install_codex_mmskills.sh | bash

Direct Codex skill-installer form:

python ~/.codex/skills/.system/skill-installer/scripts/install-skill-from-github.py \
  --repo DeepExperience/MMSkills \
  --path agent_integrations/mmskills-agent-adapter

After restarting Codex, use $mmskills to search and load task-relevant packages. The same adapter contract is intended for OpenClaw and Claude Code: share the MMSkills package format, keep product-specific behavior in thin adapters, and download only the skills needed for the current task from Hugging Face Datasets.

πŸƒ Quick Start

Run commands from the OSWorld checkout after installation.

Baseline Without Skills

python run.py \
  --agent_type general \
  --model gpt-4o \
  --api_backend openai \
  --observation_type screenshot \
  --action_space pyautogui \
  --max_steps 20 \
  --test_all_meta_path evaluation_examples/test_nogdrive.json \
  --domain chrome \
  --result_dir results/no_skills

Text-Only Skills

python run.py \
  --agent_type general_text_skill \
  --model gpt-4o \
  --api_backend openai \
  --observation_type screenshot \
  --action_space pyautogui \
  --max_steps 20 \
  --skills_library_dir skills_library \
  --task_skill_mapping_root task_skill_mappings/task_skill_mapping.json \
  --skill_mode text_only \
  --text_skill_mode branch_planner \
  --test_all_meta_path evaluation_examples/test_nogdrive.json \
  --domain chrome \
  --result_dir results/text_only

Multimodal MMSkill Agent

python run.py \
  --agent_type mm_skill \
  --model gpt-4o \
  --api_backend openai \
  --observation_type screenshot \
  --action_space pyautogui \
  --max_steps 20 \
  --skills_library_dir skills_library \
  --task_skill_mapping_root task_skill_mappings/task_skill_mapping.json \
  --skill_mode multimodal \
  --task_skill_top_k 6 \
  --save_conversation_json \
  --test_all_meta_path evaluation_examples/test_nogdrive.json \
  --domain chrome \
  --result_dir results/mm_skill_multimodal

Use --domain all for the full no-Google-Drive OSWorld split. The runner writes trajectories, screenshots, skill_invocations.json, skill_usage_summary.json, and aggregate metrics under the selected --result_dir.

πŸ“š Skill Library

The website indexes 515 skills from the open-source Ubuntu, macOS, VAB-Minecraft, and Mario skill assets. Each skill card links to a structured view of its SKILL.md, runtime state cards, and ordered visual references.

Browse the live library at deepexperience.github.io/MMSkills/skills.html.

The repository also includes a compact runtime-ready subset under skills_library/ for immediate OSWorld integration.

πŸ“¦ Skill Package Format

skills_library/<domain>/<skill_name>/
β”œβ”€β”€ SKILL.md                  # Procedure, applicability, transfer limits, checks
β”œβ”€β”€ runtime_state_cards.json  # Compact state/view metadata used at inference time
β”œβ”€β”€ state_cards.json          # Audit-grade state metadata for inspection
β”œβ”€β”€ plan.json                 # Generated plan metadata, when available
└── Images/                   # Full frames, focus crops, before/after references

The main agent sees only concise skill names and state hints. Detailed visual evidence is loaded lazily by the branch planner, which keeps the main context compact while preserving access to state-specific multimodal references.

runtime_state_cards.json is the inference-facing version: it contains compact state descriptions, when-to-use rules, visible cues, verification cues, and selected image views for branch-time loading. state_cards.json is the richer authoring/audit version: it keeps transfer-limit notes, highlight targets, grounding queries, bounding boxes, crop decisions, and evidence-source metadata for inspection and regeneration.

πŸ§ͺ Outputs

MMSkills adds skill-aware artifacts to OSWorld result directories:

File Purpose
skill_invocations.json Per-branch consultation records, selected states, requested views, and planner outputs
skill_usage_summary.json Aggregate skill counts, branch success counts, exhausted skills, and final actions
conversation.json Optional main and branch conversation trace when --save_conversation_json is enabled

🀝 Contributing

Contributions are welcome for new skills, runtime integrations, documentation, and reproducibility fixes. Please read CONTRIBUTING.md before opening an issue or pull request.

πŸ“„ License

This project is released under the Apache License 2.0. Portions of the OSWorld integration are derived from OSWorld; see NOTICE for attribution details.

πŸ“ Citation

If you use MMSkills in your research or applications, please cite our arXiv paper:

@misc{zhang2026mmskills,
  title = {MMSkills: Towards Multimodal Skills for General Visual Agents},
  author = {Kangning Zhang and Shuai Shao and Qingyao Li and Jianghao Lin and Lingyue Fu and Shijian Wang and Wenxiang Jiao and Yuan Lu and Weiwen Liu and Weinan Zhang and Yong Yu},
  year = {2026},
  eprint = {2605.13527},
  archivePrefix = {arXiv},
  primaryClass = {cs.AI},
  url = {https://arxiv.org/abs/2605.13527}
}

You can also use the machine-readable citation metadata in CITATION.cff.

About

MMSkills: Towards Multimodal Skills for General Visual Agents

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.8%
  • Shell 0.2%