News | Paper | Website | Skill Library | Demos | Agent Adapter | Submit MMSkills | Overview | Installation | Quick Start | Citation
- π [May 2026] MMSkills ranked #1 on Hugging Face Daily Papers on 2026.5.18.
- π€ [May 2026] The MMSkills dataset is now available on Hugging Face Datasets; the paper page is also available on Hugging Face Papers.
- π [May 2026] The project website is live with demo comparisons and a searchable MMSkills Library indexing 515 skills across Ubuntu, macOS, VAB-Minecraft, and Mario.
- π [May 2026] The public release includes a compact multimodal desktop-skill subset, OSWorld-ready runtime adapters, task mappings, and model-agnostic skill modes.
- π [May 2026] We added the MMSkills Agent Adapter for Codex, OpenClaw, and Claude Code, with one-line Codex installation and on-demand Hugging Face skill retrieval.
- π± [May 2026] Community MMSkill submissions are open for new domains such as autonomous driving, robotics, mobile agents, and beyond.
Four OSWorld demos compare the same task under no skills, text-only skill guidance, and multimodal MMSkills. These videos show selected trajectory excerpts to highlight behavioral differences between the three settings; they are not complete end-to-end trajectories. To keep GUI text readable in the GitHub README, each case uses three separate 1080p MP4 players instead of a compressed side-by-side composite. The full video layout is also available at deepexperience.github.io/MMSkills/cases.html.
| No skills | Text-only | MMSkills |
|---|---|---|
01_no_skills.mp4 |
01_text_only.mp4 |
01_mmskills.mp4 |
Creates Sheet2, merges the requested header ranges, and writes the target labels. MMSkills follows the intended spreadsheet workflow while the other modes make slower or less reliable progress.
| No skills | Text-only | MMSkills |
|---|---|---|
02_no_skills.mp4 |
02_text_only.mp4 |
02_mmskills.mp4 |
Installs a local VSIX extension through the GUI workflow. The comparison highlights how multimodal skill references reduce detours around extension discovery and confirmation steps.
| No skills | Text-only | MMSkills |
|---|---|---|
03_no_skills.mp4 |
03_text_only.mp4 |
03_mmskills.mp4 |
Moves a specific text layer in GIMP. The multimodal skill package provides visual grounding for the relevant layer and toolbar state, making the edit path clearer.
| No skills | Text-only | MMSkills |
|---|---|---|
04_no_skills.mp4 |
04_text_only.mp4 |
04_mmskills.mp4 |
Builds the requested clustered chart in LibreOffice Calc. The side-by-side run shows the effect of reusable spreadsheet procedure knowledge on multi-step GUI manipulation.
MMSkills is a framework for representing, loading, and using reusable multimodal procedural knowledge for visual agents. Each skill combines textual procedure guidance, compact state-card metadata, and optional visual references. At inference time, the agent keeps only lightweight skill hints in the main context, then opens a temporary skill branch when task state suggests that a skill may help.
This repository is a focused open-source release. It is not a full OSWorld fork; instead, it provides the MMSkill runtime layer, an install script, OSWorld runner patches, task-to-skill mappings, and a representative public skill library.
Project pages:
Website frontend files are published from the gh-pages branch. The main branch is kept focused on the open-source code, runtime integration, skills, and documentation.
| π§© Self-contained skill packages Each skill directory contains SKILL.md, runtime state cards, audit state cards, and visual keyframes. |
ποΈ Multimodal evidence gating The runtime first decides whether visual references are needed, then loads only the requested state views. |
| π§ Branch-loaded planning A temporary planner branch consults selected skills and returns concise guidance, fallback advice, and verification cues. |
π OSWorld ready Helper scripts install the agent files, runner integration, skills, and task mappings into a local OSWorld checkout. |
| β‘ Agent-product adapter The mmskills-agent-adapter can be installed as a Codex skill and reused by OpenClaw or Claude Code through the same package contract. |
π¦ On-demand skill retrieval Agents search the 515-skill Hugging Face library, download only task-relevant packages, then read SKILL.md, runtime states, and visual references as needed. |
| π± Community-extensible library Researchers can submit MMSkill packages for new domains such as autonomous driving, robotics, mobile apps, web agents, and games. |
β
Review-first publishing Submissions open GitHub issues, notify maintainers, and are reviewed before being normalized into the public Hugging Face library and website. |
The mmskills-agent-adapter module turns MMSkills into an installable, product-neutral skill adapter for agent systems. It keeps one shared MMSkills package format across Codex, OpenClaw, Claude Code, and future agent products instead of maintaining separate copies for each ecosystem.
The adapter is intentionally lightweight. It does not bundle the full 515-skill asset set inside the repository branch. Instead, it points agents to the public Hugging Face MMSkills dataset, searches the metadata index, and downloads only the skill package needed for the current task.
One-line Codex install:
curl -fsSL https://raw.githubusercontent.com/DeepExperience/MMSkills/main/scripts/install_codex_mmskills.sh | bashDirect Codex skill-installer form:
python ~/.codex/skills/.system/skill-installer/scripts/install-skill-from-github.py \
--repo DeepExperience/MMSkills \
--path agent_integrations/mmskills-agent-adapterAfter restarting Codex, invoke $mmskills for GUI-agent or computer-use tasks. The adapter scripts provide the standard flow:
python scripts/search_skills.py "chrome bookmark" --package ubuntu
python scripts/download_skill.py ubuntu/chrome/CHROME_Manage_Bookmarks_Reading_List_And_Shortcuts
python scripts/inspect_skill.py ~/.cache/mmskills/skills/ubuntu/chrome/CHROME_Manage_Bookmarks_Reading_List_And_ShortcutsFor OpenClaw and Claude Code, use the same adapter contract: call the search/download scripts, parse SKILL.md and runtime_state_cards.json, and route Images/ into the product's visual grounding or verification layer only when visual evidence is needed.
We welcome MMSkill packages from new domains. A submission can be a single reusable skill or a new domain collection, such as autonomous driving, robotics, mobile agents, browser workflows, scientific software, games, or other visual-agent environments.
Submit through the website entrypoint or directly through the GitHub issue form:
- Website entry: deepexperience.github.io/MMSkills/submit.html
- GitHub issue form: Submit an MMSkill package
- Format guide: docs/submit_mmskills.md
Each submission creates a GitHub issue assigned to the maintainer account, so maintainers can receive email notifications through GitHub's repository notification settings. After review, accepted packages are normalized into the MMSkills library, uploaded to the public Hugging Face dataset, and surfaced on the website Skill Library.
MMSkills/
βββ agent_integrations/ # Codex/OpenClaw/Claude Code agent adapters and download helpers
βββ mm_agents/ # MMSkill runtime architecture and model adapters
βββ osworld_integration/ # MMSkills-aware OSWorld runner files
βββ skills_library/ # Public multimodal skills subset for direct runtime use
βββ task_skill_mappings/ # OSWorld task-to-skill mapping for released skills
βββ scripts/
βββ install_into_osworld.py # Install this release into an OSWorld checkout
βββ sync_from_sources.py # Maintainer sync helper for source checkouts
The public runtime entrypoint is mm_agents/mm_skill_agent.py, exposed in OSWorld as:
--agent_type mm_skillThe architecture is model-agnostic. A main visual agent receives compact skill hints; when a skill may apply, the runtime opens a branch that decides whether visual evidence is needed, requests relevant state views, compares them with the live screenshot, and returns structured guidance for the next grounded action.
The reference integration supports:
mm_skill: multimodal branch-loaded skill consultation.general_text_skill: text-only skill consultation for ablation and lightweight runs.general: baseline model-agnostic screenshot-to-pyautogui visual-agent routing.
Legacy gemini, gemini_skill, and gemini_text_skill CLI names are still accepted by the runner as aliases for compatibility, but the public files and recommended commands use the model-agnostic general* names.
Any screenshot-capable VLM served through an OpenAI-compatible chat-completions API can use the same general* and mm_skill interfaces by setting --model, --api_model when needed, --base_url, and --api_key.
git clone https://github.com/DeepExperience/MMSkills.git
cd MMSkillspython3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtClone and install OSWorld following its upstream instructions, then run:
python3 scripts/install_into_osworld.py /path/to/OSWorld --with-runner --with-skillsThis copies the MMSkill agent files into OSWorld/mm_agents/, installs the MMSkills-aware runner files, and copies the released skills_library/ plus task_skill_mappings/.
For an OpenAI-compatible endpoint:
export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1"
export OPENAI_API_KEY="your_api_key"For native Gemini-compatible routing, pass --api_backend gemini and set:
export GEMINI_BASE_URL="https://your-gemini-compatible-endpoint/v1"
export GEMINI_API_KEY="your_api_key"MMSkills also ships a lightweight agent-product adapter under agent_integrations/mmskills-agent-adapter/. The adapter is installable as a Codex skill and points agents to the full Hugging Face skill dataset for on-demand retrieval. See Agent Adapter for the full cross-agent contract.
One-line Codex install:
curl -fsSL https://raw.githubusercontent.com/DeepExperience/MMSkills/main/scripts/install_codex_mmskills.sh | bashDirect Codex skill-installer form:
python ~/.codex/skills/.system/skill-installer/scripts/install-skill-from-github.py \
--repo DeepExperience/MMSkills \
--path agent_integrations/mmskills-agent-adapterAfter restarting Codex, use $mmskills to search and load task-relevant packages. The same adapter contract is intended for OpenClaw and Claude Code: share the MMSkills package format, keep product-specific behavior in thin adapters, and download only the skills needed for the current task from Hugging Face Datasets.
Run commands from the OSWorld checkout after installation.
python run.py \
--agent_type general \
--model gpt-4o \
--api_backend openai \
--observation_type screenshot \
--action_space pyautogui \
--max_steps 20 \
--test_all_meta_path evaluation_examples/test_nogdrive.json \
--domain chrome \
--result_dir results/no_skillspython run.py \
--agent_type general_text_skill \
--model gpt-4o \
--api_backend openai \
--observation_type screenshot \
--action_space pyautogui \
--max_steps 20 \
--skills_library_dir skills_library \
--task_skill_mapping_root task_skill_mappings/task_skill_mapping.json \
--skill_mode text_only \
--text_skill_mode branch_planner \
--test_all_meta_path evaluation_examples/test_nogdrive.json \
--domain chrome \
--result_dir results/text_onlypython run.py \
--agent_type mm_skill \
--model gpt-4o \
--api_backend openai \
--observation_type screenshot \
--action_space pyautogui \
--max_steps 20 \
--skills_library_dir skills_library \
--task_skill_mapping_root task_skill_mappings/task_skill_mapping.json \
--skill_mode multimodal \
--task_skill_top_k 6 \
--save_conversation_json \
--test_all_meta_path evaluation_examples/test_nogdrive.json \
--domain chrome \
--result_dir results/mm_skill_multimodalUse --domain all for the full no-Google-Drive OSWorld split. The runner writes trajectories, screenshots, skill_invocations.json, skill_usage_summary.json, and aggregate metrics under the selected --result_dir.
The website indexes 515 skills from the open-source Ubuntu, macOS, VAB-Minecraft, and Mario skill assets. Each skill card links to a structured view of its SKILL.md, runtime state cards, and ordered visual references.
Browse the live library at deepexperience.github.io/MMSkills/skills.html.
The repository also includes a compact runtime-ready subset under skills_library/ for immediate OSWorld integration.
skills_library/<domain>/<skill_name>/
βββ SKILL.md # Procedure, applicability, transfer limits, checks
βββ runtime_state_cards.json # Compact state/view metadata used at inference time
βββ state_cards.json # Audit-grade state metadata for inspection
βββ plan.json # Generated plan metadata, when available
βββ Images/ # Full frames, focus crops, before/after references
The main agent sees only concise skill names and state hints. Detailed visual evidence is loaded lazily by the branch planner, which keeps the main context compact while preserving access to state-specific multimodal references.
runtime_state_cards.json is the inference-facing version: it contains compact state descriptions, when-to-use rules, visible cues, verification cues, and selected image views for branch-time loading. state_cards.json is the richer authoring/audit version: it keeps transfer-limit notes, highlight targets, grounding queries, bounding boxes, crop decisions, and evidence-source metadata for inspection and regeneration.
MMSkills adds skill-aware artifacts to OSWorld result directories:
| File | Purpose |
|---|---|
skill_invocations.json |
Per-branch consultation records, selected states, requested views, and planner outputs |
skill_usage_summary.json |
Aggregate skill counts, branch success counts, exhausted skills, and final actions |
conversation.json |
Optional main and branch conversation trace when --save_conversation_json is enabled |
Contributions are welcome for new skills, runtime integrations, documentation, and reproducibility fixes. Please read CONTRIBUTING.md before opening an issue or pull request.
This project is released under the Apache License 2.0. Portions of the OSWorld integration are derived from OSWorld; see NOTICE for attribution details.
If you use MMSkills in your research or applications, please cite our arXiv paper:
@misc{zhang2026mmskills,
title = {MMSkills: Towards Multimodal Skills for General Visual Agents},
author = {Kangning Zhang and Shuai Shao and Qingyao Li and Jianghao Lin and Lingyue Fu and Shijian Wang and Wenxiang Jiao and Yuan Lu and Weiwen Liu and Weinan Zhang and Yong Yu},
year = {2026},
eprint = {2605.13527},
archivePrefix = {arXiv},
primaryClass = {cs.AI},
url = {https://arxiv.org/abs/2605.13527}
}You can also use the machine-readable citation metadata in CITATION.cff.
