This repository contains the code and benchmark resources for SkillTab-Bench: Benchmarking Skill-Driven Agents for Multi-Turn Industrial Table Analysis.
SkillTab-Bench requires a Python environment with the Nanobot runtime available. The examples below use Bash.
git clone <repo-url>
cd SkillTab-Bench
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pipInstall Nanobot in the same environment. If Nanobot is available as a local source repository, install it in editable mode:
python -m pip install -e path/to/nanobotConfigure the model provider before running the benchmark. By default, the runner reads provider and model settings from:
$HOME/.nanobot/config.json
You may also pass another configuration file with --base-config.
All commands below should be executed from the repository root.
List available task ids:
python isolated_benchmark_runner/run_isolated_task.py --list-tasksPrepare benchmark workspaces without calling the model:
python isolated_benchmark_runner/run_isolated_task.py --skill-mode both --run-id exp_001Run one task with skills enabled:
python isolated_benchmark_runner/run_isolated_task.py \
--task-id task_gaokao_shandong_2025_recommendation \
--skill-mode on \
--run-id exp_001 \
--executeRun the same task with skills disabled:
python isolated_benchmark_runner/run_isolated_task.py \
--task-id task_gaokao_shandong_2025_recommendation \
--skill-mode off \
--run-id exp_001 \
--executeRun selected tasks with both skill settings:
python isolated_benchmark_runner/run_isolated_task.py \
--task-id task_company_weekly_order_risk,task_company_weekly_sales_performance \
--skill-mode both \
--run-id exp_001 \
--executeRun all tasks:
python isolated_benchmark_runner/run_isolated_task.py \
--all-tasks \
--skill-mode both \
--run-id exp_001 \
--executeUse a custom provider configuration:
python isolated_benchmark_runner/run_isolated_task.py \
--task-id task_gaokao_shandong_2025_recommendation \
--skill-mode on \
--run-id exp_001 \
--base-config path/to/config.json \
--executeEvaluate one completed run:
python isolated_benchmark_runner/evaluate_task_outputs.py \
--run-root runs/task_gaokao_shandong_2025_recommendation/skill_on/exp_001Evaluate both skill settings for all tasks in a run:
python isolated_benchmark_runner/evaluate_task_outputs.py \
--run-id exp_001 \
--skill-mode both \
--all-tasksUse a specific judge provider and model:
python isolated_benchmark_runner/evaluate_task_outputs.py \
--run-root runs/task_gaokao_shandong_2025_recommendation/skill_on/exp_001 \
--judge-config path/to/config.json \
--judge-provider <provider-name> \
--judge-model <model-name>