Artificial Collective Intelligence: Carpenter Ant-Inspired Stigmergic Swarm Robotics for Decentralized Systems Using Deep Reinforcement Learning
Individual ants possess limited cognitive capacity. Yet when thousands interact, colonies exhibit coordinated collective intelligence. This intelligence is not centralized in a single agent but emerges from decentralized interactions shaped by evolutionary processes. A key coordination mechanism is stigmergy—indirect communication through environmental modification (e.g., pheromone trails) that agents can sense and exploit.
Inspired by carpenter ants, this research investigates artificial collective intelligence arising from decentralized interactions with the environment, developing a computational framework for swarm robotics that contrasts with prevailing models of intelligence relying on large-scale centralized computing. To enable scalable experimentation, a custom multi-agent reinforcement learning simulation environment was created for decentralized policy learning. A physical swarm robotics platform was then constructed to evaluate the learned behaviours in the real world.
By coupling learned behaviours with shared “digital pheromone” fields, the system demonstrates how collective intelligence emerges from distributed agents, contributing to the emerging field of Physical AI, where intelligent algorithms interact directly with and control physical systems rather than operating only in digital environments. The robotic platform used to evaluate the system provides an accessible testbed for future swarm robotics research and is fully open-source, including algorithms, models, mechanical designs, firmware, component specifications, and assembly documentation.
Decentralized swarm systems have applications in environments where communication infrastructure is unreliable or centralized control is fragile, such as planetary exploration and disaster response. By leveraging local decision-making and redundancy, stigmergic swarm systems provide resilience, adaptability, and robustness under uncertainty.
There is a lot under docs/. This table points to the files people actually use.
- docs/ARCHITECTURE.md
- High-level system structure. Start here for the big picture.
- docs/API_REFERENCE.md
- API-level reference for the environment and surrounding tooling.
- docs/CONFIG_REFERENCE.md
SwarmConfigand the shared CLI-to-config mapping.
- docs/TRAINING_MAPPO.md
- Main training guide for the recurrent MAPPO path.
- docs/GRUMMAPO.md
- GRU MAPPO overview plus implementation walkthrough.
- docs/TRAINING_CUSTOM.md
- Older custom DQN training path.
- docs/EVALUATE.md
- How to run single-checkpoint and comparison evaluations.
- docs/EVALUATION_CUSTOM.md
- Older evaluation notes for the custom DQN workflow.
- docs/EXPERIMENT_API.md
- Experiment framework and benchmark entry points.
- docs/ONBOARDING.md
- Main getting-started guide with current commands and key files.
docs/QandA.md- Referenced by older notes, but not present in this checkout.
docs/PROJECT_LOG.md- Referenced by older notes, but not present in this checkout.
- docs/curriculum_training.md
- Training walkthrough with emphasis on the curriculum path.
- docs/trail.md
- Design note on the intended trail behavior: discover, return, deposit, exploit.
- docs/best_rl.md
- RL strategy discussion and alternatives.
- docs/CTDE_STATE.md
- CTDE state design for the MAPPO critic.
- docs/OBSERVATION_SPEC.md
- Observation layout reference.
- docs/ACTION_SPEC.md
- Action-space reference.
- docs/UML.md
- Mermaid diagrams for the environment and system structure.
- docs/DQN_EXPLAINED.md
- Plain-language DQN walkthrough tied to this codebase.
- docs/RESOURCES.md
- External learning and reference links.
docs/audit.md- Referenced by older notes, but not present in this checkout.
- docs/ToDo.md
- Open tasks and future work ideas.
- docs/SimToReal.md
- Sim-to-real deployment notes.
- docs/PI_MIGRATION.md
- Raspberry Pi migration notes.
- docs/BLUETOOTH_FIRMWARE_NOTES.md
- BLE and firmware-side communication notes.
- docs/MISSION_CONTROL.md
- Mission Control system overview.
- docs/CAMERA_SETUP.md
- Raspberry Pi camera bring-up and recovery notes.
- docs/MISSION_CONTROL_FIRMWARE_PSEUDOCODE.md
- Pseudocode view of the Mission Control firmware path.
docs/CAMERA_PROPOSAL.md- Referenced by the setup notes, but not present in this checkout.
- docs/manual/QUICK_START.md
- Shortest path to a working run.
- docs/manual/PROJECT_STRUCTURE.md
- Project layout guide.
- docs/manual/EXPERIMENT_GUIDE.md
- Manual for running experiments and reading the outputs.
- docs/manual/RESULTS_INTERPRETATION.md
- How to read metrics and results artifacts.
docs/swarm_robotics_presentation.md- Referenced by older notes, but not present in this checkout.
- docs/tasks/
- Historical task records from the build-out of the system.
A multi-agent PyGame environment for swarm RL with pheromone stigmergy, using the PettingZoo Parallel API. The repo currently contains:
- a custom DQN baseline
- a recurrent GRU MAPPO training path with CTDE
- demo and evaluation utilities
- a robot-facing runtime that preserves decentralized inference
Default scene: tank dynamics with pheromone heatmap enabled.
No pheromone rendering: same environment without the heatmap overlay.
Hovercraft dynamics: agents drift slightly due to inertia/noise.
Dense swarm: more agents, targets, and obstacles.
Generate the screenshots locally:
python train/capture_screenshots.pypython -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtFor robot deployment with firmware/run.py, pyserial is now included in requirements.txt because firmware/ant.py depends on it.
python train/random_rollout.pyThe current recommended research path is recurrent MAPPO with the staged curriculum. The intended behavior is:
- explore to discover a target
- pick it up
- visibly switch into a carrying-food state
- return to the nest
- deposit pheromone on the successful return route
- complete one delivery by reaching the nest while carrying
- let later agents exploit that trail
The current default task settings now make that loop more explicit:
- there are
3food sources in play - each source has
4uses - a source loses one use on pickup
- when a source is exhausted, it respawns somewhere else if target respawn is enabled
- agents can carry only one food item at a time
- carrying agents render in a distinct green-highlighted color in demo mode
Recommended full training run:
python train/train.py --backend mappo --headless --curriculum full --n-agents 6 --total-steps 600000 --rollout-steps 128 --update-epochs 4 --minibatch-size 256 --eval-every 10000 --eval-episodes 5 --stage-repeat-limit 1 --reward-pickup 6 --reward-nest-delivery 30 --reward-undelivered-food -10 --folder-name mappo_full_runRecommended demo checkpoint after training:
python train/demo.py --backend mappo --checkpoint-dir checkpoints/mappo_full_run/best_greedy_eval --max-steps 300 --render-scale 0.75Current single-agent return stack:
stage1a_single_agent_minisculestage1b_single_agent_tinystage1c_single_agent_smallstage1d_single_agent_carry_bootstrap- the agent starts already carrying food and learns pure homing first
stage1e_single_agent_guaranteed_homing- the first normal pickup-plus-delivery homing stage
stage1f_single_agent_delivery_bridge- the first mild clutter return stage
stage1g_single_agent_delivery_obstacles- the first true single-agent obstacle-return stage
Prompt 38 makes the bridge stage more continuity-preserving instead of letting it become an abrupt collapse point:
stage1f_single_agent_delivery_bridgenow uses a smaller arena jump fromstage1e- its single obstacle is intentionally smaller than the later obstacle-return stage
- target placement now tries to preserve a clear return corridor from nest to target in the bridge stage
- the bridge still contains clutter, but it is meant to keep greedy homing alive rather than replace it with a new task
Prompt 39 then stabilizes bridge-stage greedy delivery at the trainer level:
- stage-end evaluation now restores and evaluates the best within-stage policy instead of the last drifted one
- that keeps
stage1f_single_agent_delivery_bridgefrom ending on a worse policy than the one it already discovered earlier in the stage - this is the first configuration that looks credible for a real full training run rather than only more stage-1 debugging
Prompt 40 then adds an explicit small-swarm bootstrap stack so the first scale-up does not wipe out the learned delivery loop:
stage2a_small_swarm_carry_bootstrap- small swarm, already carrying, pure homing
stage2b_small_swarm_delivery_easy- small swarm, easy pickup-plus-delivery, no clutter, pheromone off
stage2c_small_swarm_medium- first real small-swarm medium stage, still delivery-first and pheromone off
stage2d_small_swarm_large- larger small-swarm stage where trail behavior can return
Prompt 41 then reduces repeat and budget pressure in the already-solved single-agent carry/bootstrap stages so stage1_to_2 and full runs reach the new swarm bootstrap stages instead of spending too much budget re-proving stage-1 lessons.
Prompt 44 and prompt 45 then strengthen the late swarm behavior around the nest:
- post-delivery outward pressure stays active until agents actually leave the nest zone
- non-carrying agents are pushed to fan out and explore instead of orbiting the nest
- late swarm stages now use strong non-carrying loiter, crowding, idle, and no-outward-progress penalties near the nest
Prompt 46 then goes beyond reward shaping and adds an explicit env-side “leave the nest zone” mode for empty agents in the late swarm stages:
- if a non-carrying agent remains inside the configured nest-adjacent force-explore radius, the env can override its chosen action with an outward-moving action
- this is meant to break the specific orbiting / turn-in-place / local-circling failure mode that larger scalar penalties alone did not eliminate
- carrying-food return behavior is unchanged; this mode is only for empty agents near the nest
Prompt 37 also fixed a real environment bug in env/swarm_env.py: the tank and hover movement drivers were dropping carrying_food during normal movement updates, which could silently break return-to-nest lessons immediately after the first move.
What the main arguments mean:
-
--backend mappo- use the recurrent MAPPO trainer instead of the DQN/SB3/RLlib paths
-
--headless- run without opening a PyGame window so long training is faster and more stable
-
--curriculum full- train through the full staged curriculum rather than only the early stages
-
--n-agents 6- target six agents for the later full-swarm curriculum stages
-
--total-steps 600000- total training budget across the whole curriculum
- this budget is distributed across the curriculum slice you selected, not across omitted stages
- this is much more serious than a short
30k–200krun because the later stages need real time
-
--rollout-steps 128- collect on-policy rollouts in chunks of 128 steps before PPO-style updates
-
--update-epochs 4- run four optimization passes over each collected rollout batch
-
--minibatch-size 256- minibatch size used during PPO optimization
-
--eval-every 10000- run evaluation every 10,000 training steps
-
--eval-episodes 5- use five episodes for each scheduled evaluation so eval is less noisy
-
--stage-repeat-limit 1- allow one retry when a stage still fails its minimum greedy pickup/delivery target
- this helps prevent weak early stages from being silently promoted
-
--reward-pickup 6- keep pickup meaningful, but not as important as completed delivery
-
--reward-nest-delivery 30- make successful return-to-nest delivery the strongest core task reward
-
--reward-undelivered-food -10- penalize ending an episode while still carrying food
- this helps discourage “pick up but never bring it home”
-
--folder-name mappo_full_run- base name for checkpoints and run outputs
Why this is the recommended starting point:
- the current curriculum spreads learning across many stages
- the early stages now use more responsive
action_repeat_steps = 1, while later stages keep smootheraction_repeat_steps = 2 - the early stages now deliberately simplify the task: pheromone is disabled in stage 1, movement penalties are softened, and pickup/delivery cues are stronger so greedy
pickup -> return -> deliverbehavior can form first - prompts 30, 31, and 32 now suppress exploration reward while carrying in the return-focused stages, add a dedicated guaranteed-homing stage with controlled target/agent placement, and then reintroduce clutter through a bridge stage before the true obstacle-return stage
- prompt 33 now tightens trainer pressure on those early return stages: entropy decays faster there, greedy checkpoint scoring weights completed delivery and conversion more heavily, and stage summaries explicitly print sampled-vs-greedy pickup/delivery gaps
- prompt 34 makes
--total-stepsa real hard global cap in the trainer and tightens later-stage scoring/promotion so pickup-without-delivery is treated as failure rather than progress - prompt 35 now adds carrying-phase stall penalties and metrics, so once an agent is carrying food the trainer can measure and penalize no-progress / low-displacement return behavior instead of only noticing pickup and delivery endpoints
- prompt 36 now pushes the homing stages further toward deterministic return behavior by adding sustained carrying-progress shaping, lowering return-stage entropy more aggressively, and tightening early return-stage delivery/conversion promotion targets
- the trainer now decays entropy within each stage instead of keeping one fixed exploration pressure forever, so early rollouts can explore while later updates in the same stage become more deterministic
- the early stages keep a slightly stronger exploration bonus, and later stages reduce
reward_new_cellso delivery and trail reuse compete less with wandering - the trainer now keeps a fixed padded centralized critic state dimension across the selected curriculum so the critic can carry across stages instead of resetting whenever the stage shape changes
- stage progression is now greedy-eval-aware, with optional repeats when pickup/delivery remain below minimum promotion targets
- the trainer now prints sampled-vs-greedy pickup/delivery gaps and saves
best_greedy_eval/so demo can use the strongest greedy checkpoint instead of assuminglatest/is best - stage summaries now also print pickup-to-delivery conversion, which is the main signal for whether return-to-nest behavior is actually forming
- episode/eval CSVs now also include carrying-phase signals such as stall events, low-progress fraction, low-displacement fraction, and carrying-penalty totals
- checkpoint metadata now also records the sustained carrying-progress reward coefficient used for the stage
- the final stage is still large and hard:
1400x950,18obstacles,6agents - a shorter run can finish, but often leaves the later full-swarm stages undertrained
600kis not guaranteed to be optimal, but it is a practical strong starting point for the current repo
Short smoke test:
python train/train.py --backend mappo --headless --curriculum stage1 --n-agents 6 --total-steps 2400 --rollout-steps 64 --update-epochs 2 --minibatch-size 128 --eval-every 0 --no-plots --n-targets 3 --active-targets 3 --food-source-capacity 4 --target-respawn --folder-name mappo_trail_smokeRender the trained MAPPO policy:
python train/demo.py --backend mappo --checkpoint-dir checkpoints/mappo_full_run/best_greedy_eval --max-steps 300 --render-scale 0.75In demo mode, agents now switch to a distinct carrying-food color after pickup
and return to the normal agent color after a completed nest delivery.
The PyGame window title also shows the current reset seed, and the HUD shows
step, pickups, deliveries, pheromone drops, and carrying-agent count.
You can press Space to pause/resume and click an agent to open an inspector
panel showing that agent's current inputs and model outputs.
Cycle demo resets through only specific seeds:
python train/demo.py --backend mappo --checkpoint-dir checkpoints/mappo_full_run/best_greedy_eval --max-steps 0 --render-scale 0.75 --seed-list 45,40,58Headless MAPPO evaluation:
python analysis/evaluate.py --policy-kind mappo_gru --checkpoint-dir checkpoints/mappo_full_run/best_greedy_eval --n-agents 6 --episodes 10 --headless --output-dir runs/eval --filename mappo_full_run_eval --active-targets 3 --food-source-capacity 4Pheromone comparison example:
python analysis/evaluate_comparison.py --policy-kind mappo_gru --checkpoint-with-pheromone checkpoints/mappo_full_run/best_greedy_eval --checkpoint-without-pheromone checkpoints/mappo_no_pher_run/best_greedy_eval --agent-min 1 --agent-max 6 --episodes-per-agent 3 --headless --output-dir experiments/experiment_data/trail_compareTrain headless and save checkpoints:
python train/independent_dqn_pytorch.py --headless --total-steps 10000 --save-dir checkpoints --folder-name demo_run --save-every 2000Render the trained policy:
python train/demo.py --checkpoint-dir checkpoints/demo_run/full_policypython train/independent_dqn_pytorch.py --headless --total-steps 10000Use a shared policy:
python train/independent_dqn_pytorch.py --shared-policy --headlessSave checkpoints during training:
python train/independent_dqn_pytorch.py --headless --save-every 2000 --save-dir checkpoints --folder-name demo_runTrain with an explicit epsilon schedule:
python train/independent_dqn_pytorch.py --headless --total-steps 30000 --save-dir checkpoints --folder-name demo_run --epsilon-start 1.0 --epsilon-final 0.05 --epsilon-decay-steps 20000 --warmup-steps 2000If you run that command manually, keep it on one shell line or use \ line continuations exactly. Entering >-prefixed continuation lines or isolated flag lines can make the shell create empty files such as --epsilon-final, --epsilon-decay-steps, or --warmup-steps in the repo root.
Default (custom DQN):
python train/train.py --backend custom --headless --total-steps 10000Stable-Baselines3 DQN (shared policy):
python train/train.py --backend sb3 --headless --total-steps 10000 --save-path checkpoints/sb3_dqn.zipRLlib DQN (shared policy):
python train/train.py --backend rllib --headless --total-steps 10000 --save-dir checkpoints/rllib_dqnIf Ray warns about /tmp being full or socket path length, point it to a different temp dir:
python train/train.py --backend rllib --headless --total-steps 10000 --save-dir checkpoints/rllib_dqn --ray-tmpdir /Users/christopherlin/.ray_tmppython train/demo.py --checkpoint-dir checkpointsShared-policy demo:
python train/demo.py --checkpoint-dir checkpoints --shared-policySB3 demo:
python train/demo.py --backend sb3 --sb3-model checkpoints/sb3_dqn.zipRLlib demo:
python train/demo.py --backend rllib --rllib-checkpoint checkpoints/rllib_dqnMAPPO demo:
python train/demo.py --backend mappo --checkpoint-dir checkpoints/mappo_full_run/best_greedy_eval --max-steps 300 --render-scale 0.75With a custom Ray temp dir:
python train/demo.py --backend rllib --rllib-checkpoint checkpoints/rllib_dqn --ray-tmpdir /Users/christopherlin/.ray_tmpTo auto-exit after N steps (useful for smoke tests):
python train/demo.py --backend custom --max-steps 200The current robot runtime lives in firmware/.
Main files:
- firmware/ant.py
- serial client and low-level ESP32 command helpers
- firmware/run.py
- direct checkpoint inference loop for the physical robot
- firmware/ant.service
- example systemd unit
Typical deployment flow:
python firmware/run.py --checkpoint-dir checkpoints --shared-policyThis runtime loads models/q_network.py, reads scan lines from ant.py, normalizes the observation locally, predicts a discrete action, and sends turn, move, or stop commands directly to the robot.
The repo now includes a separate live operator subsystem in mission_control/. It is a PyGame command center for:
- receiving live robot
POS,PHER, andSENSEmessages - visualizing robot positions and trails
- maintaining the authoritative digital pheromone field
- returning simulator-compatible
PHER_RESPsamples back to robots
Run it locally:
python -m mission_control.main --tcp-host 127.0.0.1 --tcp-port 8765Run a local fake robot against it:
python -m mission_control.fake_robot --robot-id robot_0 --port 8765Documentation:
SwarmEnv implements the PettingZoo Parallel API.
Methods:
reset(seed=None, options=None) -> (obs_dict, info_dict)step(action_dict) -> (obs_dict, rewards_dict, terminations, truncations, infos)render(mode="human", fps=60)close()
Discrete action space with 18 actions:
{throttle ∈ [-1,0,1]} × {turn ∈ [-1,0,1]} × {deposit ∈ [0,1]}.
Provide actions as a dict keyed by agent id (e.g., agent_0) with values in [0, 17].
Each agent gets a local observation history vector; reset/step return a
dict of agent_id -> obs.
Current default per-frame features:
- 9 lidar rays
- 2 nearest detectable target features: distance and relative angle
- 2 nest-direction features
- 2 nearest-neighbor features
- 2 heading features:
sin(theta),cos(theta) - 1 normalized speed feature
- 1 food-presence flag
- 1 carrying-food flag
- 3 pheromone samples
Current default observation size:
- 23 features per frame
observation_history_steps = 3- flattened
obs_dim = 69
The environment maintains a pheromone grid with:
- explicit deposit vs no-deposit action choice
- decay
- bounded diffusion
- carrying-food scaling
- optional gating so deposition only happens while carrying food and making return-to-nest progress
The intended training story is trail formation:
- discovery is expensive early
- successful returns write route hints into the environment
- later agents can exploit those hints
env/swarm_env.py: environment implementationenv/config.py: configuration dataclasstrain/random_rollout.py: random policy sanity checktrain/independent_dqn_pytorch.py: independent or shared DQN trainingtrain/demo.py: load and render trained checkpointsfirmware/run.py: physical robot policy runtimefirmware/ant.py: direct serial robot interfacedocs/ARCHITECTURE.md: detailed functionality and architecturedocs/PROJECT_LOG.md: decisions, notes, and next steps to resume later
- Episode ends when all targets are collected or
max_stepsreached. - Dynamics are pluggable via
dynamics_mode:"tank","hover", or"mixed".
If you’re new to PyGame and RL, this section gives a quick mental model and a practical path to running the project.
- Simulates a swarm of agents in a 2D PyGame world.
- Exposes an RL-style API (
reset,step) with multi-agent observations and rewards. - Adds stigmergy via a pheromone grid that agents can sense and write to.
- Supports a trail-formation objective where agents learn
discover -> return -> deposit -> exploit.
- Random sanity check (renders a window):
python train/random_rollout.py- Train a basic model (headless, no window):
python train/independent_dqn_pytorch.py --headless --total-steps 10000 --save-dir checkpoints --save-every 2000- Run the demo with the trained model (renders a window):
python train/demo.py --checkpoint-dir checkpointsEach step:
- You give actions for each agent.
- The environment moves agents, handles collisions, and collects targets.
- You receive rewards + new observations.
- +8 when an agent reaches a target
- -0.01 each step (encourages speed)
- -0.2 for collisions with walls/obstacles
So the learned behavior should be: “find targets quickly without crashing.”
- PyGame is only used for rendering and window events.
- If you run with
--headless, no window is opened and PyGame doesn’t render.
env/config.pycontrols most parameters:- number of agents/targets/obstacles
- rewards
- lidar rays
- pheromone settings
- non-carrying exploration controls, including env-side random exploration overrides
- render scaling via
render_scale - dynamics mode (
tank,hover,mixed)
See docs/ARCHITECTURE.md for a full breakdown of modules and data flow.