This repo contains 5 tasks that I gave to Claude Code and OpenAI Codex - in each case the agent was asked to create a detailed plan on how to achieve the task, without actually implementing it.
The tasks are each in the tasks
directory, and the responses from Claude Code and OpenAI Codex are in the agent1
and agent2
directories, respectively. This is just me trying to be scientific and not give any hints to the agents when they were doing their comparison.
agent1 = "Claude Code"
agent2 = "OpenAI Codex"
Given that Claude Code and OpenAI Codex were responsible for creating the plans, who better to ask for a verdict on which one did a better job? I wrote a detailed PROCESS.md
file that contains a prompt asking the AI to compare the 2 sets of generated plans in various ways and give its verdict.
The verdicts are in the verdicts
directory. To create from-claude.md
I ran claude
on the CLI and then gave it the prompt:
please read PROCESS.md and answer what it asks you
To create from-codex.md
I ran codex
on the CLI and then gave it the same prompt.
Neither agent knew whether agent1 was Claude Code or OpenAI. Both gave the slight edge in their verdicts to agent2
, which is OpenAI Codex
. You can read their full verdict text yourself but the general vibe was that Codex produced more focused plans without what it perceived as over-engineering. That over-engineering may well actually help the LLM do the implementation, so YMMV - this eval method is far from gospel.
Obviously, this is a sample set of 5, done one time, in a fairly non-scientific way. OpenAI Codex was released yesterday, so expect all this to change.
Codex was considerably slower, both generating the plans and comparing them. It seemed to want to spend a lot more time reading than Claude Code does.