Skip to content

sephirxth/LLM_code_test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Code Test

Benchmark comparing LLM code generation quality across models and prompting strategies. Tests Claude, Gemini, GPT, Grok, and DeepSeek on a standardized physics simulation task.

What It Tests

All models receive the same prompt: implement a Python physics simulation (bouncing balls in a rotating heptagon with gravity, collision detection, and real-time rendering). This tests:

  • Physics accuracy — Collision detection, coordinate transforms, energy conservation
  • Code completeness — Full runnable program vs. partial snippets
  • Architecture quality — Code structure, modularity, readability

Key Findings

Model META Prompt Canvas/Artifact Result
Claude Opus 4.5+ Yes N/A Best overall
Claude Opus 4 Yes N/A Strong
Gemini 3.0 Pro Yes No Good
Gemini 2.5 Pro Yes Canvas hurts Canvas degrades quality
DeepSeek R1 N/A N/A Competitive

Conclusions:

  • Opus slightly better than Gemini overall
  • META Prompt improves output across models
  • Canvas/artifact mode degrades code quality (no canvas > canvas)
  • After Claude 4.5, Opus clearly outperforms Gemini Pro 3.0

When to Use This

  • Evaluating which LLM to use for code generation tasks
  • Comparing prompting strategies (META prompt vs. bare)
  • Studying the effect of artifact/canvas mode on code quality
  • Benchmarking new models against established baselines

Project Structure

├── input_prompt.txt                          # Standardized test prompt (Chinese)
├── claude_opus4.py                           # Claude Opus 4 output
├── claude_opus4_no_prompt.py                 # Opus 4, no META prompt
├── claude_opus4_no_prompt_no_artifact.py     # Opus 4, no prompt, no artifact
├── claude_opus_4.5_METAPrompt.py             # Opus 4.5 with META prompt
├── claude_sonnet_4.py                        # Sonnet 4 output
├── gemini2.5pro_canvas.py                    # Gemini 2.5 Pro with canvas
├── gemini2.5pro_no_canvas.py                 # Gemini 2.5 Pro without canvas
├── gemini_3.0_pro_METAprompt_no_canvas.py    # Gemini 3.0 Pro with META prompt
├── deepseekR1.py                             # DeepSeek R1 output
├── grok4.py                                  # Grok 4 output
└── main.py                                   # Entry point

Run the Benchmark

# Install dependencies
pip install pygame

# Run any model's output
python claude_opus_4.5_METAPrompt.py
# → 800x800 window with bouncing balls in rotating heptagon

License

MIT

About

LLM code generation benchmark — Claude vs Gemini vs DeepSeek vs Grok on a physics simulation task. Compares prompting strategies and artifact modes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages