A concise guide to generate code/, data/, and LLM solutions with IO-level testing.
# Install dependencies
pip install -r requirements.txtDownload the complete benchmark dataset (626 problems) from Hugging Face:
python hf_dataset.py downloadThis creates data/dataset/ with 626 JSON files containing:
- Problem descriptions
- Canonical solutions (7 languages)
- Test case generators
- Evaluators
Generate executable code files from the dataset:
python evaluate_solution.py evaluate -o data/evaluationThis creates code/{python,cpp,java,javascript,golang,ruby}/ with complete runnable code files including:
- Solution implementation
- Test framework
- Auto-generated test runners
Use LLM to generate and test solutions for IO-type problems:
Configure API Key (edit io_generate_and_test.py):
OPENAI_API_KEY = "sk-your-api-key-here"
OPENAI_API_URL = "https://api.openai.com/v1"
MODEL_NAME = "gpt-4o"Run generation + testing:
# Generate solutions and test them (lite mode: stop after first failure)
python io_generate_and_test.py both
# Full mode: test all cases
python io_generate_and_test.py both --full
# Generate only (no testing)
python io_generate_and_test.py generate
# Test only (existing solutions)
python io_generate_and_test.py testOutput structure:
out-gpt-4o/
├── {problem_name}/
│ ├── solution.py # Generated solution
│ ├── prompt.txt # LLM prompt
│ └── raw_output.txt # Raw LLM response
├── generation_results.json
├── testing_results.json
└── all_results.json # Combined results
| Folder | Content | Command |
|---|---|---|
data/dataset/ |
626 problem files from HF Hub | python hf_dataset.py download |
code/ |
Executable code (7 languages) | python evaluate_solution.py evaluate |
out-gpt-4o/ |
LLM-generated IO solutions | python io_generate_and_test.py both |
# Check dataset
ls data/dataset/*.json | wc -l # Should show 626
# Check code folder
ls code/python/*.py | wc -l # Should show multiple files
# Check results (after Step 3)
cat out-gpt-4o/all_results.json | python -m json.tool | head -30See README_SETUP.md for detailed documentation.
- IO vs Functional: The dataset contains both IO-type (stdin/stdout) and functional-type (function calls) problems
- Lite Mode: Stops testing after first failure per problem (faster)
- Full Mode: Tests all test cases for each problem (comprehensive)
- Model Configuration: Change
MODEL_NAMEinio_generate_and_test.pyto use different LLMs (output goes toout-{MODEL_NAME}/)