Code and artifacts for finetuning LLMs to generate MiniZinc constraint programming models from natural language problem descriptions. We train on problems from the text2zinc dataset and evaluate on a held-out test set of 100 problems originally sourced from CardinalOperations/IndustryOR. Additional training problems are drawn from the OR-Instruct-3K dataset.
learn2zinc/
├── finetuning.ipynb # Unified finetuning notebook (all models × all strategies)
├── verify_results.py # Standalone result verification script
├── data/ # Local data files
├── evaluations/
│ └── generated_code/ # Pre-generated MiniZinc files for every model/strategy
│ ├── gemma-2-9b_original/ # Base model (no finetuning)
│ ├── gemma-2-9b_learn2zinc_finetuned/
│ ├── gemma-2-9b_learn2zinc_cot_finetuned/
│ ├── gemma-2-9b_learn2zinc_augmented_finetuned/
│ ├── gemma-2-9b_combined_finetuned/
│ ├── retry_gemma-2-9b_learn2zinc_augmented_finetuned/
│ ├── ... # (same structure for all 5 base models)
│ └── ensemble_learn2zinc_augmented/ # Ensemble cascade across models
└── LICENSE
Each strategy folder contains up to 100 .mzn files, one per problem in the IndustryOR test set.
| Model | Size | HuggingFace (finetuned) |
|---|---|---|
| Qwen3 | 0.6B | skadio/learn2zinc-Qwen3-0.6B |
| Llama 3.2 | 1B | skadio/learn2zinc-Llama-3.2-1B |
| Llama 3.2 | 3B | skadio/learn2zinc-Llama-3.2-3B |
| Gemma 2 | 9B | skadio/learn2zinc-Gemma-2-9B |
| GPT-oss | 20B | skadio/learn2zinc-GPT-oss-20B |
The published checkpoints above are the best-performing variant for each base model, all finetuned on the Learn2Zinc-Augmented dataset.
Each dataset provides (natural language description → MiniZinc code) training pairs with different prompting strategies:
| Strategy | Dataset | Description |
|---|---|---|
| Learn2Zinc-Base (Direct) | skadio/learn2zinc-base | Direct problem-to-code pairs |
| Learn2Zinc-CoT (Reasoning) | skadio/learn2zinc-cot | Chain-of-thought reasoning before code |
| Learn2Zinc-Augmented | skadio/learn2zinc-augmented | Augmented training data with syntax error correction examples |
The evaluations/generated_code/ directory contains outputs from the following configurations:
{model}_original— Base model without any finetuning{model}_learn2zinc_base_finetuned— Finetuned on Learn2Zinc-Base (direct) dataset{model}_learn2zinc_cot_finetuned— Finetuned on Learn2Zinc-CoT dataset{model}_learn2zinc_augmented_finetuned— Finetuned on Learn2Zinc-Augmented dataset{model}_combined_finetuned— Finetuned on a mix of Learn2Zinc-Base and Learn2Zinc-CoT strategiesretry_{model}_learn2zinc_augmented_finetuned— Multi-attempt retry (up to 5 attempts) with error feedbackensemble_learn2zinc_augmented— Cascade ensemble across all models (smallest to largest)
You can verify all results from the paper tables without running any model inference. The pre-generated .mzn files are evaluated directly against the test set using the MiniZinc solver.
pip install datasets pandas tqdmInstall the MiniZinc CLI with the HiGHS solver.
# Evaluate all strategies
python verify_results.pyThe script will:
- Load the 100 IndustryOR problems from the skadio/text2zinc HuggingFace dataset
- Read the pre-generated
.mznfiles for each strategy - Execute each model against the MiniZinc solver (HiGHS, 120s timeout)
- Compare objective values against ground truth
- Print execution accuracy and solution accuracy per strategy, and save detailed results to CSV
The finetuning.ipynb notebook contains the full finetuning pipeline. It iterates over all base models and dataset strategies, applying LoRA-based finetuning using Unsloth.