Reproducible QLoRA instruction‑tuned checkpoints for the Multilingual GEC tasks introduced in the OmniGEC paper.
This repository ships only lightweight notebooks, prompt code and configs – the actual model weights live on the HuggingFace Hub:
Track / Size | 8 B Aya‑Expanse | 12 B Gemma‑3 |
---|---|---|
Minimal edits | lang‑uk/OmniGEC‑Minimal‑8B |
lang‑uk/OmniGEC‑Minimal‑12B |
Fluency edits | lang‑uk/OmniGEC‑Fluency‑8B |
lang‑uk/OmniGEC‑Fluency‑12B |
Full methodology and scores appear in the ACL 2025 paper
“Introducing OmniGEC: A Silver Multilingual Dataset for Grammatical Error Correction” — PDF TBD.
git clone https://github.com/r-kovalch/omnigec-models.git
cd omnigec-models
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
omnigec-models/
├── notebooks/ # Training / inference notebooks
│ ├── aya_expanse_8b/ # Aya‑Expanse‑8B (Minimal & Fluency, QLoRA)
│ ├── gemma_3_12b/ # Gemma‑3‑12B (Minimal & Fluency, QLoRA)
│ └── gemma_3_4b/ # Gemma‑3‑4B (small‑LLM pilot, Minimal track)
├── params/ # Ω‑Conf configs per size & track
├── src/
│ ├── instruction_templates # Pydantic prompt classes
│ └── utils/ # Formatting helpers, language codes, etc.
└── README.md # ← you are here
Path | Purpose |
---|---|
notebooks/aya_expanse_8b/training_* |
QLoRA fine‑tuning recipes for Aya‑Expanse‑8B (4‑bit NF4 base + LoRA adapters) |
notebooks/gemma_3_12b/training_* |
QLoRA fine‑tuning recipes for Gemma‑3‑12B‑IT |
notebooks/gemma_3_4b/training_* |
QLoRA pilot on Gemma‑3‑4B (Minimal track) |
notebooks/*/inference_* |
Scoring on MultiGEC‑25 test set using checkpoints above |
_push_to_huggingface.ipynb |
End‑to‑end upload of safetensors, tokenizer and autogenerated model card |
Why QLoRA everywhere? Each notebook sets
QUANTIZE_4BIT = True
, loading the frozen base model with Bits‑and‑Bytes NF4 4‑bit weights and training LoRA adapters only. Flip the flag toFalse
if you prefer full‑precision LoRA (requires more VRAM).
Python snippet – Aya‑based checkpoints (8 B)
from transformers import AutoTokenizer, AutoModelForCausalLM
from src.instruction_templates import multigec_prompts
from src.utils.multigec import LANG_TO_CODE, LANG_CODE_TO_TOKEN
repo = "lang-uk/OmniGEC-Fluency-8B" # or OmniGEC-Minimal-8B
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype="auto")
example = {"language": "english", "feature": "She go to school every day ."}
def formatting_prompts_func(ex):
code = LANG_TO_CODE[ex["language"]]
token = LANG_CODE_TO_TOKEN[code]
instr = multigec_prompts[ex["language"]].prompt_template.format(
original_text=ex["feature"])
return (f"<|START_OF_TURN_TOKEN|><|USER_TOKEN|>{token}{instr}"
f"<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>")
prompt = formatting_prompts_func(example)
outputs = model.generate(**tok(prompt, return_tensors="pt"), max_new_tokens=200)
print(tok.decode(outputs[0], skip_special_tokens=True))
Python snippet – Gemma‑based checkpoints (12 B / 4 B)
from transformers import AutoTokenizer, AutoModelForCausalLM
from src.instruction_templates import multigec_prompts
from src.utils.multigec import LANG_TO_CODE, LANG_CODE_TO_TOKEN
repo = "lang-uk/OmniGEC-Fluency-12B" # or OmniGEC-Minimal-12B / *‑4B
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype="auto")
example = {"language": "english", "feature": "She go to school every day ."}
def formatting_prompts_func(ex):
code = LANG_TO_CODE[ex["language"]]
token = LANG_CODE_TO_TOKEN[code].replace("|", "") # Gemma tokens lack pipes
instr = multigec_prompts[ex["language"]].prompt_template.format(
original_text=ex["feature"])
return (f"<start_of_turn>user\n{token}{instr}<end_of_turn>\n"
f"<start_of_turn>model\n")
prompt = formatting_prompts_func(example)
outputs = model.generate(**tok(prompt, return_tensors="pt"), max_new_tokens=200)
print(tok.decode(outputs[0], skip_special_tokens=True))
TBD
@inproceedings{omnigec2025,
author = {Roman Kovalchuk, Petro Ivaniuk, Mariana Romanyshyn},
title = {Introducing OmniGEC: A Silver Multilingual Dataset for Grammatical Error Correction},
booktitle = {Proceedings of ACL 2025},
year = {2025}
}
© 2025 Roman Kovalchuk, Petro Ivaniuk, Mariana Romanyshyn. Licensed under the MIT License.
- For details on data, refer to the omnigec-data repository: https://github.com/r-kovalch/omnigec-data