Skip to content

Reproducible QLoRA recipes and configs that fine‑tune Aya‑Expanse‑8B and Gemma‑3‑12B into state‑of‑the‑art multilingual GEC checkpoints released on Hugging Face

License

Notifications You must be signed in to change notification settings

r-kovalch/omnigec-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OmniGEC‑Models

Reproducible QLoRA instruction‑tuned checkpoints for the Multilingual GEC tasks introduced in the OmniGEC paper.
This repository ships only lightweight notebooks, prompt code and configs – the actual model weights live on the HuggingFace Hub:

Track / Size 8 B Aya‑Expanse 12 B Gemma‑3
Minimal edits lang‑uk/OmniGEC‑Minimal‑8B lang‑uk/OmniGEC‑Minimal‑12B
Fluency edits lang‑uk/OmniGEC‑Fluency‑8B lang‑uk/OmniGEC‑Fluency‑12B

Full methodology and scores appear in the ACL 2025 paper
“Introducing OmniGEC: A Silver Multilingual Dataset for Grammatical Error Correction” — PDF TBD.


Installation

git clone https://github.com/r-kovalch/omnigec-models.git
cd omnigec-models
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Repository layout

omnigec-models/
├── notebooks/                # Training / inference notebooks
│   ├── aya_expanse_8b/       # Aya‑Expanse‑8B (Minimal & Fluency, QLoRA)
│   ├── gemma_3_12b/          # Gemma‑3‑12B (Minimal & Fluency, QLoRA)
│   └── gemma_3_4b/           # Gemma‑3‑4B (small‑LLM pilot, Minimal track)
├── params/                   # Ω‑Conf configs per size & track
├── src/
│   ├── instruction_templates # Pydantic prompt classes
│   └── utils/                # Formatting helpers, language codes, etc.
└── README.md                 # ← you are here

Key notebooks

Path Purpose
notebooks/aya_expanse_8b/training_* QLoRA fine‑tuning recipes for Aya‑Expanse‑8B (4‑bit NF4 base + LoRA adapters)
notebooks/gemma_3_12b/training_* QLoRA fine‑tuning recipes for Gemma‑3‑12B‑IT
notebooks/gemma_3_4b/training_* QLoRA pilot on Gemma‑3‑4B (Minimal track)
notebooks/*/inference_* Scoring on MultiGEC‑25 test set using checkpoints above
_push_to_huggingface.ipynb End‑to‑end upload of safetensors, tokenizer and autogenerated model card

Why QLoRA everywhere? Each notebook sets QUANTIZE_4BIT = True, loading the frozen base model with Bits‑and‑Bytes NF4 4‑bit weights and training LoRA adapters only. Flip the flag to False if you prefer full‑precision LoRA (requires more VRAM).


Quick inference

Python snippet  –  Aya‑based checkpoints (8 B)
from transformers import AutoTokenizer, AutoModelForCausalLM
from src.instruction_templates import multigec_prompts
from src.utils.multigec import LANG_TO_CODE, LANG_CODE_TO_TOKEN

repo = "lang-uk/OmniGEC-Fluency-8B"   # or OmniGEC-Minimal-8B
tok  = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype="auto")

example = {"language": "english", "feature": "She go to school every day ."}

def formatting_prompts_func(ex):
    code  = LANG_TO_CODE[ex["language"]]
    token = LANG_CODE_TO_TOKEN[code]
    instr = multigec_prompts[ex["language"]].prompt_template.format(
        original_text=ex["feature"])
    return (f"<|START_OF_TURN_TOKEN|><|USER_TOKEN|>{token}{instr}"
            f"<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>")

prompt  = formatting_prompts_func(example)
outputs = model.generate(**tok(prompt, return_tensors="pt"), max_new_tokens=200)
print(tok.decode(outputs[0], skip_special_tokens=True))
Python snippet  –  Gemma‑based checkpoints (12 B / 4 B)
from transformers import AutoTokenizer, AutoModelForCausalLM
from src.instruction_templates import multigec_prompts
from src.utils.multigec import LANG_TO_CODE, LANG_CODE_TO_TOKEN

repo = "lang-uk/OmniGEC-Fluency-12B"   # or OmniGEC-Minimal-12B / *‑4B
tok  = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype="auto")

example = {"language": "english", "feature": "She go to school every day ."}

def formatting_prompts_func(ex):
    code  = LANG_TO_CODE[ex["language"]]
    token = LANG_CODE_TO_TOKEN[code].replace("|", "")  # Gemma tokens lack pipes
    instr = multigec_prompts[ex["language"]].prompt_template.format(
        original_text=ex["feature"])
    return (f"<start_of_turn>user\n{token}{instr}<end_of_turn>\n"
            f"<start_of_turn>model\n")

prompt  = formatting_prompts_func(example)
outputs = model.generate(**tok(prompt, return_tensors="pt"), max_new_tokens=200)
print(tok.decode(outputs[0], skip_special_tokens=True))

Citing

TBD

@inproceedings{omnigec2025,
  author    = {Roman Kovalchuk, Petro Ivaniuk, Mariana Romanyshyn},
  title     = {Introducing OmniGEC: A Silver Multilingual Dataset for Grammatical Error Correction},
  booktitle = {Proceedings of ACL 2025},
  year      = {2025}
}

© 2025 Roman Kovalchuk, Petro Ivaniuk, Mariana Romanyshyn. Licensed under the MIT License.

Notes

About

Reproducible QLoRA recipes and configs that fine‑tune Aya‑Expanse‑8B and Gemma‑3‑12B into state‑of‑the‑art multilingual GEC checkpoints released on Hugging Face

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published