# MultiGEC Baseline

_This notebook was put together by Ricardo ([@rimusa](https://github.com/rimusa) on GitHub) and is based on the code found [here](https://github.com/spraakbanken/multigec-2025/blob/main/scripts/baseline.py)._

## Intro

This notebook is meant to showcase the baseline for the [MultiGEC shared task](https://github.com/spraakbanken/multigec-2025) on grammatical error correction.
If you want to learn a bit more about this task, be sure to check out the shared task homepage.

Note that it is a minimal one-shot prompt-engineering baseline.
This is a lot of buzzwords, let's see what they mean:

* A _baseline_ is a system with which we're comparing ours with. There are two (often opposed) philosophies here:
    * It should be simple to avoid introducing noise/artifacts to the task
    * It should be well-performing to allow for better comparison of the results
    
* A _minimal baseline_ follows the "being simple" approach, trying to avoid any kind of over-engineering

* A _few-shot system_ is meant to do very little adaptation (if at all) to the task in hand. It can be either finetuning with few examples for a small amount of epochs (with transformer-based models I'd say one at most) or showing the examples as part of a prompt.
    * _Zero-shot systems_ do not get any examples at all, they are meant to test the model as-is.
    * _One-shot systems_ see only one example. The idea here is that we don't want the model to change its internal representations but we do want it to adapt its output to the format of the input.

* _Prompt-engineering_ is a method of getting generative language models to adapt to our task without actually changing the internal representations of the model. This way we can bootstrap the knowledge contained within without risking it forgetting things during finetuning (also known as "catastrophic forgetting").

## Preamble

This section has all of the imports needed to run this code.

Note that to be able to run this notebook you need to use a Python 3 environment that has the following packages installed:

```python
# General Imports
jupyter             # For reasons that hopefully are obvious

# HuggingFace Imports
transformers        # To be able to load the models
huggingface_hub     # Llama is a gated model, so you need to log in
```

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
from huggingface_hub import login

Good etiquette says to use only one GPU from the MLT server.
Thus we will use just the one and a smaller model.
For the 8B parameter model that we used for the shared task, we might need to use more GPUs.

In [2]:
device = "auto"

## The Prompts

### Task Prompts

The idea is that we want two different kinds of prompts: one for doing minimal edits to the text and another one that allows for fluency edits.
Thus, we have two different prompts.
Note that the prompt has parts in common between the two versions, so we only need to define those parts once to reduce the amount of duplication (and thus diminishing the risk that the two diverge from each other.

In [3]:
# This is the shared part of the prompt
task_prompt = "You are a grammatical error correction tool. Your task is to correct the grammaticality and spelling of the input essay written by a learner of {}. {} Return only the corrected text and nothing more."

Note that there are two sets of brackets.
We will put the language of the text to be corrected in the first set of brackets and the task description in the second one.

We can then define the specific variations of the task:

In [4]:
task_descriptions = {

    "minimal" : "Make the smallest possible change in order to make the essay grammatically correct. Change as few words as possible. Do not rephrase parts of the essay that are already grammatical. Do not change the meaning of the essay by adding or removing information. If the essay is already grammatically correct, you should output the original essay without changing anything.",

    "fluency" : "You may rephrase parts of the essay to improve fluency. Do not change the meaning of the essay by adding or removing information. If the essay is already grammatically correct and fluent, you should output the original essay without changing anything.",

}

Let's see the two different versions of the prompts:

In [5]:
for task, description in task_descriptions.items():
    print("")
    print("Prompt for when we use", task, "edits:")
    print("")
    print(task_prompt.format("LANG",description))
    print("")


Prompt for when we use minimal edits:

You are a grammatical error correction tool. Your task is to correct the grammaticality and spelling of the input essay written by a learner of LANG. Make the smallest possible change in order to make the essay grammatically correct. Change as few words as possible. Do not rephrase parts of the essay that are already grammatical. Do not change the meaning of the essay by adding or removing information. If the essay is already grammatically correct, you should output the original essay without changing anything. Return only the corrected text and nothing more.


Prompt for when we use fluency edits:

You are a grammatical error correction tool. Your task is to correct the grammaticality and spelling of the input essay written by a learner of LANG. You may rephrase parts of the essay to improve fluency. Do not change the meaning of the essay by adding or removing information. If the essay is already grammatically correct and fluent, you should outp

### Readying the Prompts for Llama

Llama requires a specific format for the prompts.
I will not go into any details here, but I can point you [in the right direction](https://www.llama.com/docs/model-cards-and-prompt-formats/meta-llama-3/) in case you'd like to know more.

In [6]:
# This is the one-shot prompt for Llama 3
one_shot =  """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{}

Input essay:
<|eot_id|><|start_header_id|>user<|end_header_id|>
{}<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
Output essay:
{}<|eot_id|>

Input essay:
<|eot_id|><|start_header_id|>user<|end_header_id|>
{}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Output essay:
"""

In [7]:
# This would be how the zero-shot version looks like
# Note that it has one less turn in the conversation
zero_shot =  """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{}

Input essay:
<|eot_id|><|start_header_id|>user<|end_header_id|>
{}<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
Output essay:
"""

In [8]:
prompts = {
    "one_shot"  : one_shot,
    "zero_shot" : zero_shot,
}

### Examples

Remember that our prompt is one-shot, so we need to give it an example.
The more nuanced the example is and the more similar it is to the actual input, the better the prompt will be.
However, our baseline is supposed to be simple, so we only care that the model gets the in/out format correctly.

In [9]:
example_in  = "My name is Susanna. I come from Berlin, in the middl of Germany, bot I live in Bungaborg. I am studying data science in the University of Bungaborg and work extra as a teacher."

example_out = "My name is Susanna. I come from Berlin, in the middle of Germany, but I live in Bungaborg. I am studying data science in the University of Bungaborg and work extra as a teacher."

## Code

### Loading the Model

For the shared task baseline we used Llama 3.1 as it is one of the better-performing models as of the end of 2024.
Note that you need to have a HuggingFace account and sell your soul to Facebook in order to use it.
You can do so [here](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).

In [10]:
# First we log into HuggingFace, for this you need to use an authorization key, which is private.
# Do not show or share your key with anyone, under any circumstances.

# We save our key in a file called hf_key.txt
with open("./hf_key.txt") as F:
    login(next(F))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/ricardomunozsanchez/.cache/huggingface/token
Login successful


In [11]:
# We can then load the model

checkpoint = "meta-llama/Meta-Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

config = AutoConfig.from_pretrained(checkpoint, max_new_tokens=3000)

model = AutoModelForCausalLM.from_pretrained(
                                            checkpoint,
                                            config=config,
                                            torch_dtype="auto",
                                            device_map=device
                                            )

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

### Fetching the Prompts

We can reuse code for when getting the prompt. This will allow us to iterate in a faster way when giving examples.

In [12]:
def fetch_prompt(system, task, essay, language):

    current_task = task_descriptions[task]

    task_prompt.format(language,current_task)

    items = [task_prompt]
    if "one" in system:
        items += [example_in, example_out]
    items += [essay]
    
    current_prompt = prompts[system].format(*items)
    
    return current_prompt

### Obtaining Predictions

We can also reuse the same code when obtaining predictions from the models.

In [13]:
def predict(system, task, essay, language, max_length=300, print_output=True):

    prompt = fetch_prompt(system, task, essay, language)

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    outputs = model.generate(**inputs, max_length=max_length, pad_token_id=tokenizer.eos_token_id)

    out_text = tokenizer.decode(outputs[0])

    correction = out_text.split("Output essay:\n")[-1].strip("<|eot_id|>")

    if print_output:
        print(correction)
    else:
        return correction

## Testing the Model

We can test our model with a simple essay here:

In [14]:
# Language of the essay
language = "English"

# Text of the essay
# NOTE - If the essay is too long, you'll get an error from the model!
essay = """This is a test esay."""

### The One-Shot Example

Let's start with the minimal edit prompt:

In [15]:
predict("one_shot","minimal",essay,language)

This is a test essay.


And then the fluency edits:

In [16]:
predict("one_shot","fluency",essay,language)

This is a test essay.


### The Zero-Shot Example

This tends to spit more gibberish than the one-shot example, but should still work most of the time.

Minimal edit prompt:

In [17]:
predict("zero_shot","minimal",essay,language)

This is a test essay.


FLuency edit prompt:

In [18]:
predict("zero_shot","fluency",essay,language)

This is a test essay.
