# Hệ thống đánh giá tự động IELTS Writing Task 2 - Fine-tuning Mistral-7b

**Tác giả:** Phạm Lê Ngọc Sơn  
**Email:** phamlengocsononline@gmail.com  
**Mô tả:** Fine-tuning các Large Language Models để đánh giá tự động bài viết IELTS Task 2  
**Thời gian:** Tháng 12/2023 - Tháng 3/2024



## Import libraries

In [None]:
%%capture
!mamba install cudatoolkit xformers bitsandbytes pytorch pytorch-cuda=11.8 \
    -c pytorch -c nvidia -c xformers -c conda-forge -y
!pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git"
!pip uninstall datasets -y
!pip install datasets

import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
from IPython.display import display, Markdown

def display_markdown(markdown_string):
    markdown_string = markdown_string.replace("<\s>", "")
    markdown_string = markdown_string.replace("<s>", "")
    display(Markdown(markdown_string))

# Load data

Load dataset từ HuggingFace

Link dataset: https://huggingface.co/datasets/chillies/IELTS-writing-task-2-evaluation

In [None]:
from datasets import load_dataset

dataset = load_dataset("chillies/IELTS-writing-task-2-evaluation", split = "train")

Downloading data:   0%|          | 0.00/44.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.21M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

**Prompt template**

Prompt template dùng để instruct model trong quá trình fine-tune

In [None]:
PROMPT = """
In this task, you are required to evaluate an IELTS Writing Task 2 essay. Consider the following four criteria and provide a detailed assessment for each, along with a suggested band score:

## Task Achievement:
- Evaluate how well the candidate has addressed the given task.
- Assess the clarity and coherence of the response in presenting ideas.
- Identify if the candidate has fully covered all parts of the task and supported arguments appropriately.
- Suggested Band Score (Task Achievement): [Insert Score]

## Coherence and Cohesion:
- Assess the overall organization and structure of the essay.
- Evaluate the use of linking devices to connect ideas and paragraphs.
- Identify if there is a logical flow of information.
- Suggested Band Score (Coherence and Cohesion): [Insert Score]

## Lexical Resource (Vocabulary):
- Examine the range and accuracy of vocabulary used in the essay.
- Point out specific mistakes in vocabulary, such as inaccuracies or overuse of certain words and Suggest modified versions or alternatives for the identified mistakes. [list of mistakes and rectify]
- Assess the appropriateness of vocabulary for the given context.
- Suggested Band Score (Lexical Resource): [Insert Score]

## Grammatical Range and Accuracy:
- Evaluate the variety and complexity of sentence structures.
- Point out specific grammatical errors, such as incorrect verb forms or sentence construction and Suggest modified versions or corrections for the identified mistakes. [list of mistakes and rectify]
- Examine the use of punctuation and sentence formation.
- Suggested Band Score (Grammatical Range and Accuracy): [Insert Score]

## Overall Band Score:

- Provide an overall band score for the essay, considering the holistic performance across all criteria.
- Consider the synergy of the essay in meeting the task requirements cohesively.
- Suggested Overall Band Score: [Insert Score]

## Feedback and Additional Comments:
- Provide constructive feedback highlighting specific strengths and areas for improvement.
- Suggest strategies for enhancement in weaker areas.

## Prompt:
{}

## Essay:
{}

## Evaluation:
{}"""

In [None]:
EOS_TOKEN = '</s>'

def formatting_prompts_func(examples):
    essay_prompts = examples["prompt"]
    essays = examples["essay"]
    evaluations = examples["evaluation"]
    texts = []
    for p, es, ev in zip(essay_prompts, essays, evaluations):
        text = PROMPT.format(p, es, ev) + EOS_TOKEN
        if isinstance(text, str):
            texts.append(text)
        else:
            print("error")
            texts.append("")
    return {"text": texts}

In [None]:
dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/9833 [00:00<?, ? examples/s]

In [None]:
sample = dataset[7]['text']
display_markdown(sample)


In this task, you are required to evaluate an IELTS Writing Task 2 essay. Consider the following four criteria and provide a detailed assessment for each, along with a suggested band score:

## Task Achievement:
- Evaluate how well the candidate has addressed the given task.
- Assess the clarity and coherence of the response in presenting ideas.
- Identify if the candidate has fully covered all parts of the task and supported arguments appropriately.
- Suggested Band Score (Task Achievement): [Insert Score]

## Coherence and Cohesion:
- Assess the overall organization and structure of the essay.
- Evaluate the use of linking devices to connect ideas and paragraphs.
- Identify if there is a logical flow of information.
- Suggested Band Score (Coherence and Cohesion): [Insert Score]

## Lexical Resource (Vocabulary):
- Examine the range and accuracy of vocabulary used in the essay.
- Point out specific mistakes in vocabulary, such as inaccuracies or overuse of certain words and Suggest modified versions or alternatives for the identified mistakes. [list of mistakes and rectify]
- Assess the appropriateness of vocabulary for the given context.
- Suggested Band Score (Lexical Resource): [Insert Score]

## Grammatical Range and Accuracy:
- Evaluate the variety and complexity of sentence structures.
- Point out specific grammatical errors, such as incorrect verb forms or sentence construction and Suggest modified versions or corrections for the identified mistakes. [list of mistakes and rectify]
- Examine the use of punctuation and sentence formation.
- Suggested Band Score (Grammatical Range and Accuracy): [Insert Score]

## Overall Band Score:

- Provide an overall band score for the essay, considering the holistic performance across all criteria.
- Consider the synergy of the essay in meeting the task requirements cohesively.
- Suggested Overall Band Score: [Insert Score]

## Feedback and Additional Comments:
- Provide constructive feedback highlighting specific strengths and areas for improvement.
- Suggest strategies for enhancement in weaker areas.

## Prompt:
Interviews form the basic selecting criteria for most large companies. However, some people think that interview is not a reliable method of choosing whom to employ and there are other better methods. To what extent do you agree or disagree?

## Essay:
It is undeniable that most well-known organizations prefer interviews as methods of recruiting new employees. However, others propound that they are not sophisticated procedure for doing so. In my opinion, I agree with the statement that there are other ways to select the right persons for their companies. 

On the one hand, it is considered by many that interviewing is reliable. To explain this further, it allows employers to get to know their potential candidates; moreover, sets of questions and criterias are well prepared before the process of interviews, which is convenient and saves a amount of time. For instance, those candidates may be asked about their backgrounds or occupational experiences, so they can be assessed from their attitudes as well. However, this cannot gaurantee that who can thrive in company success or reach the teams' projects as long as they start working. 

On the other hand, there are some other useful methods to hire someone. First and foremost, internship is though to be one of the most effective ways when it comes to offer someone's an employment opportunitiy. To put it simply, if interns are able to achieve some tasks given by a manager or a director, they are likely to fit in the firm. Another way to employ someone is to let them demonstrate a particular mission related to their positions, which will result in not only a better quality of their qulifications but also proficiency.

In conclusion, even though a myriad of companies rely heavily on interviews before hiring employees, there are more reliable and practical processes to recruit new staff. From my perspective, I argue the notion that interviews are the optimal way to hire someone.

## Evaluation:
## Task Achievement:
- The candidate has adequately addressed the given task by exploring both perspectives on whether interviews are a reliable method of choosing whom to employ.
- The ideas presented are relevant to the task and provide some arguments to support both sides of the debate.
- However, the response does not fully cover all aspects of the task, as it does not provide a clear stance or discuss alternative methods in detail.
- Suggested Band Score (Task Achievement): 3.5

## Coherence and Cohesion:
- The essay lacks clear transitions between sentences and paragraphs, making it difficult to follow the progression of ideas.
- Connecting words and phrases are not used effectively to maintain a smooth flow of information.
- The logical sequence of information could be improved by rearranging certain paragraphs or sentences.
- Suggested Band Score (Coherence and Cohesion): 3.0

## Lexical Resource (Vocabulary):
- The vocabulary used in the essay is limited and repetitive, with frequent use of basic and common words.
- There are several instances of inaccuracies in vocabulary usage, such as "sets of questions and criterias" instead of "sets of questions and criteria" and "gauarantee" instead of "guarantee."
- The vocabulary is not always appropriate for the given context, such as the use of the informal term "amount" in a formal essay.
- Suggested Band Score (Lexical Resource): 3.0

## Grammatical Range and Accuracy:
- The essay exhibits a limited range of sentence structures, with frequent use of simple sentences.
- There are several grammatical errors throughout the essay, such as "others propound" instead of "others argue" and "if interns are able to achieve some tasks given by a manager or a director, they are likely to fit in the firm" which has incorrect verb tense and subject-verb agreement.
- The use of punctuation and sentence formation could be improved for clarity and accuracy.
- Suggested Band Score (Grammatical Range and Accuracy): 3.0

## Overall Band Score:
- Considering the holistic performance across all criteria, the essay falls within the range of Band 3.0.
- While the essay addresses the task and provides some relevant ideas, it lacks coherence, accuracy, and a sophisticated use of language.
- Suggested Overall Band Score: 3.0

## Feedback and Additional Comments:
- **Strengths:** The essay demonstrates an understanding of the task and provides some valid arguments.
- **Areas for Improvement:**
    - Enhance coherence and cohesion by using clear transitions and logical organization.
    - Expand vocabulary and improve accuracy by using more diverse and appropriate words.
    - Improve grammatical accuracy by carefully checking for errors in verb tense, subject-verb agreement, and punctuation.
    - Provide more detailed support for the arguments presented by providing specific examples or evidence.
    - Consider using more complex sentence structures to demonstrate grammatical range.</s>

# Load Model - Mistral - 7b

download một số thư viện cần thiết cho quá trình *quantization* model

In [None]:
!pip install bitsandbytes
!pip install accelerate
!pip install xformers
!pip install peft

Import các thư viện

In [None]:
from unsloth import FastLanguageModel
import torch

In [None]:
max_seq_length = 2048
dtype = None
load_in_4bit = True

**Download model Mistral-7b 4bit từ HuggingFace**

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2024.3
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/971 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]


thêm LoRA adapters để chỉ update 1-10% số tham số

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = True,
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2024.3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Train Model

In [None]:
!pip install trl

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

Cài đặt các tham số cho quá trình fine-tune

In [None]:
args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    max_steps=60,
    learning_rate=1e-5,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=2024,
    output_dir="outputs"
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=args
)

Map (num_proc=2):   0%|          | 0/9833 [00:00<?, ? examples/s]

*Calculate Memory stats before training*

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
4.5 GB of memory reserved.


**Bắt đầu fine-tune mô hình**

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 9,833 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,1.3612
2,1.3161
3,1.3387
4,1.2631
5,1.2323
6,1.241
7,1.2351
8,1.2119
9,1.2095
10,1.2596


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1985.1775 seconds used for training.
33.09 minutes used for training.
Peak reserved memory = 7.6 GB.
Peak reserved memory for training = 3.1 GB.
Peak reserved memory % of max memory = 51.532 %.
Peak reserved memory for training % of max memory = 21.02 %.


In [None]:
print(trainer_stats)

TrainOutput(global_step=60, training_loss=1.0148264328638712, metrics={'train_runtime': 1985.1775, 'train_samples_per_second': 0.242, 'train_steps_per_second': 0.03, 'total_flos': 3.2129893832687616e+16, 'train_loss': 1.0148264328638712, 'epoch': 0.05})


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
prompt = """
In most countries, animal and plant species are declining rapidly. What are the causes of this? What measures could be done to prevent this decline?
"""
essay = """
The species of animals and plants are rapidly decreasing in most countries. In this essay, I will examine the factors that contribute towards declining animal and plant species and propose some solutions to that.

There are a few factors that contribute to the falling species of animals and plants. Firstly, the main cause for this issue probably is human activities. There are some people who like to kill animals and plants for their own purpose such as for collection. Secondly, the other reason why is this happening is because of nature degradation. Nowadays, climate change is getting worse in most countries  leading to natural selection for animals and plants. Therefore, their species witnessed a rapid fallen.

A few ways can be taken to prevent the decline of animal and plant species. Firstly, government plays a crucial role. They have to be aware of this issue and then  formulate some regulations to prevent animal and plant hunting. Although the law is already generated, the government should be more active to make sure the regulation is well-running. Furthermore, as citizens, we can drive a movement to raise awareness about this issue. Maybe, some people will underestimate what can citizens do with this little, but I believe if we hold hands together, we can make a change.

In conclusion, the species of animals and plants are declining rapidly  mainly caused by human activities that do illegal hunting. This essay suggested that the ways to prevent this problem are twofold: to generate strict regulation and to create a movement in order to raise society's awareness.
"""

In [None]:
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    PROMPT.format(
        prompt,
        essay,
        "",
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 2048, use_cache = True, temperature=0.8, do_sample=True)
answer = tokenizer.batch_decode(outputs)
evaluation = answer[0].split("Evaluation:")[1]

print(f"prompt:\n{prompt}")
print(f"essay: \n{essay}")
display_markdown(evaluation)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


prompt:

In most countries, animal and plant species are declining rapidly. What are the causes of this? What measures could be done to prevent this decline?

essay: 

The species of animals and plants are rapidly decreasing in most countries. In this essay, I will examine the factors that contribute towards declining animal and plant species and propose some solutions to that.

There are a few factors that contribute to the falling species of animals and plants. Firstly, the main cause for this issue probably is human activities. There are some people who like to kill animals and plants for their own purpose such as for collection. Secondly, the other reason why is this happening is because of nature degradation. Nowadays, climate change is getting worse in most countries  leading to natural selection for animals and plants. Therefore, their species witnessed a rapid fallen.

A few ways can be taken to prevent the decline of animal and plant species. Firstly, government plays a crucia



## Task Achievement:

- The candidate has adequately addressed the given task by providing relevant information and discussing both causes and potential solutions. (Score: 7)

## Coherence and Cohesion:

- The essay is generally well-organized and coherent, although there could be improved transitions between sentences and paragraphs. (Score: 6.5)

## Lexical Resource (Vocabulary):

- The candidate demonstrates a decent range of vocabulary, although there are a few inaccuracies and overuse of certain words. (Score: 7)
- Inaccuracies:
  - "probably" should be "probably because"
  - "degradation" should be "degradation of nature"
  - "natural selection" should be "natural selection of animals and plants"
- Overuse of "decline" and "decreasing"
- Suggested modifications:
  - "probably because" -> "due to"
  - "degradation of nature" -> "environmental degradation"
  - "natural selection of animals and plants" -> "natural process in which weaker species are eliminated"
  - "decline" -> "decrease"
  - "decreasing" -> "dropping/shrinking"

## Grammatical Range and Accuracy:

- The essay exhibits a variety of sentence structures, but there are some grammatical errors. (Score: 7)
- Errors:
  - "For collection" should be "For collecting"
  - "Nowadays, climate change is getting worse in most countries" should be "Climate change is currently having a severe impact on many countries"
- Suggested corrections:
  - "For collection" -> "For the purpose of collecting"
  - "Nowadays, climate change is getting worse in most countries" -> "Climate change is currently having a severe impact on many countries"

## Overall Band Score:

- Considering the holistic performance across all criteria, the essay demonstrates a good level of achievement and could potentially receive an overall band score of 7. (Score: 7)

## Feedback and Additional Comments:

- Strengths:
  - The essay provides a clear and well-organized discussion of the causes and solutions for declining animal and plant species.
  - The essay demonstrates a good range of vocabulary and sentence structures.
- Areas for Improvement:
  - Transitions between sentences and paragraphs could be improved for better coherence and cohesion.
  - Exercise caution with vocabulary choices and ensure accuracy in sentence construction.
  - Expand on suggested strategies for enhancement, offering specific and actionable advice for improvement.</s>


 Dùng TextStream để hiển thị từng token ngay sau khi model tạo ra

In [None]:
p = """
Some people believe that governments should pay full course fees for students who want to study in universities. Do you agree or disagree with this statement?
"""
e = """Some individuals argue that the government should offer free university courses for students because pupils in university feel burdensome to pay these fees. Whereas, others including myself disagree with this idea. This essay below will discuss both sides of the argument and my perspective.

On the one hand, it is admittable that course fees in universities are too high for young adults to afford, so financial aid from the authorities will be helpful. For instance, to pay eight-semester fees in South Korea, I had to have more than two part-time jobs for four years of university life. What is more, in the case of others, they owe frustrating amounts of debt, although they work and study at the same time, and it remains demanding even if they graduate and begin to work for a company. Therefore, it seems reasonable if the government offers a helping hand to those who desire to study in universities.

However, if the government supports with entire course fees, a country may suffer from financial issues. To illustrate this, if the authorities in a nation declared that tuition would be free of charge, all citizens who are qualified would enter universities, and the government would spend most of their money solely on  education. Then, it would not be easy for the authority to help with technological development or the maintenance of social infrastructures, which, in turn, would lead to financial difficulties for the nation and a deterioration of residents' life quality. Thus, full governmental support for universities is not effective.

In a nutshell, despite some needs, I disagree with the statement that pupils should be fully supported with their course fees from the central or local government since it does not help one's life in this nation."""

In [None]:
# sample = dataset[7777]
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    PROMPT.format(
        p,
        e,
        "",
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 2048, temperature=0.777, do_sample=True)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> 
In this task, you are required to evaluate an IELTS Writing Task 2 essay. Consider the following four criteria and provide a detailed assessment for each, along with a suggested band score:

## Task Achievement:
- Evaluate how well the candidate has addressed the given task.
- Assess the clarity and coherence of the response in presenting ideas.
- Identify if the candidate has fully covered all parts of the task and supported arguments appropriately.
- Suggested Band Score (Task Achievement): [Insert Score]

## Coherence and Cohesion:
- Assess the overall organization and structure of the essay.
- Evaluate the use of linking devices to connect ideas and paragraphs.
- Identify if there is a logical flow of information.
- Suggested Band Score (Coherence and Cohesion): [Insert Score]

## Lexical Resource (Vocabulary):
- Examine the range and accuracy of vocabulary used in the essay.
- Point out specific mistakes in vocabulary, such as inaccuracies or overuse of certain words and Sugg

<a name="Save"></a>
### Lưu model trên Huggingface


In [None]:
model.push_to_hub("chillies/ielts-tutor-7b", token="") # Online saving

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/chillies/ielts-tutor-7b


# Thử nghiệm với mô hình Llama-2

Sau khi Push mistral lên HuggingFace. Ta sẽ xóa runtime và cài đặt lại từ đầu để được full 14GB free của colab

*Sau đó, ta vẫn thực hiện các bước fine-tune như ở Mistral-7b*

In [None]:
llama_model, llama_tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-2-7b-bnb-4bit", # Choose ANY! eg mistralai/Mistral-7B-Instruct-v0.2
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

config.json:   0%|          | 0.00/1.10k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.3
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.0. CUDA = 7.5. CUDA Toolkit = 11.8.
\        /    Bfloat16 = FALSE. Xformers = 0.0.24. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth




model.safetensors:   0%|          | 0.00/3.87G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/894 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

In [None]:
llama_model = FastLanguageModel.get_peft_model(
    llama_model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = True,
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2024.3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = llama_model,
    tokenizer = llama_tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Map (num_proc=2):   0%|          | 0/9912 [00:00<?, ? examples/s]

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 9,912 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 39,976,960


Step,Training Loss
1,1.2483
2,1.2381
3,1.2639
4,1.2509
5,1.2608
6,1.1528
7,1.1898
8,1.0534
9,0.9879
10,1.0209


In [None]:
print(trainer_stats)

TrainOutput(global_step=60, training_loss=0.790923851976792, metrics={'train_runtime': 2133.9401, 'train_samples_per_second': 0.225, 'train_steps_per_second': 0.028, 'total_flos': 3.400654872767693e+16, 'train_loss': 0.790923851976792, 'epoch': 0.05})


In [None]:
llama_model.push_to_hub("chillies/llama_ielts_evaluator_base", token="") # Online saving

adapter_model.safetensors:   0%|          | 0.00/160M [00:00<?, ?B/s]

Saved model to https://huggingface.co/chillies/llama_ielts_evaluator_base


In [None]:
# sample = dataset[7777]
FastLanguageModel.for_inference(llama_model)
inputs = llama_tokenizer(
[
    PROMPT.format(
        p,
        e,
        "",
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(llama_tokenizer)
_ = llama_model.generate(**inputs, streamer = text_streamer, max_new_tokens = 2048)

<s> 
In this task, you are required to evaluate an IELTS Writing Task 2 essay. Consider the following four criteria and provide a detailed assessment for each, along with a suggested band score:

## Task Achievement:
- Evaluate how well the candidate has addressed the given task.
- Assess the clarity and coherence of the response in presenting ideas.
- Identify if the candidate has fully covered all parts of the task and supported arguments appropriately.
- Suggested Band Score (Task Achievement): [Insert Score]

## Coherence and Cohesion:
- Assess the overall organization and structure of the essay.
- Evaluate the use of linking devices to connect ideas and paragraphs.
- Identify if there is a logical flow of information.
- Suggested Band Score (Coherence and Cohesion): [Insert Score]

## Lexical Resource (Vocabulary):
- Examine the range and accuracy of vocabulary used in the essay.
- Point out specific mistakes in vocabulary, such as inaccuracies or overuse of certain words and Sugg

Mô hình Llama-2 tuy có training_loss thấp hơn, nhưng lại bị overfitting. Các kết quả của llama-2 bị lặp từ.

# DPO

Sau khi đã có fine-tune model. Ta sẽ thực hiện bước DPO

**NOTE:** bước DPO độc lập với bước SFT ở trên. Sau khi thực hiện SFT, ta sẽ xóa thời gian chạy và cài đặt môi trường colab lại từ đầu để có đủ 14GB GPU free để thực hiện DPO.

Việc để DPO và SFT chung một notebook là giúp thuận tiện cho việc theo dõi quy trình fine-tune.

**Load dataset**
load dataset được tôi thu thập và push lên HuggingFace

Link dataset: https://huggingface.co/datasets/chillies/IELTS_essay_human_feedback

In [None]:
from datasets import load_dataset

dpo_data = load_dataset("chillies/IELTS_essay_human_feedback")

Downloading data:   0%|          | 0.00/5.06M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
dpo_data

DatasetDict({
    train: Dataset({
        features: ['prompt', 'essay', 'chosen', 'rejected\r\r'],
        num_rows: 768
    })
})

### **Process data**

In [None]:
PROMPT = """
In this task, you are required to evaluate an IELTS Writing Task 2 essay. Consider the following four criteria and provide a detailed assessment for each, along with a suggested band score:

## Task Achievement:
- Evaluate how well the candidate has addressed the given task.
- Assess the clarity and coherence of the response in presenting ideas.
- Identify if the candidate has fully covered all parts of the task and supported arguments appropriately.
- Suggested Band Score (Task Achievement): [Insert Score]

## Coherence and Cohesion:
- Assess the overall organization and structure of the essay.
- Evaluate the use of linking devices to connect ideas and paragraphs.
- Identify if there is a logical flow of information.
- Suggested Band Score (Coherence and Cohesion): [Insert Score]

## Lexical Resource (Vocabulary):
- Examine the range and accuracy of vocabulary used in the essay.
- Point out specific mistakes in vocabulary, such as inaccuracies or overuse of certain words and Suggest modified versions or alternatives for the identified mistakes. [list of mistakes and rectify]
- Assess the appropriateness of vocabulary for the given context.
- Suggested Band Score (Lexical Resource): [Insert Score]

## Grammatical Range and Accuracy:
- Evaluate the variety and complexity of sentence structures.
- Point out specific grammatical errors, such as incorrect verb forms or sentence construction and Suggest modified versions or corrections for the identified mistakes. [list of mistakes and rectify]
- Examine the use of punctuation and sentence formation.
- Suggested Band Score (Grammatical Range and Accuracy): [Insert Score]

## Overall Band Score:

- Provide an overall band score for the essay, considering the holistic performance across all criteria.
- Consider the synergy of the essay in meeting the task requirements cohesively.
- Suggested Overall Band Score: [Insert Score]

## Feedback and Additional Comments:
- Provide constructive feedback highlighting specific strengths and areas for improvement.
- Suggest strategies for enhancement in weaker areas.

## Prompt:
{}

## Essay:
{}

## Evaluation:
{}"""

In [None]:
EOS_TOKEN = '</s>'
def formatting_prompts_func(examples):
    essay_prompts = examples["prompt"]
    essays = examples["essay"]
    chosen = list(examples["chosen"])
    rejected = list(examples["rejected"])

    texts = []
    for p, e in zip(essay_prompts, essays):
        text = PROMPT.format(p, e, "") + '</s>'
        texts.append(text)

    chosen_evals = []
    rejected_evals = []
    for c, r in zip(chosen, rejected):
        chosen_evals.append(c + '</s>')
        rejected_evals.append(r + '</s>')

    return {"prompt": texts, "chosen": chosen_evals, "rejected": rejected_evals}

In [None]:
dpo_data = dpo_data.rename_column("rejected\r\r", "rejected")

In [None]:
dpo_dataset = dpo_data.map(formatting_prompts_func, batched=True).remove_columns(['essay'])

Map:   0%|          | 0/768 [00:00<?, ? examples/s]

In [None]:
dpo_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 768
    })
})

In [None]:
display_markdown(dpo_dataset['train'][555]['prompt'])


In this task, you are required to evaluate an IELTS Writing Task 2 essay. Consider the following four criteria and provide a detailed assessment for each, along with a suggested band score:

## Task Achievement:
- Evaluate how well the candidate has addressed the given task.
- Assess the clarity and coherence of the response in presenting ideas.
- Identify if the candidate has fully covered all parts of the task and supported arguments appropriately.
- Suggested Band Score (Task Achievement): [Insert Score]

## Coherence and Cohesion:
- Assess the overall organization and structure of the essay.
- Evaluate the use of linking devices to connect ideas and paragraphs.
- Identify if there is a logical flow of information.
- Suggested Band Score (Coherence and Cohesion): [Insert Score]

## Lexical Resource (Vocabulary):
- Examine the range and accuracy of vocabulary used in the essay.
- Point out specific mistakes in vocabulary, such as inaccuracies or overuse of certain words and Suggest modified versions or alternatives for the identified mistakes. [list of mistakes and rectify]
- Assess the appropriateness of vocabulary for the given context.
- Suggested Band Score (Lexical Resource): [Insert Score]

## Grammatical Range and Accuracy:
- Evaluate the variety and complexity of sentence structures.
- Point out specific grammatical errors, such as incorrect verb forms or sentence construction and Suggest modified versions or corrections for the identified mistakes. [list of mistakes and rectify]
- Examine the use of punctuation and sentence formation.
- Suggested Band Score (Grammatical Range and Accuracy): [Insert Score]

## Overall Band Score:

- Provide an overall band score for the essay, considering the holistic performance across all criteria.
- Consider the synergy of the essay in meeting the task requirements cohesively.
- Suggested Overall Band Score: [Insert Score]

## Feedback and Additional Comments:
- Provide constructive feedback highlighting specific strengths and areas for improvement.
- Suggest strategies for enhancement in weaker areas.

## Prompt:
In some countries children have very strict rules of behavior, in other countries they are allowed to do almost anything they like. To what extent should children have to follow rules?

## Essay:
Freedom plays a pivotal role in everybodyâs life. We can see in todayâs modernized era that nobody likes to be restricted, whether it is a child or an adult. Some people think that there should some strict rules of behavior for children, but I disagree with this statement.that sometimes more restrictions can cause more frustration in children, which leads to many other mental problems as well. they can . Sometimes they feel under pressure, which can be the main reason for their poor performance in their field. In some cases children may insist on doing .In other words we have to look other aspects as well, if we usually ignore our childrenâs bad habits, then they good human beings in their future life. Moreover if we never draw attention the childrenâs main activities then they might end up in a bad company. They the value of respect for their elders importance of relationships. They their cultural values as well.In a nutshell, I would like to say that children should be the value of their customs, rituals and respect towards their elders for their future life, but most of the additional restriction should . It would good human beings in their future.

## Evaluation:
</s>

In [None]:
sample = dpo_dataset["train"][555]
display_markdown(f"# Prompt: \n{sample['prompt']}")
display_markdown(f"\n# Chosen: \n{sample['chosen']}")
display_markdown(f"\n# Rejected: \n{sample['rejected']}")

# Prompt: 

In this task, you are required to evaluate an IELTS Writing Task 2 essay. Consider the following four criteria and provide a detailed assessment for each, along with a suggested band score:

## Task Achievement:
- Evaluate how well the candidate has addressed the given task.
- Assess the clarity and coherence of the response in presenting ideas.
- Identify if the candidate has fully covered all parts of the task and supported arguments appropriately.
- Suggested Band Score (Task Achievement): [Insert Score]

## Coherence and Cohesion:
- Assess the overall organization and structure of the essay.
- Evaluate the use of linking devices to connect ideas and paragraphs.
- Identify if there is a logical flow of information.
- Suggested Band Score (Coherence and Cohesion): [Insert Score]

## Lexical Resource (Vocabulary):
- Examine the range and accuracy of vocabulary used in the essay.
- Point out specific mistakes in vocabulary, such as inaccuracies or overuse of certain words and Suggest modified versions or alternatives for the identified mistakes. [list of mistakes and rectify]
- Assess the appropriateness of vocabulary for the given context.
- Suggested Band Score (Lexical Resource): [Insert Score]

## Grammatical Range and Accuracy:
- Evaluate the variety and complexity of sentence structures.
- Point out specific grammatical errors, such as incorrect verb forms or sentence construction and Suggest modified versions or corrections for the identified mistakes. [list of mistakes and rectify]
- Examine the use of punctuation and sentence formation.
- Suggested Band Score (Grammatical Range and Accuracy): [Insert Score]

## Overall Band Score:

- Provide an overall band score for the essay, considering the holistic performance across all criteria.
- Consider the synergy of the essay in meeting the task requirements cohesively.
- Suggested Overall Band Score: [Insert Score]

## Feedback and Additional Comments:
- Provide constructive feedback highlighting specific strengths and areas for improvement.
- Suggest strategies for enhancement in weaker areas.

## Prompt:
In some countries children have very strict rules of behavior, in other countries they are allowed to do almost anything they like. To what extent should children have to follow rules?

## Essay:
Freedom plays a pivotal role in everybodyâs life. We can see in todayâs modernized era that nobody likes to be restricted, whether it is a child or an adult. Some people think that there should some strict rules of behavior for children, but I disagree with this statement.that sometimes more restrictions can cause more frustration in children, which leads to many other mental problems as well. they can . Sometimes they feel under pressure, which can be the main reason for their poor performance in their field. In some cases children may insist on doing .In other words we have to look other aspects as well, if we usually ignore our childrenâs bad habits, then they good human beings in their future life. Moreover if we never draw attention the childrenâs main activities then they might end up in a bad company. They the value of respect for their elders importance of relationships. They their cultural values as well.In a nutshell, I would like to say that children should be the value of their customs, rituals and respect towards their elders for their future life, but most of the additional restriction should . It would good human beings in their future.

## Evaluation:
</s>


# Chosen: 
### Task Achievement:

The candidate has partially addressed the prompt. While they acknowledge the opposing viewpoint of strict rules for children, their response primarily focuses on the negative consequences of restrictions and fails to provide a balanced discussion. The response also lacks a clear thesis statement and evidence to support their arguments.

Suggested Band Score (Task Achievement): 5.0

### Coherence and Cohesion:

The essay lacks organization and coherence. The paragraphs are not logically connected, and ideas are presented in a disjointed manner. The use of linking devices is minimal, making it difficult to follow the flow of the argument.

Suggested Band Score (Coherence and Cohesion): 4.0

### Lexical Resource (Vocabulary):

The vocabulary used in the essay is limited and repetitive. The candidate often relies on basic words and phrases, such as "restrictions," "pressure," and "bad habits." There are several inaccuracies and errors in vocabulary, such as "In other words we have to look other aspects as well" and "They the value of respect for their elders importance of relationships."

Suggested Band Score (Lexical Resource): 5.0

### Grammatical Range and Accuracy:

The essay contains numerous grammatical errors, including incorrect verb forms, sentence construction, and punctuation. There is a lack of variety in sentence structures, with the majority of sentences following a simple subject-verb-object pattern.

Suggested Band Score (Grammatical Range and Accuracy): 4.0

### Overall Band Score:

5.0

### Feedback and Additional Comments:

**Strengths:**

* The candidate acknowledges the opposing viewpoint of strict rules for children.

**Areas for Improvement:**

* **Task Achievement:** Develop a clear thesis statement and provide balanced arguments supported by evidence.
* **Coherence and Cohesion:** Improve the organization and flow of the essay. Use linking devices to connect ideas and paragraphs logically.
* **Lexical Resource:** Expand the vocabulary used and correct inaccuracies.
* **Grammatical Range and Accuracy:** Improve sentence structures and eliminate grammatical errors.

**Strategies for Enhancement:**

* Practice writing essays on similar topics.
* Study model essays to improve organization and argumentation.
* Use online resources to check grammar and vocabulary.
* Seek feedback from a writing tutor or language partner.</s>


# Rejected: 
 "Overall, this essay has several issues that need to be addressed in order to improve its score in the IELTS writing exam. The four criteria for evaluation will be Task Achievement, Coherence and Cohesion, Lexical Resource, and Grammatical Range and Accuracy.

1. Task Achievement:
The essay does not fully address the prompt. While it mentions the topic of strict rules of behavior for children, the arguments presented are not well-developed and lack coherence. The writer briefly mentions the importance of restrictions and the potential negative consequences of too many rules, but these points are not supported with evidence or examples. The essay also does not provide a clear conclusion or a balanced view on the extent to which children should follow rules. 

2. Coherence and Cohesion:
The essay lacks coherence and cohesion due to numerous grammatical errors, incorrect sentence structure, and unclear ideas. There are also many incomplete sentences and run-on sentences that make it difficult to understand the writer's intended meaning. Transition words and phrases are used inconsistently, making the essay disjointed and difficult to follow.

3. Lexical Resource:
The writer demonstrates a limited range of vocabulary and uses some inappropriate word choices. There are also instances of incorrect word forms and collocation errors. The essay lacks clarity and precision in expressing ideas, as some sentences are vague and do not convey a clear message.

4. Grammatical Range and Accuracy:
There are numerous grammatical errors throughout the essay, including incorrect verb tenses, subject-verb agreement issues, and missing articles. Sentence structures are often incorrect, leading to confusion and lack of clarity. The essay also lacks proper punctuation and capitalization.

Based on the above evaluation, it is difficult to assign a specific band score for this essay. However, considering the significant issues in task achievement, coherence and cohesion, lexical resource, and grammatical range and accuracy, the band score for this essay is likely to be low, possibly around Band 4 or 4.5.</s>

### **Load model for DPO**

Load các thư viện cần thiết cho quá trình DPO

In [None]:
from unsloth import FastLanguageModel
import torch

In [None]:
max_seq_length = 2048
dtype = None
load_in_4bit = True

Ở bước này, tôi load lại mô hình đã Fine-tune ở bước SFT từ HuggingFace về

In [None]:
MODEL_NAME = "chillies/IELTS-fighter"
model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=MODEL_NAME,# "IELTS_evaluation_lora_model",
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit
)

adapter_config.json:   0%|          | 0.00/703 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2024.3
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.1.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.22.post7. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth




model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/971 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/336M [00:00<?, ?B/s]

Unsloth 2024.3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
#@title Show current memory stats|
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.0 GB of memory reserved.


### **DPO**

Thực hiện DPO

In [None]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer
from transformers import TrainingArguments
from trl import DPOTrainer

PatchDPOTrainer()

Cài đặt các tham số training

In [None]:
args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_ratio=0.1,
    num_train_epochs=1,
    learning_rate=1e-5,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.0,
    lr_scheduler_type="linear",
    seed=42,
    output_dir="dpo_outputs"
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Cài đặt DPOTrainer

In [None]:
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,
    args=args,
    beta=0.1,
    tokenizer=tokenizer,
    train_dataset=dpo_dataset['train'],
    max_length=1024,
    max_prompt_length=1024
)



Map:   0%|          | 0/768 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Thực hiện Fine-tune với DPO

In [None]:
dpo_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 768 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 96
 "-____-"     Number of trainable parameters = 41,943,040
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.6931,0.0,0.0,0.0,0.0,0.0,0.0,-3.071243,-3.071243
2,0.4357,2.840192,-0.130424,0.375,2.970616,-135.759964,-155.472931,-3.1384,-3.112468
3,0.6065,1.057781,0.058687,0.125,0.999094,-35.488091,-45.132965,-3.066482,-3.070861
4,0.5199,2.336175,0.019314,0.25,2.316861,-77.916107,-89.607437,-3.054017,-3.047468
5,0.6931,0.0,0.0,0.0,0.0,0.0,0.0,-3.036806,-3.036806
6,0.5314,1.599558,0.113139,0.25,1.486419,-89.323051,-89.173744,-3.022493,-2.996707
7,0.6931,0.0,0.0,0.0,0.0,0.0,0.0,-3.071886,-3.071886
8,0.7518,-0.134888,-0.036534,0.0,-0.098354,-47.102192,-71.808792,-3.009304,-3.023611
9,0.6065,1.341014,0.016977,0.125,1.324037,-38.353172,-43.290359,-2.904484,-2.902493
10,0.6931,0.0,0.0,0.0,0.0,0.0,0.0,-3.021194,-3.021194


TrainOutput(global_step=96, training_loss=0.606065477244556, metrics={'train_runtime': 3110.6109, 'train_samples_per_second': 0.247, 'train_steps_per_second': 0.031, 'total_flos': 0.0, 'train_loss': 0.606065477244556, 'epoch': 1.0})

### Push model to HuggingFace and inference

Push model lên trang HuggingFace

In [None]:
model.push_to_hub("chillies/DPO_ielts_fighter", token="") # Online saving

adapter_model.safetensors:   0%|          | 0.00/336M [00:00<?, ?B/s]

Saved model to https://huggingface.co/chillies/DPO_ielts_fighter


In [None]:
print(dpo_train_stats)

TrainOutput(global_step=96, training_loss=0.6931471824645996, metrics={'train_runtime': 4747.0812, 'train_samples_per_second': 0.162, 'train_steps_per_second': 0.02, 'total_flos': 0.0, 'train_loss': 0.6931471824645996, 'epoch': 1.0})


**Inference model**

In [None]:
p = """
The images below explain the process of how pollutants from modern life becomeacid rain that in turn pollutes the environment and the food chain.
"""
e = """
The given diagram illustrates the acid rain process that has  caused contamination in the environment and the food chain.

Overall what stands out from the diagram is that the Process commences with the entry of pollutants into the clouds and ends with water containing contaminants that damage nature.

Firstly,the majority of pollutants created come from housing, transport,factories and power stations .These sources of pollution release fumes such as carbon dioxide,sulphur dioxide,nitrogen oxides and other compounds.These contaminants then rise up into the air and are absorbed by clouds ,which then become polluted and travel across countries and continents.Afterwards,water containing pollutants in the cloud condenses and falls into the Earth ,namely acid rain.After that ,acid rain falls into hills,trees, and plants and into rivers and lakes,contaminating them, which leads to the pollution of the environment and the food chain.This in turn will lead to the dying out of the animals, which will then lead to an imbalance in the food chain, greatly impacting the ecosystem.
"""

In [None]:
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    PROMPT.format(
        p,
        e,
        "",
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 2048)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> 
In this task, you are required to evaluate an IELTS Writing Task 2 essay. Consider the following four criteria and provide a detailed assessment for each, along with a suggested band score:

## Task Achievement:
- Evaluate how well the candidate has addressed the given task.
- Assess the clarity and coherence of the response in presenting ideas.
- Identify if the candidate has fully covered all parts of the task and supported arguments appropriately.
- Suggested Band Score (Task Achievement): [Insert Score]

## Coherence and Cohesion:
- Assess the overall organization and structure of the essay.
- Evaluate the use of linking devices to connect ideas and paragraphs.
- Identify if there is a logical flow of information.
- Suggested Band Score (Coherence and Cohesion): [Insert Score]

## Lexical Resource (Vocabulary):
- Examine the range and accuracy of vocabulary used in the essay.
- Point out specific mistakes in vocabulary, such as inaccuracies or overuse of certain words and Sugg