# Llama-3-8B-Instruct Evaluation

Meta Llama-3-8B_Instruct 모델을 이용하여 다양한 태스크를 테스트해 보고, SQuAD Evaluation Dataset을 이용하여 MRC(Machine Reading Comprehension) 성능을 테스트해 보도록 하겠습니다.

## 0. Setup

In [1]:
# MLP Suwon 설정 필요
import os

os.environ['REQUESTS_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt'
os.environ['HTTP_PROXY'] ='http://75.17.107.42:8080'
os.environ['HTTPS_PROXY'] ='http://75.17.107.42:8080'

In [2]:
# MLP Suwon 설정 필요
import ssl

if hasattr(ssl, '_create_unverified_context'):
   ssl._create_default_https_context = ssl._create_unverified_context

## 1. Llama-3-8B-Instruct Model

Meta Llama-3-8B-Instruct 모델을 가져오도록 하겠습니다.  
모델을 가져오기 위해서는 **Hugging Face Access Token**이 필요하며, Llama 모델 사용에 대한 사전 신청이 필요합니다.  
(https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)

In [None]:
# from huggingface_hub import notebook_login
# notebook_login()

In [None]:
"""
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16, # bfloat16
    device_map="auto",
)
"""

모델 접근 권한이 어려운 경우 아래의 로컬 디렉토리(Group Volume)에 있는 모델을 활용하시면 됩니다.

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# model_ckpt = "meta-llama/Meta-Llama-3-8B-Instruct"
model_ckpt = "/group-volume/sr_edu/AI-Application-Specialist/LLM/model/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
tokenizer.pad_token_id = tokenizer.eos_token_id

model = AutoModelForCausalLM.from_pretrained(
    model_ckpt,
    torch_dtype=torch.float16, # bfloat16
    device_map="auto",
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [4]:
def generate_response(system_message, user_message):
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message},
    ]
        
    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(model.device)

    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]
    
    outputs = model.generate(
        input_ids,
        max_new_tokens=256,
        eos_token_id=terminators,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
    )
    
    response = outputs[0][input_ids.shape[-1]:]
    
    return tokenizer.decode(response, skip_special_tokens=True)

## 2. Llama-3-8B-Instruct Model Test (ChatBot, Translation, Coding, etc.)

**generate_response( )** 함수를 이용하여 다양한 NLP Task 들에 대해 테스트해 보겠습니다.

In [5]:
# [실습] 다음 코드를 완성하세요!!
# 다양한 NLP Task를 테스트해 보세요.
response = generate_response(system_message="",
                             user_message="여름을 주제로 시를 작성해줘.")

print(response)

Here's a poem about summer:

Summer's warmth descends upon the land
A gentle breeze that whispers sweet commands
The sun beats down, a fiery glow
As petals bloom, and all around, it grows

The scent of ripened fruits fills the air
As children laugh, without a single care
Their shouts and giggles echo through the day
As they chase fireflies, in a joyful sway

The world is full of vibrant hues and sounds
As nature awakens from its winter bounds
The trees regain their verdant, emerald sheen
And wildflowers bloom, a colorful dream

The nights are long, the stars shine bright and clear
As crickets serenade, without a fear
The world is full of magic, wild and free
In summer's warmth, we find ecstasy

So let us bask in summer's radiant glow
And let our spirits soar, as the days grow slow
For in this season of sunshine and delight
We find our hearts filled with joy, and our souls alight.


In [6]:
# [실습] 다음 코드를 완성하세요!!
# 다양한 NLP Task를 테스트해 보세요.
response = generate_response(system_message="",
                             user_message="다음 문장을 한국어로 번역해줘, Time flies like an arrow.")

print(response)

시간은 화살과 같이 날아간다.

(Note: This is a common idiomatic expression in English, and the translation is also an idiomatic expression in Korean. The phrase "like an arrow" is used to convey the idea that time passes quickly and swiftly, just like an arrow flying through the air.)


In [7]:
# [실습] 다음 코드를 완성하세요!!
# 다양한 NLP Task를 테스트해 보세요.
response = generate_response(system_message="",
                             user_message="퀵소트 파이썬 알고리즘 작성해줘.")

print(response)

Here is a Python implementation of the QuickSort algorithm:
```python
def quicksort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[0]
    less = [x for x in arr[1:] if x <= pivot]
    greater = [x for x in arr[1:] if x > pivot]
    return quicksort(less) + [pivot] + quicksort(greater)
```
Here's an explanation of how the algorithm works:

1. If the length of the input array is 0 or 1, return the original array (since it's already sorted).
2. Choose the first element of the array as the pivot.
3. Create two lists: `less` and `greater`. `less` contains all elements in the array that are less than or equal to the pivot, and `greater` contains all elements that are greater than the pivot.
4. Recursively call the `quicksort` function on `less` and `greater`.
5. Concatenate the results of the recursive calls, with the pivot element in its final position.

Here's an example usage:
```python
arr = [5, 2, 8, 3, 1, 6


## 3. SQuAD DataSets

**SQuAD** Dataset을 이용하여 MRC(Machine Reading Comprehension) 성능평가를 하겠습니다.

In [8]:
from datasets import load_dataset

dataset = load_dataset('/group-volume/sr_edu/AI-Application-Specialist/LLM/dataset/squad')
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})


In [9]:
dataset["train"][0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

In [10]:
import pandas as pd

dataset.set_format(type="pandas")
df = dataset["train"][:]
df.head()

Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
1,5733be284776f4190066117f,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': ['a copper statue of Christ'], 'answe..."
2,5733be284776f41900661180,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': ['the Main Building'], 'answer_start'..."
3,5733be284776f41900661181,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,{'text': ['a Marian place of prayer and reflec...
4,5733be284776f4190066117e,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,{'text': ['a golden statue of the Virgin Mary'...


In [11]:
dataset.reset_format()

Context, Question을 이용하여 LLM을 위한 Prompt Format으로 구성합니다.

In [12]:
def format_context_question(data):
    return f"Context: {data['context']}\nQuestion: {data['question']}"

In [13]:
print(format_context_question(dataset["validation"][0]))

Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.
Question: Which NFL team represented the AFC at Super Bowl 50?


In [14]:
print(model.device)

cuda:0


In [15]:
question_prompt = format_context_question(dataset["validation"][0])
question_prompt

'Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.\nQuestion: Which NFL team represented the AFC at Super Bowl 50?'

In [16]:
response = generate_response(system_message="너는 컨텍스트에 기반해서 질문에 답변을 하는 챗봇이야. 답변만 간결하게 영어로 작성해줘",
                             user_message=question_prompt)

print(response)

The Denver Broncos represented the AFC at Super Bowl 50.


In [17]:
ground_truth_text = dataset["validation"][0]["answers"]["text"]
ground_truth_text

['Denver Broncos', 'Denver Broncos', 'Denver Broncos']

## 4. SQuAD Evaluation

EM(Exact Match)-Score, F1-Score 값을 계산합니다. 특히 F1-Score는 Ground Truth 들과 일치하는 토큰 비율을 평가하게 됩니다.

In [18]:
def normalize_text(s):
    """평가를 위하여 의미없는 Article, White Space, Punctuation 기호를 삭제합니다. """
    import string, re

    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def compute_exact_match(prediction, truth):
    return int(normalize_text(prediction) == normalize_text(truth))

def compute_f1(prediction, truth):
    pred_tokens = normalize_text(prediction).split()
    truth_tokens = normalize_text(truth).split()
    
    # Prediction 또는 Truth 값이 0 일 경우에는 두 값이 일치하면 f1 = 1 그렇지 않으면 0
    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return int(pred_tokens == truth_tokens)
    
    common_tokens = set(pred_tokens) & set(truth_tokens)
    
    # 공통된 토큰이 없으면 f1 = 0
    if len(common_tokens) == 0:
        return 0
    
    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(truth_tokens)
    
    return 2 * (prec * rec) / (prec + rec)

In [19]:
question = dataset["validation"][0]["question"]
prediction = response
gold_answers = list(ground_truth_text)

em_score = max((compute_exact_match(prediction, answer)) for answer in gold_answers)
f1_score = max((compute_f1(prediction, answer)) for answer in gold_answers)

print(f"Question: {question}")
print(f"Prediction: {prediction}")
print(f"True Answers: {gold_answers}")
print(f"EM: {em_score} \t F1: {f1_score}")

Question: Which NFL team represented the AFC at Super Bowl 50?
Prediction: The Denver Broncos represented the AFC at Super Bowl 50.
True Answers: ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']
EM: 0 	 F1: 0.4


## 5. LLM Based Evaluation

LLM을 활용하여 번역 성능을 자동으로 평가하는 실습을 진행하겠습니다. (LLM as a Judge)  
System Message로 시스템의 역할을 정의하고, 정량적인 번역평가 가이드와 함께 평가 결과의 출력 형태를 제시합니다.   
User Message로는 Original Sentence, Machine Translation, Reference Translation을 각각 제시하고 가이드에 따른 평가를 요청합니다.

In [20]:
# [실습] 다음 코드를 완성하세요!!
# LLM Based Evaluation을 위한 System Message와 User Message를 작성해 보세요.
system_message = """
You are an expert in machine translation evaluation. 
Your task is to assess the translation quality of a given sentence compared to a reference translation. 
You will compare two translations, provide detailed feedback on the differences, and rate the translation quality on a scale from 1 to 5, where:
1 = Poor (significant errors in meaning and grammar)
2 = Fair (some errors in meaning or grammar)
3 = Good (minor issues, but overall understandable)
4 = Very good (accurate translation with minor stylistic differences)
5 = Excellent (perfect translation)

Provide your evaluation, including:
1. A brief description of the errors (if any) in terms of meaning or grammar.
2. A score from 1 to 5.
3. A suggestion to improve the translation if necessary.
"""

user_message = """
Original Sentence (English): "The quick brown fox jumps over the lazy dog."
Machine Translation (French): "Le rapide renard brun saute par-dessus le chien paresseux."
Reference Translation (French): "Le renard brun rapide saute par-dessus le chien endormi."

Evaluate the translation quality based on the given input.
"""

In [21]:
response = generate_response(system_message=system_message, user_message=user_message)

print(response)

Evaluation:

The machine translation "Le rapide renard brun saute par-dessus le chien paresseux" has some minor issues compared to the reference translation "Le renard brun rapide saute par-dessus le chien endormi". The main difference is the word order, with the machine translation placing the adjective "rapide" after the noun "renard", whereas the reference translation places it before. This is a minor stylistic difference, but it does affect the sentence's overall flow.

Additionally, the machine translation uses the word "paresseux" (lazy) instead of the more accurate "endormi" (asleep), which changes the meaning of the sentence slightly. The original sentence describes the dog as being asleep, whereas the machine translation implies it is simply lazy.

Description of errors: Minor issues with word order and inaccurate adjective placement, and a small change in meaning due to the use of a different adjective.

Score: 4 (Very good)

Suggestion to improve the translation: The machine