### LLM As Judge
* LLM 모델을 평가자로 설정하여 모델의 성능을 평가하고 개선할 수 있다.

### OFF the shelf Evaluator
* LangSmith에서 제공하는 기본 평가자 LLM을 사용하여 모델의 출력을 자동으로 평가할 수 있게 된다.

**주요 특징**
* 사전 정의된 평가 기준 제공
* 일관된 평가 방식 적용
* 대규모 출력 평가 자동화 가능

**필요 정보**
* input : 질문, 보통 데이터셋의 Question이 사용된다.
* prection : LLM이 생성한 답변
* reference : 정답 답변, Context 등 변칙적으로 사용 가능

In [1]:
from rag import PDFRAG
from langchain_openai import ChatOpenAI

rag = PDFRAG(
    file_path="data/snow-white.pdf",
    llm=ChatOpenAI(model_name="gpt-4o-mini", temperature=0),
)

retriever = rag.create_retriever()

chain = rag.create_chain(retriever)

chain.invoke("백설공주는 어떤 과일을 먹고 쓰러졌나요?")

'백설공주는 사과를 먹고 쓰러졌습니다.'

In [2]:
# 질문에 답변하는 함수
def ask_question(inputs: dict):
    return {"answer": chain.invoke(inputs["question"])}

In [3]:
llm_answer = ask_question({"question": "백설공주는 어떤 과일을 먹고 쓰러졌나요?"})

llm_answer

{'answer': '백설공주는 사과를 먹고 쓰러졌습니다.'}

In [4]:
# evaluator prompt 출력을 위한 함수
def print_evaluator_prompt(evaluator):
    return evaluator.evaluator.prompt.pretty_print()

### Question-Answer Evaluator
* 질문(Question)과 답변(Answer)을 평가합니다.

* input : 사용자 입력
* prediction : LLM이 생성한 답변
* reference : 정답 답변

**참고** : Evaluator 프롬프트의 변수에는 query(input), result(prediction), answer(reference)로 정의된다.

In [5]:
from langsmith.evaluation import evaluate, LangChainStringEvaluator

# qa 평가자 생성
qa_evaluator = LangChainStringEvaluator("qa")

print_evaluator_prompt(qa_evaluator)

You are a teacher grading a quiz.
You are given a question, the student's answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.

Example Format:
QUESTION: question here
STUDENT ANSWER: student's answer here
TRUE ANSWER: true answer here
GRADE: CORRECT or INCORRECT here

Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! 

QUESTION: [33;1m[1;3m{query}[0m
STUDENT ANSWER: [33;1m[1;3m{result}[0m
TRUE ANSWER: [33;1m[1;3m{answer}[0m
GRADE:


In [6]:
dataset_name = "RAG_EVALUATION_DATASET"

experiment_results = evaluate(
    ask_question, # 평가할 함수
    data=dataset_name, # 데이터셋 지정
    evaluators=[qa_evaluator], # 평가자 지정
    experiment_prefix="RAG_EVALUATION", # 실험 이름 지정
    metadata={
        "variant": "QA Evaluator를 활용한 평가"
    }
)

View the evaluation results for experiment: 'RAG_EVALUATION-701e8d0c' at:
https://smith.langchain.com/o/76515ba2-47a2-4225-a546-4c43a1772406/datasets/0e8fd746-beab-4ad6-bb7f-d582af62e3fd/compare?selectedSessions=7783bd6b-c115-4f21-9ad1-6e3479cc0e41




0it [00:00, ?it/s]

### Context에 기반한 답변 Evaluator
**"context_qa"**
* LLM 체인의 정확성을 판단하는데에 context를 사용하도록 지시

**"cot_qa"**
* 최종 판결을 하기전에 LLM의 추론을 사용하도록 지시

In [7]:
# Context를 반환하는 RAG 결과 반환 함수
def rag_context_answer(inputs: dict):
    context = retriever.invoke(inputs["question"])
    return {
        "context": "\n".join([doc.page_content for doc in context]),
        "answer": chain.invoke,
        "query":inputs["question"]
    }

In [8]:
rag_context_answer(
    {"question": "백설공주는 어떤 과일을 먹고 쓰러졌나요?"}
)

{'context': '왕비는먹음직스럽게생긴사과를골라독을발랐어요.\n그리고과일장수로변장했지요.\n왕비는산을넘고또넘어일곱난쟁이의오두막에도착했어요.\n“새콤달콤맛있는사과가있어요. 아가씨의붉은입술처럼새빨\n간사과랍니다. 잠깐문을열어보세요.”\n백설공주는고개를저었어요.\n“난쟁이들이문을열어주지말라고했어요.”\n백설공주가거절하자, 왕비는창문틈새로사과를쑥내밀었어\n요.\n“그럼, 맛이라도봐요. 정말맛있으니까. 둘이먹다하나가죽어\n도모를걸요.”\n“탐스러운사과네. 맛있어보여. 한입만아삭깨물어볼까?”\n사과를베어문순간, 백설공주는온몸에독이퍼져정신을잃고\n쓰러졌어요.\n사과를베어문순간, 백설공주는온몸에독이퍼져정신을잃고\n쓰러졌어요.\n“호호호. 이제내가세상에서가장아름답겠지?”\n왕비는백설공주를버려둔채자리를떠났어요.\n백설공주\n옛날어느왕국에공주님이태어났어요.\n“어쩜이렇게어여쁠까? 살결이눈처럼하얗구나. 백\n설공주라고불러야겠다.”\n왕과왕비는갓태어난딸을보며기뻐했어요.\n하지만기쁨도잠시, 왕비는곧세상을떠나고말았어\n요.\n숲속을헤매던백설공주는외딴오두막에이르렀어요.\n들여다보니오두막은비어있었어요.\n“아무도없네. 좀쉬어가도될까? 어? 신기하다! 모든게작아. \n어어? 이상하다! 모든게일곱. 의자도일곱, 접시도일곱. 어머, \n침대도일곱개네.”\n도망치느라치진백설공주는식탁위에있던빵을먹고나서\n일곱번째침대에쓰러져잠들었어요.\n밤이되자오두막주인인일곱난쟁이가돌아왔어요.\n난쟁이들은집안이어질러진것을보고깜짝놀랐지요.\n일곱째난쟁이가큰소리로외쳤어요.\n“누가내침대에서자고있어!”\n북적이는소리에잠이깬백설공주는왕비를피해도망쳤다고\n이야기했어요.',
 'answer': <bound method RunnableSequence.invoke of {
   context: VectorStoreRetriever(tags=['FAISS', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x0000

In [9]:
# cot_qa 평가자
cot_qa_evaluator = LangChainStringEvaluator(
    "cot_qa",
    prepare_data=lambda run, example : {
        "prediction": run.outputs["answer"], # LLM이 생성한 답변
        "reference": run.outputs["context"], # Context
        "input": example.inputs["question"] # 데이터셋의 질문
    }
)

# context_qa 평가자
context_qa_evaluator = LangChainStringEvaluator(
    "context_qa",
    prepare_data=lambda run, example : {
        "prediction": run.outputs["answer"], # LLM이 생성한 답변
        "reference": run.outputs["context"], # Context
        "input": example.inputs["question"] # 데이터셋의 질문
    }
)

In [10]:
print_evaluator_prompt(cot_qa_evaluator)

You are a teacher grading a quiz.
You are given a question, the context the question is about, and the student's answer. You are asked to score the student's answer as either CORRECT or INCORRECT, based on the context.
Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.

Example Format:
QUESTION: question here
CONTEXT: context the question is about here
STUDENT ANSWER: student's answer here
EXPLANATION: step by step reasoning here
GRADE: CORRECT or INCORRECT here

Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! 

QUESTION: [33;1m[1;3m{query}[0m
CONTEXT: [33;1m[1;3m{context}[0m
STUDENT ANSWER: [33;1m[1;3m{result}[0m
EXPLANATION:


In [11]:
print_evaluator_prompt(context_qa_evaluator)

You are a teacher grading a quiz.
You are given a question, the context the question is about, and the student's answer. You are asked to score the student's answer as either CORRECT or INCORRECT, based on the context.

Example Format:
QUESTION: question here
CONTEXT: context the question is about here
STUDENT ANSWER: student's answer here
GRADE: CORRECT or INCORRECT here

Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! 

QUESTION: [33;1m[1;3m{query}[0m
CONTEXT: [33;1m[1;3m{context}[0m
STUDENT ANSWER: [33;1m[1;3m{result}[0m
GRADE:


In [12]:
# 데이터셋 이름
dataset_name = "RAG_EVALUATION_DATASET"

# 평가실행
evaluate(
    rag_context_answer,
    data=dataset_name,
    evaluators=[cot_qa_evaluator, context_qa_evaluator],
    experiment_prefix="RAG_EVALUATION",
    metadata={
        "variant": "COT_QA & CONTEXT_QA Evalution을 활용한 평가"
    }
)

View the evaluation results for experiment: 'RAG_EVALUATION-3232c219' at:
https://smith.langchain.com/o/76515ba2-47a2-4225-a546-4c43a1772406/datasets/0e8fd746-beab-4ad6-bb7f-d582af62e3fd/compare?selectedSessions=fcb3962c-3637-4267-a91f-07bfeb219b5c




0it [00:00, ?it/s]

<ExperimentResults RAG_EVALUATION-3232c219>

### Criteria
* 기준값이 없거나 얻기 힘든 경우 "criteria"를 통해 사용자 지정 기준 집합에 대한 실행을 평가할 수 있다.
* 답변에 대해 높은 수준의 의미론적 측면을 평가하고자 할 때 유용하다.
LangChainStringEvaluator("criteria", config={ "criteria": `아래 중 하나의 criterion` })

| 기준 | 설명 |
|------|------|
| `conciseness` | 답변이 간결하고 간단한지 평가 |
| `relevance` | 답변이 질문과 관련이 있는지 평가 |
| `correctness` | 답변이 옳은지 평가 |
| `coherence` | 답변이 일관성이 있는지 평가 |
| `harmfulness` | 답변이 해롭거나 유해한지 평가 |
| `maliciousness` | 답변이 악의적이거나 악화시키는지 평가 |
| `helpfulness` | 답변이 도움이 되는지 평가 |
| `controversiality` | 답변이 논란이 되는지 평가 |
| `misogyny` | 답변이 여성을 비하하는지 평가 |
| `criminality` | 답변이 범죄를 촉진하는지 평가 |

In [14]:
from langsmith.evaluation import evaluate, LangChainStringEvaluator

# 평가자 설정
criteria_evaluator = [
    LangChainStringEvaluator("criteria", config={"criteria": "conciseness"}),
    LangChainStringEvaluator("criteria", config={"criteria": "relevance"}),
    LangChainStringEvaluator("criteria", config={"criteria": "coherence"})
]

# 데이터셋 이름 설정]
dataset_name = "RAG_EVALUATION_DATASET"

# 평가 실행
experiment_results = evaluate(
    ask_question,
    data=dataset_name,
    evaluators=criteria_evaluator,
    experiment_prefix="CRITERIA_EVALUATION",
    metadata={
        "variant": "criteria를 활용한 평가"
    }
)

View the evaluation results for experiment: 'CRITERIA_EVALUATION-e58b002d' at:
https://smith.langchain.com/o/76515ba2-47a2-4225-a546-4c43a1772406/datasets/0e8fd746-beab-4ad6-bb7f-d582af62e3fd/compare?selectedSessions=b0e559a9-1ad5-4e5f-8955-24a5de5ede77




0it [00:00, ?it/s]