首先我們建了一個應用，假設是一個問答系統。那我們如何自動衡量這個系統的表現?這樣的衡量在要比較 LLM、Prompt 等等的時候會特別重要。

In [1]:
from langchain.chains import RetrievalQA
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownTextSplitter
from langchain.vectorstores import Qdrant
from langchain.evaluation.qa import QAGenerateChain
from langchain.evaluation.qa import QAEvalChain
from langchain_setup import ChatOpenAI, tracing_v2_enabled_if_api_key_set

loader = NotionDirectoryLoader("../data/notion")
splitter = MarkdownTextSplitter(chunk_size=500, chunk_overlap=200)
documents = splitter.split_documents(loader.load())

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(),
    retriever=Qdrant.from_documents(
        documents=documents,
        embedding=OpenAIEmbeddings(),
        location=":memory:",
    ).as_retriever(search_kwargs=dict(k=1)), # extract only one document to prevent too long prompt
    chain_type_kwargs={"document_separator": "<<<<>>>>>"},
)

print(qa_chain.run('什麼是軟體開發中的隕石?'))

軟體開發中的隕石是比喻指代在隕石落下型開發中，每個步驟都被視為一顆隕石，代表著極高的壓力和快速的進度。這種開發方法強調快速且連貫的開發流程，並在每個步驟中迅速完成所需的任務。因此，隕石在這裡象徵著壓力和迅速完成任務的象徵。


# 

# 1. 自動評量資料生成

In [2]:
example_gen_chain = QAGenerateChain.from_llm(llm=ChatOpenAI())
with tracing_v2_enabled_if_api_key_set(project_name='tutorial'):
    examples = example_gen_chain.apply_and_parse([{"doc": d} for d in documents])
    print(len(examples))
    display(examples[1])



4


{'qa_pairs': {'query': 'According to the document, what is the cycle of the agile development method?',
  'answer': 'The cycle of the agile development method, as described in the document, consists of the following steps: "要件定義" (requirement definition), "基本設計" (basic design), "詳細設計" (detailed design), "實裝" (implementation), and "Programmer" (programming).'}}

[LangSmith URL]: https://smith.langchain.com/o/34ec837d-8405-462d-b949-fdfaebda792b/projects/p/fdcbda35-4d3a-418b-ab49-7e3205e630a6/r/d58b251c-66f6-4bc7-9af9-d5d0bb58166e?poll=true


In [3]:
print(example_gen_chain.prompt.template)

You are a teacher coming up with questions to ask on a quiz. 
Given the following document, please generate a question and answer based on that document.

Example Format:
<Begin Document>
...
<End Document>
QUESTION: question here
ANSWER: answer here

These questions should be detailed and be based explicitly on information in the document. Begin!

<Begin Document>
{doc}
<End Document>


**想想看: 針對產生的問答題，如何確保問題的品質?如何確保答案的正確性?**

# 2. 自動評價
有了資料後，我們可以先產出答案

In [4]:
qa_pairs = [example['qa_pairs'] for example in examples]
predictions = qa_chain.apply(qa_pairs)

因為產出跟答案都是文字，常常很難用程式判斷意思相不相符，因此我們需要用 LLM 來判斷這件事

In [5]:
eval_chain = QAEvalChain.from_llm(ChatOpenAI())
with tracing_v2_enabled_if_api_key_set(project_name='tutorial'):
    graded_outputs = eval_chain.evaluate(qa_pairs, predictions)
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['results'])
    print("==================================================")


[LangSmith URL]: https://smith.langchain.com/o/34ec837d-8405-462d-b949-fdfaebda792b/projects/p/fdcbda35-4d3a-418b-ab49-7e3205e630a6/r/dd7c343b-c541-4074-b357-95269a11f743?poll=true
Example 0:
Question: What is the name of the software development methodology introduced in the document?
Real Answer: The name of the software development methodology introduced in the document is "隕石落下型開發" (Meteorite-style development).
Predicted Answer: The name of the software development methodology introduced in the document is "隕石落下型開發" (Meteorite-style development).
Predicted Grade: CORRECT
Example 1:
Question: According to the document, what is the cycle of the agile development method?
Real Answer: The cycle of the agile development method, as described in the document, consists of the following steps: "要件定義" (requirement definition), "基本設計" (basic design), "詳細設計" (detailed design), "實裝" (implementation), and "Programmer" (programming).
Predicted Answer: The cycle of the agile development method is

In [6]:
print(eval_chain.prompt.template)

You are a teacher grading a quiz.
You are given a question, the student's answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.

Example Format:
QUESTION: question here
STUDENT ANSWER: student's answer here
TRUE ANSWER: true answer here
GRADE: CORRECT or INCORRECT here

Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! 

QUESTION: {query}
STUDENT ANSWER: {result}
TRUE ANSWER: {answer}
GRADE:


**想想看: 評分的 LLM 和被評分的 LLM 甚麼樣的情況下會存在偏心?**

<details>
<summary>參考</summary>
例如同樣是 OpenAI 開發的 GPT 系列，模型架構相似，訓練資料相似，自然產生的結果也會比較相似。評分的模型會挑比較相似的，但這不代表比較正確。
</details>