# MCQ (Multiple Choice Question) 평가 튜토리얼

## MCQDataset

이 튜토리얼에서는 Huggingface의 객관식 dataset을 불러와서 평가 후 재업로드하는 과정까지 경험해볼 것입니다.

### 1. 데이터셋 불러오기
먼저 HuggingFace Hub에서 데이터셋을 불러오는 방법을 알아보겠습니다:

In [2]:
from langmetrics.llmdataset import LLMDataset
from langmetrics.llmtestcase import LLMTestCase
from datasets import load_dataset
import pandas as pd
from dotenv import load_dotenv

In [3]:
load_dotenv(override=True)

True

In [4]:
test_dataset = LLMDataset.from_huggingface_hub('sickgpt/015_KorMedMCQA_test', split='train')

In [5]:
len(test_dataset)

3009

이제 evaluate을 진행해봅시다.

In [6]:
from langmetrics.llmfactory import LLMFactory
from langmetrics.config import ModelConfig, LocalModelConfig

In [7]:
LLMFactory.get_model_list()

['gpt-4o',
 'gpt-4o-mini',
 'deepseek-v3',
 'deepseek-reasoner',
 'claude-3.7-sonnet',
 'claude-3.5-sonnet',
 'claude-3.5-haiku',
 'naver',
 'gemini-2.0-flash']

In [8]:
# 커스텀 모델 설정 생성
custom_config = LocalModelConfig(
    model_name="Qwen/Qwen2.5-32B-Instruct",
    lora_model_path="/workspace/project/2025/model/llm/sickllm/qwen_32b_lora_ver1/epoch_0",
    lora_name="qwen",
)

In [9]:
# # 커스텀 모델 설정 생성
# custom_config = ModelConfig(
#     model_name="Qwen/Qwen2.5-3B-Instruct",
#     api_base="http://qwen3b:8000/v1",
#     api_key='EMPTY',
#     max_tokens=32000,
#     seed=66,
#     provider="openai"
# )

In [10]:
# localllm은 서버를 local에서 실행시키기 때문에 부팅되는 시간이 존재합니다.
custom_llm = LLMFactory.create_llm(custom_config, temperature=0)

waiting llm server boot


                    참고: 보통 서버는 별도의 터미널에서 실행됩니다.
                    현재 CI 병렬 환경에서 실행 중이므로 실제 성능과는 차이가 있을 수 있습니다.
                    


한글 데이터셋 이므로 template을 한글 사용

In [11]:
from langmetrics.metrics import MCQMetric
metric = MCQMetric(
    output_model=custom_llm,
    template_language='ko',  # 'ko' 또는 'en'
    output_template_type='reasoning'  # 'reasoning' 또는 'only_answer'
)

async를 통해서 빠르게 추론을 할 것입니다.

In [12]:
import nest_asyncio
nest_asyncio.apply()

In [13]:
print(test_dataset[0])

LLMTestCase(
  input='2개월 남아가 BCG예방접종 1개월 뒤 주사 부위에 이상반응이 생겨서 예방접종을 실시한 소아청소년과의원을 찾아왔다. 이때 「감염병의 예방 및 관리에 관한 법률」에 따라 예방접종 후 이상반응으로 진단한 원장이 이상반응 발생신고서를 제출해야 할 대상은?',
  expected_output='C',
  choices=['대한의사협회장', '보건복지부장관', '남아 소재지 관할 보건소장', '남아 소재지 관할 시장 ∙ 군수 ∙ 구청장', '남아 소재지 관할 시 ∙ 도지사']
)


In [16]:
results = await metric.ameasure(test_dataset)

 11%|█▏        | 341/3009 [05:44<17:00,  2.61it/s]   

Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 1/60


 27%|██▋       | 803/3009 [08:14<10:54,  3.37it/s]

Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 1/60


 29%|██▉       | 883/3009 [08:48<15:56,  2.22it/s]

Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 2/60


 30%|██▉       | 902/3009 [08:55<12:40,  2.77it/s]

Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 1/60


 36%|███▌      | 1072/3009 [09:57<14:28,  2.23it/s]

Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 1/60


 37%|███▋      | 1108/3009 [10:08<10:38,  2.98it/s]

CancelledError: 

Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 1/60
Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 1/60
Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 1/60
Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 1/60
Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 3/60
Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 1/60
Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 2/60
Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a bette

약 1200개의 달하는 test를 단 30초만에 모두 추론한 것을 확인할 수 있습니다!

In [16]:
results.df.to_csv('qwen_32b_epoch0.csv', index=False)

In [17]:
scores = sum([i.score for i in results]) / len(results)

In [18]:
print(scores)

0.7121967431040213


In [19]:
custom_llm.shutdown()

In [None]:
test_dataset = LLMDataset.from_huggingface_hub('sickgpt/015_KorMedMCQA_test', split='train')

# 커스텀 모델 설정 생성
custom_config = LocalModelConfig(
    model_name="Qwen/Qwen2.5-32B-Instruct",
    lora_model_path="/workspace/project/2025/model/llm/sickllm/qwen_32b_lora_ver1/epoch_1",
    lora_name="qwen",
)

# localllm은 서버를 local에서 실행시키기 때문에 부팅되는 시간이 존재합니다.
custom_llm = LLMFactory.create_llm(custom_config, temperature=0)
# custom_llm.shutdown()

waiting llm server boot


                    참고: 보통 서버는 별도의 터미널에서 실행됩니다.
                    현재 CI 병렬 환경에서 실행 중이므로 실제 성능과는 차이가 있을 수 있습니다.
                    


NameError: name 'metric' is not defined

In [9]:
results = await metric.ameasure(test_dataset)

custom_llm.shutdown()

results.df.to_csv('qwen_32b_epoch1.csv', index=False)

scores = sum([i.score for i in results]) / len(results)

print(scores)


 10%|▉         | 292/3009 [03:52<20:16,  2.23it/s]   

Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 1/60


 12%|█▏        | 360/3009 [04:19<15:36,  2.83it/s]

CancelledError: 

Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 1/60
Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 1/60
Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 1/60
Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 1/60
Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 1/60
Unexpected error in trimAndLoadJson: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model., retry 1/60
API Error: Connection error., retry 1/60
API Error: Connection error., retry 1/60
API Error: Connection error., retry 1/60
API Error: Connection error., retry 1/60
API Error: Connection error., retry 1/60
API Error: Connection

In [None]:
scores = sum([i.score for i in results]) / len(results)

print(scores)

In [None]:
test_dataset = LLMDataset.from_huggingface_hub('sickgpt/015_KorMedMCQA_test', split='train')

# 커스텀 모델 설정 생성
custom_config = LocalModelConfig(
    model_name="Qwen/Qwen2.5-32B-Instruct",
)

# localllm은 서버를 local에서 실행시키기 때문에 부팅되는 시간이 존재합니다.
custom_llm = LLMFactory.create_llm(custom_config, temperature=0)

results = await metric.ameasure(test_dataset)

custom_llm.shutdown()

results.df.to_csv('qwen_32b.csv', index=False)

# scores = sum([i.score for i in results]) / len(results)

# print(scores)

# custom_llm.shutdown_server()

waiting llm server boot


In [None]:
scores = sum([i.score for i in results]) / len(results)

print(scores)