# RAGAS Customization의 필요성
## 1 실제 사용자 질의와 ragas 평가용 질의의 간극 해소
* ragas의 기본 데이터 생성 방식은 대부분의 문서에 적용 가능한 범용적인 문서-질의-응답 패턴을 기반으로 설계됨
  * 문서를 통한 페르소나 생성이 아닌 실제 사용자 페르소나 입력을 통해 실제와 가까운 질의 생성
* summary embedding의 유사도를 중심으로 문서를 선택하고 질의를 생성하는 구조적 특성으로 인해 다음과 같은 문제가 발생
  * Multi-Hop 평가 데이터 생성에서 페르소나 생성이나 문서 선택 단계에서 summary embedding을 주로 활용
  * 이로 인해, 동일 키워드에 대한 서로 다른 주제나 테마를 가진 문서들 기반 평가 데이터셋 생성 불가능
  * summary embedding의 값이 유사성으로 인해 single-section의 문서 조합 기반 평가 데이터셋이 주로 나타남남
  * 결과적으로 실제 사용자 질의의 복합적 의도나 문맥적 요구사항을 충분히 반영하지 못하는 한계 발생
  * 질의 예시
    * ex) "**sumo 데드리프트의 마무리 동작에서 허리**에 과도한 무게 집중이 나타나고 있어. 이 현상이 나타나는 **과학적 이유**와 이를 해결하기 위한 **연습 방법**을 작성해줘."
      * 중심 키워드: sumo 데드리프트, 마무리 동작, 허리
      * 주제: 과학적 이유, 연습 방법
    * 이와 같이, 실제 사용자 질의는 단일 키워드에 대한 복합 주제(multi-section) 흐름을 포함하는 경우가 많음.
    * 그러나, RAGAS 기본 데이터 생성 방식은 이를 추분히 반영하지 못함으로, 커스터마이징이 필수

## 2. 변별력 강화를 위한 합성 데이터 설계
* 좋은 평가 데이터는 다양한 RAG 시스템이나 LLM 모델 간의 성능 차이를 명확하게 드러낼 수 있어야 함
* 이는 '변별력'을 지닌 질의로 정의
* Multi-Hop 관련 시나리오 생성 신규 방안 도입을 통해 변별력을 가진 합성 데이터 생성  
* 복합 유형(Simplt + Abstraction)을 통한 변별력 강화
  * ex) "역도 훈련 프로그램 구성에서 일반적인 훈련원칙은 무엇이고, 각 원칙에 대한 근거를 설명해줘."

## RAGAS Customization 확인 절차
1. ragas 기반의 기본 합성 데이터셋 생성
2. 기본 합성 데이터셋 기반 검색 절차의 hyper parameter tuning 진행
3. 검색 성능 평가의 분포를 고려하여 일부 hyper parameter 조합을 선정
4. hyper parameter 조합을 대상으로 생성 절차의 hyper parameter tuning 진행
5. custom 합성 데이터셋 생성
6. custom 합성 데이터셋 기반 검색·생성 절차의 성능 평가
7. 평가지표 비교를 통한 변별력 확인
   * 성능 분산(분포)의 폭 비교
   * 정렬 결과 차이 분석(Rank Sensitivity)
   * 통계적 유의성 테스트

변별력 확인이 어렵다면, 합성 데이터의 점수가 낮은지 높은지를 확인함.

## hyper parameter 관련 주요 설정(auto_rag)
1. 검색 평가 지표: [retrieval_f1, retrieval_ndcg, retrieval_map]
2. bm25 tokenizer: ko_kiwi
3. rrf_k(num_chunk): [3, 5, 10]
4. 생성 평가 지표: bert_score 및 g_eval
   * 여기서 ragas의 주요 지표를 사용해도 좋을거 같음

In [2]:
import os
import json
from tqdm import tqdm

from dotenv import load_dotenv
load_dotenv()

True

In [3]:
with open('../data/document/역도/chunk_with_overlap.json', 'r', encoding='utf-8') as f:
    origin_data = json.load(f)

# custom dataset의 변별성 확인을 위해서 'Ⅲ. 역도경기 기술의 구조와 훈련법', 'Ⅳ. 역도체력의 구조와 훈련법'를 사용
sample_data = origin_data[2] + origin_data[3]
# sample_data = origin_data[2]
print(len(sample_data))

81


1. 합성 데이터별 시나리오 출력
2. MultihopAbstractQuery 최적화
   * 병렬처리
   * 노드별 이웃 노드 맵 생성
3. 합성 데이터셋 번역 기능
4. 합성 데이터셋 reference_contexts의 index_id 추적

# 1. ragas 기반 기본 합성 데이터 생성

In [4]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

from ragas.testset.graph import KnowledgeGraph
from ragas.testset.graph import Node, NodeType

generator_llm = LangchainLLMWrapper(ChatOpenAI(model='gpt-4o-mini'))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [19]:
from langchain_core.documents import Document

kg = KnowledgeGraph()
document_list = []

for doc in sample_data:
    page_content = doc['page_content']
    metadata = doc['metadata']

    new_document = Document(page_content)
    new_document.metadata = metadata

    document_list.append(new_document)
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={'page_content': doc['page_content'],
                        'document_metadata': doc['metadata']}
        )
    )

In [20]:
from ragas.testset.transforms import default_transforms, apply_transforms

trans = default_transforms(documents=document_list, llm=generator_llm, embedding_model=generator_embeddings)
apply_transforms(kg, trans)

                                                                                                               

In [21]:
# kg.save('../data/document/역도/kg_sector3_4.json')

In [5]:
kg = KnowledgeGraph.load('../data/document/역도/kg_sector3_4.json')

### 기존 MultiHopAbstractQuerySunthesizer 개선 

In [13]:
import typing as t
import logging
from concurrent.futures import ThreadPoolExecutor
from collections import defaultdict
from ragas.testset.graph import KnowledgeGraph, Node
from ragas.testset.synthesizers.multi_hop.abstract import MultiHopAbstractQuerySynthesizer
from ragas.testset.synthesizers.single_hop.specific import SingleHopSpecificQuerySynthesizer
from ragas.testset.synthesizers.multi_hop.specific import MultiHopSpecificQuerySynthesizer

logger = logging.getLogger(__name__)

class FastMultiHopAbstractQuerySynthesizer(MultiHopAbstractQuerySynthesizer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._neighbor_cache = {}
        self._cluster_cache = {}
        
    def _build_neighbor_map(self, knowledge_graph: KnowledgeGraph) -> dict:
        """선처리: 노드별 이웃 노드 맵 생성"""
        if not self._neighbor_cache:
            neighbor_map = defaultdict(set)
            # 한 번의 순회로 모든 관계 처리
            for rel in knowledge_graph.relationships:
                if rel.get_property("summary_similarity"):
                    neighbor_map[rel.source].add(rel.target)
            self._neighbor_cache = dict(neighbor_map)
        return self._neighbor_cache

    def _find_cluster_from_node(self, start_node: Node, neighbor_map: dict, max_depth: int = 2) -> set:
        """단일 노드에서 시작하는 클러스터 찾기"""
        # 캐시 확인
        cache_key = (start_node.id, max_depth)
        if cache_key in self._cluster_cache:
            return self._cluster_cache[cache_key]

        visited = {start_node}
        current_level = {start_node}
        
        # BFS 사용 (더 효율적인 메모리 사용)
        for depth in range(max_depth):
            next_level = set()
            for node in current_level:
                neighbors = neighbor_map.get(node, set())
                next_level.update(n for n in neighbors if n not in visited)
            visited.update(next_level)
            current_level = next_level
            if not current_level:  # 더 이상 확장할 노드가 없으면 중단
                break

        # 결과 캐싱
        self._cluster_cache[cache_key] = visited
        return visited

    def get_node_clusters(self, knowledge_graph: KnowledgeGraph) -> t.List[t.Set[Node]]:
        """최적화된 클러스터 찾기"""
        # 1. 이웃 노드 맵 구축 (캐시 활용)
        neighbor_map = self._build_neighbor_map(knowledge_graph)
        
        # 2. 병렬 처리를 위한 함수
        def process_node_chunk(nodes):
            return [self._find_cluster_from_node(node, neighbor_map) for node in nodes]

        # 3. 노드를 청크로 분할하여 병렬 처리
        chunk_size = max(1, len(knowledge_graph.nodes) // (4 * 2))  # CPU 코어 수의 2배 정도의 청크
        node_chunks = [
            list(knowledge_graph.nodes)[i:i + chunk_size]
            for i in range(0, len(knowledge_graph.nodes), chunk_size)
        ]

        # 4. 병렬 처리 실행
        all_clusters = []
        with ThreadPoolExecutor(max_workers=4) as executor:
            chunk_results = list(executor.map(process_node_chunk, node_chunks))
            for chunk_result in chunk_results:
                all_clusters.extend(chunk_result)

        # 5. 중복 제거 및 최소 크기 필터링 (set 연산 사용)
        unique_clusters = set()
        min_cluster_size = 2  # 최소 클러스터 크기 설정
        
        for cluster in all_clusters:
            if len(cluster) >= min_cluster_size:
                frozen_cluster = frozenset(cluster)
                unique_clusters.add(frozen_cluster)

        logger.info(f"Found {len(unique_clusters)} unique clusters")
        return [set(cluster) for cluster in unique_clusters]

In [8]:
from typing import Dict, List, Tuple, Union
from dataclasses import dataclass, field

import typing as t
import logging
from concurrent.futures import ThreadPoolExecutor
from collections import defaultdict
from ragas.testset.graph import KnowledgeGraph, Node

from ragas.testset.synthesizers.multi_hop import MultiHopScenario
from ragas.testset.synthesizers.multi_hop.abstract import MultiHopAbstractQuerySynthesizer
from ragas.testset.synthesizers.single_hop.specific import SingleHopSpecificQuerySynthesizer
from ragas.testset.synthesizers.multi_hop.specific import MultiHopSpecificQuerySynthesizer


logger = logging.getLogger(__name__)

@dataclass
class FastMultiHopAbstractQuerySynthesizer(MultiHopAbstractQuerySynthesizer):
    name: str = "fast_multi_hop_abstract_synthesizer"
    _scenario_cache: Dict = field(default_factory=dict)
    generated_scenarios: List[MultiHopScenario] = field(default_factory=list)

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._neighbor_cache = {}
        self._cluster_cache = {}
        self.generated_scenarios = []
        
    def _build_neighbor_map(self, knowledge_graph: KnowledgeGraph) -> dict:
        """선처리: 노드별 이웃 노드 맵 생성"""
        if not self._neighbor_cache:
            neighbor_map = defaultdict(set)
            # 한 번의 순회로 모든 관계 처리
            for rel in knowledge_graph.relationships:
                if rel.get_property("summary_similarity"):
                    neighbor_map[rel.source].add(rel.target)
            self._neighbor_cache = dict(neighbor_map)
        return self._neighbor_cache

    def _find_cluster_from_node(self, start_node: Node, neighbor_map: dict, max_depth: int = 2) -> set:
        """단일 노드에서 시작하는 클러스터 찾기"""
        # 캐시 확인
        cache_key = (start_node.id, max_depth)
        if cache_key in self._cluster_cache:
            return self._cluster_cache[cache_key]

        visited = {start_node}
        current_level = {start_node}
        
        # BFS 사용 (더 효율적인 메모리 사용)
        for depth in range(max_depth):
            next_level = set()
            for node in current_level:
                neighbors = neighbor_map.get(node, set())
                next_level.update(n for n in neighbors if n not in visited)
            visited.update(next_level)
            current_level = next_level
            if not current_level:  # 더 이상 확장할 노드가 없으면 중단
                break

        # 결과 캐싱
        self._cluster_cache[cache_key] = visited
        return visited

    def get_node_clusters(self, knowledge_graph: KnowledgeGraph) -> t.List[t.Set[Node]]:
        """최적화된 클러스터 찾기"""
        # 1. 이웃 노드 맵 구축 (캐시 활용)
        neighbor_map = self._build_neighbor_map(knowledge_graph)
        
        # 2. 병렬 처리를 위한 함수
        def process_node_chunk(nodes):
            return [self._find_cluster_from_node(node, neighbor_map) for node in nodes]

        # 3. 노드를 청크로 분할하여 병렬 처리
        chunk_size = max(1, len(knowledge_graph.nodes) // (4 * 2))  # CPU 코어 수의 2배 정도의 청크
        node_chunks = [
            list(knowledge_graph.nodes)[i:i + chunk_size]
            for i in range(0, len(knowledge_graph.nodes), chunk_size)
        ]

        # 4. 병렬 처리 실행
        all_clusters = []
        with ThreadPoolExecutor(max_workers=4) as executor:
            chunk_results = list(executor.map(process_node_chunk, node_chunks))
            for chunk_result in chunk_results:
                all_clusters.extend(chunk_result)

        # 5. 중복 제거 및 최소 크기 필터링 (set 연산 사용)
        unique_clusters = set()
        min_cluster_size = 2  # 최소 클러스터 크기 설정
        
        for cluster in all_clusters:
            if len(cluster) >= min_cluster_size:
                frozen_cluster = frozenset(cluster)
                unique_clusters.add(frozen_cluster)

        logger.info(f"Found {len(unique_clusters)} unique clusters")
        return [set(cluster) for cluster in unique_clusters]

    async def _generate_scenarios(
        self,
        n: int,
        knowledge_graph: KnowledgeGraph,
        persona_list: List,
        callbacks,
    ) -> List[MultiHopScenario]:

        scenarios = await super()._generate_scenarios(n, knowledge_graph, persona_list, callbacks)

        self.generated_scenarios.extend(scenarios)

        return scenarios

    def get_all_scenario_details(self):
        details = []
        for scenario in self.generated_scenarios:
            detail = {
                "combinations": scenario.combinations,
                "persona": {
                    "name": scenario.persona.name,
                    "description": scenario.persona.role_description
                },
                "query_style": scenario.style.name,
                "query_length": scenario.length.name
            }
            
            details.append(detail)
        return details

In [None]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings, knowledge_graph=kg)

query_distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
    (FastMultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25)
]
 
from langchain_core.callbacks.stdout import StdOutCallbackHandler
handler = StdOutCallbackHandler()

# testset = generator.generate(testset_size=30, 
#                              query_distribution=query_distribution,
#                              callbacks=[handler])



[1m> Entering new ragas testset generation chain...[0m


Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]



[1m> Entering new persona_generation_prompt chain...[0m


[1m> Entering new persona_generation_prompt chain...[0m


[1m> Entering new persona_generation_prompt chain...[0m


Generating personas:  33%|███▎      | 1/3 [00:00<00:01,  1.09it/s]


[1m> Finished chain.[0m

[1m> Finished chain.[0m


Generating personas: 100%|██████████| 3/3 [00:01<00:00,  2.63it/s]



[1m> Finished chain.[0m


[1m> Entering new Scenario Generation chain...[0m


Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]



[1m> Entering new single_hop_specifc_query_synthesizer chain...[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m


[1m> Entering new multi_hop_specific_query_synthesizer chain...[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m


[1m> Entering new multi_hop_abstract_query_synthesizer chain...[0m


[1m> Entering new concept_combination_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new concept_combination_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Enteri

Generating Scenarios:  33%|███▎      | 1/3 [00:07<00:15,  7.77s/it]



[1m> Entering new concept_combination_prompt chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new concept_combination_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new concept_combination_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_per

Generating Scenarios:  67%|██████▋   | 2/3 [00:19<00:09,  9.87s/it]


[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m

[1m> Finished chain.[0m


[1m> Entering new themes_personas_matching_prompt chain...[0m


Generating Scenarios: 100%|██████████| 3/3 [00:29<00:00,  9.75s/it]



[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new Sample Generation chain...[0m


Generating Samples:   0%|          | 0/31 [00:00<?, ?it/s]



[1m> Entering new single_hop_specifc_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m


[1m> Entering new single_hop_specifc_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m


[1m> Entering new single_hop_specifc_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m


[1m> Entering new single_hop_specifc_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m


[1m> Entering new single_hop_specifc_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m


[1m> Entering new single_hop_specifc_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m


[1m> Entering new single_hop_specifc_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m


[1m> Entering new single_hop_specifc_query_sy

Generating Samples:   3%|▎         | 1/31 [00:01<00:58,  1.96s/it]


[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new multi_hop_specific_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m


Generating Samples:   6%|▋         | 2/31 [00:02<00:34,  1.20s/it]


[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new multi_hop_specific_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m


Generating Samples:  13%|█▎        | 4/31 [00:03<00:17,  1.55it/s]


[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new multi_hop_specific_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new multi_hop_specific_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new multi_hop_specific_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m


Generating Samples:  26%|██▌       | 8/31 [00:03<00:05,  4.33it/s]


[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new multi_hop_specific_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new multi_hop_specific_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new multi_hop_abstract_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new multi_hop_abstract_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m


Generating Samples:  39%|███▊      | 12/31 [00:04<00:02,  6.47it/s]


[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new multi_hop_abstract_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new multi_hop_abstract_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new multi_hop_abstract_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new multi_hop_abstract_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m


Generating Samples:  45%|████▌     | 14/31 [00:04<00:02,  5.73it/s]


[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new multi_hop_abstract_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


Generating Samples:  48%|████▊     | 15/31 [00:04<00:02,  5.60it/s]



[1m> Entering new multi_hop_abstract_query_synthesizer chain...[0m


[1m> Entering new query_answer_generation_prompt chain...[0m


Generating Samples:  52%|█████▏    | 16/31 [00:05<00:03,  4.89it/s]


[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


Generating Samples:  58%|█████▊    | 18/31 [00:05<00:02,  5.11it/s]


[1m> Finished chain.[0m

[1m> Finished chain.[0m


Generating Samples:  61%|██████▏   | 19/31 [00:05<00:02,  5.09it/s]


[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


Generating Samples:  74%|███████▍  | 23/31 [00:06<00:01,  6.97it/s]


[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


Generating Samples:  84%|████████▍ | 26/31 [00:06<00:00,  6.68it/s]


[1m> Finished chain.[0m

[1m> Finished chain.[0m


Generating Samples:  87%|████████▋ | 27/31 [00:07<00:01,  3.94it/s]


[1m> Finished chain.[0m

[1m> Finished chain.[0m


Generating Samples:  90%|█████████ | 28/31 [00:07<00:00,  3.91it/s]


[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


Generating Samples:  97%|█████████▋| 30/31 [00:08<00:00,  3.18it/s]


[1m> Finished chain.[0m

[1m> Finished chain.[0m


Generating Samples: 100%|██████████| 31/31 [00:10<00:00,  2.97it/s]


[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m





In [31]:
testset_df = testset.to_pandas()
# testset_df.to_csv('../data/document/역도/df_sector3_4.csv', index=False)

## 번역

In [29]:
import re
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

template = """
    You are an expert translator specializing in English to Korean translation.

    Translate the following English text into natural Korean.  
    Only output the translated Korean text.  
    If a term is a proper noun or a commonly used English term (e.g., "clean and jerk"), transliterate it into Korean and include the original English in parentheses.

    Text:  
    {input_text}
""" 


llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)
prompt = PromptTemplate.from_template(template)
chain = prompt | llm | StrOutputParser()

In [30]:
def classify_language(text):
    english_count = len(re.findall(r'[a-zA-Z]', text))
    korean_count = len(re.findall(r'[가-힣]', text))

    if english_count >= korean_count:
        return 'english'

    return 'korean'

In [32]:
testset_df['language_user_input'] = testset_df['user_input'].apply(lambda x : classify_language(x))
testset_df['language_reference'] = testset_df['reference'].apply(lambda x : classify_language(x))

user_data = testset_df.loc[(testset_df['language_user_input'] == 'english'), 'user_input'].tolist()
reference_data = testset_df.loc[(testset_df['language_reference'] == 'english'), 'reference'].tolist()

translate_user = chain.batch(user_data, config={'max_concurrency': 5})
translate_reference = chain.batch(reference_data, config={'max_concurrency': 5})

In [33]:
testset_df.loc[(testset_df['language_user_input'] == 'english'), 'user_input'] = translate_user
testset_df.loc[(testset_df['language_reference'] == 'english'), 'reference'] = translate_reference

In [35]:
testset_df.iloc[:, :4].to_csv('../data/document/역도/df_sector3_4.csv', index=False)

In [36]:
testset_df.head(2)

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name,language_user_input,language_reference
0,역도에서 스내치(snatch) 기술의 기본 원리는 무엇인가요?,"[역도경기의 기술이라 함은, 경기자가 극한의 중량을 가진 바벨을 들어올리기 위\n하...",역도에서 스내치(snatch) 기술의 기본 원리는 최소한의 노력으로 최대 중량을 들...,single_hop_specifc_query_synthesizer,english,english
1,오버그립의 정의는 무엇인가요?,"[바벨을 잡는 방법에는 크게 오버그립(over grip), 언더그립(under gr...",오버그립(over grip)은 손바닥을 몸 쪽으로 향하여 위에서 바벨을 잡는 방법입니다.,single_hop_specifc_query_synthesizer,korean,korean


In [None]:
b    

# RAGAS 합성 데이터셋 생성 관련 시나리오 출력
* 개선점
  * 한번 생성된 지식그래프를 저장하는 코드 필요
  * 각종 단계를 출력하는 코드
  * 페르소나에서 role_description도 출력하도록 변경 필요
  * 시나리오 자체도 출력하도록 수정 필요
  * 나중에는 지식그래프 입력하고 transform만 입력하면, 문제만 생성 하도록 (?)
  * summary를 실행하지 않도록 하는 방법(?)
  * 영어로 출력되는 문제 해결하기
* 지금 해야하는 것
  * 작성된 코드를 분석하기
  * 분석 결과를 기준으로 개선하는 것

In [9]:
from dataclasses import dataclass, field
from ragas.testset.synthesizers.single_hop import SingleHopScenario, SingleHopQuerySynthesizer
from ragas.testset.synthesizers.multi_hop import MultiHopScenario, MultiHopQuerySynthesizer

@dataclass
class CustomSingleHopSpecificSynthesizer(SingleHopSpecificQuerySynthesizer):
    name: str = "custom_single_hop_specific_synthesizer"
    _scenario_cache: Dict = field(default_factory=dict)
    generated_scenarios: List[SingleHopScenario] = field(default_factory=list, init=False)

    async def _generate_scenarios(
        self,
        n: int,
        knowledge_graph: KnowledgeGraph,
        persona_list: List,
        callbacks,
    ) -> List[SingleHopScenario]:
        # 부모 클래스의 _generate_scenarios 메서드 호출
        scenarios = await super()._generate_scenarios(n, knowledge_graph, persona_list, callbacks)
        
        # 생성된 시나리오들을 저장
        self.generated_scenarios.extend(scenarios)
        
        return scenarios
    
    def get_all_scenario_details(self):
        details = []
        for scenario in self.generated_scenarios:
            detail = {
                "term": scenario.term,
                "persona": {
                    "name": scenario.persona.name,
                    "description": scenario.persona.role_description
                },
                "query_style": scenario.style.name,
                "query_length": scenario.length.name
            }
            
            details.append(detail)
        return details

@dataclass
class CustomMultiHopSpecificSynthesizer(MultiHopSpecificQuerySynthesizer):
    name: str = "custom_multi_hop_specific_synthesizer"
    _scenario_cache: Dict = field(default_factory=dict)
    generated_scenarios: List[MultiHopScenario] = field(default_factory=list, init=False)

    async def _generate_scenarios(
        self,
        n: int,
        knowledge_graph: KnowledgeGraph,
        persona_list: List,
        callbacks,
    ) -> List[MultiHopScenario]:
        # 부모 클래스의 _generate_scenarios 메서드 호출
        scenarios = await super()._generate_scenarios(n, knowledge_graph, persona_list, callbacks)
        
        # 생성된 시나리오들을 저장
        self.generated_scenarios.extend(scenarios)
        
        return scenarios

    def get_all_scenario_details(self):
        details = []
        for scenario in self.generated_scenarios:
            detail = {
                "combinations": scenario.combinations,
                "persona": {
                    "name": scenario.persona.name,
                    "description": scenario.persona.role_description
                },
                "query_style": scenario.style.name,
                "query_length": scenario.length.name
            }
            
            details.append(detail)
        return details

In [10]:
from typing import Optional, List, Dict, Any, Tuple, Union
from ragas.testset.synthesizers.generate import TestsetGenerator
from ragas.testset.synthesizers.testset_schema import Testset, TestsetSample
from ragas.testset.synthesizers.base import BaseScenario
import random
import pandas as pd
from tqdm import tqdm

from langchain.callbacks import StdOutCallbackHandler
from ragas.testset.persona import generate_personas_from_kg
from ragas.testset.synthesizers.utils import calculate_split_values
from ragas.executor import Executor


class CustomTestGenerator(TestsetGenerator):
    """
    TestGenerator를 상속받아 각 데이터 행별 시나리오 정보를 포함하는 커스텀 생성기
    """
    
    def generate(
        self,
        testset_size: int,
        query_distribution: Optional[List[tuple]] = None,
        num_personas: int = 3,
        run_config: Optional[Dict[str, Any]] = None,
        batch_size: Optional[int] = None,
        callbacks: Optional[List] = None,
        token_usage_parser: Optional[Any] = None,
        with_debugging_logs: bool = False,
        raise_exceptions: bool = True,
    ) -> Testset:
        """
        기존 generate 메소드를 오버라이드하여 시나리오 정보를 포함하도록 수정
        """
        if run_config is not None:
            self.llm.set_run_config(run_config)

        query_distribution = query_distribution or default_query_distribution(
            self.llm, self.knowledge_graph
        )
        callbacks = callbacks or []

        # 페르소나 생성
        if self.persona_list is None:
            self.persona_list = generate_personas_from_kg(
                llm=self.llm,
                kg=self.knowledge_graph,
                num_personas=num_personas,
                callbacks=callbacks,
            )
        else:
            random.shuffle(self.persona_list)

        # 시나리오 생성
        splits, _ = calculate_split_values(
            [prob for _, prob in query_distribution], testset_size
        )
        exec = Executor(
            desc="Generating Scenarios",
            raise_exceptions=raise_exceptions,
            run_config=run_config,
            keep_progress_bar=False,
            batch_size=batch_size,
        )
        
        for i, (scenario, _) in enumerate(query_distribution):
            exec.submit(
                scenario.generate_scenarios,
                n=splits[i],
                knowledge_graph=self.knowledge_graph,
                persona_list=self.persona_list[:num_personas],
                callbacks=callbacks,
            )

        scenario_sample_list: t.List[t.List[BaseScenario]] = exec.results()

        # 샘플 생성
        exec = Executor(
            "Generating Samples",
            raise_exceptions=raise_exceptions,
            run_config=run_config,
            keep_progress_bar=True,
            batch_size=batch_size,
        )
        
        additional_testset_info: t.List[t.Dict] = []
        for i, (synthesizer, _) in enumerate(query_distribution):
            for scenario in scenario_sample_list[i]:
                exec.submit(
                    synthesizer.generate_sample,
                    scenario=scenario,
                    callbacks=callbacks,
                )
                # 시나리오 정보를 additional_info에 추가
                additional_testset_info.append({
                    "synthesizer_name": synthesizer.name,
                    "scenario_info": {
                        "type": scenario.__class__.__name__,
                        "description": str(scenario),
                        "style": str(scenario.style),
                        "length": str(scenario.length),
                        "nodes": [str(node) for node in scenario.nodes]
                    }
                })

        eval_samples = exec.results()

        # 테스트셋 생성
        testsets = []
        for sample, additional_info in zip(eval_samples, additional_testset_info):
            testsets.append(TestsetSample(eval_sample=sample, **additional_info))
            
        testset = Testset(samples=testsets)
        return testset 

    def _generate_batch(
        self,
        batch_size: int,
        query_distribution: List[Tuple[Union[SingleHopQuerySynthesizer, MultiHopQuerySynthesizer], float]],
        callbacks: List[StdOutCallbackHandler] = None
    ) -> Tuple[Any, Dict[str, List[Dict[str, Any]]]]:
        
        # 데이터셋 생성
        dataset = self.generate(
            testset_size=batch_size,
            query_distribution=query_distribution,
            callbacks=callbacks
        )
        
        # 시나리오 상세 정보 수집
        scenario_details = {
            "single_hop": [],
            "multi_hop": [],
            "fast_multi_hop": []
        }
        
        # 각 Synthesizer의 시나리오 정보 수집
        for synthesizer, _ in query_distribution:
            if isinstance(synthesizer, CustomSingleHopSpecificSynthesizer):
                details = synthesizer.get_all_scenario_details()
                scenario_details["single_hop"].extend(details)
                synthesizer.generated_scenarios = []  # 다음 배치를 위해 초기화
                
            elif isinstance(synthesizer, CustomMultiHopSpecificSynthesizer):
                details = synthesizer.get_all_scenario_details()
                scenario_details["multi_hop"].extend(details)
                synthesizer.generated_scenarios = []  # 다음 배치를 위해 초기화
            
            elif isinstance(synthesizer, FastMultiHopAbstractQuerySynthesizer):
                details = synthesizer.get_all_scenario_details()
                scenario_details['fast_multi_hop'].extend(details)
                synthesizer.generated_scenarios = []
        
        return dataset, scenario_details

    def merge_scenario(
        self,
        final_dataset: pd.DataFrame,
        all_scenario_details: Dict[str, List[Dict[str, Any]]],
    ) -> pd.DataFrame:
        scenario_info_list = []
        for scenario_type, details in all_scenario_details.items():
            for detail in details:
                scenario_info = {
                    'scenario_type': scenario_type,
                    'combinations': detail.get('combinations', ''),
                    'term': detail.get('term', ''),
                    'persona_name': detail.get('persona', {}).get('name', ''),
                    'persona_description': detail.get('persona', {}).get('description', ''),
                    'query_style': detail.get('query_style', ''),
                    'query_length': detail.get('query_length', '')
                }
                scenario_info_list.append(scenario_info)
        
        scenario_df = pd.DataFrame(scenario_info_list)
        
        for col in scenario_df.columns:
            final_dataset[col] = scenario_df[col].values
        return final_dataset
        
    def generate_with_details(
        self,
        testset_size: int,
        query_distribution: List[Tuple[Union[SingleHopQuerySynthesizer, MultiHopQuerySynthesizer], float]],
        callbacks: List[StdOutCallbackHandler] = None,
        batch_size: int = 5
    ) -> Tuple[Any, Dict[str, List[Dict[str, Any]]]]:
        """배치 처리와 병렬 처리를 통한 데이터셋 생성"""
        all_datasets = []
        all_scenario_details = {
            "single_hop": [],
            "multi_hop": [],
            "fast_multi_hop": []
        }
        
        with tqdm(total=testset_size, desc="데이터셋 생성 중") as pbar:
            for i in range(0, testset_size, batch_size):
                current_batch_size = min(batch_size, testset_size - i)
                
                dataset, scenario_details = self._generate_batch(
                    batch_size=current_batch_size,
                    query_distribution=query_distribution,
                    # callbacks=callbacks
                )

                all_datasets.append(dataset)
                for key in all_scenario_details:
                    all_scenario_details[key].extend(scenario_details[key])
                
                pbar.update(current_batch_size)
        
        # 데이터셋 병합
        final_dataset = pd.concat([d.to_pandas() for d in all_datasets], ignore_index=True)

        merged_dataset = self.merge_scenario(final_dataset, all_scenario_details)
        
        return merged_dataset

In [11]:
import asyncio
import json
import pandas as pd
from typing import Dict, List, Tuple, Union
from dataclasses import dataclass, field
from tqdm import tqdm
from langchain.callbacks import StdOutCallbackHandler

def run_generation(generator_llm, generator_embeddings, kg):
    # CustomTestsetGenerator 인스턴스 생성
    generator = CustomTestGenerator(
        llm=generator_llm,
        embedding_model=generator_embeddings,
        knowledge_graph=kg,
        # output_dir="my_dataset" # 필요에 따라 출력 디렉토리 지정
    )

    # Synthesizer 설정
    query_distribution = [
        (CustomSingleHopSpecificSynthesizer(llm=generator_llm), 0.5),
        (CustomMultiHopSpecificSynthesizer(llm=generator_llm), 0.25),
        (FastMultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25)

    ]
    
    # 콜백 설정 (선택 사항)
    callbacks = [StdOutCallbackHandler()]
    
    # 데이터셋 생성
    merged_dataset = generator.generate_with_details(
        testset_size=10,
        query_distribution=query_distribution,
        batch_size=5
        # callbacks=callbacks, # 콜백 전달
    )

    return merged_dataset
            

In [12]:
merged_dataset = run_generation(generator_llm, generator_embeddings, kg)

데이터셋 생성 중:   0%|          | 0/10 [00:00<?, ?it/s]

Generating personas: 100%|██████████| 3/3 [00:01<00:00,  1.94it/s]
Generating Scenarios: 100%|██████████| 3/3 [00:06<00:00,  2.10s/it]
Generating Samples: 100%|██████████| 7/7 [00:07<00:00,  1.10s/it]
Generating Scenarios: 100%|██████████| 3/3 [00:08<00:00,  2.73s/it]
Generating Samples: 100%|██████████| 7/7 [00:05<00:00,  1.17it/s]
데이터셋 생성 중: 100%|██████████| 10/10 [00:29<00:00,  2.98s/it]


In [14]:
merged_dataset.head(2)

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name,scenario_type,combinations,term,persona_name,persona_description,query_style,query_length
0,"리프팅 기술이란 뭐고, 어떻게 선수의 성능에 영향을 미치나?","[역도경기의 기술이라 함은, 경기자가 극한의 중량을 가진 바벨을 들어올리기 위\n하...",리프팅 기술은 경기자가 극한의 중량을 가진 바벨을 들어올리기 위해 육체적 성능을 합...,custom_single_hop_specific_synthesizer,single_hop,,리프팅,Sports Performance Analyst,Analyzes athletic performance and biomechanica...,POOR_GRAMMAR,MEDIUM
1,Can you explain the significance of the 훅그립 in...,"[바벨을 잡는 방법에는 크게 오버그립(over grip), 언더그립(under gr...",훅그립(hook grip) is a grip technique used by all...,custom_single_hop_specific_synthesizer,single_hop,,훅그립,Sports Performance Analyst,Analyzes athletic performance and biomechanica...,PERFECT_GRAMMAR,LONG


In [75]:
import re 

def make_chunk_dict(knowledge_graph):
    chunk_dict = {}
    for node in kg.nodes:
        chunk_id = node.properties['document_metadata']['chunk_id']
        page_content = node.properties['page_content']

        chunk_dict[page_content] = chunk_id
    return chunk_dict

def regular_expression(reference_contexts, chunk_dict):
    if len(reference_contexts) == 1:
        return [chunk_dict[reference_contexts[0]]]
    else:
        return [chunk_dict[re.sub(r"<\d+-hop>\n\n", "", text)] for text in reference_contexts] 

In [76]:
def make_reference_contexts(knowledge_graph, merged_dataset):
    chunk_dict = make_chunk_dict(knowledge_graph)
    merged_dataset['reference_contexts_id'] = merged_dataset['reference_contexts'].apply(lambda x : regular_expression(x, chunk_dict))

    return merged_dataset

In [77]:
merged_dataset = make_reference_contexts(kg, merged_dataset)

# RAGAS 실험
### 평가 항목
1. Generation 
   1. Faithfulness
      * 주어진 문맥에 대한 생성된 답변의 사실적 일관성 측정
      * 답변과 검색된 문맥(retrieved context)를 기준으로 계산
      * (0, 1) 범위 스케일이며, 값이 높을수록 좋음
      * 생성된 답변이 신뢰할 수 있다고(faithful) 간구되려면 답변에서 제시된 모든 주장이 주어진 문맥(given context)에서 추론될 수 있어야 함
      * 생성된 답변에서 주장의 집합(claims)를 식별 -> 각 주장마다 주어진 맥락 기반 여부 확인
      * 점수: context 기반의 답변 내 주장 수 / 전체 주장 수
      * 예시
        * 아이슈타인의 출생일자와 출생지는 어디인가?
          * 답변 1: 아이슈타인은 독엘에서 1879/3/14에 태어났습니다.
            * 
          * 답변 2: 아이슈타인은 독엘에서 1879/4/14에 태어났습니다.
            * 
          * context: 
* 
   1. 
1. Retriever
   1. 
## 1. 