# SFT 監督式微調

## 什麼是 SFT (Supervised Fine-Tuning)?

Supervised Fine-Tuning (SFT) 是最直接的微調方法,使用標註好的「輸入-輸出」對來訓練模型。

## SFT 的優勢與限制
**優勢:**
- 概念簡單,易於理解和實現
- 資料需求相對明確
- 訓練穩定,收斂快
- 可以快速適應特定任務
**限制:**
- 需要大量高質量標註資料
- 可能過擬合訓練資料
- 對資料質量極為敏感

## 資料格式設計

Alpaca 格式（單輪對話）：

In [None]:
{
  "instruction": "將以下英文翻譯成中文",
  "input": "The weather is beautiful today.",
  "output": "今天天氣很好。"
}

Chat 格式（多輪對話）：

In [None]:
{
  "conversations": [
    {"role": "user", "content": "什麼是機器學習？"},
    {"role": "assistant", "content": "機器學習是人工智慧的一個分支..."},
    {"role": "user", "content": "能舉個例子嗎？"},
    {"role": "assistant", "content": "當然！例如垃圾郵件過濾..."}
  ]
}

實際使用時的 Prompt 模板：

In [None]:
<|system|>你是一個有幫助的 AI 助手</s>
<|user|>什麼是機器學習？</s>
<|assistant|>機器學習是人工智慧的一個分支...</s>
<|user|>能舉個例子嗎？</s>
<|assistant|>

模型學習在 <|assistant|> 標記後生成回應。

## 資料清洗與去重

高品質資料是 SFT 成功的關鍵。常見清洗步驟：

### 去除低品質樣本：
- 太短（< 10 字）或太長（> 2048 tokens）
- 包含亂碼、重複字符（!!!!!!）
- 有害、歧視性內容

In [None]:
import json
import re
from typing import Dict

def clean_text(text: str) -> str:
    """清理單個文本"""
    # 移除多餘空白
    text = re.sub(r'\s+', ' ', text)
    # 移除重複標點符號 (!!!!! -> !)
    text = re.sub(r'([!?。])\1+', r'\1', text)
    # 移除特殊控制字符
    text = re.sub(r'[\x00-\x1f\x7f-\x9f]', '', text)
    return text.strip()

def is_valid_sample(sample: Dict) -> bool:
    instruction = sample.get('instruction', '')
    output = sample.get('output', '')

    # 1. 長度限制
    if len(instruction) < 10 or len(instruction) > 2048:
        return False
    if len(output) < 10 or len(output) > 2048:
        return False

    # 2. 亂碼檢查
    valid_chars = re.findall(r'[a-zA-Z\u4e00-\u9fff]', output)
    if len(output) >= 20 and len(valid_chars) < len(output) * 0.5:
        return False

    # 3. 重複字符
    if re.search(r'(.)\1{8,}', output):
        return False

    # 4. 敏感詞
    blacklist = ['炸彈', '黑鬼', '毒品']
    text_to_check = instruction + output
    if any(word in text_to_check for word in blacklist):
        return False

    return True


def clean_dataset(input_file: str, output_file: str):
    """清洗整個資料集"""
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    cleaned_data = []
    stats = {'total': len(data), 'removed': 0}
    
    for sample in data:
        # 清理文本
        sample['instruction'] = clean_text(sample['instruction'])
        sample['output'] = clean_text(sample['output'])
        if 'input' in sample:
            sample['input'] = clean_text(sample['input'])
        
        # 檢查有效性
        if is_valid_sample(sample):
            cleaned_data.append(sample)
        else:
            stats['removed'] += 1
    
    # 保存清洗後的資料
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(cleaned_data, f, ensure_ascii=False, indent=2)
    
    print(f"清洗完成: 原始 {stats['total']} 筆, "
          f"移除 {stats['removed']} 筆, "
          f"保留 {len(cleaned_data)} 筆")

# 使用範例
clean_dataset('raw_data.json', 'cleaned_data.json')

### 去重：
**精確去重：移除完全重複的樣本**


In [None]:
import json
import hashlib
from typing import List, Dict

def exact_dedup(data: List[Dict]) -> List[Dict]:
    """精確去重 - 移除完全相同的樣本"""
    seen_hashes = set()
    deduped_data = []
    duplicate_count = 0

    for sample in data:
        # 使用分隔符避免字串拼接碰撞
        content = sample['instruction'] + "|||" + sample['output']
        content_hash = hashlib.md5(content.encode('utf-8')).hexdigest()

        if content_hash not in seen_hashes:
            seen_hashes.add(content_hash)
            deduped_data.append(sample)
        else:
            duplicate_count += 1

    print(f"精確去重: 移除 {duplicate_count} 筆重複樣本")
    return deduped_data


# 使用範例
with open('cleaned_data.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

deduped_data = exact_dedup(data)

with open('deduped_data.json', 'w', encoding='utf-8') as f:
    json.dump(deduped_data, f, ensure_ascii=False, indent=2)

**近似去重（相似的樣本）**

In [None]:
import json
import numpy as np
from typing import List, Dict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def fuzzy_dedup(data: List[Dict], threshold: float = 0.85) -> List[Dict]:
    """
    近似去重 - 移除高度相似的樣本
    threshold: 相似度閾值
    """
    texts = [sample['output'] for sample in data]

    vectorizer = TfidfVectorizer(max_features=1000)
    tfidf_matrix = vectorizer.fit_transform(texts)

    keep_indices = set(range(len(data)))
    batch_size = 1000

    for i in range(0, len(data), batch_size):
        end_i = min(i + batch_size, len(data))
        similarities = cosine_similarity(
            tfidf_matrix[i:end_i],
            tfidf_matrix
        )

        for j in range(similarities.shape[0]):
            actual_j = i + j
            if actual_j not in keep_indices:
                continue

            # 只考慮「比自己後面」且相似度超過 threshold 的樣本
            similar_indices = np.where(
                (similarities[j] > threshold) &
                (np.arange(len(data)) > actual_j)
            )[0]

            for k in similar_indices:
                if k in keep_indices:
                    keep_indices.remove(k)

    deduped_data = [data[i] for i in sorted(keep_indices)]
    removed = len(data) - len(deduped_data)
    print(f"近似去重: 移除 {removed} 筆相似樣本 (閾值={threshold})")

    return deduped_data


# 使用範例
with open('deduped_data.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

fuzzy_deduped_data = fuzzy_dedup(data, threshold=0.85)

with open('final_data.json', 'w', encoding='utf-8') as f:
    json.dump(fuzzy_deduped_data, f, ensure_ascii=False, indent=2)

**MinHash 快速去重（適合大規模資料）**

In [None]:
from typing import List, Dict
from datasketch import MinHash, MinHashLSH

def minhash_dedup(data: List[Dict], threshold: float = 0.8) -> List[Dict]:
    """
    使用 MinHash LSH 進行快速近似去重
    適合大規模資料（10 萬+）
    """
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    minhashes = {}

    # 第一遍：建立 MinHash 並插入 LSH
    for idx, sample in enumerate(data):
        m = MinHash(num_perm=128)
        text = sample['output']

        # character-level 3-gram
        for i in range(len(text) - 2):
            m.update(text[i:i+3].encode('utf-8'))

        minhashes[idx] = m
        lsh.insert(idx, m)

    # 第二遍：去重（保留最早出現的）
    seen = set()
    deduped_data = []

    for idx, sample in enumerate(data):
        if idx in seen:
            continue

        similar_indices = lsh.query(minhashes[idx])

        # 只標記「比自己 index 大」的相似樣本
        for other_idx in similar_indices:
            if other_idx > idx:
                seen.add(other_idx)

        deduped_data.append(sample)

    removed = len(data) - len(deduped_data)
    print(f"MinHash 去重: 移除 {removed} 筆相似樣本 (閾值={threshold})")

    return deduped_data


# 使用範例
# minhash_deduped_data = minhash_dedup(data, threshold=0.8)

### 平衡資料分佈：
- 避免某類任務（如翻譯）佔比過高
- 確保不同難度、風格的樣本都有涵蓋

In [None]:
import random
from typing import List, Dict, Any
from collections import defaultdict


def balance_dataset(
    data: List[Dict[str, Any]],
    category_key: str = 'category',
    max_per_category: int = 1000,
    seed: int | None = None
) -> List[Dict[str, Any]]:
    """
    平衡不同類別的資料數量
    避免某類任務樣本過多導致模型偏向
    """

    if max_per_category <= 0:
        raise ValueError("max_per_category 必須大於 0")

    if seed is not None:
        random.seed(seed)

    # 按類別分組
    categorized = defaultdict(list)
    for sample in data:
        category = sample.get(category_key, 'unknown')
        categorized[category].append(sample)

    # 顯示原始分佈
    print("原始分佈:")
    for cat, samples in categorized.items():
        print(f"  {cat}: {len(samples)} 筆")

    # 對每個類別進行採樣
    balanced_data = []
    for category, samples in categorized.items():
        if len(samples) > max_per_category:
            sampled = random.sample(samples, max_per_category)
            print(f"  {category}: 從 {len(samples)} 降採樣至 {max_per_category}")
        else:
            sampled = samples

            if len(samples) < max_per_category * 0.1:
                print(f"  ⚠️ {category}: 樣本數偏少 ({len(samples)})")

        balanced_data.extend(sampled)

    # 打亂順序
    random.shuffle(balanced_data)

    print(f"\n平衡後總計: {len(balanced_data)} 筆")
    return balanced_data


# 使用範例
# 假設資料中有 'category' 欄位標記任務類型
balanced_data = balance_dataset(
    fuzzy_deduped_data,
    max_per_category=1000,
    seed=42
)

### 完整清洗流程

In [None]:
def full_cleaning_pipeline(input_file: str, output_file: str):
    """完整的資料清洗流程"""
    print("=" * 50)
    print("開始資料清洗流程")
    print("=" * 50)
    
    # 步驟 1: 載入資料
    print("\n[1/5] 載入資料...")
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)
    print(f"載入 {len(data)} 筆原始資料")
    
    # 步驟 2: 基礎清洗
    print("\n[2/5] 基礎清洗...")
    cleaned = []
    for sample in data:
        sample['instruction'] = clean_text(sample['instruction'])
        sample['output'] = clean_text(sample['output'])
        if is_valid_sample(sample):
            cleaned.append(sample)
    print(f"保留 {len(cleaned)} 筆有效資料")
    
    # 步驟 3: 精確去重
    print("\n[3/5] 精確去重...")
    deduped = exact_dedup(cleaned)
    
    # 步驟 4: 近似去重
    print("\n[4/5] 近似去重...")
    final = fuzzy_dedup(deduped, threshold=0.85)
    
    # 步驟 5: 保存結果
    print("\n[5/5] 保存清洗後資料...")
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(final, f, ensure_ascii=False, indent=2)
    
    print("=" * 50)
    print(f"清洗完成！")
    print(f"原始資料: {len(data)} 筆")
    print(f"最終資料: {len(final)} 筆")
    print(f"移除比例: {(1 - len(final)/len(data))*100:.1f}%")
    print("=" * 50)

# 執行完整流程
full_cleaning_pipeline('raw_data.json', 'cleaned_final.json')

## 資料擴增
資料擴增是解決訓練資料不足的關鍵技術。透過自動化方法從有限的資料中生成更多高品質的訓練樣本。

### 文字改寫 (Paraphrasing)
將原始問題或回答用不同的表達方式重新呈現，保持語義不變但增加表達多樣性。

**使用 LLM 進行改寫：**

In [None]:
import anthropic

class ParaphraseAugmenter:
    def __init__(self, api_key):
        self.client = anthropic.Anthropic(api_key=api_key)

    def paraphrase_question(self, question, num_variants=3):
        """生成問題的多種改寫版本"""

        prompt = f"""請將以下問題改寫成 {num_variants} 種不同的表達方式，保持原意不變。

原始問題：
{question}

要求：
1. 每個改寫版本在語氣、用詞或句式上有所不同
2. 保持問題的核心意圖和資訊需求
3. 語言自然流暢
4. 每行輸出一個改寫版本

改寫版本：
"""

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1000,
            messages=[{"role": "user", "content": prompt}]
        )

        lines = response.content[0].text.strip().split('\n')

        variants = []
        for line in lines:
            line = line.strip()
            if not line:
                continue
            line = line.lstrip("0123456789.-、 ")
            variants.append(line)

        return variants

    def paraphrase_answer(self, answer, style="formal"):
        """改寫回答，可指定風格"""

        style_prompts = {
            "formal": "正式、專業的風格",
            "casual": "輕鬆、口語化的風格",
            "detailed": "更詳細、有更多解釋的風格",
            "concise": "更簡潔、直接的風格"
        }

        style_desc = style_prompts.get(style, style_prompts["formal"])

        prompt = f"""請將以下回答改寫成 {style_desc}。

原始回答：
{answer}

改寫回答：
"""

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=2000,
            messages=[{"role": "user", "content": prompt}]
        )

        return response.content[0].text.strip()


# ===== 使用範例 =====

augmenter = ParaphraseAugmenter(api_key="your_api_key")

question = "如何在 Python 中讀取 CSV 檔案？"
variants = augmenter.paraphrase_question(question, num_variants=5)

print("原始問題：", question)
print("改寫版本：")
for i, v in enumerate(variants, 1):
    print(f"{i}. {v}")

**使用同義詞替換：**

In [None]:
import random
import jieba

class SynonymAugmenter:
    def __init__(self):
        # 中文同義詞字典
        self.synonym_dict = {
            "如何": ["怎麼", "怎樣", "要如何", "該如何"],
            "方法": ["方式", "做法", "途徑", "手段"],
            "使用": ["用", "運用", "利用", "採用"],
            "建立": ["創建", "建構", "設立", "製作"],
            "問題": ["疑問", "困難", "難題", "課題"],
            "解決": ["處理", "克服", "搞定", "應對"],
        }

    def augment_with_synonyms(self, text, replace_ratio=0.3):
        """使用同義詞替換部分詞彙"""

        words = list(jieba.cut(text))
        if not words:
            return text

        num_replacements = max(1, int(len(words) * replace_ratio))
        replace_indices = random.sample(
            range(len(words)),
            min(num_replacements, len(words))
        )

        augmented = words.copy()
        for idx in replace_indices:
            word = words[idx]
            if word in self.synonym_dict:
                augmented[idx] = random.choice(self.synonym_dict[word])

        return ''.join(augmented)

    def add_custom_synonyms(self, word, synonyms):
        """添加自定義同義詞"""
        self.synonym_dict[word] = synonyms


# ===== 使用範例 =====

syn_aug = SynonymAugmenter()

original = "如何使用 Python 解決這個問題？"
for i in range(5):
    print(f"{i+1}. {syn_aug.augment_with_synonyms(original)}")

### 反向翻譯(Back Translation)
將文字翻譯成其他語言後再翻譯回來，利用翻譯過程中的語義保持和表達變化來生成新樣本。

In [None]:
import random
from anthropic import Anthropic


class BackTranslationAugmenter:
    def __init__(self, api_key: str):
        self.client = Anthropic(api_key=api_key)
        self.bridge_languages = [
            "英文", "日文", "韓文", "法文", "德文"
        ]

    def _call_claude(self, prompt: str) -> str:
        """
        呼叫 Claude 並安全地取得文字回傳
        （避免 content 為空或結構變動導致錯誤）
        """
        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1000,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )

        if not response.content:
            raise RuntimeError("Claude 回傳內容為空")

        # 取第一個 text block
        for block in response.content:
            if hasattr(block, "text"):
                return block.text.strip()

        raise RuntimeError("Claude 回傳中找不到文字內容")

    def back_translate(self, text: str, bridge_lang: str | None = None):
        """執行回譯擴增"""

        if bridge_lang is None:
            bridge_lang = random.choice(self.bridge_languages)

        # 第一步：翻譯成中間語言
        forward_prompt = (
            f"請將以下中文翻譯成{bridge_lang}，"
            f"只輸出翻譯結果：\n\n{text}"
        )

        translated = self._call_claude(forward_prompt)

        # 第二步：翻譯回中文
        backward_prompt = (
            f"請將以下{bridge_lang}翻譯成中文，"
            f"只輸出翻譯結果：\n\n{translated}"
        )

        back_translated = self._call_claude(backward_prompt)

        return {
            "original": text,
            "bridge_language": bridge_lang,
            "forward_translation": translated,
            "back_translation": back_translated
        }

    def batch_back_translate(self, text: str, num_variants: int = 3):
        """
        生成多個回譯版本
        text: 單一輸入句子
        """
        variants = []

        selected_langs = random.sample(
            self.bridge_languages,
            min(num_variants, len(self.bridge_languages))
        )

        for lang in selected_langs:
            result = self.back_translate(text, bridge_lang=lang)
            variants.append(result)

        return variants


# ===== 使用範例 =====
bt_aug = BackTranslationAugmenter(api_key="your_api_key")

original = "機器學習是人工智慧的一個分支，專注於讓電腦從資料中學習。"
variants = bt_aug.batch_back_translate(original, num_variants=3)

for i, v in enumerate(variants, 1):
    print(f"\n變體 {i}（透過 {v['bridge_language']}）：")
    print(f"回譯結果：{v['back_translation']}")

注意事項：
- 可能產生不自然的表達
- 需要人工審查品質
- 某些專業術語可能被改變

### LLM 資料合成 (LLM Data Synthesis)
**指令變化生成**

針對同一個任務目標，生成多種不同表達方式的指令。

In [None]:
import json

class InstructionVariationGenerator:
    def __init__(self, api_key):
        self.client = Anthropic(api_key=api_key)

    def _parse_json(self, text):
        """安全解析 LLM 回傳的 JSON"""
        try:
            return json.loads(text)
        except json.JSONDecodeError:
            start = text.find("{")
            end = text.rfind("}") + 1
            return json.loads(text[start:end])

    def generate_variations(self, task_description, num_variations=5):
        prompt = f"""你是一個資料擴增專家。
請根據以下任務描述，生成 {num_variations} 個表達方式不同但目標相同的使用者指令。

任務描述：
{task_description}

請「只輸出 JSON」，不要加入任何說明文字。

輸出格式：
{{
  "variations": [
    {{
      "instruction": "指令內容",
      "style": "語氣風格",
      "context": "使用情境"
    }}
  ]
}}
"""

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=2000,
            messages=[{"role": "user", "content": prompt}]
        )

        data = self._parse_json(response.content[0].text)
        return data["variations"]


# ===== 使用範例 =====
gen = InstructionVariationGenerator(api_key="your_api_key")

task = "教使用者如何製作一個簡單的網頁"
variations = gen.generate_variations(task, num_variations=5)

for i, v in enumerate(variations, 1):
    print(f"\n變體 {i}")
    print("風格:", v["style"])
    print("指令:", v["instruction"])
    print("情境:", v["context"])


**自我指導生成 (Self-Instruct)**

 讓 LLM 自己創造新的任務和對應的解答，從零生成完整的訓練資料。

In [None]:
import json

class SelfInstructGenerator:
    def __init__(self, api_key):
        self.client = Anthropic(api_key=api_key)

    def _parse_json(self, text):
        try:
            return json.loads(text)
        except json.JSONDecodeError:
            start = text.find("{")
            end = text.rfind("}") + 1
            return json.loads(text[start:end])

    def generate_task_and_solution(self, domain, num_examples=10):
        prompt = f"""你是一個 {domain} 領域的專家教師。
請生成 {num_examples} 個教學範例，並「只輸出 JSON」。

輸出格式：
{{
  "examples": [
    {{
      "task": "任務描述",
      "difficulty": "簡單/中等/困難",
      "solution_steps": ["步驟1", "步驟2"],
      "final_answer": "最終答案或程式碼",
      "explanation": "補充說明"
    }}
  ]
}}
"""

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4000,
            messages=[{"role": "user", "content": prompt}]
        )

        data = self._parse_json(response.content[0].text)
        return data["examples"]

    def convert_to_training_format(self, examples):
        training_data = []

        for ex in examples:
            steps = ex.get("solution_steps", [])
            if isinstance(steps, str):
                steps = [steps]

            solution_text = "\n\n".join(
                f"步驟 {i+1}: {step}" for i, step in enumerate(steps)
            )

            solution_text += f"\n\n最終答案：\n{ex.get('final_answer', '')}"

            if ex.get("explanation"):
                solution_text += f"\n\n說明：{ex['explanation']}"

            training_data.append({
                "messages": [
                    {"role": "user", "content": ex["task"]},
                    {"role": "assistant", "content": solution_text}
                ],
                "metadata": {
                    "difficulty": ex["difficulty"],
                    "generated": True
                }
            })

        return training_data


# ===== 使用範例 =====
gen = SelfInstructGenerator(api_key="your_api_key")

examples = gen.generate_task_and_solution(
    domain="Python 程式設計",
    num_examples=5
)

training_data = gen.convert_to_training_format(examples)

with open("training_data.jsonl", "w", encoding="utf-8") as f:
    for item in training_data:
        f.write(json.dumps(item, ensure_ascii=False) + "\n")

**演化式資料生成 (Evol-Instruct)**

從簡單的指令出發，透過多種演化策略逐步增加複雜度和深度。

In [None]:
import random

class EvolInstructGenerator:
    def __init__(self, api_key):
        self.client = Anthropic(api_key=api_key)
        self.evolution_strategies = {
            "增加約束": "加入額外限制條件",
            "深化推理": "要求多步驟推理",
            "增加複雜度": "涉及更多概念",
            "具體化": "轉為真實情境",
            "組合概念": "結合多個主題"
        }

    def evolve_instruction(self, original_instruction, strategy=None):
        if strategy is None:
            strategy = random.choice(list(self.evolution_strategies.keys()))

        prompt = f"""請根據以下策略，演化指令。

策略：{strategy}
說明：{self.evolution_strategies[strategy]}

原始指令：
{original_instruction}

請只輸出演化後的指令。
"""

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1000,
            messages=[{"role": "user", "content": prompt}]
        )

        evolved = response.content[0].text.strip()
        if not evolved:
            evolved = original_instruction

        return evolved, strategy

    def multi_level_evolution(self, seed_instruction, levels=3):
        chain = [{"level": 0, "instruction": seed_instruction}]
        current = seed_instruction

        for level in range(1, levels + 1):
            evolved, strategy = self.evolve_instruction(current)
            chain.append({
                "level": level,
                "instruction": evolved,
                "strategy": strategy
            })
            current = evolved

        return chain


# ===== 使用範例 =====
gen = EvolInstructGenerator(api_key="your_api_key")

seed = "寫一個 Python 函式來計算兩個數字的和"
chain = gen.multi_level_evolution(seed, levels=3)

for item in chain:
    print(f"\n層級 {item['level']}")
    if item["level"] > 0:
        print("策略:", item["strategy"])
    print("指令:", item["instruction"])

**多輪對話生成**

In [None]:
class DialogueAugmenter:
    """生成多輪對話資料"""

    def __init__(self, api_key):
        self.client = Anthropic(api_key=api_key)

    def generate_multi_turn_dialogue(self, topic, num_turns=5):
        """生成多輪對話"""

        prompt = f"""
請生成一段關於「{topic}」的多輪對話，包含 {num_turns} 輪問答。

要求：
1. 對話要自然流暢，有邏輯連貫性
2. 問題要從基礎逐步深入
3. 助手的回答要準確且有教育意義
4. 包含追問和澄清
5. 僅輸出 JSON，不要加入任何說明文字

輸出格式：
{{
  "dialogue": [
    {{
      "turn": 1,
      "user": "使用者訊息",
      "assistant": "助手回覆",
      "intent": "意圖類型"
    }}
  ]
}}
"""

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=3000,
            messages=[{"role": "user", "content": prompt}]
        )

        import json

        # 合併所有 text block
        text_blocks = [
            block.text for block in response.content
            if block.type == "text"
        ]
        raw_text = "".join(text_blocks).strip()

        try:
            data = json.loads(raw_text)
        except json.JSONDecodeError as e:
            raise ValueError(f"模型輸出不是合法 JSON：\n{raw_text}") from e

        dialogue = data.get("dialogue", [])

        # 可選：確保 turn 數量
        return dialogue[:num_turns]

    def convert_to_training_format(self, dialogue):
        """轉換為訓練格式"""

        messages = [
            {"role": "system", "content": "你是一個樂於助人的AI助手。"}
        ]

        for turn in dialogue:
            messages.append({
                "role": "user",
                "content": turn.get("user", "")
            })
            messages.append({
                "role": "assistant",
                "content": turn.get("assistant", "")
            })

        return {"messages": messages}


# 使用範例
dialogue_aug = DialogueAugmenter(api_key="your_api_key")

dialogue = dialogue_aug.generate_multi_turn_dialogue(
    topic="機器學習模型訓練",
    num_turns=5
)

training_sample = dialogue_aug.convert_to_training_format(dialogue)