## Preparation 1: Dataset with "OSH situation" inputs and "reasoning" output by LLM

**說明**：團隊已生成初版的題目與推理答案成組，接下來這個區段是說明 "[A]生成這份題目與答案的輸入檔案是怎麼來的" 跟 "[B]正確答案推論如何製作，[C]要怎麼從初版做 (1)去id化 (2)法規推論優化(事故原因到法規適用到責任結論) (3)圖譜的隱性增強"

In [1]:
# [A]
# manually upload knowledge_graph_final.json

In [24]:
import json
import torch
import numpy as np
import pandas as pd
import networkx as nx
from collections import Counter, defaultdict
from typing import Dict, List, Tuple
from tqdm import tqdm
import logging
import torch
import random
import sys
import re
import os
import argparse

In [3]:
try:
    from sentence_transformers import SentenceTransformer, util
    HAS_BERT = True
except ImportError:
    HAS_BERT = False
    print("CRITICAL WARNING: sentence-transformers not installed. Semantic Pruning will fail.")



In [4]:
# 設定 Logging 格式，看起來更像專業的實驗室工具
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - [OSKG-Lab] - %(levelname)s - %(message)s',
    datefmt='%H:%M:%S'
)
logger = logging.getLogger(__name__)

In [5]:
class OSKGPreprocessor:
    def __init__(self, json_path: str, model_name: str = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'):
        self.json_path = json_path
        self.model_name = model_name
        self.G = nx.MultiDiGraph()
        self.edge_weights = {}
        self.node_embeddings = {} # 儲存 Tensor 格式的 Embedding
        self.hard_negative_candidates = {} # PDF 3  困難負樣本字典

        # 初始化模型 [cite: 27]
        if HAS_BERT:
            print(f"Loading BERT model: {model_name}...")
            self.model = SentenceTransformer(model_name)

        # 載入資料
        with open(json_path, 'r', encoding='utf-8') as f:
            self.raw_data = json.load(f)
        print(f"Loaded KG with {len(self.raw_data.get('nodes', []))} nodes.")

    def build_graph(self):
        """基礎圖譜建構"""
        print("Building initial graph...")
        for node in self.raw_data['nodes']:
            self.G.add_node(node['id'], **node)
        for link in self.raw_data['links']:
            self.G.add_edge(link['source'], link['target'], relation=link['relation'])

    def action_b_generate_embeddings(self):
        """
        動作 (B): 節點特徵初始化
        注意：為了支援後續的語意剪枝 [cite: 12, 13]，我們必須先生成 Embedding。
        """
        print("Executing Action B: Generating Node Embeddings...")
        if not HAS_BERT:
            return

        node_ids = list(self.G.nodes())
        texts = []

        # 準備文本：優先使用 full_text (法規) 或 label (違規描述)
        for nid in node_ids:
            node = self.G.nodes[nid]
            # [cite: 44, 47] 提取文本邏輯
            text = node.get('full_text', node.get('original_full_text', node.get('label', str(nid))))
            texts.append(str(text))

        # 批次編碼 [cite: 50, 51]
        embeddings_tensor = self.model.encode(texts, convert_to_tensor=True, show_progress_bar=True, batch_size=32)

        # 建立映射供快速查詢
        self.node_embeddings = {nid: emb for nid, emb in zip(node_ids, embeddings_tensor)}

        # 將 Embedding 存回圖節點屬性
        feature_dict = {nid: emb.cpu() for nid, emb in self.node_embeddings.items()}
        nx.set_node_attributes(self.G, feature_dict, 'x')

    def action_c_semantic_pruning(self, threshold: float = 0.6, top_k: int = 3):
        """
        動作 (C): 語意剪枝 (Semantic Pruning) - 依據 PDF 2 [cite: 8, 30]
        針對 VIOLATES_LAW 進行基於 Embedding 的剪枝，解決「噪聲爆炸」問題 [cite: 2]。
        """
        print(f"Executing Action C: Semantic Pruning (Threshold={threshold}, Top-K={top_k})...")
        if not HAS_BERT or not self.node_embeddings:
            print("Error: No embeddings found. Run action_b first.")
            return

        # 1. 找出所有 VIOLATES_LAW 邊並分組 [cite: 34-40]
        vio_edges = [(u, v, k) for u, v, k, attr in self.G.edges(keys=True, data=True) if attr['relation'] == 'VIOLATES_LAW']
        vio_to_reg_map = defaultdict(list)
        for u, v, k in vio_edges:
            vio_to_reg_map[u].append(v)

        print(f" -> Found {len(vio_edges)} VIOLATES_LAW edges to analyze.")
        edges_to_remove = []
        edges_to_add = []

        # 2. 逐一處理每個 Violation 節點 [cite: 43]
        for vio_id, reg_ids in tqdm(vio_to_reg_map.items(), desc="Pruning Edges"):
            if vio_id not in self.node_embeddings: continue

            vio_emb = self.node_embeddings[vio_id] # [cite: 12]

            # 收集目標 Regulations 的 Embeddings
            valid_regs = [rid for rid in reg_ids if rid in self.node_embeddings]
            if not valid_regs: continue

            reg_embs = torch.stack([self.node_embeddings[rid] for rid in valid_regs]) # [cite: 13]

            # 3. 計算 Cosine Similarity [cite: 14, 53]
            # util.cos_sim returns (1, n_regs)
            scores = util.cos_sim(vio_emb, reg_embs)[0]

            # 4. 篩選 Top-K [cite: 54-60]
            reg_scores = []
            for i, rid in enumerate(valid_regs):
                reg_scores.append((rid, scores[i].item()))

            # 排序：分數高到低
            reg_scores.sort(key=lambda x: x[1], reverse=True)
            top_results = reg_scores[:top_k] #

            # 標記要移除的舊邊 (全部移除，稍後加回篩選過的)
            for rid in reg_ids:
                # 這裡需要小心，因為 NetworkX 是 MultiDiGraph，我們需要移除特定的那條邊
                # 簡單起見，我們記錄 (u, v) 組合，之後統一移除該類型的邊
                pass

            # 5. 重建邊與賦予權重 [cite: 61]
            for reg_id, score in top_results:
                new_relation = 'VIOLATES_SPECIFICALLY'

                # Logic from PDF 2 Page 3 [cite: 63-79]:
                # 如果分數高於閾值 -> VIOLATES_SPECIFICALLY
                # 如果是 Top-K 但分數略低 (這裡設個緩衝) -> IS_RELEVANT_TO

                if score >= threshold: # [cite: 63]
                    new_relation = 'VIOLATES_SPECIFICALLY'
                    weight = score # [cite: 70]
                else:
                    new_relation = 'IS_RELEVANT_TO' # [cite: 77]
                    weight = score * 0.5 # [cite: 78] 降低權重

                edges_to_add.append({
                    'source': vio_id,
                    'target': reg_id,
                    'relation': new_relation,
                    'weight': weight
                })

        # 執行圖更新
        print(" -> Applying pruning updates to graph...")
        # 移除舊的 VIOLATES_LAW 邊
        edges_to_remove = [(u, v, k) for u, v, k, attr in self.G.edges(keys=True, data=True) if attr['relation'] == 'VIOLATES_LAW']
        self.G.remove_edges_from(edges_to_remove)
        print(f" -> Removed {len(edges_to_remove)} noisy edges.")

        # 加入新的精煉邊
        for edge in edges_to_add:
            self.G.add_edge(edge['source'], edge['target'], relation=edge['relation'], weight=edge['weight'])
        print(f" -> Added {len(edges_to_add)} semantic edges (VIOLATES_SPECIFICALLY / IS_RELEVANT_TO).")

    def action_d_inject_hierarchy(self):
        """動作 (D): 補強層級結構 (同前版)"""
        print("Executing Action D: Injecting Hierarchy...")
        new_edges = []
        reg_nodes = [n for n, attr in self.G.nodes(data=True) if attr.get('node_type') in ['Regulation', 'Reg']]
        for reg_id in reg_nodes:
            law_name = self.G.nodes[reg_id].get('law_name')
            if law_name:
                law_node_id = f"LAW_{abs(hash(law_name))}"
                if law_node_id not in self.G:
                    self.G.add_node(law_node_id, label=law_name, node_type='Law', law_name=law_name)
                    # Law 節點也需要 Embedding (取平均或重新 encode)
                    if HAS_BERT and reg_id in self.node_embeddings:
                        self.node_embeddings[law_node_id] = self.node_embeddings[reg_id] # 暫時借用子節點特徵
                        self.G.nodes[law_node_id]['x'] = self.node_embeddings[reg_id].cpu()

                new_edges.append((reg_id, law_node_id, 'PART_OF'))

        for u, v, rel in new_edges:
            self.G.add_edge(u, v, relation=rel, weight=1.0) # PART_OF 權重設為 1

    def action_prep_hard_negatives(self, top_k=5):
        """
        支援 PDF 3 ：準備「困難負樣本 (Hard Negatives)」
        找出「字面上很像但邏輯錯誤」的法規，供 KGE 訓練時的 Negative Sampling 使用。
        """
        print("Executing Action: Preparing Hard Negative Candidates...")
        if not HAS_BERT: return

        # 找出所有法規
        reg_ids = [n for n, attr in self.G.nodes(data=True) if attr.get('node_type') in ['Regulation', 'Reg']]
        if not reg_ids: return

        reg_embs = torch.stack([self.node_embeddings[rid] for rid in reg_ids])

        # 計算法規之間的相似度矩陣
        sim_matrix = util.cos_sim(reg_embs, reg_embs)

        candidates = {}
        for i, rid in enumerate(reg_ids):
            # 找出最相似的 Top-K，但排除自己
            scores = sim_matrix[i]
            # argsort 是由小到大，所以取最後幾個
            top_indices = torch.argsort(scores, descending=True)[1:top_k+1]

            hard_negatives = [reg_ids[idx] for idx in top_indices]
            candidates[rid] = hard_negatives

        self.hard_negative_candidates = candidates
        print(f" -> Generated hard negative candidates for {len(candidates)} regulations.")

    def action_a_compute_edge_weights(self):
        """重新計算權重 (在剪枝之後執行)"""
        print("Executing Action A: Re-calculating Global Edge Weights...")
        relations = [attr['relation'] for u, v, k, attr in self.G.edges(keys=True, data=True)]
        count = Counter(relations)
        total = len(relations)

        weights = {}
        for rel, c in count.items():
            weights[rel] = total / c

        # [cite: 96] 剪枝後 VIOLATES_LAW 應該大幅減少，權重會自然上升
        # 這裡我們將個別邊的 semantic weight 與全局 class weight 結合
        self.class_weights = weights
        print(" -> Updated Class Weights:", weights)

    def export_data(self):
        """匯出處理後的資料物件"""
        return {
            'graph': self.G,
            'node_features': torch.stack([self.G.nodes[n]['x'] for n in self.G.nodes() if 'x' in self.G.nodes[n]]),
            'hard_negatives': self.hard_negative_candidates
        }

def inspect_pruning_results(graph, num_samples=5):
    """
    [Validation] 執行人工驗證步驟
    依據評鑑建議：隨機抽樣檢查剪枝後的邊，確認法規是否為核心法條 。
    """
    logger.info("--- 正在執行人工抽樣驗證 (QA Check) ---")

    # 找出所有經過語意篩選的邊
    semantic_edges = [
        (u, v, attr) for u, v, k, attr in graph.edges(keys=True, data=True)
        if attr.get('relation') in ['VIOLATES_SPECIFICALLY', 'IS_RELEVANT_TO']
    ]

    if not semantic_edges:
        logger.warning("警告：未發現任何語意剪枝後的邊！請檢查 Threshold 設定。")
        return

    # 隨機抽樣
    samples = random.sample(semantic_edges, min(num_samples, len(semantic_edges)))

    for i, (u, v, attr) in enumerate(samples):
        # 取得節點文字 (優先取 label 或 full_text)
        u_text = graph.nodes[u].get('label', str(u))[:30] + "..."
        v_text = graph.nodes[v].get('label', str(v))[:30] + "..."

        rel_type = attr['relation']
        score = attr.get('weight', 0.0)

        print(f"Sample {i+1}:")
        print(f"  [事故/違規]: {u_text}")
        print(f"  --[{rel_type} (score: {score:.4f})]-->")
        print(f"  [對應法規]: {v_text}")
        print("-" * 50)

def main():
    # 1. 參數設定
    parser = argparse.ArgumentParser(description="OSKG Preprocessor Pipeline")
    parser.add_argument('--input', type=str, default='knowledge_graph_final.json', help='原始知識圖譜 JSON 路徑')
    parser.add_argument('--output', type=str, default='processed_oskg_data.pt', help='處理後資料的輸出路徑 (.pt)')
    parser.add_argument('--threshold', type=float, default=0.7, help='語意剪枝 Cosine Similarity 閾值 [建議值: 0.7]')
    parser.add_argument('--top_k', type=int, default=3, help='保留最相關的 K 條法規 [建議值: 3]')
    parser.add_argument('--bert_model', type=str, default='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2', help='使用的 BERT 模型')

    # Colab 中避免 argparse 讀取錯誤
    args = parser.parse_args(args=[])

    # 檢查輸入檔案
    if not os.path.exists(args.input):
        logger.error(f"找不到輸入檔案: {args.input}")
        # sys.exit(1) # In Notebook, return instead
        return

    logger.info("啟動職業安全知識圖譜前處理流程 (OSKG Pipeline)...")
    logger.info(f"設定參數: Threshold={args.threshold} , Top-K={args.top_k} ")

    # 2. 初始化處理器
    processor = OSKGPreprocessor(json_path=args.input, model_name=args.bert_model)

    # 3. 建構基礎圖譜
    logger.info("Step 1: 建構基礎圖譜拓樸...")
    processor.build_graph()

    # 4. 生成 Embeddings (Data Engineering 關鍵)
    # [cite: 12] 計算 Violation 與 Regulation 節點的 Embedding
    logger.info("Step 2: 生成 BERT Embeddings (這可能需要一點時間)...")
    processor.action_b_generate_embeddings()

    # 5. 語意剪枝 (Semantic Pruning)
    # [cite: 8] 解決「噪聲爆炸」問題，執行 Embedding-based Pruning
    logger.info(f"Step 3: 執行語意剪枝 (去除冗餘的 VIOLATES_LAW)...")
    processor.action_c_semantic_pruning(threshold=args.threshold, top_k=args.top_k)

    # 6. 層級注入 (Hierarchy)
    logger.info("Step 4: 注入法律層級結構 (Regulation -> Law)...")
    processor.action_d_inject_hierarchy()

    # 7. 計算權重 (Weighting)
    logger.info("Step 5: 重新計算全局邊權重 (Inverse Frequency)...")
    processor.action_a_compute_edge_weights()

    # 8. 準備負採樣 (Hard Negatives)
    # [cite: 92] 針對訓練數據不足，準備「困難負樣本」
    logger.info("Step 6: 生成對抗性訓練用的困難負樣本 (Hard Negatives)...")
    processor.action_prep_hard_negatives(top_k=5)

    # 9. 匯出與儲存
    final_data = processor.export_data()

    logger.info(f"Step 7: 儲存處理結果至 {args.output}...")
    torch.save(final_data, args.output)

    # 10. 人工驗證
    #  隨機抽樣檢查，確保資料工程品質
    inspect_pruning_results(final_data['graph'], num_samples=5)

    logger.info("✅ 流程結束。圖譜已稀疏化並注入語意邏輯。")

In [6]:
if __name__ == "__main__":
    main()

Loading BERT model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2...
Loaded KG with 2073 nodes.
Building initial graph...
Executing Action B: Generating Node Embeddings...


Batches:   0%|          | 0/65 [00:00<?, ?it/s]

Executing Action C: Semantic Pruning (Threshold=0.7, Top-K=3)...
 -> Found 45409 VIOLATES_LAW edges to analyze.


Pruning Edges: 100%|██████████| 559/559 [00:00<00:00, 2141.75it/s]


 -> Applying pruning updates to graph...
 -> Removed 45409 noisy edges.
 -> Added 1666 semantic edges (VIOLATES_SPECIFICALLY / IS_RELEVANT_TO).
Executing Action D: Injecting Hierarchy...
Executing Action A: Re-calculating Global Edge Weights...
 -> Updated Class Weights: {'INVOLVES_OBJECT': 15.787037037037036, 'OCCURS_IN': 15.787037037037036, 'HAS_INCIDENT_TYPE': 15.787037037037036, 'HAS_CAUSE': 14.666666666666666, 'ENABLED_BY': 3410.0, 'LEADS_TO': 3.198874296435272, 'IS_RELEVANT_TO': 4.267834793491865, 'IS_SUBCLASS_OF': 7.957992998833139, 'PART_OF': 18.633879781420767, 'VIOLATES_SPECIFICALLY': 100.29411764705883, 'IS_SIMILAR_TO': 189.44444444444446}
Executing Action: Preparing Hard Negative Candidates...
 -> Generated hard negative candidates for 438 regulations.
Sample 1:
  [事故/違規]: 勞工未採上鎖或設置標示防止他人操作粉碎機...
  --[IS_RELEVANT_TO (score: 0.3150)]-->
  [對應法規]: 職業安全衛生設施規則 第57條第1項...
--------------------------------------------------
Sample 2:
  [事故/違規]: 雇主未指定管理人員執行衝床之傳動系統及耐壓管連接處之檢查...
  --

In [7]:
# get "processed_oskg_data.pt"
    # 這份是製作 "職業安全災害情境" 與 "相關原因與法律推論" 成祖 JSON 檔案的輸入檔案。



---



In [37]:
# [B] ground truth construction
# manually upload
    # osh_doc_merged.json

In [38]:
def load_data(file_path):
    """讀取原始 JSON 檔案"""
    if not os.path.exists(file_path):
        print(f"錯誤：找不到檔案 {file_path}")
        return []

    with open(file_path, 'r', encoding='utf-8') as f:
        try:
            data = json.load(f)
            return data
        except json.JSONDecodeError:
            print("錯誤：JSON 格式解析失敗，請檢查原始檔案格式。")
            return []

In [39]:
def normalize_law_text(text):
    """
    標準化法條文字：
    1. 去除數字與文字間的空白 (e.g., "第 19 條" -> "第19條")
    2. 統一全形半形括號等 (視需要，目前主要處理空白)
    """
    if not text:
        return ""
    # 去除 "第" 與 "數字" 之間的空白，以及 "數字" 與 "條/項/款" 之間的空白
    # Pattern explanation: Look for '第', optional space, digits, optional space, '條'
    normalized = re.sub(r'第\s*(\d+)\s*條', r'第\1條', text)
    normalized = re.sub(r'第\s*(\d+)\s*項', r'第\1項', normalized)
    normalized = re.sub(r'第\s*(\d+)\s*款', r'第\1款', normalized)
    return normalized

In [40]:
def extract_legal_citations(text):
    """
    從字串中提取具體的法規引用。
    邏輯：
    1. 先以逗號 ',' 分割不同的防範措施區塊。
    2. 在每個區塊中，利用 Regex 抓取 '法規名稱' + '第X條'。

    Return: 一個包含所有唯一法條引用的 list (Set to List)
    """
    if not text:
        return []

    text = normalize_law_text(text)

    # 用於儲存提取到的法條 (使用 set 避免重複)
    citations = set()

    # 你的資料中，多個法規組合通常用逗號分隔
    # e.g. "營造...第19條第1項暨職安法第6條第1項, 職安教育規則...第16條..."
    segments = text.split(',')

    # 定義 Regex
    # 捕捉模式： (法規名稱)(第幾條)
    # 排除常見的連接詞或非透過法規名稱開頭的雜訊
    # [\u4e00-\u9fa5]+ 匹配中文法規名稱
    # ?P<law> 命名群組
    law_pattern = re.compile(r'(?P<law>[\u4e00-\u9fa5]+?)\s*第(?P<article>\d+)條')

    for segment in segments:
        # 在每個段落中尋找所有匹配項 (因為可能有 "A法...暨 B法...")
        matches = law_pattern.finditer(segment)
        for match in matches:
            law_name = match.group('law')
            article_num = match.group('article')

            # 過濾掉非由法規名稱構成的誤判 (例如只寫 "暨職業安全衛生法")
            # 通常法規名稱長度大於 2
            if len(law_name) >= 2:
                # 這裡為了處理 "暨" 黏在前面的問題 (e.g., "暨職業安全衛生法")
                if law_name.startswith('暨'):
                    law_name = law_name[1:]

                full_citation = f"{law_name}第{article_num}條"
                citations.add(full_citation)

    return sorted(list(citations))

In [41]:
def process_osh_data(input_file, output_file):
    """主處理流程"""
    data = load_data(input_file)

    if not data:
        print("警告：讀取到的資料為空，無法進行處理。請確認 'osh_doc_merged.json' 是否已上傳至正確路徑。")
        return

    processed_data = []

    print(f"開始處理 {len(data)} 筆案件資料...")

    for idx, item in enumerate(data):
        # 提取需要的欄位
        description = item.get('description', '')
        raw_regulations = item.get('preventive_regulations', '')

        # 提取黃金標準 (Ground Truth)
        gold_laws = extract_legal_citations(raw_regulations)

        # 為了讓你的 LLM 實驗更好做，我們保留致災原因摘要作為輔助 (可選)
        cause_summary = item.get('cause_summary', '')

        # 建構精簡後的物件
        processed_item = {
            "id": idx,  # 給予一個 ID 方便追蹤
            "original_incident_type": item.get('incident_type', ''),
            "description": description,  # 輸入 (Input)
            "cause_summary": cause_summary, # 輔助輸入或驗證
            "ground_truth_laws": gold_laws, # 輸出標籤 (Target Labels)
            "raw_regulation_text": raw_regulations # 保留原始文字以供參考
        }
        processed_data.append(processed_item)

    # 輸出檔案
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(processed_data, f, ensure_ascii=False, indent=4)

    print(f"處理完成！已輸出至 {output_file}")

    if processed_data:
        print(f"範例資料 (第一筆):")
        print(json.dumps(processed_data[0], ensure_ascii=False, indent=2))
    else:
        print("注意：處理後的資料列表為空。")

In [42]:
# ==========================================
# 執行區塊
# ==========================================

# 假設你的檔案名稱如下，請確保檔案在同一目錄下
input_filename = 'osh_doc_merged.json'
output_filename = 'osh_legal_ground_truth.json'

process_osh_data(input_filename, output_filename)

開始處理 482 筆案件資料...
處理完成！已輸出至 osh_legal_ground_truth.json
範例資料 (第一筆):
{
  "id": 0,
  "original_incident_type": "墜落, 滾落",
  "description": "於104 年8 月17 日徐○○、羅○○、陳○○及何○○4 人在○○股份有限公司廠\n房東北側屋頂從事塑膠採光浪板更換作業，當時第1 片採光浪板已完成更換及鋪設，\n於當日17 時15 分許，罹災者於該單位廠房東北側屋頂拆除第2 片原有的塑膠採光浪\n板之固定用螺絲釘時，不慎踏穿塑膠採光浪自高度約7.35 公尺之開口墜落至地面，因\n頭部外傷致顱腦損傷，送至衛生福利部豐原醫院急救無效，於104 年8 月17 日18 時\n30 分宣告不治死亡。",
  "cause_summary": "勞工未架設安全通道與踏板、勞工未使用安全帽、雇主未規劃安全通道與防墜設施、雇主未辦理勞工安全衛生教育訓練、雇主未訂定安全衛生工作守則、雇主未置丙種職業安全衛生業務主管、雇主未指派屋頂作業主管指揮或監督",
  "ground_truth_laws": [],
  "raw_regulation_text": "抱歉，我無法從您提供的內容中提取任何法規條文名稱。請提供包含法規條文名稱的文本。"
}


In [None]:
# 以下是能捕捉到 "項" 的版本

In [43]:
import json
import re
import os

def load_data(file_path):
    if not os.path.exists(file_path):
        print(f"錯誤：找不到檔案 {file_path}")
        return []
    with open(file_path, 'r', encoding='utf-8') as f:
        try:
            return json.load(f)
        except json.JSONDecodeError:
            print("錯誤：JSON 格式解析失敗。")
            return []

In [44]:
def normalize_law_text(text):
    if not text:
        return ""
    # 去除 "第" 與 "數字" 之間的空白，以及 "數字" 與 "條/項/款" 之間的空白
    normalized = re.sub(r'第\s*(\d+)\s*條', r'第\1條', text)
    normalized = re.sub(r'第\s*(\d+)\s*項', r'第\1項', normalized)
    normalized = re.sub(r'第\s*(\d+)\s*款', r'第\1款', normalized)
    return normalized

In [45]:
def extract_citations_hierarchical(text):
    """
    同時提取「法條層級」與「法條+項層級」的參照。
    """
    if not text:
        return [], []

    text = normalize_law_text(text)

    citations_article_only = set()   # 僅到條
    citations_detailed = set()       # 到項

    segments = text.split(',')

    # Regex: 捕捉 法規名稱 + 第X條 + (可選)第Y項
    law_pattern = re.compile(r'(?P<law>[\u4e00-\u9fa5]+?)\s*第(?P<article>\d+)條(?:第(?P<paragraph>\d+)項)?')

    for segment in segments:
        matches = law_pattern.finditer(segment)
        for match in matches:
            law_name = match.group('law')
            article_num = match.group('article')
            paragraph_num = match.group('paragraph')

            if len(law_name) >= 2:
                if law_name.startswith('暨'):
                    law_name = law_name[1:]

                # 1. Coarse Level
                article_citation = f"{law_name}第{article_num}條"
                citations_article_only.add(article_citation)

                # 2. Detailed Level
                if paragraph_num:
                    detailed_citation = f"{law_name}第{article_num}條第{paragraph_num}項"
                else:
                    detailed_citation = f"{law_name}第{article_num}條"

                citations_detailed.add(detailed_citation)

    return sorted(list(citations_article_only)), sorted(list(citations_detailed))

In [46]:
def process_osh_data_filtered(input_file, output_file):
    data = load_data(input_file)
    processed_data = []

    removed_count = 0 # 統計被移除的筆數

    print(f"原始資料共 {len(data)} 筆，開始處理與過濾...")

    for idx, item in enumerate(data):
        raw_regulations = item.get('preventive_regulations', '')

        # 提取法條
        c_article, c_detailed = extract_citations_hierarchical(raw_regulations)

        # --- 關鍵修正：過濾機制 ---
        # 如果 coarse 層級是空的，代表沒有提取到任何有效法條，直接跳過
        if not c_article:
            removed_count += 1
            continue

        # 準備資料
        description_text = item.get('description', '')
        cause_summary_text = item.get('cause_summary', '')

        processed_item = {
            "id": item.get('id', idx), # 如果原資料有 id 就用原來的，否則用 index
            "original_incident_type": item.get('incident_type', ''),
            "description": description_text,
            "cause_summary": cause_summary_text,
            "ground_truth_coarse": c_article,
            "ground_truth_fine": c_detailed,
            "raw_regulation_text": raw_regulations
        }
        processed_data.append(processed_item)

    # 輸出結果
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(processed_data, f, ensure_ascii=False, indent=4)

    print("-" * 30)
    print(f"處理完成！")
    print(f"原始筆數: {len(data)}")
    print(f"移除無效筆數: {removed_count}")
    print(f"最終有效筆數: {len(processed_data)}")
    print(f"已輸出至 {output_file}")

In [47]:
# ==========================================
# 執行區塊
# ==========================================
input_filename = 'osh_doc_merged.json'
output_filename = 'osh_legal_ground_truth_cleaned.json'

# 請執行此函式
process_osh_data_filtered(input_filename, output_filename)

原始資料共 482 筆，開始處理與過濾...
------------------------------
處理完成！
原始筆數: 482
移除無效筆數: 101
最終有效筆數: 381
已輸出至 osh_legal_ground_truth_cleaned.json




---



In [49]:
# [C]
# related files
    # already load knowledge_graph_final.json in [A]
    # get "osh_legal_ground_truth_cleaned.json" in [B]
    # manually upload boxe_validation_set_clean.json

In [50]:
## CONFIGURATION ##
# Load KG
with open('knowledge_graph_final.json', 'r') as f:
    kg = json.load(f)

# Load GT
with open('osh_legal_ground_truth_cleaned.json', 'r') as f:
    gt = json.load(f)

In [51]:
# Inspect KG nodes
print("KG Nodes sample:")
print(kg['nodes'][:2])

# Inspect GT sample
print("GT sample:")
print(gt[:1])

# Check if we can find a link between GT description and KG nodes
kg_incident_nodes = [n for n in kg['nodes'] if n.get('node_type') == 'Incident' or 'INC_' in n['id']]
print(f"Number of Incident nodes in KG: {len(kg_incident_nodes)}")
if kg_incident_nodes:
    print("Sample Incident Node:", kg_incident_nodes[0])

# Check Law nodes
kg_law_nodes = [n for n in kg['nodes'] if n.get('node_type') == 'Regulation' or 'REG_' in n['id']]
print(f"Number of Law nodes in KG: {len(kg_law_nodes)}")
if kg_law_nodes:
    print("Sample Law Node:", kg_law_nodes[0])

KG Nodes sample:
[{'id': 'CAUSE_BASIC_7678f5703ca4_ATOMIC_0', 'label': '基本原因 基本原因 基本原因：', 'node_type': 'Cause_Basic_Atomic', 'parent_id': 'CAUSE_BASIC_7678f5703ca4', 'atomic_index': 0, 'embedding_text': '基本原因 基本原因 基本原因：', 'is_atomized': True, 'original_full_text': '基本原因 基本原因 基本原因： (1)未對勞工施以從事各該工作必要之一般安全衛生教育訓練。 (2)未訂定安全衛生工作守則。 (3)未置丙種職業安全衛生業務主管。 (4)勞工於屋頂從事作業未指派屋頂作業主管指揮或監督。', 'split_method': 'regex'}, {'id': 'CAUSE_BASIC_7678f5703ca4_ATOMIC_1', 'label': '未對勞工施以從事各該工作必要之一般安全衛生教育訓練。', 'node_type': 'Cause_Basic_Atomic', 'parent_id': 'CAUSE_BASIC_7678f5703ca4', 'atomic_index': 1, 'embedding_text': '未對勞工施以從事各該工作必要之一般安全衛生教育訓練。', 'is_atomized': True, 'original_full_text': '基本原因 基本原因 基本原因： (1)未對勞工施以從事各該工作必要之一般安全衛生教育訓練。 (2)未訂定安全衛生工作守則。 (3)未置丙種職業安全衛生業務主管。 (4)勞工於屋頂從事作業未指派屋頂作業主管指揮或監督。', 'split_method': 'regex'}]
GT sample:
[{'id': 1, 'original_incident_type': '墜落, 滾落', 'description': '104 年9 月3 日約10 時許，罹災者賴○昌與彭○德、許○福、鄭○龍等4\n人於大肚區遊園路○段○巷○弄○號對面之屋頂進行頂棚違建拆除作業，\n約自10 時20 分許，罹災者賴○昌為收拾乙炔焊接工具，故由靠遊園路○\n段○巷○弄

In [52]:
import difflib
# Prepare KG Incident texts
kg_incidents = []
for n in kg['nodes']:
    if n['node_type'] == 'Incident':
        # Use full_text or label
        text = n.get('full_text', n.get('label', ''))
        kg_incidents.append({'id': n['id'], 'text': text})

# Prepare GT texts
gt_incidents = []
for item in gt:
    gt_incidents.append({'id': item['id'], 'text': item['description']})

print(f"Total KG Incidents: {len(kg_incidents)}")
print(f"Total GT Incidents: {len(gt_incidents)}")

# Try to match a few
matched = 0
mapping = {} # gt_id -> kg_id

for g in gt_incidents[:10]: # Test first 10
    best_ratio = 0
    best_id = None
    g_text = g['text']

    # Simple strategy: Check if KG text is contained in GT text?
    # Or similarity
    for k in kg_incidents:
        # Check containment first (likely high recall if KG is summary)
        if k['text'] in g_text:
             # If multiple match, take the longest?
             # For now just take first or high similarity
             ratio = 1.0
             if ratio > best_ratio:
                 best_ratio = ratio
                 best_id = k['id']
                 break # Exact substring match found

        # Fallback to SequenceMatcher
        s = difflib.SequenceMatcher(None, k['text'], g_text)
        ratio = s.ratio() # This might be low if lengths differ significantly
        # Better: use k['text'] vs relevant part of g['text']?
        # Actually, let's just try to find if k['text'] is a substring.

    if best_id:
        mapping[g['id']] = best_id
        matched += 1
        print(f"GT {g['id']} matched to {best_id}")
    else:
        # Try finding one with high word overlap?
        pass

print(f"Matched {matched}/10 in sample.")

Total KG Incidents: 431
Total GT Incidents: 381
Matched 0/10 in sample.


In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Prepare corpus
kg_texts = [i['text'] for i in kg_incidents]
kg_ids = [i['id'] for i in kg_incidents]
gt_texts = [i['text'] for i in gt_incidents]
gt_ids = [i['id'] for i in gt_incidents]

# Vectorize (Character level might be better for Chinese if no tokenizer)
vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(2, 4)) # Use char n-grams
kg_vecs = vectorizer.fit_transform(kg_texts)
gt_vecs = vectorizer.transform(gt_texts)

# Similarity
similarity_matrix = cosine_similarity(gt_vecs, kg_vecs)

# Find best matches
matches = []
match_scores = []
gt_to_kg_map = {}

for i in range(len(gt_texts)):
    best_idx = np.argmax(similarity_matrix[i])
    score = similarity_matrix[i, best_idx]
    gt_id = gt_ids[i]
    kg_id = kg_ids[best_idx]

    matches.append((gt_id, kg_id, score))
    gt_to_kg_map[gt_id] = kg_id
    if i < 5:
        print(f"GT {gt_id} matches KG {kg_id} with score {score:.3f}")
        print(f"GT: {gt_texts[i][:50]}...")
        print(f"KG: {kg_texts[best_idx]}")
        print("-" * 20)

GT 1 matches KG INC_86e34d008e50 with score 0.214
GT: 104 年9 月3 日約10 時許，罹災者賴○昌與彭○德、許○福、鄭○龍等4
人於大肚區遊園路○段○...
KG: 賴○昌於屋頂拆除作業時因安全措施不足，踏穿採光罩從22公尺高墜落，送醫不治身亡。
--------------------
GT 3 matches KG INC_292ec42abe68 with score 0.496
GT: 於104 年6 月8 日1 時許，盧○○及林○○從事機械基本維護及異常
排除等作業時，發現抽水泵及馬...
KG: 盧○○在維護作業中被抽水泵及馬達皮帶輪碎片擊中右大腿，送醫不治身亡。
--------------------
GT 6 matches KG INC_08b698fad063 with score 0.613
GT: 勞工李○○104年4月30日下午2時駕駛公務用貨車從事送貨作業，
行經臺中市大甲區日南里中山路二段9...
KG: 勞工李○○於104年4月30日下午2時駕駛公務用貨車發生事故。
--------------------
GT 7 matches KG INC_4a432d550377 with score 0.498
GT: 於104 年10 月20 日14 時30 分許，罹災者傅○欽開預拌混凝土車，載送
4 立方公尺之混凝...
KG: 傅○欽駕駛預拌混凝土車於裡冷林道因土石鬆軟翻覆墜落150公尺深谷，車重超過額定載重。
--------------------
GT 8 matches KG INC_bed95b76352d with score 0.579
GT: 依據目擊者林泳龍稱述：「罹災者邵○魯於104年10月30日15時50分許
運送混凝土至臺中市太○區山...
KG: 邵○魯於104年10月30日運送混凝土時，車輛倒退滑行15公尺後墜入山谷。
--------------------


In [54]:
# Check graph structure
import networkx as nx

G = nx.DiGraph()
for node in kg['nodes']:
    G.add_node(node['id'], label=node.get('label', ''), type=node.get('node_type'))

for link in kg['links']:
    G.add_edge(link['source'], link['target'], relation=link['relation'])

# Pick a sample Incident Node that we matched
sample_incident_id = 'INC_86e34d008e50' # Matches GT 1
if sample_incident_id in G:
    print(f"Neighbors of {sample_incident_id}:")
    for n in G.successors(sample_incident_id):
        print(f" -> {G.nodes[n]['type']} : {G.nodes[n]['label']} (Relation: {G[sample_incident_id][n]['relation']})")

    # Check if we can reach Regulation
    # BFS to find Regulations
    print("\nReachable Regulations (within 3 hops):")
    paths = nx.single_source_shortest_path(G, sample_incident_id, cutoff=3)
    found_regs = []
    for target, path in paths.items():
        if G.nodes[target]['type'] == 'Regulation' or 'REG_' in target:
             found_regs.append((target, path))

    print(f"Found {len(found_regs)} regulations.")
    if found_regs:
        print(f"Sample Path: {found_regs[0]}")
else:
    print("Sample Incident ID not in graph?")

Neighbors of INC_86e34d008e50:
 -> Medium_Specific : 屋頂, 屋架, 樑 (Relation: INVOLVES_OBJECT)
 -> Industry : 建築工程業（4100） (Relation: OCCURS_IN)
 -> IncidentType : 墜落, 滾落 (Relation: HAS_INCIDENT_TYPE)

Reachable Regulations (within 3 hops):
Found 0 regulations.


In [55]:
# Sample Incident: INC_86e34d008e50
# Incoming Edge: CAUSE_DIRECT_46b946692708 -> INC_86e34d008e50
cause_id = 'CAUSE_DIRECT_46b946692708'

print(f"Neighbors of Cause {cause_id}:")
# Outgoing from Cause
for n in G.successors(cause_id):
    print(f" -> {G.nodes[n]['type']} : {G.nodes[n]['label']} (Relation: {G[cause_id][n]['relation']})")

# Incoming to Cause
print(f"Predecessors of Cause {cause_id}:")
for n in G.predecessors(cause_id):
    print(f" <- {G.nodes[n]['type']} : {G.nodes[n]['label']} (Relation: {G[n][cause_id]['relation']})")

Neighbors of Cause CAUSE_DIRECT_46b946692708:
 -> Incident : 賴○昌於屋頂拆除作業時因安全措施不足，踏穿採光罩從22公尺高墜落，送醫不治身亡。 (Relation: HAS_CAUSE)
 -> Violation : 勞工未架設適當強度且寬度30公分以上之踏板 (Relation: LEADS_TO)
 -> Violation : 勞工未使用堅固格柵或安全網 (Relation: LEADS_TO)
 -> Violation : 雇主未訂定安全衛生工作守則 (Relation: LEADS_TO)
 -> Violation : 雇主未辦理勞工安全衛生教育訓練 (Relation: LEADS_TO)
 -> Violation : 雇主未指派屋頂作業主管指揮或監督 (Relation: LEADS_TO)
 -> Incident : 勞工張○○於屋頂補漏水時未使用安全設備，踩穿採光罩墜落，送醫不治。 (Relation: HAS_CAUSE)
 -> Violation : 雇主未規劃防墜設施如堅固格柵或安全網 (Relation: LEADS_TO)
 -> Violation : 雇主未指定專人指揮或監督作業 (Relation: LEADS_TO)
 -> Violation : 雇主未訂定職業安全衛生管理計畫 (Relation: LEADS_TO)
 -> Violation : 勞工未架設安全通道與踏板 (Relation: LEADS_TO)
 -> Violation : 勞工未使用安全帽 (Relation: LEADS_TO)
 -> Violation : 雇主未定訂自動檢查計畫 (Relation: LEADS_TO)
 -> Incident : 104年6月29日，范罹災者更換屋頂浪板時踏穿採光板墜落12公尺，送醫不治。 (Relation: HAS_CAUSE)
 -> Violation : 雇主未於屋架上設置適當強度且寬度在30公分以上之踏板 (Relation: LEADS_TO)
 -> Violation : 雇主未於屋架下方適當範圍裝設堅固格柵或安全網等防墜設施 (Relation: LEADS_TO)
 -> Violation : 雇主未置職業安全衛生

In [56]:
# Check Cause Label
cause_node = [n for n in kg['nodes'] if n['id'] == 'CAUSE_DIRECT_46b946692708'][0]
print("Cause Label:", cause_node.get('label'))
print("Cause Full Text:", cause_node.get('original_full_text', cause_node.get('embedding_text')))

# Check Violation Labels for this cause
violations = [n for n in G.successors(cause_node['id']) if G.nodes[n]['type'] == 'Violation']
print("\nViolations:")
for v in violations:
    print(f"- {G.nodes[v]['label']}")

Cause Label: (一)直接原因：罹災者不慎踏穿採光罩自高度約22 公尺之開口墜落至地面， 造成周身挫傷併多發性骨折、器官損傷致休克不治死亡。 (二)間接原因： 不安全狀況： 於塑膠材料構築之屋頂作業時，未事先規劃安全通道，於屋架頂 棚上設置適當強度且寬度30 公分以上之踏板，於屋架下方亦未裝設 堅固格柵或安全網。 (三)基本原因： 1、未指派屋頂作業主管於現場辦理指揮、監督等工作。 2、未使勞工接受一般安全衛生教育訓練。 3、未訂定安全衛生工作守則。
Cause Full Text: (一)直接原因：罹災者不慎踏穿採光罩自高度約22 公尺之開口墜落至地面， 造成周身挫傷併多發性骨折、器官損傷致休克不治死亡。 (二)間接原因： 不安全狀況： 於塑膠材料構築之屋頂作業時，未事先規劃安全通道，於屋架頂 棚上設置適當強度且寬度30 公分以上之踏板，於屋架下方亦未裝設 堅固格柵或安全網。 (三)基本原因： 1、未指派屋頂作業主管於現場辦理指揮、監督等工作。 2、未使勞工接受一般安全衛生教育訓練。 3、未訂定安全衛生工作守則。

Violations:
- 勞工未架設適當強度且寬度30公分以上之踏板
- 勞工未使用堅固格柵或安全網
- 雇主未訂定安全衛生工作守則
- 雇主未辦理勞工安全衛生教育訓練
- 雇主未指派屋頂作業主管指揮或監督
- 雇主未規劃防墜設施如堅固格柵或安全網
- 雇主未指定專人指揮或監督作業
- 雇主未訂定職業安全衛生管理計畫
- 勞工未架設安全通道與踏板
- 勞工未使用安全帽
- 雇主未定訂自動檢查計畫
- 雇主未於屋架上設置適當強度且寬度在30公分以上之踏板
- 雇主未於屋架下方適當範圍裝設堅固格柵或安全網等防墜設施
- 雇主未置職業安全衛生人員
- 雇主未規劃安全通道與防墜設施
- 雇主未實施吊掛作業危害之辨識
- 評估及控制措施
- 雇主未落實承攬管理事項
- 雇主未於設計或施工規劃階段實施風險評估




---



In [57]:
import json
import re
import difflib
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [58]:
# 1. Load Data
with open('knowledge_graph_final.json', 'r', encoding='utf-8') as f:
    kg_data = json.load(f)

with open('osh_legal_ground_truth_cleaned.json', 'r', encoding='utf-8') as f:
    gt_data = json.load(f)

# 2. Build Graph
G = nx.DiGraph()
for node in kg_data['nodes']:
    G.add_node(node['id'], **node)

for link in kg_data['links']:
    G.add_edge(link['source'], link['target'], relation=link['relation'])

In [59]:
# 1. Load Data
with open('knowledge_graph_final.json', 'r', encoding='utf-8') as f:
    kg_data = json.load(f)

with open('osh_legal_ground_truth_cleaned.json', 'r', encoding='utf-8') as f:
    gt_data = json.load(f)

# 2. Build Graph
G = nx.DiGraph()
for node in kg_data['nodes']:
    G.add_node(node['id'], **node)

for link in kg_data['links']:
    G.add_edge(link['source'], link['target'], relation=link['relation'])

# 3. Mappings

# 3a. Incident Mapping (GT ID -> KG Node ID)
print("Mapping Incidents...")
kg_incidents = []
for n in kg_data['nodes']:
    if n['node_type'] == 'Incident':
        text = n.get('full_text', n.get('label', ''))
        kg_incidents.append({'id': n['id'], 'text': text})

kg_texts = [i['text'] for i in kg_incidents]
kg_ids = [i['id'] for i in kg_incidents]

gt_texts = [item['description'] for item in gt_data]
gt_ids = [item['id'] for item in gt_data]

vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(2, 4))
kg_vecs = vectorizer.fit_transform(kg_texts)
gt_vecs = vectorizer.transform(gt_texts)

similarity_matrix = cosine_similarity(gt_vecs, kg_vecs)

gt_to_kg_incident = {}
for i in range(len(gt_texts)):
    best_idx = np.argmax(similarity_matrix[i])
    score = similarity_matrix[i, best_idx]
    if score > 0.1: # Low threshold as texts might vary significantly
        gt_to_kg_incident[gt_ids[i]] = kg_ids[best_idx]

print(f"Mapped {len(gt_to_kg_incident)}/{len(gt_data)} incidents.")

# 3b. Law Mapping (Normalized String -> List of KG Node IDs)
print("Mapping Laws...")
def normalize_law(text):
    # Remove whitespace and specific punctuation, keep text
    return re.sub(r'[^\w]', '', text)

kg_regulations = []
for n in kg_data['nodes']:
    if n['node_type'] == 'Regulation':
        label = n.get('label', '')
        norm = normalize_law(label)
        kg_regulations.append({'id': n['id'], 'norm': norm, 'label': label, 'full_text': n.get('full_text', '')})

# Create a lookup
# Since matching is fuzzy (substring), we iterate during lookup
# To speed up, we can organize by law name, but brute force on 400 nodes is fine.

# 4. Generate SFT Data
sft_data = []

for case in gt_data:
    gt_id = case['id']
    kg_incident_id = gt_to_kg_incident.get(gt_id)

    if not kg_incident_id:
        continue # Skip if no incident match

    incident_text = case['description'] # Use GT description as input
    gt_laws = case['ground_truth_coarse'] # List of strings

    # Reasoning Generation
    reasoning_parts = []
    reasoning_parts.append(f"事故原因分析：\n{incident_text}")
    reasoning_parts.append("\n法規適用性分析：")

    # Find reachable violations/regulations from Incident
    # Path: Incident <- Cause -> Violation -> Regulation
    # In Graph G:
    # Predecessors of Incident (Cause) -> Successors (Violation) -> Successors (Regulation)

    reachable_regs = {} # Reg_ID -> Path Info

    # Causes (Predecessors of Incident)
    causes = list(G.predecessors(kg_incident_id))

    for cause_id in causes:
        # Violations (Successors of Cause)
        violations = [n for n in G.successors(cause_id) if G.nodes[n].get('node_type') == 'Violation']

        for vio_id in violations:
            # Regulations (Successors of Violation)
            regs = [n for n in G.successors(vio_id) if G.nodes[n].get('node_type') == 'Regulation']

            for reg_id in regs:
                if reg_id not in reachable_regs:
                    reachable_regs[reg_id] = []
                reachable_regs[reg_id].append({
                    'cause': G.nodes[cause_id],
                    'violation': G.nodes[vio_id],
                    'regulation': G.nodes[reg_id]
                })

    # Match GT Laws to Reachable Regs
    used_laws = set()
    law_counter = 1

    for law_str in gt_laws:
        norm_gt = normalize_law(law_str)

        # Find matches in reachable_regs
        matches = []
        for reg_id, paths in reachable_regs.items():
            norm_kg = normalize_law(G.nodes[reg_id].get('label', ''))
            # Check containment
            if norm_gt in norm_kg or norm_kg in norm_gt:
                matches.append((reg_id, paths))

        if matches:
            # Construct Graph-Based Reasoning
            # Use the first match/path for simplicity
            match_reg_id, paths = matches[0]
            path = paths[0]
            vio_label = path['violation'].get('label', '')
            reg_label = path['regulation'].get('label', '')

            reasoning_parts.append(f"\n{law_counter}. 違反法規：{reg_label}")
            reasoning_parts.append(f"   推論路徑：本案顯示「{vio_label}」。")
            reasoning_parts.append(f"   依據圖譜分析，此行為直接違反了 {reg_label}。")
            used_laws.add(law_str)
        else:
            # Fallback RAG
            # Find the best matching KG node even if not reachable
            best_node = None
            for r in kg_regulations:
                if norm_gt in r['norm'] or r['norm'] in norm_gt:
                    best_node = r
                    break

            if best_node:
                reasoning_parts.append(f"\n{law_counter}. 違反法規：{best_node['label']}")
                content = best_node['full_text'].replace('\n', '')[:100] + "..."
                reasoning_parts.append(f"   法規內容：{content}")
                reasoning_parts.append(f"   適用理由：本案事故情節顯示未符合此規定（圖譜中未建立直接路徑，基於法規內容推論）。")
            else:
                reasoning_parts.append(f"\n{law_counter}. 違反法規：{law_str}")
                reasoning_parts.append("   (資料庫中未找到對應法規文本)")

        law_counter += 1

    final_output = "\n".join(reasoning_parts)

    sft_data.append({
        "instruction": "請分析此職安事故之法律責任，並解釋法規適用理由。",
        "input": incident_text,
        "output": final_output,
        "incident_id": f"GT_{gt_id}"
    })

print(f"Generated {len(sft_data)} SFT entries.")

# Write to file
with open('sft_training_data_final.jsonl', 'w', encoding='utf-8') as f:
    for entry in sft_data:
        json.dump(entry, f, ensure_ascii=False)
        f.write('\n')

print("File sft_training_data_final.jsonl saved.")

Mapping Incidents...
Mapped 369/381 incidents.
Mapping Laws...
Generated 369 SFT entries.
File sft_training_data_final.jsonl saved.


In [60]:
import random

boxe_data = []

# All Law IDs for negative sampling
all_law_ids = [r['id'] for r in kg_regulations]

for case in gt_data:
    gt_id = case['id']
    kg_incident_id = gt_to_kg_incident.get(gt_id)

    if not kg_incident_id:
        continue

    gt_laws = case['ground_truth_coarse']
    pos_law_ids = []

    # Reuse mapping logic roughly
    for law_str in gt_laws:
        norm_gt = normalize_law(law_str)
        # Find all matching KG nodes
        for r in kg_regulations:
            if norm_gt in r['norm'] or r['norm'] in norm_gt:
                pos_law_ids.append(r['id'])

    pos_law_ids = list(set(pos_law_ids))

    if not pos_law_ids:
        continue

    # Negative Sampling
    neg_law_ids = []
    while len(neg_law_ids) < len(pos_law_ids):
        cand = random.choice(all_law_ids)
        if cand not in pos_law_ids and cand not in neg_law_ids:
            neg_law_ids.append(cand)

    boxe_data.append({
        "incident_id": kg_incident_id,
        "incident_text": case['description'],
        "positive_law_ids": pos_law_ids,
        "negative_law_ids": neg_law_ids,
        "ground_truth_text": gt_laws
    })

print(f"Generated {len(boxe_data)} BoxE validation entries.")

with open('boxe_validation_set_clean.json', 'w', encoding='utf-8') as f:
    json.dump(boxe_data, f, ensure_ascii=False, indent=2)

Generated 368 BoxE validation entries.


In [61]:
# Peek at the SFT file
with open('sft_training_data_final.jsonl', 'r') as f:
    line = f.readline()
    print(json.loads(line)['output'])

事故原因分析：
104 年9 月3 日約10 時許，罹災者賴○昌與彭○德、許○福、鄭○龍等4
人於大肚區遊園路○段○巷○弄○號對面之屋頂進行頂棚違建拆除作業，
約自10 時20 分許，罹災者賴○昌為收拾乙炔焊接工具，故由靠遊園路○
段○巷○弄北向往左數第2 塊頂棚南側第8 個採光罩旁往北方向第12 個
採光罩移動，因安全母索過短無法有效使用，便將安全帶脫鉤，且屋頂頂
棚上未設置適當強度且寬度30 公分以上之踏板，下方亦未裝設堅固格柵
或安全網，罹災者賴○昌不慎踏穿第12 個採光罩自高度約22 公尺墜落至
地面，因周身挫傷併多發性骨折、器官損傷致休克，經送醫院急救無效，
於104 年9 月4 日15 時3 分宣告不治死亡。

法規適用性分析：

1. 違反法規：勞工健康保護規則 第10條
   推論路徑：本案顯示「勞工未架設適當強度且寬度30公分以上之踏板」。
   依據圖譜分析，此行為直接違反了 勞工健康保護規則 第10條。

2. 違反法規：職業安全衛生教育訓練規則 第16條第1項
   推論路徑：本案顯示「勞工未架設適當強度且寬度30公分以上之踏板」。
   依據圖譜分析，此行為直接違反了 職業安全衛生教育訓練規則 第16條第1項。

3. 違反法規：職業安全衛生法 第20條第1項
   推論路徑：本案顯示「勞工未架設適當強度且寬度30公分以上之踏板」。
   依據圖譜分析，此行為直接違反了 職業安全衛生法 第20條第1項。

4. 違反法規：職業安全衛生法 第23條第1項
   推論路徑：本案顯示「勞工未架設適當強度且寬度30公分以上之踏板」。
   依據圖譜分析，此行為直接違反了 職業安全衛生法 第23條第1項。

5. 違反法規：職業安全衛生法 第32條第1項
   推論路徑：本案顯示「勞工未架設適當強度且寬度30公分以上之踏板」。
   依據圖譜分析，此行為直接違反了 職業安全衛生法 第32條第1項。

6. 違反法規：職業安全衛生法 第34條第1項
   推論路徑：本案顯示「勞工未架設適當強度且寬度30公分以上之踏板」。
   依據圖譜分析，此行為直接違反了 職業安全衛生法 第34條第1項。

7. 違反法規：職業安全衛生管理辦法 第12條之1
   推論路徑：本案顯示「勞工未架設適當強度且寬度30公分以上之踏板」。
   依據圖譜分析，此行為直接違反

In [62]:
# get sft_training_data_final.jsonl