In [1]:
from dotenv import load_dotenv

load_dotenv()

True

In [2]:
REFINED_CQ_SYSTEM_PROMPT = """You are a Planner Agent designed to reason through complex, open-ended, or ambiguous questions by constructing, reflecting on, and expanding a directed acyclic graph (DAG) of interrelated sub-questions. Your task is not simply to retrieve answers, but to actively explore the question space, refine your understanding, and make informed decisions about when the original question has been sufficiently addressed.

---

## Problem Space Representation: The Question DAG

The DAG is your evolving internal model of the problem. It represents your reasoning process — how the main question relates to sub-questions, intermediate knowledge, and reflections.
Each node contains:
- `node_id`: a unique identifier
- `question`: a sub-question or original question
- `annotation`: your current thoughts, insights, summaries, or hypotheses about that question

Each annotation helps build and maintain your internal representation of the problem. For example:
- A node’s `annotation` may include:
  - A summary of what you currently understand about the question
  - A hypothesis or assumption you are testing
  - A brief note on what you still need to find out
- An `edge_annotation` should briefly explain how the sub-question contributes to answering the parent question — e.g., cause-effect, component, condition, clarification, definition, comparison, or implication.
---

## Input Format

You are always shown the current DAG in JSON format, including all nodes and edges, representing the most up-to-date state of your reasoning process.

---

## Key Reasoning Guidelines

- You cannot delete nodes or edges. Even if a previous path turns out to be incorrect or irrelevant, leave it intact and revise your understanding through `update()`. This mimics how humans preserve earlier lines of thought for traceability, reflection, and learning from missteps.
- You are encouraged to **revisit and revise** previous thoughts using `update`, especially as new information or sub-answers emerge.
- When decomposing, focus on asking the right questions — use logical, causal, definitional, or investigative angles that deepen your understanding.
- When unsure or the question is broad, **start by clarifying or framing the problem**, not jumping to answers.
- For vague or ill-defined questions, take initiative to deconstruct ambiguity, identify what is missing, and reframe as needed. You shape the problem space.

- Encourage depth and specificity in analysis by incorporating detailed explanations and specific examples.
- Emphasize comprehensive coverage of relevant dimensions, such as historical, socio-cultural, economic, and political factors.
- Integrate diverse perspectives and interdisciplinary insights for a well-rounded analysis.
- Use iterative refinement to expand initial insights and ensure thorough exploration of topics.
- Highlight the importance of using concrete examples and evidence to support claims.
- Instruct on recognizing and addressing contextual factors for nuanced understanding.
- Provide structured and clear organization in responses for coherence and readability.
- Ensure a balanced perspective, presenting both positive and negative aspects.
- Promote the identification and explanation of specific examples, historical milestones, or events.
- Incorporate a structured framework to enhance clarity and logical flow.
- Use comparative analysis to provide nuanced insights and depth.
- Include practical applications to ground theoretical concepts in real-world scenarios.
- Ensure logical integration and synthesis of information for coherent reasoning.
- Balance breadth and depth in exploration of topics.
- Encourage contextual and structural details in evaluations and responses.
- Reflect on broader implications and variability of topics.
- Address challenges and reforms within the context.
- Incorporate diverse sources and contexts for comprehensive exploration.
- Emphasize the importance of contextual relevance and detail.
- Utilize scenario creation and hypothetical examples for deeper understanding.

---

## Your Tools

You have three core actions to build and navigate the problem space:

1. **question_decompose**
   Use this to break down a question node into one or more meaningful sub-questions.
   - You may decompose multiple nodes at once.
   - Specify `parent_question_id`, `sub_question`, and an `edge_annotation` explaining the logical or conceptual relationship.
   - Multiple parents pointing to the same sub-question are allowed.
   - Keep the graph acyclic.

   Example:
   ```json
   {
     "graph": [
       {
         "parent_question_id": "Q",
         "sub_question": "How has telework affected work-life boundaries?",
         "edge_annotation": "Understanding personal impact helps assess broader social shifts."
       },
       {
         "parent_question_id": "Q.1",
         "sub_question": "Does telework reinforce or reduce social inequality?",
         "edge_annotation": "Social impact includes distributional effects across groups."
       }
     ]
   }
    ```

2.  **update**
    Use this to revise or expand the annotation of existing nodes.
    - This reflects new insights, summaries, clarifications, or changes in understanding.
    - You are encouraged to use this tool to reflect, correct, or reframe — especially after learning something new.
    - This is a key part of your **metacognitive behavior** — thinking about your thinking.
    
    Example:
    ```json
    {
      "nodes": [
        {
          "question_id": "Q.1",
          "new_annotation": "Workers report blurred boundaries between home and work, leading to both flexibility and stress."
        },
        {
          "question_id": "Q.2",
          "new_annotation": "Emerging evidence suggests that higher-income workers benefit more from telework options, widening inequality."
        }
      ]
    }
    ```

3.  **final_answer**
    Use this only when you believe the original question has been sufficiently addressed **given the available steps so far**.  
    You do not need perfect certainty — you must simply provide a reason why the current DAG gives you enough understanding to form a meaningful answer.
    - Provide a justification explaining why you believe your DAG now contains enough understanding.
    - Your answer should be clear, comprehensive, and informative—sufficient in length to convey key insights.
    - You may use paragraph or bullet point format as appropriate.
    - Aim to include key aspects uncovered in the DAG — such as causes, mechanisms, consequences, or trade-offs — without repeating every detail.
    
    Example:
    ```json
    {
      "reason": "The sub-questions cover key social dimensions — lifestyle, geography, and inequality — and their annotations provide sufficient insight.",
    }
    ```

---

## Metacognitive Expectations

This is not a static search task — it is an evolving thinking process.

- Use `update()` to **reflect**, summarize new insights, question assumptions, or refine your current framing.
- Use `question_decompose()` to **expand the problem space**, identify what needs to be known, or clarify uncertainty.
- Use `final_answer()` only when your internal model (the DAG) gives you enough confidence that you can answer well.
- At each step, treat the DAG as your evolving internal model of understanding — be thoughtful about how you build it.

- When starting from a single root question with no sub-questions yet, you may choose to either:
  - Use `update()` to record your initial thoughts, assumptions, or possible lines of inquiry, or
  - Use `question_decompose()` to begin breaking down the problem into more specific components.
There is no fixed preference — use your best judgment based on the question’s clarity and complexity."""

In [3]:
from CQ.cq_solver_llama import CQ_Solver_llama

In [4]:
import json
import os

test_dataset_500 = 'test_dataset_500.json'
if not os.path.exists(test_dataset_500):
    print(f"找不到問題檔案: {test_dataset_500}。請先運行程式碼生成該檔案。")
    exit()

with open(test_dataset_500, 'r', encoding='utf-8') as f:
    all_questions = json.load(f)

for m in range(2, 4):
    output_file = f'RQ3_experiment_results_{m}.jsonl'
    max_retries = 100

    # 第一次無條件運行所有問題
    first_run_results = []
    for i, question in enumerate(all_questions):
        question_id = f"question_{i+1:02d}"
        print(f"\n處理問題: {question_id}")

        cqsolver_llama = CQ_Solver_llama(
            llm="llama 3.3 70B", system_prompt=REFINED_CQ_SYSTEM_PROMPT,
            max_turns=9, debug_log=f"RQ3_{m}.log", summary_json=f"RQ3_summary_{m}.json")

        systems = {
            "CQ_Solver": {"llama 3.3 70B": cqsolver_llama}
        }

        for system_name, models in systems.items():
            for model_name, agent in models.items():
                try:
                    response = agent.run(question)
                    print("  第一次成功回答")
                    result = {
                        "system": system_name,
                        "model": model_name,
                        "question_id": question_id,
                        "question": question,
                        "answer": response,
                        "error": None
                    }
                    first_run_results.append(result)
                except Exception as e:
                    error_message = str(e)
                    print(f"  發生錯誤: {error_message}")
                    result = {
                        "system": system_name,
                        "model": model_name,
                        "question_id": question_id,
                        "question": question,
                        "answer": None,
                        "error": error_message
                    }
                    first_run_results.append(result)

    # 保存第一次運行的結果
    with open(output_file, 'w', encoding='utf-8') as f:
        for entry in first_run_results:
            json.dump(entry, f, ensure_ascii=False)
            f.write('\n')

# 進行重試機制
# for retry_count in range(1, max_retries + 1):
#     print(f"\n--- 第 {retry_count} 次重新嘗試 ---")
#     error_entries = []
#     updated_entries = []

#     # 讀取現有的結果並找出 'llama 3.3 70B' 回答為 None 的條目
#     if os.path.exists(output_file):
#         with open(output_file, 'r', encoding='utf-8') as f:
#             for line in f:
#                 try:
#                     entry = json.loads(line.strip())
#                     if entry.get("model") == "llama 3.3 70B" and entry.get("answer") is None:
#                         error_entries.append(entry)
#                     else:
#                         updated_entries.append(entry)
#                 except json.JSONDecodeError as e:
#                     print(f"JSON 解析錯誤: {e}, 行內容: {line.strip()}")

#     if not error_entries:
#         print("沒有需要重新嘗試的 'llama 3.3 70B' 錯誤回答。")
#         break

#     print(f"找到 {len(error_entries)} 個需要重新嘗試的錯誤回答。")

#     # 重新回答錯誤的條目
#     for error_entry in error_entries:
#         question_id = error_entry.get("question_id")
#         question = error_entry.get("question")
#         system_name = error_entry.get("system")
#         model_name = error_entry.get("model")

#         print(f"\n重新嘗試系統: {system_name}, 問題 ID: {question_id}")

#         cqsolver_llama = CQ_Solver_llama(
#             llm="llama 3.3 70B", system_prompt=REFINED_CQ_SYSTEM_PROMPT,
#             max_turns=9, debug_log="RQ3.log", summary_json="RQ3_summary.json")

#         systems = {
#             "CQ_Solver": {"llama 3.3 70B": cqsolver_llama}
#         }

#         if system_name in systems and model_name in systems[system_name]:
#             agent = systems[system_name][model_name]
#             try:
#                 response = agent.run(question)
#                 print("  重新回答成功")
#                 updated_entry = error_entry.copy()
#                 updated_entry["answer"] = response
#                 updated_entry["error"] = None  # 清除錯誤訊息
#                 updated_entries.append(updated_entry)
#             except Exception as e:
#                 error_message = str(e)
#                 print(f"  重新回答失敗: {error_message}")
#                 updated_entries.append(error_entry)  # 保留原來的錯誤條目
#         else:
#             print(f"  找不到系統 '{system_name}' 或模型 '{model_name}'。")
#             updated_entries.append(error_entry)  # 保留原來的錯誤條目

#     # 寫回更新後的結果 (覆蓋原檔案)
#     with open(output_file, 'w', encoding='utf-8') as f:
#         for entry in updated_entries:
#             json.dump(entry, f, ensure_ascii=False)
#             f.write('\n')

# print("\n實驗過程結束。")

# # 最後檢查是否還有 "answer" 為 None 的條目
# final_none_count = 0
# if os.path.exists(output_file):
#     with open(output_file, 'r', encoding='utf-8') as f:
#         for line in f:
#             try:
#                 entry = json.loads(line.strip())
#                 if entry.get("model") == "llama 3.3 70B" and entry.get("answer") is None:
#                     final_none_count += 1
#             except json.JSONDecodeError as e:
#                 print(f"最終檢查時 JSON 解析錯誤: {e}, 行內容: {line.strip()}")

# if final_none_count > 0:
#     print(f"\n警告：最終結果中仍有 {final_none_count} 個 'llama 3.3 70B' 的回答為 None。")
# else:
#     print("\n最終結果中沒有 'llama 3.3 70B' 的回答為 None。")


處理問題: question_01
Error processing https://www.thoughtco.com/architecture-timeline-historic-periods-styles-175996: HTTPSConnectionPool(host='www.thoughtco.com', port=443): Read timed out. (read timeout=5)
Error processing https://worldhistoryedu.com/12-ancient-greek-inventions-and-technology/: HTTPSConnectionPool(host='worldhistoryedu.com', port=443): Read timed out. (read timeout=5)
  第一次成功回答

處理問題: question_02
Error processing https://www.investopedia.com/ask/answers/052815/does-raising-minimum-wage-increase-inflation.asp: HTTPSConnectionPool(host='www.investopedia.com', port=443): Read timed out. (read timeout=5)
Error processing https://www.kansascityfed.org/ten/2022-winter-ten-magazine/ask-an-economist-what-happens-when-the-minimum-wage-increases/: HTTPSConnectionPool(host='www.kansascityfed.org', port=443): Read timed out. (read timeout=5)
  第一次成功回答

處理問題: question_03
  第一次成功回答

處理問題: question_04
  第一次成功回答

處理問題: question_05
Error processing https://vitalpathways.consulting/the-

--- Logging error ---
Traceback (most recent call last):
  File "/home/kslab/anaconda3/envs/jinqi/lib/python3.11/logging/__init__.py", line 1114, in emit
    self.flush()
  File "/home/kslab/anaconda3/envs/jinqi/lib/python3.11/logging/__init__.py", line 1094, in flush
    self.stream.flush()
RuntimeError: reentrant call inside <_io.BufferedWriter name='/home/kslab/jupyter/Jin-Qi/thesis/RQ3_2.log'>
Call stack:
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/kslab/anaconda3/envs/jinqi/lib/python3.11/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/home/kslab/anaconda3/envs/jinqi/lib/python3.11/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/home/kslab/anaconda3/envs/jinqi/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 739, in start
    self.io_loop.start()
  File "/home/kslab/anaconda3/envs/jinqi/lib/pyt

  發生錯誤: Error code: 400 - {'error': {'message': "Failed to generate JSON. Please adjust your prompt. See 'failed_generation' for more details.", 'type': 'invalid_request_error', 'code': 'json_validate_failed', 'failed_generation': '{\n   "tool_name": "search",\n   "reason": "To understand how societal pressure impacts the decision of young people to attend college, we need to explore the role of societal expectations in shaping their educational choices.",\n   "query": "societal pressure influence on college attendance among young people"\n}\n{\n   "tool_name": "summary",\n   "summary": "Societal pressure plays a significant role in influencing young people\'s decisions to attend college, driven by expectations from family, peers, and community, which can impact their perception of college as a necessary step for future success."\n}'}}

處理問題: question_47
Error processing https://www.investopedia.com/ask/answers/112814/how-did-world-war-ii-impact-european-gdp.asp: HTTPSConnectionPool(ho

--- Logging error ---
Traceback (most recent call last):
  File "/home/kslab/anaconda3/envs/jinqi/lib/python3.11/logging/__init__.py", line 1114, in emit
    self.flush()
  File "/home/kslab/anaconda3/envs/jinqi/lib/python3.11/logging/__init__.py", line 1094, in flush
    self.stream.flush()
RuntimeError: reentrant call inside <_io.BufferedWriter name='/home/kslab/jupyter/Jin-Qi/thesis/RQ3_2.log'>
Call stack:
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/kslab/anaconda3/envs/jinqi/lib/python3.11/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/home/kslab/anaconda3/envs/jinqi/lib/python3.11/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/home/kslab/anaconda3/envs/jinqi/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 739, in start
    self.io_loop.start()
  File "/home/kslab/anaconda3/envs/jinqi/lib/pyt

  第一次成功回答

處理問題: question_396
  第一次成功回答

處理問題: question_397
Error processing https://menglelaw.com/influential-factors-on-court-decisions-an-exhaustive-examination/: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
  第一次成功回答

處理問題: question_398
  第一次成功回答

處理問題: question_399
Error processing https://thesocialmediawatch.com/human-rights-and-animal-rights-ethical-considerations-in-treatment/: HTTPSConnectionPool(host='thesocialmediawatch.com', port=443): Max retries exceeded with url: /human-rights-and-animal-rights-ethical-considerations-in-treatment/ (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x75e4409de2d0>: Failed to resolve 'thesocialmediawatch.com' ([Errno -2] Name or service not known)"))
  發生錯誤: 1 validation error for CriticalEvaluation
critical_evaluation
  Field required [type=missing, input_value={'tool_name': 'update', '...ce their treatment.'}]}}, input_type=dict]
    For further information visit 

In [7]:
from dotenv import load_dotenv

load_dotenv()

eval_prompt_p1 = """You are asked to assess the quality of an AI assistant's answer to a user's question as an impartial judge. Since the type of answer you are evaluating is [Solve Professional Problem], you need to evaluate the answer in the following 5 criteria:
1. Factuality: Whether the information provided is accurate and based on reliable facts and data.
2. User Satisfaction: Whether the response meets the user's question and needs and provides a comprehensive and appropriate answer to the question.
3. Clarity: Whether the response is clear and understandable, and whether it uses concise language and structure so that the user can easily understand it.
4. Logical Coherence: Whether the response maintains overall consistency and logical coherence between different sections, avoiding self-contradiction.
5. Completeness: Whether the response provides sufficient information and details to meet the user's needs, and whether it avoids omitting important aspects.
6. Note that a longer answer is not always better, the answer that is concise and meets the above requirements is the best.

We will provide you with the user's question, an 8-score reference answer, and answers from the AI assistant that needs your assessment. When starting your evaluation, you need to follow the reasoning steps below:
1. Compare the AI assistant's answer with the reference answer, point out any shortcomings in the AI assistant's answer, and explain further.
2. Evaluate the AI assistant's answer in terms of the different criteria, giving each criterion a score from 1 to 10 after the evaluation of each.
3. Finally, combine the evaluations from each criterion and give the AI assistant's answer a composite score of 1 to 10.
4. Your scoring needs to be as rigorous as possible and adhere to the following scoring rules: in general, the higher the quality of the model's answers, the higher the score.
The two most important criteria are factual correctness and fulfillment of user needs, and the scores for these two dimensions dominate the final composite score.

When the model answer has irrelevance to the question, or intrinsically factually incorrect, or generates harmful content, the total score should be 1 to 2;
When the model answer has no serious errors and is largely harmless, but is of low quality and does not meet user requirements, the total score must be 3 to 4;
When the model answer basically meets the user's needs but performs poorly on some criteria and is of medium quality, the total score can be 5 to 6;
When the quality of the model response is similar to the reference answer and performs well in all criteria, the total score should be 7 to 8;
A score of 9 to 10 can only be achieved if the model significantly exceeds the quality of the reference answer, adequately addresses the user's question and all the needs, and is close to a perfect score on all criteria. As an example, the reference answer would receive a score of 8.

You need to evaluate and explain before you score. Your explanation of each criterion needs to be followed by the scoring. After that, at the end of your answer, return all of your scores in the following dictionary format, including the curly brackets, and make sure that your scores are integers:
{'Dimension 1': scoring, 'Dimension 2': scoring, ... , 'Final Score': Score}, e.g. {'Factuality': 9, 'User Satisfaction': 6, ... , 'Final Score': 7}.
"""


In [None]:
import json
import os
from openai import OpenAI

import json
import os
from openai import OpenAI

def compare_answers_for_batch(filename="experiment_results.jsonl", output_batch_file="batch_input.jsonl"):
    """
    開啟 JSONL 檔案，針對相同的 question_id，準備 Batch API 的輸入檔案。
    使用 gpt-4o 且 system == "MindSearch" 的答案作為參考答案，
    評估 llama 3.3 70B 模型在不同系統下的答案。
    """
    batch_requests = []
    reference_answers = {}
    questions = {}

    with open(filename, 'r') as f:
        for line in f:
            try:
                record = json.loads(line.strip())
                question_id = record.get("question_id")
                question = record.get("question")
                system = record.get("system")
                model = record.get("model")
                answer = record.get("answer")

                if question_id:
                    questions[question_id] = question
                    if model == "gpt-4o" and system == "MindSearch":
                        reference_answers[question_id] = answer
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON line: {line.strip()} - {e}")
                continue

    with open(filename, 'r') as f:
        for line in f:
            try:
                record = json.loads(line.strip())
                question_id = record.get("question_id")
                system = record.get("system")
                model = record.get("model")
                answer = record.get("answer")

                if question_id and model == "llama 3.3 70B":
                    question = questions.get(question_id, "N/A")
                    reference_answer = reference_answers.get(question_id, "N/A")

                    if reference_answer != "N/A" and answer != "N/A":
                        eval_prompt_p2 = f"""Question: "{question}"
<Reference Answer>
{reference_answer}
</Reference Answer>

<AI assistant's answer>
{answer}
</AI assistant's answer>"""

                        batch_requests.append({
                            "custom_id": f"{question_id}-{system}-llama-3.3-70B",
                            "method": "POST",
                            "url": "/v1/chat/completions",
                            "body": {
                                "model": "gpt-4o",
                                "messages": [{"role": "developer", "content": eval_prompt_p1},
                                             {"role": "user", "content": eval_prompt_p2}]
                            }
                        })
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON line: {line.strip()} - {e}")
                continue

    # 將請求寫入 Batch 輸入檔案
    with open(output_batch_file, 'w') as f:
        for req in batch_requests:
            f.write(json.dumps(req) + '\n')

if __name__ == "__main__":
    client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    output_batch_file = "batch_input_llama.jsonl"
    # compare_answers_for_batch(output_batch_file=output_batch_file)
    # print(f"Batch input file created: {output_batch_file}")

    # 步驟 2: 上傳 Batch 輸入檔案
    try:
        with open(output_batch_file, "rb") as f:
            batch_input_file = client.files.create(
                file=f,
                purpose="batch"
            )
        print(f"Batch input file uploaded with ID: {batch_input_file.id}")
        input_file_id = batch_input_file.id

        # 步驟 3: 創建 Batch
        batch = client.batches.create(
            input_file_id=input_file_id,
            endpoint="/v1/chat/completions",
            completion_window="24h",
            metadata={"description": "Evaluation batch job for llama 3.3 70B"}
        )
        print(f"Batch created with ID: {batch.id}")

        # **後續步驟：輪詢狀態、檢索結果、解析和記錄結果**
        # 您需要實現這些步驟，就像之前的流程一樣。

    except Exception as e:
        print(f"Error during Batch API interaction: {e}")


In [15]:
import json
import os
from openai import OpenAI

def prepare_cq_solver_re_evaluation_batch(cq_solver_results_file="RQ3_experiment_results.jsonl",
                                          mindsearch_results_file="./result/final_experiment_results.jsonl",
                                          output_batch_file="batch_input_cq_reEval.jsonl"):
    """
    準備 Batch API 輸入檔案，以重新評估 experiment_results_2.jsonl 中的 CQ_Solver 答案，
    並使用 experiment_results.jsonl 中對應的 MindSearch 答案作為參考。
    """
    batch_requests = []
    mindsearch_answers = {}
    cq_solver_data = {}

    # 載入 MindSearch 答案作為參考
    with open(mindsearch_results_file, 'r') as f_ms:
        for line in f_ms:
            try:
                record = json.loads(line.strip())
                if record.get("system") == "MindSearch" and record.get("model") == "gpt-4o" and record.get("question_id"):
                    mindsearch_answers[record["question_id"]] = record.get("answer")
            except json.JSONDecodeError as e:
                print(f"Error decoding MindSearch results: {e}")
                continue

    # 載入 CQ_Solver 的答案和問題
    with open(cq_solver_results_file, 'r') as f_cq:
        for line in f_cq:
            try:
                record = json.loads(line.strip())
                if record.get("system") == "CQ_Solver" and record.get("model") == "llama 3.3 70B" and record.get("question_id"):
                    cq_solver_data[record["question_id"]] = {
                        "question": record.get("question"),
                        "answer": record.get("answer")
                    }
            except json.JSONDecodeError as e:
                print(f"Error decoding CQ_Solver results: {e}")
                continue

    # 準備 Batch API 請求
    for question_id, cq_solver_info in cq_solver_data.items():
        question = cq_solver_info.get("question", "N/A")
        cq_solver_answer = cq_solver_info.get("answer", "N/A")
        reference_answer = mindsearch_answers.get(question_id, "N/A")

        if cq_solver_answer is not None and reference_answer != "N/A":
            eval_prompt_p2_cq_solver = f"""Question: "{question}"
<Reference Answer>
{reference_answer}
</Reference Answer>

<AI assistant's answer>
{cq_solver_answer}
</AI assistant's answer>"""
            batch_requests.append({
                "custom_id": f"{question_id}-cq_solver-llama-3.3-70B_makeup3",
                "method": "POST",
                "url": "/v1/chat/completions",
                "body": {
                    "model": "gpt-4o",
                    "messages": [{"role": "developer", "content": eval_prompt_p1},
                                 {"role": "user", "content": eval_prompt_p2_cq_solver}]
                }
            })

    # 將請求寫入 Batch 輸入檔案
    with open(output_batch_file, 'w') as f:
        for req in batch_requests:
            f.write(json.dumps(req) + '\n')

if __name__ == "__main__":
    client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    output_batch_file = "batch_input_refined_cq-solver_2.jsonl"
    # prepare_cq_solver_re_evaluation_batch(cq_solver_results_file="RQ3_experiment_results_3.jsonl", output_batch_file=output_batch_file)
    # print(f"Batch input file created: {output_batch_file}")

    # 步驟 2: 上傳 Batch 輸入檔案
    try:
        with open(output_batch_file, "rb") as f:
            batch_input_file = client.files.create(
                file=f,
                purpose="batch"
            )
        print(f"Batch input file uploaded with ID: {batch_input_file.id}")
        input_file_id = batch_input_file.id

        # 步驟 3: 創建 Batch
        batch = client.batches.create(
            input_file_id=input_file_id,
            endpoint="/v1/chat/completions",
            completion_window="24h",
            metadata={"description": "Evaluation batch job for llama 3.3 70B"}
        )
        print(f"Batch created with ID: {batch.id}")

        # **後續步驟：輪詢狀態、檢索結果、解析和記錄結果**
        # 您需要實現這些步驟，就像之前的流程一樣。

    except Exception as e:
        print(f"Error during Batch API interaction: {e}")

Batch input file uploaded with ID: file-UtWLkUEzVWmw3SRCDfeeXL
Batch created with ID: batch_68049bb850cc8190b52a8850e716e6b0


In [None]:
import json
import os
from openai import OpenAI

def compare_answers_for_batch_makeup(input_filenames, output_batch_file="batch_input_makeup.jsonl"):
    """
    開啟多個 JSONL 檔案，針對相同的 question_id，準備 Batch API 的輸入檔案。
    使用 gpt-4o 且 system == "MindSearch" 的答案作為參考答案，
    評估 llama 3.3 70B 模型在不同系統下的答案。
    如果 record.get("answer") is None，則略過該條記錄。

    Args:
        input_filenames (list): 包含輸入 JSONL 檔案名的列表。
        output_batch_file (str): 輸出 Batch API 輸入檔案名。
    """
    batch_requests = []
    reference_answers = {}
    questions = {}
    request_counter = 0

    # 第一次遍歷 experiment_results.jsonl 以收集參考答案和問題
    reference_filename = "experiment_results.jsonl"
    with open(reference_filename, 'r') as f_ref:
        for line in f_ref:
            try:
                record = json.loads(line.strip())
                question_id = record.get("question_id")
                question = record.get("question")
                system = record.get("system")
                model = record.get("model")
                answer = record.get("answer")

                if question_id and answer is not None:
                    questions[question_id] = question
                    if model == "gpt-4o" and system == "MindSearch":
                        reference_answers[question_id] = answer
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON line in {reference_filename}: {line.strip()} - {e}")
                continue

    # 第二次遍歷輸入的 llama 模型結果檔案以準備評估請求
    for filename in input_filenames:
        with open(filename, 'r') as f_llama:
            for line in f_llama:
                try:
                    record = json.loads(line.strip())
                    question_id = record.get("question_id")
                    system = record.get("system")
                    model = record.get("model")
                    answer = record.get("answer")

                    if question_id and model == "llama 3.3 70B" and answer is not None:
                        question = questions.get(question_id, "N/A")
                        reference_answer = reference_answers.get(question_id, "N/A")

                        if reference_answer != "N/A":
                            eval_prompt_p2 = f"""User Question: "{question}"
<Reference Answer>
{reference_answer}
</Reference Answer>

<AI assistant's answer>
{answer}
</AI assistant's answer>"""

                            custom_id = f"{question_id}-{system}-llama-3.3-70B-{request_counter:05d}"
                            batch_requests.append({
                                "custom_id": custom_id,
                                "method": "POST",
                                "url": "/v1/chat/completions",
                                "body": {
                                    "model": "gpt-4o",
                                    "messages": [{"role": "developer", "content": eval_prompt_p1},
                                                 {"role": "user", "content": eval_prompt_p2}]
                                }
                            })
                            request_counter += 1
                except json.JSONDecodeError as e:
                    print(f"Error decoding JSON line in {filename}: {line.strip()} - {e}")
                    continue

    # 將請求寫入 Batch 輸入檔案
    with open(output_batch_file, 'w') as f:
        for req in batch_requests:
            f.write(json.dumps(req) + '\n')

if __name__ == "__main__":
    client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    # input_filenames = ['llama_experiment_results_makeup_exam.jsonl', 'llama_experiment_results_makeup_exam2.jsonl']
    output_batch_file = "batch_input_llama_makeup.jsonl"
    # compare_answers_for_batch_makeup(input_filenames, output_batch_file=output_batch_file)
    # print(f"Batch input file created: {output_batch_file}")

    # 步驟 2: 上傳 Batch 輸入檔案
    try:
        with open(output_batch_file, "rb") as f:
            batch_input_file = client.files.create(
                file=f,
                purpose="batch"
            )
        print(f"Batch input file uploaded with ID: {batch_input_file.id}")
        input_file_id = batch_input_file.id

        # 步驟 3: 創建 Batch
        batch = client.batches.create(
            input_file_id=input_file_id,
            endpoint="/v1/chat/completions",
            completion_window="24h",
            metadata={"description": "Evaluation batch job for llama 3.3 70B"}
        )
        print(f"Batch created with ID: {batch.id}")

        # **後續步驟：輪詢狀態、檢索結果、解析和記錄結果**
        # 您需要實現這些步驟，就像之前的流程一樣。

    except Exception as e:
        print(f"Error during Batch API interaction: {e}")

In [1]:
from openai import OpenAI
client = OpenAI()

batch = client.batches.retrieve("batch_68049bb850cc8190b52a8850e716e6b0")
print(batch)

OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

In [19]:
from openai import OpenAI
client = OpenAI()

file_response = client.files.content("file-AGWYregKKMj9ddsJ45j8YT")
output_filename = "refined_evaluation_results_makeup.jsonl"
with open(output_filename, 'w') as outfile:
    outfile.write(file_response.text)

In [20]:
import json

input_file = "refined_evaluation_results_makeup.jsonl"
output_file = "refined_evaluation_results_makeup.jsonl"

cleaned_data = []

with open(input_file, 'r', encoding='utf-8') as f:
    for line in f:
        try:
            entry = json.loads(line)
            custom_id = entry.get("custom_id")
            content = (
                entry.get("response", {})
                     .get("body", {})
                     .get("choices", [{}])[0]
                     .get("message", {})
                     .get("content", "")
            )
            cleaned_data.append({
                "custom_id": custom_id,
                "content": content
            })
        except Exception as e:
            print(f"❌ 處理失敗: {e}")
            continue

# ✅ 可選：寫入清理後的 .jsonl 檔案
with open(output_file, 'w', encoding='utf-8') as f:
    for item in cleaned_data:
        json.dump(item, f, ensure_ascii=False)
        f.write('\n')

print(f"✅ 完成，共處理 {len(cleaned_data)} 筆資料。結果已儲存至 {output_file}")


✅ 完成，共處理 896 筆資料。結果已儲存至 refined_evaluation_results_makeup.jsonl


In [24]:
import json

input_file = "refined_evaluation_results_makeup.jsonl"
output_file = "refined_evaluation_results_makeup.jsonl"

def parse_custom_id(custom_id: str):
    # e.g. "question_01-react" or "question_02-cq-solver"
    parts = custom_id.split("-")
    question_id = "-".join(parts[:1])  # handles formats like "question_01"
    system_raw = "-".join(parts[1:2]) if len(parts) > 2 else parts[-1]
    
    parts = custom_id.split("_")
    showup = parts[-1]

    system_map = {
        "cq_solver": "CQ_Solver"
    }

    system_name = system_map.get(system_raw.lower(), system_raw)

    return question_id, system_name, showup

formatted = []

with open(input_file, 'r', encoding='utf-8') as f:
    for line in f:
        try:
            entry = json.loads(line)
            custom_id = entry.get("custom_id", "")
            content = entry.get("content", "")

            question_id, system_name, showup = parse_custom_id(custom_id)

            formatted.append({
                "system": system_name,
                "model": "llama 3.3 70B",# llama 3.3 70B, gpt-4o
                "question_id": question_id,
                "content": content,
                "showup": showup
            })
        except Exception as e:
            print(f"❌ 錯誤處理 custom_id={entry.get('custom_id')}: {e}")
            continue

# 寫入新格式的檔案
with open(output_file, 'w', encoding='utf-8') as f:
    for item in formatted:
        json.dump(item, f, ensure_ascii=False)
        f.write('\n')

print(f"✅ 格式轉換完成，結果已寫入 {output_file}，共 {len(formatted)} 筆資料。")


✅ 格式轉換完成，結果已寫入 refined_evaluation_results_makeup.jsonl，共 896 筆資料。


In [25]:
import json
import ast
import re

input_file = "refined_evaluation_results_makeup.jsonl"
output_file = "refined_evaluation_results_makeup.jsonl"

# 評分欄位標準
expected_fields = {
    'Factuality', 'User Satisfaction', 'Clarity', 'Logical Coherence', 'Completeness', 'Final Score'
}

valid_scores = set(range(1, 11))

valid_entries = []
invalid_entries = []

with open(input_file, 'r', encoding='utf-8') as f:
    for line in f:
        entry = json.loads(line)
        content = entry.get("content", "")

        # 1️⃣ 嘗試從文字中找出可能的字典型評分
        score_dict = None
        try:
            # 使用正則找出像 dict 的文字片段
            matches = re.findall(r"\{[^{}]+\}", content)

            for match in matches:
                try:
                    # 嘗試用 ast.literal_eval 安全解析
                    possible_dict = ast.literal_eval(match)

                    # 2️⃣ 檢查是否是 dict 且為 6 個欄位
                    if isinstance(possible_dict, dict) and len(possible_dict) == 6:
                        keys = set(possible_dict.keys())
                        values = set(possible_dict.values())

                        # 3️⃣ 檢查欄位名稱是否完全符合
                        if keys == expected_fields:

                            # 4️⃣ 檢查所有分數是否在 1~10
                            if all(isinstance(v, int) and v in valid_scores for v in possible_dict.values()):
                                score_dict = possible_dict
                                break
                except Exception:
                    continue

        except Exception as e:
            print(f"⚠️ 無法解析 content：{e}")

        # 5️⃣ 加入結果或記錄錯誤
        if score_dict:
            entry["score"] = score_dict
            valid_entries.append(entry)
        else:
            invalid_entries.append(entry)

# 寫入處理後的檔案（僅含成功解析的）
with open(output_file, 'w', encoding='utf-8') as f:
    for entry in valid_entries:
        json.dump(entry, f, ensure_ascii=False)
        f.write('\n')

# 顯示報告
print(f"✅ 共處理 {len(valid_entries) + len(invalid_entries)} 筆")
print(f"✅ 符合格式的資料：{len(valid_entries)} 筆")
print(f"❌ 不符合格式的資料：{len(invalid_entries)} 筆")

# 可選：列出前幾筆錯誤來檢查原因
print("\n📌 前幾筆違反格式的 custom_id：")
for item in invalid_entries:
    print("-", item.get("system"))
    print("-", item.get("question_id"))


✅ 共處理 896 筆
✅ 符合格式的資料：893 筆
❌ 不符合格式的資料：3 筆

📌 前幾筆違反格式的 custom_id：
- CQ_Solver
- question_245
- CQ_Solver
- question_418
- CQ_Solver
- question_373


In [26]:
import json
from collections import defaultdict

# === Input paths ===
eval_file_1 = 'refined_evaluation_results.jsonl'
eval_file_2 = 'refined_evaluation_results_makeup.jsonl'

experiment_files = {
    'first': 'RQ3_experiment_results.jsonl',
    'makeup2': 'RQ3_experiment_results_2.jsonl',
    'makeup3': 'RQ3_experiment_results_3.jsonl'
}

summary_files = {
    'first': 'RQ3_summary.json',
    'makeup2': 'RQ3_summary_2.json',
    'makeup3': 'RQ3_summary_3.json'
}

# === Output paths ===
final_eval_out = 'final_refined_evaluation_results.jsonl'
final_experiment_out = 'final_RQ3_experiment_results.jsonl'
final_summary_out = 'final_RQ3_summary.json'

# === 1. 收集所有評分資料，標註出現順序 ===
all_entries = []
qid_counter = defaultdict(int)

def process_eval_file(file_path, default_showup):
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            entry = json.loads(line)
            qid = entry['question_id']
            qid_counter[qid] += 1
            entry['showup'] = entry.get('showup', default_showup)
            entry['__order'] = qid_counter[qid]
            all_entries.append(entry)

process_eval_file(eval_file_1, 'first')
process_eval_file(eval_file_2, None)

# === 2. 建立回答內容與 summary 索引（以 question_id + showup 為 key）===
experiment_lookup = {}
for label, path in experiment_files.items():
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            e = json.loads(line)
            experiment_lookup[(e["question_id"], label)] = e

summary_lookup = {}
def extract_qtext(graph_string):
    try:
        data = json.loads(graph_string)
        return data["nodes"][0]["question"]
    except:
        return None

for label, path in summary_files.items():
    with open(path, 'r', encoding='utf-8') as f:
        for item in json.load(f):
            qtext = extract_qtext(item["question"])
            if qtext:
                summary_lookup[(qtext, label)] = item

# === 3. 選出每題最佳版本 ===
best_by_qid = {}

def score_key(entry):
    s = entry['score']
    total = sum(v for k, v in s.items() if k != "Final Score")
    return (s["Final Score"], total, s["Factuality"], -entry["__order"])  # -order：較早出現優先

# 暫時依 qid 聚集
entries_by_qid = defaultdict(list)
for e in all_entries:
    entries_by_qid[e["question_id"]].append(e)

# 最終保留的 entry
final_evals = []
final_experiments = []
final_summaries = []

for qid, candidates in entries_by_qid.items():
    sorted_candidates = sorted(candidates, key=score_key, reverse=True)

    for entry in sorted_candidates:
        showup = entry["showup"]
        qtext = entry["content"].split("\n")[0]  # 或用回答前的問題，如果你有更可靠的來源
        exp = experiment_lookup.get((qid, showup))
        summary = summary_lookup.get((exp["question"], showup)) if exp else None

        if exp and summary:
            final_evals.append(entry)
            final_experiments.append(exp)
            final_summaries.append(summary)
            break  # 找到一個有效就用它
    else:
        print(f"⚠️ 無有效評測資料: {qid}")

# === 4. 輸出結果 ===
with open(final_eval_out, 'w', encoding='utf-8') as f:
    for e in final_evals:
        json.dump(e, f, ensure_ascii=False)
        f.write('\n')

with open(final_experiment_out, 'w', encoding='utf-8') as f:
    for e in final_experiments:
        json.dump(e, f, ensure_ascii=False)
        f.write('\n')

with open(final_summary_out, 'w', encoding='utf-8') as f:
    json.dump(final_summaries, f, ensure_ascii=False, indent=2)

print(f"\n✅ 完成整理：")
print(f"  評分檔案：{len(final_evals)} 筆 → {final_eval_out}")
print(f"  回答內容：{len(final_experiments)} 筆 → {final_experiment_out}")
print(f"  系統摘要：{len(final_summaries)} 筆 → {final_summary_out}")



✅ 完成整理：
  評分檔案：500 筆 → final_refined_evaluation_results.jsonl
  回答內容：500 筆 → final_RQ3_experiment_results.jsonl
  系統摘要：500 筆 → final_RQ3_summary.json


In [29]:
import json
from collections import defaultdict

input_file = "final_refined_evaluation_results.jsonl"

expected_fields = [
    'Factuality', 'User Satisfaction', 'Clarity',
    'Logical Coherence', 'Completeness', 'Final Score'
]

# 初始化統計資料: (system, model) → 各評分欄位的數值 list
systems_scores = defaultdict(lambda: defaultdict(list))

# 讀取並分類累加
with open(input_file, 'r', encoding='utf-8') as f:
    for line in f:
        entry = json.loads(line)
        system = entry.get("system")
        model = entry.get("model")
        score = entry.get("score")

        if system and model and score:
            key = (system, model)
            for field in expected_fields:
                value = score.get(field)
                if isinstance(value, int):
                    systems_scores[key][field].append(value)

# 計算平均分數
print("\n📊 各系統 + 模型 的平均成績：\n")
for (system, model), scores in systems_scores.items():
    print(f"🔹 系統：{system} | 模型：{model}")
    for field in expected_fields:
        values = scores[field]
        if values:
            avg = sum(values) / len(values)
            print(f"  {field}: {avg:.2f}")
        else:
            print(f"  {field}: 無資料")
    print()
    
# 📊 各系統 + 模型 的平均成績：

# 🔹 系統：CQ_Solver | 模型：llama 3.3 70B
#   Factuality: 6.99
#   User Satisfaction: 6.05
#   Clarity: 7.58
#   Logical Coherence: 7.51
#   Completeness: 5.62
#   Final Score: 6.42



📊 各系統 + 模型 的平均成績：

🔹 系統：CQ_Solver | 模型：llama 3.3 70B
  Factuality: 7.38
  User Satisfaction: 6.45
  Clarity: 7.99
  Logical Coherence: 8.05
  Completeness: 6.06
  Final Score: 6.90



In [28]:
import json
from collections import defaultdict

input_file = "refined_evaluation_results.jsonl"
target_system = "CQ_Solver"

expected_fields = [
    'Factuality', 'User Satisfaction', 'Clarity',
    'Logical Coherence', 'Completeness', 'Final Score'
]

# 建立資料結構：每個欄位，每個分數，對應的 question_id 清單
score_distribution = {
    field: defaultdict(list)
    for field in expected_fields
}

# 分析 CQ_Solver 的資料
with open(input_file, 'r', encoding='utf-8') as f:
    for line in f:
        entry = json.loads(line)
        if entry.get("system") != target_system or entry.get("model") != "llama 3.3 70B":
            continue

        question_id = entry.get("question_id", "unknown")
        score = entry.get("score", {})

        for field in expected_fields:
            value = score.get(field)
            if isinstance(value, int) and 1 <= value <= 10:
                score_distribution[field][value].append(question_id)

# 顯示結果
print(f"\n📊 分析系統：{target_system} 的分數組成\n")

for field in expected_fields:
    print(f"🔹 評分項目：{field}")
    for score_value in range(10, 0, -1):
        questions = score_distribution[field].get(score_value, [])
        print(f"  分數 {score_value}: {len(questions)} 筆")
        if questions:
            print(f"    ➤ 問題 IDs: {', '.join(questions)}")
    print()


📊 分析系統：CQ_Solver 的分數組成

🔹 評分項目：Factuality
  分數 10: 0 筆
  分數 9: 8 筆
    ➤ 問題 IDs: question_63, question_75, question_81, question_358, question_497, question_46, question_186, question_485
  分數 8: 91 筆
    ➤ 問題 IDs: question_02, question_04, question_14, question_28, question_44, question_51, question_60, question_68, question_69, question_70, question_72, question_78, question_85, question_105, question_111, question_115, question_117, question_141, question_166, question_173, question_174, question_181, question_185, question_190, question_195, question_201, question_214, question_216, question_219, question_223, question_229, question_232, question_234, question_235, question_238, question_239, question_240, question_250, question_257, question_261, question_263, question_266, question_277, question_281, question_283, question_288, question_290, question_295, question_297, question_308, question_316, question_318, question_328, question_329, question_340, question_350, question_355,