# ML2025 Homework 1 - Retrieval Augmented Generation with Agents

## Environment Setup

## Environment Setup Phase

### Step 1: Install Required Packages and Download Model

This stage completes the following tasks:
1. **Install LLaMA Model Support Package**: `llama-cpp-python` for running the quantized version of LLaMA 3.1 8B model
2. **Install Web Search Related Packages**:
   - `googlesearch-python`: Google Search API
   - `bs4`: BeautifulSoup web parsing
   - `charset-normalizer`, `requests-html`, `lxml_html_clean`: Web content processing
3. **Download Model Weights**: Approximately 8GB quantized model file `Meta-Llama-3.1-8B-Instruct-Q8_0.gguf`
4. **Download Question Datasets**: `public.txt` and `private.txt` containing questions to be answered

**Note**: Model download requires significant time and sufficient storage space.

In [None]:
# 安裝LLaMA模型支援套件（支援CUDA 12.2）
!python3 -m pip install --no-cache-dir llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122

# 安裝網路搜尋和網頁解析相關套件
!python3 -m pip install googlesearch-python bs4 charset-normalizer requests-html lxml_html_clean

from pathlib import Path

# 下載LLaMA 3.1 8B量化模型檔案（約8GB）
if not Path('./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf').exists():
    !wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf

# 下載公開題目資料集
if not Path('./public.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/public.txt

# 下載私人題目資料集    
if not Path('./private.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/private.txt

### Step 2: GPU Environment Check

Ensure the runtime environment uses GPU to avoid extremely slow inference speeds. Even the quantized version of LLaMA 3.1 8B model will be very slow on CPU.

In [None]:
import torch

# 檢查是否正在使用GPU，若否則拋出異常
if not torch.cuda.is_available():
    raise Exception('You are not using the GPU runtime. Change it first or you will suffer from the super slow inference speed!')
else:
    print('You are good to go!')

## Model Loading and Inference Phase

### Step 3: Load LLaMA Model and Create Inference Function

This stage establishes the core inference capability of the entire system:

1. **Model Loading Configuration**:
   - `n_gpu_layers=-1`: Load all model layers onto GPU
   - `n_ctx=16384`: Set context window to 16K tokens (suitable for 16GB VRAM GPU)
   - `verbose=False`: Disable verbose logging to reduce output

2. **Inference Function Parameter Explanation**:
   - `max_tokens=512`: Limit generation length to avoid overly long responses
   - `temperature=0`: Set to 0 for reproducible results, eliminating randomness
   - `repeat_penalty=2.0`: Prevent model from repeating identical content

**Important**: Context window size directly affects memory usage and needs adjustment based on hardware.

In [None]:
from llama_cpp import Llama

# 載入LLaMA 3.1 8B模型到GPU
llama3 = Llama(
    "./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",  # 模型檔案路徑
    verbose=False,              # 關閉詳細輸出
    n_gpu_layers=-1,           # 將所有層載入GPU（-1表示全部）
    n_ctx=16384,               # 上下文窗口大小：16K tokens，適合16GB VRAM的GPU
)

def generate_response(_model: Llama, _messages: str) -> str:
    '''
    使用LLaMA模型生成回應的函數
    
    參數:
        _model: LLaMA模型實例
        _messages: 格式化後的對話訊息
    
    返回:
        str: 模型生成的回應內容
    '''
    _output = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],  # 停止符號
        max_tokens=512,          # 最大生成token數量
        temperature=0,           # 溫度參數：0表示無隨機性，結果可重現
        repeat_penalty=2.0,      # 重複懲罰：防止模型重複相同內容
    )["choices"][0]["message"]["content"]
    return _output

## Web Search Tool Phase

### Step 4: Implement Google Search and Web Content Extraction

This is the **information retrieval core** of the RAG system, responsible for obtaining relevant information from the web:

In [None]:
from typing import List
from googlesearch import search as _search
from bs4 import BeautifulSoup
from charset_normalizer import detect
import asyncio
from requests_html import AsyncHTMLSession
import urllib3
urllib3.disable_warnings()

async def worker(s: AsyncHTMLSession, url: str):
    '''
    異步獲取單個網頁內容的工作函數
    
    參數:
        s: AsyncHTMLSession實例
        url: 要抓取的網址
    
    返回:
        str or None: 網頁HTML內容，失敗時返回None
    '''
    try:
        # 先檢查網頁標頭，確認是HTML格式
        header_response = await asyncio.wait_for(s.head(url, verify=False), timeout=10)
        if 'text/html' not in header_response.headers.get('Content-Type', ''):
            return None
        
        # 獲取完整網頁內容
        r = await asyncio.wait_for(s.get(url, verify=False), timeout=10)
        return r.text
    except:
        return None

async def get_htmls(urls):
    '''
    並行獲取多個網頁的HTML內容
    
    參數:
        urls: 網址列表
    
    返回:
        list: HTML內容列表
    '''
    session = AsyncHTMLSession()
    tasks = (worker(session, url) for url in urls)
    return await asyncio.gather(*tasks)

async def search(keyword: str, n_results: int=3) -> List[str]:
    '''
    搜尋關鍵字並返回前n個網頁的文字內容
    
    參數:
        keyword: 搜尋關鍵字
        n_results: 需要返回的結果數量
    
    返回:
        List[str]: 網頁文字內容列表
    
    注意: 可能遇到HTTP 429錯誤（Google搜尋頻率限制）
    '''
    keyword = keyword[:100]  # 限制關鍵字長度
    
    # 獲取搜尋結果（取2倍數量以防部分無效）
    results = list(_search(keyword, n_results * 2, lang="zh", unique=True))
    
    # 並行獲取網頁HTML內容
    results = await get_htmls(results)
    
    # 過濾掉無效結果
    results = [x for x in results if x is not None]
    
    # 使用BeautifulSoup解析HTML
    results = [BeautifulSoup(x, 'html.parser') for x in results]
    
    # 提取純文字並移除空白，同時過濾非UTF-8編碼
    results = [''.join(x.get_text().split()) for x in results if detect(x.encode()).get('encoding') == 'utf-8']
    
    # 返回前n個結果
    return results[:n_results]

### Step 5: Test Basic Inference Pipeline

Before building a complex RAG system, test whether the basic LLM inference functionality works properly. This test ensures:
- Model loads correctly and can perform inference normally
- Chinese output format meets Traditional Chinese requirements
- Inference speed is within acceptable range

In [None]:
# 測試基本LLM推理功能
test_question="請問誰是 Taylor Swift？"

# 構建對話訊息格式
messages = [
    {"role": "system", "content": "你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。"},    # 系統提示
    {"role": "user", "content": test_question}, # 用戶問題
]

print(generate_response(llama3, messages))

## AI Agent Architecture Phase

### Step 6: LLMAgent Class Design Explanation

This stage establishes the foundational architecture of the **multi-agent collaborative system**. The LLMAgent class is the core component of the entire RAG system:

**Agent Design Philosophy**:
- **Role Separation**: Each agent handles specific tasks (question understanding, keyword extraction, Q&A, etc.)
- **Modularity**: Individual agents can be easily replaced or adjusted
- **Scalability**: Additional specialized agents can be added in the future

**Class Attribute Explanation**:
- `role_description`: Defines the agent's identity and expertise domain
- `task_description`: Clearly specifies the specific task the agent needs to complete
- `llm`: Specifies the language model backend to use

**Inference Method Features**:
- **Prompt Engineering**: Places role description and task description in system and user prompts respectively
- **Format Processing**: Ensures input format matches LLaMA's conversation template
- **Extensibility**: Reserves interface to support other LLM models

This design allows us to create specialized agents to handle different stages in the RAG process.

In [None]:
class LLMAgent():
    def __init__(self, role_description: str, task_description: str, llm:str="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"):
        '''
        初始化LLM Agent
        
        參數:
            role_description: Agent的角色描述（誰）
            task_description: Agent的任務描述（做什麼）
            llm: 使用的語言模型標識
        '''
        self.role_description = role_description    # 角色描述：定義Agent的身份（例：歷史專家、經理等）
        self.task_description = task_description    # 任務描述：指示Agent應該解決的具體任務
        self.llm = llm                             # LLM標識：指示Agent使用的語言模型後端
        
    def inference(self, message: str) -> str:
        '''
        執行推理並返回結果
        
        參數:
            message: 輸入訊息
            
        返回:
            str: Agent的回應
        '''
        if self.llm == 'bartowski/Meta-Llama-3.1-8B-Instruct-GGUF':  # 使用預設模型
            # 格式化訊息：將角色和任務分別放入system和user prompt
            messages = [
                {"role": "system", "content": f"{self.role_description}"},  # 系統角色描述
                {"role": "user", "content": f"{self.task_description}\n{message}"}, # 任務描述 + 用戶訊息
            ]
            return generate_response(llama3, messages)
        else:
            # 如果要使用其他LLM，需要在此實現相應的推理邏輯
            return ""

### Step 7: Design Three Specialized Agents

Based on RAG process requirements, create three agents with distinct responsibilities:

**1. Question Extraction Agent (question_extraction_agent)**
- **Function**: Extract core questions from complex descriptions
- **Importance**: Remove interfering information for more precise search
- **Example**: Simplify "School songs are representative songs of schools, which school's song is 'Tiger Mountain Heroic Wind Flying'?" to "Which school's song is 'Tiger Mountain Heroic Wind Flying'?"

**2. Keyword Extraction Agent (keyword_extraction_agent)**
- **Function**: Extract 2-5 most suitable search keywords from questions
- **Strategy**: Focus on entity nouns, proper nouns, and other concrete searchable terms
- **Output Format**: Comma-separated keyword list

**3. Q&A Agent (qa_agent)**
- **Function**: Answer questions based on retrieved data
- **Role**: Serves as the final knowledge integrator
- **Output Requirements**: Use Traditional Chinese, answer based on provided context

This three-stage division design can improve the professionalism and accuracy of each step.

In [None]:
# 設計三個專門化的Agent來處理RAG流程

# Agent 1: 問題萃取Agent - 負責從複雜描述中提取核心問題
question_extraction_agent = LLMAgent(
    role_description="你是一位專業的問題分析師，擅長從複雜的敘述中找出真正需要解決的問題。你只會用繁體中文回答。",
    task_description="請從下列敘述中，萃取出最核心、需要解答的問題，並忽略與問題無關的背景或多餘資訊。只需輸出精簡明確的問題句。",
)

# Agent 2: 關鍵字萃取Agent - 負責提取適合搜尋的關鍵字
keyword_extraction_agent = LLMAgent(
    role_description="你是一位專業的關鍵字萃取專家，擅長從問題中找出最適合用來搜尋的關鍵字。你只會用繁體中文回答。",
    task_description="請從下列問題中，萃取出最適合用來搜尋的 2~5 個關鍵字或短語。只需輸出關鍵字，並以逗號分隔。",
)

# Agent 3: 問答Agent - 負責基於檢索到的資料回答問題
qa_agent = LLMAgent(
    role_description="你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。",
    task_description="請回答以下問題：",
)

## RAG Core Implementation Phase

### Step 8: Install RAG-Related Packages

To implement Retrieval-Augmented Generation, the following key packages need to be installed:

**Core Package Explanation**:
- `sentence-transformers`: Pre-trained models for text vectorization
- `chromadb`: Lightweight vector database supporting similarity search
- `langchain`: Provides RAG toolchain and embedding wrappers
- `langchain-community`: Extends LangChain functionality

These packages will help us:
1. Convert text into high-dimensional vector representations
2. Store and quickly retrieve similar documents
3. Calculate semantic similarity scores

In [None]:
# 安裝RAG所需的額外套件
\!pip install sentence-transformers chromadb langchain
\!pip install -U langchain-community

### Step 9: Load Multilingual Embedding Model

**Model Selection Rationale**:
- `paraphrase-multilingual-MiniLM-L12-v2` is specifically designed for multilingual sentence transformers
- Supports Chinese semantic understanding, suitable for Traditional Chinese questions
- Moderate model size (~471MB), balancing performance and resource usage

**Embedding Function**:
- Converts text into 384-dimensional vectors
- Semantically similar texts have closer distances in vector space
- Supports cross-lingual semantic search capabilities

This step lays the foundation for subsequent similarity calculations.

In [None]:
from sentence_transformers import SentenceTransformer
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

# 載入多語言embedding模型，支援中文語義理解
embedding_model = HuggingFaceEmbeddings(model_name="paraphrase-multilingual-MiniLM-L12-v2")

### Step 10: Implement Complete RAG Pipeline

This is the **core function** of the entire system, integrating all components to complete the end-to-end Q&A process:

In [None]:
async def pipeline(question: str) -> str:
    """
    完整的RAG處理流程
    """
    # 階段1: 問題理解與預處理
    core_question = question_extraction_agent.inference(question)
    keywords = keyword_extraction_agent.inference(core_question)
    
    # 階段2: 資訊檢索
    search_results = await search(keywords, n_results=5)
    
    # 階段3: 文檔分割與向量化
    chunk_size = 500
    docs = []
    for doc in search_results:
        for i in range(0, len(doc), chunk_size):
            chunk = doc[i:i+chunk_size]
            if len(chunk) > 50:
                docs.append(chunk)
    
    # 建立Chroma向量資料庫
    vector_db = Chroma.from_texts(texts=docs, embedding=embedding_model)
    
    # 階段4: 相似性搜尋
    top_k = 5
    relevant_docs_and_scores = vector_db.similarity_search_with_score(core_question, k=top_k)
    relevant_docs = [doc[0].page_content for doc in relevant_docs_and_scores]
    
    # 摘要
    summaries = []
    for chunk in relevant_docs:
        summary = qa_agent.inference(f"請將以下資料摘要成100字重點：\n{chunk}")
        summaries.append(summary)
    context = "\n".join(summaries)
    
    # 最終問答
    final_input = f"根據以下資料回答問題：\n{context}\n問題：{core_question}"
    answer = qa_agent.inference(final_input)
    return answer

### Step 11: Test RAG Pipeline

Use the 2024 Paris Olympics date as a test case to verify basic RAG system functionality:
- Test whether search function works properly
- Check answer generation quality
- Ensure the entire process runs smoothly

In [None]:
# 測試RAG pipeline是否正常運作
result = await pipeline("請問2024年巴黎奧運的舉辦日期是什麼？請詳細說明。")
print(result)

## Batch Processing and Result Output Phase

### Step 12: Batch Process All Questions

**Processing Strategy Explanation**:
- **Resume Mechanism**: Check existing answer files to avoid duplicate processing
- **Per-Question Saving**: Save each answer immediately to prevent progress loss due to interruption
- **Memory Management**: Release related resources after processing each question

**File Naming Convention**:
- Individual answers: `{STUDENT_ID}_{question_number}.txt`
- Convenient for tracking progress and debugging

**Important Notes**:
- Colab environment may disconnect due to usage limits
- Mounting Google Drive ensures persistent file storage
- Re-execution will automatically skip completed questions

In [None]:
from pathlib import Path

# 設定學生ID（需要修改為實際的學生ID）
STUDENT_ID = "20250707"

STUDENT_ID = STUDENT_ID.lower()

# 處理public.txt中的題目（前30題）
with open('./public.txt', 'r') as input_f:
    questions = input_f.readlines()
    questions = [l.strip().split(',')[0] for l in questions]  # 提取問題部分，忽略答案
    
    for id, question in enumerate(questions, 1):
        # 檢查該題是否已經處理過（斷點續傳機制）
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        
        # 使用RAG pipeline處理問題
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')  # 移除換行符以便後續處理
        print(id, answer)
        
        # 立即保存答案，防止程式中斷時遺失
        with open(f'./{STUDENT_ID}_{id}.txt', 'w') as output_f:
            print(answer, file=output_f)

# 處理private.txt中的題目（後60題）
with open('./private.txt', 'r') as input_f:
    questions = input_f.readlines()
    for id, question in enumerate(questions, 31):  # 從第31題開始編號
        # 檢查該題是否已經處理過
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        
        # 使用RAG pipeline處理問題
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        
        # 保存答案
        with open(f'./{STUDENT_ID}_{id}.txt', 'w') as output_f:
            print(answer, file=output_f)

### Step 13: Integrate Results and Generate CSV File

**Output Format Explanation**:
- **CSV Format**: Contains Question (Q) and Answer (A) columns
- **Encoding Handling**: Use UTF-8 to ensure proper Chinese display
- **Question Source**: Merge public.txt (first 30 questions) and private.txt (last 60 questions)

**File Purpose**:
- Convenient result viewing and analysis
- Meets assignment submission format requirements
- Can be imported into Excel and other tools for further processing

In [None]:
import csv

STUDENT_ID = "20250707"
output_csv = f'./{STUDENT_ID}.csv'

# 讀取所有題目（public.txt + private.txt）
questions = []
with open('./public.txt', 'r', encoding='utf-8') as f:
    questions += [l.strip().split(',')[0] for l in f.readlines()]  # 只取問題部分
with open('./private.txt', 'r', encoding='utf-8') as f:
    questions += [l.strip().split(',')[0] for l in f.readlines()]

# 將結果寫入CSV檔案
with open(output_csv, 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['Q', 'A'])  # 寫入標題行

    for idx, question in enumerate(questions, 1):
        ans_path = f'./{STUDENT_ID}_{idx}.txt'
        try:
            # 讀取對應的答案檔案
            with open(ans_path, 'r', encoding='utf-8') as ans_f:
                answer = ans_f.readline().strip()
        except FileNotFoundError:
            answer = ''  # 如果答案檔不存在，留空
        writer.writerow([question, answer])

### Step 14: Merge All Answers into Single Text File

Combine all 90 question answers into one text file in order, one answer per line. This format is convenient for:
- Quick browsing of all answers
- Batch processing or analysis
- Use as backup file

In [None]:
# 將所有答案合併成一個文字檔
with open(f"./{\STUDENT_ID}.txt", "w") as output_f:
    for id in range(1, 91):
        with open(f"./{\STUDENT_ID}_{id}.txt", "r") as input_f:
            answer = input_f.readline().strip()
            print(answer, file=output_f)

### Step 15: Package All Result Files

**Package Contents**:
- Main CSV result file
- 90 individual answer files

**Download Functionality**:
- Automatically generate download links
- Clean temporary folders to save space

In [None]:
import shutil
import os
from IPython.display import FileLink, display

STUDENT_ID = "20250707"

# 1. 指定要打包的檔案清單
files_to_zip = [f"{STUDENT_ID}.csv"]  # 主要CSV結果檔
files_to_zip += [f"{STUDENT_ID}_{i}.txt" for i in range(1, 91)]  # 90個個別答案檔

# 2. 建立暫存資料夾並複製檔案
tmp_dir = "tmp_zip"
os.makedirs(tmp_dir, exist_ok=True)
for file in files_to_zip:
    if os.path.exists(file):
        shutil.copy(file, tmp_dir)

# 3. 壓縮成zip檔案
zip_name = f"{STUDENT_ID}_all_answers"
shutil.make_archive(zip_name, 'zip', tmp_dir)

# 4. 產生下載連結（適用於Colab環境）
display(FileLink(f"{zip_name}.zip"))

# 5. 清理暫存資料夾以節省空間
shutil.rmtree(tmp_dir)

## System Performance Analysis and Problem Summary

### Main Reasons for Slow Runtime

**1. Multiple LLM Inference Calls**
- Each question requires 6+ model inferences:
  - Question Extraction Agent: 1 time
  - Keyword Extraction Agent: 1 time  
  - Summary generation: 5 times (for each relevant document)
  - Final Q&A: 1 time
- Each inference requires GPU computation, accumulating significant time

**2. Sequential Execution Bottleneck**
- RAG pipeline stages execute serially, cannot be parallelized
- Must wait for web search results before proceeding with subsequent processing
- Vectorization and similarity calculations need to be completed step by step

**3. Network I/O Overhead**
- Google Search API call latency
- Network latency from parallel web page scraping
- HTTP request retry mechanisms increase waiting time

**4. Vector Operation Cost**
- Document embedding computation (each chunk needs vectorization)
- Distance calculations for similarity search
- ChromaDB creation and query operations

### Answer Accuracy Problem Analysis

**Core Issue**: Fundamental reasons why the RAG system cannot accurately answer questions

**1. Keyword Extraction Failure**
- Example: "Which school's song is 'Tiger Mountain Heroic Wind Flying'?"
- System-extracted keywords may be too broad
- Causes search results to deviate from question core

**2. Low Search Result Relevance**
- Google search returns web content that doesn't match questions
- Particularly for specific, detailed questions
- Lacks verification mechanism for search result quality

**3. Semantic Similarity Misjudgment**
- Embedding model may not correctly understand Chinese semantic differences
- Vector similarity search finds document fragments that aren't truly relevant
- Fixed 500-character splitting may break semantic integrity

**4. Answer Generation Drift**
- QA Agent generates answers based on incorrect or irrelevant context
- Lacks assessment of retrieved content credibility
- Model tends to answer "based on data" even when data is irrelevant

### Implementation Improvement Suggestions

**Short-term Improvements**:
1. Adjust keyword extraction strategy, add entity recognition
2. Increase search result relevance filtering
3. Implement multi-round search mechanism (re-search when first attempt fails)
4. Improve document segmentation method (semantic boundary cutting)

**Long-term Optimization**:
1. Use specialized Chinese embedding models
2. Build question type classification system
3. Implement answer confidence assessment
4. Add knowledge graph-assisted retrieval

### System Applicability Assessment

**Suitable Question Types**:
- General knowledge questions
- Current event-related queries
- Latest information requiring web search

**Unsuitable Question Types**:
- Questions requiring precise answers
- Local, detailed professional knowledge
- Mathematical questions requiring reasoning or calculation

This analysis shows that the current RAG implementation is more suitable as a general Q&A system rather than a precise Q&A tool for specific domains.