# ML2025 Homework 1 - Retrieval Augmented Generation with Agents

## Environment Setup

In this section, we install the necessary python packages and download model weights of the quantized version of LLaMA 3.1 8B. Also, download the dataset. Note that the model weight is around 8GB.

In [1]:
!python3 -m pip install --no-cache-dir llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
!python3 -m pip install googlesearch-python bs4 charset-normalizer requests-html lxml_html_clean

from pathlib import Path
if not Path('./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf').exists():
    !wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
if not Path('./public.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/public.txt
if not Path('./private.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/private.txt

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122
Collecting llama-cpp-python==0.3.4
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp310-cp310-linux_x86_64.whl (445.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.2/445.2 MB[0m [31m118.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.3.4)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m199.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.3 llama-cpp-python-0.3.4
Collecting googlesearch-python
  Downloading googlesearch_python-1.3.0-py3-none-any.whl.metadata (3.4 kB)
Collecting bs4
  Downloading bs4-0.0.2-py2.p

In [2]:
import torch
if not torch.cuda.is_available():
    raise Exception('You are not using the GPU runtime. Change it first or you will suffer from the super slow inference speed!')
else:
    print('You are good to go!')

You are good to go!


## Prepare the LLM and LLM utility function

By default, we will use the quantized version of LLaMA 3.1 8B. you can get full marks on this homework by using the provided LLM and LLM utility function. You can also try out different LLM models.

In the following code block, we will load the downloaded LLM model weights onto the GPU first.
Then, we implemented the generate_response() function so that you can get the generated response from the LLM model more easily.

You can ignore "llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized" warning.

In [3]:
from llama_cpp import Llama

# Load the model onto GPU
llama3 = Llama(
    "./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
    verbose=False,
    n_gpu_layers=-1,
    n_ctx=16384,    # This argument is how many tokens the model can take. The longer the better, but it will consume more memory. 16384 is a proper value for a GPU with 16GB VRAM.
)

def generate_response(_model: Llama, _messages: str) -> str:
    '''
    This function will inference the model with given messages.
    '''
    _output = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],
        max_tokens=512,    # This argument is how many tokens the model can generate.
        temperature=0,      # This argument is the randomness of the model. 0 means no randomness. You will get the same result with the same input every time. You can try to set it to different values.
        repeat_penalty=2.0,
    )["choices"][0]["message"]["content"]
    return _output

llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


## Search Tool

The TA has implemented a search tool for you to search certain keywords using Google Search. You can use this tool to search for the relevant **web pages** for the given question. The search tool can be integrated in the following sections.

In [4]:
from typing import List
from googlesearch import search as _search
from bs4 import BeautifulSoup
from charset_normalizer import detect
import asyncio
from requests_html import AsyncHTMLSession
import urllib3
urllib3.disable_warnings()

async def worker(s:AsyncHTMLSession, url:str):
    try:
        header_response = await asyncio.wait_for(s.head(url, verify=False), timeout=10)
        if 'text/html' not in header_response.headers.get('Content-Type', ''):
            return None
        r = await asyncio.wait_for(s.get(url, verify=False), timeout=10)
        return r.text
    except:
        return None

async def get_htmls(urls):
    session = AsyncHTMLSession()
    tasks = (worker(session, url) for url in urls)
    return await asyncio.gather(*tasks)

async def search(keyword: str, n_results: int=3) -> List[str]:
    '''
    This function will search the keyword and return the text content in the first n_results web pages.
    Warning: You may suffer from HTTP 429 errors if you search too many times in a period of time. This is unavoidable and you should take your own risk if you want to try search more results at once.
    The rate limit is not explicitly announced by Google, hence there's not much we can do except for changing the IP or wait until Google unban you (we don't know how long the penalty will last either).
    '''
    keyword = keyword[:100]
    # First, search the keyword and get the results. Also, get 2 times more results in case some of them are invalid.
    results = list(_search(keyword, n_results * 2, lang="zh", unique=True))
    # Then, get the HTML from the results. Also, the helper function will filter out the non-HTML urls.
    results = await get_htmls(results)
    # Filter out the None values.
    results = [x for x in results if x is not None]
    # Parse the HTML.
    results = [BeautifulSoup(x, 'html.parser') for x in results]
    # Get the text from the HTML and remove the spaces. Also, filter out the non-utf-8 encoding.
    results = [''.join(x.get_text().split()) for x in results if detect(x.encode()).get('encoding') == 'utf-8']
    # Return the first n results.
    return results[:n_results]

## Test the LLM inference pipeline

In [5]:
# You can try out different questions here.
test_question='請問誰是 Taylor Swift？'

messages = [
    {"role": "system", "content": "你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。"},    # System prompt
    {"role": "user", "content": test_question}, # User prompt
]

print(generate_response(llama3, messages))

泰勒絲（Taylor Swift）是一位美國歌手、詞曲作家和音樂製作人。她出生於1989年，來自田納西州。她的音乐风格从乡村乐逐渐转变为流行摇滚，并且她被誉為當代最成功的女艺人的之一。

泰勒絲早期在鄉郊小鎮演唱會時開始發展音樂事業，並於2006年發布首張專輯《Taylor Swift》。隨後，她推出了多张专辑，包括 《Fearless》（勇敢）、_Speak Now（說出來）和 _1989 等。她以她的歌曲如 "Shake It Off"、"_Blank Space_" 和 "_Bad Blood 》等获得了广泛的认可。

泰勒絲也是一位頗具爭議性的藝人，她曾經與多個音樂家發生過創作權和商業糾紛。然而，無論如何她都成為了一代人的偶像，並且她的音乐影響力在全球各地都是非常大的！


## Agents

The TA has implemented the Agent class for you. You can use this class to create agents that can interact with the LLM model. The Agent class has the following attributes and methods:
- Attributes:
    - role_description: The role of the agent. For example, if you want this agent to be a history expert, you can set the role_description to "You are a history expert. You will only answer questions based on what really happened in the past. Do not generate any answer if you don't have reliable sources.".
    - task_description: The task of the agent. For example, if you want this agent to answer questions only in yes/no, you can set the task_description to "Please answer the following question in yes/no. Explanations are not needed."
    - llm: Just an indicator of the LLM model used by the agent.
- Method:
    - inference: This method takes a message as input and returns the generated response from the LLM model. The message will first be formatted into proper input for the LLM model. (This is where you can set some global instructions like "Please speak in a polite manner" or "Please provide a detailed explanation".) The generated response will be returned as the output.

In [6]:
class LLMAgent():
    def __init__(self, role_description: str, task_description: str, llm:str="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"):
        self.role_description = role_description   # Role means who this agent should act like. e.g. the history expert, the manager......
        self.task_description = task_description    # Task description instructs what task should this agent solve.
        self.llm = llm  # LLM indicates which LLM backend this agent is using.
    def inference(self, message:str) -> str:
        if self.llm == 'bartowski/Meta-Llama-3.1-8B-Instruct-GGUF': # If using the default one.
            # TODO: Design the system prompt and user prompt here.
            # Format the messsages first.
            messages = [
                {"role": "system", "content": f"{self.role_description}"},  # Hint: you may want the agents to speak Traditional Chinese only.
                {"role": "user", "content": f"{self.task_description}\n{message}"}, # Hint: you may want the agents to clearly distinguish the task descriptions and the user messages. A proper seperation text rather than a simple line break is recommended.
            ]
            return generate_response(llama3, messages)
        else:
            # TODO: If you want to use LLMs other than the given one, please implement the inference part on your own.
            return ""

TODO 1: Design the role description and task description for each agent.

In [7]:
# TODO: Design the role and task description for each agent.

STUDENT_ID = "200505081050"

# This agent may help you filter out the irrelevant parts in question descriptions.
question_extraction_agent = LLMAgent(
    role_description="你是一個專業的問題提煉專家，能夠過濾問題中的背景描述，但必須完整保留所有專有名詞、地名和問題的具體查詢目標",
    task_description="請提煉以下問題的核心，過濾掉多餘的背景描述，但必須嚴格遵守：1. 保留所有專有名詞和地名的完整形式；2. 絕對不可更改問題的查詢目標（如「哪間科技公司」不可簡化為「誰發明了」）；3. 保留問題詢問的所有具體要素（如詢問「產品」時不可改為「發明」）；4. 確保提煉後的問題完整表達原問題的精確查詢意圖；5. 直接輸出提煉後的問題，不要包含「提煉後的問題是：」等贅字。例如「Windows 作業系統是哪間科技公司的產品？」不可簡化為「誰發明了 Windows 作業系統？」",
)

# This agent analyzes the question and decides the handling method
plan_agent = LLMAgent(
    role_description="你是一個專業的問題分析專家，能夠判斷問題的類型並決定最佳處理方式，不要直接回答问题",
    task_description="請分析以下問題並決定最佳的處理方式，嚴格按照以下規則：1. 如果是純數學計算問題，回傳「math」；2. 如果是基礎科學知識、歷史重大事件或地理常識等你絕對確定答案的問題，回傳「direct_response」；3. 如果是流行文化、名人言論、電影台詞、特定產品、時事新聞、體育賽事或任何你不能100%確定答案的問題，必須回傳「web_search」；4. 任何有爭議、需要最新資訊或專業領域知識的問題，回傳「web_search」。只回傳單一關鍵字，不要有任何解釋。",
)

# This agent may help you extract the keywords in a question so that the search tool can find more accurate results.
keyword_extraction_agent = LLMAgent(
    role_description="你是一個專業的搜尋關鍵字提取器，精確提取問題中的所有關鍵資訊，形成有效的web搜索詞組",
    task_description="請從問題中提取用於web搜索的關鍵字，嚴格遵守以下規則：1. 提取所有重要名詞、事件、人物、時間詞和關鍵概念；2. 必須包含問題的主要事件和情境；3. 必須保留問題的查詢意圖和目標；4. 絕對不可添加問題中不存在的詞語；5. 絕對不可提供問題的預設答案作為關鍵詞；6. 將相關概念組合為有意義的詞組；7. 完全去除虛詞和不影響搜索的介詞；8. 確保關鍵詞能直接用於web搜索且能獲得相關結果；9. 關鍵詞之間用空格分隔；10. 對「是誰」「哪個」等疑問句，必須保留查詢目標而非預測答案。例子：「藝人大S是在去哪個國家旅遊時因病去世？」應提取「大S 國家 旅遊 去世」；「是誰發現了萬有引力？」應提取「萬有引力 發現者」；「最新的輝達顯卡是出到「GeForce RTX 多少」系列？」應提取「輝達 GeForce RTX 最新系列 型號」。請只返回空格分隔的關鍵詞列表",
)

# This agent is the core component that answers the question.
result_judgment_agent = LLMAgent(
    role_description="你是一個專業的文件相關性評估專家，善於判斷文件對問題的價值",
    task_description="請判斷以下文件是否包含回答問題所需的關鍵資訊。評估標準：1. 文件必須直接相關於問題的核心要素（如專有名詞、地名）；2. 如果文件有幫助，請過濾掉不相關的內容，僅保留對回答問題有用的關鍵資訊；3. 如果內容確實相關，不必特別說明「True」，直接返回精簡後的相關內容；4. 如果判斷完全沒有幫助，只要回傳「False」。請使用繁體中文回覆。",
)

# This agent is the core component that answers the question.
qa_agent = LLMAgent(
    role_description="你是一個專業的問答專家，擅長根據提供的資料準確回答問題",
    task_description="請根據提供的文件內容，準確回答問題。回答要求：1. 直接給出明確、具體的答案；2. 如有多個可能答案，請選擇最符合問題要求的；3. 優先使用文件中提供的專業術語和專有名詞；4. 不必說明是根據文件內容回答，直接給出答案即可；5. 回答必須使用繁體中文，語言要精準簡潔。",
)

# This agent handles direct responses without requiring web search
direct_response_agent = LLMAgent(
    role_description="你是一個知識豐富的問答專家，善於根據已知常識回答問題",
    task_description="請直接回答以下問題，無需網路搜尋。回答要求：1. 只回答你絕對確定無誤的事實；2. 僅限於基礎科學、歷史重大事件或地理常識等領域的問題；3. 對於流行文化、電影台詞、名人言論等內容，應該明確表示需要查證；4. 如果有任何不確定，請回覆「這個問題需要查證」而非給出可能錯誤的答案；5. 答案必須簡潔精確；6. 回答必須使用繁體中文；7. 如果問題涉及數學計算，請給出計算過程和最終結果。",
)

# This agent handles mathematical questions
math_agent = LLMAgent(
    role_description="你是一個數學專家，擅長解決各種數學問題",
    task_description="請解答以下數學問題。回答要求：1. 清晰列出解題思路和計算步驟；2. 給出最終精確答案；3. 對於複雜計算，確保結果的準確性；4. 回答必須使用繁體中文；5. 如適用，可使用數學符號和公式表示。",
)

## RAG pipeline

TODO 2: Implement the RAG pipeline.

Please refer to the homework description slides for hints.

Also, there might be more heuristics (e.g. classifying the questions based on their lengths, determining if the question need a search or not, reconfirm the answer before returning it to the user......) that are not shown in the flow charts. You can use your creativity to come up with a better solution!

- Naive approach (simple baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive.png)

- Naive RAG approach (medium baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive_rag.png)

- RAG with agents (strong baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/rag_agent.png)

In [8]:
async def pipeline(question: str) -> str:
    # TODO: Implement your pipeline.
    # Currently, it only feeds the question directly to the LLM.
    # You may want to get the final results through multiple inferences.
    # Just a quick reminder, make sure your input length is within the limit of the model context window (16384 tokens), you may want to truncate some excessive texts.
    #return qa_agent.inference(question)
    # TODO: Implement your pipeline.
    # Currently, it only feeds the question directly to the LLM.
    # You may want to get the final results through multiple inferences.
    # Just a quick reminder, make sure your input length is within the limit of the model context window (16384 tokens), you may want to truncate some excessive texts.
    print("原始問題: ", question)
    extracted_question = question_extraction_agent.inference(question)
    print("簡化問題: ", extracted_question)
    
    # 使用plan_agent決定處理方式
    plan = plan_agent.inference(extracted_question)
    print("問題類型: ", plan)
    
    # 根據問題類型執行不同策略
    if plan == "direct_response":
        print("使用直接回應")
        response = direct_response_agent.inference(extracted_question)
        # 如果direct_response_agent表示需要查證，改用web_search方式
        if response == "這個問題需要查證":
            print("需要查證，改用網路搜索")
            plan = "web_search"
        else:
            return response
    
    if plan == "math":
        print("處理數學問題")
        return math_agent.inference(extracted_question)
    
    elif plan == "web_search":
        print("使用網路搜索")
        # 提取關鍵字
        keyword = keyword_extraction_agent.inference(extracted_question)
        print("keyword: ", keyword)
        
        # 搜尋並處理結果
        docs = []
        search_result = await search(keyword)
        for result in search_result:
            try:
                #import ipdb; ipdb.set_trace()
                #print("judgement_result: ", result)
                result_judgment = result_judgment_agent.inference("問題：" + extracted_question + "\n文件內容：" + result)
                print("result_judgment: ", result_judgment[:200] + "..." if len(result_judgment) > 50 else result_judgment)
                if result_judgment != "False":
                    docs.append(result_judgment)
            except ValueError as e:
                if "exceed context window" in str(e):
                    print(f"文件太長，超出上下文窗口限制: {e}")
                    continue
                else:
                    print(f"其它错误: {e}")
                    continue
                    #raise e
        
        if not docs:
            return "沒有找到相關資料"
        
        # 回答問題
        return qa_agent.inference("docs: " + str(docs) + "\nquestion: " + extracted_question)
        

## Answer the questions using your pipeline!

Since Colab has usage limit, you might encounter the disconnections. The following code will save your answer for each question. If you have mounted your Google Drive as instructed, you can just rerun the whole notebook to continue your process.

In [None]:
from pathlib import Path

# Fill in your student ID first.
STUDENT_ID = "20250508"

STUDENT_ID = STUDENT_ID.lower()
with open('./public.txt', 'r') as input_f:
    questions = input_f.readlines()
    questions = [l.strip().split(',')[0] for l in questions]
    for id, question in enumerate(questions, 1):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'w') as output_f:
            print(answer, file=output_f)

with open('./private.txt', 'r') as input_f:
    questions = input_f.readlines()
    for id, question in enumerate(questions, 31):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'a') as output_f:
            print(answer, file=output_f)

原始問題:  校歌為學校（包括小學、中學、大學等）宣告或者規定的代表該校的歌曲。用於體現該校的治學理念、辦學理想等學校文化。「虎山雄風飛揚」是哪間學校的校歌歌詞？
簡化問題:  虎山雄風飛揚是哪間學校的校歌？
問題類型:  web_search
使用網路搜索
keyword:  虎山雄風飛揚 校歌
文件太長，超出上下文窗口限制: Requested tokens (368024) exceed context window of 16384
result_judgment:  False
result_judgment:  根據文件內容，中興新村第一國民學校（今光華 國小）的校歌是「貓羅溪旁虎山雄風飛揚」。
1 中興新村第一國民學校（今光華 國小）的校歌是「貓羅溪旁虎山雄風飛揚」。
原始問題:  2025年初，NCC透過行政命令，規定民眾如果透過境外郵購無線鍵盤、滑鼠、藍芽耳機..等自用產品回台，每案一律加收審查費多少錢？
簡化問題:  2025年初，NCC規定民眾透過境外郵購無線鍵盤、滑鼠和藍芽耳機等自用產品回台，每案加收審查費多少錢？
問題類型:  web_search
使用網路搜索
keyword:  NCC 農委會 境外郵購 審查費
result_judgment:  False
result_judgment:  根據文件內容，NCC規定民眾透過境外郵購無線鍑盤、滑鼠和藍芽耳機等自用產品回台，每案加收審查費750元。...
result_judgment:  文件內容提到，NCC今年2月起新增審查費，每案新台幣750元，這項政策引發質疑。國民黨立法院党团書記长王鴻薇表示，这是变相关税务，对人民造成负担；国 民 党 立 法 院 议员 廖 先 翔 已 发 文 要求 NCC 暂停收费新制。

文件內容提到，N CC 历史上预告电信管理业务规则费用标准第14条修正草案，这项政策将对民众购买2部以下自用第二级通信管控射频器材（如手機、蓝牙耳机等）收取750元审查...
2 根據文件內容，NCC規定民眾透過境外郵購無線鍑盤、滑鼠和藍芽耳機等自用產品回台，每案加收審查費750元。
原始問題:  第一代 iPhone 是由哪位蘋果 CEO 發表？
簡化問題:  第一代 iPhone 是由史蒂夫·乔布斯（Steve Jobs）發表。
問題類型:  direct_respon

In [None]:
# Combine the results into one file.
with open(f'./{STUDENT_ID}.txt', 'w') as output_f:
    for id in range(1,91):
        with open(f'./{STUDENT_ID}_{id}.txt', 'r') as input_f:
            answer = input_f.readline().strip()
            print(answer, file=output_f)