# ML2025 Homework 1 - Retrieval Augmented Generation with Agents

## Environment Setup

First, we will mount your own Google Drive and change the working directory.

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# Change the working directory to somewhere in your Google Drive.
# You could check the path by right clicking on the folder.
%cd /content/drive/MyDrive/2025_NTU_ML/HW1/

/content/drive/MyDrive/2025_NTU_ML


In this section, we install the necessary python packages and download model weights of the quantized version of LLaMA 3.1 8B. Also, download the dataset. Note that the model weight is around 8GB. If you are using your Google Drive as the working directory, make sure you have enough space for the model.

In [None]:
!python3 -m pip install --no-cache-dir llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
!python3 -m pip install googlesearch-python bs4 charset-normalizer requests-html lxml_html_clean

from pathlib import Path
if not Path('./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf').exists():
    !wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
if not Path('./public.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/public.txt
if not Path('./private.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/private.txt

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122


In [1]:
import torch
if not torch.cuda.is_available():
    raise Exception('You are not using the GPU runtime. Change it first or you will suffer from the super slow inference speed!')
else:
    print('You are good to go!')

You are good to go!


## Prepare the LLM and LLM utility function

By default, we will use the quantized version of LLaMA 3.1 8B. you can get full marks on this homework by using the provided LLM and LLM utility function. You can also try out different LLM models.

In the following code block, we will load the downloaded LLM model weights onto the GPU first.
Then, we implemented the generate_response() function so that you can get the generated response from the LLM model more easily.

You can ignore "llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized" warning.

In [3]:
from llama_cpp import Llama

# Load the model onto GPU
llama3 = Llama(
    "./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
    verbose=False,
    n_gpu_layers=-1,
    n_ctx=16384,    # This argument is how many tokens the model can take. The longer the better, but it will consume more memory. 16384 is a proper value for a GPU with 16GB VRAM.
)

def generate_response(_model: Llama, _messages: str) -> str:
    '''
    This function will inference the model with given messages.
    '''
    _output = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],
        max_tokens=512,    # This argument is how many tokens the model can generate, you can change it and observe the differences.
        temperature=0,      # This argument is the randomness of the model. 0 means no randomness. You will get the same result with the same input every time. You can try to set it to different values.
        repeat_penalty=2.0,
    )["choices"][0]["message"]["content"]
    return _output

llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


## Search Tool

The TA has implemented a search tool for you to search certain keywords using Google Search. You can use this tool to search for the relevant **web pages** for the given question. The search tool can be integrated in the following sections.

In [37]:
from typing import List
from googlesearch import search as _search
from bs4 import BeautifulSoup
from charset_normalizer import from_bytes
import asyncio
from requests_html import AsyncHTMLSession
import urllib3
urllib3.disable_warnings()
MAX_TOKENS = 15000

async def worker(s:AsyncHTMLSession, url:str):
    try:
        header_response = await asyncio.wait_for(s.head(url, verify=False), timeout=10)
        if 'text/html' not in header_response.headers.get('Content-Type', ''):
            return None
        r = await asyncio.wait_for(s.get(url, verify=False), timeout=10)
        return r.text
    except:
        return None

async def get_htmls(urls):
    session = AsyncHTMLSession()
    tasks = (worker(session, url) for url in urls)
    return await asyncio.gather(*tasks)

async def search(keyword: str, n_results: int=3, _model: Llama = llama3) -> List[str]:
    '''
    This function will search the keyword and return the text content in the first n_results web pages.

    Warning: You may suffer from HTTP 429 errors if you search too many times in a period of time. This is unavoidable and you should take your own risk if you want to try search more results at once.
    The rate limit is not explicitly announced by Google, hence there's not much we can do except for changing the IP or wait until Google unban you (we don't know how long the penalty will last either).
    '''
    filtered_results = []
    keyword = keyword[:100]
    # First, search the keyword and get the results. Also, get 2 times more results in case some of them are invalid.
    results = list(_search(keyword, n_results * 2, lang="zh", unique=True))
    # Filter out PDF and other non-HTML URLs
    results = [url for url in results if not url.lower().endswith('.pdf')]
    # Then, get the HTML from the results. Also, the helper function will filter out the non-HTML urls.
    htmls = await get_htmls(results)
    for html in htmls:
        # Filter out the None values.
        if html is None:
            continue

        try:
            # Step 1: str 轉回 bytes
            if isinstance(html, bytes):
                html_bytes = html
            elif isinstance(html, str):
                try:
                    html_bytes = html.encode('latin1')  # fallback 保留原始 bytes
                except UnicodeEncodeError:
                    html_bytes = html.encode('utf-8')  # 優先當作 utf-8

            # Step 2: 編碼偵測與轉換
            result = from_bytes(html_bytes).best()
            if result is None:
                continue
            html_decoded = str(result)

            # Step 3: 拿乾淨文字
            soup = BeautifulSoup(html_decoded, 'html.parser')
            filtered_results.append(''.join(soup.get_text().split()))
        except Exception:
            continue

    limit_results = []
    total_tokens = 0

    for result in filtered_results:
        encoded = _model.tokenize(result.encode("utf-8"), add_bos=False)
        if total_tokens + len(encoded) > MAX_TOKENS or total_tokens > MAX_TOKENS:
            continue
        total_tokens += len(encoded)
        limit_results.append(result)

    # if not limit_results:
    #     # For debugging purpose, you can uncomment the following line to raise an exception if no valid results are found.
    #     raise Exception('No valid results found. Please try again later.')
    # Return the first n results.
    return limit_results


## Test the LLM inference pipeline

In [4]:
# You can try out different questions here.
test_question='請問誰是 Taylor Swift？'

messages = [
    {"role": "system", "content": "你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。"},    # System prompt
    {"role": "user", "content": test_question}, # User prompt
]

print(generate_response(llama3, messages))

泰勒絲（Taylor Swift）是一位美國歌手、詞曲作家和製作人。她出生於1989年，來自田納西州。她的音樂風格從鄉村搖滾開始逐漸轉變為流行電音。

她早期的作品如《泰勒絲第一輯》、《愛情故事第二章：睡美人的秘密》，獲得了廣泛認可和獎項，包括多個告示牌音樂大奖。後來，她推出了更具商業成功性的專辑，如 《1989》（2014）、_reputation（《名聲_(泰勒絲专輯)》） （ 20 ） 和 _Lover(2020)，並且在全球取得了巨大的影響力。

她以她的歌曲如 "Shake It Off"、"_Blank Space_"和 "_Bad Blood_",以及與其他藝人合作的作品，如 《Look What You Made Me Do》（2017）而聞名。泰勒絲還是知識產權運動的一部分，對於音樂創作者在數字時代獲得公平報酬有所關注。

她被譽為當代最成功和影響力最大的人物之一，並且她的歌曲經常成為流行文化的話題。


## Agents

The TA has implemented the Agent class for you. You can use this class to create agents that can interact with the LLM model. The Agent class has the following attributes and methods:
- Attributes:
    - role_description: The role of the agent. For example, if you want this agent to be a history expert, you can set the role_description to "You are a history expert. You will only answer questions based on what really happened in the past. Do not generate any answer if you don't have reliable sources.".
    - task_description: The task of the agent. For example, if you want this agent to answer questions only in yes/no, you can set the task_description to "Please answer the following question in yes/no. Explanations are not needed."
    - llm: Just an indicator of the LLM model used by the agent.
- Method:
    - inference: This method takes a message as input and returns the generated response from the LLM model. The message will first be formatted into proper input for the LLM model. (This is where you can set some global instructions like "Please speak in a polite manner" or "Please provide a detailed explanation".) The generated response will be returned as the output.

In [5]:
class LLMAgent():
    def __init__(self, role_description: str, task_description: str, llm:str="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"):
        self.role_description = role_description   # Role means who this agent should act like. e.g. the history expert, the manager......
        self.task_description = task_description    # Task description instructs what task should this agent solve.
        self.llm = llm  # LLM indicates which LLM backend this agent is using.
    def inference(self, message:str) -> str:
        if self.llm == 'bartowski/Meta-Llama-3.1-8B-Instruct-GGUF': # If using the default one.
            # TODO: Design the system prompt and user prompt here.
            # Format the messsages first.
            messages = [
                {"role": "system", "content": f"{self.role_description}"},  # Hint: you may want the agents to speak Traditional Chinese only.
                {"role": "user", "content": f"{self.task_description}\n{message}"}, # Hint: you may want the agents to clearly distinguish the task descriptions and the user messages. A proper seperation text rather than a simple line break is recommended.
            ]
            return generate_response(llama3, messages)
        else:
            # TODO: If you want to use LLMs other than the given one, please implement the inference part on your own.
            return ""

TODO: Design the role description and task description for each agent.

In [7]:
# TODO: Design the role and task description for each agent.

# This agent may help you filter out the irrelevant parts in question descriptions.
question_extraction_agent = LLMAgent(
    role_description="你是 LLaMA-3.1-8B，專門從文本中提取核心問題的 AI。你不提供解答，只專注於問題提取，使用中文時只會使用繁體中文來回問題。",
    task_description="請從以下訊息中提取一個完整且精確的問題句。請保留專有名詞（如歌名、人名、地名、活動名稱等），確保語句結構清楚明確，不可將陳述句誤當成問題，也不要曲解原始的問題，且簡單扼要。"
)

# This agent may help you extract the keywords in a question so that the search tool can find more accurate results.
keyword_extraction_agent = LLMAgent(
    role_description="你是 LLaMA-3.1-8B，專門從問題中提取關鍵字的 AI。你僅專注於關鍵字提取，不進行問題解答或額外解釋。使用中文時只會使用繁體中文來回問題。",
    task_description="請從以下問題中選出你會輸入搜尋引擎的關鍵字。保留所有專有名詞（如歌名、人名、地名、活動名稱、引號內的內容等）以及相關的詞語。僅輸出這些關鍵字，不重述問題，也不得自行生成未在輸入中出現的詞彙。"
)

# This agent is the core component that answers the question.
qa_agent = LLMAgent(
    role_description="你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。",
    task_description="請回答以下問題：",
)

## RAG pipeline

TODO: Implement the RAG pipeline.

Please refer to the homework description slides for hints.

Also, there might be more heuristics (e.g. classifying the questions based on their lengths, determining if the question need a search or not, reconfirm the answer before returning it to the user......) that are not shown in the flow charts. You can use your creativity to come up with a better solution!

- Naive approach (simple baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive.png)

- Naive RAG approach (medium baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive_rag.png)

- RAG with agents (strong baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/rag_agent.png)

In [36]:
async def pipeline(question: str) -> str:
    # TODO: Implement your pipeline.
    # Currently, it only feeds the question directly to the LLM.
    # You may want to get the final results through multiple inferences.
    # Just a quick reminder, make sure your input length is within the limit of the model context window (16384 tokens), you may want to truncate some excessive texts.
    keywords = keyword_extraction_agent.inference(question)
    # print(f"Keywords is: {keywords}")
    results = await search(keywords)
    # print(f"Search results are: {results}")
    core_question = question_extraction_agent.inference(question)
    # print("Core problem is: ", core_question)
    return qa_agent.inference(f"googlesearch: {results}，core question: {core_question}")

## Answer the questions using your pipeline!

Since Colab has usage limit, you might encounter the disconnections. The following code will save your answer for each question. If you have mounted your Google Drive as instructed, you can just rerun the whole notebook to continue your process.

In [None]:
from pathlib import Path

# Fill in your student ID first.
STUDENT_ID = "simonchu"

STUDENT_ID = STUDENT_ID.lower()
with open('./public.txt', 'r', encoding="utf-8") as input_f:
    questions = input_f.readlines()
    questions = [l.strip().split(',')[0] for l in questions]
    # questions = [questions[21]]
    for id, question in enumerate(questions, 1):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'w', encoding="utf-8") as output_f:
            print(answer, file=output_f)

with open('./private.txt', 'r', encoding="utf-8") as input_f:
    questions = input_f.readlines()
    # questions = [questions[20]]
    for id, question in enumerate(questions, 31):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'a', encoding="utf-8") as output_f:
            print(answer, file=output_f)

1 「虎山雄風飛揚」是光華國小的校歌。
2 根據NCC的說明，自2025年初起，如果民眾通過境外郵購無線鍑盤、滑鼠或藍芽耳機等第二級電信管制射頻器材回台，每案都會加收新臺幣750元審查費。
3 第一代 iPhone 是由史蒂夫·喬布斯發表的。它於 2007 年六月十九日在美國威利士堡（Walt Disney Concert Hall）舉行的一場盛大新聞會上正式亮相，標誌著手機產業進入智慧型多媒體時代。  第一代 iPhone 的設計由喬布斯親自操刀，他希望創造一個既美觀又易用的手持裝置。這款電話採取了全觸控式操作系統，並且配備了一個 3.5 英寸的彩色液晶顯示屏幕，內建有 Wi-Fi 和 EDGE 網絡連接功能。  第一代 iPhone 的發表引起全球媒體關注，被視為手機產業的一大革命。它不僅改變了人們對於電話和互聯網使用方式，也推動了一系列的創新產品研製，包括後來出現的手寫識別、GPS 和其他多種功能。  喬布斯在發表第一代 iPhone 時曾說道：「我們要做的是將電腦帶到手中，使它們更易於用，並且讓人感到愉快。」這句話成為了他對產品設計和創新的核心理念。
4 根據提供的資訊，托福網路測驗 TOEFL iBT 達到 92 分以上才能申請台灣大學進階英文免修。
5 在橄欖球運動中，達陣（Try）是一種得分方式。當一名選手將足球觸地於對方的得到區內時，就會獲得5個點數。  根據規則，如果一個隊伍成功完成了一次Touchdown，他們可以選擇進行Kickoff或Dropkick來進攻，而不是直接射門。如果他們決定通過Kicking方式取得分，球員將站在觸地處的正中央，並且必須踢出足球，使其越過對方防守線。
6 根據卑南族的神話傳說，人類始祖是從大地中出生的女 thần奴努勞（Nunur），她把一 根竹子插在起源之處巴拿 巴那樣 （Panapanayan） ，而一個男孩和一个 女生分别从 竹子的不同部分 出来。
7 熊仔的碩班指導教授為李琳山。
8 法拉第是發現電磁感應定律並奠基於其上的人。
9 根據提供的資訊，距離國立臺灣史前文化博物館最近的是康樂站。
10 根據提供的資訊，三十幾（30几）不包括 3O，而是指大於 Thirty 的數字卻小于 Forty。因此，如果我們將 Twenty 加上 Three 十五，我们可以得到以下結果：  31 +21 =52 32+22=

In [15]:
# Combine the results into one file.
with open(f'./{STUDENT_ID}.txt', 'w', encoding="utf-8") as output_f:
    for id in range(1,91):
        with open(f'./{STUDENT_ID}_{id}.txt', 'r', encoding="utf-8") as input_f:
            answer = input_f.readline().strip()
            print(answer, file=output_f)