# ML2025 Homework 1 - Retrieval Augmented Generation with Agents

## Environment Setup

First, we will mount your own Google Drive and change the working directory.

In this section, we install the necessary python packages and download model weights of the quantized version of LLaMA 3.1 8B. Also, download the dataset. Note that the model weight is around 8GB. If you are using your Google Drive as the working directory, make sure you have enough space for the model.

In [None]:
# Author: Swear01
# I didn't use COLAB on this project, hence the code below may not work on COLAB
!$CMAKE_ARGS="-DLLAMA_CUDA=on", pip install llama-cpp-python
!python3 -m pip install -r requirements.txt

from pathlib import Path
if not Path('./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf').exists():
    !wget "https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-GGUF/blob/main/Qwen2.5-14B-Instruct-IQ4_XS.gguf"
if not Path('./public.txt').exists():
    !wget "https://www.csie.ntu.edu.tw/~ulin/public.txt"
if not Path('./private.txt').exists():
    !wget "https://www.csie.ntu.edu.tw/~ulin/private.txt"

## Prepare the LLM and LLM utility function

By default, we will use the quantized version of LLaMA 3.1 8B. you can get full marks on this homework by using the provided LLM and LLM utility function. You can also try out different LLM models.

In the following code block, we will load the downloaded LLM model weights onto the GPU first.
Then, we implemented the generate_response() function so that you can get the generated response from the LLM model more easily.

You can ignore "llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized" warning.

In [None]:
# Model From https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-GGUF
from llama_cpp import Llama

# Load the model onto GPU
llama3 = Llama(
    "./Qwen2.5-14B-Instruct-IQ4_XS.gguf",
    verbose=False,
    n_gpu_layers=-1,
    n_ctx=14000,    # This argument is how many tokens the model can take. The longer the better, but it will consume more memory. 16384 is a proper value for a GPU with 16GB VRAM.
)
# For my 12G RTX 3060, this is the maximum value without using shared memory.

def generate_response(_model: Llama, _messages: str) -> str:
    '''
    This function will inference the model with given messages.
    '''
    _output = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],
        max_tokens=512,    # This argument is how many tokens the model can generate, you can change it and observe the differences.
        temperature=0,      # This argument is the randomness of the model. 0 means no randomness. You will get the same result with the same input every time. You can try to set it to different values.
        repeat_penalty=2.0,
    )["choices"][0]["message"]["content"]
    return _output

llama_init_from_model: n_ctx_per_seq (14016) < n_ctx_train (32768) -- the full capacity of the model will not be utilized


## Search Tool

The TA has implemented a search tool for you to search certain keywords using Google Search. You can use this tool to search for the relevant **web pages** for the given question. The search tool can be integrated in the following sections.

In [None]:
import re
from typing import List
from googlesearch import search as _search
from bs4 import BeautifulSoup
from charset_normalizer import detect
import asyncio
from requests_html import AsyncHTMLSession
import urllib3
urllib3.disable_warnings()

async def worker(s:AsyncHTMLSession, url:str):
    try:
        header_response = await asyncio.wait_for(s.head(url, verify=False), timeout=10)
        if 'text/html' not in header_response.headers.get('Content-Type', ''):
            return None
        r = await asyncio.wait_for(s.get(url, verify=False), timeout=10)
        return r.text
    except:
        return None

async def get_htmls(urls):
    session = AsyncHTMLSession()
    tasks = (worker(session, url) for url in urls)
    return await asyncio.gather(*tasks)

async def search(keyword: str, n_results: int=3) -> List[str]:
    '''
    This function will search the keyword and return the text content in the first n_results web pages.

    Warning: You may suffer from HTTP 429 errors if you search too many times in a period of time. This is unavoidable and you should take your own risk if you want to try search more results at once.
    The rate limit is not explicitly announced by Google, hence there's not much we can do except for changing the IP or wait until Google unban you (we don't know how long the penalty will last either).
    '''
    keyword = keyword[:100]
    # First, search the keyword and get the results. Also, get 2 times more results in case some of them are invalid.
    results = list(_search(keyword, n_results * 2 + 3, lang="zh",region="tw",safe=None, unique=True))
    # Then, get the HTML from the results. Also, the helper function will filter out the non-HTML urls.
    results = await get_htmls(results)
    # Filter out the None values.
    results = [x for x in results if x is not None]
    # Parse the HTML.
    results = [BeautifulSoup(x, 'html.parser') for x in results]
    # Get the text from the HTML and remove the spaces. Also, filter out the non-utf-8 encoding.
    results = [''.join(x.get_text().split()) for x in results if detect(x.encode()).get('encoding') == 'utf-8' and len(re.findall(r'[\u4e00-\u9fff]', x.get_text())) > 30 and not x.get_text().startswith('%PDF-1.')]
    # Return the first n results.
    return results[:n_results]

In [4]:
from opencc import OpenCC
cc = OpenCC("s2t")
def translate(zh_cn:str) -> str:
    return cc.convert(zh_cn)

## Test the LLM inference pipeline

In [5]:
# You can try out different questions here.
test_question='請問誰是 Taylor Swift？'

messages = [
    {"role": "system", "content": "你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。"},    # System prompt
    {"role": "user", "content": test_question}, # User prompt
]

# print(generate_response(llama3, messages))

## Agents

The TA has implemented the Agent class for you. You can use this class to create agents that can interact with the LLM model. The Agent class has the following attributes and methods:
- Attributes:
    - role_description: The role of the agent. For example, if you want this agent to be a history expert, you can set the role_description to "You are a history expert. You will only answer questions based on what really happened in the past. Do not generate any answer if you don't have reliable sources.".
    - task_description: The task of the agent. For example, if you want this agent to answer questions only in yes/no, you can set the task_description to "Please answer the following question in yes/no. Explanations are not needed."
    - llm: Just an indicator of the LLM model used by the agent.
- Method:
    - inference: This method takes a message as input and returns the generated response from the LLM model. The message will first be formatted into proper input for the LLM model. (This is where you can set some global instructions like "Please speak in a polite manner" or "Please provide a detailed explanation".) The generated response will be returned as the output.

In [6]:
class LLMAgent():
    def __init__(self, role_description: str, task_description: list[str], llm:str="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"):
        self.role_description = role_description   # Role means who this agent should act like. e.g. the history expert, the manager......
        self.task_description = task_description    # Task description instructs what task should this agent solve.
        self.llm = llm  # LLM indicates which LLM backend this agent is using.
    def inference(self, message:list[str]) -> str:
        if self.llm == 'bartowski/Meta-Llama-3.1-8B-Instruct-GGUF': # If using the default one.
            # TODO: Design the system prompt and user prompt here.
            # Format the messsages first.
            messages = [
                {"role": "system", "content": f"{self.role_description}"},  # Hint: you may want the agents to speak Traditional Chinese only.
                {"role": "user", "content": 
                  f"{self.task_description[0]}  \
                    {message[0]}\n              \
                    {self.task_description[1]}  \
                    {message[1]}\n              \
                    {self.task_description[2]}"
                }, # Hint: you may want the agents to clearly distinguish the task descriptions and the user messages. A proper seperation text rather than a simple line break is recommended.
            ]
            return generate_response(llama3, messages)
        else:
            # Format the messsages first.
            messages = [
                {"role": "system", "content": f"{self.role_description}"},  # Hint: you may want the agents to speak Traditional Chinese only.
                {"role": "user", "content": 
                  f"{self.task_description[0]}  \
                    {message[0]}\n              \
                    {self.task_description[1]}  \
                    {message[1]}\n              \
                    {self.task_description[2]}"
                }, # Hint: you may want the agents to clearly distinguish the task descriptions and the user messages. A proper seperation text rather than a simple line break is recommended.
            ]
            return generate_response(llama3, messages)

TODO: Design the role description and task description for each agent.

In [7]:
# TODO: Design the role and task description for each agent.

# This agent may help you filter out the irrelevant parts in question descriptions.
question_extraction_agent = LLMAgent(
    role_description="你是大型語言模型，是用來精簡問句的 AI。使用中文時只會使用繁體中文輸出結果。你的工作是從一個很長的問題中篩選出實際的核心問句。如果有些問題整句話都很重要的話，請保留整個問題。你只需要輸出核心問題本身，不需要輸出其他文字。",
    task_description=["以下是提供的問題：","","接下來請提取出以上問題的核心問句："],
)

# This agent may help you extract the keywords in a question so that the search tool can find more accurate results.
keyword_extraction_agent = LLMAgent(
    role_description="你是大型語言模型，是用來提出關鍵字的 AI。使用中文時只會使用繁體中文輸出結果。你的工作是用來提取出問題中的關鍵字用以搜尋。我會傳給你原始問題與精煉過的問題，請你提取出原始問題中的關鍵字。你只需要輸出以空格分隔的關鍵字，不需要輸出其他文字。",
    task_description=["以下是原始問題：","以下是精煉過的問題：","接下來請輸出問題中的關鍵字："],
)

search_result_extraction_agent = LLMAgent(
    role_description="你是大型語言模型，是用來處理網站的 AI。使用中文時只會使用繁體中文輸出結果。你的工作是從我提供的搜尋結果中篩選出所有與關鍵詞相關的部分，請完整保留相關部份的原始文字。你只需要輸出相關的頁面內容，不需要輸出其他文字。",
    task_description=["以下是搜尋結果：","","接下來請輸出乾淨的搜尋結果："],
)

# This agent is the core component that answers the question.
qa_agent = LLMAgent(
    role_description="你是大型語言模型，是用來回答問題的 AI。使用中文時只會使用繁體中文回答問題。我會給你與多個搜尋結果，你的工作是簡明扼要的回答問題，如果可以的話使用一個精確詞語描述。你只需要輸出單純的答案，不需要輸出其他文字。",
    task_description=["以下是搜尋結果：","以下是原始問題精煉過的問題：","請輸出以上問題的答案："],
)

## RAG pipeline

TODO: Implement the RAG pipeline.

Please refer to the homework description slides for hints.

Also, there might be more heuristics (e.g. classifying the questions based on their lengths, determining if the question need a search or not, reconfirm the answer before returning it to the user......) that are not shown in the flow charts. You can use your creativity to come up with a better solution!

- Naive approach (simple baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive.png)

- Naive RAG approach (medium baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive_rag.png)

- RAG with agents (strong baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/rag_agent.png)

In [None]:
async def pipeline(question: str) -> str:
    
    #Step 1: extract core question
    core_question = question_extraction_agent.inference([question,""])
    #print("核心問題:",core_question.replace("\n"," "))

    #Step 2: extract search key words
    key_words = keyword_extraction_agent.inference([question,core_question])
    #print("關鍵字  :",key_words.replace("\n"," "))

    #Step 3: google search
    search_results = await search(key_words, n_results=3)

    #Step 4: translate result
    search_results = [translate(result) for result in search_results]
    
    #[print("搜尋結果:",result.replace("\n"," ")[:200]) for result in search_results]
    search_results = [result[:3000] for result in search_results]

    #Step 5: clean results
    # clean_results = [
    #     search_result_extraction_agent.inference([key_words,result]) 
    #     for result in search_results]
    # [print("精煉結果:",result.replace("\n"," ")[:200]) for result in clean_results]
    
    # results = "\n此行用來分隔兩筆搜尋結果\n".join(clean_results)
    results = "\n此行用來分隔兩筆搜尋結果\n".join(search_results)

    #Step 6: Generate Response
    answer = qa_agent.inference([results,f"{question},以下是提取出的核心問題：{core_question}"])
    #print(answer)

    return answer

## Answer the questions using your pipeline!

Since Colab has usage limit, you might encounter the disconnections. The following code will save your answer for each question. If you have mounted your Google Drive as instructed, you can just rerun the whole notebook to continue your process.

In [None]:
from pathlib import Path

# Fill in your student ID first.
STUDENT_ID = "B11901015"

STUDENT_ID = STUDENT_ID.lower()
with open('./public.txt', 'r', encoding="utf-8") as input_f:
    questions = input_f.readlines()
    questions = [l.strip().split(',')[0] for l in questions]
    for id, question in enumerate(questions, 1):
        if Path(f"./result/{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./result/{STUDENT_ID}_{id}.txt', 'w', encoding="utf-8") as output_f:
            output_f.write(f"{answer}\n")

with open('./private.txt', 'r', encoding="utf-8") as input_f:
    questions = input_f.readlines()
    for id, question in enumerate(questions, 31):
        if Path(f"./result/{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        
        
        
        print(id, answer)
        with open(f'./result/{STUDENT_ID}_{id}.txt', 'w', encoding="utf-8") as output_f:
            output_f.write(f"{answer}\n")

核心問題: 2025年初，NCC規定民眾郵購自用無線鍵盤等產品回台每案加收審查費多少錢？
關鍵字  : 2025 NCC 民眾 郵購 自用 無線鍼盤 加收 安審查費
搜尋結果: 網購國外手機藍牙耳機增750元審查費NCC：國內販售品維修換新免收|生活|中央社CNA立刻加入本網站使用相關技術提供更好的閱讀體驗，同時尊重使用者隱私，點這裡瞭解中央社隱私聲明。當您關閉此視窗，代表您同意上述規範。YourbrowserdoesnotappeartosupportTraditionalChinese.WouldyouliketogotoCNA’sEnglishwebsite,“Fo
搜尋結果: 網購國外3C產品需繳納750元審查費NCC為何調整規範？｜典藏新聞｜TAAA｜臺北市廣告代理商業同業公會TAIPEIASSOCIATIONOFADVERTISINGAGENCIES臺北市廣告代理商業同業公會MENU加入會員／登入登入會員登入加入公會會員關於公會簡介與沿革理監事名單會員名冊臺灣廣告名人堂／臺灣廣告之友臺灣廣告名人堂臺灣廣告之友傑出廣告人暨卓越貢獻獎臺灣廣告節活動專區公會活動總覽廣告體
搜尋結果: NCC：網購國外藍牙產品審查費，自用研議免收或減收|TechNews科技新報搜尋：登入註冊登出VIP會員MenuSkiptocontent+訂閱獨家半導體晶圓晶片IC設計封裝測試處理器GPU記憶體零組件光電科技面板電池3C周邊財經財報證券房地產Fintech加密貨幣金融政策國際貿易國際金融支付方案網路AmazonFacebookGoogle資訊安全開放資料物聯網電子商務電子娛樂雲端尖端科技AI人工
2 750元
核心問題: 第一代 iPhone 是由哪位蘋果 CEO 發表？
關鍵字  : 第一代 iPhone CEO 發表
搜尋結果: iPhone(第一代)-維基百科，自由的百科全書跳轉到內容主菜單主菜單移至側欄隱藏導航首頁分類索引特色內容新聞動態最近更改隨機條目特殊頁面幫助幫助維基社羣方針與指引互助客棧知識問答字詞轉換IRC即時聊天聯絡我們關於維基百科搜索搜索外觀資助維基百科創建賬號登錄個人工具資助維基百科創建賬號登錄未登錄編輯者的頁面瞭解詳情貢獻討論目錄移至側欄隱藏序言1歷史開關歷史子章節1.1研發1.2發售1.3發售後2設
搜尋結果: iPhone-維基百科，自由的百科全書

In [10]:
# Combine the results into one file.
with open(f'./{STUDENT_ID}.txt', 'w', encoding="utf-8") as output_f:
    for id in range(1,91):
        with open(f'./result/{STUDENT_ID}_{id}.txt', 'r', encoding="utf-8") as input_f:
            answer = input_f.readline().strip()
            print(answer, file=output_f)