## **XÂY DỰNG HỆ THỐNG SỬ DỤNG KỸ THUẬT RAG CƠ BẢN**

Cài đặt các gói và thư viện cần thiết

In [None]:
!pip install pyarrow
!pip install llama-index
!pip install ragas
!pip install datasets

!pip install llama-index-llms-langchain
!pip install langchainhub
!pip install llama-index-llms-fireworks

!pip install llama-index-llms-anyscale
!pip install llama-index-embeddings-anyscale

!pip install anyscale
!pip install openai
!pip install llama-index-embeddings-huggingface

In [None]:
import nest_asyncio
nest_asyncio.apply()

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.evaluation import RetrieverEvaluator

from llama_index.llms.openai import OpenAI
from llama_index.llms.anyscale import Anyscale

from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine

from llama_index.core import PromptTemplate
from IPython.display import Markdown, display

from langchain import hub
from llama_index.core.prompts import LangchainPromptTemplate

import os
import pandas as pd
from datasets import Dataset

Thiết lập các khoá API

In [None]:
os.environ['ANYSCALE_API_KEY'] = #ANYSCALE_API_KEY
os.environ['OPENAI_API_KEY'] = #OPENAI_API_KEY

Tải lên và thực hiện nhúng dữ liệu

In [None]:
documents = SimpleDirectoryReader("/content/data").load_data()

service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo"),
                                               embed_model="local:BAAI/bge-small-en-v1.5")

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

Tạo công cụ truy vấn với `top_k = 5`

In [None]:
query_engine = index.as_query_engine(similarity_top_k=5)

## **CÀI ĐẶT KỸ THUẬT PROMPT ENGINEERING**

Định nghĩa hàm xem lời nhắc

In [None]:
def display_prompt_dict(prompts_dict):
    for k, p in prompts_dict.items():
        text_md = f"**Prompt Key**: {k}<br>" f"**Text:** <br>"
        display(Markdown(text_md))
        print(p.get_template())
        display(Markdown("<br><br>"))

Cài đặt kỹ thuật Few-shot prompting

In [None]:
from llama_index.core.schema import TextNode

few_shot_nodes = []
for line in open("/content/huongdan.json", "r"):
    few_shot_nodes.append(TextNode(text=line))

few_shot_index = VectorStoreIndex(few_shot_nodes)
few_shot_retriever = few_shot_index.as_retriever(similarity_top_k=5)

In [None]:
import json

def few_shot_examples_fn(**kwargs):
    query_str = kwargs["query_str"]
    retrieved_nodes = few_shot_retriever.retrieve(query_str)

    result_strs = []
    for n in retrieved_nodes:
        raw_dict = json.loads(n.get_content())
        query = raw_dict["query"]
        response_dict = json.loads(raw_dict["response"])
        result_str = f"""\
Query: {query}
Response: {response_dict}"""
        result_strs.append(result_str)
    return "\n\n".join(result_strs)

Viết lời nhắc mẫu cho câu hỏi

In [None]:
# write prompt template with functions
qa_prompt_tmpl_str = """\
Trả lời câu hỏi sau về thông tin các trường đại học thành viên thuộc khối Đại học Quốc Gia TP. Hồ Chí Minh dựa trên thông tin ngữ cảnh được cung cấp.
Thông tin ngữ cảnh dưới đây.
---------------------
{context_str}
---------------------

Query: {query_str}.
Trả lời: \
"""

def get_context():
    relevant_docs = [doc for doc in documents]
    return "\n".join([doc.text for doc in relevant_docs])

qa_prompt_tmpl = PromptTemplate(qa_prompt_tmpl_str)

In [None]:
query_engine.update_prompts({"response_synthesizer:text_qa_template": qa_prompt_tmpl})

In [None]:
display_prompt_dict(query_engine.get_prompts())

## **CÀI ĐẶT PHƯƠNG PHÁP HYDE**

In [None]:
hyde = HyDEQueryTransform(include_original=True)
query_engine = TransformQueryEngine(query_engine, hyde)

## **THỰC NGHIỆM VÀ ĐÁNH GIÁ**

Tải lên tập đánh giá

In [None]:
import pandas as pd

file_path = r'/content/testset.xlsx'
test = pd.read_excel(file_path)

print(test)

In [None]:
questions = test["question"].to_list()

**Thí nghiệm 1: Đánh giá hệ thống sử dụng RAG
kết hợp GPT-3.5 Turbo**

In [None]:
service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo"),
                                               embed_model="local:BAAI/bge-small-en-v1.5")

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

query_engine = index.as_query_engine(similarity_top_k=5)

In [None]:
answers = []
contexts = []

for i in questions:
  response_vector = query_engine.query(i)
  answers.append(response_vector)
  contexts.append([a.get_text() for a in response_vector.source_nodes])

In [None]:
ground_truths = [[a] for a in test["ground_truth"]]

In [None]:
for num, i in enumerate(ground_truths):
    if type(i) != str:
        ground_truths[num] = str(i)

answers = [str(i) for i in answers]

In [None]:
datasample = {
    "question": questions,
    "contexts": contexts,
    "answer": answers,
    "ground_truth": ground_truths
}

dataset = Dataset.from_dict(datasample)

Cài đặt thư viện đánh giá

In [None]:
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
]

In [None]:
from ragas import evaluate

result1 = evaluate(dataset=dataset, metrics=metrics, is_async=True, raise_exceptions=False)

print(result1)

In [None]:
rs1 = result1.to_pandas()
rs1.head()

**Thí nghiệm 2: Đánh giá hệ thống sử dụng RAG kết hợp GPT-3.5 Turbo và Prompt Engineering**

In [None]:
query_engine.update_prompts({"response_synthesizer:text_qa_template": qa_prompt_tmpl})

In [None]:
answers = []
contexts = []

for i in questions:
  response_vector = query_engine.query(i)
  answers.append(response_vector)
  contexts.append([a.get_text() for a in response_vector.source_nodes])

In [None]:
ground_truths = [[a] for a in test["ground_truth"]]

In [None]:
for num, i in enumerate(ground_truths):
    if type(i) != str:
        ground_truths[num] = str(i)

answers = [str(i) for i in answers]

In [None]:
datasample = {
    "question": questions,
    "contexts": contexts,
    "answer": answers,
    "ground_truth": ground_truths
}

dataset = Dataset.from_dict(datasample)

In [None]:
result2 = evaluate(dataset=dataset, metrics=metrics, is_async=True, raise_exceptions=False)

rs2 = result2.to_pandas()
rs2.head()

**Thí nghiệm 3: Đánh giá hệ thống sử dụng RAG kết hợp GPT-3.5 Turbo và HyDE**

In [None]:
hyde = HyDEQueryTransform(include_original=True)
query_engine = TransformQueryEngine(query_engine, hyde)

In [None]:
query_engine = index.as_query_engine(similarity_top_k=5)

In [None]:
answers = []
contexts = []

for i in questions:
  response_vector = query_engine.query(i)
  answers.append(response_vector)
  contexts.append([a.get_text() for a in response_vector.source_nodes])

In [None]:
ground_truths = [[a] for a in test["ground_truth"]]

In [None]:
for num, i in enumerate(ground_truths):
    if type(i) != str:
        ground_truths[num] = str(i)

answers = [str(i) for i in answers]

In [None]:
datasample = {
    "question": questions,
    "contexts": contexts,
    "answer": answers,
    "ground_truth": ground_truths
}

dataset = Dataset.from_dict(datasample)

In [None]:
result3 = evaluate(dataset=dataset, metrics=metrics, is_async=True, raise_exceptions=False)

rs3 = result3.to_pandas()
rs3.head()

**Thí nghiệm 4: Đánh giá hệ thống sử dụng RAG kết hợp Mixral 8x7B và Prompt Engineering**

In [None]:
service_context = ServiceContext.from_defaults(llm=Anyscale(model="mistralai/Mixtral-8x7B-Instruct-v0.1"),
                                               embed_model="local:BAAI/bge-small-en-v1.5")

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

query_engine = index.as_query_engine(similarity_top_k=5)

In [None]:
query_engine.update_prompts({"response_synthesizer:text_qa_template": qa_prompt_tmpl})

In [None]:
answers = []
contexts = []

for i in questions:
  response_vector = query_engine.query(i)
  answers.append(response_vector)
  contexts.append([a.get_text() for a in response_vector.source_nodes])

In [None]:
ground_truths = [[a] for a in test["ground_truth"]]

In [None]:
for num, i in enumerate(ground_truths):
    if type(i) != str:
        ground_truths[num] = str(i)

answers = [str(i) for i in answers]

In [None]:
datasample = {
    "question": questions,
    "contexts": contexts,
    "answer": answers,
    "ground_truth": ground_truths
}

dataset = Dataset.from_dict(datasample)

In [None]:
result4 = evaluate(dataset=dataset, metrics=metrics, is_async=True, raise_exceptions=False)

rs4 = result4.to_pandas()
rs4.head()

**Thí nghiệm 5: Đánh giá hệ thống sử RAG kết hợp Mixtral 8x7B và HyDE**

In [None]:
hyde = HyDEQueryTransform(include_original=True)
query_engine = TransformQueryEngine(query_engine, hyde)

In [None]:
query_engine = index.as_query_engine(similarity_top_k=5)

In [None]:
answers = []
contexts = []

for i in questions:
  response_vector = query_engine.query(i)
  answers.append(response_vector)
  contexts.append([a.get_text() for a in response_vector.source_nodes])

In [None]:
ground_truths = [[a] for a in test["ground_truth"]]

In [None]:
for num, i in enumerate(ground_truths):
    if type(i) != str:
        ground_truths[num] = str(i)

answers = [str(i) for i in answers]

In [None]:
datasample = {
    "question": questions,
    "contexts": contexts,
    "answer": answers,
    "ground_truth": ground_truths
}

dataset = Dataset.from_dict(datasample)

In [None]:
result5 = evaluate(dataset=dataset, metrics=metrics, is_async=True, raise_exceptions=False)

rs5 = result5.to_pandas()
rs5.head()