
# 從頭開始認識RAG流程

**專案目標**: 在本地GPU實作RAG(Retrieval Augmented Generation)一系列pipeline，下query詢問LLM關於文件的內容，由LLM生成回答。

**相關框架**: [LlamaIndex](https://www.llamaindex.ai/) and [LangChain](https://www.langchain.com/)

## 甚麼是RAG?

論文: [*Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*](https://arxiv.org/abs/2005.11401).  
[GitHub](https://github.com/mrdbourke/simple-local-rag/?tab=readme-ov-file#setup

RAG流程:

* **Retrieval** - 從給定的文件中提取Query的答案。  
`載入文件 -> chunking -> embedding -> 查詢 -> embed查詢 -> 比較查詢embeddings和chunks的embeddings`

* **Augmented** - 利用檢索到的資訊作為LLM的輸入。
* **Generation** - 從前面的Imput影響、優化LLM的輸出。

## 為什麼要RAG?

兩個主要改善：

<font color=#faf>**防止幻覺**</font> - LLM容易出現幻覺(hallucinations)，生成出看起來正確但奇是是錯的內容。RAG可以幫助提供LLM帶有事實依據的參考。  
<font color=#faf>**自定義數據**</font> - LLM在語言方面能力強大，但往往缺乏**具體的知識**。RAG可以提供特定領域的數據，快速制定出<font color=#ffa>**特殊領域**</font>的知識庫，提供更多<font color=#ffa>可解釋性</font>。

RAG也可以是一種比在<font color=#ffa>**特定數據**</font>上微調LLM更快的解決方案。

## 為什麼Local? --> 隱私（自己的硬體）、成本、趨勢、（效率？）  

## 關鍵名詞

- **Token**  
- **Embedding** 
- **Embedding model**: 接收A個tokens的文本，轉成一個大小B的向量。 *嵌入模型可以與大型語言模型不同。*  
- **Similarity search/vector search(相似性檢索/向量檢索)**: 使用餘閒相似度(Cosine similarity)。相似文本相似度應該越高;不同文本應該相似度越低。  
- **LLM** 
- **Prompt**



 ## 建構項目

建構一個RAG的pipline，它可以和PDF進行對話，具體來說是一本開源的XX知識庫。

* 打開 PDF。
* 格式化PDF文本，為嵌入模型做好準備（過程稱為text splitting/chunking）。
* 將pdf的所有chunks嵌入並轉化為數字表示，稍後可以儲存這些數字。
* 建立一個使用向量搜索來根據查詢找到相關chunks的檢索系統。
* 創建一個包含檢索到的文本片段的prompt。
* 根據pdf的段落生成對查詢的回答。

步驟可以分為兩大部分：

>文檔預處理/嵌入創建（步驟 1-3）。  
搜索與回答（步驟 4-6）。

<img src="https://github.com/mrdbourke/simple-local-rag/blob/main/images/simple-local-rag-workflow-flowchart.png?raw=true" alt="flowchart of a local RAG workflow" />

## 1. 建立文件的embedding

所需物件
* PDF文件
* Embedding model

步驟:
1. 載入文件
2. text splitting/chunking
3. 嵌入embedding model.
4. 保存嵌入，日後使用

### 載入pdf檔  
<font color=#ffa>PyMuPDF</font> (`import fitz`): 用python打開pdf檔的庫


In [6]:
import os
import requests

pdf_path = "RAG.pdf"

用fitz套件閱讀文件檔，定義`open_and_read_pdf`函式來存放資料

In [7]:
# Requires !pip install PyMuPDF, see: https://github.com/pymupdf/pymupdf
import fitz
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip()
    return cleaned_text

# 只讀取文字，沒有讀取圖片
def open_and_read_pdf(pdf_path):
    doc = fitz.open(pdf_path)  # open a document
    pages_and_texts = []    #
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number-0,  # adjust page numbers since our PDF starts on page 42
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': 0,
  'page_char_count': 2975,
  'page_word_count': 410,
  'page_sentence_count_raw': 14,
  'page_token_count': 743.75,
  'text': 'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Patrick Lewis†‡, Ethan Perez?, Aleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†, Mike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela† †Facebook AI Research; ‡University College London; ?New York University; plewis@fb.com Abstract Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when ﬁne-tuned on down- stream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their perfor- mance lags behind task-speciﬁc architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research prob- lems. Pre-tr

隨機測試不同頁碼

In [8]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': 2,
  'page_char_count': 3678,
  'page_word_count': 570,
  'page_sentence_count_raw': 25,
  'page_token_count': 919.5,
  'text': 'by ✓ that generates a current token based on a context of the previous i − 1 tokens y1:i−1, the original input x and a retrieved passage z. To train the retriever and generator end-to-end, we treat the retrieved document as a latent variable. We propose two models that marginalize over the latent documents in different ways to produce a distribution over generated text. In one approach, RAG-Sequence, the model uses the same document to predict each target token. The second approach, RAG-Token, can predict each target token based on a different document. In the following, we formally introduce both models and then describe the p⌘ and p✓ components, as well as the training and decoding procedure. 2.1 Models RAG-Sequence Model The RAG-Sequence model uses the same retrieved document to generate the complete sequence. Technically, it treats the re

### 獲取文本統計資料

探索性數據分析（EDA），了解正在處理的文本的大小（字符數、單詞數等）。  
從字典列表轉換成 DataFrame 查看

In [9]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head(16)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,0,2975,410,14,743.75,Retrieval-Augmented Generation for Knowledge-I...
1,1,4554,623,26,1138.5,TheDiYine Comed\(x) T QXeU\ EQcRdeU T([) MIP...
2,2,3678,570,25,919.5,by ✓ that generates a current token based on a...
3,3,4227,681,35,1056.75,minimize the negative marginal log-likelihood ...
4,4,4551,693,35,1137.75,MSMARCO as an open-domain abstractive QA task....
5,5,4115,653,36,1028.75,"Table 1: Open-Domain QA Test Scores. For TQA, ..."
6,6,4437,725,40,1109.25,Document 1: his works are considered classics ...
7,7,3060,499,20,765.0,Table 4: Human assessments for the Jeopardy Qu...
8,8,3976,600,23,994.0,General-Purpose Architectures for NLP Prior wo...
9,9,3574,505,39,893.5,Broader Impact This work offers several positi...


In [10]:
# 所有統計數據
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,16.0,16.0,16.0,16.0,16.0
mean,7.5,3767.75,531.62,40.69,941.94
std,4.76,787.18,137.4,20.63,196.79
min,0.0,1361.0,167.0,14.0,340.25
25%,3.75,3652.0,468.75,24.5,913.0
50%,7.5,3976.5,513.5,35.5,994.12
75%,11.25,4155.75,630.5,60.25,1038.94
max,15.0,4554.0,725.0,78.0,1138.5


### 把頁面分割成句子

(可調整) 把滅分割成N組句子

流程:

`get文本 -> chunking -> 嵌入chunks -> embedding`

分割文本成句子的2個方法：

1. text.split(". ") 在 ". "處分割。
2. 引用 spaCy 或 nltk 來分割句子

分割成句子可以找出哪組句子在RAG中有幫助。

使用 spaCy 將文本分割成句子，因為它比只使用 text.split(". ") 更穩健。

In [11]:
from spacy.lang.en import English

nlp = English()

# 加入sentencizer
nlp.add_pipe("sentencizer")

# Create a document instance as an example
doc = nlp("This is a sentence. This another sentence.")
assert len(list(doc.sents)) == 2

# 成功斷句
list(doc.sents)

[This is a sentence., This another sentence.]

In [12]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)
    
    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    
    # Count the sentences 
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/16 [00:00<?, ?it/s]

In [13]:
# 測試斷句
random.sample(pages_and_texts, k=1)

[{'page_number': 5,
  'page_char_count': 4115,
  'page_word_count': 653,
  'page_sentence_count_raw': 36,
  'page_token_count': 1028.75,
  'text': 'Table 1: Open-Domain QA Test Scores. For TQA, left column uses the standard test set for Open- Domain QA, right column uses the TQA-Wiki test set. See Appendix D for further details. Model NQ TQA WQ CT Closed Book T5-11B [52] 34.5 - /50.1 37.4 - T5-11B+SSM[52] 36.6 - /60.5 44.7 - Open Book REALM [20] 40.4 - / - 40.7 46.8 DPR [26] 41.5 57.9/ - 41.1 50.6 RAG-Token 44.1 55.2/66.1 45.5 50.0 RAG-Seq. 44.5 56.8/68.0 45.2 52.2 Table 2: Generation and classiﬁcation Test Scores. MS-MARCO SotA is [4], FEVER-3 is [68] and FEVER-2 is [57] *Uses gold context/evidence. Best model without gold access underlined. Model Jeopardy MSMARCO FVR3 FVR2 B-1 QB-1 R-L B-1 Label Acc. SotA - - 49.8* 49.9* 76.8 92.2* BART 15.1 19.7 38.2 41.6 64.0 81.1 RAG-Tok. 17.3 22.2 40.1 41.5 72.5 89.5 RAG-Seq. 14.7 21.4 40.8 44.2 to more effective marginalization over documents. F

轉成DataFrame查看

In [14]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,16.0,16.0,16.0,16.0,16.0,16.0
mean,7.5,3767.75,531.62,40.69,941.94,37.62
std,4.76,787.18,137.4,20.63,196.79,16.44
min,0.0,1361.0,167.0,14.0,340.25,16.0
25%,3.75,3652.0,468.75,24.5,913.0,22.75
50%,7.5,3976.5,513.5,35.5,994.12,37.0
75%,11.25,4155.75,630.5,60.25,1038.94,54.75
max,15.0,4554.0,725.0,78.0,1138.5,62.0


4個句子/chunk -->取10個chunk

### 將句子做Chunking

embedding_model = `all-mpnet-base-v2` (容量=384Tokens)

In [15]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list, 
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/16 [00:00<?, ?it/s]

### 查看chunking後結果

In [16]:
# Sample an example from the group (note: many samples have only 1 chunk as they have <=10 sentences total)
random.sample(pages_and_texts, k=1)

[{'page_number': 3,
  'page_char_count': 4227,
  'page_word_count': 681,
  'page_sentence_count_raw': 35,
  'page_token_count': 1056.75,
  'text': 'minimize the negative marginal log-likelihood of each target, P j − log p(yj|xj) using stochastic gradient descent with Adam [28]. Updating the document encoder BERTd during training is costly as it requires the document index to be periodically updated as REALM does during pre-training [20]. We do not ﬁnd this step necessary for strong performance, and keep the document encoder (and index) ﬁxed, only ﬁne-tuning the query encoder BERTq and the BART generator. 2.5 Decoding At test time, RAG-Sequence and RAG-Token require different ways to approximate arg maxy p(y|x). RAG-Token The RAG-Token model can be seen as a standard, autoregressive seq2seq genera- tor with transition probability: p0 ✓(yi|x, y1:i−1) = P z2top-k(p(·|x)) p⌘(zi|x)p✓(yi|x, zi, y1:i−1) To decode, we can plug p0 ✓(yi|x, y1:i−1) into a standard beam decoder. RAG-Sequence For R

### 平均一頁有4.12個chunks

In [17]:
# Create a DataFrame to get stats
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,16.0,16.0,16.0,16.0,16.0,16.0,16.0
mean,7.5,3767.75,531.62,40.69,941.94,37.62,4.12
std,4.76,787.18,137.4,20.63,196.79,16.44,1.63
min,0.0,1361.0,167.0,14.0,340.25,16.0,2.0
25%,3.75,3652.0,468.75,24.5,913.0,22.75,3.0
50%,7.5,3976.5,513.5,35.5,994.12,37.0,4.0
75%,11.25,4155.75,630.5,60.25,1038.94,54.75,6.0
max,15.0,4554.0,725.0,78.0,1138.5,62.0,7.0


### 從 `page_and_text[]` 到 `page_and_chunks[]`

In [18]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        
        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo 
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters
        
        pages_and_chunks.append(chunk_dict)

# 有幾個chunks?
len(pages_and_chunks)

  0%|          | 0/16 [00:00<?, ?it/s]

66

### 隨機查看某頁的某個chunk

In [19]:

random.sample(pages_and_chunks, k=1)

[{'page_number': 15,
  'sentence_chunk': 'Association for Computational Linguistics.doi: 10.18653/v1/D19-1253. URL https://www.aclweb.org/anthology/D19-1253. [68] Wanjun Zhong, Jingjing Xu, Duyu Tang, Zenan Xu, Nan Duan, Ming Zhou, Jiahai Wang, and Jian Yin. Reasoning over semantic-level graph for fact checking. ArXiv, abs/1909.03745, 2019. URL https://arxiv.org/abs/1909.03745.16',
  'chunk_char_count': 340,
  'chunk_word_count': 37,
  'chunk_token_count': 85.0}]

### chunk的Dataframe

In [20]:
# Get stats about our chunks
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,66.0,66.0,66.0,66.0
mean,8.41,911.61,127.85,227.9
std,4.36,447.55,73.9,111.89
min,0.0,109.0,14.0,27.25
25%,5.0,604.75,73.0,151.19
50%,9.5,769.0,98.0,192.25
75%,12.0,1175.75,180.5,293.94
max,15.0,2017.0,313.0,504.25


### 檢查token數<=30的chunks

In [21]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5,replace=True).iterrows():
    print(f'Chunk token 內有: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token 內有: 27.25 | Text: 32nd AAAI Conference on Artiﬁcial Intelligence, AAAI 2018 ; Conference date: 02-02-2018 Through 07-02-2018.11
Chunk token 內有: 27.25 | Text: 32nd AAAI Conference on Artiﬁcial Intelligence, AAAI 2018 ; Conference date: 02-02-2018 Through 07-02-2018.11
Chunk token 內有: 27.25 | Text: 32nd AAAI Conference on Artiﬁcial Intelligence, AAAI 2018 ; Conference date: 02-02-2018 Through 07-02-2018.11
Chunk token 內有: 27.25 | Text: 32nd AAAI Conference on Artiﬁcial Intelligence, AAAI 2018 ; Conference date: 02-02-2018 Through 07-02-2018.11
Chunk token 內有: 27.25 | Text: 32nd AAAI Conference on Artiﬁcial Intelligence, AAAI 2018 ; Conference date: 02-02-2018 Through 07-02-2018.11


通常token數<30的chunks都是header或是footer，把他們過濾掉

In [22]:
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': 0,
  'sentence_chunk': 'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Patrick Lewis†‡, Ethan Perez?,Aleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†, Mike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela† †Facebook AI Research; ‡University College London; ?New York University; plewis@fb.com Abstract Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when ﬁne-tuned on down- stream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their perfor- mance lags behind task-speciﬁc architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research prob- lems. Pre-trained models with a differentiable access mechanism to explicit non- parametric memory can overcome this is

### 實作Chunks的Embedding

將我們的每個塊轉換成數字表示（一個嵌入向量，向量是按順序排列的數字序列）。

使用 `sentence-transformers` 庫，此庫包含很多embedding_models。

使用`all-mpnet-base-v2` 模型

In [23]:
# Requires !pip install sentence-transformers
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device="cuda")
# Create a list of sentences to turn into numbers
sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer."
]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")



Sentence: The Sentences Transformers library provides an easy and open-source way to create embeddings.
Embedding: [-2.07981262e-02  3.03164721e-02 -2.01218110e-02  6.86483458e-02
 -2.55255401e-02 -8.47689249e-03 -2.07080549e-04 -6.32377267e-02
  2.81606130e-02 -3.33353281e-02  3.02635022e-02  5.30720614e-02
 -5.03526330e-02  2.62288079e-02  3.33313793e-02 -4.51578870e-02
  3.63044031e-02 -1.37112231e-03 -1.20171243e-02  1.14946552e-02
  5.04510924e-02  4.70856987e-02  2.11912915e-02  5.14607430e-02
 -2.03746371e-02 -3.58889252e-02 -6.67838030e-04 -2.94393003e-02
  4.95858938e-02 -1.05639631e-02 -1.52013665e-02 -1.31754903e-03
  4.48197164e-02  1.56023102e-02  8.60379885e-07 -1.21394161e-03
 -2.37978324e-02 -9.09417809e-04  7.34483683e-03 -2.53928010e-03
  5.23370244e-02 -4.68043461e-02  1.66214723e-02  4.71579023e-02
 -4.15599048e-02  9.01954074e-04  3.60278673e-02  3.42214704e-02
  9.68227386e-02  5.94828576e-02 -1.64984502e-02 -3.51249799e-02
  5.92516130e-03 -7.07989209e-04 -2.4103

`all-mpnet-base-v2`模型的窗口大小為384，嵌入向量的形狀為(768,)  
文本轉成list，然後做embedding

In [24]:
%%time

# Send the model to the GPU
embedding_model.to("cuda")

# Create embeddings one by one on the GPU
for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

  0%|          | 0/65 [00:00<?, ?it/s]

CPU times: total: 11.6 s
Wall time: 1.19 s


In [25]:
# Turn text chunks into a single list
text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]

In [26]:
%%time


# Embed all texts in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=32, # you can use different batch sizes here for speed/performance, I found 32 works well for this use case
                                               convert_to_tensor=True) # optional to return embeddings as tensor instead of array

text_chunk_embeddings

CPU times: total: 3.25 s
Wall time: 957 ms


tensor([[ 0.0397,  0.0469, -0.0115,  ...,  0.0245, -0.0699, -0.0454],
        [ 0.0150,  0.0806, -0.0144,  ...,  0.0126, -0.0488, -0.0525],
        [ 0.0131,  0.0966, -0.0178,  ...,  0.0178, -0.0884, -0.0685],
        ...,
        [ 0.0296,  0.0491, -0.0122,  ...,  0.0119, -0.0609, -0.0661],
        [ 0.0516,  0.0396, -0.0147,  ...,  0.0588, -0.0588, -0.0222],
        [ 0.0214,  0.0002,  0.0174,  ...,  0.0413, -0.0303, -0.0084]],
       device='cuda:0')

把`text_chunk_embeddings`存起來

### Save embeddings to file

把`pages_and_chunks_over_min_token_len` list of dictionaries 轉成 DataFrame 然後存起來.

In [27]:
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)


### 檢查是否有正確存到指定路徑

In [28]:
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,0,Retrieval-Augmented Generation for Knowledge-I...,1833,246,458.25,[ 3.97273190e-02 4.69269864e-02 -1.15035316e-...
1,0,"For language generation tasks, we ﬁnd that RAG...",1139,162,284.75,[ 1.50031894e-02 8.06403011e-02 -1.43866343e-...
2,1,TheDiYine Comed\(x) T QXeU\ EQcRdeU T([) MIP...,1686,210,421.50,[ 1.30570084e-02 9.66036320e-02 -1.78332869e-...
3,1,"The retriever (Dense Passage Retriever [26], h...",1915,270,478.75,[ 5.29118367e-02 5.53543158e-02 -1.69958323e-...
4,1,"For FEVER [56] fact veriﬁcation, we achieve re...",949,141,237.25,[ 4.94104400e-02 1.70666482e-02 3.50169698e-...
...,...,...,...,...,...,...
60,14,URL https://www.aclweb.org/ anthology/W18-5446...,1096,143,274.00,[ 4.58982587e-02 -1.25643425e-02 -3.06721888e-...
61,14,URL https://www.aaai.org/ocs/index.php/AAAI/AA...,601,75,150.25,[-1.98382903e-02 -2.96843238e-03 -1.32312218e-...
62,14,URL http://arxiv.org/abs/1410.3916. [65] Jason...,218,30,54.50,[ 2.96152793e-02 4.90686484e-02 -1.22116404e-...
63,15,International Workshop on Search-Oriented Conv...,1017,127,254.25,[ 5.16357645e-02 3.96283381e-02 -1.46658486e-...



### Chunking 和 問題的embedding

***MTEB*** - Hugging Face 的大規模文本嵌入基準排行榜

需要考慮的事情：

輸入大小 - 如果你需要嵌入更長的序列，選擇一個輸入容量更大的模型。
嵌入向量的大小 - 一般來說，更大通常是更好的表示，但需要更多的計算/存儲。
模型的大小 - 更大的模型通常會產生更好的嵌入，但需要更多的計算力/時間來運行。
開放或封閉 - 開放模型允許你在自己的硬件上運行，而封閉模型設置可能更簡單，但需要API調用來獲取嵌入。

我應該將我的嵌入存儲在哪裡？

數據集不到 100,000 個，可用np.array 或 torch.tensor 作為數據集。
100,000+ 個embeddings-->Vector Database。


## 2. RAG - 搜索和回答

### 相似性搜尋(Similarity search)  
導入之前創建的embedding並轉成tensor。

In [29]:
import random

import torch
import numpy as np 
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

text_chunks_and_embedding_df = pd.read_csv("text_chunks_and_embeddings_df.csv")

# Convert embedding column back to np.array (it got converted to string when it got saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)
embeddings.shape

torch.Size([65, 768])

### 準備好embedding的chunks了

In [30]:
text_chunks_and_embedding_df

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,0,Retrieval-Augmented Generation for Knowledge-I...,1833,246,458.25,"[0.039727319, 0.0469269864, -0.0115035316, -0...."
1,0,"For language generation tasks, we ﬁnd that RAG...",1139,162,284.75,"[0.0150031894, 0.0806403011, -0.0143866343, -0..."
2,1,TheDiYine Comed\(x) T QXeU\ EQcRdeU T([) MIP...,1686,210,421.50,"[0.0130570084, 0.096603632, -0.0178332869, 0.0..."
3,1,"The retriever (Dense Passage Retriever [26], h...",1915,270,478.75,"[0.0529118367, 0.0553543158, -0.0169958323, 0...."
4,1,"For FEVER [56] fact veriﬁcation, we achieve re...",949,141,237.25,"[0.04941044, 0.0170666482, 0.00350169698, 0.01..."
...,...,...,...,...,...,...
60,14,URL https://www.aclweb.org/ anthology/W18-5446...,1096,143,274.00,"[0.0458982587, -0.0125643425, -0.0306721888, 0..."
61,14,URL https://www.aaai.org/ocs/index.php/AAAI/AA...,601,75,150.25,"[-0.0198382903, -0.00296843238, -0.00132312218..."
62,14,URL http://arxiv.org/abs/1410.3916. [65] Jason...,218,30,54.50,"[0.0296152793, 0.0490686484, -0.0122116404, 0...."
63,15,International Workshop on Search-Oriented Conv...,1017,127,254.25,"[0.0516357645, 0.0396283381, -0.0146658486, 0...."


In [31]:
embeddings[0]

tensor([ 3.9727e-02,  4.6927e-02, -1.1504e-02, -1.5628e-03, -2.1121e-02,
        -4.9947e-03, -2.7959e-02, -1.0972e-02, -4.3458e-03, -7.7804e-02,
        -2.0022e-02, -2.7189e-02, -2.8272e-02,  4.2074e-02,  3.8686e-02,
        -3.2529e-02,  3.7117e-02,  2.2119e-03,  1.2676e-02,  9.3350e-03,
        -3.8750e-03,  5.3754e-03, -1.8202e-02,  1.9444e-02, -3.9476e-02,
         7.7228e-03, -2.5470e-02, -6.4595e-03, -1.1813e-02, -2.3488e-02,
         4.6452e-02,  7.3516e-02,  2.1861e-02,  1.5780e-02,  2.3155e-06,
        -2.9542e-02, -1.6238e-02, -2.7006e-02, -4.3258e-02,  1.7498e-03,
        -1.7556e-02,  2.2487e-02,  1.1062e-02, -1.1125e-02, -4.5319e-02,
        -1.3302e-02,  3.0515e-02,  2.7577e-02,  5.7865e-02,  8.7594e-02,
         6.5090e-04,  9.0065e-03,  3.5177e-02, -2.6052e-02,  5.7383e-02,
        -8.5708e-03,  9.3581e-03, -3.3440e-02, -6.8433e-03,  3.0560e-02,
         6.4515e-03,  3.2190e-02,  4.9336e-02, -7.7907e-03,  4.2671e-02,
         4.0554e-02, -3.1175e-02, -3.5589e-02, -2.8

### 準備embedding model

In [32]:
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device=device) # choose the device to load the model to




In [33]:
# 1. Define the query
# Note: This could be anything. But since we're working with a nutrition textbook, we'll stick with nutrition-based queries.
query = "Retrieval Augmentation generation"
print(f"Query: {query}")

# 2. Embed the query to the same numerical space as the text examples 
# Note: It's important to embed your query with the same model you embedded your examples with.
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

# 3. Get similarity scores with the dot product (we'll time this for fun)
from time import perf_counter as timer

start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
end_time = timer()

print(f"Time take to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

# 4. Get the top-k results (we'll keep this to 5)
top_results_dot_product = torch.topk(dot_scores, k=5)
top_results_dot_product

Query: Retrieval Augmentation generation
Time take to get scores on 65 embeddings: 0.00017 seconds.


torch.return_types.topk(
values=tensor([0.5734, 0.5266, 0.5069, 0.5056, 0.4899], device='cuda:0'),
indices=tensor([ 7, 39,  0,  6, 26], device='cuda:0'))

In [34]:
# Define helper function to print wrapped text 
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

In [35]:
print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"Page number: {pages_and_chunks[idx]['page_number']}")
    print("\n")

Query: 'Retrieval Augmentation generation'

Results:
Score: 0.5734
Text:
It has obtained state-of-the-art results on a diverse set of generation tasks
and outperforms comparably-sized T5 models [32]. We refer to the BART generator
parameters ✓ as the parametric memory henceforth.2.4 Training We jointly train
the retriever and generator components without any direct supervision on what
document should be retrieved. Given a ﬁne-tuning training corpus of input/output
pairs (xj, yj), we 3
Page number: 2


Score: 0.5266
Text:
[19] Kelvin Guu, Tatsunori B. Hashimoto, Yonatan Oren, and Percy Liang.
Generating sentences by editing prototypes. Transactions of the Association for
Computational Linguistics, 6:437–450, 2018.doi: 10.1162/tacl_a_00030. URL
https://www.aclweb.org/anthology/Q18-1031. [20] Kelvin Guu, Kenton Lee, Zora
Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: Retrieval-augmented language
model pre-training. ArXiv, abs/2002.08909, 2020. URL https:
//arxiv.org/abs/2002.08909. [2

相似度計算: 點積dot product & Cosine similarity


In [36]:
# import torch

# def dot_product(vector1, vector2):
#     return torch.dot(vector1, vector2)

# def cosine_similarity(vector1, vector2):
#     dot_product = torch.dot(vector1, vector2)

#     # Get Euclidean/L2 norm of each vector (removes the magnitude, keeps direction)
#     norm_vector1 = torch.sqrt(torch.sum(vector1**2))
#     norm_vector2 = torch.sqrt(torch.sum(vector2**2))

#     return dot_product / (norm_vector1 * norm_vector2)

# # Example tensors
# vector1 = torch.tensor([1, 2, 3], dtype=torch.float32)
# vector2 = torch.tensor([1, 2, 3], dtype=torch.float32)
# vector3 = torch.tensor([8, 16, 24], dtype=torch.float32)
# vector4 = torch.tensor([1, 2, 3], dtype=torch.float32)

# # Calculate dot product
# print("Dot product between vector1 and vector2:", dot_product(vector1, vector2))
# print("Dot product between vector1 and vector3:", dot_product(vector1, vector3))
# print("Dot product between vector1 and vector4:", dot_product(vector1, vector4))

# # Calculate cosine similarity
# print("Cosine similarity between vector1 and vector2:", cosine_similarity(vector1, vector2))
# print("Cosine similarity between vector1 and vector3:", cosine_similarity(vector1, vector3))
# print("Cosine similarity between vector1 and vector4:", cosine_similarity(vector1, vector4))

### 定義"相似語意搜尋"的函數

In [37]:
def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=5,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query, 
                                   convert_to_tensor=True) 

    # Get dot product scores on embeddings
    start_time = timer()
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    end_time = timer()

    if print_time:
        print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

    scores, indices = torch.topk(input=dot_scores, 
                                 k=n_resources_to_return)

    return scores, indices

def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 pages_and_chunks: list[dict]=pages_and_chunks,
                                 n_resources_to_return: int=5):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.

    Note: Requires pages_and_chunks to be formatted in a specific way (see above for reference).
    """
    
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  n_resources_to_return=n_resources_to_return)
    
    print(f"Query: {query}\n")
    print("Results:")
    # Loop through zipped together scores and indicies
    for score, index in zip(scores, indices):
        print(f"Score: {score:.4f}")
        # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
        print_wrapped(pages_and_chunks[index]["sentence_chunk"])
        # Print the page number too so we can reference the textbook further and check the results
        print(f"Page number: {pages_and_chunks[index]['page_number']}")
        print("\n")

Excellent! Now let's test our functions out.

In [38]:
query = "data retrieval"

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

[INFO] Time taken to get scores on 65 embeddings: 0.00005 seconds.


(tensor([0.4710, 0.3714, 0.3547, 0.3463, 0.3426], device='cuda:0'),
 tensor([ 6,  4, 25, 48,  5], device='cuda:0'))

In [39]:
# Print out the texts of the top scores
print_top_results_and_scores(query=query,
                             embeddings=embeddings)

[INFO] Time taken to get scores on 65 embeddings: 0.00005 seconds.
Query: data retrieval

Results:
Score: 0.4710
2.2 Retriever: DPR The retrieval component p⌘(z|x) is based on DPR [26]. DPR
follows a bi-encoder architecture: p⌘(z|x) / exp # d(z)>q(x) $ d(z) = BERTd(z),
q(x) = BERTq(x) where d(z) is a dense representation of a document produced by a
BERTBASE document encoder [8], and q(x) a query representation produced by a
query encoder, also based on BERTBASE. Calculating top-k(p⌘(·|x)), the list of k
documents z with highest prior probability p⌘(z|x), is a Maximum Inner Product
Search (MIPS) problem, which can be approximately solved in sub-linear time
[23]. We use a pre-trained bi-encoder from DPR to initialize our retriever and
to build the document index. This retriever was trained to retrieve documents
which contain answers to TriviaQA [24] questions and Natural Questions [29]. We
refer to the document index as the non-parametric memory.2.3 Generator: BART The
generator componen

(延伸)語義搜索/向量搜索 

數據集小--> 可將查詢跟每一個可能的結果比較  
大規模數據集-->Inedx。

Index--> embeddings的排序

因此它可以縮小搜索範圍。

例如，要在字典中搜索每個單詞以找到單詞「duck」是低效的，相反，你會直接去查找字母D，甚至直接到字母D的後半部分，找到接近「duck」的單詞，然後找到它。

這就是索引在不影響速度或質量太多的情況下幫助搜索許多例子的方式（有關此的更多信息，請查看最近鄰搜索）。

最受歡迎的索引庫之一是 `Faiss`。

### 選擇LLM --> Llama-3-8B-instruct

In [40]:
# Get GPU available memory
import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
print(f"Available GPU memory: {gpu_memory_gb} GB")

Available GPU memory: 12 GB


In [45]:
# Note: the following is Gemma focused, however, there are more and more LLMs of the 2B and 7B size appearing for local use.
if gpu_memory_gb < 5.1:
    print(f"Your available GPU memory is {gpu_memory_gb}GB, you may not have enough memory to run a Gemma LLM locally without quantization.")
elif gpu_memory_gb < 8.1:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in 4-bit precision.")
    use_quantization_config = True 
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb < 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.")
    use_quantization_config = False 
    model_id = "meta-llama/Meta-Llama-3-8B-instruct"
elif gpu_memory_gb > 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommend model: Gemma 7B in 4-bit or float16 precision.")
    use_quantization_config = False 
    model_id = "google/gemma-7b-it"

print(f"use_quantization_config set to: {use_quantization_config}")
print(f"model_id set to: {model_id}")

GPU memory: 12 | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.
use_quantization_config set to: False
model_id set to: meta-llama/Meta-Llama-3-8B-instruct


In [50]:
from huggingface_hub import notebook_login
notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [51]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available 
from transformers import BitsAndBytesConfig
# 量化模型權重&激活函數 4bit/16位float
quantization_config = BitsAndBytesConfig(load_in_4bit=True,                            
                                         bnb_4bit_compute_dtype=torch.float16)
#GPU優化
if (is_flash_attn_2_available()) and (torch.cuda.get_device_capability(0)[0] >= 8):
  attn_implementation = "flash_attention_2"
else:
  attn_implementation = "sdpa"
print(f"[INFO] Using attention implementation: {attn_implementation}")

print(f"[INFO] Using model_id: {model_id}")
model_id = 'meta-llama/Meta-Llama-3-8B-instruct'

tokenizer = AutoTokenizer.from_pretrained(model_id)
llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id, 
                                                 torch_dtype=torch.float16, # datatype to use, we want float16
                                                 quantization_config=quantization_config if use_quantization_config else None,
                                                 low_cpu_mem_usage=False, # use full memory 
                                                 attn_implementation=attn_implementation) # which attention version to use

if not use_quantization_config:
    llm_model.to("cuda")

[INFO] Using attention implementation: sdpa
[INFO] Using model_id: meta-llama/Meta-Llama-3-8B-instruct


tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs-us-1.huggingface.co/repos/55/ac/55acddbb5c2ac2041b89a858eeba82e6130c6160294d75fe51bfa8bd7a4e4518/d8cf9c4d0dd972e1a2131bfe656235ee98221679711a3beef6d46dadf0f20b5c?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27model-00001-of-00004.safetensors%3B+filename%3D%22model-00001-of-00004.safetensors%22%3B&Expires=1715871355&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxNTg3MTM1NX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzU1L2FjLzU1YWNkZGJiNWMyYWMyMDQxYjg5YTg1OGVlYmE4MmU2MTMwYzYxNjAyOTRkNzVmZTUxYmZhOGJkN2E0ZTQ1MTgvZDhjZjljNGQwZGQ5NzJlMWEyMTMxYmZlNjU2MjM1ZWU5ODIyMTY3OTcxMWEzYmVlZjZkNDZkYWRmMGYyMGI1Yz9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=QeprbWGRW9SUz3JolPpCvaSTsdRXx4Wbbsx5M%7EEHm0qou0QL8C0QieuXN8E5af4FNxLbuc5gAIegZfPvKlYnI2nzLnbiE-X2wHCMzpqsWpCNyv9FWtkKSsPZNU-oYHneu-9v4dPfJk0tHK0tfknKGkCzqLHeHEYuM5VAGuTrZUwzR2k26qMZ5l8bN891xkPgOwgLaZqXCjL

model-00001-of-00004.safetensors:  41%|####      | 2.03G/4.98G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs-us-1.huggingface.co/repos/55/ac/55acddbb5c2ac2041b89a858eeba82e6130c6160294d75fe51bfa8bd7a4e4518/d8cf9c4d0dd972e1a2131bfe656235ee98221679711a3beef6d46dadf0f20b5c?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27model-00001-of-00004.safetensors%3B+filename%3D%22model-00001-of-00004.safetensors%22%3B&Expires=1715871355&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxNTg3MTM1NX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzU1L2FjLzU1YWNkZGJiNWMyYWMyMDQxYjg5YTg1OGVlYmE4MmU2MTMwYzYxNjAyOTRkNzVmZTUxYmZhOGJkN2E0ZTQ1MTgvZDhjZjljNGQwZGQ5NzJlMWEyMTMxYmZlNjU2MjM1ZWU5ODIyMTY3OTcxMWEzYmVlZjZkNDZkYWRmMGYyMGI1Yz9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=QeprbWGRW9SUz3JolPpCvaSTsdRXx4Wbbsx5M%7EEHm0qou0QL8C0QieuXN8E5af4FNxLbuc5gAIegZfPvKlYnI2nzLnbiE-X2wHCMzpqsWpCNyv9FWtkKSsPZNU-oYHneu-9v4dPfJk0tHK0tfknKGkCzqLHeHEYuM5VAGuTrZUwzR2k26qMZ5l8bN891xkPgOwgLaZqXCjL

model-00001-of-00004.safetensors:  60%|######    | 3.01G/4.98G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs-us-1.huggingface.co/repos/55/ac/55acddbb5c2ac2041b89a858eeba82e6130c6160294d75fe51bfa8bd7a4e4518/d8cf9c4d0dd972e1a2131bfe656235ee98221679711a3beef6d46dadf0f20b5c?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27model-00001-of-00004.safetensors%3B+filename%3D%22model-00001-of-00004.safetensors%22%3B&Expires=1715871355&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxNTg3MTM1NX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzU1L2FjLzU1YWNkZGJiNWMyYWMyMDQxYjg5YTg1OGVlYmE4MmU2MTMwYzYxNjAyOTRkNzVmZTUxYmZhOGJkN2E0ZTQ1MTgvZDhjZjljNGQwZGQ5NzJlMWEyMTMxYmZlNjU2MjM1ZWU5ODIyMTY3OTcxMWEzYmVlZjZkNDZkYWRmMGYyMGI1Yz9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=QeprbWGRW9SUz3JolPpCvaSTsdRXx4Wbbsx5M%7EEHm0qou0QL8C0QieuXN8E5af4FNxLbuc5gAIegZfPvKlYnI2nzLnbiE-X2wHCMzpqsWpCNyv9FWtkKSsPZNU-oYHneu-9v4dPfJk0tHK0tfknKGkCzqLHeHEYuM5VAGuTrZUwzR2k26qMZ5l8bN891xkPgOwgLaZqXCjL

model-00001-of-00004.safetensors:  80%|########  | 4.00G/4.98G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs-us-1.huggingface.co/repos/55/ac/55acddbb5c2ac2041b89a858eeba82e6130c6160294d75fe51bfa8bd7a4e4518/d8cf9c4d0dd972e1a2131bfe656235ee98221679711a3beef6d46dadf0f20b5c?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27model-00001-of-00004.safetensors%3B+filename%3D%22model-00001-of-00004.safetensors%22%3B&Expires=1715871355&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxNTg3MTM1NX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzU1L2FjLzU1YWNkZGJiNWMyYWMyMDQxYjg5YTg1OGVlYmE4MmU2MTMwYzYxNjAyOTRkNzVmZTUxYmZhOGJkN2E0ZTQ1MTgvZDhjZjljNGQwZGQ5NzJlMWEyMTMxYmZlNjU2MjM1ZWU5ODIyMTY3OTcxMWEzYmVlZjZkNDZkYWRmMGYyMGI1Yz9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=QeprbWGRW9SUz3JolPpCvaSTsdRXx4Wbbsx5M%7EEHm0qou0QL8C0QieuXN8E5af4FNxLbuc5gAIegZfPvKlYnI2nzLnbiE-X2wHCMzpqsWpCNyv9FWtkKSsPZNU-oYHneu-9v4dPfJk0tHK0tfknKGkCzqLHeHEYuM5VAGuTrZUwzR2k26qMZ5l8bN891xkPgOwgLaZqXCjL

model-00001-of-00004.safetensors:  90%|######### | 4.49G/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs-us-1.huggingface.co/repos/55/ac/55acddbb5c2ac2041b89a858eeba82e6130c6160294d75fe51bfa8bd7a4e4518/3acdd690e65c24f42a24581b8467af98bd3ca357444580f8012aacd2bd607921?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27model-00003-of-00004.safetensors%3B+filename%3D%22model-00003-of-00004.safetensors%22%3B&Expires=1715873731&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxNTg3MzczMX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzU1L2FjLzU1YWNkZGJiNWMyYWMyMDQxYjg5YTg1OGVlYmE4MmU2MTMwYzYxNjAyOTRkNzVmZTUxYmZhOGJkN2E0ZTQ1MTgvM2FjZGQ2OTBlNjVjMjRmNDJhMjQ1ODFiODQ2N2FmOThiZDNjYTM1NzQ0NDU4MGY4MDEyYWFjZDJiZDYwNzkyMT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=AmyqPWCD4f-EauJ-ZeTuXF1lXeLghLwWptahrTYNiHORb4G2jrHCkDBNcD5C7LEAEC9faC2kdyMSus4rRARYgNwF76ul62slmEpDLLjreAXEJ1SuvIcurjQdy%7EuZZDANquhPtak7J9Yrb8uCYuLAI5Co6sZke48E7-N7NP4i6o%7EdM3OULGFCt0sVsTI-vQhQMX%7E2WIy

model-00003-of-00004.safetensors:  78%|#######8  | 3.85G/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

檢查是否成功載入我的LLM

In [52]:
llm_model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head)

Llama-3-8B-instruct模型有15.145 GB大

In [53]:
param_size = 0
buffer_size = 0
for param in llm_model.parameters():
    param_size += param.nelement() * param.element_size()

for buffer in llm_model.buffers():
    buffer_size += buffer.nelement() * buffer.element_size()

size_all_mb = (param_size + buffer_size) / 1024**2
print('Size: {:.3f} GB'.format(size_all_mb/1024))

Size: 15.145 GB


查看參數量

In [54]:
def get_model_num_params(model: torch.nn.Module):
    return sum([param.numel() for param in model.parameters()])

get_model_num_params(llm_model)

8030261248

## `Generation` 用LLM生成回覆

In [55]:
input_text = "Waht is 'Open-domain Question Answering' from the experience of RAG?"
print(f"Input text:\n{input_text}")

# Create prompt template for instruction-tuned model
dialogue_template = [
    {"role": "user",
     "content": input_text}
]

# Apply the chat template
prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                       tokenize=False, # keep as raw text (not tokenized)
                                       add_generation_prompt=True)
print(f"\nPrompt (formatted):\n{prompt}")

Input text:
Waht is 'Open-domain Question Answering' from the experience of RAG?

Prompt (formatted):
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Waht is 'Open-domain Question Answering' from the experience of RAG?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




tokenize --> LLM generate --> decode

In [56]:
%%time

input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
print(f"Model input (tokenized):\n{input_ids}\n")

outputs = llm_model.generate(**input_ids,
                             max_new_tokens=256) 
print(f"Model output (tokens):\n{outputs[0]}\n")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Model input (tokenized):
{'input_ids': tensor([[128000, 128000, 128006,    882, 128007,    271,  99327,    427,    374,
            364,   5109,  73894,  16225,  22559,    287,      6,    505,    279,
           3217,    315,    432,   1929,     30, 128009, 128006,  78191, 128007,
            271]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1]], device='cuda:0')}



  attn_output = torch.nn.functional.scaled_dot_product_attention(


Model output (tokens):
tensor([128000, 128000, 128006,    882, 128007,    271,  99327,    427,    374,
           364,   5109,  73894,  16225,  22559,    287,      6,    505,    279,
          3217,    315,    432,   1929,     30, 128009, 128006,  78191, 128007,
           271,   3915,    279,  13356,    315,    432,   1929,    320,    697,
          8532,  22559,  24367,    705,   5377,  73894,  16225,  22559,    287,
           320,   5109,  48622,      8,    374,    264,    955,    315,   5933,
          4221,   8863,    320,     45,  12852,      8,   3465,    430,  18065,
         24038,  11503,    311,   4860,  15107,    505,    264,  13057,     11,
          1825,  84175,   6677,   2385,    477,  43194,     13,   1115,   3465,
          7612,    279,   1646,    311,  17622,   9959,   2038,    505,    279,
         13057,   3392,    315,   1495,    828,   2561,    389,    279,   7757,
            11,   6603,     11,    477,   1023,   8336,     11,    323,   1243,
          7068,  

In [57]:
# Decode the output tokens to text
outputs_decoded = tokenizer.decode(outputs[0])
print(f"Model output (decoded):\n{outputs_decoded}\n")

Model output (decoded):
<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>

Waht is 'Open-domain Question Answering' from the experience of RAG?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

From the perspective of RAG (Relevant Answer Generation), Open-domain Question Answering (OpenQA) is a type of natural language processing (NLP) task that involves generating answers to questions drawn from a vast, open-ended knowledge base or corpus. This task requires the model to retrieve relevant information from the vast amount of text data available on the internet, books, or other sources, and then generate accurate and relevant answers to the questions.

In OpenQA, the model is not limited to a specific domain or knowledge base, unlike other types of question answering tasks, such as cloze tests or multiple-choice questions. Instead, it must be able to generalize across various topics, domains, and formats to provide answers that are accurate, informative, 

In [58]:
print(f"Input text: {input_text}\n")
print(f"Output text:\n{outputs_decoded.replace(prompt, '').replace('<|begin_of_text|>', '').replace('<eos>', '')}")

Input text: Waht is 'Open-domain Question Answering' from the experience of RAG?

Output text:
From the perspective of RAG (Relevant Answer Generation), Open-domain Question Answering (OpenQA) is a type of natural language processing (NLP) task that involves generating answers to questions drawn from a vast, open-ended knowledge base or corpus. This task requires the model to retrieve relevant information from the vast amount of text data available on the internet, books, or other sources, and then generate accurate and relevant answers to the questions.

In OpenQA, the model is not limited to a specific domain or knowledge base, unlike other types of question answering tasks, such as cloze tests or multiple-choice questions. Instead, it must be able to generalize across various topics, domains, and formats to provide answers that are accurate, informative, and relevant to the question.

RAG, as a Relevant Answer Generation model, is designed to tackle this challenging task by leveragi

已完成`Retrieval`和`Generation`

再來是`Augmentation`.


In [59]:
gpt4_questions = [
    "How does RAG differ from traditional language models in terms of architecture and processing flow?",
    "What are the main challenges in aligning the retrieved documents with the generation process in RAG?",
    "Can you provide examples of real-world applications where RAG is particularly effective?",
    "What datasets and benchmarks are most commonly used to train and evaluate RAG models?",
    "How do advancements in vector search technologies influence the development and performance of RAG systems?"
]

# Manually created question list
manual_questions = [
    "Please introduce the main technique of RAG.",
    "What is the pipline of RAG?",
    "Is there any shortcoming when implementing RAG?",
    "What is the most-used method of evatulting RAG?",
    "What is the most special point of RAG?"
]

query_list = gpt4_questions + manual_questions

And now let's check if our `retrieve_relevant_resources()` function works with our list of queries.

In [60]:
import random
query = random.choice(query_list)

print(f"Query: {query}")

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

Query: How does RAG differ from traditional language models in terms of architecture and processing flow?
[INFO] Time taken to get scores on 65 embeddings: 0.00059 seconds.


(tensor([0.4293, 0.4046, 0.4020, 0.3988, 0.3939], device='cuda:0'),
 tensor([17,  8, 18, 56,  1], device='cuda:0'))

## `Augmentation`

`prmopt_formatter` : 整理出增強過的新prompt (Query+)



In [68]:
def prompt_formatter(query: str, 
                     context_items: list[dict]) -> str:
    """
    Augments query with text-based context from context_items.
    """
    # Join context items into one dotted paragraph
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])

    # Create a base prompt with examples to help the model
    # Note: this is very customizable, I've chosen to use 3 examples of the answer style we'd like.
    # We could also write this in a txt file and import it in if we wanted.
    base_prompt = """Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.
\nExample 1:
Query: What is DPR?
Answer: DPR is a method used in RAG to retrieve relevant documents or passages from a large database. It uses dense vector embeddings of text, which are generated by a neural network, to find and retrieve content that is semantically related to a given query.
\nExample 2:
Query: What is Retrieval Ablations?
Answer: Retrieval ablations in RAG refer to experiments or modifications done to analyze the impact of different components of the retrieval system on the overall performance of the model. For instance, changing the way documents are retrieved or altering the embeddings can help understand what aspects are most crucial for accurate information retrieval.

\nNow use the following context items to answer the user query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""

    # Update base prompt with context items and query   
    base_prompt = base_prompt.format(context=context, query=query)

    # Create prompt template for instruction-tuned model
    dialogue_template = [
        {"role": "user",
        "content": base_prompt}
    ]

    # Apply the chat template
    prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                          tokenize=False,
                                          add_generation_prompt=True)
    return prompt

In [78]:
query = random.choice(query_list)
print(f"Query: {query}")

# Get relevant resources
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
    
# Create a list of context items
context_items = [pages_and_chunks[i] for i in indices]

# Format prompt with context items
prompt = prompt_formatter(query=query,
                          context_items=context_items)
print(prompt)

Query: What is the most-used method of evatulting RAG?
[INFO] Time taken to get scores on 65 embeddings: 0.00005 seconds.
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.

Example 1:
Query: What is DPR?
Answer: DPR is a method used in RAG to retrieve relevant documents or passages from a large database. It uses dense vector embeddings of text, which are generated by a neural network, to find and retrieve content that is semantically related to a given query.

Example 2:
Query: What is Retrieval Ablations?
Answer: Retrieval ablations in RAG refer to experiments or modifications done to analyze the impact of different components of the retrie

### 架構:  

In [79]:
%%time

input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = llm_model.generate(**input_ids,
                             temperature=0.7, # lower temperature = more deterministic outputs, higher temperature = more creative outputs
                             do_sample=True, # whether or not to use sampling, see https://huyenchip.com/2024/01/16/sampling.html for more
                             max_new_tokens=256) # how many new tokens to generate from prompt 

# Turn the output tokens into text
output_text = tokenizer.decode(outputs[0])

print(f"Query: {query}")
print(f"RAG answer:\n{output_text.replace(prompt, ' ')}")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Query: What is the most-used method of evatulting RAG?
RAG answer:
<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>

Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.

Example 1:
Query: What is DPR?
Answer: DPR is a method used in RAG to retrieve relevant documents or passages from a large database. It uses dense vector embeddings of text, which are generated by a neural network, to find and retrieve content that is semantically related to a given query.

Example 2:
Query: What is Retrieval Ablations?
Answer: Retrieval ablations in RAG refer to experiments or modifications done to analyze the impact of different components of the retrieval system on the overall performance 

簡化`Generation`階段  (Optional)

In [80]:
def ask(query, 
        temperature=0.7,
        max_new_tokens=512,
        format_answer_text=True, 
        return_answer_only=True):
    """
    Takes a query, finds relevant resources/context and generates an answer to the query based on the relevant resources.
    """
    
    # Get just the scores and indices of top related results
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings)
    
    # Create a list of context items
    context_items = [pages_and_chunks[i] for i in indices]

    # Add score to context item
    for i, item in enumerate(context_items):
        item["score"] = scores[i].cpu() # return score back to CPU 
        
    # Format the prompt with context items
    prompt = prompt_formatter(query=query,
                              context_items=context_items)
    
    # Tokenize the prompt
    input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Generate an output of tokens
    outputs = llm_model.generate(**input_ids,
                                 temperature=temperature,
                                 do_sample=True,
                                 max_new_tokens=max_new_tokens)
    
    # Turn the output tokens into text
    output_text = tokenizer.decode(outputs[0])

    if format_answer_text:
        # Replace special tokens and unnecessary help message
        output_text = output_text.replace(prompt, "").replace("<bos>", "").replace("<eos>", "").replace("Sure, here is the answer to the user query:\n\n", "")

    # Only return the answer without the context items
    if return_answer_only:
        return output_text
    
    return output_text, context_items

測試query_list的隨機問題

In [81]:
query = random.choice(query_list)
print(f"Query: {query}")

# Answer query with context and return context 
answer, context_items = ask(query=query, 
                            temperature=0.7,
                            max_new_tokens=512,
                            return_answer_only=False)

print(f"Answer:\n")
print_wrapped(answer)
print(f"Context items:")
context_items

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Query: How does RAG differ from traditional language models in terms of architecture and processing flow?
[INFO] Time taken to get scores on 65 embeddings: 0.00007 seconds.
Answer:

<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>
Based on the following context items, please answer the query. Give yourself
room to think by extracting relevant passages from the context before answering
the query. Don't return the thinking, only return the answer. Make sure your
answers are as explanatory as possible. Use the following examples as reference
for the ideal answer style.  Example 1: Query: What is DPR? Answer: DPR is a
method used in RAG to retrieve relevant documents or passages from a large
database. It uses dense vector embeddings of text, which are generated by a
neural network, to find and retrieve content that is semantically related to a
given query.  Example 2: Query: What is Retrieval Ablations? Answer: Retrieval
ablations in RAG refer to experiments or mo

[{'page_number': 5,
  'sentence_chunk': '14.7 21.4 40.8 44.2 to more effective marginalization over documents. Furthermore, RAG can generate correct answers even when the correct answer is not in any retrieved document, achieving 11.8% accuracy in such cases for NQ, where an extractive model would score 0%.4.2 Abstractive Question Answering As shown in Table 2, RAG-Sequence outperforms BART on Open MS-MARCO NLG by 2.6 Bleu points and 2.6 Rouge-L points. RAG approaches state-of-the-art model performance, which is impressive given that (i) those models access gold passages with speciﬁc information required to generate the reference answer, (ii) many questions are unanswerable without the gold passages, and (iii) not all questions are answerable from Wikipedia alone. Table 3 shows some generated answers from our models. Qualitatively, we ﬁnd that RAG models hallucinate less and generate factually correct text more often than BART. Later, we also show that RAG generations are more diverse 

 Extensions

* May want to improve text extraction with something like Marker - https://github.com/VikParuchuri/marker
* Guide to more advanced PDF extraction - https://towardsdatascience.com/extracting-text-from-pdf-files-with-python-a-comprehensive-guide-9fc4003d517 
* See the following prompt engineering resources for more prompting techniques - promptinguide.ai, Brex's Prompt Engineering Guide 
* What happens when a query comes through that there isn't any context in the textbook on?
* Try another embedding model (e.g. Mixed Bread AI large, `mixedbread-ai/mxbai-embed-large-v1`, see: https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1)
* Try another LLM... (e.g. Mistral-Instruct)
* Try different prompts (e.g. see prompting techniques online)
* Our example only focuses on text from a PDF, however, we could extend it to include figures and images 
* Evaluate the answers -> could use another LLM to rate our answers (e.g. use GPT-4 to make)
* Vector database/index for larger setup (e.g. 100,000+ chunks)
* Libraries/frameworks such as LangChain / LlamaIndex can help do many of the steps for you - so it's worth looking into those next, wanted to recreate a workflow with lower-level tools to show the principles
* Optimizations for speed
    * See Hugging Face docs for recommended speed ups on GPU - https://huggingface.co/docs/transformers/perf_infer_gpu_one 
    * Optimum NVIDIA - https://huggingface.co/blog/optimum-nvidia, GitHub: https://github.com/huggingface/optimum-nvidia 
    * See NVIDIA TensorRT-LLM - https://github.com/NVIDIA/TensorRT-LLM 
    * See GPT-Fast for PyTorch-based optimizations - https://github.com/pytorch-labs/gpt-fast 
    * Flash attention 2 (requires Ampere GPUs or newer) - https://github.com/Dao-AILab/flash-attention
* Stream text output so it looks prettier (e.g. each token appears as it gets output from the model)
* Turn the workflow into an app, see Gradio type chatbots for this - https://www.gradio.app/guides/creating-a-chatbot-fast, see local example: https://www.gradio.app/guides/creating-a-chatbot-fast#example-using-a-local-open-source-llm-with-hugging-face 