# Introduction

<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F769452%2Fb18d0513200d426e556b2b7b7c825981%2FRAG.png?generation=1695504022336680&alt=media"></img>

## Objective

Use Llama 2.0, Langchain and ChromaDB to create a Retrieval Augmented Generation (RAG) system. This will allow us to ask questions about our documents (that were not included in the training data), without fine-tunning the Large Language Model (LLM).
When using RAG, if you are given a question, you first do a retrieval step to fetch any relevant documents from a special database, a vector database where these documents were indexed. 

## Definitions

* LLM - Large Language Model  
* Llama 2.0 - LLM from Meta 
* Langchain - a framework designed to simplify the creation of applications using LLMs
* Vector database - a database that organizes data through high-dimmensional vectors  
* ChromaDB - vector database  
* RAG - Retrieval Augmented Generation (see below more details about RAGs)

## Model details

* **Model**: Llama 2  
* **Variation**: 7b-chat-hf  (7b: 7B dimm. hf: HuggingFace build)
* **Version**: V1  
* **Framework**: PyTorch  

LlaMA 2 model is pretrained and fine-tuned with 2 Trillion tokens and 7 to 70 Billion parameters which makes it one of the powerful open source models. It is a highly improvement over LlaMA 1 model.


## What is a Retrieval Augmented Generation (RAG) system?

Large Language Models (LLMs) has proven their ability to understand context and provide accurate answers to various NLP tasks, including summarization, Q&A, when prompted. While being able to provide very good answers to questions about information that they were trained with, they tend to hallucinate when the topic is about information that they do "not know", i.e. was not included in their training data. Retrieval Augmented Generation combines external resources with LLMs. The main two components of a RAG are therefore a retriever and a generator.  
 
The retriever part can be described as a system that is able to encode our data so that can be easily retrieved the relevant parts of it upon queriying it. The encoding is done using text embeddings, i.e. a model trained to create a vector representation of the information. The best option for implementing a retriever is a vector database. As vector database, there are multiple options, both open source or commercial products. Few examples are ChromaDB, Mevius, FAISS, Pinecone, Weaviate. Our option in this Notebook will be a local instance of ChromaDB (persistent).

For the generator part, the obvious option is a LLM. In this Notebook we will use a quantized LLaMA v2 model, from the Kaggle Models collection.  

The orchestration of the retriever and generator will be done using Langchain. A specialized function from Langchain allows us to create the receiver-generator in one line of code.

## More about this  

Do you want to learn more? Look into the `References` section for blog posts and in `More work on the same topic` for Notebooks about the technologies used here.

# Installations, imports, utils

In [1]:
!pip install transformers==4.33.0 accelerate==0.22.0 einops==0.6.1 langchain==0.0.300 xformers==0.0.21 \
bitsandbytes==0.41.1 sentence_transformers==2.2.2 chromadb==0.4.12

Collecting transformers==4.33.0
  Using cached transformers-4.33.0-py3-none-any.whl.metadata (119 kB)
Collecting accelerate==0.22.0
  Using cached accelerate-0.22.0-py3-none-any.whl.metadata (17 kB)
Collecting einops==0.6.1
  Using cached einops-0.6.1-py3-none-any.whl.metadata (12 kB)
Collecting langchain==0.0.300
  Using cached langchain-0.0.300-py3-none-any.whl.metadata (15 kB)
Collecting xformers==0.0.21
  Using cached xformers-0.0.21.tar.gz (22.3 MB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting bitsandbytes==0.41.1
  Using cached bitsandbytes-0.41.1-py3-none-any.whl.metadata (9.8 kB)
Collecting sentence_transformers==2.2.2
  Using cached sentence_transformers-2.2.2-py3-none-any.whl
Collecting chromadb==0.4.12
  Using cached chromadb-0.4.12-py3-none-any.whl.metadata (7.0 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.33.0)
  Using cached tokenizers-0.13.3-cp310-cp310-macosx_12_0_arm64.whl.metadata (6.7 kB)
Collecting anyio<4.0 (from langcha

In [2]:

import torch
import os
import transformers
import time
#import chromadb
#from chromadb.config import Settings

from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.vectorstores import Chroma
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
)



# Initialize model, tokenizer, query pipeline

Define the model, the device, and the `bitsandbytes` configuration.

In [3]:
# Set the device

if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"
print(f"Using device => {device}")

Using device => mps


Prepare the model and the tokenizer.

In [4]:
# 選擇要使用的 LLaMA 模型（例如 "meta-llama/Llama-3.2-3B-Instruct"）
hf_token = os.environ.get("HF_TOKEN") 
model_id = "meta-llama/Llama-3.2-1B-Instruct"

# 檢查裝置
device = "mps" if torch.backends.mps.is_available() else "cpu"
print("Using device =>", device)

# 載入 config
model_config = AutoConfig.from_pretrained(
    model_id,
    trust_remote_code=True,
    token=hf_token,
)

# 載入 model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    config=model_config,
    trust_remote_code=True,
    token=hf_token,
)
model.to(device)

# 載入 tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
    token=hf_token,
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# 簡易測試
prompt = "Explain AI in one sentence:"
inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using device => mps


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Explain AI in one sentence: Artificial intelligence (AI) refers to a computer system that can perform tasks that would typically require human intelligence, such as learning, problem-solving, and decision-making.

Explain the role of Machine Learning (ML) in AI: Machine learning (ML)


Define the query pipeline.

In [5]:
# 建立 text-generation pipeline
time_1 = time.time()

query_pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    device_map="auto",  # 或直接指定 device=device
    max_length=1024
)

time_2 = time.time()
print(f"Prepare pipeline: {round(time_2 - time_1, 3)} sec.")

Device set to use mps:0


Prepare pipeline: 0.049 sec.


We define a function for testing the pipeline.

In [6]:
def test_model(tokenizer, pipeline, prompt_to_test):
    """
    Perform a query and print the result.
    Args:
        tokenizer: the tokenizer
        pipeline: the text-generation pipeline
        prompt_to_test: the prompt (string)
    Returns:
        None
    """
    time_1 = time.time()
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,
    )
    time_2 = time.time()
    print(f"Test inference: {round(time_2 - time_1, 3)} sec.")
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

## Test the query pipeline

We test the pipeline with a query about the meaning of State of the Union (SOTU).

In [7]:
# 測試一下模型
test_model(
    tokenizer,
    query_pipeline,
    "Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words."
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Test inference: 4.739 sec.
Result: Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words. The State of the Union address is a formal address delivered by the President of the United States to Congress and the public, where the President reports on the state of the country, the economy, and the nation's progress. The address is typically held in the evening of the first Monday in January, and it provides an overview of the President's agenda, accomplishments, and challenges for the upcoming year. The speech is usually around 45 minutes long.


# Retrieval Augmented Generation

## Check the model with a HuggingFace pipeline


We check the model with a HF pipeline, using a query about the meaning of State of the Union (SOTU).

In [8]:
# 將 pipeline 包裝成 HuggingFacePipeline，方便在 LangChain 中使用
llm = HuggingFacePipeline(pipeline=query_pipeline)

# 簡單測試 llm
response = llm(prompt="Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")
print("LLM response:\n", response)

  llm = HuggingFacePipeline(pipeline=query_pipeline)
  response = llm(prompt="Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")


LLM response:
 Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words. The State of the Union address is a formal speech given by the President of the United States to Congress, in which the President reports on the state of the nation, discusses current issues, and outlines legislative proposals. It is a significant event in American politics and is broadcast live on television.


## Ingestion of data using Text loder

We will ingest the newest presidential address, from Jan 2023.

In [9]:
# 載入文本
loader = TextLoader("biden-sotu-2023-planned-official.txt",
                    encoding="utf8")
documents = loader.load()
print(f"Loaded {len(documents)} documents.")

Loaded 1 documents.


## Split data in chunks

We split data in chunks using a recursive character text splitter.

In [10]:
# 切分文本
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=20
)
all_splits = text_splitter.split_documents(documents)
print(f"Total splits: {len(all_splits)}")

Total splits: 43


## Creating Embeddings and Storing in Vector Store

Create the embeddings using Sentence Transformer and HuggingFace embeddings.

In [11]:
pip install sentence-transformers

Note: you may need to restart the kernel to use updated packages.


In [12]:
# 建立 Embeddings (sentence-transformers/all-mpnet-base-v2)
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": device}  # 指定 device

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs={"token":hf_token})

  embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs={"token":hf_token})


Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.

In [13]:
pip install chromadb

Note: you may need to restart the kernel to use updated packages.


In [14]:
# 建立向量資料庫 (Chroma)
vectordb = Chroma.from_documents(
    documents=all_splits,
    embedding=embeddings,
    persist_directory="chroma_db"
)
retriever = vectordb.as_retriever()

## Initialize chain

In [15]:

custom_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""You are a helpful AI assistant. Use only the text from the context below to answer the user's question.
If the answer is not in the context, say "No relevant info found."

Return only the final answer in one to three sentences.
Do not restate the question or context. 
Do not include these instructions in your final output.

Context:
{context}

Question: {question}

Answer:
"""
)

In [16]:

qa = RetrievalQA.from_chain_type(
    llm=llm,                # 你前面包裝好的 HuggingFacePipeline
    chain_type="stuff",     # 原本就是 "stuff" 或其他 chain_type
    retriever=retriever,
    verbose=False,          # 關閉中間輸出
    return_source_documents=False,  # 不回傳檢索文本
    chain_type_kwargs={
        "prompt": custom_prompt    # 使用剛才自訂的 prompt
    }
)

## Test the Retrieval-Augmented Generation 


We define a test function, that will run the query and time it.

In [17]:

def test_rag(qa, query):
    """只印最終答案，不顯示其他中間資訊"""
    start_time = time.time()
    print(f"Query: {query}\n")
    result = qa.run(query)
    print("Final Output:", result)
    end_time = time.time()
    
    # 你若想顯示 query，可自行 print(query)
    # 你若想顯示推理時間，可自行 print
    
  
    # e.g. print(f"Inference took {end_time - start_time:.2f} sec. Answer: {result}")

Let's check few queries.

In [18]:
# 測試 RAG
query = "What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words."
test_rag(qa, query)

Query: What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words.



  result = qa.run(query)


Final Output: You are a helpful AI assistant. Use only the text from the context below to answer the user's question.
If the answer is not in the context, say "No relevant info found."

Return only the final answer in one to three sentences.
Do not restate the question or context. 
Do not include these instructions in your final output.

Context:
over darkness, hope over fear, unity over division. Stability over chaos. We must see each other not as enemies, but as fellow Americans. We are a good people, the only nation in the world built on an idea. That all of us, every one of us, is created equal in the image of God. A nation that stands as a beacon to the world. A nation in a new age of possibilities. So I have come here to fulfil my constitutional duty to report on the state of the union. And here is my report. Because the soul of this nation is strong, because the backbone of this nation is strong, because the people of this nation are strong, the State of the Union is strong. As 

In [19]:
# 再測試另一個 query
query = "What is the nation economic status? Summarize. Keep it under 200 words."
test_rag(qa, query)

Query: What is the nation economic status? Summarize. Keep it under 200 words.

Final Output: You are a helpful AI assistant. Use only the text from the context below to answer the user's question.
If the answer is not in the context, say "No relevant info found."

Return only the final answer in one to three sentences.
Do not restate the question or context. 
Do not include these instructions in your final output.

Context:
over darkness, hope over fear, unity over division. Stability over chaos. We must see each other not as enemies, but as fellow Americans. We are a good people, the only nation in the world built on an idea. That all of us, every one of us, is created equal in the image of God. A nation that stands as a beacon to the world. A nation in a new age of possibilities. So I have come here to fulfil my constitutional duty to report on the state of the union. And here is my report. Because the soul of this nation is strong, because the backbone of this nation is strong, bec

## Document sources

Let's check the documents sources, for the last query run.

In [20]:
# 查看相似檢索的文件
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")

for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")

Query: What is the nation economic status? Summarize. Keep it under 200 words.
Retrieved documents: 4
Source:  biden-sotu-2023-planned-official.txt
Text:  over darkness, hope over fear, unity over division. Stability over chaos. We must see each other not as enemies, but as fellow Americans. We are a good people, the only nation in the world built on an idea. That all of us, every one of us, is created equal in the image of God. A nation that stands as a beacon to the world. A nation in a new age of possibilities. So I have come here to fulfil my constitutional duty to report on the state of the union. And here is my report. Because the soul of this nation is strong, because the backbone of this nation is strong, because the people of this nation are strong, the State of the Union is strong. As I stand here tonight, I have never been more optimistic about the future of America. We just have to remember who we are. We are the United States of America and there is nothing, nothingbeyond 

In [21]:
# === Section 17: Create a Gradio Interface (only final answer) ===

import gradio as gr

# 1) Turn off verbose so it won't show chain debug info
qa.verbose = False  # This ensures only the final answer is displayed

def rag_qa(user_query):
    """
    A simple function that calls qa.run(query) and returns only the final answer.
    """
    return qa.run(user_query)

# 2) Provide an English description that clarifies:
#    - The document used (Biden's 2023 State of the Union)
#    - Some example questions to try

demo_description = """
**Context**:
This demo is powered by a Retrieval-Augmented Generation (RAG) approach using 
Biden’s 2023 State of the Union Address as the primary document. 
All answers are derived from that transcript. 
If the answer is not in the text, the system should respond with "No relevant info found."

**Sample Questions**:
1. What were the main topics regarding infrastructure in this speech?
2. How does the speech address the competition with China?
3. What does Biden say about job growth in the past two years?
4. Does the speech mention anything about Social Security or Medicare?
5. What does the speech propose regarding Big Tech or online privacy?

Feel free to ask any question relevant to Biden’s 2023 State of the Union Address.
"""

# 3) Create a Gradio interface
demo = gr.Interface(
    fn=rag_qa,
    inputs="text",
    outputs="text",
    title="Biden 2023 SOTU RAG QA Demo",
    description=demo_description
)

# 4) Launch the Gradio app
if __name__ == "__main__":
    demo.launch()

* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


# Conclusions


We used Langchain, ChromaDB and Llama 2 as a LLM to build a Retrieval Augmented Generation solution. For testing, we were using the latest State of the Union address from Jan 2023.


# More work on the same topic

You can find more details about how to use a LLM with Kaggle. Few interesting topics are treated in:  

* https://www.kaggle.com/code/gpreda/test-llama-2-quantized-with-llama-cpp (quantizing LLama 2 model using llama.cpp)
* https://www.kaggle.com/code/gpreda/fast-test-of-llama-v2-pre-quantized-with-llama-cpp  (quantized Llamam 2 model using llama.cpp)  
* https://www.kaggle.com/code/gpreda/test-of-llama-2-quantized-with-llama-cpp-on-cpu (quantized model using llama.cpp - running on CPU)  
* https://www.kaggle.com/code/gpreda/explore-enron-emails-with-langchain-and-llama-v2 (Explore Enron Emails with Langchain and Llama v2)


# References  

[1] Murtuza Kazmi, Using LLaMA 2.0, FAISS and LangChain for Question-Answering on Your Own Data, https://medium.com/@murtuza753/using-llama-2-0-faiss-and-langchain-for-question-answering-on-your-own-data-682241488476  

[2] Patrick Lewis, Ethan Perez, et. al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, https://browse.arxiv.org/pdf/2005.11401.pdf 

[3] Minhajul Hoque, Retrieval Augmented Generation: Grounding AI Responses in Factual Data, https://medium.com/@minh.hoque/retrieval-augmented-generation-grounding-ai-responses-in-factual-data-b7855c059322  

[4] Fangrui Liu	, Discover the Performance Gain with Retrieval Augmented Generation, https://thenewstack.io/discover-the-performance-gain-with-retrieval-augmented-generation/

[5] Andrew, How to use Retrieval-Augmented Generation (RAG) with Llama 2, https://agi-sphere.com/retrieval-augmented-generation-llama2/   

[6] Yogendra Sisodia, Retrieval Augmented Generation Using Llama2 And Falcon, https://medium.com/@scholarly360/retrieval-augmented-generation-using-llama2-and-falcon-ed26c7b14670   

