# Introduction

<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F769452%2Fb18d0513200d426e556b2b7b7c825981%2FRAG.png?generation=1695504022336680&alt=media"></img>

## Objective

Build a Retrieval-Augmented Generation (RAG) system using Llama 3.2, Langchain, and ChromaDB. This enables us to query documents not seen during the model’s training, without needing to fine-tune the Large Language Model (LLM).
In a RAG setup, a question triggers a retrieval step that fetches relevant documents from a vector database, where the documents have been previously indexed. 

## Definitions

* LLM - Large Language Model  
* Llama 3.2 - LLM from Meta 
* Langchain - a framework designed to simplify the creation of applications using LLMs
* Vector database - a database that organizes data through high-dimmensional vectors  
* ChromaDB - vector database  
* RAG - Retrieval Augmented Generation (see below more details about RAGs)

## Model details

* **Model**: Llama 3.2  
* **Variation**: Llama-3.2-1B-Instruct  (1B: 1B dimm. Llama-3.2: Meta Instruct Build)  
* **Version**: V1  
* **Framework**: PyTorch  

Llama-3.2-1B-Instruct is a multilingual, instruction-tuned model with 1 billion parameters, built for efficiency and ease of use across different languages and tasks. It’s pretrained on a large multilingual dataset and fine-tuned to follow user instructions accurately. This makes it ideal for tasks like conversation, summarisation, and question answering using retrieval. Despite its smaller size, it benefits from the latest LLaMA 3 architecture and is well-suited for lightweight, real-time AI use.


## What is a Retrieval Augmented Generation (RAG) system?

Large Language Models (LLMs) have shown strong capability in understanding context and delivering accurate responses across various NLP tasks, such as summarisation and question answering. However, while they perform well on information seen during training, they may produce inaccurate responses—or hallucinate—when asked about topics outside their training data. Retrieval-Augmented Generation (RAG) addresses this by combining external sources with LLMs. A typical RAG system includes two core components: a retriever and a generator.  
 
The retriever is responsible for encoding data so that relevant parts can be easily retrieved when queried. This is achieved using text embeddings—vector representations generated by a model trained for this purpose. The most effective way to implement a retriever is through a vector database. There are various options available, both open-source and commercial, such as ChromaDB, Mevius, FAISS, Pinecone, and Weaviate. In this notebook, we will use a local persistent instance of ChromaDB.

For the generator, a Large Language Model (LLM) is the natural choice. This notebook uses a quantised LLaMA model.

The interaction between the retriever and generator is managed using Langchain. A built-in Langchain function enables us to combine both components with a single line of code.


# Installations, imports, utils

In [1]:
# Install required libraries for LLM inference, embedding, vector storage, and orchestration
!pip install transformers==4.33.0 accelerate==0.22.0 einops==0.6.1 \
langchain==0.0.300 xformers==0.0.21 bitsandbytes==0.41.1 \
sentence_transformers==2.2.2 chromadb==0.4.12

Collecting transformers==4.33.0
  Using cached transformers-4.33.0-py3-none-any.whl.metadata (119 kB)
Collecting accelerate==0.22.0
  Using cached accelerate-0.22.0-py3-none-any.whl.metadata (17 kB)
Collecting einops==0.6.1
  Using cached einops-0.6.1-py3-none-any.whl.metadata (12 kB)
Collecting langchain==0.0.300
  Using cached langchain-0.0.300-py3-none-any.whl.metadata (15 kB)
Collecting xformers==0.0.21
  Using cached xformers-0.0.21.tar.gz (22.3 MB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting bitsandbytes==0.41.1
  Using cached bitsandbytes-0.41.1-py3-none-any.whl.metadata (9.8 kB)
Collecting sentence_transformers==2.2.2
  Using cached sentence_transformers-2.2.2-py3-none-any.whl
Collecting chromadb==0.4.12
  Using cached chromadb-0.4.12-py3-none-any.whl.metadata (7.0 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.33.0)
  Using cached tokenizers-0.13.3-cp310-cp310-macosx_12_0_arm64.whl.metadata (6.7 kB)
Collecting anyio<4.0 (from langcha

In [2]:
# Core libraries
import torch            # GPU support and tensor operations
import os               # OS-level file handling
import time             # Execution timing and benchmarking
import transformers     # Hugging Face Transformers for model loading and generation
import chromadb         # Chroma – vector database for retrieval
import gradio as gr     # Gradio – simple web UI for LLM interaction

# ChromaDB configuration
from chromadb.config import Settings  # Configure local ChromaDB instance

# LangChain modules for building a RAG pipeline
from langchain.llms import HuggingFacePipeline                      # Wrap Hugging Face model for LangChain
from langchain.document_loaders import TextLoader                   # Load plain text documents
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Split documents into overlapping chunks
from langchain.embeddings import HuggingFaceEmbeddings              # Create embeddings from text
from langchain.chains import RetrievalQA                            # Combine retriever and LLM into a RAG chain
from langchain.prompts import PromptTemplate                        # Define structured prompts for the LLM
from langchain.vectorstores import Chroma                           # Chroma wrapper for LangChain vector store

# Hugging Face model components
from transformers import (
    AutoConfig,               # Load model configuration
    AutoModelForCausalLM,     # Load pre-trained causal language model
    AutoTokenizer             # Load matching tokenizer
)

# Initialize model, tokenizer, query pipeline

Define the model, the device, and the `bitsandbytes` configuration(for cuda).

In [3]:
# Select the LLaMA model to be used 
hf_token = os.environ.get("HF_TOKEN") 
model_id = "meta-llama/Llama-3.2-1B-Instruct"

# Configure the device (GPU, Apple Silicon, or CPU fallback)
if torch.backends.mps.is_available():
    device = "mps"  # Apple Silicon (Metal Performance Shaders)
elif torch.cuda.is_available():
    device = "cuda"  # NVIDIA GPU

    # Enable 4-bit quantisation to reduce memory usage (requires bitsandbytes)
    bnb_config = transformers.BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16
    )
else:
    device = "cpu"  # Default to CPU if no acceleration is available

print(f"Using device => {device}")

Using device => mps


Prepare the model and the tokenizer.

In [4]:
# Load model configuration from Hugging Face hub
model_config = AutoConfig.from_pretrained(
    model_id,
    trust_remote_code=True,
    token=hf_token,
)

# Load pre-trained LLaMA model with specified configuration
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    config=model_config,
    trust_remote_code=True,
    token=hf_token,
)
model.to(device)

# Load associated tokenizer for text pre-processing
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
    token=hf_token,
)

# Ensure pad token is defined (fallback to EOS if missing)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Quick functional test: simple prompt inference
prompt = "Explain AI in one sentence:"
inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Explain AI in one sentence: Artificial Intelligence (AI) refers to the development of computer systems that can perform tasks that would typically require human intelligence, such as learning, problem-solving, and decision-making.

Explain Machine Learning (ML) in one sentence: Machine Learning (ML)


Define the query pipeline.

In [5]:
# Build the text-generation pipeline using the loaded model and tokenizer
time_1 = time.time()

query_pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,  # Use reduced precision on GPU to save memory
    device_map="auto",  # Automatically selects the optimal device mapping
    max_length=1024     # Limit output length for efficiency
)

time_2 = time.time()
print(f"Prepare pipeline: {round(time_2 - time_1, 3)} sec.")

Device set to use mps:0


Prepare pipeline: 0.051 sec.


Define a function for testing the pipeline.

In [6]:
def test_model(tokenizer, pipeline, prompt_to_test):
    # Run a test query through the text-generation pipeline and print the result
    # Arguments:
    #   tokenizer – used to encode and decode text
    #   pipeline – the text-generation pipeline for inference
    #   prompt_to_test – the input prompt (string)

    time_1 = time.time()

    sequences = pipeline(
        prompt_to_test,
        do_sample=True,                  # Enable sampling for more natural variation
        top_k=10,                        # Sample from top 10 likely tokens
        num_return_sequences=1,         # Return a single output sequence
        eos_token_id=tokenizer.eos_token_id,
        max_length=200                  # Limit length of the generated response
    )

    time_2 = time.time()
    print(f"Test inference: {round(time_2 - time_1, 3)} sec.")

    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

## Test the query pipeline

Test the pipeline with a query about the meaning of State of the Union (SOTU).

In [7]:
# Run a sample prompt to validate the model's response quality and formatting
test_model(
    tokenizer,
    query_pipeline,
    "Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words."
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Test inference: 3.56 sec.
Result: Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words. The State of the Union address is a formal speech delivered by the President of the United States to Congress, where they discuss the state of the country's economy, national security, and other key issues. It is a critical event that marks the official beginning of the new legislative session. The speech is typically delivered in late January or early February.


# Retrieval Augmented Generation

## Check the model with a HuggingFace pipeline


Check the model with a HF pipeline, using a query about the meaning of State of the Union (SOTU).

In [8]:
# Wrap the Hugging Face pipeline for use within LangChain
llm = HuggingFacePipeline(pipeline=query_pipeline)

# Quick test to verify the LLM integration with LangChain
response = llm(prompt="Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")
print("LLM response:\n", response)

  llm = HuggingFacePipeline(pipeline=query_pipeline)
  response = llm(prompt="Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")


LLM response:
 Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words. The State of the Union address is a yearly address delivered by the President of the United States to Congress, in which the President reports on the progress of the federal government, the nation's economy, and the state of the country. It is a formal, televised event that provides an overview of the state of the nation, setting priorities and outlining legislative proposals for the upcoming year. It is an opportunity for the President to discuss domestic and foreign policy issues.


## Ingestion of data using Text loder

Ingest the presidential address, from Jan 2023.

In [9]:
# Load the source text file for processing
loader = TextLoader("biden-sotu-2023-planned-official.txt", encoding="utf8")
documents = loader.load()
print(f"Loaded {len(documents)} documents.")

Loaded 1 documents.


## Split data in chunks

Split data in chunks using a recursive character text splitter.

In [10]:
# Split the loaded document into smaller overlapping chunks for embedding
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,     # Each chunk will contain up to 1000 characters
    chunk_overlap=20     # 20 characters of overlap between chunks to preserve context
)
all_splits = text_splitter.split_documents(documents)
print(f"Total splits: {len(all_splits)}")

Total splits: 43


## Creating Embeddings and Storing in Vector Store

Create the embeddings using Sentence Transformer and HuggingFace embeddings.

In [11]:
# Set up the embedding model (using sentence-transformers: all-mpnet-base-v2)
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": device}  # Specify target device for embedding model

# Create embedding object for converting text chunks into vector representations
embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs={"token": hf_token})

  embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs={"token": hf_token})


Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.

In [12]:
# Create a persistent vector database using Chroma
vectordb = Chroma.from_documents(
    documents=all_splits,          # Chunks of text to be embedded and stored
    embedding=embeddings,          # Embedding model used for vectorisation
    persist_directory="chroma_db"  # Directory to store the local Chroma database
)

# Convert the vector database into a retriever for use in RAG pipelines
retriever = vectordb.as_retriever()

## Initialize chain

In [13]:
# Define a custom prompt template to guide the LLM's response behaviour
custom_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""You are a helpful AI assistant. Use only the text from the context below to answer the user's question.
If the answer is not in the context, say "No relevant info found."

Return only the final answer in one to three sentences.
Do not restate the question or context. 
Do not include these instructions in your final output.

Context:
{context}

Question: {question}

Answer:
"""
)

In [14]:
# Create a RetrievalQA chain by combining retriever, LLM, and custom prompt
qa = RetrievalQA.from_chain_type(
    llm=llm,                              # HuggingFacePipeline wrapped LLM
    chain_type="stuff",                  # Use 'stuff' chain type (basic context injection)
    retriever=retriever,                 # Chroma-based retriever
    verbose=False,                       # Suppress intermediate logging
    return_source_documents=False,       # Do not return source documents with the answer
    chain_type_kwargs={
        "prompt": custom_prompt          # Inject custom prompt into the chain
    }
)

## Test the Retrieval-Augmented Generation 

Define a test function that will run the query.

In [15]:
def test_rag(qa, query):   
    print(f"Query: {query}\n")
    result = qa.run(query)
    print("Final Output:", result)


Check few queries.

In [16]:
# Test the full RAG pipeline with a summarisation query
query = "What were the main topics in the State of the Union in 2023? Summarise. Keep it under 200 words."
test_rag(qa, query)

Query: What were the main topics in the State of the Union in 2023? Summarise. Keep it under 200 words.



  result = qa.run(query)


Final Output: You are a helpful AI assistant. Use only the text from the context below to answer the user's question.
If the answer is not in the context, say "No relevant info found."

Return only the final answer in one to three sentences.
Do not restate the question or context. 
Do not include these instructions in your final output.

Context:
over darkness, hope over fear, unity over division. Stability over chaos. We must see each other not as enemies, but as fellow Americans. We are a good people, the only nation in the world built on an idea. That all of us, every one of us, is created equal in the image of God. A nation that stands as a beacon to the world. A nation in a new age of possibilities. So I have come here to fulfil my constitutional duty to report on the state of the union. And here is my report. Because the soul of this nation is strong, because the backbone of this nation is strong, because the people of this nation are strong, the State of the Union is strong. As 

In [17]:
# Test another query
query = "What is the nation economic status? Summarize. Keep it under 200 words."
test_rag(qa, query)

Query: What is the nation economic status? Summarize. Keep it under 200 words.

Final Output: You are a helpful AI assistant. Use only the text from the context below to answer the user's question.
If the answer is not in the context, say "No relevant info found."

Return only the final answer in one to three sentences.
Do not restate the question or context. 
Do not include these instructions in your final output.

Context:
over darkness, hope over fear, unity over division. Stability over chaos. We must see each other not as enemies, but as fellow Americans. We are a good people, the only nation in the world built on an idea. That all of us, every one of us, is created equal in the image of God. A nation that stands as a beacon to the world. A nation in a new age of possibilities. So I have come here to fulfil my constitutional duty to report on the state of the union. And here is my report. Because the soul of this nation is strong, because the backbone of this nation is strong, bec

## Document sources

Check the documents sources for the last query run.

In [18]:
# Manually inspect documents retrieved via similarity search
docs = vectordb.similarity_search(query)

print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")

for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])     # Document source 
    print("Text: ", doc_details['page_content'], "\n")        # Retrieved content snippet

Query: What is the nation economic status? Summarize. Keep it under 200 words.
Retrieved documents: 4
Source:  biden-sotu-2023-planned-official.txt
Text:  over darkness, hope over fear, unity over division. Stability over chaos. We must see each other not as enemies, but as fellow Americans. We are a good people, the only nation in the world built on an idea. That all of us, every one of us, is created equal in the image of God. A nation that stands as a beacon to the world. A nation in a new age of possibilities. So I have come here to fulfil my constitutional duty to report on the state of the union. And here is my report. Because the soul of this nation is strong, because the backbone of this nation is strong, because the people of this nation are strong, the State of the Union is strong. As I stand here tonight, I have never been more optimistic about the future of America. We just have to remember who we are. We are the United States of America and there is nothing, nothingbeyond 

## Gradio Interface

Create a Gradio interface to test the RAG system.
The output will only show the answer, if the answer is not in the text, the system should respond with "No relevant info found.

In [19]:

# Function to run the RAG query and extract the final answer
def rag_qa(user_query):
    raw_output = qa.run(user_query)

    # Extract the final answer after "Answer:" (if present)
    lower_text = raw_output.lower()
    split_token = "answer:"
    idx = lower_text.find(split_token)

    if idx != -1:
        # Get the text following "Answer:"
        final_answer = raw_output[idx + len(split_token):].strip()
        return final_answer
    else:
        # If "Answer:" not found, return the full output
        return raw_output

# Define interface description:
demo_description = """
**Context**:
This demo uses a Retrieval-Augmented Generation (RAG) system based on 
Biden’s 2023 State of the Union Address. 
All responses are grounded in this document. 
If no relevant information is found, the system will say "No relevant info found."

**Sample Questions**:
1. What were the main topics regarding infrastructure in this speech?
2. How does the speech address the competition with China?
3. What does Biden say about job growth in the past two years?
4. Does the speech mention anything about Social Security or Medicare?
5. What does the speech propose regarding Big Tech or online privacy?

Feel free to ask any question related to Biden’s 2023 State of the Union Address.
"""

# Build the Gradio interface
demo = gr.Interface(
    fn=rag_qa,
    inputs="text",
    outputs="text",
    title="Biden 2023 SOTU RAG QA Demo",
    description=demo_description
)

#  Launch the app
if __name__ == "__main__":
    demo.launch(share=True)

* Running on local URL:  http://127.0.0.1:7860
* Running on public URL: https://0bab91603025b08a62.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


# Conclusions


Langchain, ChromaDB, and Llama 3.2 were used to build a Retrieval-Augmented Generation solution. For testing, the latest State of the Union address from January 2023 was used. The system was able to retrieve relevant information from the document and provide accurate answers to questions. The system can be further improved by using more data and fine-tuning the model.

