# Building a Basic RAG System: A Step-by-Step Guide

![RAG System Architecture](https://miro.medium.com/v2/resize:fit:1400/1*7i-6SZo3C65T3mrqUxpgZQ.png)

## Introduction to RAG (Retrieval Augmented Generation)

Retrieval Augmented Generation (RAG) combines the power of retrieval-based systems with generative AI to create more accurate, informative, and contextually relevant responses. Instead of relying solely on the knowledge encoded in a language model's parameters, RAG systems first retrieve relevant information from a knowledge base and then use that information to generate responses.

Key components of a RAG system include:

1. **Document Collection**: A corpus of documents containing domain-specific information
2. **Document Processing**: Converting raw documents into a format suitable for embedding
3. **Chunking**: Breaking documents into smaller, manageable pieces
4. **Embedding**: Converting text chunks into vector representations
5. **Vector Storage**: Storing embeddings in a vector database for efficient retrieval
6. **Retrieval**: Finding the most relevant chunks for a given query
7. **Augmentation**: Combining the query with retrieved context
8. **Generation**: Using an LLM to generate a response based on the augmented prompt

In this tutorial, we'll build a complete RAG system from scratch and see it in action!

## Setting Up the Environment

First, let's install the necessary libraries:

In [1]:
# Install required libraries
!pip install requests sentence-transformers chromadb openai
!pip install git+https://github.com/brandonstarxel/chunking_evaluation.git

Collecting git+https://github.com/brandonstarxel/chunking_evaluation.git
  Cloning https://github.com/brandonstarxel/chunking_evaluation.git to /private/var/folders/_z/ms9rqjt90dq1s5797l_x0ry40000gn/T/pip-req-build-9jlj0qvb
  Running command git clone --filter=blob:none --quiet https://github.com/brandonstarxel/chunking_evaluation.git /private/var/folders/_z/ms9rqjt90dq1s5797l_x0ry40000gn/T/pip-req-build-9jlj0qvb
  Resolved https://github.com/brandonstarxel/chunking_evaluation.git to commit d451fc4cf56e417b755994b4ca5212fd5057c0d2
  Preparing metadata (setup.py) ... [?25ldone


Next, let's import the libraries we'll need and define our configuration settings:

In [2]:
import os
import requests
import numpy as np
from sentence_transformers import SentenceTransformer
from chunking_evaluation.chunking import RecursiveTokenChunker
from chunking_evaluation.utils import openai_token_count
import chromadb
from chromadb.utils import embedding_functions
import openai

# URL handling
JINA_PREFIX = "https://r.jina.ai/"

# Chunking settings
CHUNK_SIZE = 400
CHUNK_OVERLAP = 50

# Embedding settings
EMBEDDING_MODEL = "all-MiniLM-L6-v2"

# ChromaDB settings
COLLECTION_NAME = "rag_documents"
PERSIST_DIRECTORY = "./chroma_db"

# OpenAI settings
OPENAI_MODEL = "gpt-3.5-turbo"
# Replace with your actual API key
OPENAI_API_KEY = "your-api-key-here"  
openai.api_key = OPENAI_API_KEY

# Retrieval settings
TOP_K = 5

## 1. Loading Documents from URLs

Let's start by implementing a function to fetch content from web URLs using the jina.ai reader service:

In [3]:
def fetch_url_content(url):
    """
    Fetch content from a URL using jina.ai reader service.
    
    Args:
        url (str): The original URL to fetch
        
    Returns:
        str: The content of the URL in markdown format
    """
    # Prepend the jina.ai prefix to the URL
    jina_url = JINA_PREFIX + url
    
    try:
        # Make the request to jina.ai reader service
        response = requests.get(jina_url)
        
        # Check if the request was successful
        if response.status_code == 200:
            return response.text
        else:
            print(f"Error fetching URL: Status code {response.status_code}")
            return None
    except Exception as e:
        print(f"Error fetching URL: {e}")
        return None

Now let's test our function by fetching content from a sample URL:

In [4]:
# Example URL (a Wikipedia article)
url = "https://en.wikipedia.org/wiki/Retrieval-augmented_generation"

# Fetch the content
content = fetch_url_content(url)

# Print a preview of the content
if content:
    print(f"Content length: {len(content)} characters")
    print("\nPreview of the first 500 characters:")
    print(content[:500] + "...")
else:
    print("Failed to fetch content")

Content length: 27961 characters

Preview of the first 500 characters:
Title: Retrieval-augmented generation

URL Source: https://en.wikipedia.org/wiki/Retrieval-augmented_generation

Published Time: 2023-11-05T13:19:20Z

Markdown Content:
From Wikipedia, the free encyclopedia

**Retrieval-augmented generation** (**RAG**) is a technique that enables [generative artificial intelligence](https://en.wikipedia.org/wiki/Generative_artificial_intelligence "Generative artificial intelligence") (Gen AI) models to retrieve and incorporate new information.[\[1\]](https://en....


## 2. Chunking the Content

Now that we can fetch content, let's implement the chunking functionality using RecursiveTokenChunker:

In [5]:
def chunk_document(text):
    """
    Split document text into chunks using RecursiveTokenChunker.
    
    Args:
        text (str): The document text to chunk
        
    Returns:
        list: List of text chunks
    """
    # Initialize the chunker with our configured settings
    chunker = RecursiveTokenChunker(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        length_function=openai_token_count,
        separators=["\n\n", "\n", ".", "?", "!", " ", ""]
    )
    
    # Split the text into chunks
    chunks = chunker.split_text(text)
    
    # Return the chunks
    return chunks

Let's test our chunking function on the content we fetched:

In [6]:
if content:
    # Chunk the content
    chunks = chunk_document(content)
    
    # Print chunking statistics
    print(f"Original content length: {len(content)} characters")
    print(f"Number of chunks: {len(chunks)}")
    print(f"Average chunk length: {sum(len(c) for c in chunks) / len(chunks):.1f} characters")
    
    # Print a sample chunk
    if chunks:
        print("\nSample chunk:")
        print("-" * 50)
        print(chunks[min(5, len(chunks)-1)])  # Print the 6th chunk or the last one if fewer
        print("-" * 50)
else:
    print("No content to chunk")

Original content length: 27961 characters
Number of chunks: 22
Average chunk length: 1283.0 characters

Sample chunk:
--------------------------------------------------
Typically, the data to be referenced is converted into LLM [embeddings](https://en.wikipedia.org/wiki/Word_embeddings "Word embeddings"), numerical representations in the form of a large vector space. RAG can be used on unstructured (usually text), semi-structured, or structured data (for example [knowledge graphs](https://en.wikipedia.org/wiki/Knowledge_graphs "Knowledge graphs")).[\[3\]](https://en.wikipedia.org/wiki/Retrieval-augmented_generation#cite_note-Survey-3) These embeddings are then stored in a [vector database](https://en.wikipedia.org/wiki/Vector_database "Vector database") to allow for [document retrieval](https://en.wikipedia.org/wiki/Document_retrieval "Document retrieval").

[![Image 1](https://upload.wikimedia.org/wikipedia/commons/thumb/1/14/RAG_diagram.svg/220px-RAG_diagram.svg.png)](https://en.wiki

## 3. Embedding Chunks

Now let's implement the functionality to create embeddings for our text chunks:

In [7]:
# Initialize the embedding model
embedding_model = SentenceTransformer(EMBEDDING_MODEL)

def embed_texts(texts):
    """
    Create embeddings for a list of text chunks.
    
    Args:
        texts (list): List of text strings to embed
        
    Returns:
        numpy.ndarray: Array of embeddings
    """
    embeddings = embedding_model.encode(texts)
    return embeddings

Let's test our embedding function on a few chunks:

In [8]:
if chunks:
    # Create embeddings for the first 3 chunks
    sample_chunks = chunks[:min(3, len(chunks))]
    embeddings = embed_texts(sample_chunks)
    
    # Print embedding information
    print(f"Embedding dimensions: {embeddings.shape}")
    print(f"Sample embedding (first 10 values): {embeddings[0][:10]}")
    
    # Calculate similarity between the first two chunks if possible
    if len(embeddings) >= 2:
        # Using cosine similarity: dot product of normalized vectors
        def cosine_similarity(v1, v2):
            return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
        
        sim = cosine_similarity(embeddings[0], embeddings[1])
        print(f"\nSimilarity between first two chunks: {sim:.4f}")
else:
    print("No chunks to embed")

Embedding dimensions: (3, 384)
Sample embedding (first 10 values): [-0.15525298 -0.00383533 -0.04336745  0.09886687 -0.01397708  0.06526825
  0.03533417 -0.01896184 -0.00340518 -0.04793883]

Similarity between first two chunks: 0.7273


## 4. Creating a ChromaDB Vector Store

Now let's implement the functionality to store our embeddings in ChromaDB:

In [9]:
def get_or_create_collection():
    """
    Get or create a ChromaDB collection.
    
    Returns:
        chromadb.Collection: The ChromaDB collection
    """
    # Create the persistent client
    client = chromadb.PersistentClient(path=PERSIST_DIRECTORY)
    
    # Set up the embedding function
    embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name=EMBEDDING_MODEL
    )
    
    # Try to get the collection if it exists
    try:
        collection = client.get_collection(
            name=COLLECTION_NAME,
            embedding_function=embedding_func
        )
        print(f"Using existing collection: {COLLECTION_NAME}")
    except:
        # Create a new collection if it doesn't exist
        collection = client.create_collection(
            name=COLLECTION_NAME,
            embedding_function=embedding_func,
            metadata={"hnsw:space": "cosine"}  # Use cosine similarity
        )
        print(f"Created new collection: {COLLECTION_NAME}")
    
    return collection

def add_chunks_to_collection(chunks, url):
    """
    Add document chunks to the ChromaDB collection.
    
    Args:
        chunks (list): List of text chunks
        url (str): Source URL of the document
        
    Returns:
        int: Number of chunks added
    """
    # Get the collection
    collection = get_or_create_collection()
    
    # Skip if no chunks
    if not chunks:
        return 0
    
    # Create IDs for each chunk
    ids = [f"chunk_{url.replace('/', '_')}_{i}" for i in range(len(chunks))]
    
    # Create metadata for each chunk
    metadatas = [{"source": url, "chunk_index": i} for i in range(len(chunks))]
    
    # Add chunks to the collection
    collection.add(
        documents=chunks,
        ids=ids,
        metadatas=metadatas
    )
    
    return len(chunks)

Let's test adding chunks to our vector store:

In [10]:
if chunks:
    # Add chunks to the collection
    num_added = add_chunks_to_collection(chunks, url)
    print(f"Added {num_added} chunks to the collection")
    
    # Get the collection and check count
    collection = get_or_create_collection()
    count = collection.count()
    print(f"Total documents in collection: {count}")
else:
    print("No chunks to add")

Created new collection: rag_documents
Added 22 chunks to the collection
Using existing collection: rag_documents
Total documents in collection: 22


## 5. Implementing the Retriever

Now let's implement a function to retrieve the most relevant chunks for a given query:

In [22]:
def retrieve_relevant_chunks(query):
    """
    Retrieve the most relevant document chunks for a query.
    
    Args:
        query (str): The query text
        
    Returns:
        list: List of relevant document chunks
        list: List of source URLs for each chunk
    """
    # Get the collection
    collection = get_or_create_collection()
    
    # Query the collection for similar chunks
    results = collection.query(
        query_texts=[query],
        n_results=TOP_K,
        include=["documents", "metadatas", "distances"]
    )
    
    # Extract the retrieved chunks and their sources
    chunks = results["documents"][0]  # First list is for the first query
    metadatas = results["metadatas"][0]
    distances = results["distances"][0]
    
    # Extract the sources
    sources = [meta["source"] for meta in metadatas]
    
    # Print retrieval information for debugging
    print(f"Retrieved {len(chunks)} chunks for query: '{query}'")
    for i, (chunk, source, distance) in enumerate(zip(chunks, sources, distances)):
        print("-"*40)
        print(f"\nChunk {i+1} (Distance: {distance:.4f}, Source: {source}):")
        preview = chunk
        print(preview)
    
    return chunks, sources

Let's test our retriever with a sample query:

In [23]:
# Define a test query
query = "What are the advantages of retrieval augmented generation?"

# Retrieve relevant chunks
chunks, sources = retrieve_relevant_chunks(query)

# Print summary
print(f"\nRetrieved {len(chunks)} chunks from {len(set(sources))} unique sources")

Using existing collection: rag_documents
Retrieved 5 chunks for query: 'What are the advantages of retrieval augmented generation?'
----------------------------------------

Chunk 1 (Distance: 0.3417, Source: https://en.wikipedia.org/wiki/Retrieval-augmented_generation):
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating an [information-retrieval](https://en.wikipedia.org/wiki/Information_retrieval "Information retrieval") mechanism that allows models to access and utilize additional data beyond their original training set. [AWS](https://en.wikipedia.org/wiki/AWS "AWS") states, "RAG allows LLMs to retrieve relevant information from external data sources to generate more accurate and contextually relevant responses" (**indexing**).[\[4\]](https://en.wikipedia.org/wiki/Retrieval-augmented_generation#cite_note-AWS-4) This approach reduces reliance on static datasets, which can quickly become outdated. When a user submits a query, RAG uses a documen

## 6. Generating Responses with OpenAI

Finally, let's implement a function to generate responses using OpenAI's API:

In [20]:
from openai import OpenAI

def generate_response(query, chunks, sources):
    """
    Generate a response using OpenAI's API based on the query and retrieved chunks.
    
    Args:
        query (str): The user's query
        chunks (list): List of relevant document chunks
        sources (list): List of source URLs for each chunk
        
    Returns:
        str: The generated response
    """
    # Combine chunks with their sources for better attribution
    context_with_sources = []
    for i, (chunk, source) in enumerate(zip(chunks, sources)):
        context_with_sources.append(f"Source [{i+1}] ({source}): {chunk}")
    
    context = "\n\n".join(context_with_sources)
    
    # Define system message with instructions
    system_message = """You are a helpful assistant that provides accurate information based on the given context. 
    If the context doesn't contain relevant information to answer the question, acknowledge that and provide general information if possible.
    Always cite your sources by referring to the source numbers provided in brackets. Do not make up information."""
    
    # Define the user message with query and context
    user_message = f"""Question: {query}
    
    Context information:
    {context}
    
    Please answer the question based on the context information provided."""
    
    try:
        # Call the OpenAI API
        client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        
        response = client.chat.completions.create(
            model=OPENAI_MODEL,
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message}
            ],
            temperature=0.3,  # Lower temperature for more focused responses
            max_tokens=1000   # Limit response length
        )
        
        # Extract and return the response text
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error generating response: {e}")
        return f"Error generating response: {str(e)}"

Let's test our response generation with the retrieved chunks:

In [21]:
if chunks:
    # Generate a response
    response = generate_response(query, chunks, sources)
    print("\nGenerated Response:")
    print("-" * 80)
    print(response)
    print("-" * 80)
else:
    print("No chunks retrieved, cannot generate response")


Generated Response:
--------------------------------------------------------------------------------
The advantages of retrieval-augmented generation (RAG) include:

1. **Access to Additional Data**: RAG allows large language models (LLMs) to retrieve relevant information from external data sources beyond their original training set, enabling them to generate more accurate and contextually relevant responses (indexing) [1].
   
2. **Reduced Reliance on Static Datasets**: By incorporating an information-retrieval mechanism, RAG reduces the reliance on static datasets that can quickly become outdated. When new information becomes available, the model can be augmented with updated information without the need for retraining (augmentation) [1].

3. **Enhanced Responses**: RAG enables LLMs to dynamically integrate relevant data, leading to more informed and contextually grounded responses (generation) [1].

4. **Domain-specific and Updated Information**: RAG allows LLMs to use domain-speci

## 7. Putting It All Together: The Complete RAG Pipeline

Let's combine everything into a complete RAG pipeline:

In [24]:
def rag_pipeline(url, query):
    """
    Complete RAG pipeline: fetch a document, chunk it, add to vector store, and answer a query.
    
    Args:
        url (str): URL to fetch and add to knowledge base
        query (str): Query to answer using the RAG system
        
    Returns:
        str: Generated response
    """
    print(f"Step 1: Fetching document from {url}")
    content = fetch_url_content(url)
    if not content:
        return "Failed to fetch the document."
    
    print(f"\nStep 2: Chunking the document")
    chunks = chunk_document(content)
    print(f"Created {len(chunks)} chunks")
    
    print(f"\nStep 3: Adding chunks to vector store")
    num_added = add_chunks_to_collection(chunks, url)
    print(f"Added {num_added} chunks to vector store")
    
    print(f"\nStep 4: Retrieving relevant chunks for query: '{query}'")
    relevant_chunks, sources = retrieve_relevant_chunks(query)
    
    print(f"\nStep 5: Generating response")
    if relevant_chunks:
        response = generate_response(query, relevant_chunks, sources)
        return response
    else:
        return "No relevant information found to answer the query."

Let's test the entire pipeline with a new document and query:

In [25]:
# Define a new URL and query
new_url = "https://en.wikipedia.org/wiki/Large_language_model"
new_query = "How do large language models work and what are their limitations?"

# Run the RAG pipeline
answer = rag_pipeline(new_url, new_query)

# Print the response
print("\nFinal Answer:")
print("=" * 80)
print(answer)
print("=" * 80)

Step 1: Fetching document from https://en.wikipedia.org/wiki/Large_language_model

Step 2: Chunking the document
Created 68 chunks

Step 3: Adding chunks to vector store
Using existing collection: rag_documents
Added 68 chunks to vector store

Step 4: Retrieving relevant chunks for query: 'How do large language models work and what are their limitations?'
Using existing collection: rag_documents
Retrieved 5 chunks for query: 'How do large language models work and what are their limitations?'
----------------------------------------

Chunk 1 (Distance: 0.3294, Source: https://en.wikipedia.org/wiki/Large_language_model):
^ A Closer Look at Large Language Models Emergent Abilities Archived 2023-06-24 at the Wayback Machine (Yao Fu, Nov 20, 2022)
^ Ornes, Stephen (March 16, 2023). "The Unpredictable Abilities Emerging From Large AI Models". Quanta Magazine. Archived from the original on March 16, 2023. Retrieved March 16, 2023.
^ Schaeffer, Rylan; Miranda, Brando; Koyejo, Sanmi (2023-04-01

## 8. Testing Multiple Queries

Let's try a few more queries to test our RAG system:

In [26]:
# Define a list of questions to ask
questions = [
    "What are the key challenges in implementing RAG systems?",
    "How does RAG improve factual accuracy compared to standard LLMs?",
    "What are some real-world applications of RAG technology?"
]

# Ask each question and print the answers
for i, question in enumerate(questions, 1):
    print(f"\nQuestion {i}: {question}")
    print("-" * 80)
    
    # Retrieve relevant chunks
    chunks, sources = retrieve_relevant_chunks(question)
    
    if chunks:
        # Generate a response
        response = generate_response(question, chunks, sources)
        print("\nAnswer:")
        print(response)
    else:
        print("No relevant information found to answer this question.")
    
    print("-" * 80)


Question 1: What are the key challenges in implementing RAG systems?
--------------------------------------------------------------------------------
Using existing collection: rag_documents
Retrieved 5 chunks for query: 'What are the key challenges in implementing RAG systems?'
----------------------------------------

Chunk 1 (Distance: 0.4109, Source: https://en.wikipedia.org/wiki/Retrieval-augmented_generation):
RAG is not a complete solution to the problem of hallucinations in LLMs. According to [_Ars Technica_](https://en.wikipedia.org/wiki/Ars_Technica "Ars Technica"), "It is not a direct solution because the LLM can still hallucinate around the source material in its response."[\[5\]](https://en.wikipedia.org/wiki/Retrieval-augmented_generation#cite_note-:0-5)

While RAG improves the accuracy of large language models (LLMs), it does not eliminate all challenges. One limitation is that while RAG reduces the need for frequent model retraining, it does not remove it entirely. Add

## Conclusion

You've built a complete RAG system from scratch that can:
1. Fetch and process documents from web URLs
2. Chunk them intelligently using RecursiveTokenChunker
3. Create embeddings using sentence-transformers
4. Store them in a ChromaDB vector database
5. Retrieve relevant information for queries
6. Generate informed responses using OpenAI's API

This basic implementation provides a solid foundation, but there are many ways to enhance your RAG system:

1. **Add more documents**: Expand your knowledge base with more sources
2. **Improve chunking**: Experiment with different chunking strategies and parameters
3. **Try different embedding models**: Test models like OpenAI's embeddings or other sentence transformers
4. **Enhance the retriever**: Implement re-ranking or hybrid search approaches
5. **Refine prompt engineering**: Optimize your prompts for better responses
6. **Add evaluation**: Measure the quality of your retrieval and generation

## Resources

- [Retrieval Augmented Generation (RAG): A comprehensive introduction](https://www.pinecone.io/learn/retrieval-augmented-generation/)
- [ChromaDB Documentation](https://docs.trychroma.com/)
- [Sentence-Transformers Documentation](https://www.sbert.net/)
- [OpenAI API Documentation](https://platform.openai.com/docs/api-reference)
- [Chunking Evaluation GitHub Repository](https://github.com/brandonstarxel/chunking_evaluation)
- [Jina AI Reader](https://github.com/jina-ai/reader)