# Building a RAG System Using Langchain, ChromaDB, and OpenAI

In this notebook, we'll build a simple rag system using the Langchain library and Chroma vector store. We will:
- Load text documents.
- Split the text into smaller chunks.
- Generate embeddings for the chunks using OpenAI.
- Store the embeddings in a Chroma vector database.
- Query the database for context-aware answers.

Let's dive in!


## Step 1: Install Necessary Libraries

We need to install the required libraries before we proceed. These include `langchain`, `chromadb`, `openai`, and more. 


In [None]:
# Install required libraries
%pip install python-dotenv langchain langchain-community langchain-openai unstructured chromadb openai tiktoken

## Step 2: Initialize the Environment

We will load the necessary packages, set up the OpenAI API key, and prepare the folder paths where documents will be stored.


In [29]:
import os
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from dotenv import load_dotenv
import openai
import shutil

# Load environment variables
load_dotenv()
openai.api_key = os.getenv('OPENAI_API_KEY')

# Define paths
CHROMA_PATH = "./data/chroma" # Path to store the Chroma vector store
DATA_PATH = "./data/docs" # Path to the directory containing the documents

## Step 3: Loading Documents

We will load text documents from the `data/docs/` directory. For this, we'll use the `DirectoryLoader` from Langchain, which can handle multiple files in a directory.


In [30]:
def load_documents():
    loader = DirectoryLoader(DATA_PATH, glob="./*.txt", loader_cls=TextLoader, loader_kwargs={'autodetect_encoding': True})
    try:
        print("Starting to load documents...")
        documents = loader.load()
        print("Documents loaded:", len(documents))
        return documents
    except Exception as e:
        print(f"Error loading documents: {e}")

Now we can run the document loading process.


In [31]:
documents = load_documents()

Starting to load documents...
Documents loaded: 1


## Step 4: Splitting Documents into Chunks

Long text documents need to be split into smaller chunks for better processing. We'll use `RecursiveCharacterTextSplitter`, which ensures text chunks remain meaningful and maintain some overlap between them.


In [34]:
def split_text(documents: list[Document]):
    print("Splitting documents into chunks...")
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=300,
        chunk_overlap=100,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")
    
    # Show an example chunk
    document = chunks[2]
    print(document.page_content)
    print(document.metadata)

    return chunks

Now let's split the documents into smaller chunks.

In [12]:
chunks = split_text(documents)

Splitting documents into chunks...
Split 1 documents into 25 chunks.
The company prides itself on using cutting-edge technologies to ensure that applications are not only functional but also secure and scalable.
Mobile App Development
Armada Logics provides comprehensive mobile app development services for both iOS and Android platforms. Their approach includes:
{'source': 'data\\docs\\armada_logics.txt', 'start_index': 1891}


## Step 5: Saving the Chunks to Chroma

Now that we've split the documents, we need to generate embeddings for the chunks and store them in the Chroma vector store for later retrieval.

In [13]:
def save_to_chroma(chunks: list[Document]):
    print("Saving chunks to Chroma...")

    # Clear out any existing database
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH)

    # Store the chunks in Chroma using OpenAI embeddings
    db = Chroma.from_documents(
        chunks, OpenAIEmbeddings(), persist_directory=CHROMA_PATH
    )
    db.persist()
    print(f"Saved {len(chunks)} chunks to {CHROMA_PATH}.")

Now, let's store the processed chunks into Chroma.


In [14]:
save_to_chroma(chunks)

Saving chunks to Chroma...
Saved 25 chunks to ./data/chroma.


  db.persist()


## Step 6: Create the prompt with or without context.

In [20]:
from langchain.prompts import ChatPromptTemplate
PROMPT_TEMPLATE = """
Start with a greeting and provide a thoughtful and friendly response to the question. 
{context_section}

If the answer is unclear or not available, kindly indicate that you don't know. 
<question>
{question}
</question>
"""

def create_prompt(context, question):
    context_section = f"<context>\n{context}\n</context>" if context else ""
    prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
    prompt_text = prompt_template.format(context_section=context_section, question=question)
    return prompt_text

## Step 7: Querying the Database

Once the chunks are stored, we can perform queries on the database. The following function retrieves the top 3 most relevant chunks based on a query and uses OpenAI to answer based on the context retrieved.


In [23]:
from langchain_openai import ChatOpenAI

def query_database(query_text):
	# Set up the database connection
	embedding_function = OpenAIEmbeddings()
	db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function)
	
	# Perform similarity search
	results = db.similarity_search_with_relevance_scores(query_text, k=3)

	if results and results[0][1] >= 0.7:
		context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
	else:
		context_text = None

	# Create the prompt
	prompt = create_prompt(context_text, query_text)
	print(prompt)
	
	# Query the model
	model = ChatOpenAI()
	response_text = model.invoke(prompt)
	
	# Display response and sources
	sources = [doc.metadata.get("source", None) for doc, _score in results]
	print(f"Response: {response_text}")
	print(f"Sources: {sources}")

Now, let's test it by asking a question that it probably knows the answer.


In [27]:
query_text = "What is the meaning of the universe?"
query_database(query_text)

Human: 
Start with a greeting and provide a thoughtful and friendly response to the question. 



If the answer is unclear or not available, kindly indicate that you don't know. 
<question>
What is the meaning of the universe?
</question>

Response: content="Hello! The meaning of the universe is a complex and philosophical question that has puzzled humans for centuries. It can be interpreted in various ways depending on one's beliefs, values, and perspective. Some see the universe as a result of random chance, while others believe it has a higher purpose or design. Ultimately, the meaning of the universe may be something we may never fully understand." response_metadata={'token_usage': {'completion_tokens': 76, 'prompt_tokens': 58, 'total_tokens': 134}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None} id='run-b03ce106-5463-494c-8974-2aa78f9f002b-0' usage_metadata={'input_tokens': 58, 'output_tokens': 76, 'total_tokens': 134}
Sources:

Now, let's ask a question that it probably doesn't know the answer to, but with our knowledge base set up, it should be able to provide an answer.

In [28]:
query_text = "Who is Armada Logics?"
query_database(query_text)

Human: 
Start with a greeting and provide a thoughtful and friendly response to the question. 

<context>
applications for U.S.-based companies, Armada Logics is positioned to deliver high-quality software services that meet the rigorous standards of the tech industry.

---

Armada Logics is a premier offshore software development company based in the Philippines, founded in 2024 by experienced engineers from Village 88. The company aims to provide Silicon Valley-caliber software solutions, specializing in custom application development, mobile app development, and team

---

Armada Logics was established by engineers who previously worked at Village 88, a company known for incubating startups and providing software development services. Village 88 has a rich history dating back to 2011, where it began training entry-level developers to meet the demands of Silicon
</context>

If the answer is unclear or not available, kindly indicate that you don't know. 
<question>
Who is Armada Logic

## Conclusion

We've successfully:
1. Loaded text documents.
2. Split them into smaller chunks.
3. Generated embeddings using OpenAI.
4. Stored the chunks in Chroma for querying.
5. Queried the database for context-based answers.

This system can be extended to work with large text datasets, making it useful for various applications like chatbots and question-answering systems.
