### Building a RAG System with LangChain and ChromaDB

- LangChain : A Framework for developing applications powered by language models
- ChromaDB: An open-source vector databases for storing and retrieving embeddings
- OpenAI: For Embeddings and language model(You can substitute with other providers)

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()


True

In [None]:
### lanchain imports

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document

### VECTOR STORE imports
from langchain_community.vectorstores import Chroma

### utils imports
import numpy as np
from typing import List



In [10]:
## 1 . Create Sample Data

sample_data = ["""
Machine learning Fundamentals
Machine Learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform specific tasks without explicit instructions.
These algorithms learn from and make predictions or decisions based on data. 
Machine learning is widely used in various applications, including email filtering, fraud detection, recommendation systems, and image recognition.
""",
""" LangChain Overview
LangChain is a framework for developing applications powered by language models. 
It can be used for chatbots, Generative Question-Answering (GQA), summarization, and much more.
LangChain provides a standard interface for all LLMs, as well as a host of integrations with other tools and libraries.
LangChain is designed to help developers build applications that are robust, scalable, and maintainable.
LangChain is used by developers to create applications that can understand and generate natural language.
LangChain is an open-source project that is actively maintained and developed by a community of contributors.
LangChain is a framework for developing applications powered by language models.
It can be used for chatbots, Generative Question-Answering (GQA), summarization, and much more.
LangChain provides a standard interface for all LLMs, as well as a host of integrations with other tools and libraries.
""",
""" NLP Basics
Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language.
The ultimate objective of NLP is to enable computers to understand, interpret, and generate human language in a way that is valuable.
"""]

sample_data

['\nMachine learning Fundamentals\nMachine Learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform specific tasks without explicit instructions.\nThese algorithms learn from and make predictions or decisions based on data. \nMachine learning is widely used in various applications, including email filtering, fraud detection, recommendation systems, and image recognition.\n',
 ' LangChain Overview\nLangChain is a framework for developing applications powered by language models. \nIt can be used for chatbots, Generative Question-Answering (GQA), summarization, and much more.\nLangChain provides a standard interface for all LLMs, as well as a host of integrations with other tools and libraries.\nLangChain is designed to help developers build applications that are robust, scalable, and maintainable.\nLangChain is used by developers to create applications that can understand and generate natura

In [11]:
### save sample data to text files
import tempfile
tempdir = tempfile.mkdtemp()

for i,data in enumerate(sample_data):
    with open(f"doc_{i}.txt", "w") as f:
        f.write(data)

print(f"Sample data files created in directory: {tempdir}")

Sample data files created in directory: C:\Users\yaswa\AppData\Local\Temp\tmpuk4ivzok


### 2. Document Loading

In [15]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(
    "data/", 
    glob="*.txt", 
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf8"},
    show_progress=True)

documents = loader.load()


print(f"Number of documents loaded: {len(loader.load())}")
print("documents[0]: ", documents[0].page_content[:500])  # Print first 500 characters of the first document


100%|██████████| 3/3 [00:00<00:00, 3429.52it/s]
100%|██████████| 3/3 [00:00<00:00, 2285.31it/s]

Number of documents loaded: 3
documents[0]:  
Machine learning Fundamentals
Machine Learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform specific tasks without explicit instructions.
These algorithms learn from and make predictions or decisions based on data. 
Machine learning is widely used in various applications, including email filtering, fraud detection, recommendation systems, and image recognition.






### 3. DocumentSplitting

In [17]:
# Intialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", " ", ""],
    length_function=len
)

chunks = text_splitter.split_documents(documents)


print(f"Number of chunks created: {len(chunks)} from {len(documents)} documents")
print("chunks[0]: ", chunks[0].page_content[:150])  # Print first 500 characters of the first chunk
print("chunks[0] metadata: ", chunks[0].metadata)  # Print metadata of the first chunk

Number of chunks created: 5 from 3 documents
chunks[0]:  Machine learning Fundamentals
Machine Learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistic
chunks[0] metadata:  {'source': 'data\\doc_0.txt'}


### 4. Embedding Models

In [26]:

from langchain_huggingface import HuggingFaceEmbeddings

## Initialize HuggingFace Embeddings (No API Key Required)
## First Train Your model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
embeddings

sample_text = "The quick brown fox jumps over the lazy dog."
vector = embeddings.embed_query(sample_text)
vector

[0.04393354803323746,
 0.05893440172076225,
 0.04817838966846466,
 0.077548086643219,
 0.026744406670331955,
 -0.03762960433959961,
 -0.0026051681488752365,
 -0.05994309112429619,
 -0.002496002707630396,
 0.022072812542319298,
 0.048025935888290405,
 0.055755287408828735,
 -0.03894539922475815,
 -0.026616835966706276,
 0.007693450897932053,
 -0.02623767964541912,
 -0.036416083574295044,
 -0.03781614452600479,
 0.07407816499471664,
 -0.04950502887368202,
 -0.05852172151207924,
 -0.0636197030544281,
 0.032434966415166855,
 0.02200855128467083,
 -0.07106376439332962,
 -0.03315775468945503,
 -0.06941038370132446,
 -0.05003742873668671,
 0.07462680339813232,
 -0.11113376915454865,
 -0.012306339107453823,
 0.03774561733007431,
 -0.0280313640832901,
 0.014535374008119106,
 -0.031558506190776825,
 -0.08058365434408188,
 0.058352645486593246,
 0.0025900902692228556,
 0.03928019478917122,
 0.025769619271159172,
 0.049850597977638245,
 -0.001756205689162016,
 -0.04552977532148361,
 0.029260775074

### Intialize the Chroma Vector Store And Store the Chunks in Vector Representations

In [30]:
### Create a ChromaDB vector store
persist_directory = "./chroma_db"

## Initialize ChromaDB vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory=persist_directory,
    collection_name="rag_collection"
)
print("ChromaDB vector store created and persisted at:", persist_directory)
print("Number of documents in vector store:", vectorstore._collection.count())

ChromaDB vector store created and persisted at: ./chroma_db
Number of documents in vector store: 10


### Test Similarity Search

In [49]:
sample_query = "What is Machine learning?"
docs = vectorstore.similarity_search(sample_query, k=3)
#print(f"Top 3 documents similar to the query: '{sample_query}'")
docs

[Document(metadata={'source': 'data\\doc_0.txt'}, page_content='Machine learning Fundamentals\nMachine Learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform specific tasks without explicit instructions.\nThese algorithms learn from and make predictions or decisions based on data. \nMachine learning is widely used in various applications, including email filtering, fraud detection, recommendation systems, and image recognition.'),
 Document(metadata={'source': 'data\\doc_0.txt'}, page_content='Machine learning Fundamentals\nMachine Learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform specific tasks without explicit instructions.\nThese algorithms learn from and make predictions or decisions based on data. \nMachine learning is widely used in various applications, including email filterin

In [50]:
print("Top 3 documents similar to the query:", sample_query)
print(f"\n Top {len(docs)} similar chunks:")
for i, doc in enumerate(docs):
    print(f"\n-----Chunk {i+1}-------")
    print(doc.page_content[:500] + "----")
    print(f"Source Metadata: {doc.metadata.get('source','Unknown')}")

Top 3 documents similar to the query: What is Machine learning?

 Top 3 similar chunks:

-----Chunk 1-------
Machine learning Fundamentals
Machine Learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform specific tasks without explicit instructions.
These algorithms learn from and make predictions or decisions based on data. 
Machine learning is widely used in various applications, including email filtering, fraud detection, recommendation systems, and image recognition.----
Source Metadata: data\doc_0.txt

-----Chunk 2-------
Machine learning Fundamentals
Machine Learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform specific tasks without explicit instructions.
These algorithms learn from and make predictions or decisions based on data. 
Machine learning is widely used in various applicati

### Advanced Similarity Search with Scores

In [51]:
result_scores=vectorstore.similarity_search_with_score(sample_query, k=3)
result_scores

[(Document(metadata={'source': 'data\\doc_0.txt'}, page_content='Machine learning Fundamentals\nMachine Learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform specific tasks without explicit instructions.\nThese algorithms learn from and make predictions or decisions based on data. \nMachine learning is widely used in various applications, including email filtering, fraud detection, recommendation systems, and image recognition.'),
  0.4467886686325073),
 (Document(metadata={'source': 'data\\doc_0.txt'}, page_content='Machine learning Fundamentals\nMachine Learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform specific tasks without explicit instructions.\nThese algorithms learn from and make predictions or decisions based on data. \nMachine learning is widely used in various applications,

### understanding similarity scores
ChromaDB default: Uses L2 Distance(Euclidean Distance)

- lower score = More Similar (closer in vector space)
- Score of 0 = identical vectors
- Typical range: 0 to 2 (but can be higher)

cosine similarity(if configured)

- Higher Scores = MORE SIMILAR
- Range : -1 to 1 (1 being identical)

### Intialize LLM, RAG Chain, Prompt template. Query the RAG System

In [None]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",  
    # temperature=0.2 # Low temperature for more focused responses
    # max_tokens=500 # Limit the response length
    )



In [None]:
test_response = llm.invoke("What are large language models?")
test_response

In [None]:
from langchain.chat_models.base import init_chat_model

llm = init_chat_model("openai:gpt-3.5-turbo")

# llm = init_chat_model("groq:")

In [None]:
llm.invoke("What are large language models?")

### Modern RAG Chain: traditional RAG

In [None]:
### We need to convert the vector store to retriever --> Very Important Step by doing this we can use it in the retrieval chain
retriever = vectorstore.as_retriever(
    #search_type="similarity", 
    search_kwargs={"k":3} # Number of similar documents to retrieve
    )

In [None]:
### Create a Prompt Template
from langchain_core.prompts import ChatPromptTemplate

system_prompt = """You are a knowledgeable assistant. Use the following pieces of context to answer the question at the end.
if you dont know say you dont know
Context: {context} """


prompt_template = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}")
    ])


In [None]:
## Create a Documents Chain --> This chain is responsible for combining the retrieved documents into a single context for the LLM.
from langchain.chains.combine_documents import create_stuff_documents_chain
document_chain = create_stuff_documents_chain(  #give llm and prompt_template because we are using stuff method here
    llm=llm,
    prompt=prompt_template
)
document_chain


# This Chain combines the retriever and document chain to form a complete retrieval-augmented generation pipeline.

In [None]:
### Create the RAG Chain -- Final RAG Chain
from langchain.chains import create_retrieval_chain
rag_chain_response = create_retrieval_chain(retriever, document_chain)
rag_chain_response



In [None]:
rag_chain_response.invoke({"input":"What is machine learning?"})

In [None]:
rag_chain_response["answer"]