### Building a RAG System with LangChain and ChromaDB


In [7]:
## create sample documents
sample_docs = [
    """
    Machine Learning Fundamentals
    
    Machine learning is a subset of artificial intelligence that enables systems to learn 
    and improve from experience without being explicitly programmed. There are three main 
    types of machine learning: supervised learning, unsupervised learning, and reinforcement 
    learning. Supervised learning uses labeled data to train models, while unsupervised 
    learning finds patterns in unlabeled data. Reinforcement learning learns through 
    interaction with an environment using rewards and penalties.
    """,
    
    """
    Deep Learning and Neural Networks
    
    Deep learning is a subset of machine learning based on artificial neural networks. 
    These networks are inspired by the human brain and consist of layers of interconnected 
    nodes. Deep learning has revolutionized fields like computer vision, natural language 
    processing, and speech recognition. Convolutional Neural Networks (CNNs) are particularly 
    effective for image processing, while Recurrent Neural Networks (RNNs) and Transformers 
    excel at sequential data processing.
    """,
    
    """
    Natural Language Processing (NLP)
    
    NLP is a field of AI that focuses on the interaction between computers and human language. 
    Key tasks in NLP include text classification, named entity recognition, sentiment analysis, 
    machine translation, and question answering. Modern NLP heavily relies on transformer 
    architectures like BERT, GPT, and T5. These models use attention mechanisms to understand 
    context and relationships between words in text.
    """
]

sample_docs


['\n    Machine Learning Fundamentals\n\n    Machine learning is a subset of artificial intelligence that enables systems to learn \n    and improve from experience without being explicitly programmed. There are three main \n    types of machine learning: supervised learning, unsupervised learning, and reinforcement \n    learning. Supervised learning uses labeled data to train models, while unsupervised \n    learning finds patterns in unlabeled data. Reinforcement learning learns through \n    interaction with an environment using rewards and penalties.\n    ',
 '\n    Deep Learning and Neural Networks\n\n    Deep learning is a subset of machine learning based on artificial neural networks. \n    These networks are inspired by the human brain and consist of layers of interconnected \n    nodes. Deep learning has revolutionized fields like computer vision, natural language \n    processing, and speech recognition. Convolutional Neural Networks (CNNs) are particularly \n    effective f

In [8]:
import os
os.environ['OPENAI_API_KEY']=os.getenv("OPENAI_API_KEY")

In [9]:
import tempfile
temp_dir=tempfile.mkdtemp()

for i,doc in enumerate(sample_docs):
    with open(f"{temp_dir}/doc_{i}.txt","w") as f:
        f.write(doc)

print(f"Sample create in {temp_dir}")

Sample create in C:\Users\ASUS\AppData\Local\Temp\tmp3anh4hli


## Document Loading

In [10]:
# Directory loader
from langchain_community.document_loaders import DirectoryLoader,TextLoader
loader=DirectoryLoader(
    temp_dir,
    glob="*.txt",
    loader_cls=TextLoader,
    show_progress=True,
    loader_kwargs={"encoding":"utf-8"}
)
docs=loader.load()

print(f"Number of documents: {len(docs)}")
print(f"First document content is: {docs[0].page_content[:100]}")
print(f"metadata: {docs[0].metadata}")

100%|██████████| 3/3 [00:00<00:00, 952.46it/s]

Number of documents: 3
First document content is: 
    Machine Learning Fundamentals

    Machine learning is a subset of artificial intelligence that
metadata: {'source': 'C:\\Users\\ASUS\\AppData\\Local\\Temp\\tmp3anh4hli\\doc_0.txt'}





## Document Splitting

In [11]:
# Text splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter=RecursiveCharacterTextSplitter(
    separators=[" "],
    chunk_size=500,
    chunk_overlap=50,
    length_function=len
)

chunks=splitter.split_documents(docs)

print(f"Number of chunks: {len(chunks)}")
print(f"First chunk is {chunks[0].page_content[:200]}")


Number of chunks: 5
First chunk is Machine Learning Fundamentals

    Machine learning is a subset of artificial intelligence that enables systems to learn 
    and improve from experience without being explicitly programmed. There are


### Embedding Model

In [12]:
from langchain_community.embeddings import OpenAIEmbeddings

embeddings=OpenAIEmbeddings(
    model="text-embedding-3-small"
)
query="what is ai"
embeded_docs=embeddings.embed_query(query)
print(f"Number of embeded docs: {len(embeded_docs)}")

  embeddings=OpenAIEmbeddings(


Number of embeded docs: 1536


In [13]:
embeded_docs

[0.004593748832230253,
 0.0057289653349343085,
 0.028909240316080432,
 0.0005676082513520279,
 0.002152328270414076,
 -0.06391057793376702,
 -0.0015961779637017547,
 0.001019755792365406,
 -0.06898732133893369,
 0.012769423709614038,
 0.017683994991756075,
 -0.0647567030763915,
 -0.02111079623145006,
 -0.06069530239179399,
 -0.017091707391918936,
 0.0112534504270677,
 -0.013058515535354875,
 -0.04162930858835309,
 0.019474956151687785,
 -0.0412908585313033,
 -0.012113676724080064,
 0.05649288830067262,
 0.009504794592944687,
 0.00989260101865838,
 0.005404617829256188,
 -0.04425229280519888,
 0.046847072850623846,
 -0.00335452968024257,
 0.0038569157851839606,
 -0.001510684069082406,
 0.03993705830310413,
 -0.03206810565256678,
 -0.008898405838719668,
 -0.0328296145556387,
 0.016569930220633838,
 0.008017026413141694,
 0.004942774987901588,
 0.010026570832907257,
 0.011429726498447802,
 -0.037455094943362424,
 -0.010936154429906048,
 0.007177952779033679,
 0.025129886933626243,
 0.0448

### Initialize chroma db vectorstore and store the chunks in vector representation

In [14]:
from langchain_community.vectorstores import Chroma

persist_dir="./chroma_db"
vectorstore=Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory=persist_dir,
    collection_name="rag_collection"
)



In [15]:
query="what are type of Machine Learning"
simillar_docs=vectorstore.similarity_search(query,k=2)
simillar_docs

[Document(metadata={'source': 'C:\\Users\\ASUS\\AppData\\Local\\Temp\\tmpo5t6c_6v\\doc_0.txt'}, page_content='Machine Learning Fundamentals\n\n    Machine learning is a subset of artificial intelligence that enables systems to learn \n    and improve from experience without being explicitly programmed. There are three main \n    types of machine learning: supervised learning, unsupervised learning, and reinforcement \n    learning. Supervised learning uses labeled data to train models, while unsupervised \n    learning finds patterns in unlabeled data. Reinforcement learning learns through'),
 Document(metadata={'source': 'C:\\Users\\ASUS\\AppData\\Local\\Temp\\tmp3anh4hli\\doc_0.txt'}, page_content='Machine Learning Fundamentals\n\n    Machine learning is a subset of artificial intelligence that enables systems to learn \n    and improve from experience without being explicitly programmed. There are three main \n    types of machine learning: supervised learning, unsupervised learning

#### Initialize LLM, RAG Chain, Prompt Template,Query the RAG system

In [16]:
from langchain_openai import ChatOpenAI

llm=ChatOpenAI(
    model="gpt-3.5-turbo"
)
llm

ChatOpenAI(client=<openai.resources.chat.completions.completions.Completions object at 0x0000016D0C3B1310>, async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x0000016D0C3B0A50>, root_client=<openai.OpenAI object at 0x0000016D0C3A6B10>, root_async_client=<openai.AsyncOpenAI object at 0x0000016D0C3A6520>, model_kwargs={}, openai_api_key=SecretStr('**********'), stream_usage=True)

In [17]:
query="what is ML MODEL"
result=llm.invoke(query)
result.content

'A machine learning model is a mathematical representation of a real-world process or system that is created by training a machine learning algorithm on a dataset. The model is then used to make predictions or decisions based on new data. The goal of a machine learning model is to generalize well to new, unseen data and accurately predict outcomes or classify data points. Models can be used for a wide variety of tasks such as regression, classification, clustering, and more.'

In [18]:
# Other way to initialize LLM
from langchain.chat_models.base import init_chat_model

llm=init_chat_model(
    "openai:gpt-3.5-turbo"
)


llm.invoke(query)

AIMessage(content='A machine learning model is a mathematical representation of data that is created using algorithms to identify patterns and make predictions or decisions without being explicitly programmed. It is used to analyze complex data and make predictions based on that data. Models can be used in various machine learning tasks such as classification, regression, clustering, and more.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 63, 'prompt_tokens': 11, 'total_tokens': 74, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-CUcMZ56CuSLm8n1DVMNjGw04HhYwv', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='run--cfea39b2-180c-496e-87ff-e301e689417d-0', usage_metadata={'input_t

## Modern RAG Chain

In [19]:
from langchain.chains import create_retrieval_chain
from langchain.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain

In [20]:
# convert vectorstore to retriever

retriever=vectorstore.as_retriever(
    search_kwargs={"k":3}
)

In [21]:
# Create prompt
system_prompt="""You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 
Use three sentences maximum and keep the answer concise.

Context: {context}"""

prompt=ChatPromptTemplate.from_messages(
    [
        ("system",system_prompt),
        ("human","{input}")
    ]
)

prompt

ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. \nUse the following pieces of retrieved context to answer the question. \nIf you don't know the answer, just say that you don't know. \nUse three sentences maximum and keep the answer concise.\n\nContext: {context}"), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], input_types={}, partial_variables={}, template='{input}'), additional_kwargs={})])

In [22]:
# Create document chain
document_chain=create_stuff_documents_chain(
    llm,prompt
)
document_chain

RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableLambda(format_docs)
}), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
| ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. \nUse the following pieces of retrieved context to answer the question. \nIf you don't know the answer, just say that you don't know. \nUse three sentences maximum and keep the answer concise.\n\nContext: {context}"), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], input_types={}, partial_variables={}, template='{input}'), additional_kwargs={})])
| ChatOpenAI(client=<openai.resources.chat.completions.completions.Completions object at 0x0000016D0D570050>, async_client=<openai.res

In [23]:
# Final RAG Chain
rag_chain=create_retrieval_chain(retriever, document_chain)

result=rag_chain.invoke({"input":"what is RL MODEL"})

In [26]:
result['answer']

'A reinforcement learning (RL) model is a type of machine learning model that learns through interaction with an environment by receiving rewards or penalties. RL models aim to maximize the total reward received over time by taking actions in the environment based on learned policies. These models are commonly used in situations where the model can learn from trial and error.'

In [None]:
# Function to query the modern RAG system
def query_rag_modern(question):
    print(f"Question: {question}")
    print("-" * 50)
    
    # Using create_retrieval_chain approach
    result = rag_chain.invoke({"input": question})
    
    print(f"Answer: {result['answer']}")
    print("\nRetrieved Context:")
    for i, doc in enumerate(result['context']):
        print(f"\n--- Source {i+1} ---")
        print(doc.page_content[:200] + "...")
    
    return result

# Test queries
test_questions = [
    "What are the three types of machine learning?",
    "What is deep learning and how does it relate to neural networks?",
    "What are CNNs best used for?"
]

for question in test_questions:
    result = query_rag_modern(question)
    print("\n" + "="*80 + "\n")

In [33]:
# Function to query the modern RAG system
def quer_rag_modern(question):
    print(f"Question: {question}")
    print("-"*50)
    result=rag_chain.invoke({"input":question})

    print(f"Answer: {result['answer']}")
    print("\nRetrieved Context:")
    for i, doc in enumerate(result['context']):
        print(f"\n--- Source {i+1} ---")
        print(doc.page_content[:200] + "...")

In [34]:
# Test queries
test_questions = [
    "What are the three types of machine learning?",
    "What is deep learning and how does it relate to neural networks?",
    "What are CNNs best used for?"
]

for question in test_questions:
    result = quer_rag_modern(question)
    print("\n"+"="*80 + "\n")

Question: What are the three types of machine learning?
--------------------------------------------------
Answer: The three main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning. Supervised learning uses labeled data, unsupervised learning finds patterns in unlabeled data, and reinforcement learning learns through a system of rewards and punishments.

Retrieved Context:

--- Source 1 ---
Machine Learning Fundamentals

    Machine learning is a subset of artificial intelligence that enables systems to learn 
    and improve from experience without being explicitly programmed. There are...

--- Source 2 ---
Machine Learning Fundamentals

    Machine learning is a subset of artificial intelligence that enables systems to learn 
    and improve from experience without being explicitly programmed. There are...

--- Source 3 ---
Deep Learning and Neural Networks

    Deep learning is a subset of machine learning based on artificial neural 