# Naive RAG with Milvus and LangChain

This notebook contains an implementation of RAG with Milvus, LangChain, and HuggingFace. Its purpose is to provide you with a starting point for coding, if required.


### Load (quantized) Phi-4 for Apple Sillicon hardware

Using default `transformers` implementation is too slow on my MacBook (even though it is set to use `mps` device). Hence, I use the `mlx-lm` library. On `cuda` platforms, I recommend `unsloth`.


In [1]:
# %%capture # Uncomment on CUDA platforms like Google Colab
# !pip install unsloth
# # Also get the latest nightly Unsloth!
# !pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [7]:
%%capture
!pip install mlx_lm==0.20.6 # Comment on CUDA platforms like Google Colab

In [None]:
import torch

if torch.backends.mps.is_available():
    from mlx_lm import load

    model, tokenizer = load(
        "mlx-community/phi-4-4bit"
    )  # <= replace with smaller model depending on WiFi bandwidth

elif torch.cuda.is_available():
    from unsloth import FastLanguageModel

    model_name = "unsloth/Phi-4-unsloth-bnb-4bit"
    max_seq_length = 2048
    load_in_4bit = True

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Phi-4",
        max_seq_length=max_seq_length,
        load_in_4bit=load_in_4bit,
        # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
    )

else:
    raise Exception(
        "You most likely don't have sufficient hardware to run this notebook... :("
    )

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

### Integration with LangChain


In [8]:
%%capture
!pip install langchain_community langchain_huggingface

In [None]:
from langchain_core.messages import HumanMessage

if torch.backends.mps.is_available():
    from langchain_community.llms.mlx_pipeline import MLXPipeline as Pipeline
    from langchain_community.chat_models.mlx import ChatMLX as Chat

    llm = Pipeline(
        model=model,
        tokenizer=tokenizer,
        pipeline_kwargs={"max_tokens": 1024, "temp": 0.1},
    )

elif torch.cuda.is_available():
    import transformers
    from langchain_huggingface import HuggingFacePipeline as Pipeline
    from langchain_huggingface import ChatHuggingFace as Chat

    FastLanguageModel.for_inference(model)

    hf_pipeline = transformers.pipeline(
        model=model,
        tokenizer=tokenizer,
        task="text-generation",
        # device="cuda",
        # repetition_penalty=1.15,
        return_full_text=False,
        max_new_tokens=1024,
        # output_scores=True,
        # use_cache=False,
        # truncation=True
    )

    llm = Pipeline(pipeline=hf_pipeline)

chat = Chat(llm=llm)

### Test language model

On Apple Silicon, ignore the warning, which is due to a breaking change in one of the libraries used in the past couple of weeks. That's why I pin `mlx-lm==0.20.6`.


In [10]:
messages = [
    HumanMessage(
        content="What happens when an unstoppable force meets an immovable object?"
    ),
]

res = chat.invoke(messages)
print(res.content)

As an AI, I can provide some perspectives on this classic paradox, but it's important to note that it is a thought experiment that doesn't have a definitive answer within the framework of our current understanding of physics.

The paradox of an "unstoppable force" meeting an "immovable object" presents a logical contradiction. In classical physics, the concepts of an unstoppable force and an immovable object are mutually exclusive. If a force is truly unstoppable, it implies that there is nothing in the universe that can resist it, meaning there cannot be an immovable object. Conversely, if an object is truly immovable, then no force can move it, meaning there cannot be an unstoppable force.

This paradox is often used to illustrate the limitations of language and logic when applied to the physical world. It can also serve as a metaphor for situations where two seemingly irreconcilable forces or ideas come into conflict.

In theoretical physics, concepts like these are sometimes explor

### Prepare the Data


In [11]:
import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create a WebBaseLoader instance to load documents from web sources
loader = WebBaseLoader(
    web_paths=(
        "https://lilianweng.github.io/posts/2023-06-23-agent/",
        "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
    ),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
# Load documents from web sources using the loader
documents = loader.load()
# Initialize a RecursiveCharacterTextSplitter for splitting text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)

# Split the documents into chunks using the text_splitter
docs = text_splitter.split_documents(documents)

# Let's take a look at the first document
print(docs[1])

USER_AGENT environment variable not set, consider setting it to identify your requests.


page_content='Fig. 1. Overview of a LLM-powered autonomous agent system.
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or ma

### Build naive RAG with Milvus and LangChain


In [12]:
from langchain_community.embeddings import SentenceTransformerEmbeddings

embeddings = SentenceTransformerEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

  embeddings = SentenceTransformerEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2')


In [13]:
%%capture
!pip install langchain_milvus # TODO: Get rid of warning message

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [14]:
from langchain_milvus import Milvus, Zilliz

vectorstore = Milvus.from_documents(  # or Zilliz.from_documents
    documents=docs,
    embedding=embeddings,
    connection_args={
        "uri": "./milvus_demo.db",
    },
    drop_old=True,  # Drop the old Milvus collection if it exists
    index_params={
        "metric_type": "COSINE",
        "index_type": "FLAT",  # <= NOTE: Currently a bug where langchain_milvus defaults to "HNSW" index, which doesn't work with Milvus Lite
        "params": {},
    },
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Test vector database


In [None]:
query = "What is self-reflection of an AI Agent?"
res = vectorstore.similarity_search(query, k=1)
print(res[0].page_content[0:1024] + "...")

Recency: recent events have higher scores
Importance: distinguish mundane from core memories. Ask LM directly.
Relevance: based on how related it is to the current situation / query.


Reflection mechanism: synthesizes memories into higher level inferences over time and guides the agent’s future behavior. They are higher-level summaries of past events (<- note that this is a bit different from self-reflection above)

Prompt LM with 100 most recent observations and to generate 3 most salient high-level questions given a set of observations/statements. Then ask LM to answer those questions.


Planning & Reacting: translate the reflections and the environment information into actions

Planning is essentially in order to optimize believability at the moment vs in time.
Prompt template: {Intro of an agent X}. Here is X's plan today in broad strokes: 1)
Relationships between agents and observations of one agent by another are all taken into consideration for planning and reacting.
Environmen

### Extra LangChain stuff


In [16]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Define the prompt template for generating AI responses
PROMPT_TEMPLATE = """
Human: You are an AI assistant, and provides answers to questions by using fact based and statistical information when possible.
Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context>

<question>
{question}
</question>

The response should be specific and use statistics or numbers when possible.

Assistant:"""

# Create a PromptTemplate instance with the defined template and input variables
prompt = PromptTemplate(
    template=PROMPT_TEMPLATE, input_variables=["context", "question"]
)
# Convert the vector store to a retriever
retriever = vectorstore.as_retriever()


# Define a function to format the retrieved documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

### LangChain Expression Language


In [17]:
# Define the RAG (Retrieval-Augmented Generation) chain for AI response generation
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# rag_chain.get_graph().print_ascii()

# Invoke the RAG chain with a specific question and retrieve the response
res = rag_chain.invoke(query)



In [18]:
import textwrap

# TODO: Better text wrapping in Colab
print(textwrap.fill(res, width=80, replace_whitespace=False, drop_whitespace=False))

Self-reflection in AI agents refers to the capability of an agent to evaluate 
and refine its past actions and decisions to improve future performance. This 
process involves dynamic memory and self-reflection mechanisms that allow the 
agent to learn from its mistakes and optimize its reasoning skills.

Key aspects
 of self-reflection in AI agents include:

1. **Dynamic Memory**: Reflexion 
(Shinn & Labash, 2023) introduces dynamic memory, enabling agents to store and 
utilize past experiences to guide future actions.

2. **Heuristic Evaluation**: 
After each action, the agent computes a heuristic to determine if the trajectory
 is inefficient or contains hallucinations. Inefficient planning refers to 
prolonged unsuccessful attempts, while hallucination involves repeated identical
 actions leading to the same observation.

3. **Self-Reflection Mechanism**: 
Reflexion uses two-shot examples to prompt language models (LLMs) with pairs of 
failed trajectories and ideal reflections. Thes

### You have successfully built and run a RAG pipeline using Milvus, Hugging Face, and LangChain libraries!
