# Naive RAG with Milvus and LangChain
This notebook contains an implementation of RAG with Milvus, LangChain, and HuggingFace. Its purpose is to provide you with a starting point for coding, if required.

### Load (quantized) Phi-4 for Apple Sillicon hardware
Using default `transformers` implementation is too slow on my MacBook (even though it is set to use `mps` device). Hence, I use the `mlx-lm` library.

For users on non-Apple Sillicon hardware, load your LLM as normal with `transformer`.

In [1]:
from mlx_lm import load
model, tokenizer = load("mlx-community/phi-4-4bit") # <= replace with smaller model depending on WiFi bandwidth

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/996 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/49.3k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/17.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.15M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.31G [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/917k [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.94G [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.61M [00:00<?, ?B/s]

### Integration with LangChain

In [2]:
from langchain_community.llms.mlx_pipeline import MLXPipeline
from langchain_community.chat_models.mlx import ChatMLX
from langchain_core.messages import HumanMessage

In [3]:
llm = MLXPipeline(
    model=model, tokenizer=tokenizer, pipeline_kwargs={"max_tokens": 1024, "temp": 0.1}
)

chat = ChatMLX(llm=llm)

### Test language model
Ignore the warning, which is due to a breaking change in one of the libraries used in the past couple of weeks. That's why I pin `mlx-lm==0.20.6`.

In [4]:
messages = [
    HumanMessage(
        content="What happens when an unstoppable force meets an immovable object?"
    ),
]

res = chat.invoke(messages)
print(res.content)

As an AI, I can provide some perspectives on the question of what happens when an unstoppable force meets an immovable object, but it's important to note that this is a classic paradox that doesn't have a definitive answer within the framework of classical physics.

1. **Logical Paradox**: The scenario is a logical paradox because the definitions of an "unstoppable force" and an "immovable object" are mutually exclusive. If a force is truly unstoppable, it cannot be stopped by any object, no matter how immovable. Conversely, if an object is truly immovable, no force can move it.

2. **Philosophical Interpretation**: Philosophically, this paradox can be used to explore the limits of language and logic. It challenges our understanding of concepts like infinity, absolute power, and the nature of reality.

3. **Theoretical Physics**: In theoretical physics, the laws of nature as we understand them do not allow for the existence of such absolutes. Forces and objects are subject to the laws 

### Prepare the Data

In [6]:
import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create a WebBaseLoader instance to load documents from web sources
loader = WebBaseLoader(
    web_paths=(
        "https://lilianweng.github.io/posts/2023-06-23-agent/",
        "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
    ),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
# Load documents from web sources using the loader
documents = loader.load()
# Initialize a RecursiveCharacterTextSplitter for splitting text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)

# Split the documents into chunks using the text_splitter
docs = text_splitter.split_documents(documents)

# Let's take a look at the first document
print(docs[1])

page_content='Fig. 1. Overview of a LLM-powered autonomous agent system.
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or ma

### Build naive RAG with Milvus and LangChain

In [16]:
from langchain_community.embeddings import SentenceTransformerEmbeddings
embeddings = SentenceTransformerEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2')

In [17]:
from langchain_milvus import Milvus, Zilliz

vectorstore = Milvus.from_documents(  # or Zilliz.from_documents
    documents=docs,
    embedding=embeddings,
    connection_args={
        "uri": "./milvus_demo.db",
    },
    drop_old=True,  # Drop the old Milvus collection if it exists
)

### Test vector database

In [21]:
query = "What is self-reflection of an AI Agent?"
res = vectorstore.similarity_search(query, k=1)
print(res[0].page_content[0:1024] + '...')

Recency: recent events have higher scores
Importance: distinguish mundane from core memories. Ask LM directly.
Relevance: based on how related it is to the current situation / query.


Reflection mechanism: synthesizes memories into higher level inferences over time and guides the agent’s future behavior. They are higher-level summaries of past events (<- note that this is a bit different from self-reflection above)

Prompt LM with 100 most recent observations and to generate 3 most salient high-level questions given a set of observations/statements. Then ask LM to answer those questions.


Planning & Reacting: translate the reflections and the environment information into actions

Planning is essentially in order to optimize believability at the moment vs in time.
Prompt template: {Intro of an agent X}. Here is X's plan today in broad strokes: 1)
Relationships between agents and observations of one agent by another are all taken into consideration for planning and reacting.
Environmen

### Extra LangChain stuff

In [22]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Define the prompt template for generating AI responses
PROMPT_TEMPLATE = """
Human: You are an AI assistant, and provides answers to questions by using fact based and statistical information when possible.
Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context>

<question>
{question}
</question>

The response should be specific and use statistics or numbers when possible.

Assistant:"""

# Create a PromptTemplate instance with the defined template and input variables
prompt = PromptTemplate(
    template=PROMPT_TEMPLATE, input_variables=["context", "question"]
)
# Convert the vector store to a retriever
retriever = vectorstore.as_retriever()


# Define a function to format the retrieved documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

### LangChain Expression Language

In [23]:
# Define the RAG (Retrieval-Augmented Generation) chain for AI response generation
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# rag_chain.get_graph().print_ascii()

# Invoke the RAG chain with a specific question and retrieve the response
res = rag_chain.invoke(query)
res



"Self-reflection in AI agents refers to the capability of an agent to evaluate and refine its past actions and decisions to improve future performance. This process involves dynamic memory and self-reflection mechanisms that allow the agent to identify inefficiencies or errors in its previous trajectories and make necessary adjustments.\n\nIn the Reflexion framework (Shinn & Labash, 2023), self-reflection is achieved by incorporating two-shot examples into the agent's working memory. These examples consist of pairs of failed trajectories and ideal reflections that guide future changes in the plan. The agent can store up to three reflections to use as context when querying a language model (LLM).\n\nThe heuristic function in Reflexion determines when a trajectory is inefficient or contains hallucinations, prompting the agent to reset the environment and start a new trial if necessary. Inefficient planning refers to trajectories that take too long without success, while hallucination is 

In [24]:
print(res)

Self-reflection in AI agents refers to the capability of an agent to evaluate and refine its past actions and decisions to improve future performance. This process involves dynamic memory and self-reflection mechanisms that allow the agent to identify inefficiencies or errors in its previous trajectories and make necessary adjustments.

In the Reflexion framework (Shinn & Labash, 2023), self-reflection is achieved by incorporating two-shot examples into the agent's working memory. These examples consist of pairs of failed trajectories and ideal reflections that guide future changes in the plan. The agent can store up to three reflections to use as context when querying a language model (LLM).

The heuristic function in Reflexion determines when a trajectory is inefficient or contains hallucinations, prompting the agent to reset the environment and start a new trial if necessary. Inefficient planning refers to trajectories that take too long without success, while hallucination is defin

### You have successfully built and run a RAG pipeline using Milvus, Hugging Face, and LangChain libraries!