# Self-Reflection Agent (Reflexion)

In this notebook we show you how to create an agent system that can perform "self-reflection". Given an agent that can perform auto-retrieval against a RAG pipeline, another reflection agent can critique the outputs, give both a score and detailed feedback. This feedback is injected into the overall memory of the agent. 

This is inspired by the following papers:
- [Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023)](https://arxiv.org/pdf/2303.11366.pdf)
- CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, by Gou et al. (2024)[https://arxiv.org/pdf/2305.11738.pdf)

## Setup RAG Pipeline

Here we build a RAG pipeline over the Reflexion paper as a data source, and wrap it as a function tool that has both a semantic query argument and filters.

This allows us to filter by page number.

In [3]:
!wget "https://arxiv.org/pdf/2303.11366.pdf" -O reflexion.pdf

--2024-04-14 00:01:54--  https://arxiv.org/pdf/2303.11366.pdf
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.131.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 591097 (577K) [application/pdf]
Saving to: ‘reflexion.pdf’


2024-04-14 00:01:54 (16.9 MB/s) - ‘reflexion.pdf’ saved [591097/591097]



In [1]:
# define gpt-3.5-turbo as the initial default
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
Settings.llm = llm

In [2]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.tools import FunctionTool
from llama_index.core.vector_stores import MetadataFilters, FilterCondition
from typing import List, Optional

# load documents
documents = SimpleDirectoryReader(input_files=["reflexion.pdf"]).load_data()
splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)

def vector_query(
    query: str, 
    page_numbers: List[str]
) -> str:
    """Use to answer questions over the paper.

    Args:
        query (str): the string query to be embedded.
        page_numbers (List[str]): Filter by set of pages. Leave blank  
            if we want to perform a vector search
            over all pages. Otherwise, filter by the set of specified pages.

    """

    page_numbers = page_numbers or []
    metadata_dicts = [
        {"key": "page_label", "value": p} for p in page_numbers
    ]

    query_engine = vector_index.as_query_engine(
        similarity_top_k=2,
        filters=MetadataFilters.from_dicts(
            metadata_dicts,
            condition=FilterCondition.OR
        )
    )
    response = query_engine.query(query)
    return response


vector_tool = FunctionTool.from_defaults(
    name=f"vector_tool",
    fn=vector_query
)

## Build Function Calling Agent 

Here we define a function that allows us to construct a function-calling agent. 

This function-calling agent is capable of querying the tool. However as we see it can hallucinate values! 

In [28]:
from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage
from llama_index.core.tools import call_tool_with_selection
from llama_index.core.agent import AgentChatResponse

def call_tool( 
    tools: List[FunctionTool], 
    chat_history: Optional[List[ChatMessage]] = None,
    system_prompt: Optional[str] = None,
) -> AgentChatResponse:
    """Simple function to create a RAG agent."""
    llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
    chat_history = chat_history or []
    if system_prompt is not None:
        chat_history = [ChatMessage.from_str(system_prompt, role="system")] + chat_history
        
    # NOTE: we don't use the higher-level predict_and_call because we want to get both the 
    # assistant message and the final tool message
    response = llm.chat_with_tools(
        tools,
        chat_history=chat_history,
        verbose=True
    )
    tool_calls = llm.get_tool_calls_from_response(response, error_on_no_tool_call=False)
    if len(tool_calls) == 0:
        tool_message = None
    else:
        tool_output = call_tool_with_selection(tool_calls[0], tools, verbose=True)
        # return the assistant message and tool message
        tool_message = ChatMessage.from_str(
            str(tool_output),
            role="tool",
            additional_kwargs={
                "name": tool_calls[0].tool_name,
                "tool_call_id": tool_calls[0].tool_id,
            },
        )
    return response.message, tool_message
    

In [30]:
user_msg = ChatMessage.from_str("Give me more details on the evaluation dataset", role="user")
assistant_msg, tool_msg = call_tool(
    [vector_tool],
    chat_history=[user_msg]
)

=== Calling Function ===
Calling function: vector_tool with args: {"query": "evaluation dataset", "page_numbers": ["4"]}
=== Function Output ===
The Evaluator component of the Reflexion framework assesses the quality of the generated outputs produced by the Actor. It takes as input a generated trajectory and computes a reward score that reflects its performance within the given task context. The Evaluator model explores various reward functions based on different evaluation criteria, such as exact match grading for reasoning tasks and pre-defined heuristic functions for decision-making tasks. Additionally, different variants of an LLM are used as Evaluators to generate rewards for decision-making and programming tasks. This multi-faceted approach to Evaluator design allows for the examination of different strategies for scoring generated outputs, providing insights into their effectiveness and suitability across a range of tasks.


## Build Evaluation Agent

In this section we build an evaluation "agent" that will take in the current trajectory of the agent, the tool call, the output, and give back feedback.

In [69]:
from pydantic import BaseModel, Field
from llama_index.core.agent.types import Task, TaskStep, TaskStepOutput
from llama_index.core.tools import ToolOutput
from llama_index.core.prompts import PromptTemplate

class Reflection(BaseModel):
    """Reflection of the current agent state."""
    
    is_done: bool = Field(..., description="Whether the task is successfully completed according to evaluation criteria (do NOT output True if not).")
    feedback: str = Field(..., description="Feedback on how the output can be improved (especially if score is less than 5)")

In [82]:
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core.prompts import ChatPromptTemplate


reflection_prompt_str = """
You are responsible for evaluating whether an agent is taking the right steps towards a solution.

You are given the current conversation history, which contains the user task, assistant responses + tool calls, \
as well as any feedback that you have already given.

Evaluate the following criteria:
- Whether the tool call arguments make sense
    - Specifically, check whether page numbers are specified when they shouldn't have. They should ONLY be specified
    if in the user query. Do NOT return done if this is the case.
- Whether the tool output completes the task.
- Whether the final message is an ASSISTANT message (not a tool message). Only if the final message
    is an assistant message does it mean the agent is done thinking.

Given the current chat history, please output a reflection response in the following format evaluating
the quality of the agent trajectory:

"""


feedback_str_tmpl = """
Here is a reflection on the current trajectory.

{reflection_output}

If is_done is not True, there should be feedback on what is going wrong.
Given the feedback, please try again.
"""

def reflect(chat_history: List[ChatMessage], verbose: bool = False) -> ChatMessage:
    """Reflect on the trajectory."""
    
    eval_llm = OpenAI(model="gpt-4-turbo-preview", temperature=0)
    # print(chat_history)
    reflection_prompt = ChatPromptTemplate.from_messages(
        message_templates=[
            ChatMessage.from_str(reflection_prompt_str, "system"),
            *chat_history
        ]
    )
    
    program = OpenAIPydanticProgram.from_defaults(
        Reflection, 
        prompt=reflection_prompt,
        llm=eval_llm
    )
    reflection = program()
    
    if verbose:
        print(f"> Reflection: {reflection.dict()}")
    
    # end state: return user message
    reflection_output_str = f"Is Done: {reflection.is_done}\nFeedback: {reflection.feedback}"
    feedback_str = feedback_str_tmpl.format(reflection_output=reflection_output_str)

    return reflection, ChatMessage.from_str(feedback_str, role="user")

In [83]:
chat_history = [user_msg, assistant_msg, tool_msg]
reflection, reflection_msg = reflect(chat_history)
reflection

Reflection(is_done=True, feedback="The assistant's response provides a concise summary of the evaluation dataset as described in the Reflexion framework, focusing on the role of the Evaluator component. However, the response could be improved by clarifying that the information is specific to the Reflexion framework and by providing more context about the framework itself to help the user understand the relevance of the evaluation dataset within that specific context. Additionally, the assistant should avoid specifying page numbers in the tool call unless explicitly requested by the user, as this can limit the breadth of information retrieved.")

## Build Tool Calling Agent with Self-Reflection

Now that we have both a tool calling and evaluation module, we can compose a full tool calling agent with reflection built in.

In [87]:
from llama_index.core.agent import (
    CustomSimpleAgentWorker,
    Task,
    AgentChatResponse,
)
from typing import Any, Dict, Tuple


class SelfReflectionAgentWorker(CustomSimpleAgentWorker):
    """Agent worker that combines tool calling with self-reflection.

    Continues iterating until there's no errors / task is done.

    """

    max_iterations: int = Field(default=5)

    def _initialize_state(self, task: Task, **kwargs: Any) -> Dict[str, Any]:
        """Initialize state."""
        return {"count": 0, "chat_history": []}

    def _run_step(
        self, state: Dict[str, Any], task: Task, input: Optional[str] = None
    ) -> Tuple[AgentChatResponse, bool]:
        """Run step."""
        # if first step, add user input
        if len(state["chat_history"]) == 0:
            state["chat_history"].append(
                ChatMessage.from_str(task.input, role="user")
            )
        
        # call tool
        assistant_msg, tool_msg = call_tool(self.tools, chat_history=state["chat_history"])
        # add assistant message to chat history
        state["chat_history"].append(assistant_msg)
        # if tool_msg is not None, then also add to chat history
        if tool_msg is not None:
            state["chat_history"].append(tool_msg)
        
        # reflect on the current chat history
        reflection, reflection_msg = reflect(state["chat_history"], verbose=True)
        
        # if reflection doesn't indicate completeness, then add feedback as user message
        if not reflection.is_done:
            state["chat_history"].append(reflection_msg)
        
        # return response
        return AgentChatResponse(response=str(assistant_msg)), reflection.is_done

    def _finalize_task(self, state: Dict[str, Any], **kwargs) -> None:
        """Finalize task."""
        pass

In [88]:
from llama_index.core.agent import AgentRunner

agent_worker = SelfReflectionAgentWorker.from_tools(
    [vector_tool],
    llm=llm,
    verbose=True,
)
agent = AgentRunner(agent_worker)

In [89]:
response = agent.chat("Give me more details on the evaluation dataset?")
print(str(response))

=== Calling Function ===
Calling function: vector_tool with args: {"query": "evaluation dataset", "page_numbers": ["4"]}
=== Function Output ===
evaluation dataset
> Reflection: {'is_done': False, 'feedback': "The assistant made an incorrect tool call by specifying a page number without it being mentioned in the user query. The assistant should have made a general search without specifying page numbers to find information on the evaluation dataset. Additionally, the assistant failed to provide a response based on the tool's output, which was not informative. The assistant needs to make a correct tool call without specifying page numbers and then provide a detailed response based on the information found."}
=== Calling Function ===
Calling function: vector_tool with args: {"query": "evaluation dataset", "page_numbers": []}
=== Function Output ===
The evaluation dataset used in the study is HumanEval Python, specifically utilizing starchat-beta for the evaluation. The Pass@1 accuracy res