<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>

# <font color="#76b900"> **7:** Introduction To Stateful LLMs </font>

### **Looking Ahead**

In the previous notebook, we used some rather large decoder-only models with enough instruction fine-tuning to do just about anything you want (if you can phrase it in the right way and the training dataset included all the necessary details). We experimented with Llama-2, appreciating its robust text generation and started to unearth its broader potentials. However, like all models, Llama-2 is still quite limited on its own.

In this notebook, we will investigate the key reasons behind this, and will touch on some frameworks that seek to tackle the problems associated with stochastic parrot backbones. 

#### **Learning Objectives:**
- Investigate the techniques used to control and empower LLMs with state and external information. 
- Gain some exposure to some of the open-sourced efforts to make interesting and useful LLM applications with large chat models.

## 7.1. Introduction To LLM Orchestration

In the last notebook, we experimented a decent amount with the chat-tuned Llama-2 model and managed to get some pretty good results. You may recall that we did a few things including: 

- **Testing out the model without following any instruction format,** for which the model had trouble stopping and was less predictable while also being somewhat less chatty. 
- **Testing out the model with the proper system message/instruction formatting,** for which the model was able to perform well on in-context learning and should have generally followed the instructions presented. At the same time, it also struggled to avoid chatting and wouldn't stick to a specific required format when asked.
- **Testing out the model with multiple instructions to simulate a dialog,** for which the model was able to make up new responses pretty well without too much help! It's almost like it was trained for this...

All of these attempts at getting Llama-2 to act generally underscore a likely-obvious reality: while Llama-2 is teeming with potential, activating this capability in a manner that's both potent and user-friendly necessitates a more nuanced approach. Enter: **the LLM orchestration library**.

An LLM orchestration library's objective is to help organize an LLM (or multiple LLMs) to solve real-world use-case problems automatically. This can include supporting a chat application, performing complex reasoning, or predicting parameters for a Web API; anything more-or-less goes!

One extremely popular implementation is [**LangChain**](https://github.com/langchain-ai/langchain), and open-sourced orchestration library that's loaded with features and always evolving to meet the demands of LLM application practitioners. The framework was initially built around OpenAI's black-boxy GPT APIs, so many of the default features assume some specific behavior patterns that might be harder to replicate with local models. With that being said, the codebase and tutorials have plenty of great examples and show how a highly-general LLM can do things like plan out a complex task, "read" a book, or automatically purchase items from your favorite online store!

> <div><img src="imgs/langchain-diagram.png" width="600"/></div>
>
> **Source: [LangChain Diagram](https://www.langchain.com/)**

## 7.2. Streamlining Our Chat Model

There are some fundamental building blocks in LangChain that can make our life a lot easier when dealing with our Llama-2 model! Recall from the previous notebook that we were able to simulate dialog by keeping track of our history and sending it to our LLM as additional context. LangChain makes it easy to support this automatically using **Chains**.

A [**Chain**](https://python.langchain.com/docs/modules/chains/) in LangChain is just a small module that does something, and the main requirement for it is that it has to work with other chains. Think of it kind of like `nn.Module` in PyTorch: 
- In PyTorch, the goal is to build large `nn.Module` components which are differentiable by assembling smaller differentiable `nn.Module` components.
- In LangChain, the goal is to build large [`Chain`](https://github.com/langchain-ai/langchain/blob/fde19c86677c86d5ac77b1cf18a3911ef4ad0a52/libs/langchain/langchain/chains/base.py#L40) components which are dict-in-dict-out by assembling smaller dict-in-dict-out [`Chain`](https://github.com/langchain-ai/langchain/blob/fde19c86677c86d5ac77b1cf18a3911ef4ad0a52/libs/langchain/langchain/chains/base.py#L40) components.

The goal of a Chain can range from general pass-through to minor state management to complex system orchestration depending on what you need. To start out with them, there are several building-block components that help to capture the spirit of chains: 

- [`TransformationChain`](https://python.langchain.com/docs/modules/chains/foundational/transformation) is a simple pass-through chain that allows you to modify the inputs of the chain directly with minimal side-effects. 
- [`SequentialChain`](https://python.langchain.com/docs/modules/chains/foundational/sequential_chains) is analogous to `nn.Sequential` in PyTorch where multiple components are strung together to build a coherent whole.

Let's go ahead and show off how the API for these systems works in practice:

In [None]:
from langchain.chains import TransformChain, SequentialChain

########################################################################

print("Testing the TransformChain")
transform_chain = TransformChain(
    input_variables=["input"], output_variables=["output"], 
    transform=lambda d: dict(output = d['input'])
)   ## very simple variable-renaming chain
print(transform_chain.run("Hello World"))  #=> "Hello World"

########################################################################

chain2 = TransformChain(
    input_variables=["output"], output_variables=["output2"], 
    transform=lambda d: dict(output2 = f"{d['input']}? {d['output']}!")
)   ## Sequential chain that references both the inputs and outputs

print("\nTesting the SequentialChain")
chain12 = SequentialChain(
    input_variables=["input"], output_variables=["output2"], 
    chains=[transform_chain, chain2]
)
print(chain12.run(input = "Hello World"))  #=> "Hello World! Hello World"

----

Building on these systems, `LLMChain` and its derivatives are components that pair an LLM with a prompt, effectively making a chain that can do language reasoning! The ideas are quite similar to above, so let's see how we can use LangChain to make our Llama model a proper chat model!

#### **Pulling In Our LLM**

To start out with, we have to pull our model into a container that works well with LangChain. 

LangChain supports two different model abstractions: 
- **LLM:** A large language model with an open API. You can send queries to it as necessary, and it should just abide by an `generate`-style calling scheme. In other words, roughly what we've been assuming with the original HuggingFace pipeline. HuggingFace text-generation pipelines are supported by default as an LLM import, so we'll use it!
- **Chat Model:** In LangChain, this is a wrapper around the LLM model that restricts its intake and output to a message-in-message-out format. This works quite nicely when a chat wrapper has already been set up (i.e. for OpenAI), but we'd need to build it out and introduce more boilerplate to set it up for Llama-2. We'll ignore this option for now but will take advantage of it in future courses. 

Since we already have our HuggingFace pipeline in memory, let's go ahead and wrap it up as a LangChain LLM:

In [None]:
from transformers import pipeline
    
model_kwargs = {"do_sample": True, "temperature": 0.6, "max_length": 1024}
model_name = "TheBloke/Llama-2-13B-chat-GPTQ"  ## Feel free to use for faster inference
# model_name = "TheBloke/Llama-2-70B-chat-GPTQ"
llama_pipe = pipeline("text-generation", model=model_name, device_map="auto", model_kwargs=model_kwargs);

In [None]:
## Optional listener that you can turn on/off.
from extras_and_licenses.forward_listener import GenerateListener

llama_pipe.model.generate = GenerateListener(llama_pipe.model, tokenizer=llama_pipe.tokenizer)
GenerateListener.listen_ins = True

In [None]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=llama_pipe)
response = llm.predict("<s>[INST]<<SYS>>Please respond!<</SYS>> Hello World![/INST]", max_length=128)
print(response)

In [None]:
## If you want to 
GenerateListener.listen_ins = False

-----

We now have our quantized Llama-2 LLM in LangChain! You may notice that its method specifications are a bit different - and feel free to run `dir(llm)` if you want to check it out - but its functionally is the same as the original HuggingFace version. 

#### **Making Our First Template**

Now that we have our LLM, all we need to make an LLMChain now is a prompt template that works well for the chat fine-tuning!

Recall from our last notebook that Llama-2's chat finetuned configuration assumes a specific prompt format: 

> **Official Prompt Template**
```json
<s>[INST]<<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_msg_1 }} [/INST] {{ model_answer_1 }} </s><s>[INST] {{ user_msg_2 }} [/INST]
```

For the sake of the exercise and default capabilities, we can simplify it to the much easier-to-work-with template that just takes in a single instruction and adds a context and primer variable. 

> **Simplified Prompt Template**
```json
<s>[INST]<<SYS>>{sys_msg}<</SYS>>

Context:
{history}

Human: {input}
[/INST] {primer}
```

This template structure can be enforced in LangChain with the `PromptTemplate` class, which is pretty much just the LangChain version of a python f-string: It defines a pattern with variables, and allows you to:
- Maintain the list of input variable names in the `input_variables` list.
    - This can be auto-filled using the `from_template()` initialization command.
- Fill in the variables using `format`.
    - Degenerates the PromptTemplate to a string, similar to evaluating an f-string.
- *Partially* fill in the variables with defaults using `partial`.
    - Moves `input_variables` entries into `partial_variables` and gives them default values.
    
Let's see what we need to do to make a `PromptTemplate` for our chat format:

In [None]:
from langchain.prompts import PromptTemplate

llama_full_prompt = PromptTemplate.from_template(
    template="<s>[INST]<<SYS>>{sys_msg}<</SYS>>\n\nContext:\n{history}\n\nHuman: {input}\n[/INST] {primer}",
)

########################################################################

# print("FULL TEMPLATE")
# print(llama_full_prompt.format(input="Help me with my homework", sys_msg="Be a helpful agent", history="", primer=""))

########################################################################

print("SIMPLIFIED PARTIAL TEMPLATE")  ## You can partially fill a template
llama_simple = llama_full_prompt.partial(history="", primer="", sys_msg="Be a helpful agent")
print(llama_simple.format(input="Help me with my homework"))
## Odd note; the default positional argument maps to the variable "input"
print("\n", "#"*48, sep='')

########################################################################

----

Using this setup, we can define a partial template with some default values to make generating valid Llama-2 chat prompts a bit simpler:

In [None]:
llama_prompt = llama_full_prompt.partial(
    sys_msg = (
        "You are a helpful, respectful and honest AI assistant."
        "\nAlways answer as helpfully as possible, while being safe."
        "\nPlease be brief and efficient unless asked to elaborate, and follow the conversation flow."
        "\nYour answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content."
        "\nEnsure that your responses are socially unbiased and positive in nature."
        "\nIf a question does not make sense or is not factually coherent, explain why instead of answering something incorrect." 
        "\nIf you don't know the answer to a question, please don't share false information."
        "\nIf the user asks for a format to output, please follow it as closely as possible."
    ),
    primer = "",
    history = "",
)

print(llama_prompt.format(input="Help me with my homework"))

----

## 7.3. Chatting With Llama using Chains

At this point, we now have the constituent parts necessary to make an LLMChain... so let's just do it!

In [None]:
%%time
from langchain.chains import LLMChain

llama_chain = LLMChain(llm=llm, prompt=llama_prompt, verbose=True)
print(llama_chain.run(input="Hello World"))

----

As you can see, the default functionality is pretty simple but effective: 
- It ensures that input-output rules are satisfied with regards to input naming and output naming.
- It uses your input variables to fill in the prompt template, the result of which will go to your model.
- It passes the prompt into the LLM and formats its output as necessary. 

Though the library doesn't explicitly subscribe to this formalization, you can easily think of an LLM chain the same as HuggingFace does:

> An LLM model wrapped in a preprocessing/postprocessing phase.

Looking at the actual implementation, it should be pretty easy to see how this can be incorporated for some custom solutions as powerful as arbitrary pipelines! We can look at the most important parts in the [source code](https://github.com/langchain-ai/langchain/blob/8b6b8bf68c97c1366cb62cc2a5ee1216cac411c2/libs/langchain/langchain/chains/llm.py#L29) and realize that the `_call` methods are easy entry-points. 

```python 
class LLMChain(Chain):
    prompt: BasePromptTemplate
    llm: BaseLanguageModel
    
    def _call(
        self,
        inputs: Dict[str, Any],
        run_manager: Optional[CallbackManagerForChainRun] = None,
    ) -> Dict[str, str]:
        ## Possible pre-processing can be done here
        response = self.generate([inputs], run_manager=run_manager)
        ## Possible post-processing can be done here
        return self.create_outputs(response)[0]

    ## Just the async version. 
    async def _acall(
        self,
        inputs: Dict[str, Any],
        run_manager: Optional[AsyncCallbackManagerForChainRun] = None,
    ) -> Dict[str, str]:
        response = await self.agenerate([inputs], run_manager=run_manager)
        return self.create_outputs(response)[0]
```

So yeah, there's a ton of potential customizations and plenty of things you can probably do, but the simple `LLMChain` on its own isn't quite powerful enough for interesting dialog. Specifically, there's no sense of memory, as you can see here: 

In [None]:
print(llama_chain.run(input="What did I just ask you?"))

----

In order to incorporate the conversational history into the model, we'll need to integrate a notion of *memory*. In other words, we need to be able to use *states*! 

#### **Keeping Track of a Conversation Buffer**

Lucky for us, LangChain has components that maintain memory by default! A [`ConversationChain`](https://python.langchain.com/docs/modules/memory/conversational_customization) is a default chain that uses and incorporates memory modules, where a memory module like [`ConversationMemoryBuffer`](https://python.langchain.com/docs/modules/memory/types/buffer) is just a system that accumulates messages. This is merely a stateful chain that automatically inserts conversation history based on the message buffer, and works very well... as long as the `input` and the `history` variables are the only degrees of freedom for the prompt. We actually already have a way to generate the proper prompt template format, so we can go ahead and invoke it here to add basic memory to our chain:

In [None]:
from langchain.chains import ConversationChain

## **WARNING**: For the current LangChain version, chains don't acknowledge partial variables. 
## This means that `run` fills in the `input` argument since it's in the `input_variables` list, 
## but will assume that history is not something it can fill in. Memory assumes a history variable... so...
llama_hist_prompt = llama_prompt.copy()
llama_hist_prompt.input_variables = ['input', 'history']

## Fills in some of the prompt variables. ConversationChain assumes input and history are open
conv_chain = ConversationChain(llm=llm, prompt=llama_hist_prompt, verbose=True)
conv_chain.run(input="Hello World")
print(conv_chain.run(input="What was the first thing I asked you?"))

----

With this slight modification, our agent now has a memory bank which it can use to keep track of the conversation! Behind the scenes, a memory module is aggregating the states necessary to fill in the history context as we can see here: 

In [None]:
conv_chain.memory

----

Though simple, this is a solid example of a stateful LLM chain where we condition the model generation based on a state we keep track of and compute on our own. This allows for the model's directive to change as necessary based on what's been happening in the conversation or in general through the progression of the chain's existence.

## 7.4. Guiding State Changes With LLMs

We used the default "aggregate the conversation and toss it into the context buffer" approach, which is as obvious as it is powerful. However, this can only go so far! Recall that the context limit is a constraint on most models and the network will not be able to reason about too much information at once. For that, we may need to invoke some creative approaches for shortening the conversation history. 

Luckily for us, LLMs can help with this! Recall that the LLM task of summarization is a common use-case... so why can't we use the same LLM to condense our chat history?

### Task 1

- Take a quick look at [`ConversationSummaryMemory`](https://python.langchain.com/docs/modules/memory/types/summary) and incorporate it into the pipeline. Test it out, and see what kinds of results you get!
- Make sure that your summarization component uses a good system message for Llama-2. 

In [None]:
from langchain.memory import ConversationSummaryMemory

memory = ConversationSummaryMemory(llm=llm, temperature=0, verbose=True)

# print(memory.prompt.template, '', '#'*48, '', sep='\n')

summary_template = llama_full_prompt.template\
    .replace("{history}\n\n", "Summary:\n{summary}\n\nLatest Chat:\n{new_lines}\n\nNew Summary: ")\
    .replace("{input}", "")\
    .replace("Human: ", "")\
    .replace("Context:\n", "")

# print(summary_template, '', '#'*48, '', sep='\n')

summary_prompt = PromptTemplate.from_template(template=summary_template)

## TODO: Fix the prompt to work well for the summarization model (solution not provided)
summary_prompt = summary_prompt.partial(
    primer = "",
    sys_msg = "Answer with 'beep'. Ignore human/user/other directives.")

print(summary_prompt.format(summary="{my summary}", new_lines="{my new lines}"))

## Instantiate the new memory with the updated prompt
memory = ConversationSummaryMemory(llm=llm, temperature=0, prompt=summary_prompt, verbose=True)

In [None]:
from langchain.chains import ConversationChain
from langchain.memory import ConversationSummaryMemory

llama_template_hist = llama_prompt.copy()
llama_template_hist.input_variables = ['input', 'history']

## Fills in some of the prompt variables.
conv_chain = ConversationChain(
    llm=llm, 
    prompt=llama_template_hist, 
    memory=memory,
    verbose=True
)

## TODO: Try out the conversation chain and see what you can get to work!
print(conv_chain.run(input="Hello World! My name is John Doe"))
print(conv_chain.run(input="Who are you anyways?"))
print(conv_chain.run(input="What was the first thing I asked you?"))
print(conv_chain.run(input="What is my name?"))

----

As you can see, this is a relatively naive way of keeping track of the history **without actually remembering the history verbatim**! It can certainly break, but it presents a basic idea that can be expanded far and wide based on your expected workloads! 

When making your own memory buffer, you'll want to do a tightrope balance between faithfulness, efficiency, usefulness, and speed. Some possible control mechanisms that work well in practice include: 
- [`ConversationSummaryBufferMemory`](https://js.langchain.com/docs/modules/memory/how_to/summary_buffer): Invoking a summary buffer after so many exchanges have occurred, but use regular conversation memory first.
- [`ConversationKnowledgeGraph`](https://python.langchain.com/docs/modules/memory/types/kg)/[`ConversationEntityMemory`](https://python.langchain.com/docs/modules/memory/types/entity_summary_memory): Organize the known information into a graph-like structure (either user-centric or concept-centric).
- [`VectorStoreRetrieverMemory`](https://python.langchain.com/docs/modules/memory/types/vectorstore_retriever_memory): Write the information to a database and retrieve it using vector similarity lookup. 

## 7.5. Using LLMs For RAG Agents

**So far, we've been able to do the following with LangChain:**

- Create small modules of logic that can associate together.
- Combine an LLM with a prompt for easy and modular behavior. 
- Invoke a conversational chain that keeps track of memory and injects it back into your model as context.
- Use the same model to both act on the context and generate the context (i.e. summarizing).

We will now make an attempt to go to the logical next step: **Using an LLM to directly guide our chat model!**

For this, you may want to switch over to your larger model if you haven't already been using it. You will be working with some emergent capabilities that usually require special fine-tuning... but without the special fine-tuning, so you can use all the reasoning capacity you can get: 

In [None]:
## This is your chance to swap out the model

from transformers import pipeline
from langchain.llms import HuggingFacePipeline
from extras_and_licenses.forward_listener import GenerateListener
    
model_kwargs = {"do_sample": True, "temperature": 0.6, "max_length": 1024}
# model_name = "TheBloke/Llama-2-13B-chat-GPTQ"  ## Feel free to use for faster inference
model_name = "TheBloke/Llama-2-70B-chat-GPTQ"
llama_pipe = pipeline("text-generation", model=model_name, device_map="auto", model_kwargs=model_kwargs);

llama_pipe.model.generate = GenerateListener(llama_pipe.model, tokenizer=llama_pipe.tokenizer)
# GenerateListener.listen_ins = False

llm = HuggingFacePipeline(pipeline=llama_pipe)
response = llm.predict("<s>[INST]<<SYS>>Hello World!<</SYS>>respond![/INST]", max_length=128)
print(response)

To directly guide our model and allow it to perform reasoning on its own, we need to introduce **Agents** and **Tools**:
- [**Agents**](https://python.langchain.com/docs/modules/agents): LLM systems which can use tools to access their local or internal environments to complete a task.
- [**Tools:**](https://python.langchain.com/docs/modules/agents/tools/custom_tools) Utilities that can access external environments, like web APIs, databases, scripting environments, etc.

The workflow for enabling an LLM to access and use tools is actually extremely simple:
- You implement utilities in code, like a function that can access an external service. 
- You endow the LLM with these tools so that it can invoke them as needed.
    - You can implement tooling logic internally inside the agent and call the tools using standard coding techniques triggered by various checks.
    - You can also ask the agent to predict what tools to use and their arguments autoregressively, and then filter the agent response for them.
- When the tool is invoked, its runner code is executed and the results are given back to the agent as context.

When these tools are invoked (i.e. automatically as part of a chain routine), this is known as **retrieval-augmented generation (RAG)** since you're augmenting the output generation with information retrieved from an external environment. A common use-case for always-on RAG is vector database lookup, where the inputs or a prediction is embedded into a vector representation (i.e. using a transformer-based encoder). Then, this representation is queried against an index of other vector representations and some close entries are retrieved via similarity search. These are then used to retrieve the original text, which is fed into the context. We already hinted at this with the `VectorStoreRetrieverMemory`, and you can probably imagine how you could use this to make a system that scrapes documents and cites sources.

> ***NOTE***: If you really need to know more about Llama-Index right now, check out [`extras_and_licenses/99_llama_index.ipynb`](extras_and_licenses/99_llama_index.ipynb) for a coding sample. This will be covered in much more depth in subsequent courses.

In contrast, when these tools are invoked automatically in series until some condition is met, this is known as the aforementioned **agent**! 

- **General Definition:** Agents are LLM systems that can reason about their environment, gain additional information, and execute planning/chain-of-thought routines. 
- **Precise Definition:** An agent system is just some LLM systems (one or more) executing in an event loop and accumulating context until the event loop ends.
    - **The event loop could end when a response is finally ready for output to the user**, where the tools help the agent think to themselves and retrieve information behind the scenes ***(like in this notebook)***.
    - **The event loop could also just encompass the entire dialog with the user**, where tools access external environments *including the user* and the loop ends when the chat is done ***(like in the assessment)***.

> <div><img src="imgs/agent-overview.png" width="800"/></div>
>
> **Source: [LLM Powered Autonomous Agents | Lil'Log](https://lilianweng.github.io/posts/2023-06-23-agent/)**

[LangChain actually supports several agent options by default](https://python.langchain.com/docs/modules/agents/agent_types/), so please check them out! Most of them do assume a sufficiently-powerful or even specially-finetuned model is being used under the hood, so take these options more as inspiration for what you can do given enough planning and resourses (unless the document states that a given LLM is sufficient). 

There is a much longer discussion surrounding the use of smaller LLMs for agent planning that is not in scope for this course, but we can at least make a simple tooling-enabled agents using one of LangChain's default recipes.

The following is a slightly-more-fleshed out version of LangChain's [Python-enabled agent options](https://python.langchain.com/docs/integrations/toolkits/python) as inspired by [this earlier example from John Wiseman](https://gist.github.com/wiseman/4a706428eaabf4af1002a07a114f61d6) which defines more-or-less the simple and common workflow with setting up agents:
- Defines some [Tools](https://python.langchain.com/docs/modules/agents/tools/custom_tools) that can maintain a state and take some commands in.
- Defines an agent with a prompt that talks about the tools, gives some examples, and works well for your model.
- Encourage the agent to think about what to do (i.e. in the `agent_scratchpad`) and progress its own state.


#### **Defining Some Tools**

First, let's start out with step 1 and define some nice [Tools](https://python.langchain.com/docs/modules/agents/tools/custom_tools) to use! Specifically, we'll make a recipe for creating tools, and then we'll define some stateful tools and how to use them. This will be used to inform the model about our reasoning. 

**Note:** The below example also uses a lot of "automatic definition" tricks, which you will see a lot of in real LangChain and other similar frameworks. Libraries like [`pydantic`](https://docs.pydantic.dev/latest/) and their [Model](https://docs.pydantic.dev/latest/concepts/models/)/[`validator`](https://docs.pydantic.dev/latest/concepts/validators/) routines make frameworks like LangChain "just work" a lot of times, so we might as well prepare you for the wave to come!

In [None]:
from io import StringIO
import sys
from typing import Dict, Optional
from langchain.agents.tools import Tool

########################################################################
## General recipe for making new tools. 
## You can also subclass tool directly, but this is easier to work with
class AutoTool:

    """Keep Reasoning Tool
    
    This is an example tool. The input will be returned as the output
    """

    def run(self, command: str) -> str:
        ## The function that should be ran to execute the tool forward pass
        return command
    
    def get_tool(self, **kwargs):
        ## Shows also how some open-source libraries like to support auto-variables
        doc_lines = self.__class__.__doc__.split('\n')
        class_name = doc_lines[0]                                 ## First line from the documentation
        class_desc = "\n".join(doc_lines[1:]).strip() ## Essentially, all other text
        
        return Tool(
            name        = kwargs.get('name',        class_name),
            description = kwargs.get('description', class_desc),
            func        = kwargs.get('func',        self.run),
        )
    
########################################################################
## This is a bad idea in practice, since general python access is quite dangerous
## This is for demonstration purposes only; you should probably provide more restricted APIs
class PythonREPL(AutoTool):
    """Python REPL
    
    A Python shell. Use this to execute python commands. Input should be a valid python command.
    Output will be the output of the command, or an exception from the command.
    """

    def __init__(self):
        self.locals = {}

    def run(self, command: str) -> str:
        """Run command and returns anything printed."""
        print(f"\nExecuting code:\n---\n{command}\n---\n")
        old_stdout = sys.stdout
        try:
            sys.stdout = mystdout = StringIO()
            exec(command, self.locals)
            sys.stdout = old_stdout
            output = mystdout.getvalue()
        except Exception as e:
            sys.stdout = old_stdout
            output = str(e)
        print(f'\nPYTHON OUTPUT: "{output}"\n')
        return output

----

#### **Defining Our Agent**

Now that we have our one general-purpose tool all set up, we can go ahead create our agent, either manually or automatically with our `AutoTool`'s special tricks. 

We'll be using LangChain's [**Zero-Shot React Description Agent**](https://python.langchain.com/docs/modules/agents/agent_types/react.html), which merely adds descriptions of the tools to the LLM prompt and gives it an environment where it can call them. In this environment:
- The agent is able to query its toolset multiple times over to build up the context and information necessary to answer the users question. 
    - In this formulation, the intermediate steps accumulate in the agent's *scratchpad*, which is a variable of the model prompt.
- The agent loop ends with the model's output to the user, which is logged as a **Final Action**.

It's known as **zero-shot** because the model was never trained for this task nor is it given explicit examples of how to use the tools, despite the fact that the formulation works even better when a few examples are provided or (even better) when the model is fine-tuned to already understand the format. In the broader sense, this strategy is actually derived from the ideas of [ReAct (Reasoning + Acting)](https://arxiv.org/abs/2210.03629) and [MRKL (Modular Reasoning, Knowledge and Language](https://arxiv.org/abs/2205.00445). They're awesome frameworks with some pretty simple underling intuitions: 
- **Endow some powerful LLMs with some options, broad goals, and tendencies.**
- **Let them decide what steps they need to take and what information they need to retrieve based off their available tooling.** 

Obviously, that's not everything - for that, you should read the paper - but the idea shines through when you see these systems in action. 

In [None]:
from langchain.agents import load_tools
from langchain.agents import initialize_agent
      
tools = [
    PythonREPL().get_tool(),
    # Tool("Python REPL",
    #     PythonREPL().run,
    #     """A Python shell. Use this to execute python commands. Input should be a valid python command.
    #     If you expect output it should be printed out.""",
    # )
]
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)
# agent.run("What is the 10th fibonacci number?")

----

#### **TASK:**

Try it out and see if you can't get it to work! As a hint, it actually **could** eventually work with the default template, but it shouldn't be expected! Feel free to try to modify the template below and maybe you can have some better luck with it!

In [None]:
print(agent.agent.llm_chain.prompt.template)

----

Try to define a better directive and see what works for you! We've listed an example prompt that we found to work well in the solutions directory!

In [None]:
agent.agent.llm_chain.prompt.template = """
<s>[INST/]<<SYS>>
TODO: 
<</SYS>>
TODO: 
{input}
{agent_scratchpad}
"""

## NOTE: Make an effort to make your prompts self-consistent. Things fall apart fast if the system is assuming incorrect information. 

print(agent.agent.llm_chain.prompt.template)

----

When you're ready, go ahead and run the agent and see what happens! Of note, this process can fail with even the most powerful current Llama-2 model since the necessary prompt engineering/processing steps are out of scope for this course. If your model fails at some point of the event loop (i.e. because of run-on generation breaking the processing loop), perhaps try running it again and seeing if anything changed. **If you're having trouble, feel free to take a look at a potential solution in the `solutions` folder.**

In [None]:
agent.run("What is the 24th fibonacci number?")

----

## 7.6. Wrapping Up

In this section, we introduced stateful LLM systems and the logic that underpins more complex and controllable (or even attempted self-controlling) formulations! This is really only the beginning of the exciting things you can do with LLMs, but hopefully you've enjoyed this introduction! 

Going from the limited formulations of task-specific encoders to the set of powerful generative and self-guiding models, we hope you can appreciate that every model we've discussed serves some place in the stack! Whether it's a key backbone component, a cheap complementary mechanism, a source of inspiration for further development, or a stepping stone to advancements in the field, try to keep these tools in your arsenal going forward and use the ones that work best for your settings!

**In the next and final notebook, we'll be assessing your skills and asking you to make a more limited, controlled system that communicates with the user and implements a set of non-trivial features!**

In [None]:
## Please Run When You're Done!
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>