In [13]:
%%capture
!pip install unsloth
!pip install transformers==4.44.2

### Initiate the Model and Tokenizer using Unsloth

Here we use a custom model that was trained by me using unsloth on Salesforce/xlam-function-calling-60k dataset. The model is trained with a lora adapter. Which means that for function calling we can use the model with the adapter and for the thinking and reasoning we can use the model without the adapter. This saves memory. We can do a similar thing with bigger models like llama3.2-8b which already have function calling in-built.

In [9]:
from unsloth import FastLanguageModel
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "akshayballal/phi-3.5-mini-xlam-function-calling",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

FastLanguageModel.for_inference(model);

==((====))==  Unsloth 2024.10.7: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


### Stopping Criteria

Add a stop sequence so that the generation of the agent stops at the word "PAUSE" which is there in the system prompt that we provide. This allows the agent to get the observation from the tool use.  

In [3]:
from transformers import StoppingCriteria, StoppingCriteriaList
import torch

class KeywordsStoppingCriteria(StoppingCriteria):
    def __init__(self, keywords_ids:list):
        self.keywords = keywords_ids

    def __call__(self, input_ids: torch.LongTensor, _: torch.FloatTensor, **kwargs) -> bool:
        if input_ids[0][-1] in self.keywords:
            return True
        return False


stop_ids = [17171]
stop_criteria = KeywordsStoppingCriteria(stop_ids)

### Create the Tools

Here we write the functions that we want the agent to use. These functions are written in a way that they can be used for function calling. Essentially the function should have parameters that are the input to the function and the return type should be the output of the function. There needs to be a docstring that describes the function.

In [15]:
!pip install embed_anything


Collecting embed_anything
  Downloading embed_anything-0.4.10-cp310-cp310-manylinux_2_34_x86_64.whl.metadata (12 kB)
Downloading embed_anything-0.4.10-cp310-cp310-manylinux_2_34_x86_64.whl (16.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.9/16.9 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: embed_anything
Successfully installed embed_anything-0.4.10


In [27]:
from embed_anything.vectordb import Adapter

from embed_anything import EmbedData, EmbeddingModel, WhichModel, TextEmbedConfig


In [43]:
def add_numbers(a: int, b: int) -> int:
    """
    This function takes two integers and returns their sum.

    Parameters:
    a (int): The first integer to add.
    b (int): The second integer to add.
    """
    return a + b

def square_number(a: int) -> int:
    """
    This function takes an integer and returns its square.

    Parameters:
    a (int): The integer to be squared.
    """
    return a * a

def square_root_number(a: int) -> int:
    """
    This function takes an integer and returns its square root.

    Parameters:
    a (int): The integer to calculate the square root of.
    """
    return a ** 0.5

def RagAnything(prompt: str) -> str:
    """
    This function takes a query and search for relevant information in the vector database and returns the answer.

    Parameters:
    a (str): The query to be answered.
    """
    model = EmbeddingModel.from_pretrained_hf(
    WhichModel.Bert, model_id="sentence-transformers/all-MiniLM-L6-v2")


    config = TextEmbedConfig(chunk_size=256, batch_size=32, buffer_size  = 64,splitting_strategy = "sentence")

    embed_query = embed_anything.embed_query([prompt], embeder=model)
    result = index.query(
        vector=embed_query[0].embedding,
        top_k=2,
        include_metadata=True,
    )
    print(result)
    metadata_texts = [item['metadata']['text'] for item in result['matches']]
    return metadata_texts




In [44]:
tools = [add_numbers, square_number, square_root_number, RagAnything] # add the tools to a list

### Generate Tool Descriptions

We generate the tool descriptions in a way that is easy for the agent to understand and use. This is in the form of a list of dictionaries.


In [45]:
tool_descriptions = []
for tool in tools:
    spec = {
        "name": tool.__name__,
        "description": tool.__doc__.strip(),
        "parameters": [
            {
                "name": param,
                "type": arg.__name__ if hasattr(arg, '__name__') else str(arg),
            } for param, arg in tool.__annotations__.items() if param != 'return'
        ]
    }
    tool_descriptions.append(spec)
tool_descriptions


[{'name': 'add_numbers',
  'description': 'This function takes two integers and returns their sum.\n\n    Parameters:\n    a (int): The first integer to add.\n    b (int): The second integer to add.',
  'parameters': [{'name': 'a', 'type': 'int'}, {'name': 'b', 'type': 'int'}]},
 {'name': 'square_number',
  'description': 'This function takes an integer and returns its square.\n\n    Parameters:\n    a (int): The integer to be squared.',
  'parameters': [{'name': 'a', 'type': 'int'}]},
 {'name': 'square_root_number',
  'description': 'This function takes an integer and returns its square root.\n\n    Parameters:\n    a (int): The integer to calculate the square root of.',
  'parameters': [{'name': 'a', 'type': 'int'}]},
 {'name': 'RagAnything',
  'description': 'This function takes a query and search for relevant information in the vector database and returns the answer.\n\n    Parameters:\n    a (str): The query to be answered.',
  'parameters': [{'name': 'prompt', 'type': 'str'}]}]

### Create the Agent Class

We then create the agent class that takes the system prompt, the function calling prompt, the tools and the messages as input and returns the response from the agent.

- `__call__` is the function that is called when the agent is called with a message. It adds the message to the messages list and returns the response from the agent.
- `execute` is the function that is called to generate the response from the agent. It uses the model to generate the response.
- `function_call` is the function that is called to generate the response from the agent. It uses the function calling model to generate the response.



In [46]:
import ast


class Agent:
    def __init__(
        self, system: str = "", function_calling_prompt: str = "", tools=[]
    ) -> None:
        self.system = system
        self.tools = tools
        self.function_calling_prompt = function_calling_prompt
        self.messages: list = []
        if self.system:
            self.messages.append({"role": "system", "content": system})

    def __call__(self, message=""):
        if message:
            self.messages.append({"role": "user", "content": message})
        result = self.execute()
        self.messages.append({"role": "assistant", "content": result})
        return result

    def execute(self):
        with model.disable_adapter():  # disable the adapter for thinking and reasoning
            inputs = tokenizer.apply_chat_template(
                self.messages,
                tokenize=True,
                add_generation_prompt=True,
                return_tensors="pt",
            )
            output = model.generate(
                input_ids=inputs,
                max_new_tokens=128,
                stopping_criteria=StoppingCriteriaList([stop_criteria]),
            )
            return tokenizer.decode(
                output[0][inputs.shape[-1] :], skip_special_tokens=True
            )

    def function_call(self, message):
        inputs = tokenizer.apply_chat_template(
            [
                {
                    "role": "user",
                    "content": self.function_calling_prompt.format(
                        tool_descriptions=tool_descriptions, query=message
                    ),
                }
            ],
            tokenize=True,
            add_generation_prompt=True,
            return_tensors="pt",
        )
        output = model.generate(input_ids=inputs, max_new_tokens=128, temperature=0.0)
        prompt_length = inputs.shape[-1]

        answer = ast.literal_eval(
            tokenizer.decode(output[0][prompt_length:], skip_special_tokens=True)
        )[
            0
        ]  # get the output of the function call model as a dictionary
        print(answer)
        tool_output = self.run_tool(answer["name"], **answer["arguments"])
        return tool_output

    def run_tool(self, name, *args, **kwargs):
        for tool in self.tools:
            if tool.__name__ == name:
                return tool(*args, **kwargs)

### Define the System Prompt and Function Calling Prompt

Here the system prompt is based on the ReAct pattern. The agent is asked to think about the question, reason about the question, determine the actions to be taken, pause to get the observation and finally give the answer. Also we define the function calling prompt which is used to call the functions.

In [41]:
system_prompt = f"""
You run in a loop of Thought, Action, PAUSE, Observation.
At the end of the loop you output an Answer
Use Thought to describe your thoughts about the question you have been asked.
Use Action to run one of the actions available to you - then return PAUSE.
Observation will be the result of running those actions. Stop when you have the Answer.
Your available actions are:

{tools}


""".strip()

function_calling_prompt = """
You are a helpful assistant. Below are the tools that you have access to.  \n\n### Tools: \n{tool_descriptions} \n\n### Query: \n{query} \n
"""

In [35]:
!pip install pinecone-client

Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting pinecone-plugin-inference<2.0.0,>=1.0.3 (from pinecone-client)
  Downloading pinecone_plugin_inference-1.1.0-py3-none-any.whl.metadata (2.2 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone-client)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Downloading pinecone_client-5.0.1-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.8/244.8 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_inference-1.1.0-py3-none-any.whl (85 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.4/85.4 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_interface-0.0.7-py3-none-any.whl (6.2 kB)
Installing collected packages: pinecone-plugin-interface, pinecone-plugin-inference, pinecone-client
Successfully installed pinecone-client-5.0.

In [36]:
from pinecone import Pinecone

pc =  Pinecone("1f880c47-f636-4200-a76a-431ee51e2d48")

# Initialize the PineconeEmbedder class
index = pc.Index("anything")


In [48]:
import re

def loop_agent(agent: Agent, question, max_iterations=5):

    next_prompt = question
    i = 0
    while i < max_iterations:
        result = agent(next_prompt)
        print(result)
        if "Answer:" in result:
            return result

        action = re.findall(r"Action: (.*)", result)
        if action:
            tool_output= agent.function_call(action)
            next_prompt = f"Observation: {tool_output}"
            print(next_prompt)
        else:
            next_prompt = "Observation: tool not found"
        i += 1
    return result


agent = Agent( system=system_prompt, function_calling_prompt=function_calling_prompt, tools=tools)

loop_agent(agent, "what is attention?")

Thought: To answer this question, I need to understand the concept of attention in a cognitive or psychological context.

Action: Use the function RagAnything to search for a general definition of attention.

PAUSE
{'name': 'RagAnything', 'arguments': {'prompt': 'What is attention?'}}
{'matches': [{'id': '2596216e-f4a5-471b-9487-d0378655de6b',
              'metadata': {'file': 'attention.pdf',
                           'text': 'Where the projections are parameter '
                                   'matrices W Q\n'
                                   'i ∈ Rdmodel×dk , W K\n'
                                   'i ∈ Rdmodel×dk , W V\n'
                                   'i ∈ Rdmodel×dv\n'
                                   'and W O ∈ Rhdv×dmodel.\n'
                                   '\n'
                                   'In this work we employ h = 8 parallel '
                                   'attention layers, or heads. For each of '
                                   'these we u

'Thought: The provided text gives a detailed explanation of attention mechanisms, particularly in the context of the Transformer model. Attention, in this context, seems to be a method used in sequence-to-sequence models to allow each position in one sequence to attend to all positions in another sequence.\n\nAnswer: Attention, in the context of sequence-to-sequence models like the Transformer, is a mechanism that allows every position in the decoder to attend over all positions in the input sequence. This is used in "encoder-decoder attention" layers, where the queries come from the previous decoder layer'

In [None]:
!nbconvert

/bin/bash: line 1: nbconvert: command not found
