# Dummy Agent Library

🔗 https://huggingface.co/agents-course/notebooks/blob/main/dummy_agent_library.ipynb

## Serverless API

In the Hugging Face ecosystem, there is a convenient feature called Serverless API that allows you to easily run inference on many models. There's no installation or deployment required.

To run this notebook, you need a **Hugging Face token** that you can get from https://hf.co/settings/tokens. You also need to create a `.env` file in the root folder that includes a environment variable called `HF_TOKEN`.

You also need to request access to the [Meta Llama models](https://cdn-notebooks.hf.co/notebooks/39/f2/39f29e51c20778db9e49c192e6a7d3ef7e6a31802d3700fc3e6ecf8baa4a55d7/meta-llama/Llama-3.2-3B-Instruct), if you haven't done it before. Approval usually takes up to an hour.

In [7]:
import os
from huggingface_hub import InferenceClient

# leverage .env to set token to access HuggingFace
from dotenv import load_dotenv
load_dotenv()
os.environ['HF_TOKEN']

client = InferenceClient("meta-llama/Llama-3.2-3B-Instruct")

Let us ask about the capital of France 🗼

In [9]:
output = client.text_generation(
    "The capital of France is",
    max_new_tokens=100,
)

print(output)

 Paris. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of Germany is Berlin. The capital of the United Kingdom is London. The capital of Australia is Canberra. The capital of China is Beijing. The capital of Japan is Tokyo. The capital of India is New Delhi. The capital of Brazil is Brasília. The capital of Russia is Moscow. The capital of South Africa is Pretoria. The capital of Egypt is Cairo. The capital of Turkey is Ankara. The


As seen in the LLM section, if we just do decoding, **the model will only stop when it predicts an EOS token**, and this does not happen here because this is a conversational (chat) model and we didn't apply the chat template it expects.

If we now add the special tokens, the behavior changes and it now produces the expected EOS.

In [10]:
# If we now add the special tokens related to Llama3.2 model, the behaviour changes and is now the expected one.
prompt="""<|begin_of_text|><|start_header_id|>user<|end_header_id|>

The capital of france is<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
output = client.text_generation(
    prompt,
    max_new_tokens=100,
)

print(output)

...Paris!


Using the "chat" method is a much more convenient and reliable way to apply chat templates:

In [11]:
output = client.chat.completions.create(
    messages=[
        {"role": "user", "content": "The capital of france is"},
    ],
    stream=False,
    max_tokens=1024,
)

print(output.choices[0].message.content)

Paris.


The chat method is the RECOMMENDED method to use in order to ensure a **smooth transition between models but since this notebook is only educational**, we will keep using the "text_generation" method to understand the details.

## Dummy Agent

In the previous sections, we saw that the **core of an agent library is to append information in the system prompt**.

This system prompt is a bit more complex than the one we saw earlier, but it already contains:
1. **Information about the tools**
1. **Cycle instructions** (Thought → Action → Observation)