https://huggingface.co/docs/smolagents/main/en/examples/rag

In [1]:
# Installation
!pip install smolagents
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/smolagents.git

Collecting smolagents
  Downloading smolagents-1.21.1-py3-none-any.whl.metadata (16 kB)
Collecting python-dotenv (from smolagents)
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Downloading smolagents-1.21.1-py3-none-any.whl (145 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m145.4/145.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading python_dotenv-1.1.1-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv, smolagents
Successfully installed python-dotenv-1.1.1 smolagents-1.21.1


# Agentic RAG

## Introduction to Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) combines the power of large language models with external knowledge retrieval to produce more accurate, factual, and contextually relevant responses. At its core, RAG is about "using an LLM to answer a user query, but basing the answer on information retrieved from a knowledge base."

### Why Use RAG?

RAG offers several significant advantages over using vanilla or fine-tuned LLMs:

1. **Factual Grounding**: Reduces hallucinations by anchoring responses in retrieved facts
2. **Domain Specialization**: Provides domain-specific knowledge without model retraining
3. **Knowledge Recency**: Allows access to information beyond the model's training cutoff
4. **Transparency**: Enables citation of sources for generated content
5. **Control**: Offers fine-grained control over what information the model can access

### Limitations of Traditional RAG

Despite its benefits, traditional RAG approaches face several challenges:

- **Single Retrieval Step**: If the initial retrieval results are poor, the final generation will suffer
- **Query-Document Mismatch**: User queries (often questions) may not match well with documents containing answers (often statements)
- **Limited Reasoning**: Simple RAG pipelines don't allow for multi-step reasoning or query refinement
- **Context Window Constraints**: Retrieved documents must fit within the model's context window

## Agentic RAG: A More Powerful Approach

We can overcome these limitations by implementing an **Agentic RAG** system - essentially an agent equipped with retrieval capabilities. This approach transforms RAG from a rigid pipeline into an interactive, reasoning-driven process.

### Key Benefits of Agentic RAG

An agent with retrieval tools can:

1. ‚úÖ **Formulate optimized queries**: The agent can transform user questions into retrieval-friendly queries
2. ‚úÖ **Perform multiple retrievals**: The agent can retrieve information iteratively as needed
3. ‚úÖ **Reason over retrieved content**: The agent can analyze, synthesize, and draw conclusions from multiple sources
4. ‚úÖ **Self-critique and refine**: The agent can evaluate retrieval results and adjust its approach

This approach naturally implements advanced RAG techniques:
- **Hypothetical Document Embedding (HyDE)**: Instead of using the user query directly, the agent formulates retrieval-optimized queries ([paper reference](https://huggingface.co/papers/2212.10496))
- **Self-Query Refinement**: The agent can analyze initial results and perform follow-up retrievals with refined queries ([technique reference](https://docs.llamaindex.ai/en/stable/examples/evaluation/RetryQuery/))

## Building an Agentic RAG System

Let's build a complete Agentic RAG system step by step. We'll create an agent that can answer questions about the Hugging Face Transformers library by retrieving information from its documentation.

You can follow along with the code snippets below, or check out the full example in the smolagents GitHub repository: [examples/rag.py](https://github.com/huggingface/smolagents/blob/main/examples/rag.py).

### Step 1: Install Required Dependencies

First, we need to install the necessary packages:

```bash
pip install smolagents pandas langchain langchain-community sentence-transformers datasets python-dotenv rank_bm25 --upgrade
```

If you plan to use Hugging Face's Inference API, you'll need to set up your API token:

In [2]:
!pip install smolagents pandas langchain langchain-community sentence-transformers datasets python-dotenv rank_bm25 --upgrade

Collecting pandas
  Downloading pandas-2.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m91.2/91.2 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->lan

In [3]:
# Load environment variables (including HF_TOKEN)
from dotenv import load_dotenv
load_dotenv()

False

### Step 2: Prepare the Knowledge Base

We'll use a dataset containing Hugging Face documentation and prepare it for retrieval:

### api

In [4]:
import datasets
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.retrievers import BM25Retriever

# Load the Hugging Face documentation dataset
knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")

# Filter to include only Transformers documentation
knowledge_base = knowledge_base.filter(lambda row: row["source"].startswith("huggingface/transformers"))

# Convert dataset entries to Document objects with metadata
source_docs = [
    Document(page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]})
    for doc in knowledge_base
]

# Split documents into smaller chunks for better retrieval
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Characters per chunk
    chunk_overlap=50,  # Overlap between chunks to maintain context
    add_start_index=True,
    strip_whitespace=True,
    separators=["\n\n", "\n", ".", " ", ""],  # Priority order for splitting
)
docs_processed = text_splitter.split_documents(source_docs)

print(f"Knowledge base prepared with {len(docs_processed)} document chunks")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

huggingface_doc.csv:   0%|          | 0.00/22.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2647 [00:00<?, ? examples/s]

Filter:   0%|          | 0/2647 [00:00<?, ? examples/s]

Knowledge base prepared with 14695 document chunks


### Step 3: Create a Retriever Tool

Now we'll create a custom tool that our agent can use to retrieve information from the knowledge base:

In [5]:
from smolagents import Tool

class RetrieverTool(Tool):
    name = "retriever"
    description = "Uses semantic search to retrieve the parts of transformers documentation that could be most relevant to answer your query."
    inputs = {
        "query": {
            "type": "string",
            "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
        }
    }
    output_type = "string"

    def __init__(self, docs, **kwargs):
        super().__init__(**kwargs)
        # Initialize the retriever with our processed documents
        self.retriever = BM25Retriever.from_documents(
            docs, k=10  # Return top 10 most relevant documents
        )

    def forward(self, query: str) -> str:
        """Execute the retrieval based on the provided query."""
        assert isinstance(query, str), "Your search query must be a string"

        # Retrieve relevant documents
        docs = self.retriever.invoke(query)

        # Format the retrieved documents for readability
        return "\nRetrieved documents:\n" + "".join(
            [
                f"\n\n===== Document {str(i)} =====\n" + doc.page_content
                for i, doc in enumerate(docs)
            ]
        )

# Initialize our retriever tool with the processed documents
retriever_tool = RetrieverTool(docs_processed)

> [!TIP]
> We're using BM25, a lexical retrieval method, for simplicity and speed. For production systems, you might want to use semantic search with embeddings for better retrieval quality. Check the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for high-quality embedding models.

### Step 4: Create an Advanced Retrieval Agent

Now we'll create an agent that can use our retriever tool to answer questions:

https://huggingface.co/docs/smolagents/main/en/examples/rag

In [13]:
from smolagents import InferenceClientModel, CodeAgent

# Initialize the agent with our retriever tool
agent = CodeAgent(
    tools=[retriever_tool],
    model=InferenceClientModel(model_id="meta-llama/Llama-3.2-3B-Instruct"),
    # List of tools available to the agent
    #model=InferenceClientModel(),  # Default model "Qwen/Qwen2.5-Coder-32B-Instruct"
    max_steps=4,  # Limit the number of reasoning steps
    verbosity_level=2,  # Show detailed agent reasoning
)

# To use a specific model, you can specify it like this:


> [!TIP]
> Inference Providers give access to hundreds of models, powered by serverless inference partners. A list of supported providers can be found [here](https://huggingface.co/docs/inference-providers/index).

### Step 5: Run the Agent to Answer Questions

Let's use our agent to answer a question about Transformers:

In [14]:
# Ask a question that requires retrieving information
question = "For a transformers model training, which is slower, the forward or the backward pass?"

# Run the agent to get an answer
agent_output = agent.run(question)

# Display the final answer
print("\nFinal answer:")
print(agent_output)


Final answer:
forward pass


## Practical Applications of Agentic RAG

Agentic RAG systems can be applied to various use cases:

1. **Technical Documentation Assistance**: Help users navigate complex technical documentation
2. **Research Paper Analysis**: Extract and synthesize information from scientific papers
3. **Legal Document Review**: Find relevant precedents and clauses in legal documents
4. **Customer Support**: Answer questions based on product documentation and knowledge bases
5. **Educational Tutoring**: Provide explanations based on textbooks and learning materials

## Conclusion

Agentic RAG represents a significant advancement over traditional RAG pipelines. By combining the reasoning capabilities of LLM agents with the factual grounding of retrieval systems, we can build more powerful, flexible, and accurate information systems.

The approach we've demonstrated:
- Overcomes the limitations of single-step retrieval
- Enables more natural interactions with knowledge bases
- Provides a framework for continuous improvement through self-critique and query refinement

As you build your own Agentic RAG systems, consider experimenting with different retrieval methods, agent architectures, and knowledge sources to find the optimal configuration for your specific use case.

In [8]:
# To use Hugging Face's Inference API, you'll need to set up your API token.
# Go to https://huggingface.co/settings/tokens and create a new token.
# In Colab, click on the "üîë" icon in the left sidebar,
# add a new secret with the name `HF_TOKEN` and paste your token there.
# Then, uncomment and run the following line:
# from google.colab import userdata
# import os
# os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

In [9]:
from google.colab import userdata
import os
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

In [None]:
# Load environment variables (including HF_TOKEN)
from dotenv import load_dotenv
load_dotenv()

False

In [None]:
import datasets
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.retrievers import BM25Retriever

# Load the Hugging Face documentation dataset
knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")

# Filter to include only Transformers documentation
knowledge_base = knowledge_base.filter(lambda row: row["source"].startswith("huggingface/transformers"))

# Convert dataset entries to Document objects with metadata
source_docs = [
    Document(page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]})
    for doc in knowledge_base
]

# Split documents into smaller chunks for better retrieval
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Characters per chunk
    chunk_overlap=50,  # Overlap between chunks to maintain context
    add_start_index=True,
    strip_whitespace=True,
    separators=["\n\n", "\n", ".", " ", ""],  # Priority order for splitting
)
docs_processed = text_splitter.split_documents(source_docs)

print(f"Knowledge base prepared with {len(docs_processed)} document chunks")
from smolagents import Tool

class RetrieverTool(Tool):
    name = "retriever"
    description = "Uses semantic search to retrieve the parts of transformers documentation that could be most relevant to answer your query."
    inputs = {
        "query": {
            "type": "string",
            "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
        }
    }
    output_type = "string"

    def __init__(self, docs, **kwargs):
        super().__init__(**kwargs)
        # Initialize the retriever with our processed documents
        self.retriever = BM25Retriever.from_documents(
            docs, k=10  # Return top 10 most relevant documents
        )

    def forward(self, query: str) -> str:
        """Execute the retrieval based on the provided query."""
        assert isinstance(query, str), "Your search query must be a string"

        # Retrieve relevant documents
        docs = self.retriever.invoke(query)

        # Format the retrieved documents for readability
        return "\nRetrieved documents:\n" + "".join(
            [
                f"\n\n===== Document {str(i)} =====\n" + doc.page_content
                for i, doc in enumerate(docs)
            ]
        )

# Initialize our retriever tool with the processed documents
retriever_tool = RetrieverTool(docs_processed)
from smolagents import InferenceClientModel, CodeAgent

# Initialize the agent with our retriever tool
agent = CodeAgent(
    tools=[retriever_tool],
    model=InferenceClientModel(model_id="meta-llama/Llama-3.2-3B-Instruct"),
    # List of tools available to the agent
    #model=InferenceClientModel(),  # Default model "Qwen/Qwen2.5-Coder-32B-Instruct"
    max_steps=4,  # Limit the number of reasoning steps
    verbosity_level=2,  # Show detailed agent reasoning
)

# To use a specific model, you can specify it like this:

# Ask a question that requires retrieving information
question = "For a transformers model training, which is slower, the forward or the backward pass?"

# Run the agent to get an answer
agent_output = agent.run(question)

# Display the final answer
print("\nFinal answer:")
print(agent_output)

In [None]:
pip install "smolagents[transformers]" torch accelerate bitsandbytes

In [None]:
import datasets
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.retrievers import BM25Retriever
from smolagents import Tool, CodeAgent
from smolagents import TransformersModel # <-- 1. ÿßÿ≥ÿ™Ÿäÿ±ÿßÿØ TransformersModel

# ÿ™ÿ≠ŸÖŸäŸÑ ŸÖÿ¨ŸÖŸàÿπÿ© ÿ®ŸäÿßŸÜÿßÿ™ Hugging Face documentation
knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")

# ÿ™ÿµŸÅŸäÿ© ŸÑÿ™ÿ∂ŸÖŸäŸÜ Ÿàÿ´ÿßÿ¶ŸÇ Transformers ŸÅŸÇÿ∑
knowledge_base = knowledge_base.filter(lambda row: row["source"].startswith("huggingface/transformers"))

# ÿ™ÿ≠ŸàŸäŸÑ ÿ•ÿØÿÆÿßŸÑÿßÿ™ ŸÖÿ¨ŸÖŸàÿπÿ© ÿßŸÑÿ®ŸäÿßŸÜÿßÿ™ ÿ•ŸÑŸâ ŸÉÿßÿ¶ŸÜÿßÿ™ Document ŸÖÿπ ÿ®ŸäÿßŸÜÿßÿ™ ŸàÿµŸÅŸäÿ©
source_docs = [
    Document(page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]})
    for doc in knowledge_base
]

# ÿ™ŸÇÿ≥ŸäŸÖ ÿßŸÑŸÖÿ≥ÿ™ŸÜÿØÿßÿ™ ÿ•ŸÑŸâ ÿ£ÿ¨ÿ≤ÿßÿ° ÿ£ÿµÿ∫ÿ± ŸÑÿßÿ≥ÿ™ÿ±ÿ¨ÿßÿπ ÿ£ŸÅÿ∂ŸÑ
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # ÿπÿØÿØ ÿßŸÑÿ£ÿ≠ÿ±ŸÅ ŸÑŸÉŸÑ ÿ¨ÿ≤ÿ°
    chunk_overlap=50,  # ÿßŸÑÿ™ÿØÿßÿÆŸÑ ÿ®ŸäŸÜ ÿßŸÑÿ£ÿ¨ÿ≤ÿßÿ° ŸÑŸÑÿ≠ŸÅÿßÿ∏ ÿπŸÑŸâ ÿßŸÑÿ≥ŸäÿßŸÇ
    add_start_index=True,
    strip_whitespace=True,
    separators=["\n\n", "\n", ".", " ", ""],  # ÿ™ÿ±ÿ™Ÿäÿ® ÿ£ŸàŸÑŸàŸäÿ© ÿßŸÑÿ™ŸÇÿ≥ŸäŸÖ
)
docs_processed = text_splitter.split_documents(source_docs)

print(f"ÿ™ŸÖ ÿ™ÿ¨ŸáŸäÿ≤ ŸÇÿßÿπÿØÿ© ÿßŸÑŸÖÿπÿ±ŸÅÿ© ÿ®ŸÄ {len(docs_processed)} ÿ¨ÿ≤ÿ° ŸÖŸÜ ÿßŸÑŸÖÿ≥ÿ™ŸÜÿØÿßÿ™")

class RetrieverTool(Tool):
    name = "retriever"
    description = "Ÿäÿ≥ÿ™ÿÆÿØŸÖ ÿßŸÑÿ®ÿ≠ÿ´ ÿßŸÑÿØŸÑÿßŸÑŸä ŸÑÿßÿ≥ÿ™ÿ±ÿØÿßÿØ ÿ£ÿ¨ÿ≤ÿßÿ° ŸÖŸÜ Ÿàÿ´ÿßÿ¶ŸÇ transformers ÿßŸÑÿ™Ÿä ŸÇÿØ ÿ™ŸÉŸàŸÜ ÿßŸÑÿ£ŸÉÿ´ÿ± ÿµŸÑÿ© ŸÑŸÑÿ•ÿ¨ÿßÿ®ÿ© ÿπŸÑŸâ ÿßÿ≥ÿ™ŸÅÿ≥ÿßÿ±ŸÉ."
    inputs = {
        "query": {
            "type": "string",
            "description": "ÿßŸÑÿßÿ≥ÿ™ÿπŸÑÿßŸÖ ÿßŸÑÿ∞Ÿä ÿ≥Ÿäÿ™ŸÖ ÿ™ŸÜŸÅŸäÿ∞Ÿá. Ÿäÿ¨ÿ® ÿ£ŸÜ ŸäŸÉŸàŸÜ ŸÇÿ±Ÿäÿ®Ÿãÿß ÿØŸÑÿßŸÑŸäŸãÿß ŸÖŸÜ ÿßŸÑŸÖÿ≥ÿ™ŸÜÿØÿßÿ™ ÿßŸÑŸÖÿ≥ÿ™ŸáÿØŸÅÿ©. ÿßÿ≥ÿ™ÿÆÿØŸÖ ÿµŸäÿ∫ÿ© ÿßŸÑÿ™ÿ£ŸÉŸäÿØ ÿ®ÿØŸÑÿßŸã ŸÖŸÜ ÿßŸÑÿ≥ÿ§ÿßŸÑ.",
        }
    }
    output_type = "string"

    def __init__(self, docs, **kwargs):
        super().__init__(**kwargs)
        # ÿ™ŸáŸäÿ¶ÿ© ÿßŸÑŸÖÿ≥ÿ™ÿ±ÿ¨ÿπ ÿ®ŸÖÿ≥ÿ™ŸÜÿØÿßÿ™ŸÜÿß ÿßŸÑŸÖÿπÿßŸÑÿ¨ÿ©
        self.retriever = BM25Retriever.from_documents(
            docs, k=10  # ÿ•ÿ±ÿ¨ÿßÿπ ÿ£ŸÅÿ∂ŸÑ 10 ŸÖÿ≥ÿ™ŸÜÿØÿßÿ™ ÿ∞ÿßÿ™ ÿµŸÑÿ©
        )

    def forward(self, query: str) -> str:
        """ÿ™ŸÜŸÅŸäÿ∞ ÿßŸÑÿßÿ≥ÿ™ÿ±ÿØÿßÿØ ÿ®ŸÜÿßÿ°Ÿã ÿπŸÑŸâ ÿßŸÑÿßÿ≥ÿ™ÿπŸÑÿßŸÖ ÿßŸÑŸÖŸÇÿØŸÖ."""
        assert isinstance(query, str), "Ÿäÿ¨ÿ® ÿ£ŸÜ ŸäŸÉŸàŸÜ ÿßÿ≥ÿ™ÿπŸÑÿßŸÖ ÿßŸÑÿ®ÿ≠ÿ´ ÿßŸÑÿÆÿßÿµ ÿ®ŸÉ ÿ≥ŸÑÿ≥ŸÑÿ© ŸÜÿµŸäÿ©"

        # ÿßÿ≥ÿ™ÿ±ÿØÿßÿØ ÿßŸÑŸÖÿ≥ÿ™ŸÜÿØÿßÿ™ ÿ∞ÿßÿ™ ÿßŸÑÿµŸÑÿ©
        docs = self.retriever.invoke(query)

        # ÿ™ŸÜÿ≥ŸäŸÇ ÿßŸÑŸÖÿ≥ÿ™ŸÜÿØÿßÿ™ ÿßŸÑŸÖÿ≥ÿ™ÿ±ÿØÿ© ŸÑÿ≥ŸáŸàŸÑÿ© ÿßŸÑŸÇÿ±ÿßÿ°ÿ©
        return "\nÿßŸÑŸÖÿ≥ÿ™ŸÜÿØÿßÿ™ ÿßŸÑŸÖÿ≥ÿ™ÿ±ÿØÿ©:\n" + "".join(
            [
                f"\n\n===== ÿßŸÑŸÖÿ≥ÿ™ŸÜÿØ {str(i)} =====\n" + doc.page_content
                for i, doc in enumerate(docs)
            ]
        )

# ÿ™ŸáŸäÿ¶ÿ© ÿ£ÿØÿßÿ© ÿßŸÑŸÖÿ≥ÿ™ÿ±ÿ¨ÿπ ÿßŸÑÿÆÿßÿµÿ© ÿ®ŸÜÿß ÿ®ÿßŸÑŸÖÿ≥ÿ™ŸÜÿØÿßÿ™ ÿßŸÑŸÖÿπÿßŸÑÿ¨ÿ©
retriever_tool = RetrieverTool(docs_processed)

# <-- 2. ÿßÿ≥ÿ™ÿ®ÿØÿßŸÑ InferenceClientModel ÿ®ŸÄ TransformersModel
# ŸÖŸÑÿßÿ≠ÿ∏ÿ©: ÿ≥Ÿäÿ™ŸÖ ÿ™ŸÜÿ≤ŸäŸÑ ÿßŸÑŸÜŸÖŸàÿ∞ÿ¨ Ÿàÿ™ÿÆÿ≤ŸäŸÜŸá ŸÖÿ§ŸÇÿ™Ÿãÿß ŸÅŸä ÿßŸÑŸÖÿ±ÿ© ÿßŸÑÿ£ŸàŸÑŸâ ÿßŸÑÿ™Ÿä Ÿäÿ™ŸÖ ŸÅŸäŸáÿß ÿ™ÿ¥ÿ∫ŸäŸÑ ÿßŸÑŸÉŸàÿØ.
# ŸäŸÖŸÉŸÜŸÉ ÿßÿ≥ÿ™ÿ®ÿØÿßŸÑ "meta-llama/Meta-Llama-3-8B-Instruct" ÿ®ÿ£Ÿä ŸÜŸÖŸàÿ∞ÿ¨ ÿ¢ÿÆÿ± ŸäÿØÿπŸÖ ÿ™ŸàŸÑŸäÿØ ÿßŸÑŸÉŸàÿØ ŸÖŸÜ Hugging Face Hub.
local_model = TransformersModel(model_id="meta-llama/Llama-3.2-3B-Instruct")

# ÿ™ŸáŸäÿ¶ÿ© ÿßŸÑÿπŸÖŸäŸÑ ÿ®ÿ£ÿØÿßÿ© ÿßŸÑŸÖÿ≥ÿ™ÿ±ÿ¨ÿπ ÿßŸÑÿÆÿßÿµÿ© ÿ®ŸÜÿß
agent = CodeAgent(
    tools=[retriever_tool],
    model=local_model,  # <-- 3. ÿßÿ≥ÿ™ÿÆÿØÿßŸÖ ÿßŸÑŸÜŸÖŸàÿ∞ÿ¨ ÿßŸÑŸÖÿ≠ŸÑŸä
    max_steps=4,  # ÿ™ÿ≠ÿØŸäÿØ ÿπÿØÿØ ÿÆÿ∑Ÿàÿßÿ™ ÿßŸÑÿ™ŸÅŸÉŸäÿ±
    verbosity_level=2,  # ÿ•ÿ∏Ÿáÿßÿ± ÿ™ŸÅÿßÿµŸäŸÑ ÿ™ŸÅŸÉŸäÿ± ÿßŸÑÿπŸÖŸäŸÑ
)

# ÿ∑ÿ±ÿ≠ ÿ≥ÿ§ÿßŸÑ Ÿäÿ™ÿ∑ŸÑÿ® ÿßÿ≥ÿ™ÿ±ÿØÿßÿØ ŸÖÿπŸÑŸàŸÖÿßÿ™
question = "For a transformers model training, which is slower, the forward or the backward pass?"

# ÿ™ÿ¥ÿ∫ŸäŸÑ ÿßŸÑÿπŸÖŸäŸÑ ŸÑŸÑÿ≠ÿµŸàŸÑ ÿπŸÑŸâ ÿ•ÿ¨ÿßÿ®ÿ©
agent_output = agent.run(question)

# ÿπÿ±ÿ∂ ÿßŸÑÿ•ÿ¨ÿßÿ®ÿ© ÿßŸÑŸÜŸáÿßÿ¶Ÿäÿ©
print("\nÿßŸÑÿ•ÿ¨ÿßÿ®ÿ© ÿßŸÑŸÜŸáÿßÿ¶Ÿäÿ©:")
print(agent_output)


ÿ™ŸÖ ÿ™ÿ¨ŸáŸäÿ≤ ŸÇÿßÿπÿØÿ© ÿßŸÑŸÖÿπÿ±ŸÅÿ© ÿ®ŸÄ 14695 ÿ¨ÿ≤ÿ° ŸÖŸÜ ÿßŸÑŸÖÿ≥ÿ™ŸÜÿØÿßÿ™




config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
import datasets
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.retrievers import BM25Retriever
from smolagents import Tool, CodeAgent, TransformersModel

# 1. Load the Hugging Face documentation dataset
knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")

# 2. Filter to include only Transformers documentation
knowledge_base = knowledge_base.filter(lambda row: row["source"].startswith("huggingface/transformers"))

# 3. Convert dataset entries to Document objects with metadata
source_docs = [
    Document(page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]})
    for doc in knowledge_base
]

# 4. Split documents into smaller chunks for better retrieval
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # Characters per chunk
    chunk_overlap=50,      # Overlap between chunks to maintain context
    add_start_index=True,
    strip_whitespace=True,
    separators=["\n\n", "\n", ".", " ", ""],  # Priority order for splitting
)
docs_processed = text_splitter.split_documents(source_docs)

print(f"Knowledge base prepared with {len(docs_processed)} document chunks")

# 5. Define a tool for the agent to use for retrieving information
class RetrieverTool(Tool):
    name = "retriever"
    description = "Uses semantic search to retrieve the parts of transformers documentation that could be most relevant to answer your query."
    inputs = {
        "query": {
            "type": "string",
            "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
        }
    }
    output_type = "string"

    def __init__(self, docs, **kwargs):
        super().__init__(**kwargs)
        # Initialize the retriever with our processed documents
        self.retriever = BM25Retriever.from_documents(
            docs, k=10  # Return top 10 most relevant documents
        )

    def forward(self, query: str) -> str:
        """Execute the retrieval based on the provided query."""
        assert isinstance(query, str), "Your search query must be a string"

        # Retrieve relevant documents
        docs = self.retriever.invoke(query)

        # Format the retrieved documents for readability
        return "\nRetrieved documents:\n" + "".join(
            [
                f"\n\n===== Document {str(i)} =====\n" + doc.page_content
                for i, doc in enumerate(docs)
            ]
        )

# 6. Initialize the retriever tool with the processed documents
retriever_tool = RetrieverTool(docs_processed)

# 7. Initialize the local model using TransformersModel
# The model_kwargs dictionary is used to pass additional arguments directly
# to the .from_pretrained() method of the transformers library.
# `device_map="auto"` automatically distributes the model across available devices (GPU/CPU).
local_model = TransformersModel(
    model_id="meta-llama/Llama-3.2-3B-Instruct",
    device_map="auto" # Pass device_map directly
)

# 8. Initialize the agent with our retriever tool and the local model
agent = CodeAgent(
    tools=[retriever_tool],     # List of tools available to the agent
    model=local_model,          # Use the locally loaded model
    max_steps=4,                # Limit the number of reasoning steps
    verbosity_level=2,          # Show detailed agent reasoning
)

# 9. Ask a question that requires retrieving information
question = "For a transformers model training, which is slower, the forward or the backward pass?"

# 10. Run the agent to get an answer
agent_output = agent.run(question)

# 11. Display the final answer
print("\nFinal answer:")
print(agent_output)

### without api

In [2]:
import datasets
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.retrievers import BM25Retriever
from smolagents import Tool, CodeAgent, TransformersModel

# 1. Load the Hugging Face documentation dataset
knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")

# 2. Filter to include only Transformers documentation
knowledge_base = knowledge_base.filter(lambda row: row["source"].startswith("huggingface/transformers"))

# 3. Convert dataset entries to Document objects with metadata
source_docs = [
    Document(page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]})
    for doc in knowledge_base
]

# 4. Split documents into smaller chunks for better retrieval
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # Characters per chunk
    chunk_overlap=50,      # Overlap between chunks to maintain context
    add_start_index=True,
    strip_whitespace=True,
    separators=["\n\n", "\n", ".", " ", ""],  # Priority order for splitting
)
docs_processed = text_splitter.split_documents(source_docs)

print(f"Knowledge base prepared with {len(docs_processed)} document chunks")

# 5. Define a tool for the agent to use for retrieving information
class RetrieverTool(Tool):
    name = "retriever"
    description = "Uses semantic search to retrieve the parts of transformers documentation that could be most relevant to answer your query."
    inputs = {
        "query": {
            "type": "string",
            "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
        }
    }
    output_type = "string"

    def __init__(self, docs, **kwargs):
        super().__init__(**kwargs)
        # Initialize the retriever with our processed documents
        self.retriever = BM25Retriever.from_documents(
            docs, k=10  # Return top 10 most relevant documents
        )

    def forward(self, query: str) -> str:
        """Execute the retrieval based on the provided query."""
        assert isinstance(query, str), "Your search query must be a string"

        # Retrieve relevant documents
        docs = self.retriever.invoke(query)

        # Format the retrieved documents for readability
        return "\nRetrieved documents:\n" + "".join(
            [
                f"\n\n===== Document {str(i)} =====\n" + doc.page_content
                for i, doc in enumerate(docs)
            ]
        )

# 6. Initialize the retriever tool with the processed documents
retriever_tool = RetrieverTool(docs_processed)

# 7. Initialize the local model using TransformersModel
# The model_kwargs dictionary is used to pass additional arguments directly
# to the .from_pretrained() method of the transformers library.
# `device_map="auto"` automatically distributes the model across available devices (GPU/CPU).
local_model = TransformersModel(
    model_id="meta-llama/Llama-3.2-3B-Instruct",
    device_map="auto" # Pass device_map directly
)

# 8. Initialize the agent with our retriever tool and the local model
agent = CodeAgent(
    tools=[retriever_tool],     # List of tools available to the agent
    model=local_model,          # Use the locally loaded model
    max_steps=4,                # Limit the number of reasoning steps
    verbosity_level=2,          # Show detailed agent reasoning
)

# 9. Ask a question that requires retrieving information
question = "For a transformers model training, which is slower, the forward or the backward pass?"

# 10. Run the agent to get an answer
agent_output = agent.run(question)

# 11. Display the final answer
print("\nFinal answer:")
print(agent_output)

Knowledge base prepared with 14695 document chunks




Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.



Final answer:
The forward pass is slower.
