# Raw RAG 06: Extracting Metadata for Enhanced Retrieval

In the journey to build more effective Retrieval-Augmented Generation (RAG) systems, we've explored various techniques to improve the quality and relevance of retrieved information. While embedding and chunking form the foundation of many RAG implementations, they alone may not always provide the most accurate context for complex queries.

This notebook introduces an additional layer of sophistication to our RAG pipeline: metadata extraction. By leveraging metadata, we can further refine our document retrieval process, leading to more relevant results and improved overall performance.

### Key Points

1. **Beyond Embedding and Chunking**: We'll explore how metadata can complement traditional retrieval methods.
2. **Improved Relevance**: Learn how metadata can help filter and prioritize documents more effectively.
3. **Cost Efficiency**: Discover how better retrieval can reduce API costs by minimizing unnecessary queries.
4. **Accuracy Enhancement**: Understand how metadata-driven retrieval can improve the accuracy of provided context.

### What to Expect

In this notebook, we'll dive into one specific approach to metadata extraction and utilization in RAG systems. However, it's crucial to understand that this is just one of many possible strategies to enhance the retrieval process. The field of RAG is rapidly evolving, and new techniques are constantly emerging.

We'll build upon the knowledge gained from previous notebooks, particularly the JSON output techniques covered in previous notebook 05 (Pydantic is All You Need). Familiarity with Pydantic and JSON handling will be beneficial as we explore metadata extraction.

### Looking Ahead

While we'll focus on a particular metadata extraction technique today, keep in mind that this is just the tip of the iceberg. Future explorations may delve into:

- Advanced semantic analysis for metadata generation
- Multi-modal metadata extraction (text, images, audio)
- Dynamic metadata updating and relevance scoring
- Integration of external knowledge bases for metadata enrichment

By mastering these techniques, you'll be well-equipped to adapt and extend your RAG systems to meet increasingly complex information retrieval challenges.

Let's begin our journey into the world of metadata-enhanced RAG!

In [1]:
%pip install openai pydantic python-dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Load the environment variables from the .env file

from dotenv import load_dotenv
import os

# Specify the path to your .env file if it's not in the same directory
dotenv_path = ".env"
load_dotenv(dotenv_path=dotenv_path)

True

In [3]:
# Lifted from the previous notebook

import json

from pydantic import BaseModel
from pydantic.json_schema import model_json_schema

def pydantic_to_function_schema(model: type[BaseModel]) -> dict:
    """
    Converts a Pydantic model to a function schema.

    Args:
        model (type[BaseModel]): The Pydantic model to convert.

    Returns:
        dict: The function schema representing the Pydantic model.
    """
    schema = model_json_schema(model)

    function_schema = {
        "type": "function",
        "function": {
            "name": schema["title"].lower().replace(" ", "_"),
            "description": schema.get("description", ""),
            "parameters": {
                "type": "object",
                "properties": schema["properties"],
                "required": schema.get("required", []),
            },
        },
    }

    return function_schema

In [4]:
from pydantic import BaseModel, Field, ConfigDict
from typing import List, Optional, Dict, Any

# Create a sample Pydantic model
class QueryMetadata(BaseModel):
    model_config = ConfigDict(
        title="extract query metadata",
        description="Based on the user query, extract metadata"
    )
    
    time_range: List[str] = Field(..., description="Identify any time references mentioned in the query. List them in ascending order (from earliest to latest). If no time references are found, leave this field empty.")
    company: str = Field(..., description="extract the company name mentioned in the query")
    keywords: List[str] = Field(..., description="extract the keywords mentioned in the query")

In [5]:
query_metadata_schema = pydantic_to_function_schema(QueryMetadata)
print(json.dumps(query_metadata_schema, indent=2))

{
  "type": "function",
  "function": {
    "name": "extract_query_metadata",
    "description": "",
    "parameters": {
      "type": "object",
      "properties": {
        "time_range": {
          "description": "Identify any time references mentioned in the query. List them in ascending order (from earliest to latest). If no time references are found, leave this field empty.",
          "items": {
            "type": "string"
          },
          "title": "Time Range",
          "type": "array"
        },
        "company": {
          "description": "extract the company name mentioned in the query",
          "title": "Company",
          "type": "string"
        },
        "keywords": {
          "description": "extract the keywords mentioned in the query",
          "items": {
            "type": "string"
          },
          "title": "Keywords",
          "type": "array"
        }
      },
      "required": [
        "time_range",
        "company",
        "keywords"
    

In [6]:
from openai import OpenAI

client = OpenAI()

In [7]:
from utils import MessageParser

message_parser = MessageParser()

In [8]:
query = "What was Nvidia's revenue between 2020 and 2022?"

response = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You help user with their request.",
        },
        {"role": "user", "content": query},
    ],
    model="gpt-4-turbo",
    tools=[query_metadata_schema],
    tool_choice={
        "type": "function",
        "function": {"name": "extract_query_metadata"},
    },
)

query_metadata = message_parser.extract_and_parse_arguments(response.choices[0].message)

print(query_metadata)

{'time_range': ['2020', '2022'], 'company': 'Nvidia', 'keywords': ['revenue']}


In [9]:
from pydantic import BaseModel, Field, ConfigDict
from typing import Optional, Dict, Any, List
import json


class GuestRequest(BaseModel):
    model_config = ConfigDict(
        title="GuestRequest Model",
        description="Based on the user request, return the guest request details.",
    )
    room_number: str = Field(..., description="guest room number")
    guest_name: Optional[str] = Field(None, description="guest name")
    request: str = Field(..., description="guest request")


class RoomService(BaseModel):
    model_config = ConfigDict(
        title="RoomService Model",
        description="Order room service for a guest.",
    )
    room_number: str = Field(..., description="guest room number")
    menu_item: str = Field(..., description="menu item to order")
    quantity: int = Field(..., description="quantity of the item")


# Function to convert Pydantic model to function schema
def pydantic_to_json_schema(model: type[BaseModel]) -> Dict[str, Any]:
    schema = model.model_json_schema()

    function_schema = {
        "type": "function",
        "function": {
            "name": schema.get("title", model.__name__).lower().replace(" ", "_"),
            "description": schema.get("description", ""),
            "parameters": {"type": "object", "properties": {}, "required": []},
        },
    }

    # Correctly extract the overall schema description
    model_config = getattr(model, "model_config", None)
    if model_config and isinstance(model_config, dict):
        function_schema["function"]["description"] = model_config.get("description", "")

    for field_name, field in model.model_fields.items():
        field_info = field.json_schema_extra or {}
        field_type = field.annotation

        if field.is_required():
            function_schema["function"]["parameters"]["required"].append(field_name)

        # Correctly extract the description from the field
        description = field_info.get("description") or field.description or ""

        function_schema["function"]["parameters"]["properties"][field_name] = {
            "type": "string" if field_type in (str, Optional[str]) else "integer",
            "description": description,
        }

    return function_schema


def generate_function_descriptions_and_tool_choice(
    models: List[type[BaseModel]], chosen_function: str
) -> Dict[str, Any]:
    function_descriptions = [pydantic_to_json_schema(model) for model in models]

    tool_choice = {"type": "function", "function": {"name": chosen_function}}

    return {"function_descriptions": function_descriptions, "tool_choice": tool_choice}


In [10]:
models = [GuestRequest, RoomService]
chosen_function = "get_guest_request"

result = generate_function_descriptions_and_tool_choice(models, chosen_function)

print("Function Descriptions:")
print(json.dumps(result["function_descriptions"], indent=2))
print("\nTool Choice:")
print(json.dumps(result["tool_choice"], indent=2))

Function Descriptions:
[
  {
    "type": "function",
    "function": {
      "name": "guestrequest_model",
      "description": "Based on the user request, return the guest request details.",
      "parameters": {
        "type": "object",
        "properties": {
          "room_number": {
            "type": "string",
            "description": "guest room number"
          },
          "guest_name": {
            "type": "string",
            "description": "guest name"
          },
          "request": {
            "type": "string",
            "description": "guest request"
          }
        },
        "required": [
          "room_number",
          "request"
        ]
      }
    }
  },
  {
    "type": "function",
    "function": {
      "name": "roomservice_model",
      "description": "Order room service for a guest.",
      "parameters": {
        "type": "object",
        "properties": {
          "room_number": {
            "type": "string",
            "description": "g

## Conclusion: Elevating RAG and LLM Processes with Metadata

As we conclude this exploration of metadata extraction in Retrieval-Augmented Generation (RAG) systems, it's clear that we've taken a significant step forward in enhancing the capabilities of our AI-powered information retrieval and generation processes.

### Key Takeaways

1. **Building on Foundations**: We've leveraged the JSON parsing techniques from our previous notebook, demonstrating how foundational skills can be applied to more advanced concepts.

2. **Metadata as a Powerful Filter**: By extracting and utilizing metadata, we've unlocked a new dimension of document relevance, allowing for more nuanced and accurate retrieval.

3. **Efficiency Gains**: The implementation of metadata-driven retrieval has shown potential for reducing API costs and improving overall system efficiency.

4. **Improved Context Accuracy**: Our metadata approach has demonstrated how we can provide LLMs with more relevant context, potentially leading to higher quality outputs.

5. **Flexibility in Implementation**: The techniques we've explored are adaptable, allowing for customization based on specific use cases and data types.

### Practical Applications

The skills and concepts covered in this notebook open up a range of possibilities for improving RAG and LLM processes:

- **Enhanced Query Understanding**: Use metadata to better interpret user queries and match them with relevant documents.
- **Dynamic Content Filtering**: Implement metadata-based filters that can adapt to user preferences or specific task requirements.
- **Improved Data Governance**: Leverage metadata for better tracking and management of information sources within your RAG system.
- **Multi-Modal RAG**: Extend these concepts to handle metadata from various data types, including images, audio, and video.

### Looking Ahead

While we've made significant progress, this is just one step in the ongoing evolution of RAG systems. As you continue to develop and refine your implementations, consider:

- Integrating more advanced NLP techniques for metadata extraction
- Exploring machine learning approaches to dynamically weight metadata importance
- Investigating ways to combine metadata with other retrieval enhancement techniques

Remember, the goal is not just to retrieve information, but to provide LLMs with the most relevant and context-rich data possible. By mastering metadata extraction and utilization, you're equipping yourself with a powerful tool to achieve this goal.

As we move forward, keep experimenting, iterating, and pushing the boundaries of what's possible with RAG and LLMs. The techniques we've explored today are your stepping stones to building more intelligent, efficient, and capable RAG systems.