# LangChain `SelfQueryRetriever` Quick Reference

## **Introduction**

The **SelfQueryRetriever** is a powerful tool in the LangChain ecosystem designed to enhance document retrieval by combining **semantic search** with **structured filtering**. Unlike traditional retrieval methods that rely solely on semantic similarity, the SelfQueryRetriever leverages a large language model (LLM) to generate structured queries that can filter documents based on metadata fields such as genre, year, rating, or any other custom attributes. This hybrid approach allows users to perform more precise and context-aware searches, making it an invaluable tool for applications like movie recommendations, product searches, or any domain where metadata plays a crucial role.

The **SelfQueryRetriever** in LangChain is a specialized retriever designed to enhance retrieval-augmented generation (RAG) systems by combining semantic similarity search with metadata-based filtering. It uses a query-constructing LLM chain to transform natural language queries into structured queries that can be executed on vector databases like Milvus, Pinecone, or Chroma.

### Key Features

1. **Natural Language to Structured Query Conversion**:
   - The retriever interprets user queries and generates structured queries.
   - These queries include semantic search criteria and metadata filters, enabling precise document retrieval.
2. **Metadata Filtering**:
   - Users can specify conditions (e.g., "Find documents from 2023") in their queries.
   - The retriever applies these conditions to filter results based on metadata fields, such as date, source, or tags.
3. **Integration with Vector Databases**:
   - It supports vector stores like Milvus, Pinecone, and Chroma.
   - The retriever uses the database's capabilities for similarity search and filtering.
4. **Customizable Query Translators**:
   - A `structured_query_translator` parameter allows for adapting the retriever to different vector stores by translating internal query formats into database-specific search parameters.

---

## Preparation

### Installing Required Libraries
This section installs the necessary Python libraries for working with LangChain, OpenAI embeddings, and Chroma vector store. These libraries include:
- `langchain-openai`: Provides integration with OpenAI's embedding models.
- `langchain_community`: Contains community-contributed modules and tools for LangChain.
- `langchain_experimental`: Includes experimental features and utilities for LangChain.
- `langchain-chroma`: Enables integration with the Chroma vector database.
- `chromadb`: The core library for the Chroma vector database.

In [None]:
!pip install -qU lark
!pip install -qU langchain-openai
!pip install -qU langchain_community
!pip install -qU langchain_experimental
!pip install -qU langchain-chroma>=0.1.2
!pip install -qU chromadb

### Initializing OpenAI Embeddings
This section demonstrates how to securely fetch an OpenAI API key using Kaggle's `UserSecretsClient` and initialize the OpenAI embedding model. The `OpenAIEmbeddings` class is used to create an embedding model instance, which will be used to convert text into numerical embeddings.

Key steps:
1. **Fetch API Key**: The OpenAI API key is securely retrieved using Kaggle's `UserSecretsClient`.
2. **Initialize Embeddings**: The `OpenAIEmbeddings` class is initialized with the `text-embedding-3-small` model and the fetched API key.

This setup ensures that the embedding model is ready for use in downstream tasks, such as caching embeddings or creating vector stores.

In [None]:
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from kaggle_secrets import UserSecretsClient

# Fetch API key securely
user_secrets = UserSecretsClient()
my_api_key = user_secrets.get_secret("api-key-openai")

# Initialize OpenAI embeddings and LLM
embed = OpenAIEmbeddings(model="text-embedding-3-small", api_key=my_api_key)
model = ChatOpenAI(model="gpt-4o-mini", temperature=1.0, api_key=my_api_key)

---

## **1. Retrieval Functions**

### **Basic Retrieval with Structured Filtering**
This example demonstrates how to use `SelfQueryRetriever` to retrieve documents based on a query with structured filtering (e.g., filtering by metadata like genre and rating).

In [None]:
from langchain_chroma import Chroma
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain_openai import ChatOpenAI
from langchain_core.documents import Document

# Define metadata fields for filtering
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie",
        type="string",
    ),
    AttributeInfo(
        name="rating",
        description="A 1-10 rating for the movie",
        type="float",
    ),
]
document_content_description = "Brief summary of a movie"

# Initialize vector store and embeddings
vectorstore = Chroma(embedding_function=embed)

# Add documents with metadata to the vector store
documents = [
    Document(
        page_content="A space adventure with aliens",
        metadata={"genre": "science fiction", "rating": 8.5},
    ),
    Document(
        page_content="A comedy about life in a small town",
        metadata={"genre": "comedy", "rating": 7.0},
    ),
]
vectorstore.add_documents(documents)

# Create SelfQueryRetriever
retriever = SelfQueryRetriever.from_llm(
    model,
    vectorstore,
    document_content_description,
    metadata_field_info,
)

# Retrieve documents with structured filtering
result = retriever.invoke("science fiction movie with rating greater than 8")
print(result)

### **Retrieval with Custom Metadata**
This example extends the basic retrieval by adding custom metadata to the documents and filtering based on it.

In [None]:
from langchain_chroma import Chroma
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain_openai import ChatOpenAI
from langchain_core.documents import Document

# Define metadata fields for filtering
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie",
        type="string",
    ),
    AttributeInfo(
        name="rating",
        description="A 1-10 rating for the movie",
        type="float",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="language",
        description="The language of the movie",
        type="string",
    ),
]
document_content_description = "Brief summary of a movie"

# Initialize vector store and embeddings
vectorstore = Chroma(embedding_function=embed)

# Add documents with custom metadata to the vector store
documents = [
    Document(
        page_content="A space adventure with aliens",
        metadata={
            "genre": "science fiction",
            "rating": 8.5,
            "year": 2015,
            "director": "James Cameron",
            "language": "English",
        },
    ),
    Document(
        page_content="A comedy about life in a small town",
        metadata={
            "genre": "comedy",
            "rating": 7.0,
            "year": 2010,
            "director": "Wes Anderson",
            "language": "English",
        },
    ),
    Document(
        page_content="A thriller about a hacker who uncovers a conspiracy",
        metadata={
            "genre": "thriller",
            "rating": 9.0,
            "year": 2020,
            "director": "David Fincher",
            "language": "English",
        },
    ),
    Document(
        page_content="A romantic drama set in Paris",
        metadata={
            "genre": "romance",
            "rating": 8.0,
            "year": 2018,
            "director": "Sofia Coppola",
            "language": "French",
        },
    ),
    Document(
        page_content="An animated movie about a robot",
        metadata={
            "genre": "animated",
            "rating": 9.5,
            "year": 2008,
            "director": "Andrew Stanton",
            "language": "English",
        },
    ),
]

# Add documents to the vector store
vectorstore.add_documents(documents)

# Create SelfQueryRetriever
retriever = SelfQueryRetriever.from_llm(
    model,
    vectorstore,
    document_content_description,
    metadata_field_info,
)

In [None]:
# Retrieve documents with custom metadata filtering
result = retriever.invoke("comedy movie with rating less than 8")
print("Comedy movie with rating less than 8:", result)

In [None]:
# Retrieve documents with additional metadata filters
result = retriever.invoke("science fiction movie directed by James Cameron")
print("Science fiction movie directed by James Cameron:", result)

In [None]:
result = retriever.invoke("movie released after 2015")
print("Movie released after 2015:", result)

In [None]:
result = retriever.invoke("animated movie with rating greater than 9")
print("Animated movie with rating greater than 9:", result)

In [None]:
result = retriever.invoke("French movie")
print("French movie:", result)

---

---

## **2. Batch Processing**

### **Batch Retrieval of Multiple Queries**
This example demonstrates how to use the `batch` method to process multiple queries at once.

In [None]:
# Import required libraries
from langchain_chroma import Chroma
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.documents import Document

# Define metadata fields for filtering
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie",
        type="string",
    ),
    AttributeInfo(
        name="rating",
        description="A 1-10 rating for the movie",
        type="float",
    ),
]
document_content_description = "Brief summary of a movie"

# Initialize vector store and embeddings
vectorstore = Chroma(embedding_function=embed)

# Add documents with metadata to the vector store
documents = [
    Document(
        page_content="A space adventure with aliens",
        metadata={"genre": "science fiction", "rating": 8.5},
    ),
    Document(
        page_content="A comedy about life in a small town",
        metadata={"genre": "comedy", "rating": 7.0},
    ),
]
vectorstore.add_documents(documents)

# Create SelfQueryRetriever
retriever = SelfQueryRetriever.from_llm(
    model,
    vectorstore,
    document_content_description,
    metadata_field_info,
)

# Define multiple queries
queries = [
    "science fiction movie with rating greater than 8",
    "comedy movie with rating less than 7.5",
]

# Perform batch retrieval
results = retriever.batch(queries)
for result in results:
    print(result)

### **Batch Retrieval with Custom Configuration**
This example extends the previous one by adding custom configuration (e.g., tags and metadata) to the `batch` method.

In [None]:
# Define custom configuration
config = {"tags": ["batch_retrieval"], "metadata": {"user_id": 123}}

# Perform batch retrieval with custom configuration
results = retriever.batch(queries, config=config)
for result in results:
    print(result)

---

## **3. Streaming and Event Handling**

### **Streaming Retrieval Results**
This example demonstrates how to use the `stream` method to retrieve documents in a streaming fashion. It includes all necessary imports and defines the `retriever` object.

In [None]:
# Import required libraries
from langchain_chroma import Chroma
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.documents import Document

# Define metadata fields for filtering
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie",
        type="string",
    ),
    AttributeInfo(
        name="rating",
        description="A 1-10 rating for the movie",
        type="float",
    ),
]
document_content_description = "Brief summary of a movie"

# Initialize vector store and embeddings
vectorstore = Chroma(embedding_function=embed)

# Add documents with metadata to the vector store
documents = [
    Document(
        page_content="A space adventure with aliens",
        metadata={"genre": "science fiction", "rating": 8.5},
    ),
    Document(
        page_content="A comedy about life in a small town",
        metadata={"genre": "comedy", "rating": 7.0},
    ),
]
vectorstore.add_documents(documents)

# Create SelfQueryRetriever
retriever = SelfQueryRetriever.from_llm(
    model,
    vectorstore,
    document_content_description,
    metadata_field_info,
)

# Stream retrieval results
for document in retriever.stream("science fiction movie with rating greater than 8"):
    print(document)

### **Handling Retrieval Events**
This example extends the previous one by using `astream_events` to handle real-time events during retrieval.

In [None]:
# Stream all events during retrieval
async for event in retriever.astream_events("comedy movie with rating less than 7.5", version="v2"):
    print(event)

In [None]:
# Stream only retriever-related events
async for event in retriever.astream_events(
    "comedy movie with rating less than 7.5",
    version="v2",
    include_types=["retriever"],
):
    print(event)

In [None]:
# Stream events with specific tags
async for event in retriever.astream_events(
    "comedy movie with rating less than 7.5",
    version="v2",
    include_tags=["my_retriever"],
):
    print(event)

In [None]:
# Stream events excluding specific types
async for event in retriever.astream_events(
    "comedy movie with rating less than 7.5",
    version="v2",
    exclude_types=["on_retriever_start"],
):
    print(event)

In [None]:
# Stream events with combined filters
async for event in retriever.astream_events(
    "comedy movie with rating less than 7.5",
    version="v2",
    include_types=["retriever"],
    include_tags=["my_retriever"],
):
    print(event)

In [None]:
# Define a custom event schema
async def custom_event_schema(query: str):
    async for event in retriever.astream_events(query, version="v2"):
        custom_event = {
            "event_name": event["event"],
            "run_id": event["run_id"],
            "data": event["data"],
        }
        print(custom_event)

# Run the custom event schema
await custom_event_schema("comedy movie with rating less than 7.5")

---

## **4. Utility Functions and Lifecycle Listeners**

### **Adding Lifecycle Listeners**
This example demonstrates how to add lifecycle listeners to the retriever to track its execution. It includes all necessary imports and defines the `retriever` object.

In [None]:
# Import required libraries
from langchain_chroma import Chroma
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.documents import Document

# Define metadata fields for filtering
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie",
        type="string",
    ),
    AttributeInfo(
        name="rating",
        description="A 1-10 rating for the movie",
        type="float",
    ),
]
document_content_description = "Brief summary of a movie"

# Initialize vector store and embeddings
vectorstore = Chroma(embedding_function=embed)

# Add documents with metadata to the vector store
documents = [
    Document(
        page_content="A space adventure with aliens",
        metadata={"genre": "science fiction", "rating": 8.5},
    ),
    Document(
        page_content="A comedy about life in a small town",
        metadata={"genre": "comedy", "rating": 7.0},
    ),
]
vectorstore.add_documents(documents)

# Create SelfQueryRetriever
retriever = SelfQueryRetriever.from_llm(
    model,
    vectorstore,
    document_content_description,
    metadata_field_info,
)

# Define lifecycle listeners
def on_start(run):
    print(f"Retrieval started: {run}")

def on_end(run):
    print(f"Retrieval ended: {run}")

# Bind listeners to the retriever
listener_retriever = retriever.with_listeners(on_start=on_start, on_end=on_end)

# Invoke the retriever with listeners
listener_retriever.invoke("science fiction movie with rating greater than 8")

### **Using Fallback Retrievers**
This example extends the previous one by adding fallback retrievers to handle failures gracefully.

In [None]:
from langchain_core.runnables import RunnableLambda

# Define a fallback retriever
fallback_retriever = RunnableLambda(lambda x: [{"page_content": "Fallback document"}])

# Add fallback to the retriever
fallback_enabled_retriever = listener_retriever.with_fallbacks([fallback_retriever])

# Invoke the retriever with fallback
result = fallback_enabled_retriever.invoke("invalid query")
print(result)

---

## **5. Configuration and Customization**

### **Customizing Search Parameters**
This example demonstrates how to customize search parameters (e.g., search type and keyword arguments). It includes all necessary imports and defines the `retriever` object.

In [None]:
# Import required libraries
from langchain_chroma import Chroma
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.documents import Document

# Define metadata fields for filtering
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie",
        type="string",
    ),
    AttributeInfo(
        name="rating",
        description="A 1-10 rating for the movie",
        type="float",
    ),
]
document_content_description = "Brief summary of a movie"

# Initialize vector store and embeddings
vectorstore = Chroma(embedding_function=embed)

# Add documents with metadata to the vector store
documents = [
    Document(
        page_content="A space adventure with aliens",
        metadata={"genre": "science fiction", "rating": 8.5},
    ),
    Document(
        page_content="A comedy about life in a small town",
        metadata={"genre": "comedy", "rating": 7.0},
    ),
]
vectorstore.add_documents(documents)

# Create SelfQueryRetriever
retriever = SelfQueryRetriever.from_llm(
    model,
    vectorstore,
    document_content_description,
    metadata_field_info,
)

# Customize search parameters
custom_retriever = SelfQueryRetriever.from_llm(
    model,
    vectorstore,
    document_content_description,
    metadata_field_info,
    search_type="mmr",  # Maximal Marginal Relevance
    search_kwargs={"k": 5},  # Retrieve top 5 documents
)

# Retrieve documents with custom search parameters
result = custom_retriever.invoke("science fiction movie with rating greater than 8")
print(result)

### **Configurable Alternatives for Retrieval**
This example extends the previous one by using `configurable_alternatives` to switch between different retrievers at runtime.

In [None]:
from langchain_core.runnables import ConfigurableField

# Define alternative retrievers
alternative_retriever = SelfQueryRetriever.from_llm(
    model,
    vectorstore,
    document_content_description,
    metadata_field_info,
    search_type="similarity",  # Alternative search type
)

# Configure alternatives
configurable_retriever = custom_retriever.configurable_alternatives(
    ConfigurableField(id="retriever"),
    default_key="default",
    alternative=alternative_retriever,
)

# Use the default retriever
result = configurable_retriever.invoke("science fiction movie with rating greater than 8")
print(result)

# Switch to the alternative retriever
result = configurable_retriever.with_config(configurable={"retriever": "alternative"}).invoke("comedy movie with rating less than 7.5")
print(result)

---

## **Best Practices**

### **Customizing SelfQueryRetriever to Include Similarity Scores**
This example demonstrates how to subclass `SelfQueryRetriever` to include similarity scores in the document metadata. This is useful for understanding how closely each document matches the query.

In [None]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain_core.documents import Document
from typing import Any, Dict, List

# Define metadata fields for filtering
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", 
        description="A 1-10 rating for the movie", 
        type="float"
    ),
]
document_content_description = "Brief summary of a movie"

# Initialize vector store and embeddings
embed = OpenAIEmbeddings(model="text-embedding-3-small", api_key=my_api_key)
model = ChatOpenAI(model="gpt-4o-mini", temperature=1.0, api_key=my_api_key)
vectorstore = Chroma(embedding_function=embed)

# Add documents with metadata to the vector store
documents = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"genre": "science fiction", "rating": 7.7, "year": 1993, "director": "Steven Spielberg"},
    ),
    Document(
        page_content="A heartwarming story about a boy and his dog",
        metadata={"genre": "drama", "rating": 8.5, "year": 2009, "director": "Lasse Hallström"},
    ),
]
vectorstore.add_documents(documents)

# Subclass SelfQueryRetriever to include similarity scores
class CustomSelfQueryRetriever(SelfQueryRetriever):
    def _get_docs_with_query(
        self, query: str, search_kwargs: Dict[str, Any]
    ) -> List[Document]:
        """Get docs, adding score information."""
        try:
            # Perform similarity search with scores
            results = self.vectorstore.similarity_search_with_score(query, **search_kwargs)
            if not results:
                print("No documents matched the query.")
                return []
            
            docs, scores = zip(*results)
            for doc, score in zip(docs, scores):
                doc.metadata["score"] = score
            return list(docs)
        except Exception as e:
            print(f"An error occurred during retrieval: {e}")
            return []

# Create the custom retriever
retriever = CustomSelfQueryRetriever.from_llm(
    model,
    vectorstore,
    document_content_description,
    metadata_field_info,
)

# Retrieve documents with similarity scores
result = retriever.invoke("dinosaur movie with rating less than 8")
print(result)

### **Filtering Documents by Multiple Metadata Fields**
This example demonstrates how to use `SelfQueryRetriever` to filter documents based on multiple metadata fields (e.g., genre, year, and rating).

In [None]:
# Retrieve documents with multiple metadata filters
result = retriever.invoke("science fiction movie released after 1990 with rating greater than 7")
print(result)

### **Using SelfQueryRetriever with Custom Search Parameters**
This example demonstrates how to customize the search parameters (e.g., search type and keyword arguments) for `SelfQueryRetriever`.

In [None]:
# Customize search parameters
custom_retriever = SelfQueryRetriever.from_llm(
    model,
    vectorstore,
    document_content_description,
    metadata_field_info,
    search_type="mmr",       # Maximal Marginal Relevance
    search_kwargs={"k": 5},  # Retrieve top 5 documents
)

# Retrieve documents with custom search parameters
result = custom_retriever.invoke("drama movie with rating greater than 8")
print(result)

### **Handling Edge Cases with Fallback Retrievers**
This example demonstrates how to add fallback retrievers to handle cases where no documents match the query.

In [None]:
from langchain_core.runnables import RunnableLambda

# Define a fallback retriever
fallback_retriever = RunnableLambda(lambda x: [{"page_content": "No matching documents found."}])

# Add fallback to the retriever
fallback_enabled_retriever = retriever.with_fallbacks([fallback_retriever])

# Invoke the retriever with fallback
result = fallback_enabled_retriever.invoke("horror movie with rating greater than 9")
print(result)

## **Conclusion**

The **SelfQueryRetriever** is a game-changer for applications that require both semantic understanding and structured filtering. By combining the strengths of large language models and metadata-aware retrieval, it provides a flexible and powerful solution for a wide range of use cases. Whether you're building a movie recommendation system, an e-commerce search engine, or any application that relies on metadata, the SelfQueryRetriever can significantly enhance the user experience by delivering more accurate and relevant results.

Its ability to **automatically generate structured queries** from natural language inputs makes it accessible to users without technical expertise, while its support for **custom metadata fields** ensures that it can adapt to diverse application requirements. Additionally, features like **similarity score propagation** and **fallback mechanisms** further enhance its robustness and usability.

In conclusion, the **SelfQueryRetriever** is not just a tool for retrieving documents—it's a comprehensive solution for building intelligent, metadata-aware search systems. By leveraging its capabilities, developers can create applications that are both user-friendly and highly effective, ensuring that users can find the information they need quickly and accurately.