Imagine you are working on a project that involves processing a large collection of text documents, such as research papers, legal documents, or customer service logs. Your task is to develop a system that can quickly retrieve the most relevant segments of text based on a user's query. Traditional keyword-based search methods might not be sufficient, as they often fail to capture the nuanced meanings and contexts within the documents. To address this challenge, you can use different types of retrievers based on LangChain.

Using retrievers is crucial for several reasons:

- Efficiency: Retrievers enable fast and efficient retrieval of relevant information from large datasets, saving time and computational resources.
- Accuracy: By leveraging advanced retrieval techniques, these tools can provide more accurate and contextually relevant results compared to traditional search methods.
- Versatility: Different retrievers can be tailored to specific use cases, making them adaptable to various types of text data and query requirements.
- Context awareness: Some retrievers, like the Parent Document Retriever, can consider the broader context of the document, enhancing the relevance of the retrieved segments.


We will learn about four types of retrievers: `Vector Store-backed Retriever`, `Multi-Query Retriever`, `Self-Querying Retriever`, and `Parent Document Retriever`. We will also learn the differences between these retrievers and understand the appropriate situations in which to use each one. By the end of this lab, you will be equipped with the skills to implement and utilize these retrievers in your projects.

In [2]:
!pip install --user "chromadb==0.4.24" | tail -n 1


[0mSuccessfully installed asgiref-3.8.1 backoff-2.2.1 bcrypt-4.3.0 chroma-hnswlib-0.7.3 chromadb-0.4.24 coloredlogs-15.0.1 durationpy-0.10 httptools-0.6.4 humanfriendly-10.0 kubernetes-33.1.0 mmh3-5.1.0 onnxruntime-1.22.0 opentelemetry-api-1.34.1 opentelemetry-exporter-otlp-proto-common-1.34.1 opentelemetry-exporter-otlp-proto-grpc-1.34.1 opentelemetry-instrumentation-0.55b1 opentelemetry-instrumentation-asgi-0.55b1 opentelemetry-instrumentation-fastapi-0.55b1 opentelemetry-proto-1.34.1 opentelemetry-sdk-1.34.1 opentelemetry-semantic-conventions-0.55b1 opentelemetry-util-http-0.55b1 overrides-7.7.0 posthog-5.1.0 pulsar-client-3.7.0 pypika-0.48.9 python-dotenv-1.1.0 uvloop-0.21.0 watchfiles-1.1.0


In [3]:
pip install numpy==1.26.4


Collecting numpy==1.26.4
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m64.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but yo

In [4]:
pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.25-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.1.0-py3-no

In [27]:
#!pip install --user "lark==1.1.9" | tail -n 1
pip install -U lark


Successfully installed lark-1.1.9


In [1]:
# You can use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

The following functions are prerequisite knowledge for understanding the topic of this project—retrievers. These functions include:

- Building LLMs
- Splitting documents into chunks
- Building an embedding model

The relevant knowledge and details of these functions have been covered in previous lessons.


In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

def llm():
    model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # Lightweight version of Mixtral, openly available

    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id)

    # Use pipeline for easier generation
    text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

    def generate(prompt):
        output = text_generator(
            prompt,
            max_new_tokens=256,
            temperature=0.5,
            do_sample=True,
            top_p=0.95,
            top_k=50
        )
        return output[0]["generated_text"]

    return generate


### Text Splitter

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def text_splitter(data, chunk_size, chunk_overlap):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    chunks = text_splitter.split_documents(data)
    return chunks

#### Embedding model



In [4]:
from langchain.embeddings import HuggingFaceEmbeddings

def huggingface_embedding():
    model_name = "sentence-transformers/all-MiniLM-L6-v2"  # Fast, light, and effective

    embedding = HuggingFaceEmbeddings(
        model_name=model_name,
        model_kwargs={"device": "cpu"}  # or "cuda" if GPU is available
    )
    return embedding


## Retrievers

A retriever is an interface designed to return documents based on an unstructured query. Unlike a vector store, which stores and retrieves documents, a retriever's primary function is to find and return relevant documents. While vector stores can serve as the backbone of a retriever, there are various other types of retrievers that can be used as well.

Retrievers take a string `query` as input and output a list of `Documents`.

### Vector Store-Backed Retriever

A vector store retriever is a type of retriever that utilizes a vector store to fetch documents. It acts as a lightweight wrapper around the vector store class, enabling it to conform to the retriever interface. This retriever leverages the search methods implemented by the vector store, such as similarity search and Maximum Marginal Relevance (MMR), to query texts stored within it.



In [5]:
!wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/MZ9z1lm-Ui3YBp3SYWLTAQ/companypolicies.txt"

--2025-06-19 11:54:38--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/MZ9z1lm-Ui3YBp3SYWLTAQ/companypolicies.txt
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15660 (15K) [text/plain]
Saving to: ‘companypolicies.txt.2’


2025-06-19 11:54:38 (149 MB/s) - ‘companypolicies.txt.2’ saved [15660/15660]



In [6]:
from langchain_community.document_loaders import TextLoader
loader = TextLoader("companypolicies.txt")
txt_data = loader.load()

In [7]:
txt_data



Split `txt_data` into chunks. `chunk_size = 200`, `chunk_overlap = 20` has been set.


In [8]:
chunks_txt = text_splitter(txt_data, 200, 20)

Store the embeddings into a `ChromaDB`.


In [9]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

# Step 1: Define the embedding model
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"}  # change to "cuda" if GPU available
)

# Step 2: Create Chroma vectorstore
vectordb = Chroma.from_documents(chunks_txt, embedding_model)


In [10]:
chunks_txt

[Document(metadata={'source': 'companypolicies.txt'}, page_content='1.\tCode of Conduct'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Our Code of Conduct outlines the fundamental principles and ethical standards that guide every member of our organization. We are committed to maintaining a workplace that is built on integrity,'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='built on integrity, respect, and accountability.'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Integrity: We hold ourselves to the highest ethical standards. This means acting honestly and transparently in all our interactions, whether with colleagues, clients, or the broader community. We'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='community. We respect and protect sensitive information, and we avoid conflicts of interest.'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content="Respect: We embrace dive

#### Simple similarity search

Here is an example of a simple similarity search based on the vector database.

For this demonstration, the query has been set to "email policy".


In [11]:
query = "email policy"
retriever = vectordb.as_retriever()
docs = retriever.invoke(query)
docs

[Document(metadata={'source': 'companypolicies.txt'}, page_content='3.\tInternet and Email Policy'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Our Internet and Email Policy aims to promote safe, responsible usage of digital communication tools that align with our values and legal obligations. Each employee is expected to understand and'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Our Internet and Email Policy is established to guide the responsible and secure use of these essential tools within our organization. We recognize their significance in daily business operations and'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Confidentiality: Reserve email for the transmission of confidential information, trade secrets, and sensitive customer data only when encryption is applied. Exercise discretion when discussing')]

You can also specify `search kwargs` like `k` to limit the retrieval results.


In [12]:
retriever = vectordb.as_retriever(search_kwargs={"k": 1})
docs = retriever.invoke(query)
docs

[Document(metadata={'source': 'companypolicies.txt'}, page_content='3.\tInternet and Email Policy')]

#### MMR retrieval

MMR in vector stores is a technique used to balance the relevance and diversity of retrieved results. It selects documents that are both highly relevant to the query and minimally similar to previously selected documents. This approach helps to avoid redundancy and ensures a more comprehensive coverage of different aspects of the query.

The following code is showing how to conduct an MMR search in a vector database. You just need to sepecify `search_type="mmr"`.


In [13]:
retriever = vectordb.as_retriever(search_type="mmr")
docs = retriever.invoke(query)
docs

[Document(metadata={'source': 'companypolicies.txt'}, page_content='3.\tInternet and Email Policy'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Confidentiality: Reserve email for the transmission of confidential information, trade secrets, and sensitive customer data only when encryption is applied. Exercise discretion when discussing'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Review of Policy: This policy will be reviewed periodically to ensure its alignment with evolving legal requirements and best practices for maintaining a healthy and safe workplace.'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='individual found to be in violation of this policy.')]

#### Similarity score threshold retrieval

You can also set a retrieval method that defines a similarity score threshold, returning only documents with a score above that threshold.


In [14]:
retriever = vectordb.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.4}
)
docs = retriever.invoke(query)
docs

[Document(metadata={'source': 'companypolicies.txt'}, page_content='3.\tInternet and Email Policy'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Our Internet and Email Policy aims to promote safe, responsible usage of digital communication tools that align with our values and legal obligations. Each employee is expected to understand and'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Our Internet and Email Policy is established to guide the responsible and secure use of these essential tools within our organization. We recognize their significance in daily business operations and'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Confidentiality: Reserve email for the transmission of confidential information, trade secrets, and sensitive customer data only when encryption is applied. Exercise discretion when discussing')]

### Multi-Query Retriever

Distance-based vector database retrieval represents queries in high-dimensional space and finds similar embedded documents based on "distance". However, retrieval results may vary with subtle changes in query wording or if the embeddings do not accurately capture the data's semantics.

The `MultiQueryRetriever` addresses this by using an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and then takes the unique union of these results to form a larger set of potentially relevant documents. By generating multiple perspectives on the same question, the `MultiQueryRetriever` can potentially overcome some limitations of distance-based retrieval, resulting in a richer and more diverse set of results.

A PDF document has been prepared to demonstrate this Multi-Query Retriever.


In [15]:
pip install pypdf



In [16]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/ioch1wsxkfqgfLLgmd-6Rw/langchain-paper.pdf")
pdf_data = loader.load()
pdf_data[1]

Document(metadata={'producer': 'PyPDF', 'creator': 'Microsoft Word', 'creationdate': '2023-12-31T03:50:13+00:00', 'author': 'IEEE', 'moddate': '2023-12-31T03:52:06+00:00', 'title': 's8329 final', 'source': 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/ioch1wsxkfqgfLLgmd-6Rw/langchain-paper.pdf', 'total_pages': 6, 'page': 1, 'page_label': '2'}, page_content='LangChain helps us to unlock the ability to harness the \nLLM’s immense potential in tasks such as document analysis, \nchatbot development, code analysis, and countless other \napplications. Whether your desire is to unlock deeper natural \nlanguage understanding , enhance data, or circumvent \nlanguage barriers through translation, LangChain is ready to \nprovide the tools and programming support you need to do \nwithout it that it is not only difficult but also fresh for you. Its \ncore functionalities encompass: \n1. Context-Aware Capabilities: LangChain facilitates the \ndevelopment of applications that ar

Split document and store the embeddings into a vector database.


In [17]:
vectordb.get()["ids"]

['010fdbbb-e9f6-4423-8d53-62a99574bf05',
 '036ca1ce-a33a-4220-8fc5-6dfca38b1930',
 '03cced6c-ac00-417e-b3f0-8b32bc4a88eb',
 '04d43042-1b89-4a14-8c34-b3dc4c85f961',
 '0517d0f5-39f0-4c89-a4a6-153d010e30a6',
 '0b21911f-476c-4be4-b63f-f8a43f01eba9',
 '0d109e24-66c7-4689-9d9d-45bf64f57599',
 '0e36748e-4789-4ade-af7c-6b95b5b45d59',
 '0f512b04-c3f6-4937-ab78-fccbb9a84ef5',
 '1337c7c4-941a-4fc6-9dc3-7f1826c8e0fe',
 '18a49912-e167-4df2-8a81-1a9a19aa6e49',
 '1b10c21e-1130-4f88-a43b-b37d55c9175e',
 '1b1daf36-d7bd-42a0-a8dc-7e7d50dc0d7b',
 '1b333c30-497a-4012-815e-b82e350209d1',
 '20c622bd-ccd1-47d5-9521-91cc195d6162',
 '230bec70-7303-4f58-866c-f96e0365dc4d',
 '24fd6edb-a6fe-4068-a3c9-1a1f4b888b68',
 '29767d5b-fdfc-46ce-993b-d32c621232ef',
 '29b4b0a8-95db-4eb0-ab8d-e90f977b004e',
 '2ecde1e9-7743-432f-aec8-83af20d2caa3',
 '34fb519a-6d00-41ea-862b-e53452c9c7c8',
 '353ed068-bd6c-4af6-9496-52ba9e19daad',
 '3628d61f-d624-4da0-b181-f3ac20658f03',
 '3c1fc272-0e8a-4586-9331-c4157d3a44dd',
 '3dbe4d1a-3b1d-

In [18]:
# Split
chunks_pdf = text_splitter(pdf_data, 500, 20)

# VectorDB
ids = vectordb.get()["ids"]
vectordb.delete(ids=ids)  # ✅ pass as keyword argument
vectordb = Chroma.from_documents(documents=chunks_pdf, embedding=embedding_model)

In [19]:
from langchain.retrievers.multi_query import MultiQueryRetriever

query = "What does the paper say about langchain?"

retriever = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm()
)

Device set to use cuda:0


In [20]:
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [21]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain_community.llms import HuggingFacePipeline

# Step 1: load model
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Step 2: build pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    temperature=0.5,
    do_sample=True,
    top_p=0.95,
    top_k=50
)

# Step 3: wrap into LangChain LLM object
lc_llm = HuggingFacePipeline(pipeline=pipe)

# Now this works with MultiQueryRetriever:
retriever = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(),
    llm=lc_llm
)

# Finally:
docs = retriever.invoke(query)


Device set to use cuda:0
INFO:langchain.retrievers.multi_query:Generated queries: ['You are an AI language model assistant. Your task is ', '    to generate 3 different versions of the given user ', '    question to retrieve relevant documents from a vector  database. ', '    By generating multiple perspectives on the user question, ', '    your goal is to help the user overcome some of the limitations ', '    of distance-based similarity search. Provide these alternative ', '    questions separated by newlines. Original question: What does the paper say about langchain? Answer: Langchain is a set of tools and techniques for building and deploying language models. The paper provides a comprehensive overview of the state-of-the-art in this area, including the challenges that researchers face in building language models, the benefits of using language models in natural language processing, and the opportunities for future research.']


From the log results, you can see that the LLM generated three additional queries from different perspectives based on the given query.

The returned results are the union of the results from each query.


### Self-Querying Retriever

A Self-Querying Retriever, as the name suggests, has the ability to query itself. Specifically, given a natural language query, the retriever uses a query-constructing LLM chain to generate a structured query. It then applies this structured query to its underlying vector store. This enables the retriever to not only use the user-input query for semantic similarity comparison with the contents of stored documents but also to extract and apply filters based on the metadata of those documents.


In [22]:
from langchain_core.documents import Document
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from lark import lark

A couple of document pieces have been prepared where the `page_content` contains descriptions of movies, and the `meta_data` includes different attributes for each movie, such as `year`, `rating`, `genre`, and `director`. These attributes are crucial in the Self-Querying Retriever, as the LLM will use the metadata information to apply filters during the retrieval process.


In [23]:
docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "director": "Andrei Tarkovsky",
            "genre": "thriller",
            "rating": 9.9,
        },
    ),
]

Now you can instantiate your retriever. To do this, you'll need to provide some upfront information about the metadata fields that your documents support, as well as a brief description of the document contents.


In [24]:
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]

In [25]:
#Store the document's embeddings into a vector database.
vectordb = Chroma.from_documents(docs, embedding=embedding_model)

Use the `SelfQueryRetriever`.


In [26]:
import lark

Now you can actually try using your retriever.


In [29]:
# This example only specifies a filter
model_id = "Qwen/Qwen1.5-1.8B-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Step 2: build pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    temperature=0.5,
    do_sample=True,
    top_p=0.95,
    top_k=50
)

# Step 3: wrap into LangChain LLM object
lc_llm = HuggingFacePipeline(pipeline=pipe)

document_content_description = "Brief summary of a movie."

retriever = SelfQueryRetriever.from_llm(
    lc_llm,
    vectordb,
    document_content_description,
    metadata_field_info,
)



tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.67G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/206 [00:00<?, ?B/s]

Device set to use cuda:0


OutputParserException: Parsing text
Your goal is to structure the user's query to match the request schema provided below.

<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

```json
{
    "query": string \ text string to compare to document contents
    "filter": string \ logical condition statement for filtering documents
}
```

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

A logical condition statement is composed of one or more comparison and logical operation statements.

A comparison statement takes the form: `comp(attr, val)`:
- `comp` (eq | ne | gt | gte | lt | lte): comparator
- `attr` (string):  name of attribute to apply the comparison to
- `val` (string): is the comparison value

A logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` (and | or): logical operator
- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to

Make sure that you only use the comparators and logical operators listed above and no others.
Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters only use the attributed names with its function names if there are functions applied on them.
Make sure that filters only use format `YYYY-MM-DD` when handling date data typed values.
Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored.
Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.

<< Example 1. >>
Data Source:
```json
{
    "content": "Lyrics of a song",
    "attributes": {
        "artist": {
            "type": "string",
            "description": "Name of the song artist"
        },
        "length": {
            "type": "integer",
            "description": "Length of the song in seconds"
        },
        "genre": {
            "type": "string",
            "description": "The song genre, one of "pop", "rock" or "rap""
        }
    }
}
```

User Query:
What are songs by Taylor Swift or Katy Perry about teenage romance under 3 minutes long in the dance pop genre

Structured Request:
```json
{
    "query": "teenager love",
    "filter": "and(or(eq(\"artist\", \"Taylor Swift\"), eq(\"artist\", \"Katy Perry\")), lt(\"length\", 180), eq(\"genre\", \"pop\"))"
}
```


<< Example 2. >>
Data Source:
```json
{
    "content": "Lyrics of a song",
    "attributes": {
        "artist": {
            "type": "string",
            "description": "Name of the song artist"
        },
        "length": {
            "type": "integer",
            "description": "Length of the song in seconds"
        },
        "genre": {
            "type": "string",
            "description": "The song genre, one of "pop", "rock" or "rap""
        }
    }
}
```

User Query:
What are songs that were not published on Spotify

Structured Request:
```json
{
    "query": "",
    "filter": "NO_FILTER"
}
```


<< Example 3. >>
Data Source:
```json
{
    "content": "Brief summary of a movie.",
    "attributes": {
    "genre": {
        "description": "The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        "type": "string"
    },
    "year": {
        "description": "The year the movie was released",
        "type": "integer"
    },
    "director": {
        "description": "The name of the movie director",
        "type": "string"
    },
    "rating": {
        "description": "A 1-10 rating for the movie",
        "type": "float"
    }
}
}
```

User Query:
I want to watch a movie rated higher than 8.5

Structured Request:
```json
{
    "query": "movie(rating > 8.5)",
    "filter": "AND(neq(\",genre\",","science fiction"), eq(\",rating\",","9"))
}
```

Note: In the above example, we have added an additional condition using `neq` to exclude movies with a genre of science fiction. The logical operator `AND` combines all the conditions specified in the query. The first condition checks if the genre is not'science fiction'. The second condition checks if the rating is greater than 8.5. Finally, the `NEQ` operator ensures that only movies with a genre other than'science fiction' and a rating greater than 8.5 are returned. The result is a filtered list of movies that meet both conditions. The resulting JSON object will look like this:

```json
{
    "query": "movie(rating > 8.5) AND (genre!='science fiction' AND rating > 8.5)",
    "filter": "NO_FILTER"
}
```

In this structured request, the user has specified three conditions:

1. The query matches documents where the `genre` attribute is not equal to'science fiction'.
2. The query also
 raised following error:
Got invalid JSON object. Error: Expecting value: line 2 column 14 (char 15)
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE 
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE 

In [32]:
retriever.invoke("I want to watch a movie rated higher than 8.5")

ImportError: cannot import name 'SelfQueryRetriever' from 'langchain.chains.query_constructor.base' (/usr/local/lib/python3.11/dist-packages/langchain/chains/query_constructor/base.py)