# Text to Metadata Filter

For this example let's use *Chroma* as the vector store. Let's create a demo set of documents that contain summaries of movies

First run this command to create a isolated environment with all the dependencies needed to run this example.



```CMD	
# Creating a virtual environment to work with all features and avoid issues with dependencies
! virtualenv venv
```

Installing necessary dependencies:    
```CMD
! pip install --upgrade --quiet lark langchain-chroma langchain-ollama langchain ipykernel langchain-core langchain-community
```

Create requirements file
```CMD
! pip freeze > ../requirements.txt
```

> Note: Running multiple times this setup leads to generating several copies of the same retrived documents, for more persistent ways to store and access the data, check [here](https://python.langchain.com/docs/integrations/vectorstores/chroma/)

In [None]:
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_ollama import OllamaEmbeddings
from langchain.chains.query_constructor.schema import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_ollama import ChatOllama

docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem brooks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "director": "Andrei Tarkovsky",
            "genre": "thriller",
            "rating": 9.9,
        },
    ),
]

model = OllamaEmbeddings(model="mxbai-embed-large")
vectorstore = Chroma.from_documents(docs, model)

In [None]:
fields = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie",
        type = "string of list[string]",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]
description = "Summary of a movie"

llm = ChatOllama(model="llama3.1", num_predict=100)

retriever = SelfQueryRetriever.from_llm(
    llm, vectorstore, description, fields, enable_limit=True,
)

retriever.invoke("I want to watch a movie rated higher than 8.5")

[Document(id='48b15bb8-570b-4164-919d-4acb1a01b5d1', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979}, page_content='Three men walk into the Zone, three men walk out of the Zone'),
 Document(id='8c56d1c4-a900-4dd1-a78c-7c26bf6c7994', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea')]

This results in a retriever that will take a user query, and split it into:
- A filter to apply on the metadata of each document first.
- A query to use for semantic search on the documents.

To do this, we have to describe which fields the metadata of our documents contain; that description will be included in the propmt. The retriever will then do the following:

1. Send the query generation prompt to the LLM. 
2. Parse metadata filter and rewritten search query from the LLM output.
3. Convert the metadata filter generated by the LLM to the format appropiate for our vector store.
4. Issue a similarity search against the vector store, filtered to only match documents whose metadata passes the generated filter.

> Note: After doing several changes to prompt and model configuration, notice errors related to wrong output parsing with langchain, this can be fixed by changing the prompt to be more specific about the output format. Llama3.1 has some limitations, could be a good idea to test this script with other models like Llama2 or GPT4.

