# Self-Query Retriever
    
Un retriever *auto-interrogante* (self-query) è un retriever che, come suggerisce il nome, ha la capacità di auto-interrogarsi.    
In pratica, data qualsiasi query in linguaggio naturale, il retriever utilizza una catena LLM per scrivere una nuova query strutturata e quindi applica questa nuova query al VectorStore sottostante; questo permette al retriever di estrarre filtri dalla query dell'utente sui metadati dei documenti archiviati e di eseguire tali filtri, oltre a utilizzare la query di input dell'utente per il confronto della somiglianza semantica con il contenuto dei documenti archiviati.        
    
Se gli utenti pongono domande a cui è meglio rispondere recuperando documenti basati sui metadati anziché sulla somiglianza con il testo, è utile usare questa tipologia di retriever, che, in pratica, utilizza un LLM per trasformare l'input dell'utente in:    
1) una stringa da cercare semanticamente;
2) alcuni metadati per accompagnare la stringa di ricerca.    
    
È un Retriever largamente utilizzato perché spesso le domande riguardano i METADATI dei documenti e non il contenuto stesso, però introduce dei passaggi extra, utilizzando un LLM, che spesso rallentano la generazione di una risposta, oltre che aumentarne il costo (computazionale se in locale o di utilizzo del relativo servizio in cloud). 

In [1]:
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
import os

# creazione di una serie di documenti

docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "id": 1254315,
            "year": 1979,
            "director": "Andrei Tarkovsky",
            "genre": "thriller",
            "rating": 9.9,
        },
    ),
]
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

In [2]:
from langchain_classic.chains.query_constructor.base import AttributeInfo
from langchain_classic.retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import ChatOpenAI

# creazione dei metadati da associare ai documenti
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]
document_content_description = "Brief summary of a movie"

In [3]:
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0.)

In [4]:
# utilizzo del SelfQueryRetriever

retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
)

In [5]:
retriever.query_constructor

RunnableBinding(bound=FewShotPromptTemplate(input_variables=['query'], input_types={}, partial_variables={}, examples=[{'i': 1, 'data_source': '```json\n{{\n    "content": "Lyrics of a song",\n    "attributes": {{\n        "artist": {{\n            "type": "string",\n            "description": "Name of the song artist"\n        }},\n        "length": {{\n            "type": "integer",\n            "description": "Length of the song in seconds"\n        }},\n        "genre": {{\n            "type": "string",\n            "description": "The song genre, one of "pop", "rock" or "rap""\n        }}\n    }}\n}}\n```', 'user_query': 'What are songs by Taylor Swift or Katy Perry about teenage romance under 3 minutes long in the dance pop genre', 'structured_request': '```json\n{{\n    "query": "teenager love",\n    "filter": "and(or(eq(\\"artist\\", \\"Taylor Swift\\"), eq(\\"artist\\", \\"Katy Perry\\")), lt(\\"length\\", 180), eq(\\"genre\\", \\"pop\\"))"\n}}\n```'}, {'i': 2, 'data_source': 

In [6]:
retriever.query_constructor.bound.steps[0].prefix

'Your goal is to structure the user\'s query to match the request schema provided below.\n\n<< Structured Request Schema >>\nWhen responding use a markdown code snippet with a JSON object formatted in the following schema:\n\n```json\n{{\n    "query": string \\ text string to compare to document contents\n    "filter": string \\ logical condition statement for filtering documents\n}}\n```\n\nThe query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.\n\nA logical condition statement is composed of one or more comparison and logical operation statements.\n\nA comparison statement takes the form: `comp(attr, val)`:\n- `comp` (eq | ne | gt | gte | lt | lte): comparator\n- `attr` (string):  name of attribute to apply the comparison to\n- `val` (string): is the comparison value\n\nA logical operation statement takes the form `op(statement1, statement2, ...)`:\n- `op` (and | or): log

In [7]:
retriever.query_constructor.bound.steps[0].examples

[{'i': 1,
  'data_source': '```json\n{{\n    "content": "Lyrics of a song",\n    "attributes": {{\n        "artist": {{\n            "type": "string",\n            "description": "Name of the song artist"\n        }},\n        "length": {{\n            "type": "integer",\n            "description": "Length of the song in seconds"\n        }},\n        "genre": {{\n            "type": "string",\n            "description": "The song genre, one of "pop", "rock" or "rap""\n        }}\n    }}\n}}\n```',
  'user_query': 'What are songs by Taylor Swift or Katy Perry about teenage romance under 3 minutes long in the dance pop genre',
  'structured_request': '```json\n{{\n    "query": "teenager love",\n    "filter": "and(or(eq(\\"artist\\", \\"Taylor Swift\\"), eq(\\"artist\\", \\"Katy Perry\\")), lt(\\"length\\", 180), eq(\\"genre\\", \\"pop\\"))"\n}}\n```'},
 {'i': 2,
  'data_source': '```json\n{{\n    "content": "Lyrics of a song",\n    "attributes": {{\n        "artist": {{\n            "ty

In [8]:
# filtro semplice su un attributo / metadato

retriever.invoke("I want to watch a movie rated higher than 8.5")

[Document(id='e48bddcd-bf11-4d5e-a207-1a8cdde1f5cc', metadata={'director': 'Andrei Tarkovsky', 'id': 1254315, 'genre': 'thriller', 'rating': 9.9, 'year': 1979}, page_content='Three men walk into the Zone, three men walk out of the Zone'),
 Document(id='9c7e7dc3-3861-4639-a77f-519c315ff054', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea')]

In [9]:
# query con filtro

retriever.invoke("Has Greta Gerwig directed any movies about women")

[Document(id='3be25207-71fa-41a0-a20b-f7ef10fd7c2f', metadata={'year': 2019, 'rating': 8.3, 'director': 'Greta Gerwig'}, page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them')]

In [10]:
# utilizzo di filtri composti
retriever.invoke("What's a highly rated (above 7.) science fiction film?")

[Document(id='2d0813d7-4731-4a37-bbea-065346b8cfd4', metadata={'rating': 7.7, 'genre': 'science fiction', 'year': 1993}, page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose'),
 Document(id='9c7e7dc3-3861-4639-a77f-519c315ff054', metadata={'director': 'Satoshi Kon', 'year': 2006, 'rating': 8.6}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea'),
 Document(id='e48bddcd-bf11-4d5e-a207-1a8cdde1f5cc', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'id': 1254315, 'rating': 9.9, 'year': 1979}, page_content='Three men walk into the Zone, three men walk out of the Zone'),
 Document(id='29162b6b-d661-464b-b5b7-e4aee65e5c93', metadata={'year': 2010, 'rating': 8.2, 'director': 'Christopher Nolan'}, page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...')]

In [11]:
# query con filtro composto
retriever.invoke(
    "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
)

[Document(id='1dfdb9f4-2d03-4c61-a6ca-ce9ac7bcd712', metadata={'year': 1995, 'genre': 'animated'}, page_content='Toys come alive and have a blast doing so')]

In [12]:
# è possibile specificare il numero di risultati da selezionare

retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    enable_limit=True,
    verbose=True
)

retriever.invoke("What are two movies about dinosaurs")

[Document(id='2d0813d7-4731-4a37-bbea-065346b8cfd4', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993}, page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose'),
 Document(id='1dfdb9f4-2d03-4c61-a6ca-ce9ac7bcd712', metadata={'genre': 'animated', 'year': 1995}, page_content='Toys come alive and have a blast doing so')]

In [13]:
retriever.query_constructor.invoke("What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated")

StructuredQuery(query='toys', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GT: 'gt'>, attribute='year', value=1990), Comparison(comparator=<Comparator.LT: 'lt'>, attribute='year', value=2005), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='animated')]), limit=None)