# Filtering Documents with Metadata.
Metadata helps to keep additional information about documents. Most can be tailored to the specific documents that matches some criteria. We use filters to limit the scope of the search.

### Preparing the Documents
Pure Vector Databases
1. Pinecone
2. Marqo
3. Milvus
4. Chroma
5. drant
6. Weaviate

Full-text search database
1. elasticsearch
2. OpenSearch

Vector-capable SQL databases

These are not as performant as the previous categories. Use it if you want to maintain a single database instance for your application.
1. pgvector for PostgreSQL

Vector-capable NoSQL databases

These are not as performant as the previous categories. Use it if you want to maintain a single database instance for your application.

1. MongoDB
2. Astra
3. neo4j

In Memory document stores are fast, for minimal prototypes on small datasets

In [2]:
from datetime import datetime

from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever


documents = [
    Document(
        content="Use pip to install a basic version of Haystack's latest release: pip install farm-haystack. All the core Haystack components live in the haystack repo. But there's also the haystack-extras repo which contains components that are not as widely used, and you need to install them separately.",
        meta={"version": 1.15, "date": datetime(2023, 3, 30)},
    ),
    Document(
        content="Use pip to install a basic version of Haystack's latest release: pip install farm-haystack[inference]. All the core Haystack components live in the haystack repo. But there's also the haystack-extras repo which contains components that are not as widely used, and you need to install them separately.",
        meta={"version": 1.22, "date": datetime(2023, 11, 7)},
    ),
    Document(
        content="Use pip to install only the Haystack 2.0 code: pip install haystack-ai. The haystack-ai package is built on the main branch which is an unstable beta version, but it's useful if you want to try the new features as soon as they are merged.",
        meta={"version": 2.0, "date": datetime(2023, 12, 4)},
    ),
]

document_store = InMemoryDocumentStore(bm25_algorithm="BM25Plus")
document_store.write_documents(documents=documents)

3

### Building a Document Search Pipeline
Build a simple document search pipeline, that simply has a retriever. However you can also change this pipeline to do more such as generating answers to questions etc.

In [3]:
from haystack import Pipeline

pipeline = Pipeline()
pipeline.add_component(instance=InMemoryBM25Retriever(document_store=document_store), name="retriever")

### Do Metadata Filtering
"version" > 1.2.1

In [7]:
query = "Haystack installation"
pipeline.run(data={
    "retriever": {
        "query": query,
        "filters": {
            "field": "meta.version",
            "operator": ">",
            "value": 1.21
        }
    }
})

{'retriever': {'documents': [Document(id=b53625c67fee5ba5ac6dc86e7ca0adff567bf8376e86ae4b3fc6f6f858ccf1e5, content: 'Use pip to install a basic version of Haystack's latest release: pip install farm-haystack[inference...', meta: {'version': 1.22, 'date': datetime.datetime(2023, 11, 7, 0, 0)}, score: 0.37481165807926137),
   Document(id=8ac1f8119bdec5c898d5a5c69f49ff47f64056bce1a0f95073e34493bbaf9354, content: 'Use pip to install only the Haystack 2.0 code: pip install haystack-ai. The haystack-ai package is b...', meta: {'version': 2.0, 'date': datetime.datetime(2023, 12, 4, 0, 0)}, score: 0.34124689226266874)]}}

##### Adding Logical conditions

In [11]:
# AND logical operator, others include `NOT` `OR` `AND`
query = "Haystack installation"
pipeline.run(data={
    "retriever": {
        "query": query,
        "filters": {
            "operator": "AND",
            "conditions": [
                {
                    "field": "meta.version",
                    "operator": ">",
                    "value": 1.21
                },
                {
                    "field": "meta.date",
                    "operator": ">",
                    "value": datetime(2023, 11, 7)
                }
            ]
        }
    }
})

{'retriever': {'documents': [Document(id=8ac1f8119bdec5c898d5a5c69f49ff47f64056bce1a0f95073e34493bbaf9354, content: 'Use pip to install only the Haystack 2.0 code: pip install haystack-ai. The haystack-ai package is b...', meta: {'version': 2.0, 'date': datetime.datetime(2023, 12, 4, 0, 0)}, score: 0.34124689226266874)]}}

In [9]:
# in comparison
query = "Haystack installation"
pipeline.run(data={
    "retriever": {
        "query": query,
        "filters": {
            "field": "meta.version",
            "operator": "in",
            "value": [1.15, 1.22]
        }
    }
})

{'retriever': {'documents': [Document(id=3d3b2afa171bee3bbff4a94baaec239f9d28bba333114a08ad6d0b684710a3be, content: 'Use pip to install a basic version of Haystack's latest release: pip install farm-haystack. All the ...', meta: {'version': 1.15, 'date': datetime.datetime(2023, 3, 30, 0, 0)}, score: 0.37593796637235916),
   Document(id=b53625c67fee5ba5ac6dc86e7ca0adff567bf8376e86ae4b3fc6f6f858ccf1e5, content: 'Use pip to install a basic version of Haystack's latest release: pip install farm-haystack[inference...', meta: {'version': 1.22, 'date': datetime.datetime(2023, 11, 7, 0, 0)}, score: 0.37481165807926137)]}}

In [10]:
# not in comparison
query = "Haystack installation"
pipeline.run(data={
    "retriever": {
        "query": query,
        "filters": {
            "field": "meta.version",
            "operator": "not in",
            "value": [1.15, 1.22]
        }
    }
})

{'retriever': {'documents': [Document(id=8ac1f8119bdec5c898d5a5c69f49ff47f64056bce1a0f95073e34493bbaf9354, content: 'Use pip to install only the Haystack 2.0 code: pip install haystack-ai. The haystack-ai package is b...', meta: {'version': 2.0, 'date': datetime.datetime(2023, 12, 4, 0, 0)}, score: 0.34124689226266874)]}}