# Filtering out elements based on hard criteria

In a lot of cases hard filtering is needed, when we specifically do not want the result set to contain some items, no matter how deep we scroll into the results. This can be achieved via the `.filter` clause in the `Query`.

In [1]:
%pip install superlinked==35.1.1

In [2]:
import pandas as pd
from superlinked import framework as sl

pd.set_option("display.max_colwidth", 100)

In [3]:
class Paragraph(sl.Schema):
    id: sl.IdField
    body: sl.String
    author: sl.String
    length: sl.Integer
    tags: sl.StringList
    is_active: sl.Boolean


paragraph = Paragraph()

body_space = sl.TextSimilaritySpace(text=paragraph.body, model="sentence-transformers/all-mpnet-base-v2")

In [4]:
paragraph_index = sl.Index(
    [body_space],
    fields=[paragraph.author, paragraph.body, paragraph.length, paragraph.tags, paragraph.is_active],
)

17:27:18 superlinked.framework.dsl.index.index INFO   initialized index


> **_NOTE:_** The index definition requires the fields that we plan to create filters for.

Now let's add some data and try it out!

In [5]:
source: sl.InMemorySource = sl.InMemorySource(paragraph)
executor = sl.InMemoryExecutor(sources=[source], indices=[paragraph_index])
app = executor.run()

17:27:18 superlinked.framework.query.query_dag_evaluator INFO   initialized query dag
17:27:18 superlinked.framework.online.online_dag_evaluator INFO   initialized entity dag
17:27:18 superlinked.framework.dsl.executor.interactive.interactive_executor INFO   started in-memory app


In [6]:
source.put(
    [
        {
            "id": "paragraph-1",
            "body": "The first thing Adam wrote.",
            "author": "Adam",
            "length": 300,
            "tags": ["old", "interesting"],
            "is_active": False,
        },
        {
            "id": "paragraph-2",
            "body": "The first thing Bob wrote.",
            "author": "Bob",
            "length": 500,
            "tags": ["fresh", "dull", "useful"],
            "is_active": True,
        },
        {
            "id": "paragraph-3",
            "body": "The second thing Adam wrote.",
            "author": "Adam",
            "length": 400,
            "tags": ["interesting", "funny", "fresh"],
            "is_active": True,
        },
    ]
)

17:27:18 sentence_transformers.SentenceTransformer INFO   Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

17:27:19 superlinked.framework.online.online_dag_evaluator INFO   evaluated entities
17:27:19 superlinked.framework.online.source.online_data_processor INFO   stored input data


## Using the .filter clause

### Comparisons
Provides the opportunity to write filters on the result set. For example I can ask for articles written by Adam...

In [7]:
adam_query = sl.Query(paragraph_index).find(paragraph).filter(paragraph.author == "Adam").select_all()
adam_result = app.query(adam_query)

sl.PandasConverter.to_pandas(adam_result)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The first thing Adam wrote.,Adam,300,"[old, interesting]",False,paragraph-1,0.0
1,The second thing Adam wrote.,Adam,400,"[interesting, funny, fresh]",True,paragraph-3,0.0


> **_NOTE:_**  As we are only using filters, `similarity_score` is always 0, as there is effectively no vector search in this case.

...or not Adam.

In [8]:
bob_query = sl.Query(paragraph_index).find(paragraph).filter(paragraph.author != "Adam").select_all()
bob_result = app.query(bob_query)

sl.PandasConverter.to_pandas(bob_result)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The first thing Bob wrote.,Bob,500,"[fresh, dull, useful]",True,paragraph-2,0.0


I can also just simply filter for active paragraphs, using our boolean schema field.

In [9]:
active_query = sl.Query(paragraph_index).find(paragraph).filter(paragraph.is_active.is_(True)).select_all()
active_result = app.query(active_query)

sl.PandasConverter.to_pandas(active_result)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The first thing Bob wrote.,Bob,500,"[fresh, dull, useful]",True,paragraph-2,0.0
1,The second thing Adam wrote.,Adam,400,"[interesting, funny, fresh]",True,paragraph-3,0.0


Which can also be stated inversely.

In [10]:
not_inactive_query = sl.Query(paragraph_index).find(paragraph).filter(paragraph.is_active.is_not_(False)).select_all()
not_inactive_result = app.query(not_inactive_query)

sl.PandasConverter.to_pandas(not_inactive_result)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The first thing Bob wrote.,Bob,500,"[fresh, dull, useful]",True,paragraph-2,0.0
1,The second thing Adam wrote.,Adam,400,"[interesting, funny, fresh]",True,paragraph-3,0.0


An alternative syntax is also possible, but that will not satisfy most linters, so an ignore is recommended in this case.

In [11]:
active_query_alt_syntax = (
    sl.Query(paragraph_index)
    .find(paragraph)
    .filter(paragraph.is_active == True)  # noqa: E712 # pylint: disable=singleton-comparison
    .select_all()
)
active_result_alt_syntax = app.query(active_query_alt_syntax)

sl.PandasConverter.to_pandas(active_result_alt_syntax)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The first thing Bob wrote.,Bob,500,"[fresh, dull, useful]",True,paragraph-2,0.0
1,The second thing Adam wrote.,Adam,400,"[interesting, funny, fresh]",True,paragraph-3,0.0


It is possible to filter for being greater or smaller (and equal) to certain field.

In [12]:
greater_equal_query = sl.Query(paragraph_index).find(paragraph).filter(paragraph.length <= 400).select_all()
greater_equal_result = app.query(greater_equal_query)

sl.PandasConverter.to_pandas(greater_equal_result)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The first thing Adam wrote.,Adam,300,"[old, interesting]",False,paragraph-1,0.0
1,The second thing Adam wrote.,Adam,400,"[interesting, funny, fresh]",True,paragraph-3,0.0


### Logical operations

And we can also stack multiple filters to form AND type of relationship.

In [13]:
stacked_query = (
    sl.Query(paragraph_index)
    .find(paragraph)
    .filter(paragraph.author == "Adam")
    .filter(paragraph.body == "The first thing Adam wrote.")
    .select_all()
)
stacked_result = app.query(stacked_query)

sl.PandasConverter.to_pandas(stacked_result)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The first thing Adam wrote.,Adam,300,"[old, interesting]",False,paragraph-1,0.0


It is possible to filter for being greater or smaller (and equal) to certain field

In [14]:
or_query = (
    sl.Query(paragraph_index)
    .find(paragraph)
    .filter((paragraph.author == "Bob").or_(paragraph.length == 300))
    .select_all()
)
or_result = app.query(or_query)

sl.PandasConverter.to_pandas(or_result)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The first thing Adam wrote.,Adam,300,"[old, interesting]",False,paragraph-1,0.0
1,The first thing Bob wrote.,Bob,500,"[fresh, dull, useful]",True,paragraph-2,0.0


... or simply using the well known ways of `__or__`.

In [15]:
or_query_ = (
    sl.Query(paragraph_index)
    .find(paragraph)
    .filter((paragraph.author == "Bob") | (paragraph.length == 400))
    .select_all()
)
or_result_ = app.query(or_query_)

sl.PandasConverter.to_pandas(or_result_)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The first thing Bob wrote.,Bob,500,"[fresh, dull, useful]",True,paragraph-2,0.0
1,The second thing Adam wrote.,Adam,400,"[interesting, funny, fresh]",True,paragraph-3,0.0


### Set operations

We can filter for a paragraph having an author from a group of possible authors using the `in_` operator.

In [16]:
in_query = sl.Query(paragraph_index).find(paragraph).filter(paragraph.author.in_(sl.Param("filter_list"))).select_all()
in_result = app.query(in_query, filter_list=["Alice", "Adam", "Amon"])

sl.PandasConverter.to_pandas(in_result)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The first thing Adam wrote.,Adam,300,"[old, interesting]",False,paragraph-1,0.0
1,The second thing Adam wrote.,Adam,400,"[interesting, funny, fresh]",True,paragraph-3,0.0


Or for a paragraph has an author that is not part of a list.

In [17]:
not_in_query = (
    sl.Query(paragraph_index).find(paragraph).filter(paragraph.author.not_in_(sl.Param("filter_list"))).select_all()
)
not_in_result = app.query(not_in_query, filter_list=["Alice", "Adam", "Amon"])

sl.PandasConverter.to_pandas(not_in_result)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The first thing Bob wrote.,Bob,500,"[fresh, dull, useful]",True,paragraph-2,0.0


List type attributes could be filtered conditioning on containing a value.

In [18]:
contains_query = (
    sl.Query(paragraph_index).find(paragraph).filter(paragraph.tags.contains(sl.Param("filter_tag"))).select_all()
)
contains_any_single_result = app.query(contains_query, filter_tag=["fresh"])

sl.PandasConverter.to_pandas(contains_any_single_result)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The first thing Bob wrote.,Bob,500,"[fresh, dull, useful]",True,paragraph-2,0.0
1,The second thing Adam wrote.,Adam,400,"[interesting, funny, fresh]",True,paragraph-3,0.0


Contains with a `list` input work as the industry practice, returns entities containing either of the listed elements.

In [19]:
contains_any_multiple_query = (
    sl.Query(paragraph_index).find(paragraph).filter(paragraph.tags.contains(sl.Param("filter_tag"))).select_all()
)
contains_any_multiple_result = app.query(contains_any_multiple_query, filter_tag=["fresh", "interesting"])

sl.PandasConverter.to_pandas(contains_any_multiple_result)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The first thing Adam wrote.,Adam,300,"[old, interesting]",False,paragraph-1,0.0
1,The first thing Bob wrote.,Bob,500,"[fresh, dull, useful]",True,paragraph-2,0.0
2,The second thing Adam wrote.,Adam,400,"[interesting, funny, fresh]",True,paragraph-3,0.0


To filter for entities containing all of the listed elements, stack filters...

In [20]:
contains_all_stacked_query = (
    sl.Query(paragraph_index)
    .find(paragraph)
    .filter(paragraph.tags.contains(sl.Param("filter_tag_1")))
    .filter(paragraph.tags.contains(sl.Param("filter_tag_2")))
    .select_all()
)
contains_all_stacked_result = app.query(contains_all_stacked_query, filter_tag_1=["fresh"], filter_tag_2=["useful"])

sl.PandasConverter.to_pandas(contains_all_stacked_result)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The first thing Bob wrote.,Bob,500,"[fresh, dull, useful]",True,paragraph-2,0.0


... or use contains_all.

In [21]:
contains_all_query = (
    sl.Query(paragraph_index).find(paragraph).filter(paragraph.tags.contains_all(sl.Param("filter_tag"))).select_all()
)
contains_all_result = app.query(contains_all_query, filter_tag=["fresh", "useful"])

sl.PandasConverter.to_pandas(contains_all_result)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The first thing Bob wrote.,Bob,500,"[fresh, dull, useful]",True,paragraph-2,0.0


The same can be done using not_contains, too. With a single...

In [22]:
not_contains_query = (
    sl.Query(paragraph_index).find(paragraph).filter(paragraph.tags.not_contains(sl.Param("filter_tag"))).select_all()
)
not_contains_result = app.query(not_contains_query, filter_tag=["fresh"])

sl.PandasConverter.to_pandas(not_contains_result)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The first thing Adam wrote.,Adam,300,"[old, interesting]",False,paragraph-1,0.0


... or with multiple values.

In [23]:
not_contains_multiple_query = (
    sl.Query(paragraph_index).find(paragraph).filter(paragraph.tags.not_contains(sl.Param("filter_tag"))).select_all()
)
not_contains_multiple_result = app.query(not_contains_multiple_query, filter_tag=["old", "dull"])

sl.PandasConverter.to_pandas(not_contains_multiple_result)

17:27:19 superlinked.framework.query.query_dag_evaluator INFO   evaluated query
17:27:19 superlinked.framework.dsl.executor.query.query_executor INFO   executed query


Unnamed: 0,body,author,length,tags,is_active,id,similarity_score
0,The second thing Adam wrote.,Adam,400,"[interesting, funny, fresh]",True,paragraph-3,0.0


## Summary

We are supporting

#### Comparison operations
* the `==` and `!=` operators,
* the `<`, `>`, `<=` and `>=` operators,
#### Logical operations
* creating `AND` relationships by stacking filters,
* and using the `.or_` or simply `|` to create `OR` relationships.
#### Set operations
* the `.in_` and `.not_in_` operators test String fields having a value from a collection,
* the `.contains` and `not_contains_` operators test StringList containing either of the values from a collection,
* the `.contains_all` operator tests StringList containing all of the values from a collection.