# Querying options

Query vectors can be built from two sources:
- a user-input, for example via [natural language interface](https://github.com/superlinked/superlinked/blob/main/notebook/feature/natural_language_querying.ipynb).
- a vector already stored in the VDB, usually an item-to-item recommendation, or generally a context vector.

there are examples in this notebook for both (and combinations) of these.

These inputs (even multiple of each) can be combined and weighted against each other. 
The number of returned inputs can be controlled via the exact number, or via the minimum required similarity.

Let's create a simple example to see the possibilities!

In [1]:
%pip install superlinked==36.1.0

In [2]:
import pandas as pd
from superlinked import framework as sl

pd.set_option("display.max_colwidth", 100)

In [3]:
class Paragraph(sl.Schema):
    id: sl.IdField
    body: sl.String
    category: sl.StringList


paragraph = Paragraph()

body_space = sl.TextSimilaritySpace(text=paragraph.body, model="sentence-transformers/all-mpnet-base-v2")
category_space = sl.CategoricalSimilaritySpace(
    category_input=paragraph.category, categories=["IT", "environment"], uncategorized_as_category=True
)
paragraph_index = sl.Index([body_space, category_space])

Now let's add some data to our space and fire up a running executor.

In [4]:
source: sl.InMemorySource = sl.InMemorySource(paragraph)
executor = sl.InMemoryExecutor(sources=[source], indices=[paragraph_index])
app = executor.run()

In [5]:
source.put(
    [
        {"id": "paragraph-1", "body": "Glorious animals live in the wilderness.", "category": "environment"},
        {
            "id": "paragraph-2",
            "body": "Growing computation power enables advancements in AI.",
            "category": "IT",
        },
        {
            "id": "paragraph-3",
            "body": "The flora and fauna of a specific habitat highly depend on the weather.",
            "category": "environment",
        },
    ]
)

## Using the .similar clause

Makes us able to supply query input unrelated to the stored vectors.

In [6]:
# we are creating a Param to reuse the query.
# For more info check the `dynamic_parameters.ipynb` feature notebook in this same folder.
similar_query = sl.Query(paragraph_index).find(paragraph).similar(body_space, sl.Param("similar_input")).select_all()

In [7]:
similar_result_weather = app.query(similar_query, similar_input="rainfall")
sl.PandasConverter.to_pandas(similar_result_weather)

Unnamed: 0,body,category,id,similarity_score
0,The flora and fauna of a specific habitat highly depend on the weather.,[environment],paragraph-3,0.337601
1,Glorious animals live in the wilderness.,[environment],paragraph-1,0.094036
2,Growing computation power enables advancements in AI.,[IT],paragraph-2,0.044686


In [8]:
similar_result_it = app.query(similar_query, similar_input="progress in AI")
sl.PandasConverter.to_pandas(similar_result_it)

Unnamed: 0,body,category,id,similarity_score
0,Growing computation power enables advancements in AI.,[IT],paragraph-2,0.598644
1,Glorious animals live in the wilderness.,[environment],paragraph-1,0.007107
2,The flora and fauna of a specific habitat highly depend on the weather.,[environment],paragraph-3,-0.121885


## Using the .with_vector clause

Provides the opportunity to search with the vector of an object in our database. This is useful for example for recommending items for a user based on it's vector.

In [9]:
with_vector_query = sl.Query(paragraph_index).find(paragraph).with_vector(paragraph, "paragraph-3", 1.0).select_all()

In this case the weight in the clause didn't really matter as there was no other competing clauses. Stay tuned because this is not always the case!

In [10]:
with_vector_result = app.query(with_vector_query)
sl.PandasConverter.to_pandas(with_vector_result)

Unnamed: 0,body,category,id,similarity_score
0,The flora and fauna of a specific habitat highly depend on the weather.,[environment],paragraph-3,1.0
1,Glorious animals live in the wilderness.,[environment],paragraph-1,0.655296
2,Growing computation power enables advancements in AI.,[IT],paragraph-2,-0.00953


The first result is the one we are searching with, the second is the more related one, and finally the less connected paragraph body comes.

Note however, that with_vector queries can be weighted on a per-space basis as well!

In [11]:
weight_dict: dict[sl.Space, float] = {body_space: 0.0, category_space: 1.0}
with_vector_query_space_weights = (
    sl.Query(paragraph_index).find(paragraph).with_vector(paragraph, "paragraph-3", weight_dict).select_all()
)
with_vector_result_space_weights = app.query(with_vector_query_space_weights)
sl.PandasConverter.to_pandas(with_vector_result_space_weights)

Unnamed: 0,body,category,id,similarity_score
0,Glorious animals live in the wilderness.,[environment],paragraph-1,1.0
1,The flora and fauna of a specific habitat highly depend on the weather.,[environment],paragraph-3,1.0
2,Growing computation power enables advancements in AI.,[IT],paragraph-2,0.0


In the above case as we see the results are only based on the `category` information.

While below, only the body of the text influences the similarities.

In [12]:
weight_dict_alt: dict[sl.Space, float] = {body_space: 1.0, category_space: 0.0}
with_vector_query_space_weights_alt = (
    sl.Query(paragraph_index).find(paragraph).with_vector(paragraph, "paragraph-3", weight_dict_alt).select_all()
)
with_vector_result_space_weights_alt = app.query(with_vector_query_space_weights_alt)
sl.PandasConverter.to_pandas(with_vector_result_space_weights_alt)

Unnamed: 0,body,category,id,similarity_score
0,The flora and fauna of a specific habitat highly depend on the weather.,[environment],paragraph-3,1.0
1,Glorious animals live in the wilderness.,[environment],paragraph-1,0.310591
2,Growing computation power enables advancements in AI.,[IT],paragraph-2,-0.019059


## Combine them

With the use of weights, creating any combination of inputs is possible. Imagine a situation where we search for a term, `similar_input` in those paragraphs that are relevant to a specific paragraph, denoted by `paragraph_id`. It is possible to weight the input using `input_weight` `Param`, in the relation to the context the search takes place inside using `context_weight` `Param`. Note that the `Param` names are totally arbitrary, the clauses matter.

In [13]:
# we are using dynamic parameters again
combined_query = (
    sl.Query(paragraph_index)
    .find(paragraph)
    .similar(body_space, sl.Param("similar_body"), weight=sl.Param("similar_body_weight"))
    .with_vector(paragraph, sl.Param("paragraph_id"), weight=sl.Param("paragraph_weight"))
    .select_all()
)

In [14]:
# equal weight
combined_result = app.query(
    combined_query,
    similar_body="progress in AI",
    paragraph_id="paragraph-3",
    similar_body_weight=1,
    paragraph_weight=1,
)
sl.PandasConverter.to_pandas(combined_result)

Unnamed: 0,body,category,id,similarity_score
0,The flora and fauna of a specific habitat highly depend on the weather.,[environment],paragraph-3,0.831307
1,Glorious animals live in the wilderness.,[environment],paragraph-1,0.619865
2,Growing computation power enables advancements in AI.,[IT],paragraph-2,0.218673


In [15]:
# upweight context - notice the score differences
combined_result_context = app.query(
    combined_query,
    similar_body="progress in AI",
    paragraph_id="paragraph-3",
    similar_body_weight=0.25,
    paragraph_weight=1,
)
sl.PandasConverter.to_pandas(combined_result_context)

Unnamed: 0,body,category,id,similarity_score
0,The flora and fauna of a specific habitat highly depend on the weather.,[environment],paragraph-3,0.984387
1,Glorious animals live in the wilderness.,[environment],paragraph-1,0.656062
2,Growing computation power enables advancements in AI.,[IT],paragraph-2,0.06525


In [16]:
# give more weight to query time input - the most relevant document changes
combined_result_input = app.query(
    combined_query,
    similar_body="progress in AI",
    paragraph_id="paragraph-3",
    similar_body_weight=1,
    paragraph_weight=0.1,
)
sl.PandasConverter.to_pandas(combined_result_input)

Unnamed: 0,body,category,id,similarity_score
0,Glorious animals live in the wilderness.,[environment],paragraph-1,0.519222
1,The flora and fauna of a specific habitat highly depend on the weather.,[environment],paragraph-3,0.488978
2,Growing computation power enables advancements in AI.,[IT],paragraph-2,0.300537


In order to use per-space weights, the dict structure has to be in place and the actual values can be `Param`s.

In [17]:
# we are using dynamic parameters again
combined_query_dict_context_weights = (
    sl.Query(paragraph_index)
    .find(paragraph)
    .similar(body_space, sl.Param("similar_body"), weight=sl.Param("similar_body_weight"))
    .with_vector(
        paragraph,
        sl.Param("paragraph_id"),
        weight={body_space: sl.Param("body_paragraph_weight"), category_space: sl.Param("category_paragraph_weight")},
    )
    .select_all()
)
# we can even use specific weights for context, too as seen before
combined_result_input = app.query(
    combined_query_dict_context_weights,
    similar_body="progress in AI",
    paragraph_id="paragraph-3",
    similar_body_weight=1,
    body_paragraph_weight=0.15,
    category_paragraph_weight=0.05,
)
sl.PandasConverter.to_pandas(combined_result_input)

Unnamed: 0,body,category,id,similarity_score
0,Glorious animals live in the wilderness.,[environment],paragraph-1,0.527039
1,The flora and fauna of a specific habitat highly depend on the weather.,[environment],paragraph-3,0.514157
2,Growing computation power enables advancements in AI.,[IT],paragraph-2,0.30001


## Advanced query mechanics

Query behavior can be influenced using **weights**, which control how similarity is computed across and within vector spaces. There are two types of weights:

### 1. Space Weights 

**Space weights** determine the relative contribution of each vector space to the overall similarity score between the query and knowledge base items. These weights govern **inter-space importance**, reweighting the normalized query vector components coming from each space. 

### 2. Clause Weights 

**Clause weights** affect how individual clauses contribute to the formation of a query vector within a specific space, controlling **intra-space influence**. Both `.similar` and `.with_vector` clauses support weights: 
* `.similar` clauses apply to a single space.
* `.with_vector` clauses apply across all vector spaces.

Clause weights influence how the query vector is constructed per space. After this, space weights are applied to normalized per-space vectors to adjust the overall balance across spaces.

In [18]:
# let's add 2 more examples for demonstration purposes
source.put(
    [
        {
            "id": "paragraph-4",
            "body": "The AI boom contributes to global warming through heating caused by extensive GPU usage.",
            "category": ["IT", "environment"],
        },
        {
            "id": "paragraph-5",
            "body": "An astonishing number of users still use ancient Windows operating systems.",
            "category": ["IT"],
        },
    ]
)


# helper function for partial scores
def get_partial_score_df(result: sl.QueryResult) -> pd.DataFrame:
    partial_score_df: pd.DataFrame = pd.DataFrame(
        [[entry.id] + list(entry.metadata.partial_scores) for entry in result.entries],
        columns=["id", "body_space", "category_space"],
    )
    return sl.PandasConverter.to_pandas(result).merge(partial_score_df, on="id")

We will use partial scores to explain our results a bit more - the above function is a helper for that. To a more focused look on partial scores, take a look at the relevant [feature notebook](https://github.com/superlinked/superlinked/blob/main/notebook/feature/query_result.ipynb).

In [19]:
# create a complicated query to showcase the levers we can pull to affect query results in their entirety
advanced_query = (
    sl.Query(
        paragraph_index,
        weights={body_space: sl.Param("body_space_weight"), category_space: sl.Param("category_space_weight")},
    )
    .find(paragraph)
    .similar(body_space, sl.Param("body_input"), sl.Param("body_similar_weight"))
    .similar(category_space, sl.Param("category_input"), sl.Param("category_similar_weight"))
    .with_vector(
        paragraph,
        sl.Param("paragraph_id"),
        weight={
            body_space: sl.Param("body_with_vector_weight"),
            category_space: sl.Param("category_with_vector_weight"),
        },
    )
    .select_all()
    .include_metadata()
)

In [20]:
# let's first run a query where we up-weight the with_vector part of the body_space query vector part
with_vector_favored_result = app.query(
    advanced_query,
    body_space_weight=1,
    category_space_weight=1,
    body_input="computation power",
    category_input="environment",
    body_similar_weight=1,
    category_similar_weight=1,
    paragraph_id="paragraph-5",
    body_with_vector_weight=5,
    category_with_vector_weight=1,
)

get_partial_score_df(with_vector_favored_result)

Unnamed: 0,body,category,id,similarity_score,body_space,category_space
0,The AI boom contributes to global warming through heating caused by extensive GPU usage.,"[IT, environment]",paragraph-4,0.663999,0.163999,0.5
1,An astonishing number of users still use ancient Windows operating systems.,[IT],paragraph-5,0.657851,0.491184,0.166667
2,Glorious animals live in the wilderness.,[environment],paragraph-1,0.383819,0.050486,0.333333
3,The flora and fauna of a specific habitat highly depend on the weather.,[environment],paragraph-3,0.333975,0.000642,0.333333
4,Growing computation power enables advancements in AI.,[IT],paragraph-2,0.277853,0.111186,0.166667


Notice how the input from the context vector (Windows operating systems, paragraph-5) makes paragraph-5 rank high up due to the `body_with_vector_weight` Param being significantly higher than `body_similar_weight` (the top result is mainly there due to the categorical match). 

In the next scenario, where the above mentioned Params' relationship is switched, paragraph-2 ranks high (computation power) due to it being semantically closer to `body_input`. Also notice that the relative relationship of these params matter: `5` vs `1` is the **same** as `1` vs `0.2`.

In [21]:
similar_favored_result = app.query(
    advanced_query,
    body_space_weight=1,
    category_space_weight=1,
    body_input="computation power",
    category_input="environment",
    body_similar_weight=1,
    category_similar_weight=1,
    paragraph_id="paragraph-5",
    body_with_vector_weight=0.2,
    category_with_vector_weight=1,
)

get_partial_score_df(similar_favored_result)

Unnamed: 0,body,category,id,similarity_score,body_space,category_space
0,The AI boom contributes to global warming through heating caused by extensive GPU usage.,"[IT, environment]",paragraph-4,0.681913,0.181913,0.5
1,Growing computation power enables advancements in AI.,[IT],paragraph-2,0.410214,0.243547,0.166667
2,Glorious animals live in the wilderness.,[environment],paragraph-1,0.351866,0.018533,0.333333
3,An astonishing number of users still use ancient Windows operating systems.,[IT],paragraph-5,0.344295,0.177629,0.166667
4,The flora and fauna of a specific habitat highly depend on the weather.,[environment],paragraph-3,0.316096,-0.017237,0.333333


But after all, regardless of the individual clause weights, the final similarity is driven predominantly by the space weights. Notice how increasing `category_space_weight` changed the landscape making the paragraphs with environment category rank the highest. This shift can be observed through the partial scores as well - the category space score drives most of the overall score.

In [22]:
space_weight_overpowers = app.query(
    advanced_query,
    body_space_weight=1,
    category_space_weight=5,
    body_input="computation power",
    category_input="environment",
    body_similar_weight=5,
    category_similar_weight=1,
    paragraph_id="paragraph-5",
    body_with_vector_weight=5,
    category_with_vector_weight=1,
)

get_partial_score_df(space_weight_overpowers)

Unnamed: 0,body,category,id,similarity_score,body_space,category_space
0,The AI boom contributes to global warming through heating caused by extensive GPU usage.,"[IT, environment]",paragraph-4,0.748332,0.054957,0.693375
1,Glorious animals live in the wilderness.,[environment],paragraph-1,0.473216,0.010965,0.46225
2,The flora and fauna of a specific habitat highly depend on the weather.,[environment],paragraph-3,0.459614,-0.002637,0.46225
3,An astonishing number of users still use ancient Windows operating systems.,[IT],paragraph-5,0.337383,0.106258,0.231125
4,Growing computation power enables advancements in AI.,[IT],paragraph-2,0.287483,0.056358,0.231125


## Filter results based on score or position

In [23]:
# let's use combined query above with some preset params
params = {
    "similar_body": "progress in AI",
    "paragraph_id": "paragraph-3",
    "similar_body_weight": 1,
    "paragraph_weight": 0.25,
}

In [24]:
# return top 2 items
combined_query_limit_result = app.query(combined_query.limit(2), **params)
sl.PandasConverter.to_pandas(combined_query_limit_result)

Unnamed: 0,body,category,id,similarity_score
0,The AI boom contributes to global warming through heating caused by extensive GPU usage.,"[IT, environment]",paragraph-4,0.693285
1,The flora and fauna of a specific habitat highly depend on the weather.,[environment],paragraph-3,0.564008


In [25]:
# return items with scores larger than 0.5
combined_query_radius_result = app.query(combined_query.radius(0.5), **params)
sl.PandasConverter.to_pandas(combined_query_radius_result)

Unnamed: 0,body,category,id,similarity_score
0,The AI boom contributes to global warming through heating caused by extensive GPU usage.,"[IT, environment]",paragraph-4,0.693285
1,The flora and fauna of a specific habitat highly depend on the weather.,[environment],paragraph-3,0.564008
2,Glorious animals live in the wilderness.,[environment],paragraph-1,0.542344
