# Categorical Similarity Space

CategoricalSimilaritySpace provides an elegant solution for modeling categorical data in vector search applications. It creates n-hot encodings — binary vectors where each dimension represents a category and multiple dimensions can be set to 1 simultaneously — making it perfect for items that belong to multiple categories (like a product classified as both "Clothing" and "Sports").

## Overview

This notebook demonstrates how to effectively use CategoricalSimilaritySpace to model relationships between categorized items. You'll learn:

1. **Basic implementation**: Creating and querying a categorical similarity space
2. **Handling unknown categories**: Managing categories outside your predefined taxonomy
3. **Category weighting**: Implementing preference and exclusion patterns in searches
4. **Negative filtering**: Creating clear distinctions between matches and non-matches

These techniques are valuable for applications like product categorization, content tagging systems, and any scenario where you need precise control over how categorical relationships influence search results.

In [1]:
%pip install superlinked==37.1.0

In [2]:
from superlinked import framework as sl

Here is our sample data.
In this tutorial, we will create a Superlinked space only for the "category" field,
while keeping the "name" field as a convenient label.

In [3]:
data_initial = [
    {"id": "product-1", "name": "T-shirt", "category": ["Clothing"]},
    {"id": "product-2", "name": "Running Shoes", "category": ["Clothing", "Sports"]},
    {"id": "product-3", "name": "Yoga Mat", "category": ["Sports"]},
    {"id": "product-4", "name": "Pijama", "category": ["Clothing", "Home"]},
    {"id": "product-5", "name": "Pillows", "category": ["Home"]},
]

categories = ["Clothing", "Sports", "Home"]

In [4]:
class Product(sl.Schema):
    id: sl.IdField
    name: sl.String
    category: sl.StringList


product = Product()

### Basic usage

In [5]:
category_space = sl.CategoricalSimilaritySpace(category_input=product.category, categories=categories)
index = sl.Index(category_space)
source: sl.InMemorySource = sl.InMemorySource(product)
executor = sl.InMemoryExecutor(sources=[source], indices=[index])
app = executor.run()

Now we'll embed the data and ingest the vectors into the VectorDatabase.
For this example, we'll use InMemoryVectorDatabase.

In [6]:
source.put(data_initial)

In [7]:
query = (
    sl.Query(index)
    .find(product)
    .similar(category_space.category, sl.Param("query_categories"))
    # or shorter
    # .similar(category_space, sl.Param("query_categories"))
    .select_all()
    # select_all returns all schema fields
)

In [8]:
result = app.query(query, query_categories=["Clothing"])
# gives the same result as
# result = app.query(query, query_categories="Clothing")

sl.PandasConverter.to_pandas(result)

Unnamed: 0,name,category,id,similarity_score
0,T-shirt,[Clothing],product-1,1.0
1,Running Shoes,"[Clothing, Sports]",product-2,1.0
2,Pijama,"[Clothing, Home]",product-4,1.0
3,Yoga Mat,[Sports],product-3,0.0
4,Pillows,[Home],product-5,0.0


A similarity score of 1.0 indicates a perfect match
where all query categories (in this case "Clothing") are present in the product.
Products without the "Clothing" category receive a score of 0.0.
This demonstrates how CategoricalSimilaritySpace creates a clear binary distinction
between items that match the query category and those that don't.

In [9]:
result = app.query(query, query_categories=["Clothing", "Home"])
sl.PandasConverter.to_pandas(result)

Unnamed: 0,name,category,id,similarity_score
0,Pijama,"[Clothing, Home]",product-4,1.0
1,T-shirt,[Clothing],product-1,0.5
2,Running Shoes,"[Clothing, Sports]",product-2,0.5
3,Pillows,[Home],product-5,0.5
4,Yoga Mat,[Sports],product-3,0.0


When querying with multiple categories ["Clothing", "Home"], products containing both categories (like "Pijama") receive a perfect score of 1.0. Products with only one of the categories receive a score of 0.5, representing a 50% match. Products with neither category receive a score of 0.0. 

The similarity calculation essentially measures the intersection of categories between the query and the product, normalized by the number of categories in the query.

### Adding new categories outside of predefined list

Let's add more products with categories outside our predefined list
and observe how our application handles them.

In [10]:
data_kitchen = [
    {"id": "product-6", "name": "Blender", "category": ["Kitchen"]},
    {"id": "product-7", "name": "Cooking Apron", "category": ["Clothing", "Kitchen"]},
]

source.put(data_kitchen)

In [11]:
result = app.query(query, query_categories=["Clothing", "Home"])
sl.PandasConverter.to_pandas(result)

Unnamed: 0,name,category,id,similarity_score
0,Pijama,"[Clothing, Home]",product-4,1.0
1,T-shirt,[Clothing],product-1,0.5
2,Running Shoes,"[Clothing, Sports]",product-2,0.5
3,Pillows,[Home],product-5,0.5
4,Cooking Apron,"[Clothing, Kitchen]",product-7,0.5
5,Yoga Mat,[Sports],product-3,0.0
6,Blender,[Kitchen],product-6,0.0


You can see that "Cooking Apron" received a 0.5 score
because it contains half of the categories specified in the query,
specifically "Clothing".

Now let's see how it works with "Kitchen".

In [12]:
result = app.query(query, query_categories=["Kitchen"])
sl.PandasConverter.to_pandas(result)

Unnamed: 0,name,category,id,similarity_score
0,Blender,[Kitchen],product-6,1.0
1,Cooking Apron,"[Clothing, Kitchen]",product-7,1.0
2,T-shirt,[Clothing],product-1,0.0
3,Running Shoes,"[Clothing, Sports]",product-2,0.0
4,Yoga Mat,[Sports],product-3,0.0
5,Pijama,"[Clothing, Home]",product-4,0.0
6,Pillows,[Home],product-5,0.0


Interestingly, although "Kitchen" is not in our predefined categories list, the application returned all Kitchen products with a perfect score.
Why did we define a categories list if it's not being used as expected?

The explanation will follow shortly, but first let's try another search with "Electronics".

In [13]:
result = app.query(query, query_categories=["Electronics"])
sl.PandasConverter.to_pandas(result)

Unnamed: 0,name,category,id,similarity_score
0,Blender,[Kitchen],product-6,1.0
1,Cooking Apron,"[Clothing, Kitchen]",product-7,1.0
2,T-shirt,[Clothing],product-1,0.0
3,Running Shoes,"[Clothing, Sports]",product-2,0.0
4,Yoga Mat,[Sports],product-3,0.0
5,Pijama,"[Clothing, Home]",product-4,0.0
6,Pillows,[Home],product-5,0.0


The results show that despite "Kitchen" not being in our predefined categories list, the application returned all products with the "Kitchen" category with a similarity score of 1.0.

By default, CategoricalSimilaritySpace treats all categories not in the predefined list as a special "other" category. This means both "Kitchen" and "Electronics" are mapped to this same "other" category in the vector space, explaining why they produce identical search results. The parameter `uncategorized_as_category=True` (default) controls this behavior, enabling the system to handle unknown categories without requiring redefinition of the embedding space.

### Handling uncategorized values

In [14]:
category_space = sl.CategoricalSimilaritySpace(
    category_input=product.category,
    categories=categories,
    uncategorized_as_category=False,
    # uncategorized_as_category was True by default
    # now we change it to False
)
index = sl.Index(category_space)

source_without_uncategorized: sl.InMemorySource = sl.InMemorySource(product)
executor = sl.InMemoryExecutor(sources=[source_without_uncategorized], indices=[index])
app = executor.run()

source_without_uncategorized.put(data_initial + data_kitchen)

In [15]:
query = sl.Query(index).find(product).similar(category_space.category, sl.Param("query_categories")).select_all()

In [16]:
result = app.query(query, query_categories=["Kitchen"])
sl.PandasConverter.to_pandas(result)

Unnamed: 0,name,category,id,similarity_score
0,T-shirt,[Clothing],product-1,0.0
1,Running Shoes,"[Clothing, Sports]",product-2,0.0
2,Yoga Mat,[Sports],product-3,0.0
3,Pijama,"[Clothing, Home]",product-4,0.0
4,Pillows,[Home],product-5,0.0
5,Blender,[Kitchen],product-6,0.0
6,Cooking Apron,"[Clothing, Kitchen]",product-7,0.0


Now with `uncategorized_as_category=False`, the system ignores any categories not in our predefined list. This explains why searching for "Kitchen" returns no relevant products (all similarity scores are 0). This configuration is useful when you want strict adherence to your predefined taxonomy and don't want to allow new categories to emerge without explicit updates to your system.

In [17]:
# same for "Electronics"

result = app.query(query, query_categories=["Electronics"])
sl.PandasConverter.to_pandas(result)

Unnamed: 0,name,category,id,similarity_score
0,T-shirt,[Clothing],product-1,0.0
1,Running Shoes,"[Clothing, Sports]",product-2,0.0
2,Yoga Mat,[Sports],product-3,0.0
3,Pijama,"[Clothing, Home]",product-4,0.0
4,Pillows,[Home],product-5,0.0
5,Blender,[Kitchen],product-6,0.0
6,Cooking Apron,"[Clothing, Kitchen]",product-7,0.0


### Weighting categories

Sometimes you want to weight categories differently.
For example, you want to prefer some categories and exclude others.
You can do that by chaining `.similar` calls with different weights.

In [18]:
query = (
    sl.Query(index)
    .find(product)
    .similar(category_space.category, sl.Param("preferred_categories"), weight=1.0)
    .similar(category_space.category, sl.Param("excluded_categories"), weight=-1.0)
    .select_all()
)

In [19]:
result = app.query(query, preferred_categories=["Clothing", "Sports"], excluded_categories=["Home"])

sl.PandasConverter.to_pandas(result)

Unnamed: 0,name,category,id,similarity_score
0,Running Shoes,"[Clothing, Sports]",product-2,1.0
1,T-shirt,[Clothing],product-1,0.5
2,Yoga Mat,[Sports],product-3,0.5
3,Cooking Apron,"[Clothing, Kitchen]",product-7,0.5
4,Pijama,"[Clothing, Home]",product-4,0.0
5,Blender,[Kitchen],product-6,0.0
6,Pillows,[Home],product-5,-0.5


We get the following interesting results:

- "Running Shoes" received a 1.0 score because it's a perfect match with both preferred categories.

- "T-shirt", "Yoga Mat", and "Cooking Apron" each received 0.5
  because they contain only one of the two preferred categories
  and none of the excluded categories.

- "Blender" received 0.0 because it has "Kitchen", which is outside our predefined categories list.

- "Pijama" received 0.0 because it has one preferred category ("Clothing")
  but also has one excluded category ("Home"),
  resulting in a score of 0.5 - 0.5 = 0.0

- "Pillows" received -0.5 because it only contains the "Home" category, which we're excluding.

Here is a more advanced scenario:

In [20]:
query = (
    sl.Query(index)
    .find(product)
    .similar(category_space.category, sl.Param("preferred_categories"), weight=sl.Param("preferred_weight"))
    .similar(category_space.category, sl.Param("excluded_categories"), weight=sl.Param("excluded_weight"))
    .select_all()
)

In [21]:
result = app.query(
    query,
    preferred_categories=["Clothing", "Sports"],
    preferred_weight=1.0,
    excluded_categories=["Home"],
    excluded_weight=-4.0,
)

sl.PandasConverter.to_pandas(result)

Unnamed: 0,name,category,id,similarity_score
0,Running Shoes,"[Clothing, Sports]",product-2,1.0
1,T-shirt,[Clothing],product-1,0.5
2,Yoga Mat,[Sports],product-3,0.5
3,Cooking Apron,"[Clothing, Kitchen]",product-7,0.5
4,Blender,[Kitchen],product-6,0.0
5,Pijama,"[Clothing, Home]",product-4,-1.5
6,Pillows,[Home],product-5,-2.0


By increasing the negative weight to -4.0 for excluded categories, we significantly penalize products in those categories. Note how "Pillows" now has a score of -2.0 and "Pijama" has -1.5, reflecting the stronger penalty for having the "Home" category.

### Negative filtering

If you want to generally downweight categories not mentioned in the query, you can set `negative_filter` to a negative number.
This will push non-matching categories significantly lower in the results order.

In [22]:
category_space = sl.CategoricalSimilaritySpace(
    category_input=product.category,
    categories=categories,
    uncategorized_as_category=False,
    negative_filter=-10,  # Strong penalty for non-matches
)
index = sl.Index(category_space)

source_with_negative_filter: sl.InMemorySource = sl.InMemorySource(product)
executor = sl.InMemoryExecutor(sources=[source_with_negative_filter], indices=[index])
app = executor.run()

source_with_negative_filter.put(data_initial + data_kitchen)

In [23]:
query = sl.Query(index).find(product).similar(category_space.category, sl.Param("query_categories")).select_all()

In [24]:
result = app.query(query, query_categories=["Clothing", "Home"])
sl.PandasConverter.to_pandas(result)

Unnamed: 0,name,category,id,similarity_score
0,Pijama,"[Clothing, Home]",product-4,1.0
1,T-shirt,[Clothing],product-1,-8.160254
2,Running Shoes,"[Clothing, Sports]",product-2,-8.160254
3,Pillows,[Home],product-5,-8.160254
4,Cooking Apron,"[Clothing, Kitchen]",product-7,-8.160254
5,Yoga Mat,[Sports],product-3,-17.320508
6,Blender,[Kitchen],product-6,-17.320508


The `negative_filter=-10` parameter dramatically changes how similarity scores are calculated. With this setting:

1. Perfect matches (containing all query categories) still receive a score of 1.0
2. Partial matches or mismatches receive substantial penalties
3. The specific values (-8.16, -17.32) are derived from vector distance calculations with the negative filter applied

This approach effectively creates a "cliff" in the similarity scores, making it easy to separate exact matches from everything else. It's particularly useful for applications where you need strict filtering but still want to maintain a ranking among the non-matching items.

### Summary: Key Features of CategoricalSimilaritySpace

In this notebook, we've explored the CategoricalSimilaritySpace, a powerful tool for handling categorical data in vector search applications. Here's a summary of its key features:

1. **N-hot encoding**: Transforms categorical data into binary vectors where each dimension represents a category, allowing items to belong to multiple categories simultaneously.

2. **Flexible category handling**: 
   - With `uncategorized_as_category=True` (default), unknown categories are grouped into a special "other" category
   - With `uncategorized_as_category=False`, unknown categories are ignored, enforcing strict adherence to predefined taxonomies

3. **Powerful weighting options**:
   - Apply positive weights to preferred categories
   - Apply negative weights to excluded categories
   - Use granular control with parameterized weights for fine-tuned relevance

4. **Negative filtering**: Set `negative_filter` to a negative value to create a sharp distinction between exact matches and partial matches, effectively implementing "must-have" category requirements while maintaining ranking among non-matches.

CategoricalSimilaritySpace is particularly useful for e-commerce product categorization, content tagging systems, and any application where items can belong to multiple discrete categories and where you need precise control over how these categories influence search results.