# Understanding Categorical Similarity Space

CategoricalSimilaritySpace is best used to represent categorical similarity information where there are few categories which don't have semantic names to embed them as text. The space creates an n-hot encoding of the categories.

In [1]:
%pip install qyver==19.2.3

In [2]:
import pandas as pd
from qyver import framework as sl

pd.set_option("display.max_colwidth", 100)

In [3]:
class Product(sl.Schema):
    id: sl.IdField
    category: sl.StringList


product = Product()

## Creating a categorical embedding

Decision items:
1. What are the `categories` that will get their own column in the [n-hot](https://stats.stackexchange.com/questions/467633/what-exactly-is-multi-hot-encoding-and-how-is-it-different-from-one-hot) encoding.<br>
    Other will be classified as `other` and will be represented in the last column of the encoding.
1. Should items in the other category be similar to each other? Set `uncategorized_as_category` accordingly.<br>
    If set to `True`, all `other` items are similar to each other, while otherwise they never are (even if the same category is encoded).<br>
    If the intention is to make a category value similar to only the same category items, it should be added to `categories`.
1. There is a possibility to set `negative_filter` to a negative number, so non-matching categories will result in negative similarity<br>
    (contrary to simply not contributing to similarity) therefore setting these results substantially back in the order.

In [4]:
category_space_uncategorized_category = sl.CategoricalSimilaritySpace(
    category_input=product.category,
    categories=["category-1", "category-2", "category-3"],
    uncategorized_as_category=True,
)
product_index = sl.Index(category_space_uncategorized_category)

In [5]:
source: sl.InMemorySource = sl.InMemorySource(product)
executor = sl.InMemoryExecutor(sources=[source], indices=[product_index])
app = executor.run()

In [6]:
source.put(
    [
        {"id": "product-1", "category": "category-1"},
        {"id": "product-2", "category": "category-2"},
        {"id": "product-3", "category": ["category-2", "category-3"]},
        {"id": "product-4", "category": "category-3"},
        {"id": "product-5", "category": "category-4"},
    ]
)

In [7]:
query_uncateg_as_categ = (
    sl.Query(product_index)
    .find(product)
    .similar(category_space_uncategorized_category.category, sl.Param("query_category"))
    .select_all()
)

Note below that multi-label instances are less similar than a single category instance to a single category query - but similar nevertheless.

In [8]:
result_other_uncateg_as_categ = app.query(query_uncateg_as_categ, query_category="category-2")
sl.PandasConverter.to_pandas(result_other_uncateg_as_categ)

Unnamed: 0,category,id,similarity_score
0,[category-2],product-2,1.0
1,"[category-2, category-3]",product-3,0.707107
2,[category-1],product-1,0.0
3,[category-3],product-4,0.0
4,[category-4],product-5,0.0


Let's first see how the space works with `uncategorized_as_category=True`!

In this case, category-4 category items are similar to other category-4 category items.

In [9]:
result_sunglass_uncateg_as_categ = app.query(query_uncateg_as_categ, query_category="category-4")
sl.PandasConverter.to_pandas(result_sunglass_uncateg_as_categ)

Unnamed: 0,category,id,similarity_score
0,[category-4],product-5,1.0
1,[category-1],product-1,0.0
2,[category-2],product-2,0.0
3,"[category-2, category-3]",product-3,0.0
4,[category-3],product-4,0.0


But every `other` category item is similar to category-4 category items, too.

In [10]:
result_other_uncateg_as_categ = app.query(query_uncateg_as_categ, query_category="any other category")
sl.PandasConverter.to_pandas(result_other_uncateg_as_categ)

Unnamed: 0,category,id,similarity_score
0,[category-4],product-5,1.0
1,[category-1],product-1,0.0
2,[category-2],product-2,0.0
3,"[category-2, category-3]",product-3,0.0
4,[category-3],product-4,0.0


On the contrary, if we se `uncategorized_as_category=False`, no `other` category will be similar to each other.

In [11]:
category_space_no_uncategorized = sl.CategoricalSimilaritySpace(
    category_input=product.category,
    categories=["category-1", "category-2", "category-3"],
    uncategorized_as_category=False,
)
product_index = sl.Index(category_space_no_uncategorized)

In [12]:
text_only_source: sl.InMemorySource = sl.InMemorySource(product)
executor = sl.InMemoryExecutor(sources=[text_only_source], indices=[product_index])
app = executor.run()

In [13]:
text_only_source.put(
    [
        {"id": "product-1", "category": "category-1"},
        {"id": "product-2", "category": "category-2"},
        {"id": "product-3", "category": "category-2"},
        {"id": "product-4", "category": "category-3"},
        {"id": "product-5", "category": "category-4"},
    ]
)

Neither category-4 to other category-4...

In [14]:
query_uncateg_not_categ = (
    sl.Query(product_index).find(product).similar(category_space_no_uncategorized, sl.Param("query_category"))
)
result_uncateg_not_categ = app.query(query_uncateg_not_categ, query_category="category-4")
sl.PandasConverter.to_pandas(result_uncateg_not_categ)

Unnamed: 0,id,similarity_score
0,product-1,0.0
1,product-2,0.0
2,product-3,0.0
3,product-4,0.0
4,product-5,0.0


...nor any `other` category to category-4.

In [15]:
result_uncateg_not_categ = app.query(query_uncateg_not_categ, query_category="something_else")
sl.PandasConverter.to_pandas(result_uncateg_not_categ)

Unnamed: 0,id,similarity_score
0,product-1,0.0
1,product-2,0.0
2,product-3,0.0
3,product-4,0.0
4,product-5,0.0
