# Understand Events and their Effects

Effects are useful to create conditional relationships in your data: if some events happen `Effect`s can be configured to modify vectors in order to reflect the induced change of the event.

In [1]:
%pip install superlinked==35.1.1

In [2]:
from datetime import datetime, timedelta
import pandas as pd
from superlinked import framework as sl

pd.set_option("display.max_colwidth", 100)

# set "NOW" for a fixed data so the notebook runs the same regardless of the date
date_time_obj = datetime(year=2024, month=8, day=7, hour=0, minute=0, second=0)
now_timestamp = int(date_time_obj.timestamp())
EXECUTOR_DATA = {sl.CONTEXT_COMMON: {sl.CONTEXT_COMMON_NOW: now_timestamp}}

## Setting up event schemas

Events generally have:
- `SchemaReference`s: these contain ids that are resolved in the referenced schema. These reflect the items which were constituents of the event.
- event_type as a string: used to group events so that `Effect`s can be applied to a subset of events
- and id of course

In [3]:
class Paragraph(sl.Schema):
    id: sl.IdField
    body: sl.String


class User(sl.Schema):
    id: sl.IdField
    interest: sl.String


class Event(sl.EventSchema):
    id: sl.IdField
    created_at: sl.CreatedAtField
    paragraph: sl.SchemaReference[Paragraph]
    user: sl.SchemaReference[User]
    event_type: sl.String


paragraph = Paragraph()
user = User()
event = Event()

relevance_space = sl.TextSimilaritySpace(
    text=[user.interest, paragraph.body],
    model="sentence-transformers/all-mpnet-base-v2",
)

Effects are the way to set up how and to what extent vectors should affect each other conditional on some event.

In [4]:
# weights in effects control importance between events
# effectively doesn't matter if there is only one effect in the index.
event_effects = [
    sl.Effect(
        relevance_space,
        event.user,
        0.8 * event.paragraph,
        event.event_type == "read",
    )
]

We are setting up multiple indexes to understand the differences in terms of the trade-off around `event_influence` - whether the initial vector of the entity, or the event effects should matter more.

In [5]:
# for this index, only initial data of the user will matter as event_influence is 0.
index_low_event_infl = sl.Index(spaces=relevance_space, effects=event_effects, event_influence=0.0)
# for this index, initial data and events of the user will matter equally as event_influence is 0.5.
index_mid_event_infl = sl.Index(spaces=relevance_space, effects=event_effects, event_influence=0.5)
# high event_influence means the emphasis shifts to event data, and the initial vector will matter less.
# Eventually the initial vector will not matter when `event_influence = 1.0`.
index_high_event_infl = sl.Index(spaces=relevance_space, effects=event_effects, event_influence=1.0)

Let's also setup different indexes to understand the `temperature` parameter. It is a way to add recency bias to the system in terms of event registration order. It is a way to bias towards freshly ingested events.
- `temperature > 0.5` biases towards newer items. Setting it to `1.0` results in the latest event overwriting the aggregate event effects accumulated to that point.
- `temperature < 0.5` create bias towards older items and the vector is less sensitive to changes due to new events. Setting it to `0.0` doesn't really make sense as it will keep the event aggregate affect non-existent.

In [6]:
index_low_temp = sl.Index(spaces=relevance_space, effects=event_effects, temperature=0.25)

index_high_temp = sl.Index(spaces=relevance_space, effects=event_effects, temperature=0.75)

**_NOTE 1:_**  `event_influence` can be any number between 0.0 and 1.0 and controls the tradeoff between initial entity vectors and event effects. Its value can be set based on business logic or parameter tuning. `0.5` is a sensible default balancing between the two.

**_NOTE 2:_**  `temperature` can be any number between 0.0 and 1.0 and controls the aggregation of previously aggregated event effects and the current event effect when aggregating the two. `0.5` is a sensible default creating an equal balance.

**_NOTE 3:_**  `Index` argument `max_age` defaults to `None` if omitted meaning no restriction. If set, events older than it will be filtered out and will not affect the vector. Only takes effect in the batch system.

**_NOTE 4:_**  `Index` argument `max_count` defaults to `None` if omitted meaning no restriction. If set, only the last `max_count` events are considered. Only takes effect in our batch system.

**_NOTE 5:_**  `Index` argument `time_decay_floor` defaults to `1.0` if omitted meaning the time-based modifier of event weights are all equally `1.0`. As a result, timestamps stored in the `CreatedAtField`s of events do not take effect. Set it to a number closer to `0.0` to achieve decaying weights for older events.

### Superlinked setup

Now let's set our local superlinked system up and ingest data of a user, 3 documents and an event where the user read the second doument. The user originally expressed interest in wild animals. 

In [7]:
source_paragraph: sl.InMemorySource = sl.InMemorySource(paragraph)
source_user: sl.InMemorySource = sl.InMemorySource(user)
source_event: sl.InMemorySource = sl.InMemorySource(event)
executor = sl.InMemoryExecutor(
    sources=[source_paragraph, source_user, source_event],
    indices=[
        index_low_event_infl,
        index_mid_event_infl,
        index_high_event_infl,
        index_low_temp,
        index_high_temp,
    ],
)
app = executor.run()

In [8]:
source_paragraph.put(
    [
        {"id": "paragraph-1", "body": "Glorious animals live in the wilderness."},
        {
            "id": "paragraph-2",
            "body": "Growing computation power enables advancements in AI.",
        },
        {
            "id": "paragraph-3",
            "body": "Stock markets are reaching all time highs during 2024.",
        },
    ]
)

source_user.put([{"id": "user-1", "interest": "I am interested in wild animals."}])

source_event.put(
    [
        {
            "id": "event-1",
            "created_at": int((date_time_obj - timedelta(days=2)).timestamp()),  # 2 days old event
            "paragraph": "paragraph-2",
            "user": "user-1",
            "event_type": "read",
        }
    ]
)

The creation time of events matter in many use-cases - more recent events are generally more important. The time related modifier of event weights is linearly correlated with the creation time (`created_at` field) of the event and it's relative position between `NOW` and `NOW - max_age`. Set `time_decay_floor` to value closer to `0.0` to increase that effect - a default `1.0` value means timestamps of events are not taken into account. Nevertheless, setting `0.5 < temperature <= 1` will create a recency bias as indicated above. 

## Making the initial vector count more

Setting `event_influence` to 0 in `index_low_event_infl`, the fact the user read a different paragraph (about AI) does not matter - the initial interest in wild animals will prevail. The index is unaffected by the event.

In [9]:
query_low_event_infl = (
    sl.Query(index_low_event_infl).find(paragraph).with_vector(user, sl.Param("user_id")).select_all()
)

result_low_event_infl = app.query(
    query_low_event_infl,
    user_id="user-1",
)

sl.PandasConverter.to_pandas(result_low_event_infl)

Unnamed: 0,body,id,similarity_score,rank
0,Glorious animals live in the wilderness.,paragraph-1,0.49042,0
1,Stock markets are reaching all time highs during 2024.,paragraph-3,0.045257,1
2,Growing computation power enables advancements in AI.,paragraph-2,0.020563,2


## The power of events

Increasing event_influence switches the effect of events on, and shifts the user vector away from the initial interest (wild animals) towards newly read topics (AI). Even though the user expressed interest in wild animals, as an other document about AI was read the preferences shifted towards the actual empirical assessment of their interest. 

In [10]:
query_mid_event_infl = (
    sl.Query(index_mid_event_infl).find(paragraph).with_vector(user, sl.Param("user_id")).select_all()
)

result_mid_event_infl = app.query(
    query_mid_event_infl,
    user_id="user-1",
)

sl.PandasConverter.to_pandas(result_mid_event_infl)

Unnamed: 0,body,id,similarity_score,rank
0,Growing computation power enables advancements in AI.,paragraph-2,0.63442,0
1,Glorious animals live in the wilderness.,paragraph-1,0.401294,1
2,Stock markets are reaching all time highs during 2024.,paragraph-3,0.187389,2


### Driven fully by events

Setting `event_influence` to 1 means similarities are driven entirely by the event data: hence the `1.0` similarity to the read paragraph.

In [11]:
query_high_event_infl = (
    sl.Query(index_high_event_infl).find(paragraph).with_vector(user, sl.Param("user_id")).select_all()
)

result_high_event_infl = app.query(
    query_high_event_infl,
    user_id="user-1",
)

sl.PandasConverter.to_pandas(result_high_event_infl)

Unnamed: 0,body,id,similarity_score,rank
0,Growing computation power enables advancements in AI.,paragraph-2,1.0,0
1,Stock markets are reaching all time highs during 2024.,paragraph-3,0.246391,1
2,Glorious animals live in the wilderness.,paragraph-1,0.035769,2


### The effect of temperature

Now let's ingest a second event - where our user read the 3rd `paragraph` about stock markets. The initial interest in wild animals (therefore in the first `paragraph`), and the reading of the second `paragraph` about AI are uniform in these 2 following cases. However, we can observe how the different values of temperature move the user vector closer to the 3rd `paragraph` on stock markets.

In [12]:
source_event.put(
    [
        {
            "id": "event-2",
            "created_at": int((date_time_obj - timedelta(days=1)).timestamp()),  # 1 days old event
            "paragraph": "paragraph-3",
            "user": "user-1",
            "event_type": "read",
        }
    ]
)

Even a lower temperature moves the user vector much closer to the 3rd `paragraph`...

In [13]:
query_low_temp = sl.Query(index_low_temp).find(paragraph).with_vector(user, sl.Param("user_id")).select_all()

result_low_temp = app.query(
    query_low_temp,
    user_id="user-1",
)

sl.PandasConverter.to_pandas(result_low_temp)

Unnamed: 0,body,id,similarity_score,rank
0,Growing computation power enables advancements in AI.,paragraph-2,0.546575,0
1,Glorious animals live in the wilderness.,paragraph-1,0.42258,1
2,Stock markets are reaching all time highs during 2024.,paragraph-3,0.320754,2


... but the higher temperature makes that `paragraph` the closest to the user's interest.

In [14]:
query_high_temp = sl.Query(index_high_temp).find(paragraph).with_vector(user, sl.Param("user_id")).select_all()

result_high_temp = app.query(
    query_high_temp,
    user_id="user-1",
)

sl.PandasConverter.to_pandas(result_high_temp)

Unnamed: 0,body,id,similarity_score,rank
0,Stock markets are reaching all time highs during 2024.,paragraph-3,0.563034,0
1,Glorious animals live in the wilderness.,paragraph-1,0.417958,1
2,Growing computation power enables advancements in AI.,paragraph-2,0.298646,2
