# Eventstream concept

Install retentioneering if running from google.colab or for the first time:

In [1]:
# !pip install retentioneering

In [1]:
import pandas as pd
import sys
sys.path.insert(0, '..')

``Eventstream`` - is a core data type, which is used in rete library.
We need such a type in order to processed clickstream data, create tools which take into account its specifics.

First of all let's download small dataset.

In [3]:
df = pd.read_csv("../src/datasets/data/simple-onlineshop.csv")

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35381 entries, 0 to 35380
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   user_id    35381 non-null  int64 
 1   event      35381 non-null  object
 2   timestamp  35381 non-null  object
dtypes: int64(1), object(2)
memory usage: 829.4+ KB


## Eventstream creation

To create an ``Eventstream`` we need at least 3 columns:
- ``user_id``
- ``event_name``
- ``event_timestamp``

If columns in our df have names like in the default ``RawDataSchema`` - all we need is to import ``Eventstream`` and create it from input pd.Dataframe.

In [5]:
from src.eventstream import Eventstream
stream = Eventstream(df)

  from .autonotebook import tqdm as notebook_tqdm


We can't directly display eventstream data, but we can convert it to pd.Dataframe:

In [6]:
stream.to_dataframe()

Unnamed: 0,event_id,event_type,event_index,event,timestamp,user_id
0,37fdfcb9-f411-4873-97dd-67b898be206a,raw,0,catalog,2019-11-01 17:59:13.273932,219483890
1,511ab593-6311-40b4-adf8-f3ae443fc8d4,raw,1,product1,2019-11-01 17:59:28.459271,219483890
2,6885a9d6-b04c-46d4-b1b8-cc9819eadb64,raw,2,cart,2019-11-01 17:59:29.502214,219483890
3,a8ca40c2-e161-447c-9548-48a0da74e241,raw,3,catalog,2019-11-01 17:59:32.557029,219483890
4,b807d2f9-4a18-4da6-a4b6-9a37a0d945f0,raw,4,catalog,2019-11-01 21:38:19.283663,964964743
...,...,...,...,...,...,...
35376,d0e348af-8ca1-4c68-8c79-73200f22aa8d,raw,35376,catalog,2020-04-29 12:47:40.975732,501098384
35377,67f0c278-d0ca-4544-82b8-4fd257a57992,raw,35377,catalog,2020-04-29 12:48:01.809577,501098384
35378,6162bd14-1d83-48d7-a42d-501481585224,raw,35378,main,2020-04-29 12:48:01.938488,501098384
35379,580a49cb-3cf9-4915-a067-af216a13012a,raw,35379,catalog,2020-04-29 12:48:06.595390,501098384


Let's have a look on more complex example - if columns names are different from the default ``RawDataSchema``.

In [7]:
df.columns = ['uid', 'action_name', 'datetime']
df.head()

Unnamed: 0,uid,action_name,datetime
0,219483890,catalog,2019-11-01 17:59:13.273932
1,219483890,product1,2019-11-01 17:59:28.459271
2,219483890,cart,2019-11-01 17:59:29.502214
3,219483890,catalog,2019-11-01 17:59:32.557029
4,964964743,catalog,2019-11-01 21:38:19.283663


There are 2 ways how to deal with that problem:
1) Rename columns
2) Change RawDataSchema

Let's have a look on how to create a custom ``RawDataSchema``

In [8]:
from src.eventstream import Eventstream, RawDataSchema
raw_data_schema=RawDataSchema(
                user_id="uid",
                event_name="action_name",
                event_timestamp="datetime",
                )
stream2 = Eventstream(
            raw_data_schema=raw_data_schema,
            raw_data=df)

In [9]:
stream2.to_dataframe().head()

Unnamed: 0,event_id,event_type,event_index,event,timestamp,user_id
0,96850d03-b5a4-4bfd-8b00-2b5e83c2176c,raw,0,catalog,2019-11-01 17:59:13.273932,219483890
1,26a7f996-0d34-465b-a136-470ca3f37f72,raw,1,product1,2019-11-01 17:59:28.459271,219483890
2,09a08046-9eb6-4b81-ad26-c6823ff36e53,raw,2,cart,2019-11-01 17:59:29.502214,219483890
3,1e815908-db2f-4ebb-a230-7aeee43929df,raw,3,catalog,2019-11-01 17:59:32.557029,219483890
4,319b62dd-142e-45f1-807b-89a7451a56b8,raw,4,catalog,2019-11-01 21:38:19.283663,964964743


One more point, which we need to explore - how we can define some custom columns.
Let's add one in order to demonstrate such functionality.

In [10]:
conv_users = df[df['action_name'] == 'payment_done']['uid'].unique()
df['user_type'] = 'non_conv'
df.loc[df['uid'].isin(conv_users),'user_type'] = 'conv'
df.head()

Unnamed: 0,uid,action_name,datetime,user_type
0,219483890,catalog,2019-11-01 17:59:13.273932,non_conv
1,219483890,product1,2019-11-01 17:59:28.459271,non_conv
2,219483890,cart,2019-11-01 17:59:29.502214,non_conv
3,219483890,catalog,2019-11-01 17:59:32.557029,non_conv
4,964964743,catalog,2019-11-01 21:38:19.283663,non_conv


In [11]:
df.user_type.value_counts()

non_conv    24038
conv        11343
Name: user_type, dtype: int64

In [12]:
raw_data_schema=RawDataSchema(
                user_id="uid",
                event_name="action_name",
                event_timestamp="datetime",
                custom_cols = [{"custom_col": 'user_type_col',
                                 "raw_data_col": "user_type"}]
                )
stream3 = Eventstream(
            raw_data_schema=raw_data_schema,
            raw_data=df)

In [13]:
stream3.to_dataframe().head()

Unnamed: 0,event_id,event_type,event_index,event,timestamp,user_id,user_type_col
0,0f15dccc-a8ce-4997-bfda-aaac1c0a60d1,raw,0,catalog,2019-11-01 17:59:13.273932,219483890,non_conv
1,d43a3ba3-6ec9-4058-9303-8d77fbb1aa12,raw,1,product1,2019-11-01 17:59:28.459271,219483890,non_conv
2,9aca87b2-7a19-448e-834b-5a78a1353f91,raw,2,cart,2019-11-01 17:59:29.502214,219483890,non_conv
3,227d1a91-a8e0-4fd5-84ca-00526b3d2964,raw,3,catalog,2019-11-01 17:59:32.557029,219483890,non_conv
4,a5394dd8-4f0b-485c-b801-5ab66c33095c,raw,4,catalog,2019-11-01 21:38:19.283663,964964743,non_conv


## Add custom column

If we have an ``eventstream`` and would like to add any custom column without additional convertations.

In [14]:
stream.to_dataframe().head()

Unnamed: 0,event_id,event_type,event_index,event,timestamp,user_id
0,37fdfcb9-f411-4873-97dd-67b898be206a,raw,0,catalog,2019-11-01 17:59:13.273932,219483890
1,511ab593-6311-40b4-adf8-f3ae443fc8d4,raw,1,product1,2019-11-01 17:59:28.459271,219483890
2,6885a9d6-b04c-46d4-b1b8-cc9819eadb64,raw,2,cart,2019-11-01 17:59:29.502214,219483890
3,a8ca40c2-e161-447c-9548-48a0da74e241,raw,3,catalog,2019-11-01 17:59:32.557029,219483890
4,b807d2f9-4a18-4da6-a4b6-9a37a0d945f0,raw,4,catalog,2019-11-01 21:38:19.283663,964964743


In [15]:
stream.add_custom_col('user_type', df['user_type'])

In [16]:
stream.to_dataframe().head()

Unnamed: 0,event_id,event_type,event_index,event,timestamp,user_id,user_type
0,37fdfcb9-f411-4873-97dd-67b898be206a,raw,0,catalog,2019-11-01 17:59:13.273932,219483890,non_conv
1,511ab593-6311-40b4-adf8-f3ae443fc8d4,raw,1,product1,2019-11-01 17:59:28.459271,219483890,non_conv
2,6885a9d6-b04c-46d4-b1b8-cc9819eadb64,raw,2,cart,2019-11-01 17:59:29.502214,219483890,non_conv
3,a8ca40c2-e161-447c-9548-48a0da74e241,raw,3,catalog,2019-11-01 17:59:32.557029,219483890,non_conv
4,b807d2f9-4a18-4da6-a4b6-9a37a0d945f0,raw,4,catalog,2019-11-01 21:38:19.283663,964964743,non_conv


We can see, stream schema also have changed

In [17]:
stream.schema

EventstreamSchema(event_id='event_id', event_type='event_type', event_index='event_index', event_name='event', event_timestamp='timestamp', user_id='user_id', custom_cols=['user_type'])

## Custom index order

``eventstream.index_order`` -  attribute that stores rules of the events sorting depends on their ``event_type``. It's nedeed when we start preprocessing process and add synthetic events in users trajectories.
Actual ``index_order``:

In [18]:
stream.index_order

['profile',
 'path_start',
 'new_user',
 'existing_user',
 'truncated_left',
 'session_start',
 'session_start_truncated',
 'group_alias',
 'raw',
 'raw_sleep',
 None,
 'synthetic',
 'synthetic_sleep',
 'positive_target',
 'negative_target',
 'session_end_truncated',
 'session_end',
 'session_sleep',
 'truncated_right',
 'absent_user',
 'lost_user',
 'path_end']

In order to change index_order we can put it in the variable and make some corrections. For example we add custom ``event_type`` - ``loop`` and would like to put it after ``group_alias`` type.

In [19]:
index_list = stream.index_order.copy()

In [20]:
place_at = index_list.index('group_alias') + 1

In [21]:
index_list[place_at:place_at] = ['loop']

In [22]:
stream.index_order = index_list
stream.index_order

['profile',
 'path_start',
 'new_user',
 'existing_user',
 'truncated_left',
 'session_start',
 'session_start_truncated',
 'group_alias',
 'loop',
 'raw',
 'raw_sleep',
 None,
 'synthetic',
 'synthetic_sleep',
 'positive_target',
 'negative_target',
 'session_end_truncated',
 'session_end',
 'session_sleep',
 'truncated_right',
 'absent_user',
 'lost_user',
 'path_end']

## Copy

In [23]:
stream4 = stream.copy()
stream4.to_dataframe().head()

Unnamed: 0,event_id,event_type,event_index,event,timestamp,user_id,user_type
0,37fdfcb9-f411-4873-97dd-67b898be206a,raw,0,catalog,2019-11-01 17:59:13.273932,219483890,non_conv
1,511ab593-6311-40b4-adf8-f3ae443fc8d4,raw,1,product1,2019-11-01 17:59:28.459271,219483890,non_conv
2,6885a9d6-b04c-46d4-b1b8-cc9819eadb64,raw,2,cart,2019-11-01 17:59:29.502214,219483890,non_conv
3,a8ca40c2-e161-447c-9548-48a0da74e241,raw,3,catalog,2019-11-01 17:59:32.557029,219483890,non_conv
4,b807d2f9-4a18-4da6-a4b6-9a37a0d945f0,raw,4,catalog,2019-11-01 21:38:19.283663,964964743,non_conv


## Nodes and PGraph creation

In that guide we are not going to concentrate on preprocessing or tooling, there are special guides for these themes.
But let's have a look on ``PGraph`` construction.

First of all let's create 2 nodes: ``start_end_node`` and ``lost_node``:

In [24]:
from src.graph.p_graph import PGraph, EventsNode
from src.data_processors_lib import StartEndEvents, StartEndEventsParams
from src.data_processors_lib import LostUsersEvents, LostUsersParams

start_end_node = EventsNode(
    StartEndEvents(params=StartEndEventsParams(**{}))
)

params_lost ={'lost_cutoff' : (3000, 's')}

lost_node = EventsNode(
    LostUsersEvents(params=LostUsersParams(**params_lost)))

  params_schema: dict[str, Any] = cls.schema()
  params_schema: dict[str, Any] = cls.schema()
  params_schema: dict[str, Any] = cls.schema()


Then we create instance of preprocessing graph:

In [25]:
graph = PGraph(stream2)

And lastly add two nodes: source - start_end_node - lost_node.
It's important to mention, that we don't start any calculations right now.

In [26]:
graph.add_node(node=start_end_node, parents=[graph.root])
graph.add_node(lost_node, parents=[start_end_node])

In [27]:
graph.get_parents(lost_node)

[{'name': 'EventsNode', 'pk': '27580143-270d-405e-a566-155e9fff0761'}]

# @TODO add explanation. dpanina

In [2]:
# graph.display()

In [32]:
graph.export(payload=dict())

{'directed': True,
 'nodes': [{'name': 'SourceNode',
   'pk': 'eeb52ecf-b880-45a9-a36f-56a7467a155f'},
  {'name': 'EventsNode',
   'pk': '27580143-270d-405e-a566-155e9fff0761',
   'processor': {'values': {}, 'name': 'StartEndEvents'}},
  {'name': 'EventsNode',
   'pk': '2f87f833-e23c-4358-a3d0-2d9af3dff639',
   'processor': {'values': {'lost_cutoff': '3000.0,s',
     'lost_users_list': None},
    'name': 'LostUsersEvents'}}],
 'links': [{'source': 'eeb52ecf-b880-45a9-a36f-56a7467a155f',
   'target': '27580143-270d-405e-a566-155e9fff0761'},
  {'source': '27580143-270d-405e-a566-155e9fff0761',
   'target': '2f87f833-e23c-4358-a3d0-2d9af3dff639'}]}

In order to start calculations we need to use ``combine`` method.
Which return new eventstream after processing.

In [29]:
result = graph.combine(node=start_end_node)

  self.__events = pd.concat([result_left_part, result_right_part, result_deleted_events])


In [30]:
result.to_dataframe()

Unnamed: 0,event_id,event_type,event_index,event,timestamp,user_id
0,20966df0-2c97-40cd-be52-5df6a1ffc2c3,path_start,0,path_start,2019-11-01 17:59:13.273932,219483890.0
1,96850d03-b5a4-4bfd-8b00-2b5e83c2176c,raw,1,catalog,2019-11-01 17:59:13.273932,219483890.0
2,26a7f996-0d34-465b-a136-470ca3f37f72,raw,2,product1,2019-11-01 17:59:28.459271,219483890.0
3,09a08046-9eb6-4b81-ad26-c6823ff36e53,raw,3,cart,2019-11-01 17:59:29.502214,219483890.0
4,1e815908-db2f-4ebb-a230-7aeee43929df,raw,4,catalog,2019-11-01 17:59:32.557029,219483890.0
...,...,...,...,...,...,...
42878,ac8be4db-ee2b-49bf-b938-0694aab32efe,raw,42878,catalog,2020-04-29 12:48:01.809577,501098384.0
42879,627bbbda-6c08-4ad8-81e4-65c412b0762a,raw,42879,main,2020-04-29 12:48:01.938488,501098384.0
42880,94dbc15e-aad8-44a2-9f23-480effde990e,raw,42880,catalog,2020-04-29 12:48:06.595390,501098384.0
42881,9b80356d-fdfe-4b37-ad94-6ceebcca5931,raw,42881,lost,2020-04-29 12:48:07.595390,501098384.0
