## Prerequisites

Run this cell to prepare the environment. This step is obligatory.

In [3]:
import json

json.dumps({ "a": (3, 2) })
json.loads('{"a": [3, 2]}')

{'a': [3, 2]}

In [3]:
d = {
    "a": 'fdfdfdf',
    "b": None,
}

if d["a"]:
    print(d["a"])

if d["b"]:
    print(d["b"])

print(isinstance("fsdfsfsd", str))

fdfdfdf
False


  print("tttt" is str)


In [None]:
# Configuration for using Retentioneering library 

# get link to the Rete repository
import pandas as pd
LINKS = pd.read_csv(
    'https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=1Wd5A24EoankWRVX3klL3TN4smal4yXf0mgSNj_2Aymw&exportFormat=csv', 
    index_col='title'
)
RETE_ID = LINKS.link.rete_repository.split('/')[-2]

# download the required packages 
!pip install umap-learn 

# import system packages
from google_drive_downloader import GoogleDriveDownloader as gdd
import os
import sys
import shutil

os.chdir('/content/')
if os.path.exists('/content/retentioneering-tools-new-arch.zip'):
    os.remove('/content/retentioneering-tools-new-arch.zip')
if os.path.exists('/content/retentioneering-tools-new-arch/'):
    shutil.rmtree('/content/retentioneering-tools-new-arch/', ignore_errors=True)

# download library
gdd.download_file_from_google_drive(file_id=RETE_ID,
                                    dest_path='./retentioneering-tools-new-arch.zip',
                                    unzip=True) 

# setup environment
sys.path.insert(0, '..')
sys.path.insert(1, '/content/retentioneering-tools-new-arch/')
# change working direcory to /content/retentioneering-tools-new-arch
os.chdir('/content/retentioneering-tools-new-arch/')

# Eventstream concept

Install retentioneering if running from google.colab or for the first time:

In [None]:
# !pip install retentioneering

In [4]:
import pandas as pd

``Eventstream`` - is a core data type, which is used in rete library.
We need such a type in order to processed clickstream data, create tools which take into account its specifics.

First of all let's download small dataset.

In [5]:
df = pd.read_csv("src/datasets/data/simple-onlineshop.csv")

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35381 entries, 0 to 35380
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   user_id    35381 non-null  int64 
 1   event      35381 non-null  object
 2   timestamp  35381 non-null  object
dtypes: int64(1), object(2)
memory usage: 829.4+ KB


## Eventstream creation

To create an ``Eventstream`` we need at least 3 columns:
- ``user_id``
- ``event_name``
- ``event_timestamp``

If columns in our df have names like in the default ``RawDataSchema`` - all we need is to import ``Eventstream`` and create it from input pd.Dataframe.

In [7]:
from retentioneering.eventstream import Eventstream
stream = Eventstream(df)

  params_schema: dict[str, Any] = cls.schema()
  from .autonotebook import tqdm as notebook_tqdm


We can't directly display eventstream data, but we can convert it to pd.Dataframe:

In [8]:
stream.to_dataframe()

Unnamed: 0,event_id,event_type,event_index,event,timestamp,user_id
0,bc1e2b89-826c-41ef-b16b-cc65a33295d1,raw,0,catalog,2019-11-01 17:59:13.273932,219483890
1,80cacfcd-2679-4c4f-8113-83a1b968f38e,raw,1,product1,2019-11-01 17:59:28.459271,219483890
2,f57ef553-4ae5-41d2-9bd9-2d76bb8181f7,raw,2,cart,2019-11-01 17:59:29.502214,219483890
3,d752445e-e117-4a2c-a650-711f703b6e3e,raw,3,catalog,2019-11-01 17:59:32.557029,219483890
4,24af774e-ee69-443a-84e6-bc97ae374c93,raw,4,catalog,2019-11-01 21:38:19.283663,964964743
...,...,...,...,...,...,...
35376,585e7dc0-9c7a-426a-9a45-67a6df09951d,raw,35376,catalog,2020-04-29 12:47:40.975732,501098384
35377,32f21c28-0a95-4b9f-9de0-214b32da776b,raw,35377,catalog,2020-04-29 12:48:01.809577,501098384
35378,eed32e3c-9e8c-409b-a59b-549f8b2344eb,raw,35378,main,2020-04-29 12:48:01.938488,501098384
35379,28dc2d3e-fe0b-40c2-9b0a-47e02b8a01d0,raw,35379,catalog,2020-04-29 12:48:06.595390,501098384


Let's have a look on more complex example - if columns names are different from the default ``RawDataSchema``.

In [9]:
df.columns = ['uid', 'action_name', 'datetime']
df.head()

Unnamed: 0,uid,action_name,datetime
0,219483890,catalog,2019-11-01 17:59:13.273932
1,219483890,product1,2019-11-01 17:59:28.459271
2,219483890,cart,2019-11-01 17:59:29.502214
3,219483890,catalog,2019-11-01 17:59:32.557029
4,964964743,catalog,2019-11-01 21:38:19.283663


There are 2 ways how to deal with that problem:
1) Rename columns
2) Change RawDataSchema

Let's have a look on how to create a custom ``RawDataSchema``

In [10]:
from retentioneering.eventstream import Eventstream, RawDataSchema
raw_data_schema=RawDataSchema(
                user_id="uid",
                event_name="action_name",
                event_timestamp="datetime",
                )
stream2 = Eventstream(
            raw_data_schema=raw_data_schema,
            raw_data=df)

In [11]:
stream2.to_dataframe().head()

Unnamed: 0,event_id,event_type,event_index,event,timestamp,user_id
0,90414b3a-8c7a-4813-a710-a1f348bb9d36,raw,0,catalog,2019-11-01 17:59:13.273932,219483890
1,45f93d65-d19d-4d96-a84a-fe9917cc9308,raw,1,product1,2019-11-01 17:59:28.459271,219483890
2,d4a7d1d4-9112-49a1-b56b-7ed9eae1ab90,raw,2,cart,2019-11-01 17:59:29.502214,219483890
3,b116ae61-2d7d-43eb-adc4-e2e79fb7156f,raw,3,catalog,2019-11-01 17:59:32.557029,219483890
4,4d6a775a-03ab-4175-9138-8a539c6f202b,raw,4,catalog,2019-11-01 21:38:19.283663,964964743


One more point, which we need to explore - how we can define some custom columns.
Let's add one in order to demonstrate such functionality.

In [12]:
conv_users = df[df['action_name'] == 'payment_done']['uid'].unique()
df['user_type'] = 'non_conv'
df.loc[df['uid'].isin(conv_users),'user_type'] = 'conv'
df.head()

Unnamed: 0,uid,action_name,datetime,user_type
0,219483890,catalog,2019-11-01 17:59:13.273932,non_conv
1,219483890,product1,2019-11-01 17:59:28.459271,non_conv
2,219483890,cart,2019-11-01 17:59:29.502214,non_conv
3,219483890,catalog,2019-11-01 17:59:32.557029,non_conv
4,964964743,catalog,2019-11-01 21:38:19.283663,non_conv


In [13]:
df.user_type.value_counts()

non_conv    24038
conv        11343
Name: user_type, dtype: int64

In [14]:
raw_data_schema=RawDataSchema(
                user_id="uid",
                event_name="action_name",
                event_timestamp="datetime",
                custom_cols = [{"custom_col": 'user_type_col',
                                 "raw_data_col": "user_type"}]
                )
stream3 = Eventstream(
            raw_data_schema=raw_data_schema,
            raw_data=df)

In [15]:
stream3.to_dataframe().head()

Unnamed: 0,event_id,event_type,event_index,event,timestamp,user_id,user_type_col
0,9bcd7226-9c21-46ab-94d7-1ba4a22a5657,raw,0,catalog,2019-11-01 17:59:13.273932,219483890,non_conv
1,d8cecc45-b600-4373-8713-5de974c035c2,raw,1,product1,2019-11-01 17:59:28.459271,219483890,non_conv
2,f3995812-3ad9-4662-842c-1666491ffa5b,raw,2,cart,2019-11-01 17:59:29.502214,219483890,non_conv
3,61a4c749-31c2-4cce-9758-e8f207e1413f,raw,3,catalog,2019-11-01 17:59:32.557029,219483890,non_conv
4,01287fe9-d269-49a5-9a4e-e53fd403ecac,raw,4,catalog,2019-11-01 21:38:19.283663,964964743,non_conv


## Add custom column

If we have an ``eventstream`` and would like to add any custom column without additional convertations.

In [16]:
stream.to_dataframe().head()

Unnamed: 0,event_id,event_type,event_index,event,timestamp,user_id
0,bc1e2b89-826c-41ef-b16b-cc65a33295d1,raw,0,catalog,2019-11-01 17:59:13.273932,219483890
1,80cacfcd-2679-4c4f-8113-83a1b968f38e,raw,1,product1,2019-11-01 17:59:28.459271,219483890
2,f57ef553-4ae5-41d2-9bd9-2d76bb8181f7,raw,2,cart,2019-11-01 17:59:29.502214,219483890
3,d752445e-e117-4a2c-a650-711f703b6e3e,raw,3,catalog,2019-11-01 17:59:32.557029,219483890
4,24af774e-ee69-443a-84e6-bc97ae374c93,raw,4,catalog,2019-11-01 21:38:19.283663,964964743


In [17]:
stream.add_custom_col('user_type', df['user_type'])

In [18]:
stream.to_dataframe().head()

Unnamed: 0,event_id,event_type,event_index,event,timestamp,user_id,user_type
0,bc1e2b89-826c-41ef-b16b-cc65a33295d1,raw,0,catalog,2019-11-01 17:59:13.273932,219483890,non_conv
1,80cacfcd-2679-4c4f-8113-83a1b968f38e,raw,1,product1,2019-11-01 17:59:28.459271,219483890,non_conv
2,f57ef553-4ae5-41d2-9bd9-2d76bb8181f7,raw,2,cart,2019-11-01 17:59:29.502214,219483890,non_conv
3,d752445e-e117-4a2c-a650-711f703b6e3e,raw,3,catalog,2019-11-01 17:59:32.557029,219483890,non_conv
4,24af774e-ee69-443a-84e6-bc97ae374c93,raw,4,catalog,2019-11-01 21:38:19.283663,964964743,non_conv


We can see, stream schema also have changed

In [19]:
stream.schema

EventstreamSchema(event_id='event_id', event_type='event_type', event_index='event_index', event_name='event', event_timestamp='timestamp', user_id='user_id', custom_cols=['user_type'])

## Custom index order

``eventstream.index_order`` -  attribute that stores rules of the events sorting depends on their ``event_type``. It's nedeed when we start preprocessing process and add synthetic events in users trajectories.
Actual ``index_order``:

In [20]:
stream.index_order

['profile',
 'path_start',
 'new_user',
 'existing_user',
 'truncated_left',
 'session_start',
 'session_start_truncated',
 'group_alias',
 'raw',
 'raw_sleep',
 None,
 'synthetic',
 'synthetic_sleep',
 'positive_target',
 'negative_target',
 'session_end_truncated',
 'session_end',
 'session_sleep',
 'truncated_right',
 'absent_user',
 'lost_user',
 'path_end']

In order to change index_order we can put it in the variable and make some corrections. For example we add custom ``event_type`` - ``loop`` and would like to put it after ``group_alias`` type.

In [21]:
index_list = stream.index_order.copy()

In [22]:
place_at = index_list.index('group_alias') + 1

In [23]:
index_list[place_at:place_at] = ['loop']

In [24]:
stream.index_order = index_list
stream.index_order

['profile',
 'path_start',
 'new_user',
 'existing_user',
 'truncated_left',
 'session_start',
 'session_start_truncated',
 'group_alias',
 'loop',
 'raw',
 'raw_sleep',
 None,
 'synthetic',
 'synthetic_sleep',
 'positive_target',
 'negative_target',
 'session_end_truncated',
 'session_end',
 'session_sleep',
 'truncated_right',
 'absent_user',
 'lost_user',
 'path_end']

## Copy

In [25]:
stream4 = stream.copy()
stream4.to_dataframe().head()

Unnamed: 0,event_id,event_type,event_index,event,timestamp,user_id,user_type
0,bc1e2b89-826c-41ef-b16b-cc65a33295d1,raw,0,catalog,2019-11-01 17:59:13.273932,219483890,non_conv
1,80cacfcd-2679-4c4f-8113-83a1b968f38e,raw,1,product1,2019-11-01 17:59:28.459271,219483890,non_conv
2,f57ef553-4ae5-41d2-9bd9-2d76bb8181f7,raw,2,cart,2019-11-01 17:59:29.502214,219483890,non_conv
3,d752445e-e117-4a2c-a650-711f703b6e3e,raw,3,catalog,2019-11-01 17:59:32.557029,219483890,non_conv
4,24af774e-ee69-443a-84e6-bc97ae374c93,raw,4,catalog,2019-11-01 21:38:19.283663,964964743,non_conv


## Nodes and PGraph creation

In that guide we are not going to concentrate on preprocessing or tooling, there are special guides for these themes.
But let's have a look on ``PGraph`` construction.

First of all let's create 2 nodes: ``start_end_node`` and ``lost_node``:

In [26]:
from retentioneering.graph.p_graph import PGraph, EventsNode
from retentioneering.data_processors_lib import StartEndEvents, StartEndEventsParams
from retentioneering.data_processors_lib import LostUsersEvents, LostUsersParams

start_end_node = EventsNode(
    StartEndEvents(params=StartEndEventsParams(**{}))
)

params_lost ={'lost_cutoff' : (3000, 's')}

lost_node = EventsNode(
    LostUsersEvents(params=LostUsersParams(**params_lost)))

Then we create instance of preprocessing graph:

In [27]:
graph = PGraph(stream2)

And lastly add two nodes: source - start_end_node - lost_node.
It's important to mention, that we don't start any calculations right now.

In [28]:
graph.add_node(node=start_end_node, parents=[graph.root])
graph.add_node(lost_node, parents=[start_end_node])

In [29]:
graph.get_parents(lost_node)

[{'name': 'EventsNode', 'pk': '9daa7572-4bd2-447f-b516-e5720365b9aa'}]

## timedelta_hist

## user_lifetime_hist

## event_timestamp_hist

## describe

## describe_events

# @TODO add explanation. dpanina

In [30]:
graph.display()



In [31]:
graph.export(payload=dict())

{'directed': True,
 'nodes': [{'name': 'SourceNode',
   'pk': '9972b672-ff64-4104-9c97-9aa23f252de1'},
  {'name': 'EventsNode',
   'pk': '9daa7572-4bd2-447f-b516-e5720365b9aa',
   'processor': {'values': {}, 'name': 'StartEndEvents'}},
  {'name': 'EventsNode',
   'pk': 'e163e588-bec3-40ef-94dd-a3c0210e7869',
   'processor': {'values': {'lost_cutoff': '3000.0,s',
     'lost_users_list': None},
    'name': 'LostUsersEvents'}}],
 'links': [{'source': '9972b672-ff64-4104-9c97-9aa23f252de1',
   'target': '9daa7572-4bd2-447f-b516-e5720365b9aa'},
  {'source': '9daa7572-4bd2-447f-b516-e5720365b9aa',
   'target': 'e163e588-bec3-40ef-94dd-a3c0210e7869'}]}

In order to start calculations we need to use ``combine`` method.
Which return new eventstream after processing.

In [32]:
result = graph.combine(node=start_end_node)

  self.__events = pd.concat([result_left_part, result_right_part, result_deleted_events])


In [33]:
result.to_dataframe()

Unnamed: 0,event_id,event_type,event_index,event,timestamp,user_id
0,6d375ae7-7be7-4de8-9c3b-7663409cfb4d,path_start,0,path_start,2019-11-01 17:59:13.273932,219483890
1,90414b3a-8c7a-4813-a710-a1f348bb9d36,raw,1,catalog,2019-11-01 17:59:13.273932,219483890
2,45f93d65-d19d-4d96-a84a-fe9917cc9308,raw,2,product1,2019-11-01 17:59:28.459271,219483890
3,d4a7d1d4-9112-49a1-b56b-7ed9eae1ab90,raw,3,cart,2019-11-01 17:59:29.502214,219483890
4,b116ae61-2d7d-43eb-adc4-e2e79fb7156f,raw,4,catalog,2019-11-01 17:59:32.557029,219483890
...,...,...,...,...,...,...
42878,d88814b3-8091-43c4-b561-0c19d8acb6f2,raw,42878,catalog,2020-04-29 12:48:01.809577,501098384
42879,e11135eb-036f-4d40-88ee-b10b6a71d1dd,raw,42879,main,2020-04-29 12:48:01.938488,501098384
42880,34c478e1-da4e-4163-ae55-2fb101d843f1,raw,42880,catalog,2020-04-29 12:48:06.595390,501098384
42881,7095357a-ff3c-47d0-950b-7c4145670424,raw,42881,lost,2020-04-29 12:48:07.595390,501098384
