# 1. Motivation. Why preprocessing module.

This demo notebook describes the benefits of using Retentioneering preprocessing module in real-life scenarios. It uses [rete_preprocessing_demo](https://drive.google.com/file/d/1pZsFXm_xuwWM6CGbq5mo2dVEd8LNM-NO/view?usp=sharing) dataset originated by some production system. This clickstream covers <span style="color:red">TODO: fill the exact numbers</span> \*\*\* users and \*\*\* unique events, \*\*\* event per user on average. The data encompasses the date range from YYYY-MM-DD to YYYY-MM-DD.

<span style="color:red">TODO: rename original event names</span>

Let's look what transition graph is associated with the given clickstream.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# download_url = 'https://drive.google.com/uc?id=1P-qUosoQMgTb52iWtawKoBGXI4I3PUJv&export=download&confirm=t'
download_url = 'https://drive.google.com/uc?id=1tY-4xg6m_dv6IaVPIcd4oC1bK1lW5Tr9&export=download&confirm=t'
df0 = pd.read_csv(download_url, compression='gzip')

In [2]:
import retentioneering

retentioneering.config.update({
    'user_col': 'user_id',    
    'event_col': 'event',
    'event_time_col': 'timestamp'
})
# df0.rete.plot_graph()

In [5]:
from IPython.display import Image
Image(url='enormous_transition_graph.png')

Obviously, it's impossible to figure out what's going on in this diagram, so we have to find the ways to unravel this tangle. The basic idea of any simplification is that we need to reduce the number of the events in the clickstream. Retentioneering preprocessing module efficiently cope with this problem. It contains multiple functions which facilitates the clickstream analysis.

Besides this straightforward benefit, the module helps to organize the clickstream research. It's a common situation when you create many similar data frames (like df1, df2, df3, etc) during the exploration process. Essentially, these data frames are variations of the original data frame, and they represent it from different perspectives. The problem of these manupulations is that they are poorly structured. As a result, your Jupyter notebook turns into a messy pile of code. 

Here's an example. Suppose you want to drop the users whose paths are too short. You're not sure what "short" really means, so you start experimenting sequentially dropping the users whose paths are shorter than, say, 5 events, 10 events, 1 minute, 10 minutes, etc. Each time you obtain a new data frame after the dropping, you want to compare it with some previous version and look what's been changed. So you have to keep all these data frames in the notebook, and in the future it will be hard to reproduce what was going on here.

Retentioneering preprocessing module keep all the manipulations with the original clickstream in a calculation graph. Each node of the graph is associated with a particular manipulation (like dropping the users whose path is shorter than 5 event in the example above), and the edges are associated with the execution sequences. As a result, the calculation graph makes all the data manipulations **structured and reproducable**, which is crucial for the data analysis.

Below we provide some real-world cases when preprocessing module could help to manage your analytical process applied to the same ```rete_preprocessing_demo``` dataset introduced in the beginning of this document.

# 2 - General clickstream exploration

Suppose you've just downloaded your clickstream dataset, and so far you have no idea how the users behave. As we've seen in the previous section, looking at the data straighforwardly makes no sense often. Intuitevely, it urges you to simplify the eventstream. There are multiple ways do to so, but we focus on the following particular steps:

- Leave 10% random users.
- Remove users whose path is too short (i.e. remove their paths entirely).
- Drop a list of events <span style="color:red">TODO: list the particular events</span>.
- Collapse all the identical consequent events.
- Group similar events <span style="color:red">TODO: list the particular events</span> and treat them as a single event.

<span style="color:red">TODO: mention that all the original events are kept in an eventstream object, but they are kept invisible since there's no hard delete option.</span>

In [1]:
import sys
sys.path.insert(0, '..')

In [2]:
import numpy as np
import sys

from src.eventstream.eventstream import Eventstream
from src.eventstream.schema import RawDataSchema, EventstreamSchema
from src.graph.p_graph import PGraph, EventsNode
from src.data_processors_lib.rete import CollapseLoops, CollapseLoopsParams
from src.data_processors_lib.rete import DeleteUsersByPathLength, DeleteUsersByPathLengthParams
from src.data_processors_lib.rete import FilterEvents, FilterEventsParams
from src.data_processors_lib.rete import GroupEvents, GroupEventsParams
from src.data_processors_lib.rete import NewUsersEvents, NewUsersParams
from src.data_processors_lib.rete import SplitSessions, SplitSessionsParams
from src.data_processors_lib.rete import StartEndEvents, StartEndEventsParams
from src.data_processors_lib.rete import TruncatePath, TruncatePathParams
from src.data_processors_lib.rete import TruncatedEvents, TruncatedEventsParams

  params_schema: dict[str, Any] = cls.schema()
  params_schema: dict[str, Any] = cls.schema()
  params_schema: dict[str, Any] = cls.schema()


In [3]:
raw_data_schema = RawDataSchema(
    event_name='event', 
    event_timestamp='timestamp', 
    user_id='user_id'
)

stream = Eventstream(
    raw_data=df0,
    raw_data_schema=raw_data_schema,
    schema=EventstreamSchema()
)

NameError: name 'df0' is not defined

In [None]:
def user_sample(df, schema):
    users = df['user_id'].unique()
    sample_size = int(0.1 * len(users))
    np.random.seed(42)
    sampled_users = np.random.choice(users, size=sample_size, replace=False)
    return df['user_id'].isin(sampled_users)


def exclude_events(df, schema):
    events_to_exclude = ['site/*', 'landing/*', '/promo/*', '/trading/*', '/', '/content/education/*', '/analytics/*']
    return ~df['event_name'].isin(events_to_exclude)

def group_profile(df, schema):
    return df[schema.event_name].isin(['profile', 'profile/*', 'profile/identity'])

def group_finances(df, schema):
    return df[schema.event_name].isin(['finances', 'finances/*transfer*', 'finances/deposit'])

def group_cabinet1(df, schema):
    return df[schema.event_name].isin(['cabinet/partner_*', 'cabinet/recovery_*'])

def group_cabinet2(df, schema):
    return df[schema.event_name].isin(['/cabinet/content/*', '/cabinet/loyalty_*', '/cabinet/subs/*'])

stream = stream\
    .filter(filter=user_sample)\
    .delete_users(cutoff=(1, 'h'))\
    .filter(filter=exclude_events)\
    .group(event_name='profile', filter=group_profile)\
    .group(event_name='finances', filter=group_finances)\
    .group(event_name='cabinet1', filter=group_cabinet1)\
    .group(event_name='cabinet2', filter=group_cabinet2)

In [None]:
def user_sample(df, schema):
    users = df['user_id'].unique()
    sample_size = int(0.1 * len(users))
    np.random.seed(42)
    sampled_users = np.random.choice(users, size=sample_size, replace=False)
    return df['user_id'].isin(sampled_users)


def exclude_events(df, schema):
    events_to_exclude = [
        'site/*',
        'landing/*',
        '/promo/*',
        '/trading/*',
        '/',
        '/content/education/*',
        '/analytics/*'
    ]
    return ~df['event_name'].isin(events_to_exclude)

def group_profile(df, schema):
    return df[schema.event_name].isin(['profile', 'profile/*', 'profile/identity'])

def group_finances(df, schema):
    return df[schema.event_name].isin(['finances', 'finances/*transfer*', 'finances/deposit'])

def group_cabinet1(df, schema):
    return df[schema.event_name].isin(['cabinet/partner_*', 'cabinet/recovery_*'])

def group_cabinet2(df, schema):
    return df[schema.event_name].isin(['/cabinet/content/*', '/cabinet/loyalty_*', '/cabinet/subs/*'])

stream = stream\
    .filter(filter=user_sample)\
    .delete_users(cutoff=(1, 'h'))\
    .filter(filter=exclude_events)\
    .group(event_name='profile', filter=group_profile)\
    .group(event_name='finances', filter=group_finances)\
    .group(event_name='cabinet1', filter=group_cabinet1)\
    .group(event_name='cabinet2', filter=group_cabinet2)

In [20]:
stream.to_dataframe().rete.plot_graph()

'experiments/graph_2022-11-14 22_13_35_493310.html'

So we see that the diagram has become less messy but still it's impossible to understand what's going on here, how the users behave. So the described steps could be used for general simplification, but for a more detailed research we need to break the data into smaller parts. These techniques will be presented in the next section.

# 3 - Exploring the beginning of the paths

For many IT products, such as apps or websites, the first steps of a user within the system are crusial from the churn point of view. That's why product analytics pay much attention especially to the beginning of a user's path.

One might define beginning of a user path in different ways: first N actions, first day, first session. All the interpretations are possible, but for a data analyst, especially for a one who perform the exploratory analysis for the first time, it's often not clear what option is preferable. If you never try you'll never know. In this section we demonstrate how retentioneering preprocessing module alleviates the testing of these options.

In [51]:
Image(url='path_beginning_research.png', width=600)

In [8]:
import pandas as pd

raw_data_schema = RawDataSchema(
    event_name='event_name', 
    event_timestamp='event_timestamp', 
    user_id='user_id'
)


# df0.rete.plot_graph()

stream = Eventstream(
    # raw_data=stream.to_dataframe(),
    raw_data=pd.read_csv('tmp.csv'),
    raw_data_schema=raw_data_schema,
    schema=EventstreamSchema()
)

FileNotFoundError: [Errno 2] No such file or directory: 'tmp.csv'

In [28]:
graph.combine(node=node4)

<src.eventstream.eventstream.Eventstream at 0x7fcdc44e6fa0>

In [1]:
import sys
sys.path.insert(0, '..')

In [2]:
import pandas as pd

# download_url = 'https://drive.google.com/uc?id=1tY-4xg6m_dv6IaVPIcd4oC1bK1lW5Tr9&export=download&confirm=t'
# df0 = pd.read_csv(download_url, compression='gzip')
df0 = pd.DataFrame(data=[], columns=['event', 'timestamp', 'user_id'])
df0

Unnamed: 0,event,timestamp,user_id


In [3]:
from src.eventstream.schema import RawDataSchema, EventstreamSchema
from src.eventstream.eventstream import Eventstream
from src.graph.p_graph import PGraph, EventsNode
from src.data_processors_lib.rete import CollapseLoops, CollapseLoopsParams
from src.data_processors_lib.rete import DeleteUsersByPathLength, DeleteUsersByPathLengthParams
from src.data_processors_lib.rete import FilterEvents, FilterEventsParams
from src.data_processors_lib.rete import GroupEvents, GroupEventsParams
from src.data_processors_lib.rete import NewUsersEvents, NewUsersParams
from src.data_processors_lib.rete import SplitSessions, SplitSessionsParams
from src.data_processors_lib.rete import StartEndEvents, StartEndEventsParams
from src.data_processors_lib.rete import TruncatePath, TruncatePathParams
from src.data_processors_lib.rete import TruncatedEvents, TruncatedEventsParams
from src.graph.p_graph import PGraph, EventsNode
import inspect


raw_data_schema = RawDataSchema(
    event_name='event', 
    event_timestamp='timestamp', 
    user_id='user_id'
)

stream = Eventstream(
    raw_data=df0,
    raw_data_schema=raw_data_schema,
    schema=EventstreamSchema()
)

graph = PGraph(source_stream=stream)

TARGET_EVENT = 'finances/deposit/<payment_name>/success'

def users_with_target_event(df, schema):
    target_users = df[df['event_name'] == TARGET_EVENT]['user_id'].unique()
    return df['user_id'].isin(target_users)

def first_session_filter(df, schema):
    return df['session_id'].str.endswith('_1')

def new_and_not_truncated_users(df, schema):
    truncated_users = df[(df['event_name'] == 'truncated_right')]['user_id'].unique()
    new_users = df[(df['event_name'] == 'new_user')]['user_id'].unique()
    target_users = np.setdiff1d(new_users, truncated_users)
    return df['user_id'].isin(target_users)

def first_session_filter(df, schema):
    return df['session_id'].str.endswith('_1')


node1 = EventsNode(StartEndEvents(params=StartEndEventsParams(**{})))
node2 = EventsNode(NewUsersEvents(params=NewUsersParams(new_users_list="all")))
node3 = EventsNode(TruncatedEvents(params=TruncatedEventsParams(right_truncated_cutoff=(12, 'D'))))
node4 = EventsNode(FilterEvents(params=FilterEventsParams(filter=new_and_not_truncated_users)))
node5 = EventsNode(SplitSessions(params=SplitSessionsParams(
    session_cutoff=(1, 'h'),
    session_col='session_id'
)))
node6 = EventsNode(FilterEvents(params=FilterEventsParams(filter=first_session_filter)))

node7 = EventsNode(FilterEvents(params=FilterEventsParams(filter=users_with_target_event)))
node8 = EventsNode(TruncatePath(params=TruncatePathParams(drop_after=TARGET_EVENT)))
node9 = EventsNode(FilterEvents(params=FilterEventsParams(filter=first_session_filter)))


graph.add_node(node=node1, parents=[graph.root])
graph.add_node(node=node2, parents=[node1])
graph.add_node(node=node3, parents=[node2])
graph.add_node(node=node4, parents=[node3])
graph.add_node(node=node5, parents=[node4])

graph.add_node(node=node6, parents=[node5])
graph.add_node(node=node7, parents=[node6])

graph.add_node(node=node8, parents=[node5])
graph.add_node(node=node9, parents=[node8])

graph.display()

  params_schema: dict[str, Any] = cls.schema()
  params_schema: dict[str, Any] = cls.schema()
  params_schema: dict[str, Any] = cls.schema()


In [28]:
graph.display()

In [5]:
raise graph.error

AttributeError: 'PGraph' object has no attribute 'error'

In [12]:
graph.combine(node=node6).to_dataframe().rete.plot_graph()

'experiments/graph_2022-10-27 12_46_29_857943.html'

In [22]:
graph.combine(node=node8).to_dataframe().rete.plot_graph()

'experiments/graph_2022-10-27 13_00_20_715993.html'

In [8]:
raw_data_schema = RawDataSchema(
    event_name='event_name', 
    event_timestamp='event_timestamp', 
    user_id='user_id'
)

retentioneering.config.update({
    'user_col': 'user_id',    
    'event_col': 'event_name',
    'event_time_col': 'event_timestamp'
})
# df0.rete.plot_graph()

stream = Eventstream(
    # raw_data=stream.to_dataframe(),
    raw_data=pd.read_csv('tmp.csv'),
    raw_data_schema=raw_data_schema,
    schema=EventstreamSchema()
)

In [11]:
TARGET_EVENT = 'finances/deposit/<payment_name>/success'

def new_and_not_truncated_users(df, schema):
    truncated_users = df[(df['event_name'] == 'truncated_right')]['user_id'].unique()
    new_users = df[(df['event_name'] == 'new_user')]['user_id'].unique()
    target_users = np.setdiff1d(new_users, truncated_users)
    return df['user_id'].isin(target_users)

def users_with_target_event(df, schema):
    target_users = df[df['event_name'] == TARGET_EVENT]['user_id'].unique()
    return df['user_id'].isin(target_users)

def first_session_filter(df, schema):
    return df['session_id'].str.endswith('_1')

node1 = EventsNode(StartEndEvents(params=StartEndEventsParams(**{})))
node2 = EventsNode(NewUsersEvents(params=NewUsersParams(new_users_list=new_users)))
node3 = EventsNode(TruncatedEvents(params=TruncatedEventsParams(right_truncated_cutoff=(12, 'D'))))
node4 = EventsNode(FilterEvents(params=FilterEventsParams(filter=new_and_not_truncated_users)))
node5 = EventsNode(CollapseLoops(params=CollapseLoopsParams(suffix=None)))
node6 = EventsNode(FilterEvents(params=FilterEventsParams(filter=users_with_target_event)))
node7 = EventsNode(TruncatePath(params=TruncatePathParams(drop_after=TARGET_EVENT)))
node8 = EventsNode(SplitSessions(params=SplitSessionsParams(session_cutoff=(1, 'h'), session_col='session_id')))
node9 = EventsNode(FilterEvents(params=FilterEventsParams(filter=first_session_filter)))

graph = PGraph(source_stream=stream)
graph.add_node(node=node1, parents=[graph.root])
graph.add_node(node=node2, parents=[node1])
graph.add_node(node=node3, parents=[node2])
graph.add_node(node=node4, parents=[node3])
graph.add_node(node=node5, parents=[node4])

graph.add_node(node=node6, parents=[node5])
graph.add_node(node=node7, parents=[node6])

graph.add_node(node=node8, parents=[node5])
graph.add_node(node=node9, parents=[node8])


In [15]:
graph.combine(node=node7).to_dataframe().rete.plot_graph(
    targets={TARGET_EVENT: 'green'}
)

'experiments/graph_2022-11-15 15_41_57_402853.html'