This is one of the Objectiv example notebooks. For more examples visit the 
[example notebooks](https://objectiv.io/docs/modeling/example_notebooks/) section of our docs. The notebooks can run with the demo data set that comes with the our [quickstart](https://objectiv.io/docs/home/quickstart-guide/), but can be used to run on your own collected data as well.

All example notebooks are also available in our [quickstart](https://objectiv.io/docs/home/quickstart-guide/). With the quickstart you can spin up a fully functional Objectiv demo pipeline in five minutes. This also allows you to run these notebooks and experiment with them on a demo data set.

# Basic user intent analysis

In this notebook, we briefly demonstrate how you can easily do basic user intent analysis on your data.

## Getting started

### Import the required packages for this notebook
The open model hub package can be installed with `pip install objectiv-modelhub` (this installs Bach as well).  
If you are running this notebook from our quickstart, the model hub and Bach are already installed, so you don't have to install it separately.

In [1]:
from modelhub import ModelHub
from bach import display_sql_as_markdown
import bach
import pandas as pd 
from datetime import timedelta

At first we have to instantiate the Objectiv DataFrame object and the model hub.

In [2]:
# instantiate the model hub
modelhub = ModelHub(time_aggregation='YYYY-MM-DD')

# get the Bach DataFrame with Objectiv data
df = modelhub.get_objectiv_dataframe(start_date='2022-02-02')

The columns 'global_contexts' and the 'location_stack' contain most of the event specific data. These columns
are json type columns and we can extract data from it based on the keys of the json objects using `SeriesGlobalContexts` or `SeriesGlobalContexts` methods to extract the data.

In [3]:
# adding specific contexts to the data
df['application'] = df.global_contexts.gc.application
df['root_location'] = df.location_stack.ls.get_from_context_with_type_series(type='RootLocationContext', key='id')

## Exploring root location
The `root_location` context in the `location_stack` uniquely represents the top-level UI location of the user. As a first step of grasping user internt, this is a good starting point to see in what main areas of your product users are spending time.

In [4]:
# model hub: unique users per root location
users_root = modelhub.aggregate.unique_users(df, groupby=['application','root_location'])
users_root.head(10)

application       root_location
objectiv-docs     docs             266
                  home             173
                  modeling         207
                  taxonomy         107
                  tracking         221
objectiv-website  about            193
                  blog             396
                  home             696
                  jobs             175
                  join-slack        18
Name: unique_users, dtype: int64

## Exploring session duration
The average `session_duration` model from the [open model hub](https://objectiv.io/docs/modeling/) is another good pointer to explore first for user intent.

In [5]:
# model hub: duration, per root location
duration_root = modelhub.aggregate.session_duration(df, groupby=['application', 'root_location']).sort_index()
duration_root.head(10)

application       root_location
objectiv-docs     docs            0 days 00:06:32.101332
                  home            0 days 00:06:50.403322
                  modeling        0 days 00:08:05.667446
                  taxonomy        0 days 00:07:59.104562
                  tracking        0 days 00:05:56.997620
objectiv-website  about           0 days 00:03:07.734470
                  blog            0 days 00:03:56.332150
                  home            0 days 00:05:38.025199
                  jobs            0 days 00:02:25.590949
                  join-slack      0 days 00:02:39.291852
Name: session_duration, dtype: timedelta64[ns]

In [6]:
# how is this time spent distributed?
session_duration = modelhub.aggregate.session_duration(df, groupby='session_id')

# materialization is needed because the expression of the created series contains aggregated data, and it is not allowed to aggregate that.
session_duration.to_frame().materialize()['session_duration'].quantile(q=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]).head(10)

quantile
0.1   0 days 00:00:00.652800
0.2   0 days 00:00:01.216600
0.3   0 days 00:00:02.747400
0.4   0 days 00:00:12.426800
0.5   0 days 00:00:44.912000
0.6   0 days 00:02:06.920800
0.7   0 days 00:03:17.687600
0.8   0 days 00:05:35.042400
0.9   0 days 00:19:40.537800
Name: session_duration, dtype: timedelta64[ns]

## Defining different stages of user intent
After exploring the `root_location` and `session_duration`, we can make a simple definition of different stages of user intent.

Based on the objectiv.io website data in the quickstart, we could define them as:

| User intent | Root locations | Duration |
| :--- | :--- | : --- |
| 1 - Inform | home, blog, about | less than 2 minutes |
| 2 - Explore | modeling, taxonomy, tracking | between 2 and 20 minutes |
| 3 - Implement | modeling, taxonomy, tracking | more than 20 minutes | 

This is just for illustration purposes, you can adjust these definitions based on your own collected data. 

## Assigning user intent
Based on the simple definitions above, we can start assigning a stage of intent to each user. We do this per timeframe (in this case monthly), as users can progress from one stage to the next over time.

In [7]:
# model hub: calculate average session duration per user, monthly
user_duration = modelhub.aggregate.session_duration(df, groupby=['user_id', modelhub.time_agg(df, 'YYYY-MM')])
user_duration.sort_index(ascending=False).head(10)

user_id                               time_aggregation
ffc42c2e-7db3-488f-a4ae-3e6452c5050b  2022-02            0 days 00:00:33.013500
ffc233df-022a-4567-b74a-631b23c45dfc  2022-02            0 days 00:02:01.376000
ffc0ba50-9146-438c-bac3-38faa7183dda  2022-04            0 days 00:00:33.219000
ff48d79a-195a-476a-b49d-0e212de43c96  2022-04            0 days 00:01:10.978333
                                      2022-03            0 days 00:00:57.196500
                                      2022-02            0 days 00:00:00.661667
ff33827e-671b-41c3-a6d4-6e13838c4e3a  2022-03            0 days 00:03:08.289000
ff1e489d-9828-43df-b621-1a06ddf61d5d  2022-02            0 days 00:00:00.450500
fedc6bd2-3c86-47b3-b9f7-7ec0bde72f11  2022-02                   0 days 00:01:15
fed1b03d-e952-47ce-a149-ee9c11b2c4bd  2022-02            0 days 00:15:02.568000
Name: session_duration, dtype: timedelta64[ns]

In [8]:
# model hub: calculate average session duration per user, per root location, monthly
user_root_duration = modelhub.aggregate.session_duration(df, groupby=['user_id', 'application', 'root_location', modelhub.time_agg(df, 'YYYY-MM')])
user_root_duration.sort_index(ascending=False).head(10)

user_id                               application       root_location  time_aggregation
ffc42c2e-7db3-488f-a4ae-3e6452c5050b  objectiv-docs     docs           2022-02            0 days 00:00:03.434000
ffc233df-022a-4567-b74a-631b23c45dfc  objectiv-website  home           2022-02            0 days 00:02:01.376000
ffc0ba50-9146-438c-bac3-38faa7183dda  objectiv-website  home           2022-04            0 days 00:00:33.219000
ff48d79a-195a-476a-b49d-0e212de43c96  objectiv-website  home           2022-04            0 days 00:01:10.978333
                                                                       2022-03            0 days 00:00:56.477500
                                                                       2022-02            0 days 00:00:00.638500
                                                        blog           2022-02            0 days 00:00:00.708000
ff33827e-671b-41c3-a6d4-6e13838c4e3a  objectiv-website  blog           2022-03            0 days 00:00:01.125000
        

In [9]:
# assign user intent according to the defintions above
user_root_duration_f = user_root_duration.to_frame().reset_index()

roots = bach.DataFrame.from_pandas(engine=df.engine, 
                                   df=pd.DataFrame({'roots': ['home', 'blog', 'about']}), 
                                   convert_objects=True).roots
roots2 = bach.DataFrame.from_pandas(engine=df.engine, 
                                    df=pd.DataFrame({'roots': ['modeling', 'taxonomy', 'tracking']}), 
                                    convert_objects=True).roots

user_root_duration_f['bucket'] = 'unassigned'

user_root_duration_f.loc[(user_root_duration_f.root_location.isin(roots)) &
                         (user_root_duration_f.session_duration < timedelta(0, 120)), 'bucket'] = '1 - inform'

user_root_duration_f.loc[(((user_root_duration_f.root_location.isin(roots))) | 
                          ((user_root_duration_f.root_location.isin(roots2)))) &
                         (user_root_duration_f.session_duration >= timedelta(0, 120)) &
                         (user_root_duration_f.session_duration <= timedelta(0, 1200)), 'bucket'] = '2 - explore'

user_root_duration_f.loc[(((user_root_duration_f.root_location.isin(roots))) | 
                          ((user_root_duration_f.root_location.isin(roots2)))) &
                         (user_root_duration_f.session_duration > timedelta(0, 1200)), 'bucket'] = '3 - implement'

# total number of users per intent
user_root_duration_f.groupby('bucket').agg({'root_location': 'nunique', 
                                            'session_duration': ['min', 'max'], 
                                            'user_id': 'nunique'}).head()

Unnamed: 0_level_0,root_location_nunique,session_duration_min,session_duration_max,user_id_nunique
bucket,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1 - inform,3,0 days 00:00:00.001000,0 days 00:01:59.692000,655
2 - explore,6,0 days 00:02:00.367000,0 days 00:19:30.948000,360
3 - implement,6,0 days 00:20:11.870000,0 days 00:59:43.205000,33
unassigned,7,0 days 00:00:00.001000,0 days 00:36:00.726846,470


In [11]:
# user intent over time
user_root_duration_f.groupby(['bucket', 'time_aggregation']).agg({'user_id': 'nunique'}).head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,user_id_nunique
bucket,time_aggregation,Unnamed: 2_level_1
1 - inform,2022-02,370
1 - inform,2022-03,230
1 - inform,2022-04,98
2 - explore,2022-02,146
2 - explore,2022-03,152
2 - explore,2022-04,83
3 - implement,2022-02,19
3 - implement,2022-03,10
3 - implement,2022-04,8
unassigned,2022-02,302
