This is one of the Objectiv example notebooks. For more examples visit the 
[example notebooks](https://objectiv.io/docs/modeling/example_notebooks/) section of our docs. The notebooks can run with the demo data set that comes with the our [quickstart](https://objectiv.io/docs/home/quickstart-guide/), but can be used to run on your own collected data as well.

All example notebooks are also available in our [quickstart](https://objectiv.io/docs/home/quickstart-guide/). With the quickstart you can spin up a fully functional Objectiv demo pipeline in five minutes. This also allows you to run these notebooks and experiment with them on a demo data set.

# Basic user intent analysis

In this notebook, we briefly demonstrate how you can easily do basic user intent analysis on your data.

## Getting started

### Import the required packages for this notebook
The open model hub package can be installed with `pip install objectiv-modelhub` (this installs Bach as well).  
If you are running this notebook from our quickstart, the model hub and Bach are already installed, so you don't have to install it separately.

In [46]:
from modelhub import ModelHub
from bach import display_sql_as_markdown
import bach
import pandas as pd 
from datetime import timedelta

At first we have to instantiate the Objectiv DataFrame object and the model hub.

In [47]:
# instantiate the model hub
modelhub = ModelHub(time_aggregation='YYYY-MM-DD')

# get the Bach DataFrame with Objectiv data
df = modelhub.get_objectiv_dataframe(start_date='2022-02-01')

The columns 'global_contexts' and the 'location_stack' contain most of the event specific data. These columns
are json type columns and we can extract data from it based on the keys of the json objects using `SeriesGlobalContexts` or `SeriesGlobalContexts` methods to extract the data.

In [48]:
# adding specific contexts to the data
df['application'] = df.global_contexts.gc.application
df['root_location'] = df.location_stack.ls.get_from_context_with_type_series(type='RootLocationContext', key='id')

## Exploring root location
The `root_location` context in the `location_stack` uniquely represents the top-level UI location of the user. As a first step of grasping user internt, this is a good starting point to see in what main areas of your product users are spending time.

In [49]:
# model hub: unique users per root location
users_root = modelhub.aggregate.unique_users(df, groupby=['application','root_location'])
users_root.head(10)

application       root_location
objectiv-docs     home             138
                  modeling         181
                  taxonomy          89
                  tracking         204
objectiv-website  about             86
                  blog              95
                  home             339
                  jobs              72
                  join-slack        18
                  privacy           13
Name: unique_users, dtype: int64

## Exploring session duration
The average `session_duration` model from the [open model hub](https://objectiv.io/docs/modeling/) is another good pointer to explore first for user intent.

In [50]:
# model hub: duration, per root location
duration_root = modelhub.aggregate.session_duration(df, groupby=['application', 'root_location']).sort_index()
duration_root.head(10)

application       root_location
objectiv-docs     home            0 days 00:06:13.206928
                  modeling        0 days 00:08:10.151239
                  taxonomy        0 days 00:06:46.368386
                  tracking        0 days 00:04:35.300549
objectiv-website  about           0 days 00:03:22.542548
                  blog            0 days 00:05:37.312841
                  home            0 days 00:05:18.488096
                  jobs            0 days 00:02:06.103118
                  join-slack      0 days 00:02:39.291852
                  privacy         0 days 00:02:10.331625
Name: session_duration, dtype: timedelta64[ns]

Now, we look at the distribution of time spent. We use the Bach `quantile` operation for this. We'll use this distribution to define the different stages of user intent.

In [51]:
# how is this time spent distributed?
session_duration = modelhub.aggregate.session_duration(df, groupby='session_id')

# materialization is needed because the expression of the created series contains aggregated data, and it is not allowed to aggregate that.
session_duration.to_frame().materialize()['session_duration'].quantile(q=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]).head(10)

quantile
0.1   0 days 00:00:00.676800
0.2   0 days 00:00:01.417600
0.3   0 days 00:00:03.911000
0.4   0 days 00:00:19.456000
0.5   0 days 00:01:03.464000
0.6   0 days 00:02:57.522000
0.7   0 days 00:03:25.259000
0.8   0 days 00:07:24.479800
0.9   0 days 00:21:24.153800
Name: session_duration, dtype: timedelta64[ns]

## Defining different stages of user intent
After exploring the `root_location` and `session_duration` (both per root location and quantiles), we can make a simple definition of different stages of user intent.

Based on the objectiv.io website data in the quickstart, we could define them as:

| User intent | Root locations | Duration |
| :--- | :--- | : --- |
| 1 - Inform | home, blog, about, modeling, taxonomy, tracking | less than 2 minutes |
| 2 - Explore | modeling, taxonomy, tracking | between 2 and 20 minutes |
| 3 - Implement | modeling, taxonomy, tracking | more than 20 minutes | 

This is just for illustration purposes, you can adjust these definitions based on your own collected data. 

## Assigning user intent
Based on the simple definitions above, we can start assigning a stage of intent to each user. We do this per timeframe (in this case monthly), as users can progress from one stage to the next over time.

In [52]:
# model hub: calculate average session duration per user, per root location, monthly
user_root_duration = modelhub.aggregate.session_duration(df, groupby=['user_id', 'application', 'root_location', modelhub.time_agg(df, 'YYYY-MM')])
user_root_duration.sort_index(ascending=False).head(10)

user_id                               application       root_location  time_aggregation
ffc0ba50-9146-438c-bac3-38faa7183dda  objectiv-website  home           2022-04            0 days 00:00:33.219000
ff48d79a-195a-476a-b49d-0e212de43c96  objectiv-website  home           2022-04            0 days 00:01:10.978333
                                                                       2022-03            0 days 00:00:56.477500
ff33827e-671b-41c3-a6d4-6e13838c4e3a  objectiv-website  blog           2022-03            0 days 00:00:01.125000
                                      objectiv-docs     tracking       2022-03            0 days 00:00:00.004000
                                                        taxonomy       2022-03            0 days 00:00:59.272000
                                                        modeling       2022-03            0 days 00:00:19.193000
                                                        home           2022-03            0 days 00:00:00.006000
fec82cdd

In [53]:
# assign user intent according to the defintions above
user_root_duration_f = user_root_duration.to_frame().reset_index()

roots = bach.DataFrame.from_pandas(engine=df.engine, 
                                   df=pd.DataFrame({'roots': ['home', 'blog', 'about', 'modeling', 'taxonomy', 'tracking']}), 
                                   convert_objects=True).roots
roots2 = bach.DataFrame.from_pandas(engine=df.engine, 
                                    df=pd.DataFrame({'roots': ['modeling', 'taxonomy', 'tracking']}), 
                                    convert_objects=True).roots

user_root_duration_f['bucket'] = 'unassigned'

user_root_duration_f.loc[(user_root_duration_f.root_location.isin(roots)) &
                         (user_root_duration_f.session_duration < timedelta(0, 120)), 'bucket'] = '1 - inform'

user_root_duration_f.loc[(((user_root_duration_f.root_location.isin(roots))) | 
                          ((user_root_duration_f.root_location.isin(roots2)))) &
                         (user_root_duration_f.session_duration >= timedelta(0, 120)) &
                         (user_root_duration_f.session_duration <= timedelta(0, 1200)), 'bucket'] = '2 - explore'

user_root_duration_f.loc[(((user_root_duration_f.root_location.isin(roots))) | 
                          ((user_root_duration_f.root_location.isin(roots2)))) &
                         (user_root_duration_f.session_duration > timedelta(0, 1200)), 'bucket'] = '3 - implement'

# total number of users per intent
user_root_duration_f.groupby('bucket').agg({'root_location': 'nunique', 
                                            'session_duration': ['min', 'max'], 
                                            'user_id': 'nunique'}).head()

Unnamed: 0_level_0,root_location_nunique,session_duration_min,session_duration_max,user_id_nunique
bucket,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1 - inform,6,0 days 00:00:00.001000,0 days 00:01:59.692000,421
2 - explore,6,0 days 00:02:01.987000,0 days 00:19:30.948000,228
3 - implement,6,0 days 00:20:22.373000,0 days 00:52:07.517500,16
unassigned,3,0 days 00:00:00.001000,0 days 00:21:23.086000,37


In [54]:
# user intent over time
user_root_duration_f.groupby(['bucket', 'root_location']).agg({'user_id': 'nunique', 'session_duration':'max'}).sort_values('bucket', ascending=False).head(10)


Unnamed: 0_level_0,Unnamed: 1_level_0,user_id_nunique,session_duration_max
bucket,root_location,Unnamed: 2_level_1,Unnamed: 3_level_1
unassigned,jobs,25,0 days 00:21:23.086000
unassigned,join-slack,16,0 days 00:12:53.576000
unassigned,privacy,5,0 days 00:06:24.124000
3 - implement,about,2,0 days 00:25:04.754000
3 - implement,blog,1,0 days 00:46:41.897000
3 - implement,home,8,0 days 00:29:08.341000
3 - implement,modeling,6,0 days 00:36:35.463000
3 - implement,taxonomy,3,0 days 00:48:28.833000
3 - implement,tracking,3,0 days 00:52:07.517500
2 - explore,about,5,0 days 00:10:52.836000


## Get the SQL for this user intent analysis

In [55]:
user_root_duration_f.groupby(['bucket', 'root_location']).agg({'user_id': 'nunique'}).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,user_id_nunique
bucket,root_location,Unnamed: 2_level_1
1 - inform,about,17
1 - inform,blog,54
1 - inform,home,283
1 - inform,modeling,103
1 - inform,taxonomy,44
1 - inform,tracking,83
2 - explore,about,5
2 - explore,blog,12
2 - explore,home,61
2 - explore,modeling,64


In [56]:
# get the SQL to use this analysis in for example your BI tooling
display_sql_as_markdown(user_root_duration_f)

```sql
with "loaded_data___f1f52ff4aa561ce51fcf94c92f968c8c" as (select * from (values 
(cast(0 as bigint), 'home'),
(cast(1 as bigint), 'blog'),
(cast(2 as bigint), 'about'),
(cast(3 as bigint), 'modeling'),
(cast(4 as bigint), 'taxonomy'),
(cast(5 as bigint), 'tracking')
) as t("_index_0", "roots")
),
"loaded_data___62caf65fb3cb3d96f0bb62b589b209b2" as (select * from (values 
(cast(0 as bigint), 'modeling'),
(cast(1 as bigint), 'taxonomy'),
(cast(2 as bigint), 'tracking')
) as t("_index_0", "roots")
),
"ExtractedContexts___dbb013fbc71ca150c710b1cd62b8961d" as (
    SELECT event_id,
            day,
            moment,
            cookie_id AS user_id,
            CAST(JSON_EXTRACT_PATH(value, 'global_contexts') AS jsonb) AS global_contexts,
            CAST(JSON_EXTRACT_PATH(value, 'location_stack') AS jsonb) AS location_stack,
            value->>'_type' AS event_type,
            CAST(JSON_EXTRACT_PATH(value, '_types') AS jsonb) AS stack_event_types
     FROM data
     where day >= '2022-03-01'
     ),
"session_starts_8dcd7bf8880b32f2b5380cc36396ce99" as (
        select
            *,
            case when coalesce(
                extract(
                    epoch from (moment - lag(moment, 1)
                        over (partition by user_id order by moment, event_id))
                ) > 1800,
                true
            ) then true end as is_start_of_session
        from "ExtractedContexts___dbb013fbc71ca150c710b1cd62b8961d"
    ),
"session_id_and_start_8dcd7bf8880b32f2b5380cc36396ce99" as (
        select
            *,
            -- generates a session_start_id for each is_start_of_session
            case
                when is_start_of_session then
                    row_number() over (partition by is_start_of_session order by moment, event_id)
            end as session_start_id,
            -- generates a unique number for each session, but not in the right order.
            count(is_start_of_session) over (order by user_id, moment, event_id) as is_one_session
        from session_starts_8dcd7bf8880b32f2b5380cc36396ce99
    ),
"SessionizedData___8dcd7bf8880b32f2b5380cc36396ce99" as (select
        *,
        -- populates the correct session_id for all rows with the same value for is_one_session
        first_value(
            session_start_id
        ) over (
            partition by is_one_session order by moment, event_id
        ) as session_id,
        row_number() over (partition by is_one_session order by moment, event_id) as session_hit_number
    from session_id_and_start_8dcd7bf8880b32f2b5380cc36396ce99
    ),
"getitem_having_boolean___ef69c901fecb4143681ee76a518d241c" as (select "user_id" as "user_id", 
        jsonb_path_query_first("global_contexts",
        '$[*] ? (@._type == $type)',
        '{"type":"ApplicationContext"}') ->> 'id' as "application", 
        jsonb_path_query_first("location_stack",
        '$[*] ? (@._type == $type)',
        '{"type":"RootLocationContext"}') ->> 'id' as "root_location", to_char("moment", 'YYYY-MM') as "time_aggregation", "session_id" as "_session_id", min("moment") as "moment_min", max("moment") as "moment_max", ((max("moment")) - (min("moment"))) as "session_duration" 
from "SessionizedData___8dcd7bf8880b32f2b5380cc36396ce99" 
 
group by "user_id", 
        jsonb_path_query_first("global_contexts",
        '$[*] ? (@._type == $type)',
        '{"type":"ApplicationContext"}') ->> 'id', 
        jsonb_path_query_first("location_stack",
        '$[*] ? (@._type == $type)',
        '{"type":"RootLocationContext"}') ->> 'id', to_char("moment", 'YYYY-MM'), "session_id" 
having (((max("moment")) - (min("moment"))) > '0') 
 
 
),
"reset_index___e69b7d665036a83919e881315a3c80d7" as (select "user_id" as "user_id", "application" as "application", "root_location" as "root_location", "time_aggregation" as "time_aggregation", avg("session_duration") as "session_duration" 
from "getitem_having_boolean___ef69c901fecb4143681ee76a518d241c" 
 
group by "user_id", "application", "root_location", "time_aggregation" 
 
 
 
)
select "user_id" as "user_id", "application" as "application", "root_location" as "root_location", "time_aggregation" as "time_aggregation", "session_duration" as "session_duration", CASE WHEN (((("root_location" in (SELECT "roots" as "roots" FROM "loaded_data___f1f52ff4aa561ce51fcf94c92f968c8c")) OR ("root_location" in (SELECT "roots" as "roots" FROM "loaded_data___62caf65fb3cb3d96f0bb62b589b209b2")))) AND (("session_duration" > cast('0:20:00' as interval)))) THEN '3 - implement' ELSE CASE WHEN (((((("root_location" in (SELECT "roots" as "roots" FROM "loaded_data___f1f52ff4aa561ce51fcf94c92f968c8c")) OR ("root_location" in (SELECT "roots" as "roots" FROM "loaded_data___62caf65fb3cb3d96f0bb62b589b209b2")))) AND (("session_duration" >= cast('0:02:00' as interval))))) AND (("session_duration" <= cast('0:20:00' as interval)))) THEN '2 - explore' ELSE CASE WHEN (("root_location" in (SELECT "roots" as "roots" FROM "loaded_data___f1f52ff4aa561ce51fcf94c92f968c8c")) AND (("session_duration" < cast('0:02:00' as interval)))) THEN '1 - inform' ELSE 'unassigned' END END END as "bucket" 
from "reset_index___e69b7d665036a83919e881315a3c80d7" 
 
 
 
 
 
```