# Intro
This notebook demonstrates how the Objectiv stack can be used to it's full potential. The data used is tracked using the Open Taxonomy with the Objectiv tracker and the Objectiv collector. This data is stored in an sql database. With Objectiv Bach, standard transformations and models can be applied to the data to make data manipulations easy and standardized.

The Objectiv Bach api is heavily inspired by the pandas api. We believe this provides a great, generic interface to handle large amounts of data in a python environment while supporting multiple data stores.

For an intro into the pandas api see: https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html  
The full Objectiv Bach api reference is available here: https://objectiv.io/docs/modeling/reference

# Contents  
* [Instantiate-the-object](#Instantiate-the-object)
  * [The data](#The-data)
    * [global_contexts & location_stack](#global_contexts-&-location_stack)
    * [location_stack](#location_stack)
    * [global_contexts](#global_contexts)
  * [Using the data](#Using-the-data)
    * [Simple analytics metrics](#Simple-analytics-metrics)
    * [Working with the location stack](#Working-with-the-location-stack)
    * [Advanced aggregation with dash app](#Advanced-aggregation-with-dash-app)
  * [Sampling](#Sampling)
  * [References](#References)

In [None]:
import sqlalchemy
import os
from bach_open_taxonomy import ObjectivFrame

# Instantiate the object
As a first step, the Objectiv Frame object is instantiated. The Objectiv Frame is an extension to Objectiv Bach DataFrame, specifically for use with Objectiv tracked data. It loads the data as stored by the Objectiv Tracker, makes a few transformations, and sets the right data types.

In [None]:
dsn = os.environ.get('DSN', 'postgresql://objectiv:@localhost:5432/objectiv')
engine = sqlalchemy.create_engine(dsn, pool_size=1, max_overflow=0)
df = ObjectivFrame.from_table(engine)
# todo set a time frame here, instead of loading all data.

The data for the DataFrame is still in the database and the database is not queried before any of the data is loaded to the python environment. The methods that query the database are: 
* [`head()`](https://objectiv.io/docs/modeling/dataframe/bach.DataFrame.head#bach.DataFrame.head)
* [`to_pandas()`](https://objectiv.io/docs/modeling/dataframe/bach.DataFrame.to_pandas#bach.DataFrame.to_pandas)
* [`get_sample()`](https://objectiv.io/docs/modeling/dataframe/bach.DataFrame.get_sample#bach.DataFrame.get_sample)
* The property accessors [`values`](https://objectiv.io/docs/modeling/dataframe/bach.DataFrame.values#bach.DataFrame.values), [`Series.array`](https://objectiv.io/docs/modeling/series/bach.Series.array#bach.Series.array), [`Series.value`](https://objectiv.io/docs/modeling/series/bach.Series.value#bach.Series.value)

For demo puposes of this notebook, these methods are called often to show the results of our operations. To limit the number of executed queries on the full data set it is recommended to use these methods less often or [to sample the data first](#sampling).

## The data
The contents of the Objectiv Frame exist of:

In [None]:
df.index_dtypes

The index contain a unique identifyer for every hit.

In [None]:
df.dtypes

* `day`: the day of the session as a date.
* `moment`: the exact moment of the event.
* `user_id`: the unique identifyer of the user based on the cookie.
* `session_id`: a unique incremented integer id for each session. Starts at 0 for the selected data in the Objectiv Frame.
* `global_contexts`: a json-like data column that stores additional information on the event that is logged. This includes data like device data, application data, and cookie information. [See below](#global_contexts) for more detailed explanation. 
* `location_stack`: a json-like data column that stores information on the exact location where the event is triggered in the product. [See below](#location_stack) for more detailed explanation.
* `event_type`: the type of event that is logged.
* `stack_event_types`: the parents of the event_type.

A preview of the data:

In [None]:
df.head()

### global_contexts & location_stack
The global_contexts and location_stack are both columns containing json-like data. Data in these columns is key to effective usage of the Open Taxonomy. Both columns are always arrays of context objects. The json arrays can be accessed with the `.json` accessor. A context object looks like the example below. It _always_ contains a `_type` and `id` key.

In [None]:
df.location_stack.json[0].head(1)[0]

**Slicing the json data**  
With the `.json[]` syntax you can slice the array using integers. Instead of integers, dictionaries can also be passed to 'query' the json array. If the passed dictionary matches a context object in the stack, all objects of the stack starting (or ending, depending on the slice) at that object will be returned.

**An example**  
Consider a json array that looks like this):
```json
[{'id': '#document', '_type': 'WebDocumentContext'},
 {'id': 'page-about', '_type': 'SectionContext'},
 {'id': 'main', '_type': 'SectionContext'},
 {'id': 'core-team', '_type': 'SectionContext'},
 {'id': 'contributors', '_type': 'SectionContext'},
 {'id': 'jansentom', '_type': 'SectionContext'},
 {'id': 'contributor-card', '_type': 'SectionContext'}]
```
**Regular slicing**
```python
df.location_stack.json[2:4]
```
For the example array it would return:
```json
[{'id': 'main', '_type': 'SectionContext'},
 {'id': 'core-team', '_type': 'SectionContext'}]
```
**Slicing by querying**
We want to return only the part of the array starting at the object that contain this object:
```javascript
{"id": "contributors", "_type": "SectionContext"}
```
The syntax for selecting like this is: 
```python
df.location_stack.json[{"_type": "SectionContext", "id": "contributors"}:]
```
For the example array it would return:
```json
[{'id': 'contributors', '_type': 'SectionContext'},
 {'id': 'jansentom', '_type': 'SectionContext'},
 {'id': 'contributor-card', '_type': 'SectionContext'}]
```
In case a location stack does not contain the object, `None` is returned. More info at the api reference: https://objectiv.io/docs/modeling/series/bach.SeriesJsonb.json#bach.SeriesJsonb.json

### location_stack
The `location_stack` column in an Objectiv Frame that stores the information on the exact location where the event is triggered in the product. The example used above is the location stack of the contributor card of Tom Jansen on this page: https://objectiv.io/about/.

Because of the specific way the location information is labeled, validated, and stored using the Open Taxonomy, it can be used to slice and group your products' features in an efficient and easy way. We call this 'feature creation'. See the [section](#feature-creation-/-aggregations) dedicated to feature creation below.

The column is set as an `objectiv_location_stack` type, and therefore location stack specific methods can be used to access the data from the `location_stack`. These methods can be used using the `.ls` accessor on the column. The methods are:
* The property accessors [`.ls.navigation_features`](https://objectiv.io/docs/modeling/objectiv/bach_open_taxonomy.SeriesLocationStack.ls#bach_open_taxonomy.SeriesLocationStack.ls), [`.ls.feature_stack`](https://objectiv.io/docs/modeling/objectiv/bach_open_taxonomy.SeriesLocationStack.ls#bach_open_taxonomy.SeriesLocationStack.ls), [`.ls.nice_name`](https://objectiv.io/docs/modeling/objectiv/bach_open_taxonomy.SeriesLocationStack.ls#bach_open_taxonomy.SeriesLocationStack.ls)
* all [methods](https://objectiv.io/docs/modeling/series/bach.SeriesJsonb.json#bach.SeriesJsonb.json) for the json(b) type can also be accessed using `.ls`

The full reference of location stack is [here](https://objectiv.io/docs/taxonomy/location-contexts). An example is queried below:

In [None]:
df.location_stack.head(1)[0]

### global_contexts
The `global_contexts` column in an Objectiv Frame contain all information that is relevant to the logged event. As it is set as an `objectiv_global_context` type, specific methods can be used to access the data from the `global_contexts`. These methods can be used using the `.gc` accessor on the column. The methods are:
* [`.gc.get_from_context_with_type_series(type, key)`](https://objectiv.io/docs/modeling/objectiv/bach_open_taxonomy.ObjectivStack#bach_open_taxonomy.ObjectivStack)
* The property accessors [`.gc.cookie_id`](https://objectiv.io/docs/modeling/objectiv/bach_open_taxonomy.SeriesGlobalContexts.gc#bach_open_taxonomy.SeriesGlobalContexts.gc), [`.gc.user_agent`](https://objectiv.io/docs/modeling/objectiv/bach_open_taxonomy.SeriesGlobalContexts.gc#bach_open_taxonomy.SeriesGlobalContexts.gc), [`.gc.application`](https://objectiv.io/docs/modeling/objectiv/bach_open_taxonomy.SeriesGlobalContexts.gc#bach_open_taxonomy.SeriesGlobalContexts.gc)  
* all [methods](https://objectiv.io/docs/modeling/series/bach.SeriesJsonb.json#bach.SeriesJsonb.json) for the json(b) type can also be accessed using `.gc`

The full reference of global contexts is [here](https://objectiv.io/docs/taxonomy/global-contexts). An example is queried below:

In [None]:
df.global_contexts.head(1)[0]

# Using the data
## Simple analytics metrics
Now that we are familiar with the data, we first create some straightforward analytics metrics using pandas syntax.

For all examples below, the entire data set is used, but the data does not get queried until the data is exported to this notebook using the `.to_pandas()` method.

### Unique users per day

In [None]:
unique_users = df.groupby('day').user_id.nunique()
unique_users.to_pandas()

### Average sessions per user per day

In [None]:
sessions = df.groupby('day').session_id.nunique()
avg_sessions_per_user = sessions / unique_users
avg_sessions_per_user.to_pandas()

### Top features
Calculate the most used features in a selected week. Notice that instead of grouping by a column name, we group on a modified series ([`.ls.nice_name`](https://objectiv.io/docs/modeling/objectiv/bach_open_taxonomy.SeriesLocationStack.ls#bach_open_taxonomy.SeriesLocationStack.ls)). This is possible as long as the (modified) series originates from the same Data Frame and allows for more condensed code.

In [None]:
top_features = df[(df.day>='2021-11-11') & 
                  (df.day<='2021-11-17') & 
                  (df.event_type=='ClickEvent')].groupby(df.location_stack.ls.nice_name).session_hit_number.count()

top_features_sorted = top_features.sort_values(ascending=False)
top_features_sorted.head(10)

### Session duration

In [None]:
session_start_end = df.groupby(['day','session_id']).agg({'moment': ['min', 'max']})

# When aggregating, Bach adds the name of the aggregation function to the column it is applied to. Therefore 
# the aggregated data columns can be used as here:
session_start_end['session_duration'] = session_start_end.moment_max - session_start_end.moment_min

# Exclude 0 s sessions (bounces)
avg_session_duration = session_start_end[session_start_end.session_duration>'0'].groupby('day').session_duration.mean()
avg_session_duration.to_pandas()

## Working with the location stack

In [None]:
# todo: here the models-hub starts: df.conversion() instead of multiple lines of Bach that are now still below.

### Conversion

In [None]:
conversion_df = df[(df.global_contexts.gc.application == 'objectiv-website') & 
                   (df.day > '2021-11-15')]

In [None]:
# locations that are conversions
conversion_df['conversion_location'] = conversion_df.location_stack.json[{'_type': 'LinkContext', 'id': 'cta-repo-button'}:]
conversion_df['conversion_location'] = conversion_df['conversion_location'].fillna(conversion_df.location_stack.json[{'_type': 'LinkContext', 'id': 'GitHub'}:])

In [None]:
conversion_df[(conversion_df.conversion_location.notnull()) & 
              (conversion_df.event_type == 'ClickEvent')].display_sankey()

In [None]:
# click events for these locations (ie conversion)
conversion_successful = conversion_df[(conversion_df.conversion_location.notnull()) & 
                                      (conversion_df.event_type == 'ClickEvent')].groupby('day').user_id.nunique()

In [None]:
# total users per day
conversion_users = conversion_df.groupby('day').user_id.nunique()

In [None]:
conversion_rate_df = conversion_users.to_frame().merge(conversion_successful, how='left', left_index=True, right_index=True, suffixes=('_all', '_converted'))

In [None]:
conversion_rate_df['conversion_rate'] = (conversion_rate_df.user_id_converted / conversion_rate_df.user_id_all).fillna(0.)

In [None]:
conversion_rate_df.to_pandas()

### Conversion funnel

## Advanced aggregation with dash app

In [None]:
from jupyter_dash import JupyterDash as Dash
from bach_open_taxonomy.sankey_dash import get_app

In [None]:
ff = conversion_df.create_sample_feature_frame('buh2', overwrite=True)

In [None]:
app = get_app(Dash, ff, dash_options={'server_url': 'http://localhost:8050'})
app.run_server(mode='inline', height = 1100, port=8050, host='0.0.0.0')

In [None]:
ff['page'] = ff.location_stack.ls[1:2]

In [None]:
feature_creation_df = ff.apply_feature_frame_sample_changes(conversion_df)

In [None]:
feature_creation_df.head()

# Sampling
One of the key features to Objectiv Bach is that it runs on your full data set. There can however be situations where you want to experiment with your data, meaning you have to query the full data set often. This can become slow and/or costly. 

To limit these costs it is possible to do operations on a sample of the full data set. All operations can easily be applied at any time to the full data set if that is desired.

Below we create a sample that randomly selects ~1% of all the rows in the data. A table containing the sampled is written to the database, therefore the `table_name` must be provided when creating the sample.

In [None]:
df_sample = df.get_sample(table_name='buh', sample_percentage=10, overwrite=True)

Two new columns are created in the sample.

In [None]:
df_sample['root_locatiomn_contexts'] = df_sample.location_stack.json[:1]
df_sample['application'] = df_sample.global_contexts.gc.application
df_sample.head()

Using `.get_unsampled()`, the operations that are done on the sample (the creation of the two columns), are applied to the entire data set:

In [None]:
df_unsampled = df_sample.get_unsampled()
df_unsampled.head()

The sample can also be used for grouping and aggregating. The example below counts all hits and the unique event_types in the sample:

In [None]:
df_sample_grouped = df_sample.groupby(['application']).agg({'event_type':'nunique','session_hit_number':'count'})
df_sample_grouped.head()
# todo sampled series does not work/exist, therefore I create a df

As can be seen from the counts, unsampling applies the transformation to the entire data set:

In [None]:
df_unsampled_grouped = df_sample_grouped.get_unsampled()
df_unsampled_grouped.head()

# References
* other notebooks