# Objectiv example notebook

This demo notebook enables you to play with Bach, our modeling library, to get an idea of what it can do. A live web version is also available [here](https://notebook.objectiv.io/lab?path=product_analytics.ipynb).

A few notes about this example notebook:
* It uses a real dataset from objectiv.io, collected with an unaltered version of Objectiv’s [tracker](https://objectiv.io/docs/tracking/). No cleaning or transformation* has been applied to the data. Objectiv’s tracker uses the [open taxonomy for analytics](https://objectiv.io/docs/taxonomy/) to collect clean data that’s ready to model on.
*  You can also generate your own events and use/see them in this notebook. Check out our [Quickstart Guide](https://www.objectiv.io/docs/quickstart-guide) for instructions.
* It is connected to a PostgreSQL database and runs directly on the full dataset. You can use Pandas-like dataframe operations, that Bach translates to SQL under the hood.
* This notebook demonstrates only a selection all of the operations that are supported by Bach. Check out the [docs](https://objectiv.io/docs/modeling/reference#api-reference) for the full rundown.
* You can also use this notebook for your own website or app once you've instrumented it with Objectiv's tracker.

For any question, please join our [Slack channel](https://join.slack.com/t/objectiv-io/shared_invite/zt-u6xma89w-DLDvOB7pQer5QUs5B_~5pg).

<sub>*for privacy reasons, IPs have been removed and timeframes have been cut from the initial dataset.</sub>

In [None]:
import datetime
import plotly
import plotly.graph_objects as go
import plotly.figure_factory as ff
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from matplotlib.ticker import FuncFormatter
import sqlalchemy
import os

from jupyter_dash import JupyterDash as Dash

# import Objectiv Bach
from bach_open_taxonomy import ObjectivFrame
from bach_open_taxonomy.sankey_dash import get_app

## Connect to full dataset in PostgreSQL

In [None]:
# connect to full postgresql dataset, add database and credentials here
dsn = os.environ.get('DSN', 'postgresql://objectiv:@localhost:5432/objectiv')
engine = sqlalchemy.create_engine(dsn, pool_size=1, max_overflow=0)

## Set the time aggregation 

In [None]:
# choose for which level of time aggregation the rest of the analysis will run
# supports all Postgres datetime template patterns: https://www.postgresql.org/docs/9.1/functions-formatting.html#FUNCTIONS-FORMATTING-DATETIME-TABLE

agg_level = 'YYYYMMDD'

In [None]:
# create a Bach dataframe based on the full dataset
# note that the database is not queried for operations on this dataframe. The database will only be queried
# when data is outputted to the python environment (ie when using .head() or .to_pandas()).

# An ObjectivFrame automatically sets the global contexts and location stack as custom dtype so we can use them in modeling
# global_contexts and location_stack are json type data columns. Setting custom dtypes extends the functionality
# for easy access to the contents of these columns.

# functions specific for columns of the type 'objectiv_global_context' can be accessed using the `gc` name space.
# for 'objectiv_location_stack' type columns this is `ls`
timeframe_df = ObjectivFrame.from_table(engine, time_aggregation=agg_level, start_date='2021-10-21')

## Set sample / unsample

In [None]:
# if desired, sample the data to develop models, for demo purposes we skip the sampling and work on full set
# all underlying data for df gets queried once in order to create the sample.

# df = df.get_sample(table_name='basic_features_sample', sample_percentage=10, overwrite=True)

In [None]:
# it is possible to apply all data manipulations on the full data set at any time.
# to unsample the data and run all models below on full dataset, use:

# df = df.get_unsampled()

## Explore the data

In [None]:
# only now the data gets queried. It is therefore recommended to limit the use of functions that query the
# database or use a sample when it is not (yet) required to query all data. The documentation of Bach always
# indicates in case the database gets queried.
timeframe_df.sort_values(by='moment', ascending=False).head()

## Explore most recent events

In [None]:
# summarize today's generated events
recent_events = timeframe_df[(timeframe_df['day'] == datetime.date(2021,11,1))]
recent_events = recent_events.groupby([timeframe_df.global_contexts.gc.application, 
                                       'event_type', 
                                       timeframe_df.location_stack.ls.nice_name]).aggregate({'user_id':'nunique'})

recent_events.head(30)

## Users

In [None]:
timeframe_df.model_hub.unique_users().head()

## Sessions

In [None]:
timeframe_df.model_hub.unique_sessions().head()

## Sessions per user

In [None]:
users_sessions = timeframe_df.model_hub.unique_sessions() / timeframe_df.model_hub.unique_users()
users_sessions.head()

## New users

In [None]:
new_users = timeframe_df.model_hub.unique_users(filter=timeframe_df.model_hub.is_first_session())
new_users.head()

In [None]:
timeframe_df.model_hub.unique_users().to_frame().merge(new_users, 
                                                       left_index=True, 
                                                       right_index=True, 
                                                       suffixes=('_total', '_new')).head()

## Feature creation

In [None]:
# using Objectiv, you can create features that utilize the context of where they occur on the UI, using the location stack
# while it is possible to use the event_type and location_stack as is to describe individual features,
# the location stack can be leveraged to group and aggregate various features at different levels of location 'depth'
# of your product. 

# choose for which application(s) to create features, in this case we select the Objectiv website
feature_creation_df = timeframe_df[(timeframe_df.global_contexts.gc.application == 'objectiv-website')]

# limit the timerange to match the latest taxonomy version applied as example on the website
feature_creation_df = feature_creation_df[(feature_creation_df['moment'] >= datetime.date(2021,11,15))]

# first, create a feature frame that will be used to create features
# todo skip this part now
# feature_frame = feature_creation_df.create_sample_feature_frame(table_name='feature_sample', overwrite=True)
# feature_frame.head()
feature_frame = feature_creation_df

**feature creation slicing the location stack**  
The `.json[]` syntax of location stacks allows you to slice with integers, but also dictionaries can be passed. If a dictionary matches
a context object in the stack, all objects of the stack starting at that object will be returned.  
  
**An example**  
We want to return only location stacks sub sets that contain this object:
```javascript
{"id": "contributors", "_type": "SectionContext"}
```
This means that if a location stack looks like this:
```json
[{"id": "#document", "_type": "WebDocumentContext"},
 {"id": "main", "_type": "SectionContext"},
 {"id": "core-team", "_type": "SectionContext"},
 {"id": "contributors", "_type": "SectionContext"},
 {"id": "jansentom", "_type": "SectionContext"},
 {"id": "contributor-card", "_type": "SectionContext"}]
```
The returned location stack looks like this:
```json
[{"id": "contributors", "_type": "SectionContext"},
 {"id": "jansentom", "_type": "SectionContext"},
 {"id": "contributor-card", "_type": "SectionContext"}]
```
In case a location stack does not contain the object, `None` is returned. The syntax for selecting like this is: 
```python
feature_frame["contributors_features"] = feature_frame.location_stack.json[{"_type": "SectionContext", "id": "contributors"}:]
```

Now we want to create a location stack that only contains the first object of this stack. For example if you are  not interested in clicks on individual contributors, but want to aggregate clicks on all of them. This can be done by using slices:
```python
feature_frame["contributors_aggregated"] = feature_frame.contributors_features.json[:1]
```
result:
```json
[{"id": "contributors", "_type": "SectionContext"}]
```



**feature creation with Dash app**  
using a Dash app, you can visualize all events with the location stack and create features.

the database gets queried for this to get all unique features.

as an example, we'll create features:
1. the job annoucement bar that is on both Home & About pages  
2. conversion, in this case going to GitHub repo
3. contributor features  
4. aggregate all contributers

In [None]:
# features are created
feature_frame['announcement_bar_features'] = feature_frame.location_stack.json[{'_type': 'SectionContext', 'id': 'announcement-bar'}:]

feature_frame['conversion'] = feature_frame.location_stack.json[{'_type': 'LinkContext', 'id': 'cta-repo-button'}:]
feature_frame['conversion'] = feature_frame['conversion'].fillna(feature_frame.location_stack.json[{'_type': 'LinkContext', 'id': 'GitHub'}:])

feature_frame['contributors_features'] = feature_frame.location_stack.json[{'_type': 'SectionContext', 'id': 'contributors'}:]

# this returns the stack of 'contributors_features' up to the first object in the stack (and therefore aggregates all
# following objects in the stack)
feature_frame['contributors_aggregated'] = feature_frame.contributors_features.json[:1]

**Visualizing the stack**  
Now we can visualize the location stack. You can select the features with 'Location stack column to visualize'. The width of the links indicates the number of hits (given the selected event type). The number of hits is also the number displayed when hovering over a node.  

It is also possible to create features using the tool by clicking nodes, or slicing the selected location stack. Clicking 'Add to Feature Frame' adds the feature to the feature frame.  
  
Try selecting the just created features. When the event type 'ClickEvent' is selected and switching between 'contributors_features' and 'contributors_aggregated', it shows how the clicks on individual contributors are aggregated.  
  
By clicking on nodes, or slicing in the sankey tool, Features can also be created. Try recreating the features above starting from the 'location_stack' column as 'Location stack column to visualize'.

In [None]:
# todo skip this part now
# app = get_app(Dash, feature_frame, dash_options={'server_url': 'http://localhost:8053'})
# app.run_server(mode='inline', height = 1100, port=8053, host='0.0.0.0')

In [None]:
# if you are happy with the result, write these creatured features to the original working df
# feature_creation_df = feature_creation_df.apply_feature_frame_sample_changes(feature_frame)
feature_creation_df.head()

## Features

In [None]:
# select the features we just created
created_features = feature_creation_df[(feature_creation_df.conversion.notnull()) | 
                                (feature_creation_df.announcement_bar_features.notnull()) |
                                (feature_creation_df.contributors_features.notnull()) |
                                (feature_creation_df.contributors_aggregated.notnull())]

# get the number of total users and hits per feature
users_per_event = created_features.groupby(['event_type', created_features.location_stack.ls.nice_name]).aggregate({'user_id':'nunique','session_hit_number':'count'})

users_per_event.sort_values(by=['user_id_nunique'], ascending=False).head(10)

## Conversion

In [None]:
feature_creation_df.add_conversion_event(conversion_stack=feature_creation_df.conversion, 
                                         conversion_event='ClickEvent')

In [None]:
feature_creation_df.conversion_events

In [None]:
feature_creation_df.model_hub.unique_users(filter=feature_creation_df.model_hub.conversion('conversion_1')).head()