Skip to content

Latest commit

 

History

History
333 lines (212 loc) · 22.4 KB

snowplow_media_player_overview.md

File metadata and controls

333 lines (212 loc) · 22.4 KB

{% docs snowplow_media_player %}

{% raw %}

Snowplow Media Player Package

Welcome to the documentation site for the Snowplow Media Player dbt package. The package is built as an extension of the dbt-snowplow-web package that transforms raw media player event data into derived tables for easier querying generated by the Snowplow JavaScript tracker in combination with media tracking specific plugins such as the Media Tracking plugin or the YouTube Tracking plugin.

In order to keep the documentations separate and less verbose, this guide will assume the reader is already familiar with configuring and running the web model and will only explain how to operate the media-player package in conjunction with the web model as the media model was designed to be run together with it. Please refer to the snowplow-web dbt doc site for a full breakdown of the package and how to set it up.

Please note that the media player package is not compatible with the Flutter tracker as it is reliant on the Snowplow JavaScript tracker.

Note this doc site is linked to the latest release of the package. If you are not using the latest release, generate and serve the doc site locally for accurate documentation.

Overview

This package consists of a series of dbt models with the goal to produce the following main aggregated models from the raw media player events and relevant contexts:

  • snowplow_media_player_base: This derived table summarises the key media player events and metrics of each media element on a media_id and pageview level which is considered as a base aggregation level for media interactions.

  • snowplow_media_player_plays_by_pageview: This view removes impressions from the '_base' table to summarise media plays on a page_view by media_id level.

  • snowplow_media_player_media_stats: This derived table aggregates the '_base' table to individual media_id level, calculating the main KPIs and overall video/audio metrics.

The package is built on top of the dbt-snowplow-web package taking that as a basis to carry out the incremental update. It is designed to be run together with the web model in a similar manner to how a custom module would run:

The _interactions_this_run table takes the snowplow_web_base_events_this_run table generated by the web package as an input then adds the various contexts to enrich the base table with the additional media related fields. It could be used for custom models for more in-depth event level derived tables and further analysis.

The _base_this_run table then aggregates the _interactions_this_run table to media_id and pageview level and serves as a basis for the incrementalised derived table _media_base.

The main _media_stats derived table will also be updated incrementally based on the _media_base derived table, however not through the snowplow_incremental materialization, but using the native dbt incremental materialization on a pageview basis after a set time window passed. This is to prevent complex and expensive queries due to metrics which need to take the whole page_view events into calculation. This way the metrics will only be calculated once per pageview / media, after no new events are expected.

The additional _pivot_base table is there to calculate the percent_progress boundaries and weights that are used to calculate the total play_time and other related media fields.

Adapter Support

The Snowplow Media Player v0.2.0 package currently supports BigQuery, Databricks, Postgres, Redshift and Snowflake.

Requirements

Installation

Check dbt Hub for the latest installation instructions, or read the dbt docs for more information on installing packages. If you already have the web package installed you might need to upgrade it in order for it to be compatible with the media_player package requirements. Otherwise it is enough to install the media_player package as it will add the web package automatically due to the dependencies.

Configuration

1. Setting up variables

In general, when adding new variables to the dbt project we have to be careful around scoping the variables appropriately, especially when using multiple packages, which is the case when running the snowplow media player package. You can read more about variable scoping in dbt's docs around variable precedence.

In this particular case, despite being separated, running the two packages happens in sync. Although we try and name our package variables uniquely across all Snowplow dbt packages, when making any changes to them it's best to keep them separate in their appropriate scoping level. In other words, variables introduced in the web model should be set under snowplow_web and the same goes for the media_player related variables as illustrated below:

# dbt_project.yml
...
vars:
  snowplow_web:
    snowplow__backfill_limit_days: 60
  snowplow_media_player:
    snowplow__percent_progress_boundaries: [20, 40, 60, 80]

Media Player specific variables:

snowplow__percent_progress_boundaries: [10, 25, 50, 75]

The default list of percent progress values. It needs to be aligned with the values being tracked by the tracker. It is worth noting that the more these percent progress boundaries are being tracked the more accurate the play time calculations become. Please note that tracking 100% is unnecessary as there is a separate ended event which the model equates to achieving 100% and it also gets included automatically to this list, in case it is not added (you can refer to the helper macro get_percentage_boundaries (source) for details).

snowplow__valid_play_sec: 30

The minimum number of seconds that a media play needs to last to consider that interaction a valid play. The default is 30 seconds (based on the YouTube standard) but it can be modified here, if needed.

snowplow__complete_play_rate: 0.99

The rate to set what percentage of a media needs to be played in order to consider that complete. 0.99 (=99%) is set as a default value here but it may be increased to 1 (or decreased) depending on the use case.

snowplow__max_media_pv_window: 10

The number of hours that needs to pass before new page_view level media player metrics from the snowplow_media_palyer_base table are safe to be processed by the model downstream in the snowplow_media_player_media_stats table. Please note that even if new events are added later on ( e.g. new percentprogress events are fired indicading potential replay) and the snowplow_media_player_base table is changed, the model will not update them in the media_stats table, therefore it is safer to set as big of a number as still convenient for analysis and reporting.

snowplow__enable_youtube: false

Set to true if the HTML5 media element context schema is enabled. This variable is used to handle syntax depending on whether the context fields are available in the database or not.

snowplow__enable_whatwg_media: false

Set to true if the HTML5 video element context schema is enabled. This variable is used to handle syntax depending on whether the context fields are available in the database or not.

snowplow__enable_whatwg_video: false

Set to true if the HTML5 video element context schema is enabled. This variable is used to handle syntax depending on whether the context fields are available in the database or not.

2. Configuring the web model (in case it has not been run before)

Please refer to the Quick Start guide within the snowplow-web dbt doc site to make sure you configure the web model appropriately. (e.g.: checking the source data or enabling desired contexts).

One thing to highlight here: as the package is built onto the snowplow_incremental_materialization logic provided by the web package, please leave the snowplow__incremental_materialization variable as is with the default snowplow_incremental value.

3. Adding the selector.yml file

Within this package we have provided a suite of suggested selectors to run and test the models within the package together with the web model. This leverages dbt's selector flag.

The selectors include:

  • snowplow_web: Recommended way to run the package. This selection includes all models within the Snowplow Web and Media Player package as well as any custom models you have created that are tagged with 'snowplow_web_incremental'. This is the same as in the web package.
  • snowplow_web_lean_and_media_player_tests: Recommended way to test the models within the web and media player packages. See the testing section for more details. There are other selectors for testing, please see the Tests section for more details on this.

These are defined in the selectors.yml file (source) within the package, however in order to use these selections you will need to copy this file into your own dbt project directory. This is a top-level file and therefore should sit alongside your dbt_project.yml file.

Operation

Due to its unique relationship with the web package, in order to operate the media player package together with the web model there are several considerations to keep in mind. Depending on the use case one of the following scenarios may happen:

  1. The web package is already being used and the media tracking package needs to be added at a later time.
  2. The web package has not been used but it needs to be run together with the media player package.
  3. Only the media player package needs to be run.

1. Adding the media player data model to an existing dbt project with web model data already running

Supposing there are months of data being collected using the web package and media tracking is introduced at that later stage there is no need to fully reprocess the web data from the date media tracking was deployed.

As models from both packages need to be run in sync, first the backfilling of the models from the new package needs to happen. Please note that during backfill no new web data is allowed to be processed and depending on the snowplow_backfill_limit_days configured and the period that needs backfilling it can take a while for all models to sync up and new web events to be processed.

To begin the synching process please run the following script:

dbt run -m snowplow_web.base snowplow_media_player --vars 'snowplow__start_date: <date_when_media_player_tracking_starts>'

This way only the base module is reprocessed which is used as one of the main sources for the media player package. The web model's update logic should recognise the new media player models (as all are tagged with snowplow_web_incremental) and backfilling should start between the date you defined within snowplow_start_date and the upper limit defined by the variable snowplow_backfill_limit_days that is set for the web model.

Snowplow: New Snowplow incremental model. Backfilling

You can overwrite this limit for this backfilling process temporarily while it lasts, if needed:

# dbt_project.yml
...
vars:
  snowplow_web:
    snowplow__backfill_limit_days: 1

After this you should be able to see all media_player models added to the derived.snowplow_web_incremental_manifest table. Any subsequent run from this point onwards could be carried out using the recommended web model running method - using the snowplow_web selector - which automatically adds all media_models as they are within the project directory and are all tagged with snowplow_web_incremental.

dbt run --selector snowplow_web

As soon as backfilling finishes, running the model results in both the web and the media player models being updated during the same run for the same period, both using the same latest set of data from the _base_events_this_run table.

2. Starting both the media and web model from scratch

The easiest implementation out of the three scenarios. As the snowplow_web_incremental_manifest table is new, all models from both packages (plus any custom modules tagged with snowplow_web_incremental) will be processsed using the recommended web model running method - using the snowplow_web selector without any extra step.

dbt run --selector snowplow_web

3. Only running the media player package from the same dbt project

Although the media player package is not designed for standalone useage, there can be scenarios where only the media player models are targeted for the update, not the web model. In such case the web model still has to be configured with the main difference that all modules that the media player model does not rely on need to be disabled. It is essentially only the base model that is needed, so please disable all the rest like so:

# dbt_project.yml
...
models:
  snowplow_web:
    page_views:
      enabled: false
    sessions:
      enabled: false
    user_mapping:
      enabled: false
    users:
      enabled: false

Running it, however can still be achieved by running the selector as defined in the web model. In order for it to work, you can copy the selectors.yml file (source) from the package to your dbt project's main directory.

dbt run --selector snowplow_web

After the run finishes, you should only see the media player related models to be present within the snowplow_web_incremental_manifest table.

Tests

Following the logic defined in the web package, the media player package also contains tests for both the scratch and derived models. Depending on your use case you might not want to run all tests in production, for example to save costs. There are several tags included in the package to help select subsets of tests. Tags:

  • this_run: Any model with the _this_run suffix
  • scratch: Any model in the scratch sub directories.
  • derived: Any of the derived models i.e. media_stats.
  • primary-key: Any test on the primary keys of all models in this package.

The recommended approach to testing the web model is using the selector flag 'snowplow_web_lean_tests'. When running the web model together with the media_player model, the recommendation is to use the snowplow_web_lean_and_media_player_tests. In order for it to work, please copy the selectors.yml file to the main project directory (source).

dbt test --selector snowplow_web_lean_and_media_player_tests

This is equivalent to running the lean tests on the web-model as well as all media_player tests and any tests on any custom models tagged 'snowplow_media_player'.

Alternatively, if you wanted to run all available tests in both the Snowplow Web and Media Player package (plus any tests on any custom models tagged 'snowplow_media_player') you can run snowplow_web_and_media_player_tests:

dbt test --selector snowplow_web_and_media_player_tests

In case only the media model's tests (and any custom module tagged with 'snowplow_media_player') need to be run you can use snowplow_media_player_tests:

dbt test --selector snowplow_media_player_tests

Custom models

There are two custom models included in the package which could potentially be used in downstream models:

  1. the snowplow_media_player_session_stats table, which aggregates the snowplow_media_base table on a session level

  2. the snowplow_media_player_user_stats table, which aggregates the snowplow_media_player_session_stats to user level

By default these are disabled, but you can enable them in the project's profiles.yml, if needed.

# dbt_project.yml
...
  models:
    snowplow_media_player:
      custom:
        enabled: true

Just like in case of the web model, users are encouraged to use the Media Player model and its incremental logic to design their own custom models / modules. The snowplow_media_player_interactions_this_run table is designed with this in mind, where a couple of potentially userful fields are generated that the Media Player model does not use downstream but they nonetheless have the potential to be incorporated into users custom models.

One such example is the player_current_time, which is the playback position of a specific media in seconds whenever a media player event is fired, from which more precise time-based calculations could be made.

e.g. subsequent events' player_current_time could be used to deduct the player_end_time of an event. Subtracting these two would result in calculated play_times instead of taking the percentprogress fields as a base like the Media Player model.

with interaction_ends as (

   select
     event_id,
     event_type,
     start_tstamp,
     lead(start_tstamp, 1) over(partition by play_id order by start_tstamp) as end_tstamp,
     player_current_time,
     lead(player_current_time, 1) over(partition by play_id order by start_tstamp) as player_end_time
from scratch.snowplow_media_player_interactions_this_run

)

select
	event_id,
	player_end_time - player_current_time as play_time_sec_calculated

from interaction_ends

where event_type = 'play'

Join the Snowplow community

We welcome all ideas, questions and contributions!

For support requests, please use our community support Discourse forum.

If you find a bug, please report an issue on GitHub.

Copyright and license

The snowplow-media-player package is Copyright 2022 Snowplow Analytics Ltd.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

{% endraw %} {% enddocs %}