Setup guide

Anton Parkhomenko edited this page Nov 14, 2018 · 4 revisions

Setup guide

Configuration file

Both Loader and Mutator use same configuration file with iglu:com.snowplowanalytics.snowplow.storage/bigquery_config/jsonschema/ schema, that looks like following:

{
    "schema": "iglu:com.snowplowanalytics.snowplow.storage/bigquery_config/jsonschema/1-0-0",
    "data": {
        "name": "Alpha BigQuery test",
        "id": "31b1559d-d319-4023-aaae-97698238d808",

        "projectId": "com-acme",
        "datasetId": "snowplow",
        "tableId": "events",

        "input": "enriched-good-sub",
        "typesTopic": "bq-test-types",
        "typesSubscription": "bq-test-types-sub",
        "badRows": "bq-test-bad-rows",
        "failedInserts": "bq-test-bad-inserts",

        "load": {
            "mode": "STREAMING_INSERTS",
            "retry": false
        },

        "purpose": "ENRICHED_EVENTS"
    }
}
  • All topics and subscriptions (input, typesTopic, typesSubscription, badRows and failedInserts) explained in topics and message formats section.
  • projectId used to group all resources (topics, subscriptions and BigQuery table)
  • datasetId and tableId (along with projectId) allow to identify BigQuery table to load
  • name is an arbitrary human-readable description of a storage target
  • id is unique identification in UUID format
  • load specifies loading mode and explained in dedicated section
  • purpose is a standard storage configuration, can be only ENRICHED_EVENTS

Loading mode

BigQuery supports two loading API:

In order to configure BigQuery Loader to use one of above APIs, you can use load property.

In case of streaming inserts it can be following:

{
    "load": {
        "mode": "STREAMING_INSERTS",
        "retry": false
    }
}

retry specifies if BigQuery loader needs to retry inserts that were failed (e.g. due mutation lag) infinitely or send them straight to failedInserts topic. Note that if loader won't be able to insert row - it will keep trying which can throttle whole job, so that it will have to be restarted.

In case of load jobs it can look like following:

{
    "load": {
        "mode": "FILE_LOADS",
        "frequency": 60000
    }
}

frequency specifies how often should load job performed, in seconds. Unlike near-realtime streaming inserts API, load jobs are more batch-oriented.

Note that load jobs do not support retry (as streaming inserts do not support frequency).

It is generally recommended to stick with streaming jobs API without retries (and use forwarder job to recover data from failedInserts). However, load jobs API is cheaper and provides much fewer duplicates.

Command line options

All three apps: Loader, Mutator and Forwarder accept path to above config file and to Iglu resolver config.

Loader

Loader accepts these two arguments and any other, supported by Google Cloud Dataflow.

./snowplow-bigquery-loader \
    --config=$CONFIG \
    --resolver=$RESOLVER

This can be launched from any machine authenticated to submit Dataflow jobs.

Mutator

Mutator has three subcommands: listen, create and add-column.

listen

listen is primary one and used to automate table migrations.

./snowplow-bigquery-mutator \
    listen
    --config $CONFIG \
    --resolver $RESOLVER \
    --verbose               # Optional, for debug only

add-column

add-column can be used once to add particular column manually. This should eliminate chance of mutation lag and necessity of running forwarder job.

./snowplow-bigquery-mutator \
    add-column \
    --config $CONFIG \
    --resolver $RESOLVER \
    --shred-property CONTEXTS \
    --schema iglu:com.acme/app_context/jsonschema/1-0-0

Specified schema must be present in one of Iglu registries in resolver configuration.

create

create just creates empty table with atomic structure.

./snowplow-bigquery-mutator \
    create \
    --config $CONFIG \
    --resolver $RESOLVER

Forwarder

Forwarder as well as Loader can be submitted from any machine authenticated to submit Dataflow jobs.

./snowplow-bigquery-forwarder \
    --config=$CONFIG \
    --resolver=$RESOLVER
    --failedInsertsSub=$FAILED_INSERTS_SUB

Its only unique option is failedInsertsSub, which is a subscription (which must be created upfront) with failed inserts.

Note that by convention both Dataflow jobs (Forwarder and Loader) accept CLI options with = symbol and camelCase, while Mutator accepts in UNIX style (without =).

Docker support

All three applications are available as Docker images.

  • snowplow-docker-registry.bintray.io/snowplow/snowplow-bigquery-loader:0.1.0
  • snowplow-docker-registry.bintray.io/snowplow/snowplow-bigquery-forwarder:0.1.0
  • snowplow-docker-registry.bintray.io/snowplow/snowplow-bigquery-mutator:0.1.0

Partitioning

During initial setup it is strictly recommended to setup partitioning on derived_tstamp property. Mutator's create does not automatically adding partitioning yet.

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.