Command-line tool for Google Cloud BigQuery.
Provides two core features:
- Sync: will copy/synchronise Google Analytics session tables between different datasets/projects. For example, moving Google Analytics data from a US dataset into the EU.
- Materialize: executes a statement and outputs to a table. Useful for materializing views/complex intermediate tables.
The destination project must have a dataset (with the same name as the source) that already exists. It will look for any session tables that are missing from the destination dataset and replicate the
--number of most recent ones. We use the
--number parameter to help incrementally copy very large datasets over time.
export GCLOUD_PROJECT="source-project-id" export GOOGLE_APPLICATION_CREDENTIALS="./service-account-key.json" export JVM_OPTS="-Dlogback.configurationFile=./logback.example.xml" java $JVM_OPTS -cp big-replicate-standalone.jar \ uswitch.big_replicate.sync \ --source-project source-project-id \ --source-dataset 98909919 \ --destination-project destination-project-id \ --destination-dataset 98909919 \ --table-filter "ga_sessions_\d+" \ --google-cloud-bucket gs://staging-data-bucket \ --number 30
Because only missing tables from the destination dataset are processed tables will not be overwritten.
The example above is intended for replicating Google Analytics BigQuery data from one project to another. It works by:
--table-filterto only replicate tables matching the expected
ga_sessions_\d+filter. This can be any valid Java regular expression.
--numberrestricts the replication to 30 tables. Tables are reverse ordered lexicographically by the tool.
This ensures that the most recent 30 days of tables that don't exist (in the
--destination-dataset) but do in the sources will be replicated.
big-replicate will run multiple extract and loads concurrently- this is currently set to the number of available processors (as reported by the JVM runtime). You can override this with the
--number-of-agents flag. Since no processing is performed client-side (all operations are BigQuery jobs) its safe to set this well above the processor count.
We often use views to help break apart more complex queries, building join tables between datasets etc. The
materialize operation executes a statement and stores the output in a table.
export GCLOUD_PROJECT="source-project-id" export GOOGLE_APPLICATION_CREDENTIALS="./service-account-key.json" export JVM_OPTS="-Dlogback.configurationFile=./logback.example.xml" echo "SELECT * FROM [dataset.sample_table]" | java $JVM_OPTS \ -cp big-replicate-standalone.jar \ uswitch.big_replicate.materialize \ --project-id destination-project-id \ --dataset-id destination-dataset-id \ --table-id destination-table \ --force
Copyright © 2016 uSwitch
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.