Skip to content

Changing and overriding default behaviour

Fabien Taysse edited this page Dec 14, 2022 · 8 revisions

How it works

When fired, the Cloud Function creates a BigQuery load job for the triggering file.

This job will be auto-configured with sensible default options, but you can alter these options using:

  • Environment variables
  • Custom metadata on the file
  • Mapping files

Environment variables

These environment variables can be set during the deployment of the cloud function to override the default bahaviour without editing mappings:

  • PROJECT_ID: GCP Project ID

    String, mandatory, no default value

  • DATASET_ID: Default dataset for the destination table

    String, defaults to Staging

  • CREATE_DISPOSITION: Should new tables be automatically created

    CREATE_IF_NEEDED|CREATE_NEVER, defaults to CREATE_IF_NEEDED

  • WRITE_DISPOSITION: How new data for an existing table should be processed

    WRITE_TRUNCATE|WRITE_APPEND|WRITE_EMPTY, defaults to WRITE_APPEND

  • ENCODING: Encoding of the file

    UTF-8|ISO-8859-1, defaults to UTF-8

  • DRY_RUN: Dry run

    True|False, defaults to False

Mapping files

Mapping files let you define options for a specific file, or all files matching a specific pattern

A mapping file is a handlebars template file that result in a json document defining a JobConfiguration object

See https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad for the complete list of configuration options

Templating engine

The handlebars templating engine is mainly used for:

  • Variable substitution
  • Conditional generation of sections

Additionnal variables are made available to the tempateing engine:

  • Environment variables are available under the env prefix.
  • The file that triggered the cloud function is available as file variable

A few handlebars helpers are pre-loaded and available for use in the mapping files templates:

  • Complete collection of helpers from handlebars-helpers module (link)

  • regex-match

    Enclosed section will be rendered only if variable value matches the pattern. Named groups are made available as the new context inside the block.

    {{#regex-match variable pattern}}
       // ... content
    {{/regex-match}}
  • assign

    Enclosed section value will be evaluated and assigned to variable for later use in the template.

    {{#assign variable}}value{{/assign}}`

Adding a new mapping file

All files of the autoload bucket matching the pattern /mappings/**/*.hbs will be aggregated to obtain the full mapping configuration

Therefore, you can add any number of arbitrary .hbs files to the /mappings/ directory, defining the mappings you want to use. These mappings can be specific to a single file, or a file pattern matching multiple similar files you want to process the same way.

Example:

The following file will instruct biquery-autoloader to load data from export_{table}_{yyyyMMdd}.csv into the {table} table rather than export_{table}.

eg: export_cities_20190506.csv will be loaded into the cities table

// File:  mappings/export_TABLE_yyyyMMdd.hbs
{{#regex-match file.name "\/export_(?<TABLE_ID>.*)_\d{8}\.csv$" }}
{
   "configuration.load.destinationTable.tableId": "{{TABLE_ID}}",
   "configuration.load.writeDisposition":"WRITE_TRUNCATE"
}
{{/regex-match}}

Convenience helpers

HJSON

For convenience, the resulting file is parsed using hjson, so that the syntax is a bit more permissive.

You can use comments, or forget commas and quotes, hjson will try (and most likely succeed) to parse your file.

DOT-OBJECT

dot-object is used to expand properties named with dot-notation.

This JSON object:

{
  "configuration.load.destinationTable.datasetId": "myDataset",
  "configuration.load.destinationTable.tableId": "myTable"
}

will be parsed as:

{
  "configuration": {
    "load": {
      "destinationTable": {
        "datasetId": "myDataset",
        "tableId": "myTable"
      }
    }
  }
}

Custom metadata

Any custom metadata of the file prefixed with bigquery. will be added to the job configuration

Options specified in custom metadata take precedence and override existing configuration options if present

Dotted notation is used to define nested properties

Example: Changing the table name and dataset by setting custom metadata at upload time

 gsutil -h "x-goog-meta-bigquery.configuration.load.destinationTable.datasetId: Test" \
        -h "x-goog-meta-bigquery.configuration.load.destinationTable.tableId: City" \
        cp "samples/cities_20190506.csv" "gs://bq-autoload/"

Note: Custom metadata keys must be prefixed with x-goog-meta- when using gsutil to upload the file