Skip to content

Default mappings reference

tfabien edited this page Apr 2, 2021 · 4 revisions

The default behaviour is implemented through the same mapping files mechanism used for customization

A few mapping files are include in the sources (under the ./mappingsdirectory) to define the default configuration for the BigQuery load job You can edit this file prior to deployment to modify or add default behaviour.

The following sections describe the default shipped configuration in details

000-global_config.hbs

// Global configuration
{
  configuration: {
    load: {
      destinationTable: {
        projectId: {{env.PROJECT_ID}}
        datasetId: {{default env.DATASET_ID 'Staging'}}
      },
      createDisposition: {{default env.CREATE_DISPOSITION 'CREATE_IF_NEEDED'}}
      writeDisposition: {{default env.WRITE_DISPOSITION 'WRITE_APPEND'}}
      encoding: {{default env.ENCODING 'UTF-8'}}
    },
    dryRun: {{default env.DRY_RUN 'False'}}
  }
}

This configuration always applies and defines the global configuration options such as project id, dataset id, and create/write disposition

It also exposes the dryRun option through environment variable

001-source_uris.hbs

{
  // Inline partial that defines the complete filepath
  {{#* inline "filePath"}}gs://{{file.bucket}}/{{file.name}}{{/inline}}

  // Always use the complete file path as a sourceUri
  configuration.load.sourceUris: [ "{{> filePath}}" ]
}

This configuration always applies and defines the source uri for the load job.

002-table-naming.hbs

// Table naming patterns
// Note: if multiple rules match, the last value for an option overrides all preceding ones
{
  // Inline partial that defines the complete filepath
  {{#assign 'filePath'}}gs://{{file.bucket}}/{{file.name}}{{/assign}}

  /**
   * Use the filename minus the extension and date/number suffix as table name
   * Default datasetId is used unles overriden by another config
   * 
   * eg: All these files result in a 'cities' tableId
   *     - gs://bq-autoload/cities.csv
   *     - gs://bq-autoload/cities-us.csv
   *     - gs://bq-autoload/cities_1.csv
   *     - gs://bq-autoload/cities_20190516.csv
   *     - gs://bq-autoload/cities_20190516-063000.csv
   *     - gs://bq-autoload/cities.20190516.063000.csv
   *     - gs://bq-autoload/CSV/cities_20190516
  **/
  {{#regex-match filePath '^gs\:\/\/.*\/(?<TABLE_ID>.*?)(?:[\.\-_][\d\.\-\_]+)?(?:\..+)?$'}}
    configuration.load.destinationTable.tableId: "{{TABLE_ID}}"
  {{/regex-match}}

  /**
   * Use the first subdir as datasetId, and second subdir as tableId.
   * The file must be nested under two directories
   * A third nesting directory is accepted if this directory is a date/timestamp
   * 
   * eg: All these files result in a 'Cities' tableId
   *     - gs://bq-autoload/Staging/Cities/export_20190516.csv
   *     - gs://bq-autoload/Staging/Cities/dump_20190516.csv
   *     - gs://bq-autoload/Staging/Cities/20190516/export.csv
   *     - gs://bq-autoload/Staging/Cities/20190516-063000/export.csv
   * 
   * eg: These files will not match this pattern (and fall back to previous config)
   *     - gs://bq-autoload/Cities/export_20190516.csv
   *     - gs://bq-autoload/Staging/Cities/CSV/export_20190516.csvdirectory
  **/
  {{#regex-match filePath '^gs\:\/\/[^\/]+\/(?<DATASET_ID>[^\/]+)\/(?<TABLE_ID>[^\/]+)\/([\d\.\_\-]+\/)?[^\/]+$'}}
    configuration.load.destinationTable.datasetId: "{{DATASET_ID}}"
    configuration.load.destinationTable.tableId: "{{TABLE_ID}}"
  {{/regex-match}}
}

This configuration applies to all files and defines the table naming pattern for the uploaded file.

The first regex-match section defines a table name derived from the file name minus it's extension and optionnal yyyyMMdd-HHmmSS timestamp suffix, using named cature groups inside the regular expression.

The second regex-match section defines a table name derived from the subdirectories the file is in. It uses the first subdirectory as DatasetId, and the second one as TableId, ignoring the file name.

Note: If this section matches, it overrides the first one

You can change the configuration.load.destinationTable.tableId property to alter this naming convention if the provided default does not suite your needs. Depending on the variables you want to use, you may also need to alter the regex-match pattern to add new capturing groups.

003-file_formats.hbs

Load options for the CSV file format:

// Load options for BigQuery's supported file formats
{
  configuration.load: {
    // CSV
    {{#regex-match file.name "\.csv$" }}
      sourceFormat: CSV
      autodetect: True
    {{/regex-match}}

    // JSON
    {{#regex-match file.name "\.js(on)?$" }}
      sourceFormat: NEWLINE_DELIMITED_JSON
      autodetect: True
    {{/regex-match}}

    // Avro
    {{#regex-match file.name "\.avro$" }}
      sourceFormat: AVRO
      useLogicalTypes: True
    {{/regex-match}}

    // Parquet
    {{#regex-match file.name "\.parquet$" }}
      sourceFormat: PARQUET
    {{/regex-match}}

    // ORC
    {{#regex-match file.name "\.orc$" }}
      sourceFormat: ORC
    {{/regex-match}}
  }
}