Skip to content

Usage samples

Fabien TAYSSE edited this page May 27, 2019 · 4 revisions

Autoload default behaviour

Upload a CSV file to create a new table

The samples/cities_20190506.csv file :

  • is a CSV file, separated with ','
  • has a header line
  • has 127 records

Upload the file to the biquery-autoload GCS bucket

$> gsutil cp samples/cities_20190506.csv gs://bq-autoload

-> A new Staging.cities BigQuery table will automatically be created and populated with data from the CSV

  • The column names are extracted from the header line
  • The column types are infered from data
$> bq query-q "SELECT * FROM Staging.cities LIMIT 5"
+------+------+------+-------+------+------+------+----+-----------------+-------+
| LatD | LatM | LatS |  NS   | LonD | LonM | LonS | EW |      City       | State |
+------+------+------+-------+------+------+------+----+-----------------+-------+
|   39 |   45 |    0 | false |   75 |   33 |    0 | W  | Wilmington      |  DE   |
|   41 |   15 |    0 | false |   77 |    0 |    0 | W  | Williamsport    |  PA   |
|   26 |   43 |   11 | false |   80 |    3 |    0 | W  | West Palm Beach |  FL   |
|   36 |   40 |   11 | false |  121 |   39 |    0 | W  | Salinas         |  CA   |
|   50 |   25 |   11 | false |  104 |   39 |    0 | W  | Regina          |  SA   |
+------+------+------+-------+------+------+------+----+-----------------+-------+
$> bq query -q --format json "SELECT COUNT(*) as Count FROM Staging.cities"
[{"Count":"127"}]

Append data to an existing table

Upload a CSV file, adding data and a new column to the existing table

The samples/cities_20190507.csv file:

Upload the file to the biquery-autoload GCS bucket

$> gsutil cp samples/cities_20190507.csv gs://bq-autoload

-> The Staging.cities BigQuery table is updated:

  • The Comment column has been created with a String infered type, and a nulldefault value
  • The new line from the samples/cities_20190507.csv file has been added
$> bq query-q "SELECT * FROM Staging.cities LIMIT 5"
+------+------+------+-------+------+------+------+----+-----------------+-------+------------------------------------------------------+
| LatD | LatM | LatS |  NS   | LonD | LonM | LonS | EW |      City       | State |                       Comment                        |
+------+------+------+-------+------+------+------+----+-----------------+-------+------------------------------------------------------+
|   41 |    5 |   59 | false |   80 |   39 |    0 | W  | Youngstown      |  OH   | This is a new line appended from cities_20190507.csv |
|   39 |   45 |    0 | false |   75 |   33 |    0 | W  | Wilmington      |  DE   | NULL                                                 |
|   41 |   15 |    0 | false |   77 |    0 |    0 | W  | Williamsport    |  PA   | NULL                                                 |
|   26 |   43 |   11 | false |   80 |    3 |    0 | W  | West Palm Beach |  FL   | NULL                                                 |
|   36 |   40 |   11 | false |  121 |   39 |    0 | W  | Salinas         |  CA   | NULL                                                 |
+------+------+------+-------+------+------+------+----+-----------------+-------+------------------------------------------------------+
$> bq query -q --format json "SELECT COUNT(*) as Count FROM Staging.cities"
[{"Count":"128"}]

Modify the destination table name and write mode

Upload a CSV file, using custom metadata to specify an arbitrary table, and specify TRUNCATE mode to replace all existing data

The samples/export_cities.20190508.csv file:

Uploading this file as is would normally cause a new export_cities table to be created. We will upload this file with additional custom metadata to override this behaviour

$> gsutil -h "x-goog-meta-bigquery.configuration.load.destinationTable.datasetId: Cities" \
          -h "x-goog-meta-bigquery.configuration.load.writeDisposition: WRITE_TRUNCATE" \
          cp samples/export_cities.20190508.csv gs://bq-autoload

-> The Staging.cities BigQuery table is updated:

  • All existing rows have been deleted (due to the WRITE_TRUNCATE mode being used), and the 10 records from the CSV file have been inserted
  • The Comment column has been deleted
$> bq query-q "SELECT * FROM Staging.cities LIMIT 5"
+------+------+------+-------+------+------+------+----+-----------------+-------+
| LatD | LatM | LatS |  NS   | LonD | LonM | LonS | EW |      City       | State |
+------+------+------+-------+------+------+------+----+-----------------+-------+
|   41 |    5 |   59 | false |   80 |   39 |    0 | W  | Youngstown      |  OH   |
|   39 |   45 |    0 | false |   75 |   33 |    0 | W  | Wilmington      |  DE   |
|   41 |   15 |    0 | false |   77 |    0 |    0 | W  | Williamsport    |  PA   |
|   26 |   43 |   11 | false |   80 |    3 |    0 | W  | West Palm Beach |  FL   |
|   36 |   40 |   11 | false |  121 |   39 |    0 | W  | Salinas         |  CA   |
+------+------+------+-------+------+------+------+----+-----------------+-------+
$> bq query -q --format json "SELECT COUNT(*) as Count FROM Staging.cities"
[{"Count":"10"}]

Specify column names and use only part of the file name as table name

Upload a CSV file without header into a new table, configure columns names and types using a mappings file

The samples/cities_noheader_20190506.csv file:

Uploading this file as is would normally result into a new cities_noheader table being created with generic columns names like {inferedType}_field_{n}

$> gsutil cp samples/cities_noheader_20190506.csv gs://bq-autoload
$> bq query -q "SELECT * FROM Staging.cities_noheader LIMIT 1"
+---------------+---------------+---------------+--------------+-----+
| int64_field_0 | int64_field_1 | int64_field_2 | bool_field_3 | ... |
+---------------+---------------+---------------+--------------+-----+
|            39 |            45 |             0 |        false | ... |
+---------------+---------------+---------------+--------------+-----+

The samples/mappings/cities_noheader_yyyyMMdd.json file defines a mapping matching the header-less CSV file name (/\/cities_noheader_\d{8}\.csv$/) and specifies the column names and types to be used.

// File: mappings/cities_noheader.hbs
{
  {{#regex-match file.name 'cities_noheader_\d{8}\.csv$'}}
  configuration: {
    load: {
      schema: {
        fields: [
          { name: "LatD",  type: "INTEGER" },
          { name: "LatM",  type: "INTEGER" },
          { name: "LatS",  type: "INTEGER" },
          { name: "NS",    type: "BOOLEAN" },
          { name: "LonD",  type: "INTEGER" },
          { name: "LonM",  type: "INTEGER" },
          { name: "LonS",  type: "INTEGER" },
          { name: "EW",    type: "STRING"  },
          { name: "City",  type: "STRING"  },
          { name: "State", type: "STRING"  }
        ]
      }
    }
  }
  {{/regex-match}}
}

Uploading this mapping file to the mappings directory of the autoload GCS bucket will allow the Cloud Function to be parsed, and the configuration to be modified for all files matching the cities_noheader_yyyyMMdd.csv pattern.

Subsequent upload of the samples/cities_noheader_20190506.csv file will now yield a different result

$> gsutil cp samples/mappings/cities_noheader_yyyyMMdd.json gs://bq-autoload/mappings/cities_noheader_yyyyMMdd.json
$> gsutil cp samples/cities_noheader_20190506.csv gs://bq-autoload

-> The Staging.cities_noheaders BigQuery table is created:

  • The column names and types are extracted from the load job configuration
$> bq query -q "SELECT * FROM Staging.cities_noheader LIMIT 1"
+------+------+------+-------+------+------+------+----+----------+-------+
| LatD | LatM | LatS |  NS   | LonD | LonM | LonS | EW |   City   | State |
+------+------+------+-------+------+------+------+----+----------+-------+
|   49 |   52 |   48 | false |   97 |    9 |    0 | W  | Winnipeg |  MB   |
+------+------+------+-------+------+------+------+----+----------+-------+