Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CC-7246: add ability to partition based on timestamp of a record value field #214

Closed
wants to merge 2 commits into from

Conversation

levzem
Copy link

@levzem levzem commented Nov 20, 2019

implements this behavior:
https://cloud.google.com/bigquery/docs/partitioned-tables#date_timestamp_partitioned_tables

TLDR: BigQuery can partition based off a column that contains a timestamp, so by just passing a field in the struct to BigQuery, it will specify which column to partition by.

Signed-off-by: Lev Zemlyanov lev@confluent.io

Signed-off-by: Lev Zemlyanov <lev@confluent.io>
@CLAassistant
Copy link

CLAassistant commented Nov 20, 2019

CLA assistant check
All committers have signed the CLA.

@levzem
Copy link
Author

levzem commented Nov 20, 2019

@wicknicks @gharris1727 @aakashnshah any reviews would be appreciated

Signed-off-by: Lev Zemlyanov <lev@confluent.io>
@@ -155,6 +155,12 @@ private RowToInsert getRecordRow(SinkRecord record) {
convertedRecord = FieldNameSanitizer.replaceInvalidKeys(convertedRecord);
}

if (config.useTimestampPartitioning()) {
if (!convertedRecord.containsKey(config.getTimestampPartitionFieldName())) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When/how would this happen? Looks like the first if statement accounts for the field name being non empty. Of course this doesn't ensure the record containing the field, but still wanted to know this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means that the record doesn't contain the field - RecordConverter returns a map of all the field names to their values, so that's how I check the struct

Comment on lines +115 to +118
final String testTableName = "testTable";
final String testDatasetName = "testDataset";
final String testDoc = "test doc";
final TableId tableId = TableId.of(testDatasetName, testTableName);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you not put these outside of the function since you use these multiple times?

Comment on lines +133 to +134
com.google.cloud.bigquery.Schema fakeBigQuerySchema =
com.google.cloud.bigquery.Schema.of(Field.of("mock field", LegacySQLTypeName.STRING));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the mockito inline package :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following the original tests, want to make this a non-invasive addition

@levzem levzem changed the title FF-1311: add ability to partition based on timestamp of a record value field CC-7246: add ability to partition based on timestamp of a record value field Nov 25, 2019
@levzem
Copy link
Author

levzem commented Nov 25, 2019

@mtagle would love some eyes on this :)

@archy-bold
Copy link

I tried this and I was getting an error 'Streaming to metadata partition of column based partitioning table

$20191127 is disallowed.' It looks like the reason is that, with column-based partitions, you shouldn't supply the partition explicitly. You just supply the table name and BigQuery sorts the partitioning itself.

When I updated the PartitionedTableId::createFullTableName() function to simply return the table, I was able to insert records into the table.

It seems to create the table fine, though.

Source: https://stackoverflow.com/a/50006560

@archy-bold
Copy link

Actually, this PR does what I explained: #203

@rhauch
Copy link

rhauch commented Feb 3, 2020

FYI: #203 has been superseded by #229.

@levzem
Copy link
Author

levzem commented Feb 4, 2020

this PR attempts to accomplish the second half of #244 and address issue #169 by allowing auto creation of a column partitioned table in BigQuery by the connector. This may be superseded by a following PR.

@levzem
Copy link
Author

levzem commented Feb 11, 2020

superseded by #246 thus closing

@levzem levzem closed this Feb 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants