Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importing a datastore backup into bigquery does not work with WRITE_TRUNCATE #7

Closed
GoogleCodeExporter opened this issue Apr 14, 2016 · 10 comments

Comments

@GoogleCodeExporter
Copy link
Contributor

We are loading datastore backups into big query with the big query v2 api. We 
are specifying this JSON configuration:

{'configuration': {
    'load': {
        'sourceFormat'     : 'DATASTORE_BACKUP',
        'writeDisposition' : 'WRITE_TRUNCATE',
        'sourceUris'       : sourceUris,
        'destinationTable' : {
            'projectId': settings.PROJECT_ID,
            'datasetId': datasetId,
            'tableId'  : entityKind
            }
        }
    }
}
We have already loaded this entity into BigQuery once and are now expecting 
further loads to replace the existing table with the new data. We are not 
seeing this but an error in the insert job request:

u'status': {
u'state': u'DONE',
u'errors': [
  {
    u'reason': u'invalid',
    u'message': u'Cannot import a datastore backup to a table that already has a schema.'
  }
],
u'errorResult': {
  u'reason': u'invalid',
  u'message': u'Cannot import a datastore backup to a table that already has a schema.'
}
},

Original issue reported on code.google.com by bbassing...@vendasta.com on 11 Feb 2013 at 6:31

@GoogleCodeExporter
Copy link
Contributor Author

Your backup has different schema than the bigquery table.
Have you modified the entity schema?
If true, delete the bigquery table and creates it again.

Original comment by josean1...@gmail.com on 24 Feb 2014 at 12:50

@GoogleCodeExporter
Copy link
Contributor Author

I think we're seeing the same thing.  Occasionally we will see twice as many 
rows in a datasource than we're expecting when we use `bq load` to update the 
datasource, despite passing the `--replace` flag as an argument.  I suspect 
this occurs when we've made a modification to the datasource schema.

If we drop the datasource prior to uploading, there will be a period of several 
minutes where any dashboards and other tools that are expecting the datasource 
will break.  If we do not drop the datasource and there is a schema change, it 
looks like there will be duplicate rows until a follow-on update in which the 
schema has not changed.

Is the behavior described in this issue considered a bug?  If not, is there a 
way to avoid this difficulty that we haven't thought of?

Original comment by ewal...@pagerduty.com on 10 Apr 2014 at 6:37

@GoogleCodeExporter
Copy link
Contributor Author

I can now add that we are seeing this behavior independent of a schema change.  
Occasionally when we create a load job, even when there has been no schema 
change, the `--replace` command in `bq` appears to be ignored, and we see twice 
as many results as expected.

This is a pretty serious error, and it is making us reconsider using BQ for our 
dashboards.  Has anyone in Google seen this?

Original comment by ewal...@pagerduty.com on 24 Apr 2014 at 8:04

@GoogleCodeExporter
Copy link
Contributor Author

We'll look into it. Can you send us any relevant table IDs or job IDs? Feel 
free to email jcondit@google.com if you prefer (though I may hand off to 
someone else, since I'm going to be out next week).

If this is indeed an issue with loading from a datastore backup, you could try 
working around the problem by loading into a fresh table and then using a table 
copy (with WRITE_TRUNCATE) to atomically replace the old table with the new 
data.

Original comment by jcon...@google.com on 25 Apr 2014 at 6:41

@GoogleCodeExporter
Copy link
Contributor Author

I can confirm the incorrect behavior reported by ewalker@ with WRITE_TRUNCATE, 
and I believe we have fixed it, as of late yesterday PDT.  Sometimes, data that 
was on a table before WRITE_TRUNCATE would persist after the truncate, 
alongside the data that was written.  We'll be getting more detail on the exact 
incidence of the problem.

bbassingthwaite's original content of the issue was a different thing, if I 
understand correctly: WRITE_TRUNCATE does not allow you to import a Datastore 
backup, even though  it seems like it logically could?  I'll look at that also.

Original comment by e...@google.com on 23 May 2014 at 10:19

@GoogleCodeExporter
Copy link
Contributor Author

Spring cleaning, issue resolved per #5.

Original comment by thomasp...@google.com on 22 Aug 2014 at 5:27

  • Changed state: Fixed

@GoogleCodeExporter
Copy link
Contributor Author

The original comment, that you can't replace a table created from a datastore 
backup with WRITE_TRUNCATE enabled is still true:


  "status": {
    "errorResult": {
      "message": "Cannot import a datastore backup to a table that already has a schema.",
      "reason": "invalid"
    },
    "errors": [
      {
        "message": "Cannot import a datastore backup to a table that already has a schema.",
        "reason": "invalid"
      }
    ],
    "state": "DONE"
  }

Original comment by ch...@pushbullet.com on 9 Oct 2014 at 5:21

@GoogleCodeExporter
Copy link
Contributor Author

I have just hit the same issue. I initially tried to load Datastore backupfile 
from GCS to BQ using BQ Load cli. Once I try to repeat, it had said me there is 
already a schema available and BQ hinted me to use write disposition as write 
truncate. After I include disposition, I am getting the following error to 
include encode type for Datastore. I tried both UTF and ISO - still this wont 
allow me.

bucket_name> load --source_format=DATASTORE_BACKUP --allow_jagged_rows=false 
--encoding=UTF-8 --write_disposition=WRITE_TRUNCATE sample_red.t1estchallenge_1 
gs://test.appspot.com/folder/ahFzfnZpcmdpbi1yZWQtdGVzdHJBCxIcX0FFX0RhdGFzdG9yZUF
kbWluX09wZXJhdGlvbhiBwLgCDAsSFl9BRV9CYWNrdXBfSW5mb3JtYXRpb24YAQw.enittykind.back
up_info 

Error parsing command: flag --encoding=None: value should be one of 
<UTF-8|ISO-8859-1>

How do I approach for this issue?

Original comment by rajeshk...@gae.golgek.mobi on 1 Apr 2015 at 5:10

@GoogleCodeExporter
Copy link
Contributor Author

I'm not sure if that's the correct error message, but try using --replace 
instead of --write_disposition=WRITE_TRUNCATE.

Original comment by jcon...@google.com on 1 Apr 2015 at 4:50

@GoogleCodeExporter
Copy link
Contributor Author

All,

As Jeremy pointed out, --replace flag works great if you prefer a full refresh 
load. WRITE_TRUNCATE flag asks for encoding for Datastore which may not be 
possible to supply.

Original comment by rajeshk...@gae.golgek.mobi on 2 Apr 2015 at 6:05

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant