Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow BigQueryExampleGen example #54

Closed
DimitrijeManic opened this issue Apr 24, 2019 · 2 comments
Closed

Airflow BigQueryExampleGen example #54

DimitrijeManic opened this issue Apr 24, 2019 · 2 comments

Comments

@DimitrijeManic
Copy link

Hi there,

After going through the workshop tutorial, I am attempting to build my own pipeline ingesting from BigQuery rather than a CSV.

The only example using BigQuery is taxi_pipeline_kubeflow.py which assumes execution on GCP.

Is it possible to use the same pipeline as the tutorial where Airflow & AirflowScheduler are running locally but pull data from BigQuery?

How does GCP authentication work in this scenario?

I have tried this snippet along with editing bigquery_default under admin>connections in the Airflow webapp with no luck

@PipelineDecorator(
    pipeline_name='series',
    enable_cache=True,
    metadata_db_root=_metadata_db_root,
    additional_pipeline_args={
            'logger_args': logger_overrides,
            'beam_pipeline_args': [
                '--runner=DirectRunner',
                '--experiments=shuffle_mode=auto',
                '--project=<MY-PROJECT-ID>',
                '--temp_location=<TEMP-DIR-GCP>',
                '--region=us-central1',
            ],
        },
    pipeline_root=_pipeline_root)
def _create_pipeline():
  """Series Pipeline"""
  query = """
    SELECT * FROM `<MY-TABLE>`
  """
  example_gen = BigQueryExampleGen(query=query)

  return [
      example_gen,
  ]


pipeline = AirflowDAGRunner(_airflow_config).run(_create_pipeline())
@1025KB
Copy link
Collaborator

1025KB commented Apr 24, 2019

Is it possible to use the same pipeline as the tutorial where Airflow & AirflowScheduler are running locally but pull data from BigQuery?

Yes

How does GCP authentication work in this scenario?

It works pretty much the same as kubeflow example, you need to have GOOGLE_APPLICATION_CREDENTIALS setup in your env(in airflow scheduler and webserver console), and set '--project=xxx' in your beam_pipeline_args

So basically you can use the taxi_pipeline_simple.py example, with example gen change to BigQueryExampleGen, and beam_pipeline_args setup in additional_pipeline_args
'beam_pipeline_args': [
'--runner=DirectRunner',
'--project=xxx',
],

To debug,

You can try the query in BigQuery web ui console to see if it work, and then try the same query in code
(the query in taxi_pipeline_kubeflow.py should work for you too)

@DimitrijeManic
Copy link
Author

Got it to work, thanks!

ruoyu90 pushed a commit to ruoyu90/tfx that referenced this issue Aug 28, 2019
* Design review for "Attention for Dense Networks"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants