Skip to content

Dynamically generate Google Cloud managed Apache Airflow ( Cloud Composer ) DAGs from YAML configuration files

License

Notifications You must be signed in to change notification settings

suchitpuri/auto-compose

 
 

Repository files navigation

auto-compose

Travis CI Code Style

auto-compose is a utility for dynamically generating Google cloud managed Apache Airflow DAGs from YAML configuration files. It is a fork of dag-factory and uses its logic to parse YAML files and convert them to airflow DAG's.

Installation

To run auto-compose without checking out the github repository run /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/suchitpuri/auto-compose/master/scripts/bootstrap.sh)" . It requires docker which has all the required dependencies baked in.

You can also checkout the repository and run /bin/bash ./scripts/bootstrap.js

Open in Cloud Shell

Usage

Once you run auto-compose, it will ask you for the following details

  1. project-id : This is your GCP project id. When you run auto-compose it uses the underlying environment authentication to gcp environment. If you are not logged in go run gcloud auth login or similar command before running auto-compose.
  2. composer-id : This is the name/id of the composer environment. You can get that from the name column of https://console.cloud.google.com/composer/environments
  3. composer-location: This is the name of the region ( e.g asia-northeast1 ) where composer is running. You can get that from the location column of https://console.cloud.google.com/composer/environments
  4. YAML file absolute path: This is the absolute path of the YML file. Correct absolute path is needed so that docker mount the file

To deploy a DAG in airflow managed by google cloud you first need to create a YAML configuration file. For example:

default:
  default_args:
    owner: 'default_owner'
    start_date: 2019-08-02
    email: ['test@test.com']
    email_on_failure: True
    retries: 1
    email_on_retry: True
  max_active_runs: 1
  schedule_interval: '0 * * * */1'

bq_dag_complex:
  default_args:
    owner: 'add_your_ldap'
    start_date: 2019-02-14
  description: 'this is an sample bigquery dag which runs every day'
  tasks:
    query_1:
      operator: airflow.contrib.operators.bigquery_operator.BigQueryOperator
      bql: 'SELECT count(*) FROM `bigquery-public-data.noaa_gsod.gsod2018`'
      use_legacy_sql: false
    query_2:
      operator: airflow.contrib.operators.bigquery_operator.BigQueryOperator
      bql: 'SELECT count(*) FROM `bigquery-public-data.noaa_gsod.gsod2017`'
      dependencies: [query_1]
      use_legacy_sql: false
    query_3:
      operator: airflow.contrib.operators.bigquery_operator.BigQueryOperator
      bql: 'SELECT count(*) FROM `bigquery-public-data.noaa_gsod.gsod2016`'
      dependencies: [query_1]
      use_legacy_sql: false
    query_4:
      operator: airflow.contrib.operators.bigquery_operator.BigQueryOperator
      bql: 'SELECT count(*) FROM `bigquery-public-data.noaa_gsod.gsod2015`'
      dependencies: [query_1, query_2]
      use_legacy_sql: false
    query_5:
      operator: airflow.contrib.operators.bigquery_operator.BigQueryOperator
      bql: 'SELECT count(*) FROM `bigquery-public-data.noaa_gsod.gsod2014`'
      dependencies: [query_3]
      use_legacy_sql: false

bq_dag_simple:
  default_args:
    owner: 'add_your_ldap'
    start_date: 2019-02-14
  description: 'this is an sample bigquery dag which runs every 12 hours'
  schedule_interval: '0 */12 * * *'
  tasks:
    query_1:
      operator: airflow.contrib.operators.bigquery_operator.BigQueryOperator
      bql: 'SELECT count(*) FROM `bigquery-public-data.noaa_gsod.gsod2018`'
      use_legacy_sql: false
    query_2:
      operator: airflow.contrib.operators.bigquery_operator.BigQueryOperator
      bql: 'SELECT count(*) FROM `bigquery-public-data.noaa_gsod.gsod2017`'
      dependencies: [query_1]
      use_legacy_sql: false
    query_3:
      operator: airflow.contrib.operators.bigquery_operator.BigQueryOperator
      bql: 'SELECT count(*) FROM `bigquery-public-data.noaa_gsod.gsod2016`'
      dependencies: [query_1]
      use_legacy_sql: false

You can see that it has all the airflow semantics, like default args, schedule interval, max active runs and more. You can find a complete list here.

The best part is that currently you can use any of the following operators in YAML file directly without any configuration.

And this DAG will be generated and ready to run in Airflow!

screenshot

screenshot

Benefits

  • Construct DAGs without knowing Python
  • Construct DAGs without learning Airflow primitives
  • Avoid duplicative code
  • Use any of the available google cloud operators
  • Everyone loves YAML! ;)

Contributing

Contributions are welcome! Just submit a Pull Request or Github Issue.

About

Dynamically generate Google Cloud managed Apache Airflow ( Cloud Composer ) DAGs from YAML configuration files

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 76.2%
  • Shell 12.6%
  • Makefile 6.0%
  • Dockerfile 5.2%