Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: A lot of documentation improvements #43

Merged
merged 6 commits into from
Feb 19, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
179 changes: 82 additions & 97 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,89 +5,116 @@
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Test coverage](https://codecov.io/gh/tomasfarias/airflow-dbt-python/branch/master/graph/badge.svg?token=HBKZ78F11F)](https://codecov.io/gh/tomasfarias/airflow-dbt-python)

An [Airflow](https://airflow.apache.org/) operator and hook to interface with the [`dbt-core`](https://pypi.org/project/dbt-core/) Python package.
A collection of [Airflow](https://airflow.apache.org/) operators and hooks to interface with [`dbt`](https://pypi.org/project/dbt-core/).

# Motivation
Read the [documentation](https://tomasfarias.github.io/airflow-dbt-python/) for examples, installation instructions, and a full reference.

## Airflow running in a managed environment
# Installing

Although [`dbt`](https://docs.getdbt.com/) is meant to be installed and used as a CLI, we may not have control of the environment where Airflow is running, disallowing us the option of using `dbt` as a CLI.
## Requirements

This is exactly what happens when using [Amazon's Managed Workflows for Apache Airflow](https://aws.amazon.com/managed-workflows-for-apache-airflow/) or MWAA: although a list of Python requirements can be passed, the CLI cannot be found in the worker's PATH.
airflow-dbt-python requires the latest major version of [`dbt-core`](https://pypi.org/project/dbt-core/) which at the time of writing is version 1. Since `dbt-core` follows [semantic versioning](https://semver.org/), we do not impose any restrictions on the minor and patch versions, but do keep in mind that the latest `dbt-core` features incorporated as minor releases may not yet be supported.

There is a workaround which involves using Airflow's `BashOperator` and running Python from the command line:
To line up with `dbt-core`, airflow-dbt-python supports Python 3.7, 3.8, and 3.9. We also include Python 3.10 in our testing pipeline, although as of the time of writing `dbt-core` does not yet support it.

``` python
from airflow.operators.bash import BashOperator
Due to the dependency conflict, airflow-dbt-python **does not include Airflow as a dependency**. We expect airflow-dbt-python to be installed into an environment with Airflow already in it. For more detailed instructions see the [docs](https://tomasfarias.github.io/airflow-dbt-python/getting_started.html).

BASH_COMMAND = "python -c 'from dbt.main import main; main()' run"
operator = BashOperator(
task_id="dbt_run",
bash_command=BASH_COMMAND,
)
## From PyPI:

``` shell
pip install airflow-dbt-python
```

But it can get sloppy when appending all potential arguments a `dbt run` command (or other subcommand) can take.
Any `dbt` adapters you require may be installed by specifying extras:

That's where `airflow-dbt-python` comes in: it abstracts the complexity of interfacing with `dbt-core` and exposes one operator for each `dbt` subcommand that can be instantiated with all the corresponding arguments that the `dbt` CLI would take.
``` shell
pip install airflow-dby-python[snowflake,postgres]
```

## An alternative to `airflow-dbt` that works without the dbt CLI
## From this repo:

The alternative [`airflow-dbt`](https://pypi.org/project/airflow-dbt/) package, by default, would not work if the `dbt` CLI is not in PATH, which means it would not be usable in MWAA. There is a workaround via the `dbt_bin` argument, which can be set to `"python -c 'from dbt.main import main; main()' run"`, in similar fashion as the `BashOperator` example. Yet this approach is not without its limitations:
* `airflow-dbt` works by wrapping the `dbt` CLI, which makes our code dependent on the environment in which it runs.
* `airflow-dbt` does not support the full range of arguments a command can take. For example, `DbtRunOperator` does not have an attribute for `fail_fast`.
* `airflow-dbt` does not offer access to `dbt` artifacts created during execution. `airflow-dbt-python` does so by pushing any artifacts to [XCom](https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html).
Clone the repo:
``` shell
git clone https://github.com/tomasfarias/airflow-dbt-python.git
cd airflow-dbt-python
```

With poetry:
``` shell
poetry install
```

Install any extras you need, and only those you need:
``` shell
poetry install -E postgres -E redshift
```

## In MWAA:

Add `airflow-dbt-python` to your `requirements.txt` file and edit your Airflow environment to use this new `requirements.txt` file.

# Additional features
# Features

Besides running `dbt` as one would do if doing so manually, `airflow-dbt-python` also supports a few additional features to bring `dbt` closer to being a first-class citizen of Airflow.
Airflow-dbt-python aims to make dbt a **first-class citizen** of Airflow by supporting additional features that integrate both tools. As you would expect, airflow-dbt-python can run all your dbt workflows in Airflow with the same interface you are used to from the CLI, but without being a mere wrapper: airflow-dbt-python directly interfaces with internal `dbt-core <https://pypi.org/project/dbt-core/>`_ classes, bridging the gap between them and Airflow's operator interface.

## Download dbt projects from S3
As this integration was completed, several features were developed to **extend the capabilities of `dbt`** to leverage Airflow as much as possible. Can you think of a way `dbt` could leverage Airflow that is not currently supported? Let us know in a [GitHub issue](https://github.com/tomasfarias/airflow-dbt-python/issues/new/choose)! The current list of supported features is as follows:

The arguments `profiles_dir` and `project_dir` would normally point to a directory containing a `profiles.yml` file and a dbt project in the local environment respectively. `airflow-dbt-python` extends these arguments to also take an [AWS S3](https://aws.amazon.com/s3/) URL (identified by an `s3://` scheme):
## Independent task execution

Airflow executes [Tasks](https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html) independent of one another: even though downstream and upstream dependencies between tasks exist, the execution of an individual task happens entirely independently of any other task execution (see: [Tasks Relationships](https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html#relationships).

In order to work with this constraint, airflow-dbt-python runs each dbt command in a **temporary and isolated directory**. Before execution, all the relevant dbt files are copied from supported backends, and after executing the command any artifacts are exported. This ensures dbt can work with any Airflow deployment, including most production deployments as they are usually running [Remote Executors](https://airflow.apache.org/docs/apache-airflow/stable/executor/index.html#executor-types) and do not guarantee any files will be shared by default between tasks, since each task may run in a completely different environment.


## Download dbt files from S3

The dbt parameters `profiles_dir` and `project_dir` would normally point to a directory containing a `profiles.yml` file and a dbt project in the local environment respectively (defined by the presence of a `dbt_project.yml` file). airflow-dbt-python extends these parameters to also accept an [AWS S3](https://aws.amazon.com/s3/) URL (identified by a `s3://` scheme):

* If an S3 URL is used for `profiles_dir`, then this URL must point to a directory in S3 that contains a `profiles.yml` file. The `profiles.yml` file will be downloaded and made available for the operator to use when running.
* If an S3 URL is used for `project_dir`, then this URL must point to a directory in S3 containing all the files required for a `dbt` project to run. All of the contents of this directory will be downloaded and made available for the operator. The URL may also point to a zip file containing all the files of a `dbt` project, which will be downloaded, uncompressed, and made available for the operator.
* If an S3 URL is used for `project_dir`, then this URL must point to a directory in S3 containing all the files required for a dbt project to run. All of the contents of this directory will be downloaded and made available for the operator. The URL may also point to a zip file containing all the files of a dbt project, which will be downloaded, uncompressed, and made available for the operator.

This feature is intended to work in line with Airflow's [description of the task concept](https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html#relationships):

> Tasks don’t pass information to each other by default, and run entirely independently.

In our world, that means task should be responsible of fetching all the `dbt` related files it needs in order to run independently. This is particularly relevant for an Airflow deployment with a [remote executor](https://airflow.apache.org/docs/apache-airflow/stable/executor/index.html#executor-types), as Airflow does not guarantee which worker will run a particular task.
In our world, that means task should be responsible of fetching all the dbt related files it needs in order to run independently, as already described in [Independent Task Execution](#independent-task-execution).

As of the time of writing S3 is the only supported backend for dbt projects, but we have plans to extend this to support more backends, initially targeting other file storages that are commonly used in Airflow connections.

## Push dbt artifacts to XCom

Each `dbt` execution produces several JSON [artifacts](https://docs.getdbt.com/reference/artifacts/dbt-artifacts/) that may be valuable to obtain metrics, build conditional workflows, for reporting purposes, or other uses. `airflow-dbt-python` can push these artifacts to XCom as requested by exposing a `do_xcom_push_artifacts` argument, which takes a list of artifacts to push. This way, artifacts may be pulled and operated on by downstream tasks. For example:
Each dbt execution produces one or more [JSON artifacts](https://docs.getdbt.com/reference/artifacts/dbt-artifacts/) that are valuable to produce meta-metrics, build conditional workflows, for reporting purposes, and other uses. airflow-dbt-python can push these artifacts to [XCom](https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html) as requested via the `do_xcom_push_artifacts` parameter, which takes a list of artifacts to push.

# Motivation

## Airflow running in a managed environment

Although [`dbt`](https://docs.getdbt.com/) is meant to be installed and used as a CLI, we may not have control of the environment where Airflow is running, disallowing us the option of using `dbt` as a CLI.

This is exactly what happens when using [Amazon's Managed Workflows for Apache Airflow](https://aws.amazon.com/managed-workflows-for-apache-airflow/) or MWAA: although a list of Python requirements can be passed, the CLI cannot be found in the worker's PATH.

There is a workaround which involves using Airflow's `BashOperator` and running Python from the command line:

``` python
with DAG(
dag_id="example_dbt_artifacts",
schedule_interval="0 0 * * *",
start_date=days_ago(1),
catchup=False,
dagrun_timeout=dt.timedelta(minutes=60),
) as dag:
dbt_run = DbtRunOperator(
task_id="dbt_run_daily",
project_dir="/path/to/my/dbt/project/",
profiles_dir="~/.dbt/",
select=["+tag:daily"],
exclude=["tag:deprecated"],
target="production",
profile="my-project",
full_refresh=True,
do_xcom_push_artifacts=["manifest.json", "run_results.json"],
)
from airflow.operators.bash import BashOperator

process_artifacts = PythonOperator(
task_id="process_artifacts",
python_callable=process_dbt_artifacts,
provide_context=True,
)
dbt_run >> process_artifacts
BASH_COMMAND = "python -c 'from dbt.main import main; main()' run"
operator = BashOperator(
task_id="dbt_run",
bash_command=BASH_COMMAND,
)
```

See the full example [here](examples/use_dbt_artifacts_dag.py).
But it can get sloppy when appending all potential arguments a `dbt run` command (or other subcommand) can take.

That's where `airflow-dbt-python` comes in: it abstracts the complexity of interfacing with `dbt-core` and exposes one operator for each `dbt` subcommand that can be instantiated with all the corresponding arguments that the `dbt` CLI would take.

## An alternative to `airflow-dbt` that works without the dbt CLI

The alternative [`airflow-dbt`](https://pypi.org/project/airflow-dbt/) package, by default, would not work if the `dbt` CLI is not in PATH, which means it would not be usable in MWAA. There is a workaround via the `dbt_bin` argument, which can be set to `"python -c 'from dbt.main import main; main()' run"`, in similar fashion as the `BashOperator` example. Yet this approach is not without its limitations:
* `airflow-dbt` works by wrapping the `dbt` CLI, which makes our code dependent on the environment in which it runs.
* `airflow-dbt` does not support the full range of arguments a command can take. For example, `DbtRunOperator` does not have an attribute for `fail_fast`.
* `airflow-dbt` does not offer access to `dbt` artifacts created during execution. `airflow-dbt-python` does so by pushing any artifacts to [XCom](https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html).

# Usage

Expand Down Expand Up @@ -153,51 +180,7 @@ with DAG(
dbt_test >> dbt_seed >> dbt_run
```

More examples can be found in the [`examples/`](examples/) directory.

# Requirements

To line up with `dbt-core`, `airflow-dbt-python` supports Python 3.7, 3.8, and 3.9. We also include Python 3.10 in our testing pipeline, although `dbt-core` does not yet support it.

On the Airflow side, we unit test with version 1.10.12 and the latest version 2 release.

Finally, `airflow-dbt-python` requires at least `dbt-core` version 1.0.0. Since `dbt-core` follows [semantic versioning](https://semver.org/), we do not impose any restrictions on the minor and patch versions, but do keep in mind that the latest `dbt-core` features incorporated as minor releases may not yet be supported.

# Installing

## From PyPI:

``` shell
pip install airflow-dbt-python
```

Any `dbt` adapters you require may be installed by specifying extras:

``` shell
pip install airflow-dby-python[snowflake,postgres]
```

## From this repo:

Clone the repo:
``` shell
git clone https://github.com/tomasfarias/airflow-dbt-python.git
cd airflow-dbt-python
```

With poetry:
``` shell
poetry install
```

Install any extras you need, and only those you need:
``` shell
poetry install -E postgres -E redshift
```

## In MWAA:

Add `airflow-dbt-python` to your `requirements.txt` file and edit your Airflow environment to use this new `requirements.txt` file.
More examples can be found in the [`examples/`](examples/) directory and the [documentation](https://tomasfarias.github.io/airflow-dbt-python/example_dags.html).

# Testing

Expand All @@ -207,6 +190,8 @@ Tests are written using `pytest`, can be located in `tests/`, and they can be ru
poetry run pytest tests/ -vv
```

See development and testing instructions in the [documentation](https://tomasfarias.github.io/airflow-dbt-python/development.html).

# License

This project is licensed under the MIT license. See ![LICENSE](LICENSE).
15 changes: 0 additions & 15 deletions docs/autodoc.rst

This file was deleted.

100 changes: 100 additions & 0 deletions docs/development.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
.. _development:

Development
===========

This section describes how to setup a development environment. If you are looking to dig into the internals of airflow-dbt-python and make a (very appreciated) contribution to the project, read along.

Poetry
------

airflow-dbt-python uses `Poetry <https://python-poetry.org/>`_ for project management. Ensure it's installed before running: see `Poetry's installation documentation <https://python-poetry.org/docs/#installation>`_.

Installing Airflow
------------------

Development requires a local installation of Airflow, as airflow-dbt-python doesn't come bundled with one. We can install a specific version using ``pip``:

.. code-block:: shell

pip install apache-airflow==2.2

.. note::
Installing any 1.X version of Airflow will raise warnings due to dependency conflicts with ``dbt-core``. However, these conflicts should not impact airflow-dbt-python.

Installing the ``airflow`` extra will fetch the latest version of Airflow with major version 2:

.. code-block:: shell

cd airflow-dbt-python
poetry install -E airflow


Building from source
--------------------

Clone the main repo and install it:


.. code-block:: shell

git clone https://github.com/tomasfarias/airflow-dbt-python.git
cd airflow-dbt-python
poetry install


Pre-commit hooks
----------------

A handful of `pre-commit <https://pre-commit.com/>`_ hooks are provided, including:

* Trailing whitespace trimming.
* Ensure EOF newline.
* Detect secrets.
* Code formatting (`black <https://github.com/psf/black>`_).
* PEP8 linting (`flake8 <https://github.com/pycqa/flake8/>`_).
* Static type checking (`mypy <https://github.com/python/mypy>`_).
* Import sorting (`isort <https://github.com/PyCQA/isort>`_).


Install hooks after cloning airflow-dbt-python:

.. code-block:: shell

pre-commit install

Ensuring hooks pass is highly recommended as hooks are mapped to CI/CD checks that will block PRs.

Testing
-------

Unit tests are available for all operators and hooks. That being said, only a fraction of the large amount of possible inputs that the operators and hooks can take is currently covered, so the unit tests do not offer perfect coverage (a single peek at the ``DbtBaseOperator`` should give you an idea of the level of state explosion we manage).

.. note::
Unit tests (and airflow-dbt-python) assume dbt works correctly and do not assert the behavior of the dbt commands themselves.

Requirements
^^^^^^^^^^^^

Unit tests interact with a `PostgreSQL <https://www.postgresql.org/>`_ database as a target to run dbt commands. This requires PostgreSQL to be installed in your local environment. Installation instructions for all major platforms can be found here: https://www.postgresql.org/download/.

Some unit tests require the `Amazon provider package for Airflow <https://pypi.org/project/apache-airflow-providers-amazon/>`_. Ensure it's installed via the ``amazon`` extra:

.. code-block:: shell

poetry install -E amazon

Running unit tests with pytest
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

airflow-dbt-python uses `pytest <https://docs.pytest.org/>`_ as its testing framework. After you have saved your changes, all unit tests can be run with:

.. code-block:: shell

poetry run pytest tests/ -vv

Generating coverage reports with pytest-cov can be done with:

.. code-block:: shell

poetry run pytest -vv --cov=./airflow_dbt_python --cov-report=xml:./coverage.xml --cov-report term-missing tests/