# Airflow

## What is Airflow?

Apache Airflow is an open-source platform designed to programmatically author, schedule, and monitor workflows.

It is widely used for orchestrating complex data pipelines and automating tasks in data engineering, machine learning, and other domains.

Airflow allows users to define workflows as Directed Acyclic Graphs (DAGs), where each task represents a step in the pipeline. These workflows can be scheduled to run at specific intervals and monitored for execution status.


## Installation

Plenty of different ways to install Airflow :

https://airflow.apache.org/docs/apache-airflow/stable/installation/index.html

and via Docker it's here:

https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html

Notes:

- You can install it with PyPI
- Then you can launch it on standalone mode. in that case use this command line to modify the verbosity:

```shell
export AIRFLOW__LOGGING__LOGGING_LEVEL=WARNING
airflow standalone
```

## Airflow Core concepts

Video from coder2j:

https://youtu.be/K9AnJ9_ZAXE?si=FMot2dGl5L26u-cT&t=1078
## DAGs and Workflows

**Workflow:**  
A workflow is a sequence of tasks or processes that are executed to achieve a specific goal. In the context of Airflow, a workflow is represented as a DAG, where each node is a task and edges define dependencies. Workflows automate and orchestrate complex processes, such as data pipelines, by managing the execution, scheduling, and monitoring of tasks.

**DAG (Directed Acyclic Graph):**
A DAG is a collection of all the tasks you want to run, organized in a way that clearly shows their relationships and dependencies.

In Airflow, a DAG defines the structure of your workflow, ensuring that each task is executed in the correct order and only once all its dependencies have been met. The "acyclic" property means there are no loops-tasks.


<p align="center">
    <img src="files/airflow_dag_tasks_and_operators.png" alt="Airflow DAG, Tasks, and Operators" style="max-width: 75%;" source="https://www.youtube.com/watch?v=K9AnJ9_ZAXE">
</p>

**Execution Date:**  
The execution date is a logical timestamp that represents when a DAG run is scheduled to start. It is used by Airflow to track and identify runs of a workflow, not necessarily the actual time the workflow is executed.

**Task Instance:**  
A task instance is a specific run of a task for a given DAG run and execution date. It represents the state and result of a single execution of a task within a workflow. As a variable it is often named `ti`.

**DAG Run:**  
A DAG run is an instance of a DAG execution, triggered either by a schedule or manually. Each DAG run is associated with an execution date and contains all the task instances for that particular run.

<p align="center">
    <img src="files/airflow_dag_run_execution_date_run_task_instance.png" alt="Airflow DAG, Run Execution Date and Task Instance" style="max-width: 75%;">
</p>

**airflow.cfg:**  
`airflow.cfg` is the main configuration file for Apache Airflow. It contains settings that control the behavior of Airflow components, such as the web server, scheduler, executor, and database connections. You can modify this file to customize your Airflow environment.

**Web Server:**  
The web server provides a user interface for Airflow, allowing users to monitor DAGs, trigger runs, view logs, and manage workflows. It is typically accessible via a browser and is essential for workflow management and troubleshooting.

**Scheduler:**  
The scheduler is responsible for monitoring DAG definitions and scheduling task instances for execution based on their dependencies and schedules. It determines when each task should run and sends them to the executor.

**Executor (Local or Sequential):**  
The executor determines how and where tasks are executed.  
- **Local Executor:** Runs tasks in parallel processes (most of the time) on the same machine, suitable for small to medium workloads. 
- **Sequential Executor:** Runs one task at a time, mainly used for testing or development.

**Worker:**  
A worker is a process that executes the actual tasks defined in your DAGs. In distributed setups (like with the Celery executor), multiple workers can run tasks in parallel across different machines, increasing scalability and reliability. Each worker receives resources from the executor to perform its tasks.

<p align="center">
    <img src="files/airflow_basic_architecture.png" alt="Airflow Basic Architecture" style="max-width: 75%;">
</p>


# Airflow

## What is Airflow?

Apache Airflow is an open-source platform designed to programmatically author, schedule, and monitor workflows.

It is widely used for orchestrating complex data pipelines and automating tasks in data engineering, machine learning, and other domains.

Airflow allows users to define workflows as Directed Acyclic Graphs (DAGs), where each task represents a step in the pipeline. These workflows can be scheduled to run at specific intervals and monitored for execution status.


## Installation

Plenty of different ways to install Airflow :

https://airflow.apache.org/docs/apache-airflow/stable/installation/index.html

Notes:

- You can install it with PyPI
- Then you can launch it on standalone mode. in that case use this command line to modify the verbosity:

```shell
export AIRFLOW__LOGGING__LOGGING_LEVEL=WARNING
airflow standalone
```

## Airflow Core concepts

Video from coder2j:

https://youtu.be/K9AnJ9_ZAXE?si=FMot2dGl5L26u-cT&t=1078
## DAGs and Workflows

**Workflow:**  
A workflow is a sequence of tasks or processes that are executed to achieve a specific goal. In the context of Airflow, a workflow is represented as a DAG, where each node is a task and edges define dependencies. Workflows automate and orchestrate complex processes, such as data pipelines, by managing the execution, scheduling, and monitoring of tasks.

**DAG (Directed Acyclic Graph):**
A DAG is a collection of all the tasks you want to run, organized in a way that clearly shows their relationships and dependencies.

In Airflow, a DAG defines the structure of your workflow, ensuring that each task is executed in the correct order and only once all its dependencies have been met. The "acyclic" property means there are no loops-tasks.


<p align="center">
    <img src="files/airflow_dag_tasks_and_operators.png" alt="Airflow DAG, Tasks, and Operators" style="max-width: 75%;" source="https://www.youtube.com/watch?v=K9AnJ9_ZAXE">
</p>

**Execution Date:**  
The execution date is a logical timestamp that represents when a DAG run is scheduled to start. It is used by Airflow to track and identify runs of a workflow, not necessarily the actual time the workflow is executed.

**Task Instance:**  
A task instance is a specific run of a task for a given DAG run and execution date. It represents the state and result of a single execution of a task within a workflow. As a variable it is often named `ti`.

**DAG Run:**  
A DAG run is an instance of a DAG execution, triggered either by a schedule or manually. Each DAG run is associated with an execution date and contains all the task instances for that particular run.

<p align="center">
    <img src="files/airflow_dag_run_execution_date_run_task_instance.png" alt="Airflow DAG, Run Execution Date and Task Instance" style="max-width: 75%;">
</p>

**airflow.cfg:**  
`airflow.cfg` is the main configuration file for Apache Airflow. It contains settings that control the behavior of Airflow components, such as the web server, scheduler, executor, and database connections. You can modify this file to customize your Airflow environment.

**Web Server:**  
The web server provides a user interface for Airflow, allowing users to monitor DAGs, trigger runs, view logs, and manage workflows. It is typically accessible via a browser and is essential for workflow management and troubleshooting.

**Scheduler:**  
The scheduler is responsible for monitoring DAG definitions and scheduling task instances for execution based on their dependencies and schedules. It determines when each task should run and sends them to the executor.

**Executor (Local or Sequential):**  
The executor determines how and where tasks are executed.  
- **Local Executor:** Runs tasks in parallel processes (most of the time) on the same machine, suitable for small to medium workloads. 
- **Sequential Executor:** Runs one task at a time, mainly used for testing or development.

**Worker:**  
A worker is a process that executes the actual tasks defined in your DAGs. In distributed setups (like with the Celery executor), multiple workers can run tasks in parallel across different machines, increasing scalability and reliability. Each worker receives resources from the executor to perform its tasks.

<p align="center">
    <img src="files/airflow_basic_architecture.png" alt="Airflow Basic Architecture" style="max-width: 75%;">
</p>

