# Understand Dataflows Gen2

## What is a dataflow?
**Dataflows Gen2:** allow you to extract data from various sources, transform it using a wide range of transformation operations, and load it into a destination. 
  - Using **`Power Query Online`** also allows for a visual interface to perform these tasks.

## How to use Dataflows Gen2
The goal of Dataflows Gen2 is to provide an easy, reusable way to perform ETL tasks using **Power Query Online**.

Dataflows allow you to promote reusable ETL logic that prevents the need to create more connections to your data source. Dataflows offer a wide variety of transformations, and can be run manually, on a refresh schedule, or as part of a Data Pipeline orchestration.

Dataflows can be **horizontally partitioned** as well.
  - Once you create a global dataflow, data analysts can use dataflows to create specialized semantic models for specific needs.

**Ways 1 - `Data Pipeline`:** you copy data, then use your preferred coding language to **`extract`**, **`transform`**, and **`load`** the data. 

**Ways 2 - `Dataflow Gen2`:** Alternatively, you can create a **Dataflow Gen2** first to **`extract`** and **`transform`** the data. 
- You can also **`load`** the data into a Lakehouse, and other destinations. Adding a data destination to your dataflow is optional, and the dataflow preserves all transformation steps. 
- To perform other tasks or load data to a different destination after transformation, create a **Data Pipeline** and add the **Dataflow Gen2** activity to your orchestration.

**Way 3 - `Data Pipeline`** + **`Dataflow Gen2`**: for ELT (Extract, Load, Transform) process. 
- For this order, you'd use a **Data Pipeline** to **`extract`** and **`load`** the data into your preferred destination, such as the Lakehouse. 
- Then you'd create a **Dataflow Gen2** to connect to Lakehouse data to cleanse and **`transform`** data. In this case, you'd offer the Dataflow as a curated semantic model for data analysts to develop reports.

## Benefits and limitations
Benefits:
- Extend data with consistent data, such as a standard date dimension table.
- Allow self-service users access to a subset of data warehouse separately.
- Optimize performance with dataflows, which enable extracting data once for reuse, reducing data refresh time for slower sources.
- Simplify data source complexity by only exposing dataflows to larger analyst groups.
- Ensure consistency and quality of data by enabling users to clean and transform data before loading it to a destination.
- Simplify data integration by providing a low-code interface that ingests data from various sources.

Limitations:
- Not a replacement for a data warehouse.
- Row-level security isn't supported.
- Fabric capacity workspace is required.

# Explore Dataflows Gen2
In Microsoft Fabric, you can create a Dataflow Gen2 in the Data Factory workload or Power BI workspace, or directly in the lakehouse.

<img src="./images/03/power-query-online-overview.png" alt="Power Query Online" style="border: 2px solid black; border-radius: 10px;">

## (1) Power Query ribbon
**Dataflows Gen2** support a wide variety of data **source connectors** and numerous data **transformations** possible, such as:
- Filter and Sort rows
- Pivot and Unpivot
- Merge and Append queries
- Split and Conditional split
- Replace values and Remove duplicates
- Add, Rename, Reorder, or Delete columns
- Rank and Percentage calculator
- Top N and Bottom N

## (2) Queries pane
**The Queries pane** shows you the different data sources - now called queries. 
  - Rename, duplicate, reference, and enable staging are some of the options available.

## (3) Diagram view
**The Diagram View** allows you to visually see how the data sources are connected and the different applied transformations.

## (4) Data Preview pane
**The Data Preview pane** only shows a subset of data to allow you to see which transformations you should make and how they affect the data. 
- You can also interact with the preview pane by dragging and dropping columns to change order or right-clicking on columns to filter or make changes.

## (5) Query Settings pane
**The Query Settings pane** primarily includes **Applied Steps**. 
- Each transformation you do is tied to a step, some of which are automatically applied when you connect the data source. Depending on the complexity of the transformations, you may have several applied steps for each query.

While this visual interface is helpful, you can also view the M code through **Advanced editor**.

<img src="./images/03/power-query-advanced-editor.png" alt="Power Query Advanced Editor" style="border: 2px solid black; border-radius: 10px;">

In the Query settings pane, you can see a **Data Destination** field where you can set the Lakehouse as your destination.


<img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note">  If made available, data analysts can also connect to the dataflow through Power BI Desktop.

![Power BI Desktop Get Data Connectors](./images/03/power-bi-desktop-dataflow-connectors.png)
<img src="aaa" alt="Description" style="border: 2px solid black; border-radius: 10px;">

# Integrate Dataflows Gen2 and Pipelines
**Dataflows Gen2** provide an excellent option for data transformations in Microsoft Fabric. The combination of dataflows and pipelines is useful when you need to **perform additional operations** on the transformed data.

**Data pipelines** are easily created in the Data Factory and Data Engineering workloads. Pipelines are a common concept in data engineering and offer a wide variety of activities to orchestrate. Some common activities include:
- Copy data
- Incorporate Dataflow
- Add Notebook
- Get metadata
- Execute a script or stored procedure

<img src="./images/03/pipelines-options.png" alt="Pipeline Options" style="border: 2px solid black; border-radius: 10px;">

Pipelines provide a visual way to complete activities in a specific order. 
- You can use a **`dataflow`** for data ingestion and transformation, and landing into a Lakehouse using dataflows. 
- Then incorporate the **`dataflow`** into a **`pipeline`** to orchestrate extra activities, like execute scripts or stored procedures after the dataflow has completed.

<img src="./images/03/pipeline-dataflow-markup.png" alt="Pipeline & Dataflow" style="border: 2px solid black; border-radius: 10px;">