# Explore data load strategies
In Microsoft Fabric, there are many ways you can choose to load data in a warehouse. 
- This step is fundamental as it ensures that high-quality, transformed or processed data is integrated into a **single repository**.
- Also, the efficiency of data loading directly impacts the **timeliness** and **accuracy** of analytics, making it vital for real-time decision-making processes.

## Understand data ingestion and data load operations
While both processes are part of the ETL (Extract, Transform, Load) pipeline in a data warehouse scenario, they usually serve different purposes. 
- **Data ingestion/extract:** is about **`moving`** raw data from various sources into a central repository. 
- **Data loading:** involves **`taking`** the **transformed** or **processed** data and **`loading`** it into the final storage destination for analysis and reporting.

All Fabric data items like data warehouses and lakehouses store their data automatically in OneLake in Delta Parquet format.

## Stage your data
You may have to build and work with **auxiliary objects** involved in a **load operation** such as tables, stored procedures, and functions. These auxiliary objects are commonly referred to as **staging**.
- Staging objects act as temporary storage and transformation areas.
- They can share resources with a data warehouse, or live in its own storage area.

Staging serves as an abstraction layer, simplifying and facilitating the load operation to the final tables in the data warehouse.

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/1-data-warehouse-process.png" alt="Sequential steps" style="border: 2px solid black; border-radius: 10px;">

Also, staging area provides a buffer that can help to minimize the impact of the load operation on the performance of the data warehouse. This is important in environments where the data warehouse needs to remain operational and responsive during the data loading process.

##Review type of data loads
There are two types of data loads to consider when loading a data warehouse.


| Load Type	| Description	| Operation	| Duration	| Complexity	| Best used	|
| ----------|-------------|-----------|-----------|-------------|-----------|
| Full (initial) load | The process of populating the data warehouse for the **first time**. | All the tables are **truncated** and **reloaded**, and the old data is lost | It may take longer to complete due to the amount of data being handled | Easier to implement as there's no history preserved | This method is typically used when setting up a **new data warehouse**, or when a **complete refresh of the data** is required |
| Incremental load | The process of **updating** the data warehouse with the changes since the last update | The history is preserved, and tables are **updated** with new information | Takes less time than the initial load	Implementation is more complex than the initial load | This method is commonly used for regular updates to the data warehouse, such as daily or hourly updates. | It requires mechanisms to **track changes** in the source data since the last load. |

<img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note"> To learn more about how to perform an incremental load, see [Incremental load](https://learn.microsoft.com/en-us/fabric/data-factory/tutorial-incremental-copy-data-warehouse-lakehouse).

## Load a dimension table
Think of a dimension table as the "who, what, where, when, why” of your data warehouse. It’s like the **descriptive backdrop** that gives context to the raw numbers found in the fact tables.

### Slowly changing dimensions (SCD)
**Slowly Changing Dimensions:** change over time, but at a **slow pace** and **unpredictably**. There are several types of SCD in a data warehouse, with type 1 and type 2 being the most frequently used.

#### SCD 0 - Retain Original
- The dimension attributes **`never change`**.

#### SCD 1 - Overwrite
- **`Overwrites`** existing data, doesn't keep history.

#### SCD 2 - Add New Row
- **Adds `new records`** for changes, keeps full history for a given natural key.

#### SCD 3 - Add New Attribute
- History is **added as a `current` column** to record **current** and **previous** values.

#### SCD 4 - Add Mini-Dimension
- A **new `mini-dimension` is added**.
- The type 4 technique is used when:
  - A **`group of attributes`** in a dimension **rapidly changes** and is **split off** to a mini-dimension. This situation is sometimes called a **_rapidly changing monster dimension_**. 
  - **Frequently `used` attributes** in multimillion-row dimension tables are mini-dimension design **candidates**, even if they don’t frequently change. 
- The surrogate keys of both the base dimension and mini-dimension are captured in the associated fact tables.

**Example:**

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/SCD_4.png" alt="SCD 4" style="border: 2px solid black; border-radius: 10px;">

#### SCD 5 - Add Mini-Dimension and Type 1 Outtrigger
- The type 5 technique is used to:
  - Accurately preserve **`historical` attribute values**, plus 
  - Report historical facts according to **`current` attribute values**. 
- **Type 5** builds on the **`type 4` mini-dimension** by also embedding a current **`type 1`** reference to the mini-dimension in the base dimension. 
  - This enables the currently-assigned minidimension attributes to be accessed along with the others in the base dimension without linking through a fact table. 
  - Logically, you’d represent the base dimension and mini-dimension outrigger as a single table in the presentation area. 
- The ETL team **must `overwrite`** this type 1 mini-dimension reference whenever the current mini-dimension assignment changes.

**Example:**

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/SCD_5.png" alt="SCD 5" style="border: 2px solid black; border-radius: 10px;">

#### SCD 6 - Add Type 1 Attribute To Type 2 Dimension
- **Type 6** builds on:
  - The **`type 2`** technique by 
  - also embedding current **`type 1`** versions of the same attributes in the dimension row so that fact rows can be filtered or grouped by either the type 2 attribute value in effect when the measurement occurred or the attribute’s current value. 
- In this case, the type 1 attribute is systematically **`overwritten` on all rows** associated with a particular durable key whenever the attribute is updated.

**Example:**

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/SCD_6.png" alt="SCD 6" style="border: 2px solid black; border-radius: 10px;">

In [0]:
%sql
-- The following example shows how to handle changes in a type 2 SCD for the Dim_Products table using T-SQL.
IF EXISTS (SELECT 1 FROM Dim_Products WHERE SourceKey = @ProductID AND IsActive = 'True')
BEGIN
    -- Existing product record
    UPDATE Dim_Products
    SET ValidTo = GETDATE(), IsActive = 'False'
    WHERE SourceKey = @ProductID AND IsActive = 'True';
END
ELSE
BEGIN
    -- New product record
    INSERT INTO Dim_Products (SourceKey, ProductName, StartDate, EndDate, IsActive)
    VALUES (@ProductID, @ProductName, GETDATE(), '9999-12-31', 'True');
END

The mechanism for detecting changes in source systems is crucial for determining when records are inserted, updated, or deleted. [Change Data Capture (CDC)](https://learn.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-data-capture-sql-server?view=sql-server-ver16), [change tracking](https://learn.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-tracking-sql-server?view=sql-server-ver16), and [triggers](https://learn.microsoft.com/en-us/sql/relational-databases/triggers/dml-triggers?view=sql-server-ver16) are all features available for managing data tracking in source systems such as SQL Server.

## Load a fact table
Let's consider an example where we load a Fact_Sales table in a data warehouse. This table contains sales transactions data with columns such as FactKey, DateKey, ProductKey, OrderID, Quantity, Price, and LoadTime.

Assume we have a source table Order_Detail in an OLTP system with columns: OrderID, OrderDate, ProductID, Quantity, and Price.

The following T-SQL script example load the Fact_Sales table.

In [0]:
%sql
-- Lookup keys in dimension tables
INSERT INTO Fact_Sales (DateKey, ProductKey, OrderID, Quantity, Price, LoadTime)
SELECT d.DateKey, p.ProductKey, o.OrderID, o.Quantity, o.Price, GETDATE()
FROM Order_Detail o
JOIN Dim_Date d ON o.OrderDate = d.Date
JOIN Dim_Product p ON o.ProductID = p.ProductID;

In this example, we use a JOIN operation to look up the **DateKey** and **ProductKey** values in the **Dim_Date** and **Dim_Product** tables, respectively, and then insert the data into the Fact_Sales table. 
- However, it is important to note that the complexity of the loading process depends on several factors, including the amount of data, the transformation requirements, error handling, schema differences, and performance.

# Use data pipelines to load a warehouse

## Create a data pipeline
There are a few ways to launch the data pipeline editor.

- **From the workspace:** Select + **New**, then select **Data pipeline**. If it's not visible in the list, select **More options**, then find **Data pipeline** under the **Data Factory section**.

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/3-data-pipeline-create.gif" alt="Launch Data Pipeline from the workspace" style="border: 2px solid black; border-radius: 10px;">

- **From the warehouse asset:** Select **Get Data**, and then **New data pipeline**.

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/3-create-data-pipeline.png" alt="Shortcuts for a few features in the Warehouse asset" style="border: 2px solid black; border-radius: 10px;">

There are three options available when creating a pipeline.

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/3-build-pipeline.png" alt="Options available when creating a pipeline" style="border: 2px solid black; border-radius: 10px;">

| Option | Description |
| -- | -- |
| 1. Add pipeline activity | Launches the pipeline editor where you can create your own pipeline. |
| 2. Copy data | Launches an assistant to copy data from various data sources to a data destination. A new pipeline activity is generated at the end with a preconfigured Copy Data task. |
| 3. Choose a task to start | You can choose from a collection of predefined templates to assist you in initiating pipelines based on many scenarios. |

## Configure the copy data assistant
The copy data assistant provides a step-by-step interface that facilitates the configuration of a **Copy Data** task.

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/3-copy-data-assistant.png" alt="Copy data assistant" style="border: 2px solid black; border-radius: 10px;">

- **Choose data source:** Select a connector, and provide the connection information.
- **Connect to a data source:** Select, preview, and choose the data. This can be done from tables or views, or you can customize your selection by providing your own query.
- **Choose data destination:** Select the data store as the destination.
- **Connect to data destination:** Select and map columns from source to destination. You can load to a new or existing table.
- **Settings:** Configure other settings like staging, and default values.

After you copy the data, you can use other tasks to further transform and analyze it. You can also use the **Copy Data** task to publish transformation and analysis results for business intelligence (BI) and application consumption.

## Schedule a data pipeline
You can schedule your data pipeline by selecting **Schedule** from the data pipeline editor.

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/3-schedule-data-pipeline.png" alt="Schedule a data pipeline from the pipeline designer" style="border: 2px solid black; border-radius: 10px;">

You can also configure the schedule by selecting **Settings** in the **Home** menu in the data pipeline editor.

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/3-schedule-configuration.png" alt="Configuration properties when you schedule a data pipeline" style="border: 2px solid black; border-radius: 10px;">

We recommend data pipelines for a code-free or low-code experience due to the graphical user interface. They're ideal for data workflows that run at a schedule, or that connects to different data sources.

To learn more about data pipelines, see [Ingest data into your Warehouse using data pipelines](https://learn.microsoft.com/en-us/fabric/data-warehouse/ingest-data-pipelines).

# Load data using T-SQL

## Use COPY statement
The [COPY statement](https://learn.microsoft.com/en-us/sql/t-sql/statements/copy-into-transact-sql?view=azure-sqldw-latest) serves as the main method for importing data into the Warehouse. It facilitates efficient data ingestion from an external Azure storage account.

It offers flexibility, allowing you to specify the format of the source file, designate a location for storing rows that are rejected during the import process, skip header rows, among other configurable options.

The option to store rejected rows separately is useful for data cleaning and quality control. It allows you to easily identify and investigate any issues with the data that weren't successfully imported.

To connect to an Azure storage account, you need to use either Shared Access Signature (SAS) or Storage Account Key (SAK).

<img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note"> The COPY statement currently supports the PARQUET and CSV file formats.

### Handle error
The option to use a different storage account for the _ERRORFILE_ location (`REJECTED_ROW_LOCATION`) allows for better error handling and debugging. It makes it easier to isolate and investigate any issues that occur during the data loading process. _ERRORFILE_ only applies to CSV.

### Load multiple files
The ability to specify wildcards and multiple files in the storage location path allows the COPY statement to handle bulk data loading efficiently. This is useful when dealing with large datasets distributed across multiple files.

Multiple file locations can only be specified from the same storage account and container via a comma-separated list.

In [0]:
%sql
COPY my_table
FROM 'https://myaccount.blob.core.windows.net/myblobcontainer/folder0/*.csv, 
    https://myaccount.blob.core.windows.net/myblobcontainer/folder1/'
WITH (
    FILE_TYPE = 'CSV',
    CREDENTIAL=(IDENTITY= 'Shared Access Signature', SECRET='<Your_SAS_Token>')
    FIELDTERMINATOR = '|'
)

In [0]:
%sql
--The following example shows how to load a PARQUET file.
COPY INTO test_parquet
FROM 'https://myaccount.blob.core.windows.net/myblobcontainer/folder1/*.parquet'
WITH (
    CREDENTIAL=(IDENTITY= 'Shared Access Signature', SECRET='<Your_SAS_Token>')
)

## Load table from other warehouses and lakehouses
You can load data from various data assets in a workspace, such as other warehouses and lakehouses.

To reference the data asset, ensure that you use [three-part naming](https://learn.microsoft.com/en-us/sql/t-sql/language-elements/transact-sql-syntax-conventions-transact-sql?view=sql-server-ver16) to combine data from tables on these workspace assets. You can then use `CREATE TABLE AS SELECT` (CTAS) and `INSERT...SELECT` to load the data into the warehouse.

| SQL Statement	| Description |
|--|--|
| [CREATE TABLE AS SELECT](https://learn.microsoft.com/en-us/sql/t-sql/statements/create-table-as-select-azure-sql-data-warehouse?view=azure-sqldw-latest)	| Allows you to create a new table based on the output of a `SELECT` statement. This operation is often used for creating a copy of a table or for transforming and loading the results of complex queries. |
| [INSERT...SELECT](https://learn.microsoft.com/en-us/sql/t-sql/statements/insert-transact-sql?view=sql-server-ver16)	| Allows you to insert data from one table into another. It’s useful when you want to copy data from one |

In a scenario where an analyst needs data from both a warehouse and a lakehouse, they can use this feature to combine the data. They can then load this combined data into the warehouse for analysis. This feature is useful when data is distributed across many assets in a workspace.

The following query creates a new table in the `analysis_warehouse` that combines data from the `sales_warehouse` and the `social_lakehouse` using the _product_id_ as the common key. The new table can then be used for further analysis.

In [0]:
%sql
CREATE TABLE [analysis_warehouse].[dbo].[combined_data]
AS
SELECT 
FROM [sales_warehouse].[dbo].[sales_data] sales
INNER JOIN [social_lakehouse].[dbo].[social_data] social
ON sales.[product_id] = social.[product_id];

All the Warehouses that share the same workspace are integrated into the same logical SQL server. If you use SQL client tools such as [SQL Server Management Studio](https://learn.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16), you can easily perform a cross-database query like in any SQL Server instance.

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/4-load-using-ssms.gif" alt="Reference other Warehouses in a workspace from SQL Server Management Studio" style="border: 2px solid black; border-radius: 10px;">

_MyWarehouse_ and _Sales_ are both warehouse assets that share the same workspace.

If you’re using the object Explorer from the workspace to query your Warehouses, you need to add them explicitly. The warehouses added will also be visible from the Visual query editor.

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/4-query-using-workspace.gif" alt="Query other Warehouses in a workspace from the Fabric workspace" style="border: 2px solid black; border-radius: 10px;">

When using T-SQL, data can be efficiently loaded into a warehouse in Microsoft Fabric through the COPY statement, or from other warehouses and lakehouses within the same workspace, allowing for seamless data management and analysis.

# Load and transform data with Dataflow Gen2

## Create a dataflow
To create a new dataflow, navigate to your workspace, then select + **New**. If **Dataflow Gen2** isn't visible in the list, select **More options**, then find **Dataflow Gen2** under the **Data Factory section**.

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/5-load-using-dataflow.gif" alt="Launch Dataflow Gen2 from the workspace" style="border: 2px solid black; border-radius: 10px;">

## Import data
Once the Dataflow Gen2 launches, there are many options to load your data available.

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/5-import-options.png" alt="Launch Data Pipeline from the Warehouse asset" style="border: 2px solid black; border-radius: 10px;">

You can load different file types with just a few steps. For example, to load a text or CSV file from your local computer.

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/5-load-file.png" alt="Load a text or CSV file" style="border: 2px solid black; border-radius: 10px;">

Once the data is imported you can start authoring your dataflow, you might decide to clean your data, reshape, remove columns, and create new ones. All the steps you perform are saved.

## Transform data with Copilot
Copilot can be a valuable tool for assisting with dataflow transformations. Let's say we have a _Gender_ column that contains '_Male_' and '_Female_' and we want to transform it.

The first step is to activate Copilot within your dataflow. Once that's done, you can then provide specific instructions on the transformation you want to perform.

For instance, you might input the following command: "_Transform the Gender column. If Male 0, if Female 1. Then convert it to integer_."

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/5-copilot.png" alt="Use Copilot to apply transformation in a dataflow" style="border: 2px solid black; border-radius: 10px;">

Copilot adds a new step automatically, and you can always revert it if you want, or continue to build on it for further transformations.

## Add a data destination
With the **Add data destination** feature, you can separate your ETL logic and destination storage. This separation can lead to cleaner, more maintainable code and can make it easier to modify either the ETL process or the storage configuration without affecting the other.

Once the data is transformed, the next step is to add a destination step. On the **Query settings** tab, select + to add a destination step in your dataflow.

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/5-add-destination.png" alt="Option to add a data destination in a dataflow" style="border: 2px solid black; border-radius: 10px;">

The following destination options are available.
- Azure SQL Database
- Lakehouse
- Azure Data Explorer (Kusto)
- Azure Synapse Analytics (SQL DW)
- Warehouse

Data that’s loaded into a destination like a warehouse can be easily accessed and analyzed using various tools. This improves the accessibility of your data and allows for more flexible and comprehensive data analysis.

When you select a warehouse as a destination, you can choose the following update methods.

Diagram showing visually the difference between the append and replace methods to update a row.

<img src="../images/04_Implement a data warehouse with Microsoft Fabric/01/5-update-table-options.png" alt="Difference between the append and replace methods to update a row" style="border: 2px solid black; border-radius: 10px;">

- **Append:** Add new rows to an existing table.
- **Replace:** Replace the entire content of a table with a new set of data.

## Publish a dataflow
After you choose your update method, the final step is to publish your dataflow.

Publishing makes your transformations and data loading operations live, allowing the dataflow to be executed either manually or on a schedule. This process encapsulates your ETL operations into a single and reusable unit, streamlining your data management workflow.

Any changes made in the dataflow take effect when it’s published. So, always ensure to publish your dataflow after making any relevant modifications.