# Building Data Ingestion Pipelines Using Azure Data Factory

Azure Data Factory is the bread and butter for a data engineer and understanding its fundamentals is extremely essential in building efficient pipelines. By the end of the lab, you will know how to provision a data factory account, copy data from an Azure SQL database to a data lake using copy activity, use control flow activities, move data from SQL Server to a data lake, and choose options to trigger a data factory pipeline.

In this lab, we’ll cover the following recipes:

* Provisioning Azure Data Factory
* Copying files to a database from a data lake using a control flow and copy activity
* Triggering a pipeline in Azure Data Factory
* Copying data from a SQL Server virtual machine to a data lake using the Copy data wizard

## Recipe 1 - Provisioning Azure Data Factory

To get started with Azure Data Factory, you need to run an Azure Data Factory account. An Azure Data Factory account is comprised of the following key components:

-   **Linked services**: A component that maintains the connection credentials to data sources. An example of this is a connection to a SQL database/text file.
-   **Dataset**: The data that's obtained after connecting to the data source using a linked service. An example of this is a group of tables or files connected via a linked service.
-   **Activity**: A task that will process the dataset. An example of this is a copy activity that moves the data from a flat file to a database.
-   **Data flow**: These are specific tasks that perform data transformations on datasets. An example of this is pivoting or sorting a dataset while it is being moved from source to destination. This can be done by a data flow transformation task.
-   **Integration runtime**: This is the Azure Data Factory engine that works behind the scenes and provides the compute and resources to run the activities or tasks.
-   **Pipeline**: A single entity that combines all the aforementioned components to connect, process, transform, and ingest the data to the destination. A single pipeline may contain multiple linked services, datasets, activities, and data flows.

In this recipe, we will be provisioning an Azure Data Factory using the Azure portal. Follow these steps:

1. Log in to portal.azure.com, click on Create a resource, and search for Data Factory. Select Data Factory and click on Create. Provide the data factory name, the resource group name, and location.
1. Click on Next: Git configuration. Git configuration allows you to configure integration with Azure DevOps or GitHub. Git integration helps you save data factory pipelines as Azure Resource Manager (ARM) templates and lets you perform continuous integration and continuous deployment (CI/CD). For this recipe, we will choose Configure Git later.
1. As for the remaining tabs (Networking/Advanced/Tags), they can remain as is. Click Review + create to create the data factory.

How it works…

It is fairly straightforward to create a data factory instance. The data factory that we created in this recipe will be used to hold several datasets, pipelines, and data sources that are to be created in the following recipes in this lab.

## Recipe 2 - Copying files to a database from a data lake using a control flow and copy activity

In this recipe, we will be building a pipeline that will copy a group of files in blob storage to Azure SQL Database, but only if the filenames contain today's date as a suffix, by following these steps:

1.  Get the list of files to be copied using the **Get Metadata** activity in the data factory.
2.  Use the **Filter** activity to filter the file whose suffix is the current date.
3.  Use the **ForEach** activity to loop through the files.
4.  Use the **Copy** activity to load the file into Azure SQL Database.

Execute the following command to create a container:

In [None]:
$storageaccountname="sparshstorage1"
$resourcegroup="sparsh-resource-1"
$containername="dataloading"

$storagecontext = (Get-AzStorageAccount -ResourceGroupName $resourcegroup -Name $storageaccountname).Context;

New-AzStorageContainer -Name $containername -Context $storagecontext

Execute the following script to create a SQL Server database in the same resource group:

In [None]:
$serverName = "sparshdeadfsql"
$adminSqlLogin = "sqladmin"
$password = "MyPass123"
$startIp = "0.0.0.0"
$endIp = "255.255.255.255"
$databasename = "sample"

$server = New-AzSqlServer -ResourceGroupName $resourcegroup -ServerName $serverName -Location "Eastus" -SqlAdministratorCredentials $(New-Object -TypeName System.Management.Automation.PSCredential -ArgumentList $adminSqlLogin, $(ConvertTo-SecureString -String $password -AsPlainText -Force))

$serverFirewallRule = New-AzSqlServerFirewallRule -ResourceGroupName $resourcegroup -ServerName $serverName -FirewallRuleName "AllowedIPs" -StartIpAddress $startIp -EndIpAddress $endIp

$database = New-AzSqlDatabase -ResourceGroupName $resourcegroup -ServerName $serverName -DatabaseName $databaseName -RequestedServiceObjectiveName "S0"

$database

You will see output similar to:

```
PS /> $serverName = "sparshdeadfsql"                     PS /> $adminSqlLogin = "sqladmin"
PS /> $password = "MyPass123"
PS /> $startIp = "0.0.0.0"
PS /> $endIp = "255.255.255.255"
PS /> $databasename = "sample"
PS /> 
PS /> $server = New-AzSqlServer -ResourceGroupName $resourcegroup -ServerName $serverName -Location "Eastus" -SqlAdministratorCredentials $(New-Object -TypeName System.Management.Automation.PSCredential -ArgumentList $adminSqlLogin, $(ConvertTo-SecureString -String $password -AsPlainText -Force))
PS /> 
PS /> $serverFirewallRule = New-AzSqlServerFirewallRule -ResourceGroupName $resourcegroup -ServerName $serverName -FirewallRuleName "AllowedIPs" -StartIpAddress $startIp -EndIpAddress $endIp
PS /> 
PS /> $database = New-AzSqlDatabase -ResourceGroupName $resourcegroup -ServerName $serverName -DatabaseName $databaseName -RequestedServiceObjectiveName "S0"
WARNING: Upcoming breaking changes in the cmdlet 'New-AzSqlDatabase' :

- The output type 'Microsoft.Azure.Commands.Sql.Database.Model.AzureSqlDatabaseModel' is changing
- The following properties in the output type are being deprecated : 'BackupStorageRedundancy'
- The following properties are being added to the output type : 'CurrentBackupStorageRedundancy' 'RequestedBackupStorageRedundancy'
- The change is expected to take effect from the version : '3.0.0'
Note : Go to https://aka.ms/azps-changewarnings for steps to suppress this breaking change warning, and other information on breaking changes in Azure PowerShell.
PS /> 
PS /> $database

ResourceGroupName                : sparsh-resource-1
ServerName                       : sparshdeadfsql
DatabaseName                     : sample
Location                         : eastus
DatabaseId                       : 2a584e8b-b1d3-38df5dc0eedb
Edition                          : Standard
CollationName                    : SQL_Latin1_General_CP1_CI_AS
CatalogCollation                 : 
MaxSizeBytes                     : 268435456000
Status                           : Online
CreationDate                     : 2/10/2023 8:25:22 PM
CurrentServiceObjectiveId        : 00000000-0000-0000-0000-000000000000
CurrentServiceObjectiveName      : S0
RequestedServiceObjectiveName    : S0
RequestedServiceObjectiveId      : 
ElasticPoolName                  : 
EarliestRestoreDate              : 
Tags                             : 
ResourceId                       : /subscriptions/044e679e-4bef-add9/resourceGroups/sparsh-resource-1/providers/Microsoft.Sql/servers/sparshdeadfsql/data
                                   bases/sample
CreateMode                       : 
ReadScale                        : Disabled
ZoneRedundant                    : False
Capacity                         : 10
Family                           : 
SkuName                          : Standard
LicenseType                      : 
AutoPauseDelayInMinutes          : 
MinimumCapacity                  : 
ReadReplicaCount                 : 
HighAvailabilityReplicaCount     : 
CurrentBackupStorageRedundancy   : Geo
RequestedBackupStorageRedundancy : Geo
SecondaryType                    : 
MaintenanceConfigurationId       : /subscriptions/044e679e-431e-ccfcdb3/providers/Microsoft.Maintenance/publicMaintenanceConfigurations/SQL_Default
EnableLedger                     : False
PreferredEnclaveType             : Default
PausedDate                       : 
ResumedDate                      : 

PS /> 
```

Upload the data files into the container:

In [15]:
from datetime import datetime, timezone, timedelta
import pandas as pd
 
for i in range(3):
    now = datetime.now(timezone.utc) - timedelta(days=i)
    dt_string = now.strftime("%Y%m%d")
    df = pd.DataFrame()
    product = ['PC', 'keyboard', 'cable', 'camera', 'mobile']
    cost = [1000, 20, 1, 50, 100]
    quantity = [5, 20, 1000, 50, 200]
    location = ['Singapore', 'Dubai', 'Singapore', 'Delhi', 'HongKong']
    df = pd.DataFrame({'product': product, 'cost': cost, 'quantity': quantity, 'location': location})
    df['order_dt'] = dt_string
    filepath = f"./data/orderdtls-{dt_string}.csv"
    df[['order_dt', 'product', 'cost', 'quantity', 'location']].to_csv(filepath, index=False)

In [None]:
$files = Get-ChildItem -Path ".\data";

foreach($file in $files){

Set-AzStorageBlobContent -File $file.FullName -Context $storagecontext -Blob $file.BaseName.csv -Container $containername -Force

}

To copy the files in blob storage that have the current date as the suffix into Azure SQL Database, we will do the following:

1. Create linked services to connect the blob storage and Azure SQL Database.
1. Add the Get Metadata activity to get a list of files.
1. Add the Filter activity to filter the files with the current date as the suffix.
1. Add the Copy activity to copy the files that have been filtered to Azure SQL Database.

### Creating a linked service

First, let’s create two connections (linked services) – one for Azure SQL Database and another for blob storage:

1. In the Azure portal, open the data factory that we provisioned. Click on Open Azure Data Factory Studio. Once the Azure Data Factory Studio opens, click on the Manage button.
1. Click on Linked Services, then + New, and search for data under Data store. Select `Azure Data Lake Storage Gen2` and click Continue.
1. Create a connection to the storage account.
1. Similarly, create a linked service for `Azure SQL Database`. Set User name to `sqladmin` and Password to `MyPass123`.

### Using the Get Metadata activity to get filenames

The first task is to get the list of files in the container. We’ll do this using the Get Metadata activity. Follow these steps:

1. Create a new pipeline by clicking on the Author icon (the pencil-shaped icon on the left), then the + button, and then Pipeline.
1. Under Activities, search and drag the `Get Metadata` activity onto the pipeline.
1. Set the name of the activity to `GetFilename`.
1. Under Dataset, click on the + New button to add a new dataset.
1. Select `Azure Data Lake Storage Gen2` and select CSV as the file type. Name the dataset `OrderdtlsCSV`. Set Linked service to DataLoading, which we created earlier. Under File path, select the dataloading container. Then, check the First row as header box.
1. Under Field list, click on the + New button and select `Child items`.
1. Hit the Debug button at the top and check the output. If no errors have been reported and the output shows the filenames, then we have configured the Get Metadata task correctly and we can proceed to the next step.

### Filtering the current date files using the Filter activity

The next step is to filter the list of files that are returned by the Get Metadata activity for the files with the current date as the suffix. Let’s add a Filter activity to the pipeline:

1. Search for filter in the Activities tab and drag and drop the activity onto the pipeline.
1. Connect the Get Metadata activity and the Filter activity.
1. Name the Filter activity `FilterTodaysDate`. Move to the Settings tab of the Filter activity.
1. For Items, click on the textbox and then click on `Add dynamic content` (Alt + Shift + D).
1. Paste `@activity('GetFileName').output.childitems` into the Items field. This will retrieve the output array that was returned by the Get Metadata activity.
1. Similarly, add `@endswith(item().name,concat('-',formatDateTime(utcnow(),'yyyMMdd'),'.csv'))` for Condition.
1. Hit the Debug button to test this. Ensure that the `FilterTodaysDate` activity has been completed successfully and shows the filename with the current date as the suffix.

### Adding a ForEach activity to loop through the files

Now, we will create a ForEach activity to iterate through the files that are returned by the Filter activity. Follow these steps:

1. Search for `ForEach` under Activities and add the activity to the pipeline. Link it to the FilterTodaysDate activity.
1. Go to the Settings tab of the ForEach activity. For Items, click on Add dynamic content (Alt + Shift + D).
1. Under Activity Outputs, click on the `FilterTodaysDate` activity’s output. This will automatically add `@activity('FilterTodaysDate').output`.
1. Append `.value` and set Items to `@activity('FilterTodaysDate').output.value`. This will pass the filenames from the FilterTodaysDate activity to the ForEach activity.
1. Go to the Activities (0) tab and click on the pencil button.

### Adding the Copy data activity to ingest files to Azure SQL Database

Finally, we will ingest files from the data lake to Azure SQL Database. We can do this by passing the files listed by the ForEach activity to the Copy activity. Follow these steps:

1. Search for Copy Data under Activities and add it to the pipeline.
1. Name the Copy data activity `CopyOrderDtltoSQL`.
1. Go to the Source tab. Select OrderdtlsCSV as the dataset since we need to copy CSV files to Azure SQL Database.
1. Under File path type, select Wildcard file path. In the last textbox, type `@item().name`.
1. Move to the Sink tab. Press + New to add the dataset.
1. Search for SQL and select `Azure SQL Database`.
1. Select `SQLDB` as the linked service as we had created the connection to Azure SQL Database initially. Name the dataset as `OrderdtlSQL`. Click on the Edit checkbox below Table name. Provide table name as `dbo.orderdtls`. Select None option under Import schema. Press OK.
1. In the Sink tab, copy and paste the following script into the Pre-copy script field:
    ```
    if not exists ( Select * from sys.objects where name like 'orderdtls')

    Create table dbo.orderdtls(order_dt varchar(30),product varchar(100),cost int, quantity int, location varchar(100))
    ```
    This will create a table called orderdtls for the first time and append rows to the same table in subsequent runs. All the other options can be left as is. Click on pipeline1 to go back to the pipeline.
1. Hit the Debug button to test all activities. The activities will complete.
1. Rename the pipeline `ControlFlowActivities` and hit the Publish button to save the pipeline. You can verify the result by querying Azure SQL Database from the Azure portal via Query editor.

![](https://user-images.githubusercontent.com/62965911/218207655-a53ff18b-58f7-47de-8e2c-c14002ef4da3.png)

How it works…

In this recipe, we performed four key steps to move the data from blob storage to Azure SQL Database:

- We got the list of files in the container using the Get Metadata task.
- We filtered for files while using the current date as a prefix using the Filter task.
- We iterated through all the current date files using the ForEach task.
- We ingested the file’s content in a SQL table using the Copy task.

By using various control activities, we can successfully transfer data from files to a database. Transferring files to a database is a common scenario in ETL workloads; you can use the preceding framework and customize the data transfer based on your requirements.

## Recipe 3 - Triggering a pipeline in Azure Data Factory

An Azure Data Factory pipeline can be triggered manually, scheduled, or triggered by an event. In this recipe, we’ll configure an event-based trigger to run the pipeline that we created in the previous recipe whenever a new file is uploaded to the Data Lake Store.

To create the trigger, follow these steps:

- The event trigger requires the eventgrid resource to be registered in the subscription. To do that, execute the following PowerShell command:

In [None]:
Register-AzResourceProvider -ProviderNamespace Microsoft. EventGrid

- In the Azure portal, under All resources, open the data factory that you created in the Provisioning Azure Data Factory recipe. On the data factory overview page, select Open Azure Data Factory Studio. Click on the Author button on the left. Expand Pipeline and click on ControlFlowActivities, which was created in the previous recipe
- Select Add trigger and then select New/Edit.
- In the Add triggers window, select Choose trigger and select New.
- In the New trigger window, set Name as NewFileTrigger. Set the event’s Type to Storage event. Use the Storage account name and Container name properties you created earlier. Under Event, select blob Created. This will ensure that any time a file is uploaded to the dataloading container, the pipeline will be triggered.
- Click Continue to create the trigger.
- In the Data preview window, all the files in the dataloading container will be listed.
- Click Continue.
- In the New trigger window, we can specify the parameter values, if any, that are required by the pipeline to run.
- Click OK to create the trigger.
- The trigger will be created. Click Publish all to save and apply these changes.

To see the trigger in action, do the following:

In [16]:
import pandas as pd
 
df = pd.DataFrame()
product = ['PC', 'keyboard', 'cable', 'camera', 'mobile']
cost = [1000, 20, 1, 50, 100]
quantity = [5, 20, 1000, 50, 200]
location = ['Singapore', 'Dubai', 'Singapore', 'Delhi', 'HongKong']
df = pd.DataFrame({'product': product, 'cost': cost, 'quantity': quantity, 'location': location})
df['order_dt'] = dt_string
filepath = "./data/orderdtls-Trigger.csv"
df[['order_dt', 'product', 'cost', 'quantity', 'location']].to_csv(filepath, index=False)

In [None]:
Set-AzStorageBlobContent -File ".\data\orderdtls-Trigger.csv" -Context $storagecontext -Blob orderdtls-Trigger.csv -Container $containername

- Once the file has been uploaded, NewFileTrigger will trigger the ControlFlowActivities pipeline.
- To check the trigger and pipeline execution, open the Monitor window.
- You will see that the ControlFlowActivities pipeline was executed and that it was triggered by NewFileTrigger. This proves that the execution was triggered by the file being uploaded.

How it works…

Azure Event Grid, which we registered at the subscription level using the Register-AzureResourceProvider PowerShell command, helps track and trigger the pipeline when a file is uploaded to the data lake container. The storage event-based trigger makes it a powerful feature in data engineering projects, since pipelines need to be triggered when a file is uploaded by another batch process or other similar scenarios. You can add conditions such as blob prefix/suffix filters or parameters to trigger the pipeline, but only when specific files are loaded/deleted or when granular conditions must be met.

## Recipe 4 - Copying data from a SQL Server virtual machine to a data lake using the Copy data wizard

A common scenario in data engineering projects is where you need to ingest data from a relational database engine such as SQL Server, Oracle, or MySQL to a data lake. This recipe will show you how to ingest data from SQL Server, which has been installed in an Azure VM, to an Azure Data Lake. This method will work in on-premises SQL Server to Azure Data Lake instances too, but you will need to install an integration runtime. In this recipe, we will focus on copying data from SQL Server in an Azure VM to a data lake. We will be using the user-friendly Copy data wizard to transfer the data.

Provision the SQL Server VM by doing the following:

1. Log in to portal.azure.com.
1. Click on Create a Resource.
1. Search for SQL Server.
1. Select SQL Server 2019 on Windows Server 2019. Pick the Free SQL Server License: SQL 2019 Developer on Windows Server 2019 option.
1. Set Resource group to sparshADEADF and Virtual machine name to SQLVM. Then, set Availability options to No infrastructure redundancy required. After that, ensure that Username is set to sqladmin and that Password is set to `MyPass123`. Leave Select inbound ports as is to allow the (RDP) 3389 port since it is allowed by default.
1. Click on the SQL Server settings tab and set SQL connectivity to Public (Internet). For SQL Authentication, choose Enable.
1. Click on Review + create. It will take around 15 minutes to create the VM. Once the VM has been created, go to the VM’s overview page in the Azure portal, get the Public IP Address information, and perform a remote desktop connection to the VM. Log in using your user ID and password – that is, sqladmin/MyPass123.
1. Open Windows PowerShell in the SQL VM and run the following commands. These commands will create a folder and download the adventureworks backup file into the folder:
    ```
    New-Item -Path c:\temp -ItemType directory

    cd c:\temp

    Invoke-WebRequest "https://github.com/Microsoft/sql-server-samples/releases/download/adventureworks/AdventureWorksLT2019.bak"  -OutFile "AdventureWorksLT2019.bak"
    ```
1. Open Command Prompt in the SQL VM and type the following: ```Sqlcmd -e```
1. Paste the following command in Command Prompt to restore the database to SQL Server:
    ```
    RESTORE DATABASE [AdventureWorksLT2019] FROM  DISK = N'c:\temp\AdventureWorksLT2019.bak' WITH  FILE = 1,  MOVE N'AdventureWorksLT2012_Data' TO N'F:\data\AdventureWorksLT2012.mdf',  MOVE N'AdventureWorksLT2012_Log' TO N'F:\log\AdventureWorksLT2012_log.ldf',  NOUNLOAD,  STATS = 5

    GO
    ```
1. Hit Enter on Command Prompt. This will restore the database.

Now that the database has been restored, let’s copy the data from the database into our data lake using the Copy data wizard in the data factory. Follow these steps:

1. Log in to portal.azure.com. Go to the data factory that you created earlier. Open Azure Data Factory Studio. Then, click on the Ingest button on the home page.
1. Select Built-in copy task and choose the Run once now option.
1. Set Source type to SQL server and click + New connection.
1. Provide the SQL VM’s public IP address under Server name. Set Database name as AdventureWorksLT2019 and pick SQL authentication under Authentication type. Finally, set User name as sqladmin and Password as MyPass123. Then, click Create.
1. Select a few tables you want to copy to the data lake.
1. Hit Next twice to go to the Destination data store configuration screen. Select `Azure Data Lake Storage Gen2` under Target type and click + New connection.
1. Pick the storage account you created earlier and click Create to create the connection.
1. Under Folder path, click on Browse. Select the dataloading container you created earlier. Then, click OK.
1. Pick a File format. Check the Add header to file box to ensure that the column names are shown. Then, click Next.
1. Set Task name as `CopySQLVMtoADL` and click Next.
1. Review your configuration on the Summary page and click Next. This will create the pipelines automatically and execute them. Click Finish.
1. Verify that the files have been created by checking the blob storage container in the Azure portal.

![Figure_3](https://user-images.githubusercontent.com/62965911/218211302-64d1e918-852c-4ec3-8c00-e9b52b2e717f.jpg)

How it works…

Go to Azure Data Factory Studio and click on the Author button on the left. Expand Pipeline and notice that a new pipeline, CopySQLVMtoADL, has been created. Click on it. You will see that there’s a ForEach activity in the pipeline. The ForEach activity is used to iterate through each table that needs to be copied.

![Figure_3 51](https://user-images.githubusercontent.com/62965911/218211682-5c195007-6d39-4210-b1e2-67eb7acac6a9.jpg)

Click on Activities. You will notice a Copy data task whose source is SQL Server and the destination is Azure Data Lake Storage Gen2. The Copy data task will copy one table at a time from SQL Server to the data lake.

You will also notice that the source table name and destination filename come from the ForEach activity (item().source.table /item().destination.fileName). The Copy data wizard has automatically created the pipeline with the relevant activities to move the data.

You can easily customize the pipeline or schedule it based on your needs to transfer data periodically. Copy data wizard saves so much time when it comes to configuring the datasets and the ForEach activity, which makes it easy to get started with data movement tasks.

Ensure that you delete the sparshADEADF resource group once you have finished since you will incur Azure consumption costs otherwise.