# Understanding the SQL DW Connector

You can access Azure SQL Data Warehouse (SQL DW) from Azure Databricks using the SQL Data Warehouse connector (referred to as the SQL DW connector), a data source implementation for Apache Spark that uses Azure Blob storage, and PolyBase in SQL DW to transfer large volumes of data efficiently between a Databricks cluster and a SQL DW instance.

Both the Databricks cluster and the SQL DW instance access a common Blob storage container to exchange data between these two systems. In Databricks, Spark jobs are triggered by the SQL DW connector to read data from and write data to the Blob storage container. On the SQL DW side, data loading and unloading operations performed by PolyBase are triggered by the SQL DW connector through JDBC.

The SQL DW connector is more suited to ETL than to interactive queries, because each query execution can extract large amounts of data to Blob storage. If you plan to perform several queries against the same SQL DW table, we recommend that you save the extracted data in a format such as Parquet.

## SQL Data Warehouse Pre-Requisites

There are two pre-requisites for connecting Azure Databricks with SQL Data Warehouse that apply to the SQL Data Warehouse:
1. You need to [create a database master key](https://docs.microsoft.com/en-us/sql/relational-databases/security/encryption/create-a-database-master-key) for the Azure SQL Data Warehouse. 

    **The key is encrypted using the password.**

    USE [databricks-sqldw];  
    GO  
    CREATE MASTER KEY ENCRYPTION BY PASSWORD = '980AbctotheCloud427leet';  
    GO

2. You need to ensure that the [Firewall](https://docs.microsoft.com/en-us/azure/sql-database/sql-database-firewall-configure#manage-firewall-rules-using-the-azure-portal) on the Azure SQL Server that contains your SQL Data Warehouse is configured to allow Azure services to connect (e.g., Allow access to Azure services is set to On).

You should have already completed the above requirements in earlier steps of the lab.

## Azure Storage Pre-Requisites

Azure Storage blobs are used as the intermediary for the exchange of data between Azure Databricks and Azure SQL Data Warehouse. As a result of this, you will need:
1. To create a general purpose Azure Storage account v1
2. Acquire the Account Name and Account Key for that Storage Account 
3. Create a container that will be used to store data used during the exchange, for example "dwtemp" (this must exists before you run an queries against SQL DW)

You should have already performed these steps in the setup steps for this lab. You will need to refer to the Account Name and Account Key you saved during setup for the section below.

## Enabling access for a notebook session

You can enable access for the lifetime of your notebook session to SQL Data Warehouse by executing the cell below. Be sure to replace the **"name-of-your-storage-account"** and **"your-storage-key"** values with your own before executing.

In [9]:
storage_account_name = "name-of-your-storage-account"
storage_account_key = "your-storage-key"
storage_container_name = "dwtemp"

temp_dir_url = "wasbs://{}@{}.blob.core.windows.net/".format(storage_container_name, storage_account_name)

spark_config_key = "fs.azure.account.key.{}.blob.core.windows.net".format(storage_account_name)
spark_config_value = storage_account_key

spark.conf.set(spark_config_key, spark_config_value)

You will need the JDBC connection string for your Azure SQL Data Warehouse. You should copy this value exactly as it appears in the Azure Portal. 

Please replace the missing **`servername`**, **`databasename`**, and **`your-password`** values in the command below:

In [11]:
servername = "servername"
databasename = "databasename"
password = "your-password"

sql_dw_connection_string = "jdbc:sqlserver://{}.database.windows.net:1433;database={};user=dwlab@{};password={};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;".format(servername, databasename, servername, password)

In [12]:
print(sql_dw_connection_string)

## Accessing data from SQL Data Warehouse

You can load data into your Azure Databricks environment by issuing a SQL query against the SQL Data Warehouse, as follows:

In [15]:
query = "SELECT * FROM EmployeeBasic"

df = spark.read \
  .format("com.databricks.spark.sqldw") \
  .option("url", sql_dw_connection_string) \
  .option("tempdir", temp_dir_url) \
  .option("forward_spark_azure_storage_credentials", "true") \
  .option("query", query) \
  .load()
  
display(df)

Once you have the data in a DataFrame, you can query it just as you would any other DataFrame. This includes visualizing it as follows:

In [17]:
summary = df.select("EmployeeName")
display(summary)

## Writing new data to SQL Data Warehouse

You can also go in the opposite direction, saving the contents of an existing DataFrame out to a new table in Azure SQL Data Warehouse.

First, let's load the `weblogs` Databricks table contents into a new DataFrame. As you recall, we created this table in the previous notebook.

In [20]:
from pyspark.sql.functions import *

purchasedWeblogDF = spark.sql("select * from weblogs where Action == 'Purchased'")

Execute the cell below to add one additional column containing the transaction date/time. This column will be derived from the `TransactionDate` field by casting it to date type.

In [22]:
purchasedProductsDF = purchasedWeblogDF.select('*',to_date(unix_timestamp(purchasedWeblogDF.TransactionDate,'MM/dd/yyyy HH:mm:ss').cast("timestamp")).alias("TransactionDateTime"))

Execute the cell below to filter out products purchased in month of May (05).

In [24]:
filteredProductsDF = purchasedProductsDF.where(month(purchasedProductsDF.TransactionDateTime)==5)

Execute the cell below to select the required columns.

In [26]:
finalProductsDF = filteredProductsDF.select("SessionId","UserId","ProductId","Quantity","TransactionDateTime")
display(finalProductsDF)

Now that we have our desired DataFrame, let's create a new table in the Azure SQL Data Warehouse database and populate it with the contents of the DataFrame.

In [28]:
new_table_name = "PurchasedProducts"

finalProductsDF.write \
  .format("com.databricks.spark.sqldw") \
  .option("url", sql_dw_connection_string) \
  .option("forward_spark_azure_storage_credentials", "true") \
  .option("dbtable", new_table_name) \
  .option("tempdir", temp_dir_url) \
  .save()

To verify that we have indeed created a new table in the SQL DW database, write a query to select the data from the new table:

In [30]:
query = "SELECT * FROM PurchasedProducts"

df = spark.read \
  .format("com.databricks.spark.sqldw") \
  .option("url", sql_dw_connection_string) \
  .option("tempdir", temp_dir_url) \
  .option("forward_spark_azure_storage_credentials", "true") \
  .option("query", query) \
  .load()
  
display(df)

You may also open Azure Data Studio and refresh the Tables list to see the new PurchasedProducts table.

Run the following query in a new query window:

```sql
SELECT TOP (1000) [SessionId]
      ,[UserId]
      ,[ProductId]
      ,[Quantity]
      ,[TransactionDateTime]
  FROM [dbo].[PurchasedProducts]
```

You should have an output similar to the following:

<img src="https://databricksdemostore.blob.core.windows.net/images/02-SQL-DW/sql-dw-new-table.png" style="border: 1px solid #aaa; padding: 10px; border-radius: 10px 10px 10px 10px"/>