This library provides the capability to write CDM Folders from within Azure Databricks. If you are unfamilar with the Common Data Model and the use case for integration with various Azure Data Platform offerings, see the excellent series of articles by Matthew Roche on his blog here.
This code is based on spark-cdm but has been modified and extended to support large datasets. My intention is to further extend this library by adding the capability to write lineage information directly into Apache Atlas which is the foundation for Azure Data Catalog Gen2.
In order to access resources in Azure, such as Azure Data Lake Store (ADLS), you should configure a Service Principal that will be used to grant access. Once you do this (If you are unfamiliar with creating a Service Principal, you can follow the steps here to create one.) Keep in mind that you MUST copy the client secret as you will only be able to view it during the creation process. You will need the following 3 things from the object that you just created:
- The Application ID (also known as a "Client ID"), which will be in the form of a GUID
- The Tenant ID, which will be in the form of a GUID (Note: You do not need the full URL as you would in some ADLS itegrations, just the GUID)
- The Client Secret (sometimes called a Token). Note that this will ONLY be available during the creation process
Once the service principal has been created, you will need to assign permissions to it in your ADLS account. The recommended way is to use Azure RBAC controls and assign the role Storage Blob Data Contributor to your ADLS Account. (See the previous link if you are unfamiliar with how to do this)
The only requirement for this library is that it be installed directly on the cluster. The easiest way to do this is by following these instructions to install a "Workspace Library".
The following example is based on a dataset that I detailed in my blog and is assuming that an ADLS Gen2 account named "temalo" that contains a filesystem named "adlsgen2" exists and that the folder "CDMFolders/Databricks" is the location where you wish to write the CDM Folders for the data model.
Although the spark-cdm library is written in Scala, it can be used in either python or Scala code within a Databricks notebook. The basic process is as follows:
- Configure the CDM Environment
- Set the Output path
- Set the Model Name
- Obtain the AppID, Secret, and TenantID from the Service Principal
- Populate a Spark Table
- Write the Spark Table (Entity) to the CDM Folder
The following code assumes that you have configured a Databricks Secrets Scope that includes all of the required information to identify and use the Service Principal discussed above.
%python
cdmModelName = "Chicago Crime"
outputLocation = "https://temalo.dfs.core.windows.net/adlsgen2/CDMFolders/Databricks"
appID = dbutils.secrets.get(scope="AKV", key="AppID")
appKey = dbutils.secrets.get(scope="AKV", key="Token")
tenantID = dbutils.secrets.get(scope = "AKV", key="TenantID")
The following code assumes that you have data stored in an Azure SQL Database instance that mirrors the data from the blog post referenced above (note that the query used will return 10 years worth of Crime Reports if you have followed the steps in the blog to create a complete dataset)
%python
jdbcUsername = dbutils.secrets.get(scope = "AKV", key = "dbUsername")
jdbcPassword = dbutils.secrets.get(scope = "AKV", key = "dbPassword")
#Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
jdbcHostname = "tmdemo.database.windows.net"
jdbcPort = 1433
jdbcDatabase ="ChicagoCrime"
#Create the JDBC URL without passing in the user and password parameters.
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
db_query = "(SELECT [Case Number] ,[Date] ,[Block] ,[IUCR] ,[Primary Type] ,[Description] ,[Location Description] ,[Arrest] ,[Domestic] ,[Beat] ,[District] ,[Ward] ,[Community Area] ,[FBI Code] ,[Latitude] ,[Longitude] , [DateOnly] FROM ChicagoCrimes WHERE DATEDIFF(yy,DateOnly,GetUTCDate()) <= 10) result"
result = spark.read.jdbc(url=jdbcUrl, table=db_query, properties=connectionProperties)
result.createOrReplaceTempView("ChicagoCrimeReports")
Once the data is stored in a Spark Table, you can call the write.format method on the table, and pass "com.microsoft.cdm" as the format, supplying the required options as shown:
%python
ChicagoCrimeReports = spark.table("ChicagoCrimeReports")
(ChicagoCrimeReports.write.format("com.microsoft.cdm")
.option("entity", "Crime Reports")
.option("appId", appID)
.option("appKey", appKey)
.option("tenantId", tenantID)
.option("cdmFolder", outputLocation)
.option("cdmModelName", cdmModelName)
.save())
Reading from a CDM folder is relatively straightforward. The process is exactly as described above, except you call the read.format method