At the core of Azure Databricks is the open source distributed compute processing engine called Apache Spark, which is widely used in the industry for developing big data projects. Databricks is a company created by the founders of Apache Spark, to make it easier to work with Spark by providing the necessary management layers. Microsoft makes, the Databricks service available on its Azure Cloud platform as a first party service. These three offerings together makes Azure Databricks.
- Catalyst Optimizer converts the code into a high level optimization plan and Tungsten helps with memory management.
Databricks Architecture is basically split into two parts, one called the Control Plane and another one called the Data Plane.
Control plane is located in Databricks own subscription.
This contains the Databricks UX and also the Cluster Manager.
It's also home to the Databricks File System (DBFS) and also metadata about Clusters, Files mounted, etc. Data Plane is located in the customer subscription.
When you create a Databricks service in Azure, there are four resources created in your subscription, a Virtual Network and Network Security Group for the Virtual Network. Azure Blob Storage for the default storage and also a Databricks Workspace.
When a user requests for a cluster, Databricks Cluster Manager will create the required virtual machines in our subscription via the Azure Resource Manager.
So none of the customer data leaves a subscription.
Temporary outputs such as running a display command or data for manage tables, are stored in the Azure Blob Storage, and the processing also happens within the VNet in our subscription. The Azure Blob Storage we have shown here is the default storage or otherwise called the DBFS a route, and it's not recommended as a permanent data storage.
# of Nodes
Single Node - only one VM Multi Node - Has Main Node and Worker Nodes
Databricks Runtime Configuration
Cluster Policies
- Can be set by the administrators to limit the use of clusters that extend beyond a certain budget or memory constraint.
- Simplifies the UI
- Access Keys
- Azure Active Directory
- Service Principal
- Cluster Scoped Auth
- Session Scoped Auth
Shared Access Signature
Service Principal
Steps
Cluster Scoped Authentication
Session Scoped Vs Cluster Scoped Authentication
We need to add the same credentials as in Access Keys but in the Spark Config text area of the cluster itself.
Now when we remove the config from the notebook with Access Keys we can still access the notebooks.
AAD Credential Passthrough
Now even if we are the owner of the storage account, we can't access the data without creating a role that gives the Storage Blob Contributor Access
Again we don't need to mention any credentials in the notebook.
-
Go to the Databricks Home Page
-
Add 'secrets/createScope' to the end of the URL.
-
Add secret scope name and then select all users.
-
Add the Vault URL and Resource Id that can be got from the Key Vault on Azure (Home/key-vault/properties)
Databricks Secrets Utility
To list the name of the secret scope : dbutils.secrets.list(scope = formula1-scope)
To check if a key is a secret scope use : dbutils.secrets.get(scope = 'formula1-scope',key = 'fomula1-dl-account-key')
To add the access key to the cluster add the following to the spark config fs.azure.account.key.formula1dl.dfs.core.windows.net{{secrets/formula1-scope/formula1dl-account-key}}
Any notebook that has access to the cluser will have access to the ADLS Storage.
-
The deployment created a default Azure Blob Storage and mounted that to DBFS. So we could run DBFS or Databricks File System utilities to interact with the Azure Blob Storage from the Databricks workspace.
-
DBFS or Databricks File System here, is a distributed file system mounted on the Databricks workspace.
-
This can be accessed from any of the Databricks Clusters created in this workspace.
-
It's just an abstraction layer on top of the Azure Object Storage.
-
The key takeaway here is that, DBFS is simply a file system that provides distributed access to the data stored in Azure storage.
-
It's not a storage solution in itself. The storage here is the Azure Blob Storage, and this is the default storage that's created when the Databricks workspace was deployed.
-
This DBFS mount on the default Azure Blob Storage is called DBFS Root. As we said, DBFS Root is backed by Azure Blob Storage in the databricks created Resource Group.
-
You can access one of the special folders within DBFS Root called File Store via the Web User Interface.
-
You can use this as a temporary storage, for example, to store any images to be used in notebooks or some data to play with quickly.
-
Databricks also stores query results from commands such as display in DBFS Root. Similar to Hive, Databricks also allows us to create both managed and external tables.
-
If you create a managed table without specifying the location for the database, that data will also be stored in DBFS Root, i.e. the default location for managed tables is DBFS Root. But you can change that during the database creation.
-
Even though DBFS Root is the default storage for Databricks, it's not the recommended location to store customer data.
-
When you drop the Databricks workspace, this storage also gets dropped, which is not what you would want for the customer data.
-
Instead, we can use an external Data Lake, fully controlled by the customer and we can mount that to the workspace.
Implementation
The DBFS Console is hidden. First go to the top right 'az_admin@gmail.com' and click it.
Then Click on Admin Console >> Workplace Settings >> Search For DBFS >> Enable DBFS Browser >> Refresh the Browser
Now Go to the Data tab and Click Browse DBFS >> Click FileStore.
The files that are in the FileStore can be used by all the users of the workspace.
- We said that we shouldn't be using DBFS Root for keeping customer data.
- Now the question becomes if we can't use DBFS Root as the storage for customer data, where do we store that?
- Customers can create separate Azure Blob Storage or Azure Data Lake storage accounts in their subscription and keep the data in them.
- In this Architecture, when you delete the Databricks workspace, the customer data still stays without being untouched.
- In order to access the storage, we can use the ABFS protocol like we did before in the previous section of the course.
- But as you saw previously, this approach is tedious for two reasons.
- Firstly, we need to deal with those along ABFS URLs rather than the file system semantics to access the files.
- Secondly, every time we'll have to use the credentials to authenticate to the storage accounts before accessing the data from them.
- To make this experience better, Databricks allows us to mount these storage accounts to DBFS. We specify the credential when the storage is mounted.
- Once it's mounted, everyone who has access to the workspace can access the data without providing the credentials.
- Also, they will be able to use the file system semantics rather than the long URLs. In summary, Databricks mounts offer some important benefits to the storage solution in Databricks.
- Once the Azure object storage solution, such as Azure Data Lake or the Blob storage has been mounted onto the Databricks workspace, you can access the mount points without specifying the credentials.
- This allows for accessing the Azure storage from Databricks using file semantics rather than the long storage URLs. You can treat a mount point as the same as mapping another drive to your computer.
- DBFS is just an abstraction layer and it still stores the files to the Azure storage, so you get all the benefits such as different performance tiers, replication, massive storage etc., as you would generally get from Azure storage.
- This was the recommended solution from Databricks to access Azure Data Lake until the introduction of Unity Catalog, which became generally available around end of 2022.
- Databricks now recommends using the Unity Catalog for better security, but most projects I see today are still using Databricks Mounts to access the data. So please be familiar with this approach and you will come across it in your projects.
- In case you are wondering how to access data using Unity Catalog, Once a workspace has been configured with Unity Catalog, you can simply use the ABFS protocol to access the Data Lake like we have been doing so far.
Mounting Azure Data Lake Storage Gen2 : Code
Partition By allows us to create different data folders for various date years
When we have a nested JSON data, the nested keys and values must be defined in a separate schema
If there are multiple lines in a JSON file then we can set option as .option("multiLine",True)