Skip to content

Latest commit

 

History

History
252 lines (187 loc) · 10.1 KB

Dec 09 2020 - Connect to Azure Blob storage using Notebooks in Azure Databricks.md

File metadata and controls

252 lines (187 loc) · 10.1 KB

Dec 09 2020 - Connect to Azure Blob storage using Notebooks in Azure Databricks

Azure Databricks repository is a set of blogposts as a Advent of 2020 present to readers for easier onboarding to Azure Databricks!

Series of Azure Databricks posts:

Yesterday we introduced the Databricks CLI and how to upload the file from "anywhere" to Databricks. Today we will look how to use Azure Blob Storage for storing files and accessing the data using Azure Databricks notebooks.

1. Create Azure Storage account

We will need to go outside of Azure Databricks to Azure portal. And search for Storage accounts.

And create a new Storage account by clicking on "+ Add". And select the subscription, Resource group, Storage account name, location, account type and replication.

Continue to set up networking, data protection, advance settings and create the storage account. When you are finished with storage account, we will create a storage itself. Note that General Purpose v2 Storage accounts support latest Azure Storage features and all functionality of general purpose v1 and Blob Storage accounts. General purpose v2 accounts bring lowest per-gigabyte capacity prices for Azure storege and support following Azure Storage services:

  • Blobs (all types: Block, Append, Page)
  • Data Lake Gen2
  • Files
  • Disks
  • Queues
  • Tables

Once the Account is ready to be used, select it and choose "Container".

Container is a blob storage for unstructured data and will communicate with Azure Databricks DBFS perfectly. When in Container part, select "+ Container" to add new container and give a container a name.

Once the container is created, click on the container to get additional details.

Your data will be stored in this container and later used with Azure Databricks Notebooks. you can also access the storage using Microsoft Azure Storage Explorer. It is much more intuitive and and offers easier management, folder creation and binary files management.

You can upload a file using Microsoft Azure Storage Explorer tool or directly on portal. But in organisation, you will have files and data being here copied automatically using many other Azure service. Upload a file that is available for you on Github repository (data/Day9_MLBPlayers.csv - data file is licensed under GNU) to blob storage container in any desired way. I have used Storage explorer and simply drag and dropped the file to container.

2. Shared access Signature (SAS)

Before we go back to Azure Databricks, we need to set the access policy for this container. Select "Access Policy"

We need to create a Shared Access Signature which is a general Microsoft grant to access the storage account. Click on Access policy from left menu and once new site is loaded, select "+ Add Policy" under Shared access policies and give it a name, access and validity period:

Click OK to confirm and click Save (save icon). Go back to Storage account and on the left select Shared Access Signature.

Under Allowed resource types, it is mandatory to select Container, but you can select all. Set the Start and expiry date - 1 month in my case. Select button "Generate SAS and connection string" and copy paste the needed strings; connection string and SAS token should be enough (copy and paste it to a text editor)

Once this is done, let's continue with Azure Databricks notebooks.

3. Creating notebooks in Azure Databricks

Start up a cluster and create new notebooks (as we have discussed on Day 4 and Day 7). The notebook is available at Github.

And the code is:

%scala 

val containerName = "dbpystorecontainer"
val storageAccountName = "dbpystorage"
val sas = "?sv=2019-12-12&ss=bfqt&srt=sco&sp=rwdlacupx&se=2020-12-09T06:15:32Z&st=2020-12-08T22:15:32Z&spr=https&sig=S%2B0nzHXioi85aW%2FpBdtUdR9vd20SRKTzhNwNlcJJDqc%3D"
val config = "fs.azure.sas." + containerName+ "." + storageAccountName + ".blob.core.windows.net"

with the mount function.

dbutils.fs.mount(
  source = "wasbs://dbpystorecontainer@dbpystorage.blob.core.windows.net/Day9_MLBPlayers.csv",
  mount_point = "/mnt/storage1")

When you run a following scala command, it will generate a data.frame called mydf1 data.frame

%scala

val mydf1 = spark.read
.option("header","true")
.option("inferSchema", "true")
.csv("/mnt/storage1")
display(mydf1)

And now we can start exploring the dataset. And I am using R language.

This was a long but important topic that we have addressed. Now you know how to addree and store data.

Tomorrow we will check how to start using Notebooks and will be for now focusing more on analytics and less on infrastructure.

Complete set of code and Notebooks will be available at the Github repository.

Happy Coding and Stay Healthy!