# Connecting to Azure Blob Storage

Apache Spark&trade; and Azure Databricks&reg; allow you to connect to virtually any data store including Azure Blob Storage.

-sandbox
### Spark as a Connector

Spark quickly rose to popularity as a replacement for the [Apache Hadoop&trade;](http://hadoop.apache.org/) MapReduce paradigm in a large part because it easily connected to a number of different data sources.  Most important among these data sources was the Hadoop Distributed File System (HDFS).  Now, Spark engineers connect to a wide variety of data sources including:  
<br>
* Traditional databases like Postgres, SQL Server, and MySQL
* Message brokers like Kafka and Kinesis
* Distributed databases like Cassandra and Redshift
* Data warehouses like Hive and Cosmos DB
* File types like CSV, Parquet, and Avro

<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/open-source-ecosystem_2.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

-sandbox
### DBFS Mounts and Azure Blob Storage

Azure Blob Storage is the backbone of Databricks workflows.  Azure Blob Storage offers data storage that easily scales to the demands of most data applications and, by colocating data with Spark clusters, Databricks quickly reads from and writes to Azure Blob Storage in a distributed manner.

The Databricks File System (DBFS), is a layer over Azure Blob Storage that allows you to mount Blob containers, making them available to other users in your workspace and persisting the data after a cluster is shut down.

In our roadmap for ETL, this is the <b>Extract and Validate </b> step:

<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/ETL-Process-1.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

Run the cell below to set up your environment.

In [5]:
%run "./Includes/Classroom-Setup"

-sandbox

Define your Azure Blob credentials.  You need the following elements:<br><br>

* Storage account name
* Container name
* Mount point (how the mount will appear in DBFS)
* Shared Access Signature (SAS) key

Below these elements are defined, including a read-only SAS key.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> For more information on SAS keys, <a href="https://docs.microsoft.com/en-us/azure/storage/common/storage-dotnet-shared-access-signature-part-1" target="_blank"> see the Azure documentation.</a><br>
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> SAS keys are normally provided as a SAS URI. Of this URI, focus on everything from the `?` on, including the `?`. The following cell provides an example of this.

In [7]:
STORAGE_ACCOUNT = "dbtraineastus2"
CONTAINER = "training"
MOUNT_POINT = "/mnt/training"
SAS_KEY = "?sv=2017-07-29&ss=b&srt=sco&sp=rl&se=2023-04-19T06:32:30Z&st=2018-04-18T22:32:30Z&spr=https&sig=BB%2FQzc0XHAH%2FarDQhKcpu49feb7llv3ZjnfViuI9IWo%3D"

First, unmount the container, since you mounted this container in the classroom setup script.

In [9]:
try:
  dbutils.fs.unmount(MOUNT_POINT) # Use this to unmount as needed
except:
  print("{} already unmounted".format(MOUNT_POINT))

Define two strings populated with the storage account and container information.  This will be passed to the `mount` function.

In [11]:
source_str = "wasbs://{container}@{storage_acct}.blob.core.windows.net/".format(container=CONTAINER, storage_acct=STORAGE_ACCOUNT)
conf_key = "fs.azure.sas.{container}.{storage_acct}.blob.core.windows.net".format(container=CONTAINER, storage_acct=STORAGE_ACCOUNT)

-sandbox

Now, mount the container <a href="https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage-containers-with-dbfs" target="_blank"> using the template provided in the docs.</a>

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The code below includes error handling logic to handle the case where the mount is already mounted.

In [13]:
try:
  dbutils.fs.mount(
    source = source_str,
    mount_point = MOUNT_POINT,
    extra_configs = {conf_key: SAS_KEY}
  )
except Exception as e:
  print("ERROR: {} already mounted. Run previous cells to unmount first".format(MOUNT_POINT))

Next, explore the mount using `%fs ls` and the name of the mount.

In [15]:
%fs ls /mnt/training

-sandbox

In practice, always secure your credentials.  Do this by either maintaining a single notebook with restricted permissions that holds SAS keys, or delete the cells or notebooks that expose the keys. **After a cell used to mount a container is run, access this mount in any notebook, any cluster, and share the mount between colleagues.**

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> See <a href="https://docs.azuredatabricks.net/user-guide/secrets/index.html" target="_blank">secret management to securely store and reference your credentials in notebooks and jobs.</a>

## Adding Options

When you import that data into a cluster, you can add options based on the specific characteristics of the data.

Display the first few lines of `Chicago-Crimes-2018.csv` using `%fs head`.

In [19]:
%fs head /mnt/training/Chicago-Crimes-2018.csv

`option` is a method of `DataFrameReader`. Options are key/value pairs and must be specified before calling `.csv()`.

This is a tab-delimited file, as seen in the previous cell. Specify the `"delimiter"` option in the import statement.  

:NOTE: Find a [full list of parameters here.](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dateformat#pyspark.sql.DataFrameReader.csv)

In [21]:
display(spark.read
  .option("delimiter", "\t")
  .csv("/mnt/training/Chicago-Crimes-2018.csv")
)

Spark doesn't read the header by default, as demonstrated by the column names of `_c0`, `_c1`, etc. Notice the column names are present in the first row of the DataFrame. 

Fix this by setting the `"header"` option to `True`.

In [23]:
display(spark.read
  .option("delimiter", "\t")
  .option("header", True)
  .csv("/mnt/training/Chicago-Crimes-2018.csv")
)

Spark didn't infer the schema, or read the timestamp format, since this file uses an atypical timestamp.  Change that by adding the option `"timestampFormat"` and pass it the format used in this file.  

Set `"inferSchema"` to `True`, which triggers Spark to make an extra pass over the data to infer the schema.

In [25]:
crimeDF = (spark.read
  .option("delimiter", "\t")
  .option("header", True)
  .option("timestampFormat", "mm/dd/yyyy hh:mm:ss a")
  .option("inferSchema", True)
  .csv("/mnt/training/Chicago-Crimes-2018.csv")
)
display(crimeDF)

## The Design Pattern

Other connections work in much the same way, whether your data sits in Cassandra, Cosmos DB, Redshift, or another common data store.  The general pattern is always:  
<br>
1. Define the connection point
2. Define connection parameters such as access credentials
3. Add necessary options

After adhering to this, read data using `spark.read.options(<option key>, <option value>).<connection_type>(<endpoint>)`.

## Exercise 1: Read Wikipedia Data

Read Wikipedia data from Azure Blob Storage, accounting for its delimiter and header.

### Step 1: Get a Sense for the Data

Take a look at the head of the data, located at `/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv`.

In [29]:
# ANSWER
print(dbutils.fs.head('/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv', 100)) # this evaluates to the thing as %fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv

### Step 2: Import the Raw Data

Import the data **without any options** and save it to `wikiDF`. Display the result.

In [31]:
# ANSWER
path = "/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv"

wikiDF = spark.read.csv(path)
display(wikiDF)

In [32]:
# TEST - Run this cell to test your solution

dbTest("ET1-P-03-01-01", 7200001, wikiDF.count())
dbTest("ET1-P-03-01-02", '_c0', wikiDF.columns[0])

print("Tests passed!")

### Step 3: Import the Data with Options

Import the data with options and save it to `wikiWithOptionsDF`.  Display the result.  Your import statement should account for:<br><br>  

 - The header
 - The delimiter

In [34]:
# ANSWER
path = "/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv"

wikiWithOptionsDF = (spark.read
  .option("header", True)
  .option("delimiter", "\t")
  .csv(path)
)

display(wikiWithOptionsDF)

In [35]:
# TEST - Run this cell to test your solution
cols = set(wikiWithOptionsDF.columns)

dbTest("ET1-P-03-02-01", 7200000, wikiWithOptionsDF.count())
dbTest("ET1-P-03-02-02", {'requests', 'site', 'timestamp'}, cols)

print("Tests passed!")

## Review

**Question:** What accounts for Spark's quick rise in popularity as an ETL tool?  
**Answer:** Spark easily accesses data virtually anywhere it lives, and the scalable framework lowers the difficulties in building connectors to access data.  Spark offers a unified API for connecting to data making reads from a CSV file, JSON data, or a database, to provide a few examples, nearly identical.  This allows developers to focus on writing their code rather than writing connectors.

**Question:** What is DBFS and why is it important?  
**Answer:** The Databricks File System (DBFS) allows access to scalable, fast, and distributed storage backed by S3 or the Azure Blob Store.

**Question:** How do you connect your Spark cluster to the Azure Blob?  
**Answer:** By mounting it. Mounts require Azure credentials such as SAS keys and give access to a virtually infinite store for your data. One other option is to define your keys in a single notebook that only you have permission to access. Click the arrow next to a notebook in the Workspace tab to define access permisions.

**Question:** How do you specify parameters when reading data?  
**Answer:** Using `.option()` during your read allows you to pass key/value pairs specifying aspects of your read.  For instance, options for reading CSV data include `header`, `delimiter`, and `inferSchema`.

**Question:** What is the general design pattern for connecting to your data?  
**Answer:** The general design pattern is as follows:
0. Define the connection point
0. Define connection parameters such as access credentials
0. Add necessary options such as for headers or paralleization

## Next Steps

Start the next lesson, [Connecting to JDBC]($./04-Connecting-to-JDBC ).

## Additional Topics & Resources

**Q:** Where can I find more information on DBFS?  
**A:** <a href="https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html#dbfs" target="_blank">Take a look at the Databricks documentation for more details