# ELT Sample: Azure Blob Stroage - Databricks - CosmosDB
In this notebook, you extract data from Azure Blob Storage into Databricks cluster, run transformations on the data in Databricks cluster, and then load the transformed data into Azure Cosmos DB.
## prerequisites:
- Azure Blob Storage Account and Containers
- Databricks Cluster (Spark)
- Cosmos DB Spark Connector (azure-cosmosdb-spark)
  - Create a library using maven coordinates. Simply typed in `azure-cosmosdb-spark_2.2.0` in the search box and search it, or create library by simply uploading jar file that can be donwload from marven central repository
- Azure Cosmos DB Collection
## Sample data
- https://github.com/Azure/usql/blob/master/Examples/Samples/Data/json/radiowebsite/small_radio_json.json
## LINKS
- https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-storage.html
- https://github.com/Azure/azure-cosmosdb-spark

# Connecting to Azure Blob Storage and access a sample Json file

## Set up an account access key

In [4]:
# spark.conf.set(
#  "fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
#  "<storage-access-key>")

spark.conf.set(
  "fs.azure.account.key.databrickstore.blob.core.windows.net",
  "S1PtMWvUw5If1Z8FMzXAxC7OMw9G5Go8BGCXJ81qpFVYpZ9dpXOnU4zlg0PbldKkbLIbmbv02WoJsgYLGKIfgg==")

Once an account access key or a SAS is set up in your notebook, you can use standard Spark and Databricks APIs to read from the storage account

In [6]:
#dbutils.fs.ls("wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>")
dbutils.fs.ls("wasbs://dbdemo01@databrickstore.blob.core.windows.net")

## Mount a Blob storage container or a folder inside a container

In [8]:
# Mount a Blob storage container or a folder inside a container
# dbutils.fs.mount(
#   source = "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>",
#   mount_point = "<mount-point-path>",
#   extra_configs = <"<conf-key>": "<conf-value>">)
# [note] <mount_point> is a DBFS path and the path must be under /mnt

dbutils.fs.mount(
  source = "wasbs://dbdemo01@databrickstore.blob.core.windows.net",
  mount_point = "/mnt/dbdemo01",
  extra_configs = {"fs.azure.account.key.databrickstore.blob.core.windows.net": "S1PtMWvUw5If1Z8FMzXAxC7OMw9G5Go8BGCXJ81qpFVYpZ9dpXOnU4zlg0PbldKkbLIbmbv02WoJsgYLGKIfgg=="})


## Access files in your container as if they were local files

In [10]:
# Access files in your container as if they were local files
# (TEXT) df = spark.read.text("/mnt/%s/...." % <mount-point-path>)
# (JSON) df = spark.read.json("/mnt/%s/...." % <mount-point-path>)

df = spark.read.json( "/mnt/%s/small_radio_json.json" % "dbdemo01" )

# display(df)
df.show()

## Unmount the blob storage (if needed)

In [12]:
# unmount (if needed)
# dbutils.fs.unmount("<mount-point-path>")
# dbutils.fs.unmount("/mnt/dbdemo01")

# Transform data in Azure Databricks

Start by retrieving only the columns firstName, lastName, gender, location, and level from the dataframe you already created.

In [15]:
specificColumnsDf = df.select("firstname", "lastname", "gender", "location", "level")
specificColumnsDf.show()

You can further transform this data to rename the column level to subscription_type.

In [17]:
renamedColumnsDF = specificColumnsDf.withColumnRenamed("level", "subscription_type")
renamedColumnsDF.show()

# Load data into Azure Cosmos DB

Write configuration, then write to Cosmos DB from the renamedColumnsDF DataFrame

In [20]:
#writeConfig = {
# "Endpoint" : "https://<cosmosdb-account-name>.documents.azure.com:443/",
# "Masterkey" : "<Cosmosdb-master-key-string>",
# "Database" : "<database-name>",
# "Collection" : "<collection-name>",
# "Upsert" : "true"
#}

# Write configuration
writeConfig = {
 "Endpoint" : "https://dbstreamdemo.documents.azure.com:443/",
 "Masterkey" : "ekRLXkETPJ93s6XZz4YubZOw1mjSnoO5Bhz1Gk29bVxCbtgtKmiyRz4SogOSxLOGTouXbwlaAHcHOzct4JVwtQ==",
 "Database" : "etl",
 "Collection" : "outcol01",
 "Upsert" : "true"
}

# Write to Cosmos DB from the renamedColumnsDF DataFrame
renamedColumnsDF.write.format("com.microsoft.azure.cosmosdb.spark").options(**writeConfig).save()