# Module 6: Azure Blob Storage

In the digital world, data is often stored in cloud environments. Understanding how to work with data in a cloud environment is essential for a Data Engineer. In this training we are going to focus on the data stored with the Azure Blob Storage.

During this training we will go into detail about Azure Data Lake Storage, and how to work with the data stored in it with Python. During the training we will follow the following outline:
1. What is Azure Blob Storage?
2. Connecting with Azure Blob Storage
3. Azure Blob Storage operations 

Enjoy!

Run the following cell to import all necessary libraries.

In [None]:
import azure.storage.blob
import json
import requests

### Section 1: What is Azure Blob Storage? (15 min)

Azure Blob Storage is an online place where data can be stored, it's as simple as that. And as with almost any data storage location, there is a way to let Python do the heavy lifting for us. There are options and libraries within Python that will allow to communicatie with this way of data storage, and retrieve information from the stored data.

Azure Blob Storage is a storage solution for the cloud, and is optimized for storing huge amounts of unstructured data. The unstructured data is data that usually doesn't follow a data model. Examples are text data, images or videos. This makes Azure Blob Storage an ideal place where to quickly store and access your (unstructured) data.

One can view Azure Blob Storage as a file system, with a hierarchical relationship in storing the data or files. When data is stored within the Azure Blob Storage, you can access it using the Azure Blob Storage REST API, which we'll look at further on.
The structure of Azure Blob Storage is described in the image below.

![blob1.png](attachment:blob1.png)

In the image above, you can see three important things. The storage account, the containers within the storage account, and the blob within the containers. These are the core of what makes up Azure Blob Storage. Let's have a look at each of them.

**The storage account**
A storage account provides a unique namespace in Azure for your data, it creates a space to store your objects and allows you to find those objects again. Every object that you store in Azure Storage has an address that includes your unique account name. The combination of the account name and the Blob Storage endpoint forms the base address for the objects in your storage account. For example, if your storage account is named newstorageaccount, then the default endpoint for Blob storage is: http://newstorageaccount.blob.core.windows.net

**Containers**
A container organizes a set of blobs, similar to a directory works in a file system. A storage account can include an unlimited number of containers, and a container can store an unlimited number of blobs.

**Blobs**
Blobs are the lowest tier within Azure Blob Storage. They are the files that you store within your Blob storage. There are three different types of blobs; Block blobs, Append blobs, and Page blobs. In this training we will only be using the blocks blobs. For now, all you need to know about the different blob types, is that they exists, and that each of them will have a different functionality. See their documentation for more information (if you want): https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-pageblob-overview?tabs=dotnet.  

With this as a core, we can move to working with Azure Blob Storage. And as with every data-related subject, we can use Python. During this training we will start with a focus on the following core aspects:
- Create a container.
- Upload a file to block blob.
- List blobs.
- Download a blob to file.
- Delete a blob.
- Delete the container.

And afterwards, we will put these core practices more into practice, so that it all becomes familiar.

### Section 2: Connecting with Azure Blob Storage (45 min)

The first thing that we'll have to do, is to be able to connect to the Azure Blob Storage. As mentioned above, each Blob Storage has its own unique endpoint. Along with that, there are also unique keys to authenticate the one trying to access the Blob Storage. 

As with a lot of things, there is a Python library that we can use. We are going to make use of the azure-blob-storage library. See the documentation for more information: https://pypi.org/project/azure-storage-blob/. This library is designed to help with operations on an Azure Blob Storage. It will allow to make a connection to the Blob Storage, and to retrieve information in that Blob Storage, and also perform other actions such as creating your own containers and storing your own data.

Let's first have a look at making a connection. There are multiple possibilities on how to do that (as you could have seem in the documentation). We are going to focus on one of them for now.

As we saw in the previous section, the hierachy with containers and blobs is important, and that's (partly) what you us to navigate. Below there is an example of code which can be used to access information.

```python
import json
from azure.storage.blob import BlobClient

CONNECTION_STRING = "INSERT_CONNECTION_STRING_HERE"

blob_client = BlobClient.from_connection_string(conn_str=CONNECTION_STRING, 
                                                container_name="pokemon", 
                                                blob_name="pokemon_1.json")


result = blob_client.download_blob()
json.loads(result.readall())
```

Let's analyse the code above. The first thing of note is the variable 'CONNECTION_STRING'. This is one of the methods that can be used to access an Azure Blob Storage. When you read the 'CONNECTION_STRING', you can see the storage account name, the account key, and the endpoint. 
Another aspect is the usage of the BlobClient. This comes from the azure-storage-blob library. In using it, you have to use the connection-string, the name of the container, and sometimes the name of the blob. In this particular example we retrieved only one blob, which is a single json-file. Using the json library, we can retrieve the answer.
Using the '.downdload_blob()' method we can retrieve the data within the Blob.

In our case, you can see an Azure Blob Storage as a simplified file directory. With the containers as folders, and the blobs as files. Let's first have a look at (some of) the existing files within our Blob Storage. 

The connection string (which you'll need to use for the entire duration of the training) can be found in the code section below.

In [None]:
CONNECTION_STRING = "INSERT_CONNECTION_STRING_HERE"

#### Assignment 1: Connecting with Azure Blob Storage 1

Set up a connection to the Azure Blob Storage.
Then retrieve and print the information on 'pokemon_54.json'.

In [None]:
### FILL IN

#### Assignment 2: Connecting with Azure Blob Storage 2

Set up a connection to the Azure Blob Storage.
Then retrieve and print the information on 'pokemon_131.json'.

In [None]:
### FILL IN

#### Assignment 3: Connecting with Azure Blob Storage 3

Set up a connection to the Azure Blob Storage.
Then retrieve and store information on multiple blobs. 
Use a for loop to make repeated connections with the Azure Blob Storage.
Store each blob in a list, use the json library to retrieve information.
The wanted Blobs are; 'pokemon_37.json', 'pokemon_38.json', 'pokemon_58.json', 'pokemon_59.json', 'pokemon_77.json', 'pokemon_78.json', 'pokemon_126.json', and 'pokemon_136.json'.

In [None]:
### FILL IN

In the assignments above, we have only made use of the general BlobClient. There is also a method to focus on only one container, and that is the 'ContainerClient' method. In the documentaton (https://pypi.org/project/azure-storage-blob/) you can find how to use it.

#### Assignment 4: Connecting with Azure Blob Storage 4

Set up a connection to the Azure Blob Storage.
And then list the Blobs within the 'pokemon' container.
Have a look at the documentation for how to use the '.list_blobs()' method.

In [None]:
### FILL IN

#### Assignment 5: Connecting with Azure Blob Storage 5

Set up a connection to the Azure Blob Storage.
Then retrieve and store the information on all the blobs within the 'pokemon' container.
Store each blob in a list.
Use a for loop to make repeated connections with the Azure Blob Storage.
You could make use of the '.list_blobs()' method from the previous assignment to get the Blob names.

In [None]:
### FILL IN

Sometimes you would want to retrieve the data from a Blob Storage, and store it on your local device. Let's have a look at downloading a file. It's not really too much different to what we have done before. See the example below.

```python
import json
from azure.storage.blob import BlobClient

CONNECTION_STRING = "INSERT_CONNECTION_STRING_HERE"

blob_client = BlobClient.from_connection_string(conn_str=CONNECTION_STRING, 
                                                container_name="pokemon", 
                                                blob_name="pokemon_1.json")

with open("BlobResult_Pokemon1.json", "wb") as my_blob:
    blob_data = blob_client.download_blob()
    blob_data.readinto(my_blob)
```

The main difference with what we have done before, is that during the loading of the result, we write it to a created file.

#### Assignment 6: Connecting with Azure Blob Storage 6

Set up a connection to the Azure Blob Storage.
Retrieve the result of 'pokemon_25.json', and save it as a json file.

In [None]:
### FILL IN

#### Assignment 7: Connecting with Azure Blob Storage 7

Set up a connection to the Azure Blob Storage.
Then retrieve information from multiple blobs, and save each of them as a json file. 
Use a for loop to make repeated connections with the Azure Blob Storage.
The wanted Blobs are; 'pokemon_3.json', 'pokemon_6.json', 'pokemon_9.json', 'pokemon_144.json', 'pokemon_145.json', and 'pokemon_146.json'.

In [None]:
### FILL IN

### Section 3: Azure Blob Storage operations (60 min)

Great going! Up until we have looked at an existing container, with an existing structure of Blobs. In a Data Engineering world, it's often the case that you have an automated pipeline, and you'll have to store your data automatically. Let's have a look at uploading Blobs to a Container with the use of Python. 

Again, we'll be using the azure-storage-blob library for these steps. See below for an example on uploading a Blob to a Container.

```python
import requests
import json
from azure.storage.blob import BlobClient

CONNECTION_STRING = "INSERT_CONNECTION_STRING_HERE"

# Set up the connection.
blob_client = BlobClient.from_connection_string(conn_str=CONNECTION_STRING, 
                                                container_name="pokemon", 
                                                blob_name="pokemon_1.json")

# Retrieve data on a Pokemon using an API.
url = "https://pokeapi.co/api/v2/pokemon/1"
result = requests.get(url).json()

# Perform the upload.
blob_client.upload_blob(json.dumps(result))
```

The key section in uploading a Blob is the last line of code in the example above. Using the '.upload_blob()' method, you can upload files to the Container. Let's give it a try. We'll focus first on adding one Blob to a container, and then we'll move into adding more files to the container.

#### Assignment 8: Azure Blob Storage operations 1

Below there is the code to send a request to the Pokemon API for information on Pokemon number 151.
We'll add information on this pokemon to the existing pokemon container.

Set up a connection to the Azure Blob Storage.
Upload the result from the API request to the Blob Storage.
Give the blob the following name: 'pokemon_151_< your-name >.json'. (To differentiate your result from the others).

In [None]:
### FILL IN

#### Assignment 9: Azure Blob Storage operations 2

Check whether you succeeded above by retrieving information on the Blob you just uploaded ('pokemon_151_< your-name >.json').
You could use either the '.list_blobs()' method, or a specific Blob request.

In [None]:
### FILL IN

For those unfamiliar with Pokemon, there are a lot more than just 151. Let's make sure that we have stored information on all of them. At the time of writing, there are a total of 905 different ones. 

But before we will store information on all 905 pokemon, let's create our own workspace, or container, in the Azure Blob Storage. We can create Containers using the 'ContainerClient' we have seen before. Use the documentation (https://pypi.org/project/azure-storage-blob/) to see how you can create your own Container.

#### Assignment 10: Azure Blob Storage operations 3

Set up a connection to the Azure Blob Storage, and specifically with a 'ContainerClient'.
Give the 'ContainerClient' the following name; 'allpokemon< your initials >'. (To differentiate your result from the others).
Create your own Container.

In [None]:
### FILL IN

#### Assignment 11: Azure Blob Storage operations 4

Check whether you've succesfully created your own container.
You can check this using a 'BlobServiceClient', and using the '.list_containers()' method. (In the 'BlobServiceClient', you only need a connection-string).

Print all containers using the '.list_containers()' method.

In [None]:
### FILL IN

Now let's move on to filling your own container with information on all the 905 pokemon.

#### Assignment 12: Azure Blob Storage operations 5

Write a for loop that; retrieves a result on a pokemon, and sends it to your own container to store it as a Blob.

For each pokemon, you'll need to complete the following steps:
Send a request to the Pokemon API, using the specific ID in the url. (numbers from 1 to 905)
Set up a connection to the Azure Blob Storage.
Upload the result from the API request to the Blob Storage.
Give the blob the following name: 'pokemon_< number >_< your-name >.json'.

In [None]:
### FILL IN

#### Assignment 13: Azure Blob Storage operations 6

Check your result by printing a list of all Blobs within your own created Container.

In [None]:
### FILL IN

Sometimes you'll also have to able to delete files, or blobs, from your Azure Blob Storage. It may seem very similar to the other methods we have used before. See the (short) example below. We'll use a 'ContainerClient' for this.

```python
blob_to_delete = "pokemon_151_Bas.json"
container_client = ContainerClient.from_connection_string(conn_str=CONNECTION_STRING, 
                                                          container_name="")

container_client.delete_blob(blob_name=blob_to_delete)
```

By specifying the Blob name, you can delete the Blob.

#### Assignment 14: Azure Blob Storage operations 7

From your own created Container, delete the Pokemon Blob with number 905 (so delete the last pokemon Blob).
Use a 'ContainerClient', and delete it.

In [None]:
### FILL IN

#### Assignment 15: Azure Blob Storage operations 8

Check your result by specifically requesting the Blob you have just deleted.
If it all went well, you should get a 'ResourceNotFoundError' as a result (which is good in this case).

In [None]:
### FILL IN

#### Assignment 16: Azure Blob Storage operations 9

Delete all pokemon within you own created Container.
You could use the '.list_blobs()' method to retrieve the names.
Use a for loop to delete every Blob.

In [None]:
### FILL IN

#### Assignment 17: Azure Blob Storage operations 10

Check whether you own created Container is empty. You could the '.list_blobs()' method for this.

In [None]:
### FILL IN

Now that the Container you have created is empty, let's remove it from the Azure Blob Storage. The metholodogy is similar to what we have done before. You can use a 'ContainerClient' and the 'delete_container()' method.

#### Assignment 18: Azure Blob Storage operations 11

Delete your own created Container.

In [None]:
### FILL IN

#### Assignment 19: Azure Blob Storage operations 12

Check whether you succeeded at the previous assignment by trying to create a 'ContainerClient'.
You should receive a 'ResourceNotFoundError', which is good!

In [None]:
### FILL IN