# File Manipulation with Azure Blob Storage

We try a few file manipulation between a local computer and a blob storage on Azure. It requires [azure-sdk-for-python](https://github.com/Azure/azure-sdk-for-python) and [pyensae](http://www.xavierdupre.fr/app/pyensae/helpsphinx/index.html). We first create a dummy file.

In [4]:
import pandas, random
mat = [ {"x":random.random(), "y":random.random()} for i in range(0,1000)]
df = pandas.DataFrame(mat)
df.to_csv("randomxy.txt", sep="\t", encoding="utf8")

We need credentials and to avoid having them in clear in the notebook, we use a HTML form:

In [2]:
import pyquickhelper
params={"blob_storage":"", "password":""}
pyquickhelper.open_html_form(params=params,title="credentials",key_save="blobservice")

We stored the values in two variables in the workspace:

In [1]:
blobstorage = blobservice["blob_storage"]
blobpassword = blobservice["password"]

We need pyensae >= 1.1:

In [1]:
import pyensae
pyensae.__version__

'1.1'

We open a connection to the blob storage:

In [3]:
cl, bs = %open_blob
cl, bs

(<pyensae.remote.azure_connection.AzureClient at 0x9f6d250>,
 <azure.storage.blobservice.BlobService at 0x9f6d270>)

We extract the available containers:

In [4]:
l = %blob_containers
l

['clusterensaeazure1', 'hdblobstorage', 'testhadoopensae']

We get the content of one container:

In [4]:
df = %blob_ls clusterensaeazure1
df.head()

Unnamed: 0,name,last_modified,content_type,content_length,blob_type
0,HdiSamples/SensorSampleData/building/building.csv,"Thu, 13 Nov 2014 23:43:59 GMT",application/octet-stream,544,BlockBlob
1,HdiSamples/SensorSampleData/hvac/HVAC.csv,"Thu, 13 Nov 2014 23:43:59 GMT",application/octet-stream,240591,BlockBlob
2,HdiSamples/StorageAnalytics/hive-serde-microso...,"Thu, 13 Nov 2014 23:43:59 GMT",application/octet-stream,9562,BlockBlob
3,HdiSamples/StorageAnalytics/hive-serde-microso...,"Thu, 13 Nov 2014 23:43:59 GMT",application/octet-stream,10290,BlockBlob
4,HdiSamples/StorageAnalytics/hive-serde-microso...,"Thu, 13 Nov 2014 23:43:59 GMT",application/octet-stream,10321,BlockBlob


We upload the file we created in the first cell:

In [4]:
%blob_up randomxy.txt clusterensaeazure1/testpyensae/randomxy.txt

'testpyensae/randomxy.txt'

We check the file is over there:

In [5]:
%blob_ls clusterensaeazure1/testpyensae

Unnamed: 0,name,last_modified,content_type,content_length,blob_type
0,testpyensae/randomxy.txt,"Sat, 15 Nov 2014 12:17:21 GMT",application/octet-stream,43486,BlockBlob


We try an extended version:

In [6]:
%blob_lsl clusterensaeazure1/testpyensae

Unnamed: 0,blob_type,content_encoding,content_language,content_length,content_md5,content_type,copy_completion_time,copy_id,copy_progress,copy_source,copy_status,copy_status_description,etag,last_modified,lease_duration,lease_state,lease_status,name,url,xms_blob_sequence_number
0,BlockBlob,,,43486,,application/octet-stream,,,,,,,0x8D1CEE53D92BF19,"Sat, 15 Nov 2014 12:17:21 GMT",,available,unlocked,testpyensae/randomxy.txt,https://hdblobstorage.blob.core.windows.net/cl...,0


If you need information not accessible through a magic command, you can use the variable ``bs`` (type [azure.storage.blobservice.BlobService](http://www.xavierdupre.fr/app/azure-sdk-for-python/helpsphinx/storage/blobservice.html#module-azure.storage.blobservice)):

In [12]:
l=bs.get_block_list("clusterensaeazure1", "testpyensae/randomxy.txt")
for _ in l.committed_blocks:
    print("size=",_.size, "id=",_.id)

size= 43486 id= 00000000


We download this again to the local computer:

In [13]:
%blob_down clusterensaeazure1/testpyensae/randomxy.txt randomxx_copy.txt

'randomxx_copy.txt'

In [4]:
%lsr r.*[.]txt

Unnamed: 0,directory,last_modified,name,size
0,False,2014-11-15 14:16:32.416793,.\randomxx_copy.txt,42.47 Kb
1,False,2014-11-15 13:03:51.826002,.\randomxy.txt,42.47 Kb


PIG scripts usually produce more than one output and it is convenient to merge them while downloading them. To test that, we upload a second time our file with a different names:

In [5]:
%blob_up randomxy.txt clusterensaeazure1/testpyensae/randomxy2.txt

'testpyensae/randomxy2.txt'

In [9]:
%blob_ls clusterensaeazure1/testpyensae

Unnamed: 0,name,last_modified,content_type,content_length,blob_type
0,testpyensae/randomxy.txt,"Sat, 15 Nov 2014 13:41:55 GMT",application/octet-stream,43486,BlockBlob
1,testpyensae/randomxy2.txt,"Sat, 15 Nov 2014 13:43:32 GMT",application/octet-stream,43486,BlockBlob


And we merge them:

In [5]:
%blob_downmerge clusterensaeazure1/testpyensae randomall.txt

'randomall.txt'

We check the size of file ``randomall.txt`` is twice bigger:

In [3]:
%lsr r.*[.]txt

Unnamed: 0,directory,last_modified,name,size
0,False,2014-11-15 14:48:51.154361,.\randomall.txt,84.93 Kb
1,False,2014-11-15 14:16:32.416793,.\randomxx_copy.txt,42.47 Kb
2,False,2014-11-15 13:03:51.826002,.\randomxy.txt,42.47 Kb


We finally remove the files from the blob storage:

In [7]:
%blob_delete clusterensaeazure1/testpyensae/randomxy.txt
%blob_delete clusterensaeazure1/testpyensae/randomxy2.txt

True

We check it disappeared:

In [8]:
%blob_ls clusterensaeazure1/testpyensae/

And we close the connection:

In [9]:
%blob_close

True

**END**