# Exercise 02 : Prepare Data

Here we learn ```Datastore``` and ```Dataset``` in Azure Machine Learning.<br>
The subsequent all exercises (Exercise 04 -) will use data provisioned in this exercise, and you should then run this exercise beforehand.

*back to [index](https://github.com/tsmatz/azureml-tutorial-tensorflow-v1/)*

## Get config setting

Read your config settings. (See and run "[Exercise01 : Prepare Config Settings](./exercise01_prepare_config.ipynb)" beforehand.)

In [1]:
from azureml.core import Workspace
import azureml.core

ws = Workspace.from_config()

## Use default datastore

Azure Machine Learning (AML) workspace has its own default datastore. When you create an AML workspace, a storage account (default datastore) is automatically generated in the same resource group.

In [2]:
# Get workspace default datastore
ds = ws.get_default_datastore()

## Create and register data as Dataset

Now we create AML dataset and register in workspace.

The data will be uploaded into the container in storage account, and you can share data in AML workspace.<br>
Registering dataset is not mandatory, but you can trace versions of data with models or experiments by registering data as AML dataset. (You can see the registered dataset in AML studio UI.)

In this exercise, I register all files in specific folders, but you can also register a part of files (such as, files with specific extension) as dataset.

In [3]:
from azureml.core import Dataset
from azureml.data.datapath import DataPath

# Upload local "data" folder (incl. files) as "tfdata" folder
mnist_dataset = Dataset.File.upload_directory(
    src_dir='./data',
    target=DataPath(ds, 'tfdata'),
    show_progress=True
)

# Register dataset
mnist_dataset = mnist_dataset.register(
    workspace=ws,
    name='mnist_tfrecords_dataset',
    description='training and test dataset',
    create_new_version=True)

Validating arguments.
Arguments validated.
Uploading file to tfdata
Uploading an estimated of 2 files
Uploading ./data/test.tfrecords
Uploaded ./data/test.tfrecords, 1 files out of an estimated total of 2
Uploading ./data/train.tfrecords
Uploaded ./data/train.tfrecords, 2 files out of an estimated total of 2
Uploaded 2 files
Creating new dataset


## [Optional] Use datastore with your own provisioned storage

(Running this tutorial is not needed for the following exercises, and you can skip.)

Here we learn how to use your own blob storage as AML datastore.

Before running, **please create Azure storage account and container as follows**.

1. Create your Storage Account in [Azure Portal](https://portal.azure.com/).
2. Create a container in storage account.
3. Copy storage account name, access key, and container name.
4. Set these values in the following cell.

In [4]:
from azureml.core import Datastore

# Register your own storage as AML datastore
ds2 = Datastore.register_azure_blob_container(
    ws,
    datastore_name='myblob01',
    account_name='{STORAGE ACCOUNT NAME}',
    account_key='{ACCESS KEY}',
    container_name='{CONTAINER NAME}',
    overwrite=True)

"\nds2 = Datastore.register_azure_blob_container(\n    ws,\n    datastore_name='myblob01',\n    account_name='{STORAGE ACCOUNT NAME}',\n    account_key='{ACCESS KEY}',\n    container_name='{CONTAINER NAME}',\n    overwrite=True)\n"

Once you have registered your own datastore, you can use this datastore with familiar API.<br>
In this example, I upload local data. (See the uploaded data in your storage account.)

In [5]:
# Get your own datastore
ds2 = Datastore.get(ws, datastore_name='myblob01')

# Upload local "data" folder (incl. files) as "tfdata" folder
mnist_dataset2 = Dataset.File.upload_directory(
    src_dir='./data',
    target=DataPath(ds2, 'tfdata'),
    show_progress=True
)

Validating arguments.
Arguments validated.
Uploading file to tfdata
Uploading an estimated of 2 files
Uploading ./data/test.tfrecords
Uploaded ./data/test.tfrecords, 1 files out of an estimated total of 2
Uploading ./data/train.tfrecords
Uploaded ./data/train.tfrecords, 2 files out of an estimated total of 2
Uploaded 2 files
Creating new dataset
