# Exercise 02 : Prepare Data

Here we learn ```Datastore``` and ```Dataset``` in Azure Machine Learning.<br>
The subsequent all exercises (Exercise 04 -) will use data provisioned in this exercise, and you should then run this exercise beforehand.

Here we use hand-writing digit's dataset ([MNIST](http://yann.lecun.com/exdb/mnist/)) to train in this tutorial.

*back to [index](https://github.com/tsmatz/azureml-tutorial/)*

## Get config setting

Read your config settings. (See and run "[Exercise01 : Prepare Config Settings](./exercise01_prepare_config.ipynb)" beforehand.)

In [1]:
from azureml.core import Workspace
import azureml.core

ws = Workspace.from_config()

## Download data in local folder

Download MNIST (hand-writing digits) dataset in ```./data``` folder.<br>
The generated train data has 60,000 records and test data has 10,000 records.

In [2]:
import tensorflow as tf
from tensorflow import keras

mnist = tf.keras.datasets.mnist
(train_images, train_labels),(test_images, test_labels) = mnist.load_data()

train_dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels))

train_dataset.save('./data/train')
test_dataset.save('./data/test')

2022-10-05 04:20:18.196181: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-05 04:20:18.357260: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-10-05 04:20:18.357293: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-10-05 04:20:18.390218: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-10-05 04:20:19.232189: W tensorflow/stream_executor/pla

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


2022-10-05 04:20:20.329581: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-10-05 04:20:20.329622: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2022-10-05 04:20:20.329658: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (client1005): /proc/driver/nvidia/version does not exist
2022-10-05 04:20:20.330044: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Use default datastore

Azure Machine Learning (AML) workspace has its own default datastore. When you create an AML workspace, a storage account (default datastore) is automatically generated in the same resource group.

In [3]:
# Get workspace default datastore
ds = ws.get_default_datastore()

## Create and register data as Dataset

Now we create AML dataset and register in workspace.

The data will be uploaded into the container in storage account, and you can share data in AML workspace.<br>
Registering dataset is not mandatory, but you can trace versions of data with models or experiments by registering data as AML dataset. (You can see the registered dataset in AML studio UI.)

In this exercise, I register all files in ```data``` folder, but you can also register a part of files (such as, files with specific extension) as dataset.

In [4]:
from azureml.core import Dataset
from azureml.data.datapath import DataPath

# Upload local "data" folder (incl. files) as "tfdata" folder
mnist_dataset = Dataset.File.upload_directory(
    src_dir='./data',
    target=DataPath(ds, 'tfdata'),
    show_progress=True
)

# Register dataset
mnist_dataset = mnist_dataset.register(
    workspace=ws,
    name='mnist_dataset',
    description='MNIST training and test dataset',
    create_new_version=True)

Validating arguments.
Arguments validated.
Uploading file to tfdata
Uploading an estimated of 6 files
Uploading ./data/test/dataset_spec.pb
Uploaded ./data/test/dataset_spec.pb, 1 files out of an estimated total of 6
Uploading ./data/test/snapshot.metadata
Uploaded ./data/test/snapshot.metadata, 2 files out of an estimated total of 6
Uploading ./data/train/dataset_spec.pb
Uploaded ./data/train/dataset_spec.pb, 3 files out of an estimated total of 6
Uploading ./data/train/snapshot.metadata
Uploaded ./data/train/snapshot.metadata, 4 files out of an estimated total of 6
Uploading ./data/test/7255580156754328102/00000000.shard/00000000.snapshot
Uploaded ./data/test/7255580156754328102/00000000.shard/00000000.snapshot, 5 files out of an estimated total of 6
Uploading ./data/train/4093752533441703203/00000000.shard/00000000.snapshot
Uploaded ./data/train/4093752533441703203/00000000.shard/00000000.snapshot, 6 files out of an estimated total of 6
Uploaded 6 files
Creating new dataset


## [Optional] Use datastore with your own provisioned storage

(Running this tutorial is not needed for the following exercises, and you can skip.)

Here we learn how to use your own blob storage as AML datastore.

Before running, **please create Azure storage account and container as follows**.

1. Create your Storage Account in [Azure Portal](https://portal.azure.com/).
2. Create a container in storage account.
3. Copy storage account name, access key, and container name.
4. Set these values in the following cell.

In [4]:
from azureml.core import Datastore

# Register your own storage as AML datastore
ds2 = Datastore.register_azure_blob_container(
    ws,
    datastore_name='myblob01',
    account_name='{STORAGE ACCOUNT NAME}',
    account_key='{ACCESS KEY}',
    container_name='{CONTAINER NAME}',
    overwrite=True)

"\nds2 = Datastore.register_azure_blob_container(\n    ws,\n    datastore_name='myblob01',\n    account_name='{STORAGE ACCOUNT NAME}',\n    account_key='{ACCESS KEY}',\n    container_name='{CONTAINER NAME}',\n    overwrite=True)\n"

Once you have registered your own datastore, you can use this datastore with familiar API.<br>
In this example, I upload local data. (See the uploaded data in your storage account.)

In [5]:
# Get your own datastore
ds2 = Datastore.get(ws, datastore_name='myblob01')

# Upload local "data" folder (incl. files) as "tfdata" folder
mnist_dataset2 = Dataset.File.upload_directory(
    src_dir='./data',
    target=DataPath(ds2, 'tfdata'),
    show_progress=True
)

Validating arguments.
Arguments validated.
Uploading file to tfdata
Uploading an estimated of 2 files
Uploading ./data/test.tfrecords
Uploaded ./data/test.tfrecords, 1 files out of an estimated total of 2
Uploading ./data/train.tfrecords
Uploaded ./data/train.tfrecords, 2 files out of an estimated total of 2
Uploaded 2 files
Creating new dataset
