# Exercise 02 : Prepare Data

Here we uploda data as data asset in Azure Machine Learning.<br>
The subsequent all exercises (Exercise 04 -) will use data provisioned in this exercise, and you should then run this exercise beforehand.

Here we use hand-writing digit's dataset ([MNIST](http://yann.lecun.com/exdb/mnist/)) to train in this tutorial.

*back to [index](https://github.com/tsmatz/azureml-tutorial/)*

## Initialize MLClient

Before running code, you need to connect to your Azure ML workspace.<br>
Replace below's branket's string with your subscription id, resource group name, and AML workspace name.

I note that creating ```MLClient``` will not connect to AML workspace, and the client initialization is lazy.

In [1]:
from azure.ai.ml import MLClient
from azure.identity import DeviceCodeCredential

# When you run on remote
cred = DeviceCodeCredential()

# # When you run on Azure ML Notebook
# from azure.identity import DefaultAzureCredential
# cred = DefaultAzureCredential()

# Get a handle to the workspace
ml_client = MLClient(
    credential=cred,
    subscription_id="{SUBSCRIPTION ID}",
    resource_group_name="{RESOURCE GROUP NAME}",
    workspace_name="{AML WORKSPACE NAME}",
)

## Download data in local folder

Download MNIST (hand-writing digits) dataset in ```./data``` folder.<br>
The generated train data has 60,000 records and test data has 10,000 records.

In [2]:
import tensorflow as tf
from tensorflow import keras

mnist = tf.keras.datasets.mnist
(train_images, train_labels),(test_images, test_labels) = mnist.load_data()

train_dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels))

train_dataset.save("./data/train")
test_dataset.save("./data/test")

2022-10-04 23:28:44.201455: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-04 23:28:44.417613: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-10-04 23:28:44.417654: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-10-04 23:28:44.461965: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-10-04 23:28:45.607578: W tensorflow/stream_executor/platform/de

## Upload local files to AML default datastore

Azure Machine Learning (AML) workspace has its own default datastore. When you create an AML workspace, a storage account (default datastore) is automatically generated in the same resource group.<br>
Now we upload files in local ```data``` folder to Azure ML workspace.

In [3]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

mnist_data = Data(
    name="mnist_data",
    path="data",
    type=AssetTypes.URI_FOLDER,
)
credit_data = ml_client.data.create_or_update(mnist_data)

To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code ARR7SANTZ to authenticate.


[32mUploading data (58.66 MBs): 100%|███████████████████████████████| 58660145/58660145 [00:00<00:00, 64947296.86it/s][0m
[39m



## Show registered data

You can extract the registered data with ```get()``` or ```list()``` methods.

In [4]:
data = ml_client.data.get("mnist_data", version=1)
data

Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_folder', 'is_anonymous': False, 'auto_increment_version': False, 'name': 'mnist_data', 'description': None, 'tags': {}, 'properties': {}, 'id': '/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AML-rg/providers/Microsoft.MachineLearningServices/workspaces/ws01/data/mnist_data/versions/1', 'Resource__source_path': None, 'base_path': '/home/tsmatsuz/python_sdk2', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f48ea1d2b20>, 'serialize': <msrest.serialization.Serializer object at 0x7f48e8104ac0>, 'version': '1', 'latest_version': None, 'path': 'azureml://subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourcegroups/AML-rg/workspaces/ws01/datastores/workspaceblobstore/paths/LocalUpload/b6a4ea84008bfe761be7adab68271677/data/', 'datastore': None})