# Exercise 02 : Prepare Data

Here we uploda data as data asset in Azure Machine Learning.<br>
The subsequent all exercises (Exercise 04 -) will use data provisioned in this exercise, and you should then run this exercise beforehand.

Here we use hand-writing digit's dataset ([MNIST](http://yann.lecun.com/exdb/mnist/)) - **train.tfrecords**, **test.tfrecords** - to train in this tutorial.

*back to [index](https://github.com/tsmatz/azureml-tutorial/)*

## Initialize MLClient

Before running code, you need to connect to your Azure ML workspace.<br>
Replace below's branket's string with your subscription id, resource group name, and AML workspace name.

I note that creating ```MLClient``` will not connect to AML workspace, and the client initialization is lazy.

In [1]:
from azure.ai.ml import MLClient
from azure.identity import DeviceCodeCredential

# When you run on remote
cred = DeviceCodeCredential()

# # When you run on Azure ML Notebook
# from azure.identity import DefaultAzureCredential
# cred = DefaultAzureCredential()

# Get a handle to the workspace
ml_client = MLClient(
    credential=cred,
    subscription_id="{SUBSCRIPTION ID}",
    resource_group_name="{RESOURCE GROUP NAME}",
    workspace_name="{AML WORKSPACE NAME}",
)

  from cryptography import x509


## Download data in local folder

Download MNIST (hand-writing digits) dataset in ```./data``` folder.<br>
The generated ```train.tfrecords``` has 60,000 records and ```test.tfrecords``` has 10,000 records.

In [None]:
import os
import tensorflow as tf
from tensorflow.contrib.learn.python.learn.datasets import mnist

def convert_to(data_set, name):
    images = data_set.images
    labels = data_set.labels
    num_examples = data_set.num_examples

    rows = images.shape[1]
    cols = images.shape[2]
    depth = images.shape[3]

    filename = os.path.join("data", name + '.tfrecords')
    print('Writing', filename)
    with tf.python_io.TFRecordWriter(filename) as writer:
        for index in range(num_examples):
            image_raw = images[index].tobytes()
            example = tf.train.Example(
                features=tf.train.Features(
                    feature={
                        'height': tf.train.Feature(int64_list=tf.train.Int64List(value=[rows])),
                        'width': tf.train.Feature(int64_list=tf.train.Int64List(value=[cols])),
                        'depth': tf.train.Feature(int64_list=tf.train.Int64List(value=[depth])),
                        'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(labels[index])])),
                        'image_raw': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_raw]))
                    }))
            writer.write(example.SerializeToString())

data_sets = mnist.read_data_sets(
    "tmp-data",
    dtype=tf.uint8,
    reshape=False,
    validation_size=0)
os.makedirs("./data", exist_ok=True)
convert_to(data_sets.train, 'train')
convert_to(data_sets.test, 'test')

## Upload local files to AML default datastore

Azure Machine Learning (AML) workspace has its own default datastore. When you create an AML workspace, a storage account (default datastore) is automatically generated in the same resource group.<br>
Now we upload files in local ```data``` folder to Azure ML workspace.

In [2]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

mnist_data = Data(
    name="mnist_tfrecords_data",
    path="data",
    type=AssetTypes.URI_FOLDER,
)
credit_data = ml_client.data.create_or_update(mnist_data)

To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code RPR2B3UFA to authenticate.


[32mUploading data (57.91 MBs): 100%|██████████| 57915000/57915000 [00:01<00:00, 46535587.87it/s]
[39m



## Show registered data

You can extract the registered data with ```get()``` or ```list()``` methods.

In [3]:
data = ml_client.data.get("mnist_tfrecords_data", version=1)
data

Data({'type': 'uri_folder', 'is_anonymous': False, 'auto_increment_version': False, 'name': 'mnist_tfrecords_data', 'description': None, 'tags': {}, 'properties': {}, 'id': '/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AzureML-rg/providers/Microsoft.MachineLearningServices/workspaces/ws01/data/mnist_tfrecords_data/versions/1', 'base_path': './', 'creation_context': <azure.ai.ml._restclient.v2022_05_01.models._models_py3.SystemData object at 0x7f06171ac898>, 'serialize': <msrest.serialization.Serializer object at 0x7f061719cda0>, 'version': '1', 'latest_version': None, 'path': 'azureml://subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourcegroups/AzureML-rg/workspaces/ws01/datastores/workspaceblobstore/paths/LocalUpload/cb5afd9ca46093b6ec3c6dce49d2ce0e/data/', 'referenced_uris': None})