# Exercise 02 : Prepare Data

Here we uploda data as data asset in Azure Machine Learning.<br>
The subsequent all exercises (Exercise 04 -) will use data provisioned in this exercise, and you should then run this exercise beforehand.

Here we use hand-writing digit's dataset ([MNIST](http://yann.lecun.com/exdb/mnist/)) - **train.tfrecords**, **test.tfrecords** - to train in this tutorial.

*back to [index](https://github.com/tsmatz/azureml-tutorial/)*

## Variable's Setting

Replace below's branket's string and set the required variables.

> Note : By the following ```az configure --defaults```, you can skip setting for ```--resource-group``` and ```--workspace-name``` options in each ```az ml``` command.<br>
> ```az configure --defaults group=$resource_group workspace=$aml_workspace```

In [1]:
my_resource_group = "{AML-RESOURCE-GROUP-NAME}"
my_workspace = "{AML-WORSPACE-NAME}"

## Download data in local folder

Download MNIST (hand-writing digits) dataset in ```./data``` folder.<br>
The generated ```train.tfrecords``` has 60,000 records and ```test.tfrecords``` has 10,000 records.

In [None]:
import os
import tensorflow as tf
from tensorflow.contrib.learn.python.learn.datasets import mnist

def convert_to(data_set, name):
    images = data_set.images
    labels = data_set.labels
    num_examples = data_set.num_examples

    rows = images.shape[1]
    cols = images.shape[2]
    depth = images.shape[3]

    filename = os.path.join("data", name + '.tfrecords')
    print('Writing', filename)
    with tf.python_io.TFRecordWriter(filename) as writer:
        for index in range(num_examples):
            image_raw = images[index].tobytes()
            example = tf.train.Example(
                features=tf.train.Features(
                    feature={
                        'height': tf.train.Feature(int64_list=tf.train.Int64List(value=[rows])),
                        'width': tf.train.Feature(int64_list=tf.train.Int64List(value=[cols])),
                        'depth': tf.train.Feature(int64_list=tf.train.Int64List(value=[depth])),
                        'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(labels[index])])),
                        'image_raw': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_raw]))
                    }))
            writer.write(example.SerializeToString())

data_sets = mnist.read_data_sets(
    "tmp-data",
    dtype=tf.uint8,
    reshape=False,
    validation_size=0)
os.makedirs("./data", exist_ok=True)
convert_to(data_sets.train, 'train')
convert_to(data_sets.test, 'test')

## Upload local files to AML default datastore

Azure Machine Learning (AML) workspace has its own default datastore. When you create an AML workspace, a storage account (default datastore) is automatically generated in the same resource group.<br>
Now, we create yaml and upload files (in ```data``` folder) to AML.

First we create yaml for data asset registration.

In [2]:
%%writefile 02_file_upload.yml
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: mnist_tfrecords_data
description: This is example.
type: uri_folder
path: data

Writing 02_file_upload.yml


Now we register data asset (upload local data) with AML CLI.

In [3]:
!az ml data create --file 02_file_upload.yml \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

[36mCommand group 'ml data' is in preview and under development. Reference and support levels: https://aka.ms/CLI_refstatus[0m
[32mUploading data (57.91 MBs): 100%|█| 57915000/57915000 [00:00<00:00, 74829460.63i[0m
[39m

{
  "creation_context": {
    "created_at": "2022-04-15T05:30:09.753715+00:00",
    "created_by": "Tsuyoshi Matsuzaki",
    "created_by_type": "User",
    "last_modified_at": "2022-04-15T05:30:09.765830+00:00"
  },
  "description": "This is example.",
  "id": "/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AML-rg/providers/Microsoft.MachineLearningServices/workspaces/ws01/data/mnist_tfrecords_data/versions/1",
  "name": "mnist_tfrecords_data",
  "path": "azureml://subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourcegroups/AML-rg/workspaces/ws01/datastores/workspaceblobstore/paths/LocalUpload/cb5afd9ca46093b6ec3c6dce49d2ce0e/data",
  "resourceGroup": "AML-rg",
  "tags": {},
  "type": "uri_file",
  "version": "1"
}
[0m