# TABLE OF CONTENTS:
---
* [Notebook Summary](#Notebook-Summary)
* [Setup](#Setup)
    * [Connect to Workspace](#Connect-to-Workspace)
* [Data](#Data)
    * [Overview](#Overview)
    * [Download & Extract Data](#Download-&-Extract-Data)
    * [Upload Data](#Upload-Data)
    * [Explore Data](#Explore-Data)
    * [Create and Register AML Dataset](#Create-and-Register-AML-Dataset)
---

# Notebook Summary

This notebook will download the [stanford dogs dataset](http://vision.stanford.edu/aditya86/ImageNetDogs/) from the stanford vision website to the local compute and then upload it to the Azure Machine Learning (AML) workspace default blob storage. It will also create an AML file dataset that can be used for easy data access during training.

# Setup

Append. parent directory to sys path to be able to import created modules from src directory.

In [None]:
import sys
sys.path.append(os.path.dirname(os.path.abspath("")))

Automatically reload modules when changes are made.

In [None]:
%load_ext autoreload
%autoreload 2

Import libraries and modules.

In [None]:
# Import libraries
import azureml.core
import torchvision
from azureml.core import Dataset, Workspace

# Import created modules
from src.utils import download_stanford_dogs_archives, extract_stanford_dogs_archives, load_data, show_image, show_batch_of_images

print(f"azureml.core version: {azureml.core.VERSION}")

### Connect to Workspace

In order to connect and communicate with the AML workspace, a workspace object needs to be instantiated using the AML SDK.

In [None]:
# Connect to the AML workspace using interactive authentication
ws = Workspace.from_config()

# Data

### Overview

The [stanford dogs dataset](http://vision.stanford.edu/aditya86/ImageNetDogs/) is an image dataset that will be used to train a multiclass dog breed classification model. In total there are 120 different dog breeds/classes and 20,580 images. The dataset has been built using images and annotations from ImageNet for the task of fine-grained image categorization. The images are three-channel color images of variable pixels in size. While a file with a given train/test split can be downloaded from the website, the train dataset will be further split into a validation and train set (80:20). This will ultimately lead into a data distribution as follows:
- 9600 training images (47.65%)
- 2400 validation images (11.66%)
- 8580 test images (41.69%)

### Download & Extract Data

Download the data to the local compute.

A utility file with functions to download the dogs dataset archive files from the stanford vision website and to extract the archives into a format expected by the [torchvision.datasets.ImageFolder](https://pytorch.org/docs/stable/torchvision/datasets.html#imagefolder) has been created (`../src/utils/data_utils.py`).

In [None]:
# Download the dataset archive files
download_stanford_dogs_archives()

In [None]:
# Extract the dataset archives and remove them after extraction
extract_stanford_dogs_archives()

### Upload Data

Upload the data to the default AML datastore.

In [None]:
datastore = ws.get_default_datastore()
datastore.upload(src_dir="../data", target_path="data/stanford_dogs", overwrite=True)

### Explore Data

Load the data into memory. A utility function to create dataloaders has been created as part of the `../src/utils/data_utils.py` script.

In [None]:
# Load data
dataloaders, dataset_sizes, class_names = load_data("../data")

Display an example image. All images have different shapes.

In [None]:
show_image(image_path="../data/val/n02085620-Chihuahua/n02085620_1152.jpg")

Display the first batch of 4 images.

In [None]:
# Get some random training images
dataiter = iter(dataloaders["val"])
images, labels = dataiter.next()

# Show images
show_batch_of_images(torchvision.utils.make_grid(images))
# Print labels
print("\n".join("%s" % class_names[labels[j]].split("-")[1] for j in range(4)))

### Create and Register AML Dataset

Register the data as a file dataset in the AML workspace for easy accessibility during training.

In [None]:
# Create a dataset object from the datastore location
dataset = Dataset.File.from_files(path=(datastore, "data/stanford_dogs"))

In [None]:
# Register the dataset
dataset = dataset.register(workspace=ws,
                           name="stanford-dogs-dataset",
                           description="stanford dogs dataset containing training, validation and test data",
                           create_new_version=True)