# The problem: Cats vs Dogs

In this problem, we have to write an algorithm to classify whether images contain either a dog or a cat. This is easy for humans, dogs, and cats, but your computer will find it a bit more difficult.

<img src='https://storage.googleapis.com/kaggle-competitions/kaggle/3362/media/woof_meow.jpg' />

#### The Asirra data set
Web services are often protected with a challenge that's supposed to be easy for people to solve, but difficult for computers. Such a challenge is often called a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) or HIP (Human Interactive Proof). HIPs are used for many purposes, such as to reduce email and blog spam and prevent brute-force attacks on web site passwords.

Asirra (Animal Species Image Recognition for Restricting Access) is a HIP that works by asking users to identify photographs of cats and dogs. This task is difficult for computers, but studies have shown that people can accomplish it quickly and accurately. Many even think it's fun! Here is an example of the Asirra interface:

Asirra is unique because of its partnership with Petfinder.com, the world's largest site devoted to finding homes for homeless pets. They've provided Microsoft Research with over three million images of cats and dogs, manually classified by people at thousands of animal shelters across the United States. Kaggle is fortunate to offer a subset of this data for fun and research. 
        

## Solving the problem

Let's start by getting some undestanding of the problem. This is what we know:

<ul>
    <li><b>Problem type:</b> Classification</li>
    <li><b>Number of classes:</b> 2 (cats, dogs)</li>
    <li><b>Input:</b> Images (25.000 — 50% cats, 50% dogs)</li>
</ul>

The dataset is completely balanced, and 

# Azure Machine Learning Services

Microsoft has a variety of services tailored for Machine Learning and AI, however, the most suitable for this talk is by far AML. It provides a cloud-based environment you can use to develop, train, test, deploy, manage, and track machine learning models. The current version of AML uses a code-first approach with Python, which means that the whole process is managed using this language. It can be executed from a notebook or from the IDE of your choice.

## Install the SDK

As I stated before, AML uses a code-first approach to create, manage and publish machine learning experiments. So you will need to install some libraries in your environment. Your environment could be your local computer using PyCharm, Spider, VS Code or any other IDE of your choice, or it can be a notebook running in the cloud or locally. You need to install two libraries: Azure and Azure ML SDK.

In [None]:
!pip install azureml-sdk[notebooks,automl] --ignore-installed

<h2>A Machine Learning workspace</h2>

First, we are going to create a workspace in AML to work with. The workspace is the cloud resource you will use to create, manage, and publish machine learning experiments. To create a workspace you need the subscription ID of the subscription you are going to use, a name for the workspace and a location to deploy the resource. The location parameter is important since it will define which compute hardware will be available for your training job. I’m using East of US.

In [8]:
import azureml.core
from azureml.core.workspace import Workspace
ws = Workspace.get(
     name = "aa-ml-aml-workspace",
     subscription_id = "",
     resource_group = 'Analytics.Aml.Experiments.Workspaces')

<h2>Experiment</h2>

An experiment is a logical container for your proposed solution to the problem you want to model. A workspace can have multiple experiments running at the same time. They don’t just work as a container for your solution, but they also allow you to track down your progress around how good your solution is doing. Such progress is tracked using metrics you can define. If your problem is a classification problem, probably you will want to track the Accuracy or the MAP your model is getting. Each experiment can have multiple metrics being tracked.

Your experiment will be associated with a folder on your local computer. Such a folder contains all the resources (code files, assets, data, etc) you need to solve the problem. The folder will typically be associated with a code repository. This is not required, but it will allow you to collaborate among different Data Scientists in the same experiment. The repository can be hosted in any service, from GitHub to Azure DevOps.

You create an experiment using Python by simply indicating the name of the experiment and the Workspace associated with it.

In [10]:
from azureml.core import Experiment
experiment = Experiment(workspace=ws, name='azureml-cats-vs-dogs')

<b>State of the art</b>

The current literature suggests machine classifiers can score above 80% accuracy on this task. Therfore, Asirra is no longer considered safe from attack. Current Top 1 Kaggle leader achieved 0.98914. Let's try to implement a solution using Machine Learning. 

## Creating a train script: Solving the problem with fast.ai and PyTorch

We are introducing here a framework called fast.ai (https://www.fast.ai/), a framework based on PyTorch with some handy operations already implemented to speed up problem solving quickly. To use fast.ai, we need to import 2 libraries: fastai and torch. fast.ai also has the named dataset already uploaded as part of the framework, which makes pretty convenient to work with it. The following line unzipes the compressed tar file where all the dataset is stored.

```python
path = untar_data(URLs.DOGS)
```

As images are unzipped in a folder, we need to create a dataset to use for training and testing. fast.ai has a very simple way to do that. The method ImageDataBunch.from_folder creates a dataset of images using as parameters path (where the files are stored inside 2 folders, cats and dogs each of them representing one class), ds_tfms indicating which image transformations to apply and size specifying the size of the images used. The way folders are read is as images in the same subfolder are considered one class. The ds_tfms() method quickly gets a set of random transforms that have proved to work well in a wide range of tasks in computer vision, including a random flip is applied with probability 0.5, a random rotation, a random zoom, a random lightning and contrast change and a random symmetric warp. Images will be altered somehow similar to the following:

```python
data = ImageDataBunch.from_folder(path, ds_tfms=get_transforms(), size=224)
```

<img src='https://notebooks.azure.com/fasantia/projects/hol-aml-experimentationservice/raw/docs%2Fimg_cats_transform.png' />

Finally, the function normalize creates a normalize/denormalize func using an specific mean and std. In this case, those parameters are taken from the imagenet dataset, using the values imagenet_stats which are means = [0.485, 0.456, 0.406] and stds = [0.229, 0.224, 0.225].

```python
data = ImageDataBunch.from_folder(path, ds_tfms=get_transforms(), size=224).normalize(imagenet_stats)
```

Once our dataset is ready, it's time to create our NN. CNN represents a very convenient way to solve Computer Vision problems, specially when combined with transfer learning. We use transfer learning with a pretrained image classification models to extract visual features. The idea behind it is that the representations learned for task A (typically a high-level task) are applied to task B (typically a lower-level task) as for the degree of success at task B indicates how much the task A model has learned about task B.

```python
learn = create_cnn(data, models.resnet50, metrics=accuracy)
```

Then it's time to train. When using transfer learning, the training process is a bit different like in a normal network. In the processes we take a pre-trained model and “fine-tuning” the model with your our own dataset. The idea is that this pre-trained model will act as a feature extractor. You will remove the last layer of the network and replace it with your own classifier. You then freeze the weights of all the other layers and train the network normally. This is exactly what the following 3 lines are doing:

```python
learn.fit_one_cycle(1)
learn.unfreeze()
learn.fit_one_cycle(1, slice(1e-5,3e-4), pct_start=0.05)
```

I'm using here the method fit_one_cycle for training the model. What this method does behind the scenes is running for few epochs to find out a good learning rate, where it trains from some low learning rate and increase the learning rate after each mini-batch till the loss value starts to explode. This single run provides valuable information on how well the network can be trained over a range of learning rates and what is the maximum learning rate. This is based on a paper https://arxiv.org/abs/1506.01186 which is a really good reading by the way. In Cyclical learning rates (CLR) one specifies minimum and maximum learning rate boundaries and a stepsize. The stepsize is the number of iterations (or epochs) used for each step and a cycle consists of two such steps – one in which the learning rate linearly increases from the minimum to the maximum and the other in which it linearly decreases.

Once the model is trained, it's time to save the work. The save method will save the model and all the required files used when training. The export method will also create a pkl file which can be used later to make predictions based on new images.

```python
saved_model_path = learn.save(name='cats-vs-dogs', return_path = True)
learn.export()
```

Let's put all the peaces together now into a single train file

In [1]:
%%writefile fastai/train.py

import torch
import numpy as np
import fastai
from fastai import *
from fastai.vision import *

path = untar_data(URLs.DOGS)
data = ImageDataBunch.from_folder(path, ds_tfms=get_transforms(), size=224).normalize(imagenet_stats)
learn = create_cnn(data, models.resnet50, metrics=accuracy)

learn.fit_one_cycle(1)
learn.unfreeze()
learn.fit_one_cycle(1, slice(1e-5,3e-4), pct_start=0.05)

saved_model_path = learn.save(name='cats-vs-dogs', return_path = True)
learn.export()

Overwriting fastai/train.py


## Creating a better version of train.py

Althought the previous train.py file would do the job, it would be great if we can take advantage of some of the features Azure Machine Learning offers, specially regarding metric's tracking. We can achieve that by using the get_context() method to get the current execution context inside Azure Machine Learning Services:

```python
from azureml.core import Run
run = Run.get_context()
```

Then, we can log specific metrics using the log method:

```python
run.log('training_acc', accuracy_value)
```

I'm also using the method run.log_list to log a sequence of values, which will be later displayed as a graph in the Azure Dashboard. In particular, I'm logging the learning rate and the loss which I will use to know if the model is overfitting the training data set or not. You will see that I use a method called reduce_list. This is used to reduce the number of points to plot. Currently, the method has a limit of points you can submit.

The new version of the train.py file will look like this:

### The train.py script (v2)

In [12]:
%%writefile fastai/train.py

import torch
import numpy as np
import fastai
from fastai import *
from fastai.vision import *

print("PyTorch version %s" % torch.__version__)
print("fastai version: %s" % fastai.__version__)
print("CUDA supported: %s" % torch.cuda.is_available())
print("CUDNN enabled: %s" % torch.backends.cudnn.enabled)

path = untar_data(URLs.DOGS)
data = ImageDataBunch.from_folder(path, ds_tfms=get_transforms(), size=224).normalize(imagenet_stats)
learn = create_cnn(data, models.resnet50, metrics=accuracy)

learn.fit_one_cycle(1)
learn.unfreeze()
learn.fit_one_cycle(1, slice(1e-5,3e-4), pct_start=0.05)

saved_model_path = learn.save(name='cats-vs-dogs', return_path = True)
learn.export()
saved_model_pkl = str(learn.path) + '/export.pkl'

from azureml.core import Run
run = Run.get_context()

def reduce_list(all_values):
    return [np.max(all_values[i:i+10]) for i in range(0,len(all_values)-1,10)]

losses_values = [tensor.item() for tensor in learn.recorder.losses] 
accuracy_value = np.float(accuracy(*learn.TTA()))

run.log('training_acc', accuracy_value)
run.log('pytorch', torch.__version__)
run.log('fastai', fastai.__version__)
run.log('base_model', 'resnet50')
run.log_list('Learning_rate', reduce_list(learn.recorder.lrs))
run.log_list('Loss', reduce_list(losses_values))

from shutil import copyfile
copyfile(saved_model_pkl, './outputs/cats-vs-dogs.pkl')

Overwriting fastai/train.py


In the last 2 lines I'm saving the model in the folder outputs. The reason for that is that AMLS automatically capture all the files in that directory and saves it in the workspace. Then you can download the trained model to use later.

<h1>What, where and how to execute the training</h1>
<h2>Run configuration and Estimators</h2>

The Estimator is an abstraction that allows you especify how the train.py file should be executed based on high-level specifications. You create an Estimator using the azureml.train.Estimator namespace, however, the SDK cames with some Estimators prebuilt for specific deep learning frameworks, including TensorFlow and PyTorch. If you are using one of those frameworks, then you can create an Estimator for them as follows:

In [13]:
from azureml.train.dnn import PyTorch
src = PyTorch(source_directory =  r'fastai',
              entry_script = 'train.py',
              compute_target='amlcompute', 
              vm_size='Standard_NC6', 
              use_gpu = True, 
              pip_packages = ['fastai==1.0.0', 'azureml-sdk'])

This method will create a PyTorch execution environment. Parameters are:

<ul>
    <li><b>souce_directory:</b> All the files in souce_directory will be copies to the execution target (this is usually your project root directory).</li>
    <li><b>compute_target</b> specified where are you going to execute this job. The value ‘amlcompute’ signals we want Azure to provision a VM for this specific job. The machine will be created and once the job is done it will be destroyed. Pretty cool feature. Other types are available including (Databricks, HDInsight (Spark), custom VMs, local computer)</li>
    <li><b>vm_size</b> specified the type of hardware to use. In this case, Standard_NC6 are powered by NVIDIA Tesla K80 with 8 GiB, 6 vCPU, and 56 GiB of RAM.</li>
    <li><b>entry_script</b> specified which is the training script you want to execute. This file should be inside of source_directory.</li>
    <li><b>use_gpu</b> specifies that we want GPU-enabled libraries.</li>
    <li><b>pip_packages</b> allows you to specify which additional packages you need in the execution environment. In this case, since PyTorch execution environment has everything that is needed for PyTorch, the only package that is missing is fast.ai.</li>
</ul>

<h2>Runs</h2>

Inside an experiment, you have Runs. A Run is a particular instance of the experiment. Each time you submit your experiment to Azure and execute it, it will create a Run. You will typically collect metrics across different runs, for instance, the accuracy the model is getting, in order to compare. This is how you can track progress in your model manage how it evolves. The run can also generate outputs. Typically, one of the outputs will be the model itself (a file).

You create a run for your experiment by executing the submit method of the experiment object.

In [14]:
run = experiment.submit(src)

Once a run is submitted, the training process for your experiment will start. The method is asynchronous, that means that it will not wait until it is done. You will typically want to wait for it. wait_for_completion does that for you. The show_output = true indicated that you want to see the output of the process in your console. The output will be a live stream so you can see exactly what’s going on. Kind of cool!

What is happening under the hood is that Azure is preparing a new docker image for executing PyTorch code with GPU support, copying all the assets we need, installing all the packages we specified, creating a VM and deploying the image in the VM. Finally, the script is executed and once done the VM destroyed.

In [15]:
run.wait_for_completion(show_output = True)

RunId: azureml-cats-vs-dogs_1554821600_f6d06459

Streaming azureml-logs/20_image_build_log.txt

2019/04/09 14:54:18 Using acb_vol_9cd7a6d1-b8ab-49bf-91f9-7f5a366c8cad as the home volume
2019/04/09 14:54:18 Creating Docker network: acb_default_network, driver: 'bridge'
2019/04/09 14:54:18 Successfully set up Docker network: acb_default_network
2019/04/09 14:54:18 Setting up Docker configuration...
2019/04/09 14:54:19 Successfully set up Docker configuration
2019/04/09 14:54:19 Logging in to registry: aamlamlwacrhtyjtxxl.azurecr.io
2019/04/09 14:54:20 Successfully logged into aamlamlwacrhtyjtxxl.azurecr.io
2019/04/09 14:54:20 Executing step ID: acb_step_0. Working directory: '', Network: 'acb_default_network'
2019/04/09 14:54:20 Obtaining source code and scanning for dependencies...
2019/04/09 14:54:21 Successfully obtained source code and scanned for dependencies
2019/04/09 14:54:21 Launching container with name: acb_step_0
Sending build context to Docker daemon  45.06kB

Step 1/15 : FR

  Downloading https://files.pythonhosted.org/packages/19/74/e50234bc82c553fecdbd566d8650801e3fe2d6d8c8d940638e3d8a7c5522/pandas-0.24.2-cp36-cp36m-manylinux1_x86_64.whl (10.1MB)
Collecting fastprogress>=0.1.19 (from fastai->-r /azureml-setup/condaenv.w9vfbr24.requirements.txt (line 4))
  Downloading https://files.pythonhosted.org/packages/83/db/794db47024a26c75635c35f0ee5431aa8b528e895ad1ed958041290f83f7/fastprogress-0.1.21-py3-none-any.whl
Collecting packaging (from fastai->-r /azureml-setup/condaenv.w9vfbr24.requirements.txt (line 4))
  Downloading https://files.pythonhosted.org/packages/91/32/58bc30e646e55eab8b21abf89e353f59c0cc02c417e42929f4a9546e1b1d/packaging-19.0-py2.py3-none-any.whl
Collecting typing (from fastai->-r /azureml-setup/condaenv.w9vfbr24.requirements.txt (line 4))
  Downloading https://files.pythonhosted.org/packages/4a/bd/eee1157fc2d8514970b345d69cb9975dcd1e42cd7e61146ed841f6e68309/typing-3.6.6-py3-none-any.whl
Collecting scipy (from fastai->-r /azureml-setup/condae

Building wheels for collected packages: pyyaml, bottleneck, nvidia-ml-py3, pathspec, pycparser
  Building wheel for pyyaml (setup.py): started
  Building wheel for pyyaml (setup.py): finished with status 'done'
  Stored in directory: /root/.cache/pip/wheels/ad/56/bc/1522f864feb2a358ea6f1a92b4798d69ac783a28e80567a18b
  Building wheel for bottleneck (setup.py): started
  Building wheel for bottleneck (setup.py): finished with status 'done'
  Stored in directory: /root/.cache/pip/wheels/f2/bf/ec/e0f39aa27001525ad455139ee57ec7d0776fe074dfd78c97e4
  Building wheel for nvidia-ml-py3 (setup.py): started
  Building wheel for nvidia-ml-py3 (setup.py): finished with status 'done'
  Stored in directory: /root/.cache/pip/wheels/e4/1d/06/640c93f5270d67d0247f30be91f232700d19023f9e66d735c7
  Building wheel for pathspec (setup.py): started
  Building wheel for pathspec (setup.py): finished with status 'done'
  Stored in directory: /root/.cache/pip/wheels/45/cb/7e/ce6e6062c69446e39e328170524ca8213498bc


Streaming azureml-logs/60_control_log.txt

Streaming log file azureml-logs/60_control_log.txt
Streaming log file azureml-logs/80_driver_log.txt

Streaming azureml-logs/80_driver_log.txt

PyTorch version 1.0.0
fastai version: 1.0.51
CUDA supported: True
CUDNN enabled: True
Downloading http://files.fast.ai/data/examples/dogscats
  warn("`create_cnn` is deprecated and is now named `cnn_learner`.")
Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /root/.torch/models/resnet50-19c8e357.pth

  0%|          | 0/102502400 [00:00<?, ?it/s]
  3%|▎         | 3555328/102502400 [00:00<00:02, 35544331.00it/s]
  6%|▌         | 6356992/102502400 [00:00<00:02, 32837547.98it/s]
  9%|▊         | 8847360/102502400 [00:00<00:03, 29776279.41it/s]
 11%|█▏        | 11599872/102502400 [00:00<00:03, 28989967.53it/s]
 14%|█▍        | 14417920/102502400 [00:00<00:03, 28731074.86it/s]
 17%|█▋        | 17235968/102502400 [00:00<00:02, 28432887.90it/s]
 19%|█▉        | 19791872/102502400 [

{'runId': 'azureml-cats-vs-dogs_1554821600_f6d06459',
 'target': 'amlcompute',
 'status': 'Failed',
 'startTimeUtc': '2019-04-09T15:07:55.494566Z',
 'endTimeUtc': '2019-04-09T15:26:29.228121Z',
 'error': {'error': {'code': 'UserError',
   'message': "save() got an unexpected keyword argument 'name'",
   'details': [],
   'debugInfo': {'type': 'TypeError',
    'message': "save() got an unexpected keyword argument 'name'",
    'stackTrace': '  File "azureml-setup/context_manager_injector.py", line 90, in execute_with_context\n    runpy.run_path(sys.argv[0], globals(), run_name="__main__")\n  File "/azureml-envs/azureml_e7a13acf77779904bb28653701c4bd3a/lib/python3.6/runpy.py", line 263, in run_path\n    pkg_name=pkg_name, script_name=fname)\n  File "/azureml-envs/azureml_e7a13acf77779904bb28653701c4bd3a/lib/python3.6/runpy.py", line 96, in _run_module_code\n    mod_name, mod_spec, pkg_name, script_name)\n  File "/azureml-envs/azureml_e7a13acf77779904bb28653701c4bd3a/lib/python3.6/runpy.py

## Visualize experiment

There are two ways to visualize the experiment results. Either by using:

In [None]:
from azureml.widgets import RunDetails
RunDetails(run).show()

All the metrics you track can be seen within the portal in Azure too

<img src='https://cdn-images-1.medium.com/max/1200/1*vuo42vDq9ml5Z2iyS4qYeg.png' width='800' />

As you can see, AML has added a couple of metrics to the dashboard automatically, like Base Model Name and Training Accuracy. This happens automatically, but you can customize the dashboard to show the metrics you are interested in. If you want to see all the metrics of a particular run, you just click on it:

<img src='https://cdn-images-1.medium.com/max/1200/1*g5iHNpVQg15jZ_uqKbR4QA.png' width='800' />

As you see, now we have more variables being tracked. If you take a closer look, I’m also tracking two charts: Learning Rate and Loss. They show how the learning rate and the Loss are evolving as the training takes more samples on each epoch (I’m training with Stochastic Gradient Descent — SGD). It is useful to know when to stop training.

### Test the model for inference

In [None]:
import torch
from fastai import *
from fastai.vision import *

In [None]:
learn_inference = load_learner('/home/santiagxf/.fastai/data/dogscats')

Let's download an image from internet and submit it to the model we created

In [None]:
!wget -O test.jpg https://thenypost.files.wordpress.com/2018/05/180516-woman-mauled-by-angry-wiener-dogs-feature.jpg
    
img = open_image(os.path.join(os.getcwd(),"test.jpg"))
result = learn_inference.predict(img)