# Tutorial 3: Model Tuning

In tutorial 2, we build a baseline model based on image embeddings and the cosine similarity and trained a distance metric between images with a deep learning model. To our surprise, the deep learning model performed worse than the cosine model. 

In this tutorial we will
- Analyse and fix the problems in our current model
- Learn how to use AWS to train larger and better models
- Discuss potential further improvements

### Table of contents

[1. Why is the Deep Learning Model worse than the baseline?](#section_explain_results) <br>
[2. Creating better training pairs](#section_better_training_data) <br>
[3. Retraining the model](#section_retraining) <br>
[4. Learning embeddings](#section_learning_embeddings) <br>
[5. Intro to AWS](#aws) <br>
[6. Setting up AWS SageMaker](#aws_setup) <br>
&emsp; [6.1 AWS Account Credentials](#aws_setup.credentials) <br>
&emsp; [6.2 AWS Account Setup](#aws_setup.account_setup) <br>
&emsp; [6.3 The Training Script](#aws_setup.training_script) <br>
&emsp; [6.4 Training in the Cloud](#aws_setup.training_in_the_cloud) <br>
&emsp; [6.5 Getting the results](#aws_setup.getting_the_results) <br>
&emsp; [6.6 Getting the results](#aws_setup.getting_the_results) <br>
&emsp; [6.7 Rules](#aws_setup.rules) <br>
[7. Next steps](#next_steps) <br>
[8. Summary](#summary) <br>

In [1]:
# Install new packages
!pip install aws_requests_auth sagemaker awscli



In [2]:
# Add the project dir and the src folder to paths
import sys
from pathlib import Path
project_dir = Path.cwd().parent
src_dir = project_dir / 'src'
sys.path.insert(0, str(project_dir))
sys.path.insert(0, str(src_dir))

In [3]:
# For readability, we list all libraries we use in this notebook at the beginning
import numpy as np
import os
import pandas as pd
import tensorflow as tf
from itertools import combinations, permutations
from data.embedding_generators import DataGeneratorFromEmbeddings, BalancedDataGeneratorFromEmbeddings
from data.image_generators import load_image
from utils.remote_sagemaker import get_job_details, start_remote_sagemaker_job, upload_code_folder_to_s3
from models.siamese_twin_embeddings import siamese_net_from_embeddings
from sagemaker.s3 import S3Downloader
from tensorflow.keras.applications import MobileNet
from tensorflow.keras.applications.mobilenet import preprocess_input
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from utils.remote_sagemaker import get_job_details, download_sagemaker_job_results, extract_results, download_sagemaker_job_logs, list_all_running_jobs, stop_job

2020-03-30 18:25:34,537 credentials load line 1196 Found credentials in shared credentials file: ~/.aws/credentials


<a id='section_explain_results'></a>
# Why is the Deep Learning Model worse than the baseline?

Let's recap our approach. We
- Computed embeddings of the images
- Defined the architecture of the model
- Generated image pairs as training data
- Fit the model to our data
- Created predictions

There must be a mistake in at least one of these steps. 

Since we used the same embeddings for the baseline model. The first step is probably not the issue. 
The architecture is very simple. Hence also an unlikely candidate.
The accuracy ($>98.5\%$) reported by the *fit* function indicates that our model did get pretty good at matching the training data. 

Such a high accuracy is usually an indicator that we are overfitting, i.e. that the model performs very well on the training data, but badly on new data. For deep learning models, this typically appears when your train for very long and the solution would be to stop the training earlier. 

In our case, a high accuracy was reached almost immediately. Our model found it *very simple* to solve the task which seems weird given the use case. Let's examine that data that we used to train the model.

In [4]:
train_data_path = Path('../data/train')
train_files_paths = list(train_data_path.glob('*/*.jpg'))
train_files_names = [p.name for p in train_files_paths]
train_files_classes = [p.parent.name for p in train_files_paths]
train_pairs = list(combinations(zip(train_files_names, train_files_classes), 2))

In [5]:
def is_similar(pair):
    image_one_class = pair[0][1]
    image_two_class = pair[1][1]
    if image_one_class == image_two_class:
        return 1
    else:
        return 0

In [6]:
train_labels = {pair: is_similar(pair) for pair in train_pairs}

The training data consists of all possible image combinations. A pair gets the label *1* iff both pictures show the same whale. If the images show a different whale the label is *0*. Let's take a closer look:

In [7]:
total = len(train_labels)
neg_count = sum(value == 0 for value in train_labels.values())
pos_count = sum(value == 1 for value in train_labels.values())
print(f"The total number of pairs is {total}")
print(f"{neg_count} pairs have the label 0, i.e. show different whales.")
print(f"{pos_count} pairs have the label 1, i.e. show the same whale.")

The total number of pairs is 10267246
10094385 pairs have the label 0, i.e. show different whales.
172861 pairs have the label 1, i.e. show the same whale.


This dataset is extremely imbalanced. Out of over 10 million examples only around 170.000 have the label 1.

In [8]:
round(100.0*pos_count/total, 2)

1.68

I.e. only $1,68\%$ of the pairs show the same whale. Given that we only used $32*100*10 = 32.000$ pairs (batch size * steps * epochs) for training, the chance that our model learned always output *0* is pretty high.

But this is not all that went wrong. Recall that we learned in the EDA that the class with the most images was the *-1* class, i.e. the class for images that were too bad for classification.

In [9]:
bad_pics = set(train_data_path.glob('-1/*.jpg'))
bad_pairs = len(list(combinations(zip(bad_pics, bad_pics), 2)))
print(f"There are {bad_pairs} pairs of bad pictures.")

There are 165600 pairs of bad pictures.


Which leaves

In [10]:
round(100.0*(pos_count - bad_pairs)/total, 2)

0.07

$0.07\%$ of all image pairs that show the same whale. Therefore, we can explain the results of our deep learing model with *the number one rule of machine learning*:

**Garbage in - Garbage out: If training data is bad, the model predictions are useless**

<font color='blue'>
    
**Best practice:** 
    
- Always remember the number one rule of machine learning: **Garbage in - Garbage out**.

<a id='section_better_training_data'></a>
# Creating better training pairs

Let's create better data to train from. We will

- Ignore all image from the -1 class
- Ignore all whale ids with only one image
- Create a dataset where both labels appear 50% of the time

This requires some changes to our previous DataGenerator: *DataGeneratorFromEmbeddings*. The result is called *BalancedDataGeneratorFromEmbeddings* and can also be found in `src/data/embedding_generators`.

To start, we need to define our model and compute the embeddings.

In [11]:
BATCH_SIZE = 32
IMG_HEIGHT = 224
IMG_WIDTH = 224
mobilenet = MobileNet(weights='imagenet', include_top=False, pooling='avg', input_shape=(IMG_HEIGHT, IMG_WIDTH, 3))

In [12]:
train_files = tf.data.Dataset.from_tensor_slices(list(map(str, train_files_paths)))  
train_ds = train_files.map(load_image)
train_ds = train_ds.batch(BATCH_SIZE)
embed_train = mobilenet.predict(train_ds)

In [13]:
train_embeddings = {img: embedding for img, embedding in zip(train_files_names, embed_train)}
train_generator = BalancedDataGeneratorFromEmbeddings(Path('../data/train'), train_embeddings)

The *BalancedDataGeneratorFromEmbeddings* takes the path to the training data and the precomputed train embeddings and creates balanced samples. When you check out the code, you'll find that the *__parse_folder* function removes the -1 class as well as all whale ids with only one image:

```python
    def __parse_folder(self):
        """
        Creates id_to_images and image_to_ids mappings.
        Ignores all whale ids with only one image and the -1 class
        :return:
        """
        id_folders = list(self.train_data_path.glob('*'))
        id_to_images = {}
        image_to_id = {}
        for folder in id_folders:
            if folder.name == "-1":
                continue
            pics = list(folder.glob('*.jpg'))
            if len(pics) == 1:
                continue
            id_to_images[folder.name] = {p.name for p in pics}
            for p in pics:
                image_to_id[p.name] = folder.name
        return id_to_images, image_to_id

```

The balancing is done in the *__data_generation* function:

```python
    def __data_generation(self, images):
        """
        Generates data containing batch_size samples
        :param image_to_id_temp:
        :return:
        """
        # Initialization
        x = [np.empty((self.batch_size, self.dim)) for _i in range(2)]
        y = np.empty(self.batch_size, dtype=int)
        # Generate data
        for i, image in enumerate(images):
            output_index = 2*i

            # Store similar sample
            similar_img = self.__get_similar_image(image)
            x[0][output_index, ] = self.embeddings[image]
            x[1][output_index, ] = self.embeddings[similar_img]
            y[output_index] = 1

            # Store different sample
            different_img = self.__get_different_image(image)
            x[0][output_index+1, ] = self.embeddings[image]
            x[1][output_index+1, ] = self.embeddings[different_img]
            y[output_index+1] = 0

        return x, y
```

You can test the balancing by taking a sample and checking the classes

In [14]:
sample = train_generator.get_sample(0)
sample[1]  # Show classes

array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 0, 1, 0])

With this, we fixed all issues we identified. Let's retrain the model and check the results.

<a id='section_retraining'></a>
# Retraining the model

We reuse the model from tutorial 2. For readability, we put the definition code in a separate file: `src/models/siamese_twin_embeddings.py`.

In [15]:
model = siamese_net_from_embeddings(0.0001)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            [(None, 1024)]       0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            [(None, 1024)]       0                                            
__________________________________________________________________________________________________
tf_op_layer_sub (TensorFlowOpLa [(None, 1024)]       0           input_2[0][0]                    
                                                                 input_3[0][0]                    
__________________________________________________________________________________________________
tf_op_layer_Abs (TensorFlowOpLa [(None, 1024)]       0           tf_op_layer_sub[0][0]        

In [16]:
model.fit_generator(train_generator, steps_per_epoch=500, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x5419240>

Note the difference in accuracy and loss. In tutorial 2, we got an accuracy of ~98% almost immediately.
This time, the accuracy is much lower which seems more realistic. But what kind of accuracy would we need?

Recall that for each test image, we have to predict the similariy for over 5000 images. 
With an accuracy of around 70%, the chance that the correct images are within the top 20 predictions are very low.

**Excercise:**
- Create and submit new predictions with the newly trained model. What kind of score do you get?
- How does the accuracy and score change when you train the model for longer?
- How does the accuracy and score change when you add additional layers to the model?
- Check the *get_similar_image()* function of the BalancedDataGeneratorFromEmbeddings class. Is this the best way to find similar images? What happens if you change it?

<a id='section_learning_embeddings'></a>
# Learning embeddings

When the accuracy of your model isn't good it is either due to bad training data, because we didn't train long enough, or because the model isn't powerful enough.
In this section, we'll move away from precomputed embeddings and train all of MobileNet instead. The resulting model will have many more parameters and should achieve a higher accuracy. We need to make several changes:

<a id='section_from_scratch.generator'></a>
### Updating the DataGenerator

The changes we have to make to our DataGenerator class are rather small: Instead of loading the embeddings, we are now loading the pictures directly. 

In *BalancedDataGeneratorFromEmbeddings* we created the input data via:
```python
            output_index = 2 * i
    
            # Store similar sample
            similar_img = self.__get_similar_image(image)
            x[0][output_index, ] = self.embeddings[image]
            x[1][output_index, ] = self.embeddings[similar_img]
            y[output_index] = 1

            # Store different sample
            different_img = self.__get_different_image(image)
            x[0][output_index+1, ] = self.embeddings[image]
            x[1][output_index+1, ] = self.embeddings[different_img]
            y[output_index+1] = 0
```

This needs to be changes to load images instead of embeddings.
Note that we also apply a [random image augmentation](https://machinelearningmastery.com/how-to-configure-image-data-augmentation-when-training-deep-learning-neural-networks/). 

```python
            img = img_to_array(load_img(image, target_size=self.dim))
            trans_args = self.img_gen.get_random_transform(self.dim)
            img = self.img_gen.apply_transform(img, trans_args)
            img = preprocess_input(img)
            output_index = 2 * i

            # Store similar sample
            similar_img = self.__get_similar_image(image)
            sim_img = img_to_array(load_img(similar_img, target_size=self.dim))
            trans_args = self.img_gen.get_random_transform(self.dim)
            sim_img = self.img_gen.apply_transform(sim_img, trans_args)
            sim_img = preprocess_input(sim_img)
            x[0][output_index,] = img
            x[1][output_index,] = sim_img
            y[output_index] = 1

            # Store different sample
            different_img = self.__get_different_image(image)
            diff_img = img_to_array(
                load_img(different_img, target_size=self.dim)
            )
            trans_args = self.img_gen.get_random_transform(self.dim)
            diff_img = self.img_gen.apply_transform(diff_img, trans_args)
            diff_img = preprocess_input(diff_img)
            x[0][output_index + 1,] = img
            x[1][output_index + 1,] = diff_img
            y[output_index + 1] = 0
```

*Note: Make sure your downloaded the latest version of the code from our [repositiory](http://de-mucingode1.corp.capgemini.com/gitlab/SophieY/global_data_science_challenge_3_public)*

<a id='section_from_scratch.model'></a>
### Updating the Model

In a similar way we are updating our model definition. Instead of giving the embeddings as an input, we will train a proper siamese twin network with a MobileNet architecture for the CNN part:

![siamese](https://miro.medium.com/max/1531/1*dFY5gx-Vze3micJ0AMVp0A.jpeg)

To do this, we need to update our model definition function *siamese_net_from_embeddings* from:

```python
        input_shape = [EMBED_LENGTH]
        embeddings_1 = Input(input_shape)
        embeddings_2 = Input(input_shape)
```

to:

```python
        input_shape = [IMG_HEIGHT, IMG_WIDTH, 3]
        left_input = Input(input_shape)
        right_input = Input(input_shape)

        model = MobileNet(input_shape=input_shape,
                          weights='imagenet',
                          include_top=False,
                          pooling='avg')
        
        encoded_l = model(left_input)
        encoded_r = model(right_input)
```

The resulting model is called *siamese_net_from_images_mobilenet* and can be found in the `src/model/siamese_twin_images.py`

Running this on your local machine is possible, but will take a very very very long time. Which can be a good thing:
![Training](xkcd_training.png)

To train the model, the [CPU](https://en.wikipedia.org/wiki/Central_processing_unit) of our machine has to do a lot of matrix multiplications, a task it is not particularly good at. Fortunately, [GPU's](https://en.wikipedia.org/wiki/Graphics_processing_unit) are [much faster](https://graphics.stanford.edu/papers/gpumatrixmult/gpumatrixmult.pdf) doing these calculation.

<a id='aws'></a>
# Speedings things up with GPUs: AWS SageMaker

With AWS, we can rent GPUs and use them the speed up our training and predictions. AWS offers several tools to easily access their services. For the GDSC, we will be using [AWS SageMaker](https://aws.amazon.com/de/sagemaker/).

> Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high quality models.
>-- <cite>https://aws.amazon.com/sagemaker/</cite>

To awesome thing about SageMaker is that is behaves *almost* as standard tensorflow, allowing us to easily switch from local to cloud development.

### How does it work?
What happens behind the scenes is described on a high level in the [docs](https://sagemaker.readthedocs.io/en/stable/using_tf.html#what-happens-when-fit-is-called) of the TensorFlow `.fit()` method that is part of the [AWS SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/).

> Calling `fit` starts a SageMaker training job. The training job will execute the following:
>
>Starts `train_instance_count` EC2 instances of the type `train_instance_type`.
>On each instance, it will do the following steps:
>* starts a Docker container optimized for TensorFlow.
>* downloads the dataset.
>* setup up training related environment varialbes
>* setup up distributed training environment if configured to use parameter server
>* starts asynchronous training
>-- <cite>https://sagemaker.readthedocs.io/</cite>

This is quite a lot of things that are taken care of and we do not need to worry about them.

<a id='aws_setup'></a>
# Setting up AWS SageMaker
To set up AWS Sagemaker, we need to 

1. an AWS account setup and credentials for the same
1. our local environment configured to be able to connect to it
1. a locally running python application for the training logic __within \*.py files__

Let's get started!

<a id='aws_setup.credentials'></a>
### AWS Account Credentials
You can find your `Access Key ID` and `Secret Access Key` credentials in the profile tab of your [GDSC account](http://gdsc.ce.capgemini.com/aws_info/):

<a id='aws_setup.account_setup'></a>
### AWS Account Setup
To set up AWS, we need to run 

```bash
aws configure
```

in the terminal to store our credentials. Enter the credentials from the previous step and confirm with *Enter*

*Note:You can start the terminal via the Anaconda Launcher by selecting **CMD.exe Promt** in the applications tab.*

### Architecture Overview
For the GDSC we implemented a custom AWS access layer: ![AWS Overview](training_environment.png)

We'll go over the individual functions below.

<a id='aws_setup.training_script'></a>
## The Training Script

So far, we relied on Jupyter notebooks to run our code. To use SageMaker, we need put everything in a Python script.
The script will be uploaded to AWS and executed in the cloud. We prepared a basic version for this tutorial. You can find it under `src/local_training_siamese_mobilenet_from_images.py`. The main components are as follows:

The scripts needs to be able to accept parameters from the command line. The [`__main__`](https://stackoverflow.com/questions/419163/what-does-if-name-main-do) check together with [argparse](https://docs.python.org/3/library/argparse.html) take care of this.

```python
if __name__ == "__main__":

    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument('--epochs', type=int, default=100)
    parser.add_argument('--learning_rate', type=float, default=0.001)
    parser.add_argument('--batch_size', type=int, default=32)
    parser.add_argument('--steps_per_epoch', type=int, default=100)
    parser.add_argument('--validation_steps', type=int, default=10)

    # input data and model directories
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))

    args, _ = parser.parse_known_args()
```

The rest of the code follows our previous setup:

```python
    logging.info('Instantiating TrainGeneratorFromImages with args: %s' % args.train)
    train_data_generator = BalancedDataGeneratorFromImages(args.train)
    logging.info('Invoking siamese_net_from_images_mobilenet with learning_rate: %s' % args.learning_rate)
    model = siamese_net_from_images_mobilenet(args.learning_rate)
```

creates the data generator and instanciates the model.

The model gets fitted on the data:

```python
    model.fit_generator(
        train_data_generator,
        steps_per_epoch=args.steps_per_epoch,
        epochs=args.epochs,
        callbacks=[tensorboard_callback],
    )
```

Note the callback. [Tensorboard](https://www.tensorflow.org/tensorboard) allows you visualize the progress of your training and can be used to debug your model

And finally, the predictions get created and written at


```python
    predictions = predict_siamese_twin_mobilenet_model(args.train, args.eval, model)
    write_prediction_file(predictions, os.path.join(model_dir, "predictions.csv"))    
```

If we'd use the model to create the prediction it would take a **lot of time** because we'd have to compute the embeddings for every single pair. `predict_siamese_twin_mobilenet_model` first extracts the CNN of our model, precomputes all embeddings, and then only runs the embeddings comparison on all the pairs. 

Finally, we need to test that we made no mistakes by running the script on our local machine. To do this, open your terminal, go to your *src* folder and run:

```bash
python local_training_siamese_mobilenet_from_images.py --epochs=2 --steps_per_epoch=2 --learning-rate=0.0001 --batch_size=32 --train=../data/train --eval=../data/test_val
```

Note the small values for *epochs* and *steps_per_epoch*. We only want to test that the general setup works, not run a proper training. If everything worked, you'll the predictions under *src/trained_models/predicitions.csv*. 

<font color='blue'>
    
**Best practice:** 
    
- Test your script on your local machine before running it on SageMaker. This saves you time and money.

<a id='aws_setup.training_in_the_cloud'></a>
## Training in the Cloud

When you are sure that the training script runs properly on your local machine it is time for the next step: using your script to start a SageMaker training job. An example is provided under `src/remote_training_siamese_mobilenet_from_images.py`.

The script doesn't run SageMaker directly, but instead uses a custom backend explained above.. You can easily adapt it to your needs. Let's go over the main points:

In [17]:
upload_code_folder_to_s3()

2020-03-30 18:31:51,780 remote_sagemaker retrieve_team_ddn_config_record line 59 Downloaded team config: {'team_name': 'GDSC Tutorials', 'team_user_name': 'GDSCTutorials', 'team_role_name': 'GDSCTutorials-role', 'team_sm_role_name': 'arn:aws:iam::880110969874:role/smScalingQueuer-StartJobFunctionRole-1XX6KGVEV83UL', 'team_id': '236', 'team_region': 'us-east-1', 'team_regional_bucket_name': 'all-data-all-participants-us-east-1'} from URL: https://z8js7f1x0e.execute-api.eu-west-1.amazonaws.com/Prod/get_team_details/
2020-03-30 18:31:52,868 credentials load line 1196 Found credentials in shared credentials file: ~/.aws/credentials
2020-03-30 18:31:55,758 remote_sagemaker upload_code_folder_to_s3 line 278 Uploading local file C:\Users\dkuehlwe\PycharmProjects\global_data_science_challenge_3_public\src\utils\..\data\embedding_generators.py to key GDSCTutorials/training_code_latest/data/embedding_generators.py


Found folder: C:\Users\dkuehlwe\PycharmProjects\global_data_science_challenge_3_public\src\utils\..\data


2020-03-30 18:31:56,167 remote_sagemaker upload_code_folder_to_s3 line 278 Uploading local file C:\Users\dkuehlwe\PycharmProjects\global_data_science_challenge_3_public\src\utils\..\data\image_generators.py to key GDSCTutorials/training_code_latest/data/image_generators.py
2020-03-30 18:31:56,435 remote_sagemaker upload_code_folder_to_s3 line 278 Uploading local file C:\Users\dkuehlwe\PycharmProjects\global_data_science_challenge_3_public\src\utils\..\data\make_dataset.py to key GDSCTutorials/training_code_latest/data/make_dataset.py
2020-03-30 18:31:56,891 remote_sagemaker upload_code_folder_to_s3 line 278 Uploading local file C:\Users\dkuehlwe\PycharmProjects\global_data_science_challenge_3_public\src\utils\..\data\__init__.py to key GDSCTutorials/training_code_latest/data/__init__.py
2020-03-30 18:31:57,190 remote_sagemaker upload_code_folder_to_s3 line 295 Uploading local file C:\Users\dkuehlwe\PycharmProjects\global_data_science_challenge_3_public\src\utils\..\download_job_logs.py

Found folder: C:\Users\dkuehlwe\PycharmProjects\global_data_science_challenge_3_public\src\utils\..\models


2020-03-30 18:31:59,342 remote_sagemaker upload_code_folder_to_s3 line 278 Uploading local file C:\Users\dkuehlwe\PycharmProjects\global_data_science_challenge_3_public\src\utils\..\models\siamese_twin_images.py to key GDSCTutorials/training_code_latest/models/siamese_twin_images.py
2020-03-30 18:31:59,654 remote_sagemaker upload_code_folder_to_s3 line 278 Uploading local file C:\Users\dkuehlwe\PycharmProjects\global_data_science_challenge_3_public\src\utils\..\models\siamese_twin_predictions.py to key GDSCTutorials/training_code_latest/models/siamese_twin_predictions.py
2020-03-30 18:31:59,975 remote_sagemaker upload_code_folder_to_s3 line 278 Uploading local file C:\Users\dkuehlwe\PycharmProjects\global_data_science_challenge_3_public\src\utils\..\models\__init__.py to key GDSCTutorials/training_code_latest/models/__init__.py
2020-03-30 18:32:00,273 remote_sagemaker upload_code_folder_to_s3 line 295 Uploading local file C:\Users\dkuehlwe\PycharmProjects\global_data_science_challenge_

Found folder: C:\Users\dkuehlwe\PycharmProjects\global_data_science_challenge_3_public\src\utils\..\utils


2020-03-30 18:32:01,726 remote_sagemaker upload_code_folder_to_s3 line 278 Uploading local file C:\Users\dkuehlwe\PycharmProjects\global_data_science_challenge_3_public\src\utils\..\utils\__init__.py to key GDSCTutorials/training_code_latest/utils/__init__.py
2020-03-30 18:32:02,033 remote_sagemaker upload_code_folder_to_s3 line 295 Uploading local file C:\Users\dkuehlwe\PycharmProjects\global_data_science_challenge_3_public\src\utils\..\__init__.py to key GDSCTutorials/training_code_latest/__init__.py
2020-03-30 18:32:02,399 remote_sagemaker upload_code_folder_to_s3 line 302 Uploaded code to bucket all-data-all-participants-us-east-1 with prefix GDSCTutorials/training_code_latest


In [18]:
job_status = start_remote_sagemaker_job(
    base_job_name='Test',
    # This MUST point to a file, relative to the src/ folder.
    # In this very example we use the provided local training script.
    entry_point='local_training_siamese_mobilenet_from_images.py',
    # Tweak your hyperparams here.
    hyperparams={
        'epochs': 1,
        'learning_rate': 0.001, 
        'batch_size': 32,
        'steps_per_epoch': 10
    },
)

Note the *entry_point* and *hyperparams* parameters. You might want to change them when you start creating your own models.

In [19]:
job_status

'Job created with job name: Test-GDSCTutorials-2020-03-30-16-32-09-276'

Like the previous script, you can start the script on the command line via

```bash
python remote_training_siamese_mobilenet_from_images.py 
```

Note that you do not need to give any additional parameters since we hardcoded the values for epochs, learning_rate, etc in the script. You also do not neet to provide the *train* and *test_val* folders since they are already uploaded and referred for you.

<a id='aws_setup.getting_the_results'></a>
## Getting the results

You can use the following python scripts interact with your job:

Monitoring the job on a high level
```
$ python src/observe_job_status.py Test-GDSCTutorials-2020-03-30-16-32-09-276
```
or get low level info with
```
$ python src/get_job_details.py Test-GDSCTutorials-2020-03-30-16-32-09-276
```
or

In [25]:
get_job_details('Test-GDSCTutorials-2020-03-30-16-32-09-276')

{'TrainingJobName': 'Test-GDSCTutorials-2020-03-30-16-32-09-276',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:880110969874:training-job/test-gdsctutorials-2020-03-30-16-32-09-276',
 'TrainingJobStatus': 'InProgress',
 'SecondaryStatus': 'Downloading',
 'HyperParameters': {'batch_size': '32',
  'epochs': '1',
  'learning_rate': '0.001',
  'model_dir': '"s3://all-data-all-participants-us-east-1/GDSCTutorials/trained_model_latest/Test-GDSCTutorials-2020-03-30-16-32-09-276/model"',
  'sagemaker_container_log_level': '20',
  'sagemaker_enable_cloudwatch_metrics': 'false',
  'sagemaker_job_name': '"Test-GDSCTutorials-2020-03-30-16-32-09-276"',
  'sagemaker_program': '"local_training_siamese_mobilenet_from_images.py"',
  'sagemaker_region': '"us-east-1"',
  'sagemaker_submit_directory': '"s3://all-data-all-participants-us-east-1/GDSCTutorials/sagemaker_code_artifacts/Test-GDSCTutorials-2020-03-30-16-32-09-276/source/sourcedir.tar.gz"',
  'steps_per_epoch': '10'},
 'AlgorithmSpecification'

You can list your running jobs with
```
$ python src/list_my_running_jobs.py
```
or

In [27]:
list_all_running_jobs()

[{'TrainingJobName': 'Test-GDSCTutorials-2020-03-30-16-32-09-276',
  'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:880110969874:training-job/test-gdsctutorials-2020-03-30-16-32-09-276',
  'CreationTime': '2020-03-30 16:32:11.340000+00:00',
  'LastModifiedTime': '2020-03-30 16:36:50.334000+00:00',
  'TrainingJobStatus': 'InProgress'}]

Download the detailed logs via
```
$ python src/download_job_logs.py Test-GDSCTutorials-2020-03-30-16-32-09-276
```
or

In [29]:
download_sagemaker_job_logs('Test-GDSCTutorials-2020-03-30-16-32-09-276')

2020-03-30 18:43:36,519 remote_sagemaker retrieve_team_ddn_config_record line 59 Downloaded team config: {'team_name': 'GDSC Tutorials', 'team_user_name': 'GDSCTutorials', 'team_role_name': 'GDSCTutorials-role', 'team_sm_role_name': 'arn:aws:iam::880110969874:role/smScalingQueuer-StartJobFunctionRole-1XX6KGVEV83UL', 'team_id': '236', 'team_region': 'us-east-1', 'team_regional_bucket_name': 'all-data-all-participants-us-east-1'} from URL: https://z8js7f1x0e.execute-api.eu-west-1.amazonaws.com/Prod/get_team_details/


'C:\\Users\\dkuehlwe\\PycharmProjects\\global_data_science_challenge_3_public\\src\\utils\\..\\..\\logs\\Test-GDSCTutorials-2020-03-30-16-32-09-276.txt'

After running this, you can see the log in the *logs* folder.

Note: This is your only way to access loss and accuracy of your model while it is training! 

If the results do not look good, you can stop a running job via
```
$ python src/stop_job.py Test-GDSCTutorials-2020-03-30-16-32-09-276
```
or
```python
stop_job('Test-GDSCTutorials-2020-03-30-16-32-09-276')
```

After the training is finished you can download all created artifact, i.e. logs and most importantly the created predictions via 
```
$ python src/download_job_results.py Test-GDSCTutorials-2020-03-30-16-32-09-276
```

In [31]:
download_sagemaker_job_results('Test-GDSCTutorials-2020-03-30-16-32-09-276')

Will download training results to C:\Users\dkuehlwe\PycharmProjects\global_data_science_challenge_3_public\src\utils\..\..\trained_models\Test-GDSCTutorials-2020-03-30-16-32-09-276


'C:\\Users\\dkuehlwe\\PycharmProjects\\global_data_science_challenge_3_public\\src\\utils\\..\\..\\trained_models\\Test-GDSCTutorials-2020-03-30-16-32-09-276'

In [32]:
extract_results('Test-GDSCTutorials-2020-03-30-16-32-09-276')

This will download and extract all outputs in the folder *trained_models*.

<a id='aws_setup.rules'></a>
## Rules

**Please be mindful when using AWS resources. We got a fixed budget that is shared among all participants.**

When using AWS SageMaker we ask you to follow the following rules

- Only use SageMaker when you really need GPU power.
- Only use SageMaker for experiments that are well thought out and planned. You should have a clear goal and new learning for each run.
- Only use SageMaker for the GDSC

We keep track of how many resources each team uses. If you overspend, we may disable your account.

<a id='next_steps'></a>
# Next steps

- Do your research. How did other people solve similar problems? 
- Keep a list of things that could be improved. Prioritize them and do one experiment at a time. 
- *Optionally:* Read the trainings from the [last challenge](http://de-mucingode1.corp.capgemini.com/gitlab/dkuehlwein/global_data_science_challenge_2_public) on how to structure your experiments

Some things we think can help you get a good result are:
- Read the approaches of the [humpback detection competition](https://www.kaggle.com/martinpiotte/whale-recognition-model-with-score-0-78563) on kaggle. There are many great ideas there. **The GDSC Tutorial submission with a score of 816 was created by finetuning this model.**
- The triplet loss idea behind [FaceNet](https://arxiv.org/abs/1503.03832)
- Experiment with different image sizes. Is 224x224 big enough to detect everything?

<a id='summary'></a>
# Summary

In this tutorial we
- Analysed the problems of the embeddings model
- Learned how to use AWS to train larger and better models
- Discuss potential further improvements

You can now start building your own models! Collaborate with the other participants on the low hanging fruits. 
Things like 
- code for cropping the fluke
- finding data errors 
- setting up TensorBoard

are best done in a team effort.