# Tensor analysis using Amazon SageMaker Debugger

Looking at the distributions of activation inputs/outputs, gradients and weights per layer can give useful insights. For instance, it helps to understand whether the model runs into problems like neuron saturation, whether there are layers in your model that are not learning at all or whether the network consists of too many layers etc. 

The following animation shows the distribution of gradients of a convolutional layer from an example application  as the training progresses. We can see that it starts as Gaussian distribution but then becomes more and more narrow. We can also see that the range of gradients starts very small (order of $1e-5$) and becomes even tinier as training progresses. If tiny gradients are observed from the start of training, it is an indication that we should check the hyperparameters of our model. 

![](images/example.gif)

In this notebook we will train a poorly configured neural network and use Amazon SageMaker Debugger with custom rules to aggregate and analyse specific tensors. Before we proceed let us install the smdebug binary which allows us to perform interactive analysis in this notebook. After installing it, please restart the kernel, and when you come back skip this cell.


### Configuring the inputs for the training job

Now we'll call the Sagemaker MXNet Estimator to kick off a training job . The `entry_point_script` points to the MXNet training script. The users can create a custom *SessionHook* in their training script. If they chose not to create such hook in the training script (similar to the one we will be using in this example) Amazon SageMaker Debugger will create the appropriate *SessionHook* based on specified *DebugHookConfig* parameters.

The `hyperparameters` are the parameters that will be passed to the training script. We choose `Uniform(1)` as initializer and learning rate of `0.001`. This leads to the model not training well because the model is poorly initialized.

The goal of a good intialization is 
- to break the symmetry such that parameters do not receive same gradients and updates
- to keep variance similar across layers

A bad intialization may lead to vanishing or exploiding gradients and the model not training at all. Once the training is finished we will look at the distirbutions of activation inputs/outputs, gradients and weights across the training to see how these hyperparameters influenced the training.


In [1]:
entry_point_script = 'mnist.py'
bad_hyperparameters = {'initializer': 2, 'lr': 0.001}

In [2]:
import sagemaker
from sagemaker.mxnet import MXNet
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig
import boto3
import os

sagemaker_session = sagemaker.Session()
BUCKET_NAME = sagemaker_session.default_bucket()
LOCATION_IN_BUCKET = 'smdebug-mnist-tensor-analysis'

s3_bucket_for_tensors = 's3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.format(BUCKET_NAME=BUCKET_NAME, LOCATION_IN_BUCKET=LOCATION_IN_BUCKET)
estimator = MXNet(role=sagemaker.get_execution_role(),
                  base_job_name='mxnet',
                  train_instance_count=1,
                  train_instance_type='ml.m5.xlarge',
                  train_volume_size=400,
                  source_dir='src',
                  entry_point=entry_point_script,
                  hyperparameters=bad_hyperparameters,
                  framework_version='1.6.0',
                  py_version='py3',
                  debugger_hook_config = DebuggerHookConfig(
                      s3_output_path=s3_bucket_for_tensors,  
                      collection_configs=[
                        CollectionConfig(
                            name="all",
                            parameters={
                                "include_regex": ".*",
                                "save_interval": "100"
                            }
                        )
                     ]
                   )
                )

Start the training job

In [3]:
estimator.fit(wait=False)

### Get S3 location of tensors

We can get information related to the training job:

In [9]:
job_name = estimator.latest_training_job.name
client = estimator.sagemaker_session.sagemaker_client
description = client.describe_training_job(TrainingJobName=job_name)
description

{'TrainingJobName': 'mxnet-2019-12-09-08-07-37-389',
 'TrainingJobArn': 'arn:aws:sagemaker:us-west-2:453691756499:training-job/mxnet-2019-12-09-08-07-37-389',
 'ModelArtifacts': {'S3ModelArtifacts': 's3://sagemaker-us-west-2-453691756499/mxnet-2019-12-09-08-07-37-389/output/model.tar.gz'},
 'TrainingJobStatus': 'Completed',
 'SecondaryStatus': 'Completed',
 'HyperParameters': {'initializer': '2',
  'lr': '0.001',
  'sagemaker_container_log_level': '20',
  'sagemaker_enable_cloudwatch_metrics': 'false',
  'sagemaker_job_name': '"mxnet-2019-12-09-08-07-37-389"',
  'sagemaker_program': '"mnist.py"',
  'sagemaker_region': '"us-west-2"',
  'sagemaker_submit_directory': '"s3://sagemaker-us-west-2-453691756499/mxnet-2019-12-09-08-07-37-389/source/sourcedir.tar.gz"'},
 'AlgorithmSpecification': {'TrainingImage': '763104351884.dkr.ecr.us-west-2.amazonaws.com/mxnet-training:1.6.0-cpu-py3',
  'TrainingInputMode': 'File',
  'EnableSageMakerMetricsTimeSeries': True},
 'RoleArn': 'arn:aws:iam::45369

We can retrieve the S3 location of the tensors:

In [10]:
path = estimator.latest_job_debugger_artifacts_path()
print('Tensors are stored in: ', path)

Tensors are stored in:  s3://sagemaker-us-west-2-453691756499/smdebug-mnist-tensor-analysis/mxnet-2019-12-09-08-07-37-389/debug-output


### Download tensors from S3

Now we will download the tensors from S3, so that we can visualize them in our notebook.

In [19]:
folder_name = "/tmp/{}".format(path.split("/")[-1])
os.system("aws s3 cp --recursive {} {}".format(path,folder_name))
print('Downloading tensors into folder: ', folder_name)

Downloading tensors into folder:  /tmp/debug-output


Now that we have obtained the tensors from our training job, it is time to plot the distribution of different layers. 
In the following sections we will use Amazon SageMaker Debugger and custom rules to retrieve certain tensors. Typically, rules are supposed to return True or False. However in this notebook we will use custom rules to return dictionaries of aggregated tensors per layer and step, which we then plot afterwards.

### Activation outputs
This rule will use Amazon SageMaker Debugger to retrieve tensors from the ReLU output layers. It sums the activations across batch and steps. If there is a large fraction of ReLUs outputing 0 across many steps it means that the neuron is dying.

In [20]:
from smdebug.trials import create_trial
from smdebug.rules.rule_invoker import invoke_rule
from smdebug.exceptions import NoMoreData
from smdebug.rules.rule import Rule
import numpy as np
import utils
import collections
import os
from IPython.display import Image

In [17]:
trial = create_trial('/tmp/debug-output')

[2019-12-09 08:17:04.605 ip-172-16-62-40:11373 INFO local_trial.py:35] Loading trial debug-output at path /tmp/debug-output


In [18]:
trial

[2019-12-09 08:17:11.211 ip-172-16-62-40:11373 INFO trial.py:197] Training has ended, will refresh one final time in 1 sec.
[2019-12-09 08:17:12.213 ip-172-16-62-40:11373 INFO trial.py:209] Loaded all steps


<smdebug.trials.local_trial.LocalTrial object at 0x7f2c1ef259b0>:(
    name=debug-output,
    path=/tmp/debug-output,
    steps=[0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400],
    collections=['weights', 'biases', 'gradients', 'losses', 'scalars', 'default', 'all'],
    tensor_names=['conv0_bias', 'conv0_input_0', 'conv0_output_0', 'conv0_relu_input_0', 'conv0_relu_output_0', 'conv0_weight', 'conv1_bias', 'conv1_input_0', 'conv1_output_0', 'conv1_relu_input_0', 'conv1_relu_output_0', 'conv1_weight', 'dense0_bias', 'dense0_input_0', 'dense0_output_0', 'dense0_relu_input_0', 'dense0_relu_output_0', 'dense0_weight', 'dense1_bias', 'dense1_input_0', 'dense1_output_0', 'dense1_relu_input_0', 'dense1_relu_output_0', 'dense1_weight', 'dense2_bias', 'dense2_input_0', 'dense2_output_0', 'dense2_weight', 'flatten0_input_0', 'flatten0_output_0', 'gradient/conv0_bias', 'gradient/conv0_weight', 'gradient/conv1_bias', 'gradient/conv1_weight', 'gradient/dense0_bias', '

In [21]:
class ActivationOutputs(Rule):
    def __init__(self, base_trial):
        super().__init__(base_trial)  
        self.tensors = collections.OrderedDict() 
    
    def invoke_at_step(self, step):
        for tname in self.base_trial.tensor_names(regex='.*relu_output'):
            if "gradients" not in tname:
                try:
                    tensor = self.base_trial.tensor(tname).value(step)
                    if tname not in self.tensors:
                        self.tensors[tname] = collections.OrderedDict()
                    if step not in self.tensors[tname]:
                        self.tensors[tname][step] = 0
                    neg_values = np.where(tensor <= 0)[0]
                    if len(neg_values) > 0:
                        self.logger.info(f" Step {step} tensor  {tname}  has {len(neg_values)/tensor.size*100}% activation outputs which are smaller than 0 ")
                    batch_over_sum = np.sum(tensor, axis=0)/tensor.shape[0]
                    self.tensors[tname][step] += batch_over_sum
                except:
                    self.logger.warning(f"Can not fetch tensor {tname}")
        return False

trial = create_trial(folder_name)
rule = ActivationOutputs(trial)
try:
    invoke_rule(rule)
except NoMoreData:
    print('The training has ended and there is no more data to be analyzed. This is expected behavior.')


[2019-12-09 08:17:44.288 ip-172-16-62-40:11373 INFO local_trial.py:35] Loading trial debug-output at path /tmp/debug-output
[2019-12-09 08:17:44.301 ip-172-16-62-40:11373 INFO rule_invoker.py:10] Started execution of rule ActivationOutputs at step 0
[2019-12-09 08:17:44.302 ip-172-16-62-40:11373 INFO trial.py:197] Training has ended, will refresh one final time in 1 sec.
[2019-12-09 08:17:45.304 ip-172-16-62-40:11373 INFO trial.py:209] Loaded all steps
[2019-12-09 08:17:45.319 ip-172-16-62-40:11373 INFO <ipython-input-21-c7f9e2eb0647>:17]  Step 0 tensor  conv0_relu_output_0  has 48.73792860243056% activation outputs which are smaller than 0 
[2019-12-09 08:17:45.325 ip-172-16-62-40:11373 INFO <ipython-input-21-c7f9e2eb0647>:17]  Step 0 tensor  conv1_relu_output_0  has 51.16064453125% activation outputs which are smaller than 0 
[2019-12-09 08:17:45.326 ip-172-16-62-40:11373 INFO <ipython-input-21-c7f9e2eb0647>:17]  Step 0 tensor  dense0_relu_output_0  has 53.26822916666667% activation 

Plot the histograms

In [22]:
utils.create_interactive_matplotlib_histogram(rule.tensors, filename='images/activation_outputs.gif')

In [24]:
Image(url='images/activation_outputs.gif')

### Activation Inputs
In this rule we look at the inputs into activation function, rather than the output. This can be helpful to understand if there are extreme negative or positive values that saturate the activation functions. 

In [25]:
class ActivationInputs(Rule):
    def __init__(self, base_trial):
        super().__init__(base_trial)  
        self.tensors = collections.OrderedDict() 
        
    def invoke_at_step(self, step):
        for tname in self.base_trial.tensor_names(regex='.*relu_input'):
            if "gradients" not in tname:
                try:
                    tensor = self.base_trial.tensor(tname).value(step)
                    if tname not in self.tensors:
                        self.tensors[tname] = {}
                    if step not in self.tensors[tname]:
                        self.tensors[tname][step] = 0
                    neg_values = np.where(tensor <= 0)[0]
                    if len(neg_values) > 0:
                        self.logger.info(f" Tensor  {tname}  has {len(neg_values)/tensor.size*100}% activation inputs which are smaller than 0 ")
                    batch_over_sum = np.sum(tensor, axis=0)/tensor.shape[0]
                    self.tensors[tname][step] += batch_over_sum
                except:
                    self.logger.warning(f"Can not fetch tensor {tname}")
        return False

trial = create_trial(folder_name)
rule = ActivationInputs(trial)
try:
    invoke_rule(rule)
except NoMoreData:
    print('The training has ended and there is no more data to be analyzed. This is expected behavior.')


[2019-12-09 08:19:08.848 ip-172-16-62-40:11373 INFO local_trial.py:35] Loading trial debug-output at path /tmp/debug-output
[2019-12-09 08:19:08.861 ip-172-16-62-40:11373 INFO rule_invoker.py:10] Started execution of rule ActivationInputs at step 0
[2019-12-09 08:19:08.862 ip-172-16-62-40:11373 INFO trial.py:197] Training has ended, will refresh one final time in 1 sec.
[2019-12-09 08:19:09.864 ip-172-16-62-40:11373 INFO trial.py:209] Loaded all steps
[2019-12-09 08:19:09.871 ip-172-16-62-40:11373 INFO <ipython-input-25-92e74c737aaa>:17]  Tensor  conv0_relu_input_0  has 48.73792860243056% activation inputs which are smaller than 0 
[2019-12-09 08:19:09.875 ip-172-16-62-40:11373 INFO <ipython-input-25-92e74c737aaa>:17]  Tensor  conv1_relu_input_0  has 51.16064453125% activation inputs which are smaller than 0 
[2019-12-09 08:19:09.877 ip-172-16-62-40:11373 INFO <ipython-input-25-92e74c737aaa>:17]  Tensor  dense0_relu_input_0  has 53.26822916666667% activation inputs which are smaller th

Plot the histograms

In [26]:
utils.create_interactive_matplotlib_histogram(rule.tensors, filename='images/activation_inputs.gif')

We can see that second convolutional layer `conv1_relu_input_0` receives only negative input values, which means that all ReLUs in this layer output 0.

In [28]:
Image(url='images/activation_inputs.gif')

### Gradients
The following code retrieves the gradients and plots their distribution. If variance is tiny, that means that the model parameters do not get updated effectively with each training step or that the training has converged to a minimum.

In [29]:
class GradientsLayer(Rule):
    def __init__(self, base_trial):
        super().__init__(base_trial)  
        self.tensors = collections.OrderedDict()  
        
    def invoke_at_step(self, step):
        for tname in self.base_trial.tensor_names(regex='.*gradient'):
            try:
                tensor = self.base_trial.tensor(tname).value(step)
                if tname not in self.tensors:
                    self.tensors[tname] = {}

                self.logger.info(f" Tensor  {tname}  has gradients range: {np.min(tensor)} {np.max(tensor)} ")
                self.tensors[tname][step] = tensor
            except:
                self.logger.warning(f"Can not fetch tensor {tname}")
        return False

trial = create_trial(folder_name)
rule = GradientsLayer(trial)
try:
    invoke_rule(rule)
except NoMoreData:
    print('The training has ended and there is no more data to be analyzed. This is expected behavior.')

[2019-12-09 08:21:08.745 ip-172-16-62-40:11373 INFO local_trial.py:35] Loading trial debug-output at path /tmp/debug-output
[2019-12-09 08:21:08.756 ip-172-16-62-40:11373 INFO rule_invoker.py:10] Started execution of rule GradientsLayer at step 0
[2019-12-09 08:21:08.758 ip-172-16-62-40:11373 INFO trial.py:197] Training has ended, will refresh one final time in 1 sec.
[2019-12-09 08:21:09.759 ip-172-16-62-40:11373 INFO trial.py:209] Loaded all steps
[2019-12-09 08:21:09.761 ip-172-16-62-40:11373 INFO <ipython-input-29-1efdd7f3ed18>:13]  Tensor  gradient/conv0_bias  has gradients range: -5419.91064453125 28531.94921875 
[2019-12-09 08:21:09.762 ip-172-16-62-40:11373 INFO <ipython-input-29-1efdd7f3ed18>:13]  Tensor  gradient/conv0_weight  has gradients range: -13531.4794921875 40225.25390625 
[2019-12-09 08:21:09.763 ip-172-16-62-40:11373 INFO <ipython-input-29-1efdd7f3ed18>:13]  Tensor  gradient/conv1_bias  has gradients range: -1851.815673828125 10187.85546875 
[2019-12-09 08:21:09.764

Plot the histograms

In [30]:
utils.create_interactive_matplotlib_histogram(rule.tensors, filename='images/gradients.gif')

In [31]:
Image(url='images/gradients.gif')

### Check variance across layers
The rule retrieves gradients, but this time we compare variance of gradient distribution across layers. We want to identify if there is a large difference between the min and max variance per training step. For instance, very deep neural networks may suffer from vanishing gradients the deeper we go. By checking this ratio we can determine if we run into such a situation.

In [32]:
class GradientsAcrossLayers(Rule):
    def __init__(self, base_trial, ):
        super().__init__(base_trial)  
        self.tensors = collections.OrderedDict()  
        
    def invoke_at_step(self, step):
        for tname in self.base_trial.tensor_names(regex='.*gradient'):
            try:
                tensor = self.base_trial.tensor(tname).value(step)
                if step not in self.tensors:
                    self.tensors[step] = [np.inf, 0]
                variance = np.var(tensor.flatten())
                if variance < self.tensors[step][0]:
                    self.tensors[step][0] = variance
                elif variance > self.tensors[step][1]:
                    self.tensors[step][1] = variance             
                self.logger.info(f" Step {step} current ratio: {self.tensors[step][0]} {self.tensors[step][1]} Ratio: {self.tensors[step][1] / self.tensors[step][0]}") 
            except:
                self.logger.warning(f"Can not fetch tensor {tname}")
        return False

trial = create_trial(folder_name)
rule = GradientsAcrossLayers(trial)
try:
    invoke_rule(rule)
except NoMoreData:
    print('The training has ended and there is no more data to be analyzed. This is expected behavior.')

[2019-12-09 08:23:36.777 ip-172-16-62-40:11373 INFO local_trial.py:35] Loading trial debug-output at path /tmp/debug-output
[2019-12-09 08:23:36.788 ip-172-16-62-40:11373 INFO rule_invoker.py:10] Started execution of rule GradientsAcrossLayers at step 0
[2019-12-09 08:23:36.790 ip-172-16-62-40:11373 INFO trial.py:197] Training has ended, will refresh one final time in 1 sec.
[2019-12-09 08:23:37.792 ip-172-16-62-40:11373 INFO trial.py:209] Loaded all steps
[2019-12-09 08:23:37.793 ip-172-16-62-40:11373 INFO <ipython-input-32-0c18f883c3b5>:17]  Step 0 current ratio: 120216512.0 0 Ratio: 0.0
[2019-12-09 08:23:37.794 ip-172-16-62-40:11373 INFO <ipython-input-32-0c18f883c3b5>:17]  Step 0 current ratio: 120216512.0 156414736.0 Ratio: 1.3011085987091064
[2019-12-09 08:23:37.797 ip-172-16-62-40:11373 INFO <ipython-input-32-0c18f883c3b5>:17]  Step 0 current ratio: 7661114.5 156414736.0 Ratio: 20.41670799255371
[2019-12-09 08:23:37.799 ip-172-16-62-40:11373 INFO <ipython-input-32-0c18f883c3b5>:



Let's check min and max values of the gradients across layers:

In [33]:
for step in rule.tensors:
    print("Step", step, "variance of gradients: ", rule.tensors[step][0], " to ",  rule.tensors[step][1])

Step 0 variance of gradients:  570.01514  to  156414740.0
Step 100 variance of gradients:  19.166117  to  21529.385
Step 200 variance of gradients:  0.28520313  to  215.97237
Step 300 variance of gradients:  0.046270333  to  164.88522
Step 400 variance of gradients:  0.0034838195  to  32.36142
Step 500 variance of gradients:  0.032400593  to  152.3032
Step 600 variance of gradients:  0.030220622  to  41.118534
Step 700 variance of gradients:  0.0027776044  to  40.222153
Step 800 variance of gradients:  0.0004990436  to  60.11105
Step 900 variance of gradients:  7.5891185e-05  to  20.536386
Step 1000 variance of gradients:  0.012268313  to  116.92578
Step 1100 variance of gradients:  0.0  to  87.674774
Step 1200 variance of gradients:  0.0044369544  to  596.1117
Step 1300 variance of gradients:  1.0401059e-05  to  14.678249
Step 1400 variance of gradients:  3.9267954e-05  to  20.91969


### Distribution of weights
This rule retrieves the weight tensors and checks the variance. If the distribution does not change much across steps it may indicate that the learning rate is too low, that gradients are too small or that the training has converged to a minimum.

In [34]:
class WeightRatio(Rule):
    def __init__(self, base_trial, ):
        super().__init__(base_trial)  
        self.tensors = collections.OrderedDict()  
        
    def invoke_at_step(self, step):
        for tname in self.base_trial.tensor_names(regex='.*weight'):
            if "gradient" not in tname:
                try:
                    tensor = self.base_trial.tensor(tname).value(step)
                    if tname not in self.tensors:
                        self.tensors[tname] = {}
                 
                    self.logger.info(f" Tensor  {tname}  has weights with variance: {np.var(tensor.flatten())} ")
                    self.tensors[tname][step] = tensor
                except:
                    self.logger.warning(f"Can not fetch tensor {tname}")
        return False

trial = create_trial(folder_name)
rule = WeightRatio(trial)
try:
    invoke_rule(rule)
except NoMoreData:
    print('The training has ended and there is no more data to be analyzed. This is expected behavior.')


[2019-12-09 08:24:44.893 ip-172-16-62-40:11373 INFO local_trial.py:35] Loading trial debug-output at path /tmp/debug-output
[2019-12-09 08:24:44.904 ip-172-16-62-40:11373 INFO rule_invoker.py:10] Started execution of rule WeightRatio at step 0
[2019-12-09 08:24:44.906 ip-172-16-62-40:11373 INFO trial.py:197] Training has ended, will refresh one final time in 1 sec.
[2019-12-09 08:24:45.908 ip-172-16-62-40:11373 INFO trial.py:209] Loaded all steps
[2019-12-09 08:24:45.910 ip-172-16-62-40:11373 INFO <ipython-input-34-f8ca0a68fd3b>:14]  Tensor  conv0_weight  has weights with variance: 0.2831619679927826 
[2019-12-09 08:24:45.911 ip-172-16-62-40:11373 INFO <ipython-input-34-f8ca0a68fd3b>:14]  Tensor  conv1_weight  has weights with variance: 0.33994680643081665 
[2019-12-09 08:24:45.913 ip-172-16-62-40:11373 INFO <ipython-input-34-f8ca0a68fd3b>:14]  Tensor  dense0_weight  has weights with variance: 0.33299320936203003 
[2019-12-09 08:24:45.914 ip-172-16-62-40:11373 INFO <ipython-input-34-f8

Plot the histograms

In [35]:
utils.create_interactive_matplotlib_histogram(rule.tensors, filename='images/weights.gif')

In [37]:
Image(url='images/weights.gif')

### Inputs

This rule retrieves layer inputs excluding activation inputs.

In [38]:
class Inputs(Rule):
    def __init__(self, base_trial, ):
        super().__init__(base_trial)  
        self.tensors = collections.OrderedDict()  
        
    def invoke_at_step(self, step):
        for tname in self.base_trial.tensor_names(regex='.*input'):
            if "relu" not in tname:
                try:
                    tensor = self.base_trial.tensor(tname).value(step)
                    if tname not in self.tensors:
                        self.tensors[tname] = {}
                 
                    self.logger.info(f" Tensor  {tname}  has inputs with variance: {np.var(tensor.flatten())} ")
                    self.tensors[tname][step] = tensor
                except:
                    self.logger.warning(f"Can not fetch tensor {tname}")
        return False

trial = create_trial(folder_name)
rule = Inputs(trial)
try:
    invoke_rule(rule)
except NoMoreData:
    print('The training has ended and there is no more data to be analyzed. This is expected behavior.')


[2019-12-09 08:26:34.086 ip-172-16-62-40:11373 INFO local_trial.py:35] Loading trial debug-output at path /tmp/debug-output
[2019-12-09 08:26:34.097 ip-172-16-62-40:11373 INFO rule_invoker.py:10] Started execution of rule Inputs at step 0
[2019-12-09 08:26:34.098 ip-172-16-62-40:11373 INFO trial.py:197] Training has ended, will refresh one final time in 1 sec.
[2019-12-09 08:26:35.101 ip-172-16-62-40:11373 INFO trial.py:209] Loaded all steps
[2019-12-09 08:26:35.407 ip-172-16-62-40:11373 INFO <ipython-input-38-c408f5ccc2ae>:14]  Tensor  conv0_input_0  has inputs with variance: 0.9688668251037598 
[2019-12-09 08:26:35.409 ip-172-16-62-40:11373 INFO <ipython-input-38-c408f5ccc2ae>:14]  Tensor  conv1_input_0  has inputs with variance: 4.631735801696777 
[2019-12-09 08:26:35.411 ip-172-16-62-40:11373 INFO <ipython-input-38-c408f5ccc2ae>:14]  Tensor  dense0_input_0  has inputs with variance: 58.34519577026367 
[2019-12-09 08:26:35.412 ip-172-16-62-40:11373 INFO <ipython-input-38-c408f5ccc2a

Plot the histograms

In [39]:
utils.create_interactive_matplotlib_histogram(rule.tensors, filename='images/layer_inputs.gif')

In [42]:
Image(url='images/layer_inputs.gif')

### Layer outputs
This rule retrieves outputs of layers excluding activation outputs.

In [43]:
class Outputs(Rule):
    def __init__(self, base_trial, ):
        super().__init__(base_trial)  
        self.tensors = collections.OrderedDict() 
        
    def invoke_at_step(self, step):
        for tname in self.base_trial.tensor_names(regex='.*output'):
            if "relu" not in tname:
                try:
                    tensor = self.base_trial.tensor(tname).value(step)
                    if tname not in self.tensors:
                        self.tensors[tname] = {}
                 
                    self.logger.info(f" Tensor  {tname}  has inputs with variance: {np.var(tensor.flatten())} ")
                    self.tensors[tname][step] = tensor
                except:
                    self.logger.warning(f"Can not fetch tensor {tname}")
        return False

trial = create_trial(folder_name)
rule = Outputs(trial)
try:
    invoke_rule(rule)
except NoMoreData:
    print('The training has ended and there is no more data to be analyzed. This is expected behavior.')


[2019-12-09 08:29:22.832 ip-172-16-62-40:11373 INFO local_trial.py:35] Loading trial debug-output at path /tmp/debug-output
[2019-12-09 08:29:22.844 ip-172-16-62-40:11373 INFO rule_invoker.py:10] Started execution of rule Outputs at step 0
[2019-12-09 08:29:22.845 ip-172-16-62-40:11373 INFO trial.py:197] Training has ended, will refresh one final time in 1 sec.
[2019-12-09 08:29:23.847 ip-172-16-62-40:11373 INFO trial.py:209] Loaded all steps
[2019-12-09 08:29:23.852 ip-172-16-62-40:11373 INFO <ipython-input-43-54f31e2b0743>:14]  Tensor  conv0_output_0  has inputs with variance: 3.62184476852417 
[2019-12-09 08:29:23.855 ip-172-16-62-40:11373 INFO <ipython-input-43-54f31e2b0743>:14]  Tensor  conv1_output_0  has inputs with variance: 43.39231491088867 
[2019-12-09 08:29:23.856 ip-172-16-62-40:11373 INFO <ipython-input-43-54f31e2b0743>:14]  Tensor  dense0_output_0  has inputs with variance: 4567.81201171875 
[2019-12-09 08:29:23.858 ip-172-16-62-40:11373 INFO <ipython-input-43-54f31e2b07

Plot the histograms

In [44]:
utils.create_interactive_matplotlib_histogram(rule.tensors, filename='images/layer_outputs.gif')

In [45]:
Image(url='images/layer_outputs.gif')

### Comparison 
In the previous section we have looked at the distribution of gradients, activation outputs and weights of a model that has not trained well due to poor initialization. Now we will compare some of these distributions with a model that has been well intialized.

In [46]:
entry_point_script = 'mnist.py'
hyperparameters = {'lr': 0.01}

In [None]:
estimator = MXNet(role=sagemaker.get_execution_role(),
                  base_job_name='mxnet',
                  train_instance_count=1,
                  train_instance_type='ml.m5.xlarge',
                  train_volume_size=400,
                  source_dir='src',
                  entry_point=entry_point_script,
                  hyperparameters=hyperparameters,
                  framework_version='1.6.0',
                  py_version='py3',
                  debugger_hook_config = DebuggerHookConfig(
                      s3_output_path=s3_bucket_for_tensors,  
                      collection_configs=[
                        CollectionConfig(
                            name="all",
                            parameters={
                                "include_regex": ".*",
                                "save_interval": "100"
                            }
                        )
                     ]
                   )
                )
                  

Start the training job

In [None]:
estimator.fit(wait=False)

Get S3 path where tensors have been stored

In [None]:
job_name = estimator.latest_training_job.name
client = estimator.sagemaker_session.sagemaker_client
description = client.describe_training_job(TrainingJobName=job_name)
path = description['DebugHookConfig']['S3OutputPath'] + '/' + job_name + '/debug-output'
print('Tensors are stored in: ', path)

Download tensors from S3

In [None]:
folder_name2 = "/tmp/{}_2".format(path.split("/")[-1])
os.system("aws s3 cp --recursive {} {}".format(path,folder_name2))
print('Downloading tensors into folder: ', folder_name2)

#### Gradients

Lets compare distribution of gradients of the convolutional layers of both trials.

In [None]:
trial = create_trial(folder_name)
rule = GradientsLayer(trial)
try:
    invoke_rule(rule)
except NoMoreData:
    print('The training has ended and there is no more data to be analyzed. This is expected behavior.')


In [None]:
dict_gradients = {}
dict_gradients['gradient/conv0_weight_bad_hyperparameters'] = rule.tensors['gradient/conv0_weight']
dict_gradients['gradient/conv1_weight_bad_hyperparameters'] = rule.tensors['gradient/conv1_weight']

Second trial:

In [None]:
trial = create_trial(folder_name2)
rule = GradientsLayer(trial)
try:
    invoke_rule(rule)
except NoMoreData:
    print('The training has ended and there is no more data to be analyzed. This is expected behavior.')


In [None]:
dict_gradients['gradient/conv0_weight_good_hyperparameters'] = rule.tensors['gradient/conv0_weight']
dict_gradients['gradient/conv1_weight_good_hyperparameters'] = rule.tensors['gradient/conv1_weight']

Plot the histograms

In [None]:
utils.create_interactive_matplotlib_histogram(dict_gradients, filename='images/gradients_comparison.gif')

In the case of the poorly initalized model, gradients are fluctuating a lot leading to very high variance. 

In [None]:
Image(url='images/gradients_comparison.gif')

#### Activation inputs

Lets compare distribution of activation inputs of both trials.

In [None]:
trial = create_trial(folder_name)
rule = ActivationInputs(trial)
try:
    invoke_rule(rule)
except NoMoreData:
    print('The training has ended and there is no more data to be analyzed. This is expected behavior.')


In [None]:
dict_activation_inputs = {}
dict_activation_inputs['conv0_relu_input_0_bad_hyperparameters'] = rule.tensors['conv0_relu_input_0']
dict_activation_inputs['conv1_relu_input_0_bad_hyperparameters'] = rule.tensors['conv1_relu_input_0']

Second trial

In [None]:
trial = create_trial(folder_name2)
rule = ActivationInputs(trial)
try:
    invoke_rule(rule)
except NoMoreData:
    print('The training has ended and there is no more data to be analyzed. This is expected behavior.')


In [None]:
dict_activation_inputs['conv0_relu_input_0_good_hyperparameters'] = rule.tensors['conv0_relu_input_0']
dict_activation_inputs['conv1_relu_input_0_good_hyperparameters'] = rule.tensors['conv1_relu_input_0']

Plot the histograms

In [None]:
utils.create_interactive_matplotlib_histogram(dict_activation_inputs, filename='images/activation_inputs_comparison.gif')

The distribution of activation inputs into first activation layer `conv0_relu_input_0` look quite similar in both trials. However in the case of the second layer they drastically differ. 

In [None]:
Image(url='images/activation_inputs_comparison.gif')