<a href="https://colab.research.google.com/github/michelucci/zhaw-dlcourse-spring2019/blob/master/Week%206%20-%20Network%20Training/Week%206%20-%20Zalando%20dataset%20and%20decaying%20learning%20rate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural Networks and Deep Learning for Life Sciences and Health Applications - An introductory course about theoretical fundamentals, case studies and implementations in python and tensorflow

(C) Umberto Michelucci 2018 - umberto.michelucci@gmail.com 

github repository: https://github.com/michelucci/zhaw-dlcourse-spring2019

Spring Semester 2019

# Exploding Gradient problem

In this notebook we will discuss the problem of exploding gradients when using the ReLU activation function, with the help of the fashion MNIST (Zalando) dataset. I will skip the data preparation discussion (check the other notebooks in this folder) and will concentrate on the problem itself.

# Load of the data

Information can be found here

https://www.kaggle.com/zalando-research/fashionmnist/data

Note that this notebook expect the files to be in this folder.

**Context**
Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Zalando intends Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.

The original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. "If it doesn't work on MNIST, it won't work at all", they said. "Well, if it does work on MNIST, it may still fail on others."
Zalando seeks to replace the original MNIST dataset

**Content**
Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255. The training and test data sets have 785 columns. The first column consists of the class labels (see above), and represents the article of clothing. The rest of the columns contain the pixel-values of the associated image.

To locate a pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27. The pixel is located on row i and column j of a 28 x 28 matrix. 
For example, pixel31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ascii-diagram below. 

**Labels**
Each training and test example is assigned to one of the following labels:
- 0 T-shirt/top
- 1 Trouser
- 2 Pullover
- 3 Dress
- 4 Coat
- 5 Sandal
- 6 Shirt
- 7 Sneaker
- 8 Bag
- 9 Ankle boot 

**TL;DR**
Each row is a separate image 
Column 1 is the class label. 
Remaining columns are pixel numbers (784 total). 
Each value is the darkness of the pixel (1 to 255)

**Acknowledgements**
Original dataset was downloaded from https://github.com/zalandoresearch/fashion-mnist
Dataset was converted to CSV with this script: https://pjreddie.com/projects/mnist-in-csv/

**License**
The MIT License (MIT) Copyright © [2017] Zalando SE, https://tech.zalando.com
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

In [147]:
import pandas as pd
import numpy as np
import tensorflow as tf

%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

from random import *

In [148]:
data_train = pd.read_csv('fashion-mnist_train.csv', header = 0)
data_test = pd.read_csv('fashion-mnist_test.csv', header = 0)

# Train dataset preparation

In [149]:
labels = data_train['label'].values.reshape(1, 60000)

labels_ = np.zeros((60000, 10))
labels_[np.arange(60000), labels] = 1
labels_ = labels_.transpose()


train = data_train.drop('label', axis=1).transpose()

In [150]:
print(labels_.shape)
print(train.shape)

(10, 60000)
(784, 60000)


# Test dataset preparation

In [151]:
labels_test = data_test['label'].values.reshape(1, 10000)

labels_test_ = np.zeros((10000, 10))
labels_test_[np.arange(10000), labels_test] = 1
labels_test_ = labels_test_.transpose()


test = data_test.drop('label', axis=1).transpose()

### Normalization of data

Let's normalize the training data dividing by 255.0 to get the values between 0 and 1.

In [152]:
train = np.array(train / 255.0)
test = np.array(test / 255.0)
labels_ = np.array(labels_)
labels_test_ = np.array(labels_test_)

# Problem description

The problem is the following. If we create a network with one layer and 15 neurons (see below) and we use ReLU as activation functions, when using a learning rate $\gamma = 0.01$ we will see ```nan``` appears quite soon if not immediately. The reason is that the gradient of the ReLU function can only be 0 or 1. Let's try to understand better the problem and identify it. Let's start with a network with one hidden layer with one neuron and ten output neurons (for multi-class classification) and a softmax activation function.

Note that this problem can be reproduced when 

- using the ```Adam``` optimizer and not the ```GD```.
- using a mini-batch of 50
- using a learning rate of $\gamma = 0.01$

# Network with 1 layer and 15 neurons - with consant learning rate $\gamma$

In [275]:
n_dim = 784
tf.reset_default_graph()

tf.random.set_random_seed(42)

# Number of neurons in the layers
n1 = 15 # Number of neurons in layer 1
n2 = 10 # Number of neurons in output layer 

cost_history = np.empty(shape=[1], dtype = float)
learning_rate = tf.placeholder(tf.float64, shape=())

st = 1.0/np.sqrt(n_dim)

X = tf.placeholder(tf.float64, [n_dim, None])
Y = tf.placeholder(tf.float64, [10, None])
W1 = tf.Variable(tf.truncated_normal([n1, n_dim], stddev=st, dtype = tf.float64), dtype = tf.float64) 
b1 = tf.Variable( tf.zeros([n1,1], dtype = tf.float64)) # tf.constant(0.1, shape = [n1,1])
W2 = tf.Variable(tf.truncated_normal([n2, n1], stddev=st, dtype = tf.float64)) 
b2 = tf.Variable(tf.zeros([n2,1], dtype = tf.float64)) 
                 
    
ZZ = tf.matmul(W1, X) + b1
# Let's build our network...
Z1 = tf.nn.relu(tf.matmul(W1, X) + b1) # n1 x n_dim * n_dim x n_obs = n1 x n_obs
Z2 = tf.matmul(W2, Z1) + b2 # n2 x n1 * n1 * n_obs = n2 x n_obs
y_ = tf.nn.softmax(Z2,0) # n2 x n_obs (10 x None)

cost = - tf.reduce_mean(Y * tf.log(y_)+(1-Y) * tf.log(1-y_))

optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
#optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

init = tf.global_variables_initializer()

In [276]:
%%time
sess = tf.Session()
sess.run(init)
    
minibatch = 50
    
cost_history = []
for epoch in range(100+1):
    for i in range(0, train.shape[1], minibatch):
        X_train_mini = train[:,i:i + minibatch]
        y_train_mini = labels_[:,i:i + minibatch]

        sess.run(optimizer, feed_dict = {X: X_train_mini, Y: y_train_mini, learning_rate: 1e-4})
    cost_ = sess.run(cost, feed_dict={ X:train, Y: labels_, learning_rate: 1e-4})
    cost_history = np.append(cost_history, cost_)

    if (epoch % 10 == 0):
        print("Reached epoch",epoch,"cost J =", cost_)

Reached epoch 0 cost J = 0.32545125752288273
Reached epoch 10 cost J = 0.3248716516422052
Reached epoch 20 cost J = 0.324286508719163
Reached epoch 30 cost J = 0.32364431586960424
Reached epoch 40 cost J = 0.32291914191888776
Reached epoch 50 cost J = 0.3220928779765425
Reached epoch 60 cost J = 0.3211361317420825
Reached epoch 70 cost J = 0.3200077123419853
Reached epoch 80 cost J = 0.31866034846478775
Reached epoch 90 cost J = 0.31704463973505254
Reached epoch 100 cost J = 0.31510980542814526
CPU times: user 4min 2s, sys: 2min 1s, total: 6min 4s
Wall time: 2min 39s


In [277]:
correct_predictions = tf.equal(tf.argmax(y_,0), tf.argmax(Y,0))
accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"))
print ("Accuracy:", accuracy.eval({X: train, Y: labels_, learning_rate: 0.001}, session = sess))

Accuracy: 0.29365


In [278]:
correct_predictions = tf.equal(tf.argmax(y_,0), tf.argmax(Y,0))
accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"))
print ("Accuracy:", accuracy.eval({X: test, Y: labels_test_, learning_rate: 0.001}, session = sess))

Accuracy: 0.2904


In [274]:
sess.close()