# Back Propagation

Дана двухслойная нейронная сеть:  

<center> $x = input$  </center>
<center> $z = W_{h}x$ </center>  
<center> $h = ReLU(z)$  </center> 
<center> $\theta = W_{o}h$   </center> 
<center> $\hat{y} = softmax(\theta)$  </center> 


<center> $E = CE(\hat{y}, y)$ </center>



Вычислять производные будем с помощью chain rule:

<center> $\frac{\partial E}{\partial W_{o}} = \frac{\partial E}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial \theta}\frac{\partial \theta}{\partial W_{o}}$ </center> 

Введем вспомогательные дельты:

<center> $\delta_1 = \frac{\partial E}{\partial \theta}$ , $\delta_2 = \frac{\partial E}{\partial z}$ </center>


<center> $\delta_1 = \frac{\partial E}{\partial \theta} = (\hat{y} - y)^T$ </center> 

<center> $\delta_2 = \frac{\partial E}{\partial \theta}\frac{\partial \theta}{\partial h}\frac{\partial h}{\partial z} = \delta_1 \frac{\partial \theta}{\partial h}\frac{\partial h}{\partial z} = \delta_1 W_{o} \frac{\partial h}{\partial z} = \delta_1 W_{o} \circ ReLU'(z) = \delta_1 W_{o} \circ sgn(h)$ </center> 

Вычисляем производные: 

<center> $\frac{\partial E}{\partial W_{o}} = \frac{\partial E}{\partial \theta} \frac{\partial \theta}{\partial W_{o}} = \delta_1 \frac{\partial \theta}{\partial W_{o}} = \delta_1^T h^T$ </center> 

<center> $\frac{\partial E}{\partial W_{h}} = \frac{\partial E}{\partial z}\frac{\partial z}{\partial W_{h}} = \delta_2 \frac{\partial z}{\partial W_{h}} = \delta_2^T x^T$ </center> 


In [0]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
from sklearn.preprocessing import LabelBinarizer
from scipy.special import xlogy

### Activation and loss functions

Определяем функции активации - softmax и relu, а так же функцию потерь - cross-entropy

In [0]:
def stable_softmax(x):
    tmp = x - x.max(axis=1, keepdims=True)
    np.exp(tmp, out=x)
    x /= x.sum(axis=1, keepdims=True)
    return x


def relu(x):
    return np.maximum(0, x)


def relu_derivative(x):
    out = x[:]
    out[out <= 0] = 0
    out[out > 0] = 1
    return out


def crossentropy_loss(y_true, y_prob):
    return - xlogy(y_true, y_prob).sum()

### DNNClassifier definition

In [0]:
class DNNClassifier():
    def __init__(self, hidden_layer_sizes=(100,),
                 activation_functions=(None, None),
                 loss_function=crossentropy_loss,
                 batch_size=1, learning_rate=0.001,
                 max_iter=200, random_state=42, verbose=False):
        self.hidden_layer_sizes = hidden_layer_sizes
        self.activation_functions = activation_functions
        self.loss_function = loss_function
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.max_iter = max_iter
        self.random_state = random_state
        self.verbose = verbose

        self._label_binarizer = LabelBinarizer()

    def __forward_layer(self, x, w, activation_function=None):
        out = np.dot(x, w)
        if activation_function:
            out = activation_function(out)
        return out

    def __forward_propagate(self, x):
        activations = [x]
        for next_layer, activation in zip(self.weights, self.activation_functions):
            out = self.__forward_layer(activations[-1], next_layer, activation)
            activations.append(out)
        return activations

    def __back_propagation(self, activations, y):
        weights = self.weights
        gradients = [np.empty_like(layer) for layer in weights]

        deltas1 = activations[2] - y # 256,10
        gradients[1] = np.dot(activations[1].T, deltas1) # 128,10

        deltas2 = np.dot(deltas1, weights[1].T) * relu_derivative(activations[1]) # 256,128

        gradients[0] = np.dot(activations[0].T, deltas2) # hotim 784,128

        return gradients

    def __init_layer(self, input_size, output_size):
        a = 2.0 / (input_size + output_size)
        w = np.random.uniform(-a, a, (input_size, output_size))
        return w

    def fit(self, X, y, shuffle=False):
        np.random.seed(self.random_state)

        y_train = y
        X_train = X
        y = self._label_binarizer.fit_transform(y)
        num_classes = len(self._label_binarizer.classes_)

        n, p = X.shape
        s = self.hidden_layer_sizes[0]

        self.weights = [
            self.__init_layer(p, s),
            self.__init_layer(s, num_classes)
        ]

        for j in range(self.max_iter):
            accumulated_loss = 0.0

            if shuffle:
                indices = np.arange(n)
                np.random.shuffle(indices)
                X = X.take(indices, axis=0)
                y = y.take(indices, axis=0)

            for i in range(0, n, self.batch_size):
                X_batch = X[i: i + self.batch_size]
                y_batch = y[i: i + self.batch_size]

                activations = self.__forward_propagate(X_batch)

                y_prob = activations[-1]

                accumulated_loss += self.loss_function(y_batch, y_prob)
                gradients = self.__back_propagation(activations, y_batch)

                gradients = [gradient / self.batch_size for gradient in gradients]
                self.weights = [weight - self.learning_rate * grad for weight, grad in
                                zip(self.weights, gradients)]

            if self.verbose:
                loss = accumulated_loss / X.shape[0]
                y_pred = self.predict(X_train)
                accuracy = (y_pred == y_train).mean()
                print("Epoch {}/{};\t Train accuracy: {:.3f} \t Loss : {:.3f}".format(j + 1, self.max_iter, accuracy,
                                                                                      loss))

        return self

    def predict(self, X):
        activations = self.__forward_propagate(X)
        y_pred = activations[-1]
        return self._label_binarizer.inverse_transform(y_pred)



### Load and prepare dataset

Данные представлены в виде матриц 28*28, для дальнейшего использования преобразовываем в вектор и нормализуем, поделив на 255.

In [4]:
import tensorflow as tf

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train = x_train.reshape(x_train.shape[0], 28*28)
x_test = x_test.reshape(x_test.shape[0], 28*28)

x_train /= 255
x_test /= 255

print('train size: ', x_train.shape, y_train.shape)
print('test size: ', x_test.shape, y_test.shape)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
train size:  (60000, 784) (60000,)
test size:  (10000, 784) (10000,)


### Instantiate and fit DNNClassifier

In [5]:
activation_functions = (relu, stable_softmax)
estimator = DNNClassifier(hidden_layer_sizes=(128,),
                          activation_functions=activation_functions,
                          batch_size=256,
                          learning_rate=0.5, max_iter=50,
                          random_state=42, verbose=True)

estimator.fit(x_train, y_train)

Epoch 1/50;	 Train accuracy: 0.923 	 Loss : 0.565
Epoch 2/50;	 Train accuracy: 0.950 	 Loss : 0.211
Epoch 3/50;	 Train accuracy: 0.963 	 Loss : 0.152
Epoch 4/50;	 Train accuracy: 0.971 	 Loss : 0.120
Epoch 5/50;	 Train accuracy: 0.975 	 Loss : 0.100
Epoch 6/50;	 Train accuracy: 0.978 	 Loss : 0.085
Epoch 7/50;	 Train accuracy: 0.981 	 Loss : 0.074
Epoch 8/50;	 Train accuracy: 0.982 	 Loss : 0.066
Epoch 9/50;	 Train accuracy: 0.984 	 Loss : 0.059
Epoch 10/50;	 Train accuracy: 0.986 	 Loss : 0.053
Epoch 11/50;	 Train accuracy: 0.987 	 Loss : 0.048
Epoch 12/50;	 Train accuracy: 0.988 	 Loss : 0.043
Epoch 13/50;	 Train accuracy: 0.989 	 Loss : 0.039
Epoch 14/50;	 Train accuracy: 0.990 	 Loss : 0.036
Epoch 15/50;	 Train accuracy: 0.991 	 Loss : 0.032
Epoch 16/50;	 Train accuracy: 0.992 	 Loss : 0.030
Epoch 17/50;	 Train accuracy: 0.992 	 Loss : 0.027
Epoch 18/50;	 Train accuracy: 0.993 	 Loss : 0.025
Epoch 19/50;	 Train accuracy: 0.994 	 Loss : 0.023
Epoch 20/50;	 Train accuracy: 0.994 	 Lo

<__main__.DNNClassifier at 0x7fd8364392e8>

In [6]:
y_pred = estimator.predict(x_test)
print("Accuracy on test dataset: %s " % (y_pred == y_test).mean())

Accuracy on test dataset: 0.9797 
