# 5章 誤差逆伝播法

## 5.2 連鎖律

連鎖律とは合成関数の微分についての性質。合成関数の微分は、合成関数を構成するそれぞれの関数の微分の積によって表すことができる。

$$
\begin{eqnarray*}
  z &=& t^2 \\
  t &=& x + y
\end{eqnarray*}
$$

この合成関数を$x$について微分すると

$$
\begin{eqnarray*}
  \frac{\partial z}{\partial x} &=& \frac{\partial z}{\partial t} \frac{\partial t}{\partial x} \\
  &=& 2t \cdot 1 \\
  &=& 2 (x + y)
\end{eqnarray*}
$$

<img  src="images/TXBIV4EK028AESGPW2HKEIC3GC7Q77UD.png"/>

計算グラフ上ではそれぞれ以下の通り

$$
\begin{eqnarray*}
  \frac{\partial z}{\partial x} &=& \frac{\partial z}{\partial z} \frac{\partial z}{\partial t} \frac{\partial t}{\partial x} = 2(x + y) \\
  \frac{\partial z}{\partial t} &=& \frac{\partial z}{\partial z} \frac{\partial z}{\partial t} = 2t \\
  \frac{\partial z}{\partial z} &=& 1
\end{eqnarray*}
$$

## 5.3 逆伝播

<img  src="images/YGV86B5WW4LT5JIIQH73HESDQ9MLFQL6.png"/>

加算ノードの逆伝播では伝播してきた値をそのまま伝播する。

$$
\begin{eqnarray*}
  z &=& x + y \\
  \frac{\partial L}{\partial x} &=& \frac{\partial L}{\partial z} \frac{\partial z}{\partial x} = \frac{\partial L}{\partial z} \cdot 1 \\
  \frac{\partial L}{\partial y} &=& \frac{\partial L}{\partial z} \frac{\partial z}{\partial y} = \frac{\partial L}{\partial z} \cdot 1
\end{eqnarray*}
$$

乗算ノードの逆伝播では伝搬してきた値をひっくり返した値を乗算して伝搬する。

$$
\begin{eqnarray*}
  z &=& xy \\
  \frac{\partial L}{\partial x} &=& \frac{\partial L}{\partial z} \frac{\partial z}{\partial x} = \frac{\partial L}{\partial z} \cdot y \\
  \frac{\partial L}{\partial y} &=& \frac{\partial L}{\partial z} \frac{\partial z}{\partial y} = \frac{\partial L}{\partial z} \cdot x
\end{eqnarray*}
$$

## 5.4 単純なレイヤの実装

In [1]:
class MulLayer:
    def __init__(self):
        self.x = None
        self.y = None

    def forward(self, x, y):
        self.x = x
        self.y = y
        out = x * y

        return out

    def backward(self, dout):
        dx = dout * self.y
        dy = dout * self.x

        return dx, dy

In [2]:
apple = 100
apple_num = 2
tax = 1.1

mul_apple_layer = MulLayer()
mul_tax_layer = MulLayer()

# forward
apple_price = mul_apple_layer.forward(apple, apple_num)
price = mul_tax_layer.forward(apple_price, tax)

# backward
dprice = 1
dapple_price, dtax = mul_tax_layer.backward(dprice)
dapple, dapple_num = mul_apple_layer.backward(dapple_price)

print("price:", int(price))
print("dApple:", dapple)
print("dApple_num:", int(dapple_num))
print("dTax:", dtax)

price: 220
dApple: 2.2
dApple_num: 110
dTax: 200


In [3]:
class AddLayer:
    def __init__(self):
        pass

    def forward(self, x, y):
        out = x + y

        return out

    def backward(self, dout):
        dx = dout * 1
        dy = dout * 1

        return dx, dy

In [4]:
apple = 100
apple_num = 2
orange = 150
orange_num = 3
tax = 1.1

# layer
mul_apple_layer = MulLayer()
mul_orange_layer = MulLayer()
add_apple_orange_layer = AddLayer()
mul_tax_layer = MulLayer()

# forward
apple_price = mul_apple_layer.forward(apple, apple_num)  # (1)
orange_price = mul_orange_layer.forward(orange, orange_num)  # (2)
all_price = add_apple_orange_layer.forward(apple_price, orange_price)  # (3)
price = mul_tax_layer.forward(all_price, tax)  # (4)

# backward
dprice = 1
dall_price, dtax = mul_tax_layer.backward(dprice)  # (4)
dapple_price, dorange_price = add_apple_orange_layer.backward(dall_price)  # (3)
dorange, dorange_num = mul_orange_layer.backward(dorange_price)  # (2)
dapple, dapple_num = mul_apple_layer.backward(dapple_price)  # (1)

print("price:", int(price))
print("dApple:", dapple)
print("dApple_num:", int(dapple_num))
print("dOrange:", dorange)
print("dOrange_num:", int(dorange_num))
print("dTax:", dtax)

price: 715
dApple: 2.2
dApple_num: 110
dOrange: 3.3000000000000003
dOrange_num: 165
dTax: 650


## 5.5 活性化関数レイヤの実装

ReLU(Rectified Linear Unit)

$$
\begin{eqnarray*}
  y &=& \left\{ \begin{array}{cc}
    x & (x \gt 0) \\
    0 & (x \le 0)
  \end{array} \right. \\
  \frac{\partial L}{\partial x} &=& \left\{ \begin{array}{cc}
    \frac{\partial L}{\partial y} & (x \gt 0) \\
    0 & (x \le 0)
  \end{array} \right.
\end{eqnarray*}
$$

In [5]:
class Relu:
    def __init__(self):
        self.mask = None

    def forward(self, x):
        self.mask = (x <= 0)
        out = x.copy()
        out[self.mask] = 0

        return out

    def backward(self, dout):
        dout[self.mask] = 0
        dx = dout

        return dx

シグモイド関数

$$
\begin{eqnarray*}
  y &=& \frac{1}{1 + \exp (-x)} \\
  \frac{\partial L}{\partial x} &=& \frac{\partial L}{\partial y} y^2 \exp (-x) \\
  &=& \frac{\partial L}{\partial y} y (1 - y)
\end{eqnarray*}
$$

In [6]:
class Sigmoid:
    def __init__(self):
        self.out = None

    def forward(self, x):
        out = sigmoid(x)
        self.out = out
        return out

    def backward(self, dout):
        dx = dout * (1.0 - self.out) * self.out

        return dx

## 5.6 Affine/Softmax レイヤの実装

アフィン変換（行列の内積）

$$
\begin{eqnarray*}
  Y &=& X \cdot W + B \\
  \frac{\partial L}{\partial X} &=& \frac{\partial L}{\partial Y} \cdot W^T \\
  \frac{\partial L}{\partial W} &=& X^T \cdot \frac{\partial L}{\partial Y} \\
  \frac{\partial L}{\partial B} &=& \frac{\partial L}{\partial Y} \mbox{の最初の軸に関する和}
\end{eqnarray*}
$$

$X$と$\frac{\partial L}{\partial X}$は同じ形状になる。$W$も同様。

$$
\begin{eqnarray*}
  X &=& (x_0, x_1, \cdots, x_n) \\
  \frac{\partial L}{\partial X} &=& \left( \frac{\partial L}{\partial x_0}, \frac{\partial L}{\partial x_1}, \cdots, \frac{\partial L}{\partial x_n} \right)
\end{eqnarray*}
$$

In [8]:
import numpy as np

In [9]:
X_dot_W = np.array([[0, 0, 0], [10, 10, 10]])
B = np.array([1, 2, 3])

In [10]:
X_dot_W

array([[ 0,  0,  0],
       [10, 10, 10]])

In [11]:
B

array([1, 2, 3])

In [12]:
X_dot_W + B

array([[ 1,  2,  3],
       [11, 12, 13]])

In [13]:
dY = np.array([[1, 2, 3], [4, 5, 6]])
dY

array([[1, 2, 3],
       [4, 5, 6]])

In [14]:
dB = np.sum(dY, axis=0)
dB

array([5, 7, 9])

In [15]:
class Affine:
    def __init__(self, W, b):
        self.W = W
        self.b = b
        self.x = None
        self.dW = None
        self.db = None

    def forward(self, x):
        self.x = x
        out = np.dot(x, self.W) + self.b

        return out

    def backward(self, dout):
        dx = np.dot(dout, self.W.T)
        self.dW = np.dot(self.x.T, dout)
        self.db = np.sum(dout, axis=0)

        return dx

ソフトマックス関数

In [16]:
class SoftmaxWithLoss:
    def __init__(self):
        self.loss = None
        self.y = None # softmaxの出力
        self.t = None # 教師データ (one-hot vector)

    def forward(self, x, t):
        self.t = t
        self.y = softmax(x)
        self.loss = cross_entropy_error(self.y, self.t)

        return self.loss

    def backward(self, dout=1):
        batch_size = self.t.shape[0]
        dx = (self.y - self.t) / batch_size

        return dx

## 5.7 誤差逆伝播法の実装

In [17]:
class TwoLayerNet:

    def __init__(self, input_size, hidden_size, output_size, weight_init_std = 0.01):
        # 重みの初期化
        self.params = {}
        self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size)
        self.params['b1'] = np.zeros(hidden_size)
        self.params['W2'] = weight_init_std * np.random.randn(hidden_size, output_size)
        self.params['b2'] = np.zeros(output_size)

        # レイヤの生成
        self.layers = OrderedDict()
        self.layers['Affine1'] = Affine(self.params['W1'], self.params['b1'])
        self.layers['Relu1'] = Relu()
        self.layers['Affine2'] = Affine(self.params['W2'], self.params['b2'])

        self.lastLayer = SoftmaxWithLoss()

    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)

        return x

    # x:入力データ, t:教師データ
    def loss(self, x, t):
        y = self.predict(x)
        return self.lastLayer.forward(y, t)

    def accuracy(self, x, t):
        y = self.predict(x)
        y = np.argmax(y, axis=1)
        if t.ndim != 1 : t = np.argmax(t, axis=1)

        accuracy = np.sum(y == t) / float(x.shape[0])
        return accuracy

    # x:入力データ, t:教師データ
    def numerical_gradient(self, x, t):
        loss_W = lambda W: self.loss(x, t)

        grads = {}
        grads['W1'] = numerical_gradient(loss_W, self.params['W1'])
        grads['b1'] = numerical_gradient(loss_W, self.params['b1'])
        grads['W2'] = numerical_gradient(loss_W, self.params['W2'])
        grads['b2'] = numerical_gradient(loss_W, self.params['b2'])

        return grads

    def gradient(self, x, t):
        # forward
        self.loss(x, t)

        # backward
        dout = 1
        dout = self.lastLayer.backward(dout)

        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)

        # 設定
        grads = {}
        grads['W1'], grads['b1'] = self.layers['Affine1'].dW, self.layers['Affine1'].db
        grads['W2'], grads['b2'] = self.layers['Affine2'].dW, self.layers['Affine2'].db

        return grads

In [20]:
import sys, os
sys.path.append(os.pardir)  # 親ディレクトリのファイルをインポートするための設定
from common.layers import *
from common.gradient import numerical_gradient
from dataset.mnist import load_mnist

from collections import OrderedDict

In [21]:
# データの読み込み
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True)

network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)

x_batch = x_train[:3]
t_batch = t_train[:3]

grad_numerical = network.numerical_gradient(x_batch, t_batch)
grad_backprop = network.gradient(x_batch, t_batch)

for key in grad_numerical.keys():
    diff = np.average( np.abs(grad_backprop[key] - grad_numerical[key]) )
    print(key + ":" + str(diff))

W1:1.93234902274e-13
W2:7.27464479877e-13
b1:6.84038904079e-13
b2:1.2012613404e-10


In [22]:
iters_num = 10000
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.1

train_loss_list = []
train_acc_list = []
test_acc_list = []

iter_per_epoch = max(train_size / batch_size, 1)

for i in range(iters_num):
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]

    # 勾配
    #grad = network.numerical_gradient(x_batch, t_batch)
    grad = network.gradient(x_batch, t_batch)

    # 更新
    for key in ('W1', 'b1', 'W2', 'b2'):
        network.params[key] -= learning_rate * grad[key]

    loss = network.loss(x_batch, t_batch)
    train_loss_list.append(loss)

    if i % iter_per_epoch == 0:
        train_acc = network.accuracy(x_train, t_train)
        test_acc = network.accuracy(x_test, t_test)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)
        print(train_acc, test_acc)

0.173216666667 0.1756
0.9013 0.9058
0.922983333333 0.9274
0.935983333333 0.9366
0.946416666667 0.945
0.952516666667 0.9511
0.958416666667 0.9539
0.96195 0.9565
0.9659 0.9603
0.96635 0.9595
0.970033333333 0.9643
0.972483333333 0.9644
0.973883333333 0.9649
0.9759 0.9674
0.977566666667 0.9672
0.977483333333 0.9677
0.979183333333 0.9695
