# Chapter 4. Neural Network Learning

Learning means getting the optimized weight values from train data by minimizing the value of loss function.

## 4.1 learn from data

In the case of image classification,  
Machine learning such as SVM, KNN, etc: train the pattern of the features extracted from the images. However, the features are still selected by human.  
Neural Network (Deep Learning): Machine chooses and extracts the important features from the images for itself.

In order to evaluate universal ability, we normally deal with learning machine by dividing the data into train data and test data. (universal ability stands for the ability that can solve the problmes machine never met before.)

## 4.2. loss function

Nerual network find the optimized weight and bias values by minimizing loss function. i.e. Mean squared error(MSE) and cross entropy error(CEE) are usually used as the loss function.  
The reason why we can get the optimized values based on accuracy is accuracy has many points where the differentiated value is 0.

- Mean squared error(MSE) for one data
  
    $ E =  \frac{1}{2} \Sigma_k (y_k - t_k)^2$

In [31]:
import numpy as np
def mean_squared_error(y, t):
    return 0.5*np.sum((y-t)**2)

# Example
y = [0.1,0.05,0.1,0.0,0.05,0.1,0.0,0.6,0.0,0.0]
t = [0,0,1,0,0,0,0,0,0,0] #one hot encoing

mean_squared_error(np.array(y), np.array(t))

0.5975

- Cross entropy error(CEE) for one data
  
    $ E = - \Sigma_k t_k log y_k$ ($log$ is natural logarithm: $log_e$)

In [34]:
def cross_entropy_error(y,t):
    delta = 1e-7 #very tiny value
    return -np.sum(y*np.log(y + delta)) #To prevent the denominator from being 0 and the result from being -inf

# Example
y = [0.1,0.05,0.1,0.0,0.05,0.1,0.0,0.6,0.0,0.0]
t = [0,0,1,0,0,0,0,0,0,0] #one hot encoing

cross_entropy_error(np.array(y),np.array(t))

1.2968435295135659

 Above formulas are only for one data. The below is loss function for the whole data.

- Cross entropy error(CEE) for whole data  
    $E = -\frac{1}{N}\Sigma_n\Sigma_kt_{nk}logy_{nk}$  
    ($log$ is natural logarithm($log_e$) and  $t_nk$ is $k^{th}$value of $n^{th}$data)

- Mini-batch
    train some of the datas in neural network learning

In [1]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
import numpy as np

X = mnist['data']
T = mnist['target']

x_train = X[:60000]
t_train = T[:60000]
x_test = X[60000:]
t_test = T[60000:]

print(x_train.shape)
print(t_train.shape)



(60000, 784)
(60000,)


In [2]:
train_size = x_train.shape[0]
batch_size = 10
batch_mask = np.random.choice(train_size, batch_size)
x_batch = x_train[batch_mask]
t_batch = t_train[batch_mask]

- Cross entropy error(CEE) for mini-batch

In [3]:
def cross_entropy_error_minibatch(y,t): #one hot encoding
    if y.ndim == 1: #make the array to 2-d
        y = y.reshape(1,y.size)
        t = t.reshape(1,t.size)
    
    delta = 1e-7
    batch_size = y.shape[0]
    return -np.sum(t*np.log(y+delta))/batch_size

In [4]:
def cross_entrpy_error_minibatch_1(y,t): #when the answer array consists of number label(not one hot enocoding)
    if y.ndim == 1: #make the array to 2-d
        y = y.reshape(1,y.size)
        t = t.reshape(1,t.size)
    
    delta = 1e-7
    batch_size = y.shape[0]
    return -np.sum(np.log(y[np.arange(batch_size),t]+delta))/batch_size

Accuracy is not changed by a tiny correction. Though accuracy respond to the small correction, the value of accuracy is changed discontinuously. Thus, we can not use accuracy as an indicator. It is similar to the reason why we don't use step function as an activation function. On the other hand, slope of Sigmoid function is changed continuously.

## 

## 4.3 Numerical Differentiation

- differentiation
$$ \frac{df(x)} {dx} = lim_{h->0} \frac{f(x+h)-f(x)}{h} $$

In [2]:
def numerical_diff(f,x):
    h = 1e-4 #0.0001
    return (f(x+h) - f(x-h))/(2*h) #central difference

- partial derivative  
we use it when there are more than or equal to 2 variables.  
e.g. $ \frac {\delta f}{\delta x_0} $  
we set a target variable among several variables and differentiate the formula considering other variables except for the target variable as constants.

## 4.4 Gradient

Gradient is a vector representation of the partial derivatives of all variables. e.g. $(\frac {\delta f}{\delta x_0},\frac {\delta f}{\delta x_1})$

In [23]:
import numpy as np
def numerical_gradient(f,x):
    h = 1e-4
    grad = np.zeros_like(x)
    print('x size', x.size)
    for idx in range(x.size):
        tmp_val = x[idx]
        x[idx] = tmp_val + h
        fxh1 = f(x)
        
        x[idx] = tmp_val - h
        fxh2 = f(x)
        
        grad[idx] = (fxh1 - fxh2)/(2*h)
        x[idx] = tmp_val
    
    return grad

- gradient method(gradient descent)  
    Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a local maximum of that function; the procedure is then known as gradient ascent.  [reference: https://en.wikipedia.org/wiki/Gradient_descent ]

$$ x_0 = x_0 - \eta \frac{\delta f}{\delta x_0} $$  
$$ x_1 = x_1 - \eta \frac{\delta f}{\delta x_1} $$
$\eta$ is the learning rate that stands for the amount that the parameters are updated.

In [4]:
def gradient_descent(f, init_x, lr=0.01, step_num=100):
    x = init_x
    
    for i in range(step_num):
        grad = numerical_gradient(f,x)
        x -= lr*grad
    return x

e.g. $f(x_0, x_1) = x_0^2+x_1^2$

In [6]:
def function(x): 
    return x[0]**2 + x[1]**2

In [14]:
import numpy as np #p132
init_x = np.array([-3.0, 4.0])
gradient_descent(function, init_x, lr=0.1, step_num=100)

array([-6.11110793e-10,  8.14814391e-10])

We call parameter such as learning rate hyper parameter. While weight parameter of neural network is the automatically calculated parameter, hyper parameters like learning rate should be set by a human. Thus, we have to find the optimized learning rate by testing various values.

$ W = \left[\begin{array}{rrr} 
w_{11}&w_{12}&w_{13}\\
w_{21}&w_{22}&w_{23}\\
\end{array}\right]$ #weights

$ \frac{\delta L}{\delta W} = \left[\begin{array}{rrr} 
\frac {\delta L}{\delta w_{11}} & \frac {\delta L}{\delta w_{12}} & \frac {\delta L}{\delta w_{13}}\\
\frac {\delta L}{\delta w_{21}} & \frac {\delta L}{\delta w_{22}} & \frac {\delta L}{\delta w_{23}}\\
\end{array}\right]$ 

In [29]:
import numpy as np
import functions
from gradient import numerical_gradient

class simpleNet:
    def __init__(self):
        self.W = np.random.randn(2,3)
    
    def predict(self, x):
        return np.dot(x, self.W)
    
    def loss(self, x, t):
        z = self.predict(x)
        y = softmax(z)
        loss = cross_entropy_error(y,t)
        return loss

ModuleNotFoundError: No module named 'gradient'

In [21]:
net = simpleNet()
print(net.W)
x = np.array([0.6, 0.9])
p = net.predict(x)
print(p)
print('max index:', np.argmax(p))
t = np.array([0, 0, 1])
net.loss(x,t)

[[ 1.40317766 -1.73806586  0.03307011]
 [-1.26061617 -0.12631762  0.5202958 ]]
[-0.29264796 -1.15652538  0.48810829]
max index: 2


0.9103865814602358

In [24]:
def f(W):
    return net.loss(x,t)
dW = numerical_gradient(f, net.W)
print(dW)

x size 6


IndexError: index 2 is out of bounds for axis 0 with size 2