&copy;Copyright 2017 Shuang Wu<br>
cite from the Neural Networks and Deep Learning book by Michael Nielsen http://neuralnetworksanddeeplearning.com <br>
Learning notes

## CH 3
## Improving the way neural networks learn

### The cross-entropy cost function
For a simple one layer neural network with $\eta = 0.15$, initial weight $0.6$ and initial bias $0.9$. The cost is the quadratic cost function, $C$. When input $1$, we want the output to be $0$. After 300 epoch, can reach to $0.09$.<br>
![cec1](/notebooks/imgs/cec1.jpg)<br>
This process learn fast. If change both initial weights and bias to $2$.<br>
![cec2](/notebooks/imgs/cec2.jpg)<br>
This shows the learning process slowly. In the first 150 epoch, the weights and biases don't change much.<br>
When the artificial neuron has a lot of difficulty learning when it's badly wrong.<br>
Learning is slow is same as partial derivatives are small. This is becuase the shape of the sigmoid function. When the neuron's ouput is close to 1, the curve gets very flat, and then $\sigma'(z)$ get very small, which cause both derivatives very small.<br>

### Intro the cross-entropy cost func
We can solve the slow-down learning problem by replace the quadratic cost with a cross-entropy cost function.<br>
Suppose the simple model have three inputs now:<br>
![cec3](/notebooks/imgs/cec3.jpg)<br>
Define the cross-entropy cost function as:<br>
$$C=-\frac{1}{n}\sum_x[y\ln a+(1-y)\ln(1-a)]$$<br>
$n$ is total # of training data, sum is over all training inputs. $y$ is the corresponding output.<br>
First, interpret the cross-entropy func as cost func. There are 2 properties:<br>
1. $C>0$<br>
    (1) all terms in the sum are negative because logarithms of the number range from 0 to 1.<br>
    (2) - sign out fornt the sum.<br>
2. The neuron's actual output is close to the desired output for all training inputs, $x$, which make the croo-entropy close to 0. This can be seen by 2 situations:<br>
    (1) $y=0$ and $a \approx 0$<br>
    (2) $y=1$ and $a \approx 1$<br>
both case make the $C=0$.<br>

To know why the cross-entropy solve the slow down problem, compute the derivative respect to weights.<br>
$$\frac{\partial C}{\partial w_j}=\frac{1}{n}\sum_x x_j(\sigma(z)-y)$$<br>
This means the weight learns is controlled by $\sigma(z)-y$, the error in the output. The larger the error, the faster the neuron will learn. In the quadratic cost func, it slow down by $\sigma'(z)$. And the partial derivative for the bias:<br>
$$\frac{\partial C}{\partial b} = \frac{1}{n}\sum_x(\sigma(z)-y)$$<br>
Return to the previous example, the results increase:<br>
![cec4](/notebooks/imgs/cec4.jpg)<br>
![cec5](/notebooks/imgs/cec5.jpg)<br>

Extension the cross-entropy to multi-layer networks.<br>
$$C=-\frac{1}{n}\sum_x\sum_j[y_i\ln a^L_j+(1-y_j)\ln(1-a^L_j)]$$

<strong>Choose between two cost function</strong><br>
Cross-entropy is nearly always the better choice, provide the ouput neurons are sigmoid neurons. It's because when we're setting up the network we usually initialize the weights and biases randomly. It may happend that those initial choices result in the network being decisively wrong for some training input, so, an output neuron will have saturated near 1, when it should be 0, vice versa. When use quadratic, it won't stop learning completely, since the weights will continue learning from other training inputs, which undesirable.

### Using the cross-entropy to classify MNIST digits<br>
The overall accuracy increase, especially when we have 100 hidden layers. The cross-entropy cost gives us similar or betterresults than the quadratic cost.But just small improvement. Cross-entropy is a widely-used cost function. And this is a good laboratory to begin understanding neuron saturation and how it may be addressed.<br>

### What does the cross-entropy mean? 
Our neuron is tring to compute the function $x\rightarrow y = y(x)$, but instead, it computes the function $x\rightarrow a =a(x)$.Suppose we think of $a$ as our neuron's estimated probability that $y$ is $1$, and $1-a$ is the estimated probability that the right value for $y$ is $0$. Then, the cross-entropy measures how "suprised" we are, when we learn the true value for $y$. Low suprise when output is what we expect, high surprise when output is unexpected.

### Softmax<br>
Softmax layer. The idea is to define a new type of output layer for our neural networks. Begining in the same way as w/ a sigmoid layer, by forming the weighted inputs as $z_j^L=\sum_kw_{jk}^La_k^{L-1}+b^L_j$. But we don't apply sigmoid function. In a softmax layer we apply <i>softmax function</i> to the $z^L_j$ instead of $a^L_j=\sigma(z^L_j)$. The activation $a^L_j$ of the $j^{th}$ output neuron is:<br>
$$a^L_j = \frac{e^{z^L_j}}{\sum_ke^{z^L_k}}$$<br>
Sum over all the output neurons in the denominator.<br>
$$\sum_j a^L_j = \frac{\sum_j e^{z^L_j}}{\sum_ke^{z^L_k}}=1$$<br>
As a result, if $a^L_4$ increase, then the output activations must decrease by yhe same total amount, to ensure the sum over all remains $1$. Also hold for all the other activations.<br>
This equation also have the property that all positive, because the exponential.<br>
So the softmax layer is a set of positive #s which sum up to 1, or the softmax layer can be thought of as a probability distribution. So it's convenient to be ablut to interpret the output activation $a^L_j$ as the networks's estimate of the probability that the correct output is $j$.<br>
<strong>The activations from a sigmoid layer won't in general form a probability distribution. With the sigmoid output layer we dont have such a simple interpretation of the output activations.<br></strong>
It's same as rescaling the $z^L_j$, and squishing them together to form a probability ditribution.<br>
Softmax is monotonicity and non-locality.<br>

<strong>The learning slowdown problem:</strong>
To see how softmax layer address the learning slowdown problem, define the log-likelihood cost function:<br>
$$C\equiv -\ln a^L_y$$<br>
And the corresponding partial derivatives will be:<br>
$$\frac{\partial C}{\partial b^L_j} = a^L_j - y_j$$<br>
$$\frac{\partial C}{\partial w^L_{jk}} = a^{L-1}_k(a^L_j - y_j)$$
It's useful to think of a softmax output layer w/ log-likelihood cost as being quite similar to a sigmoid output layer w/ cross-entropy cost.<br>
In many situations both approaches work well. As a more general point of principle, softmax plus log-likelihood is worth usign whenever you want to interpret the output activations as probabilities. This is not always a concern, but can be useful w/ <strong>classification problems involving disjoint classes</strong>.<br>

### Overfitting and regularization
The test situation: use 30 hidden neuron network, with it's 23,860 parameters. Instead of train 50,000 images, use the first 1000 training images. $\eta=0.5$ and mini-batch size of 10. Train for 400 epochs, large than before. The cost change as the network learns:<br>
![of1](/notebooks/imgs/of1.jpg)<br>
The classification accuracy on the test data chagnes over time<br>:
![of2](/notebooks/imgs/of2.jpg)<br>
The cost looks good but the test accuracy results show the improvement is an illusion. What our network learns after epoch 280 no longer generalizes to the test data. And so it' not useful learning. And we say the network is <i>overfitting</i> beyond epoch 280.<br>
The cost on the test data:<br>
![of3](/notebooks/imgs/of3.jpg)<br>
The cost improves until around epoch 15, after that it get worse, even the cost on the training data is continue get better. Another sign for overfitting.<br>
And the classification accuracy on the training data is also the sign:<br>
![of4](/notebooks/imgs/of4.jpg)<br>
The accuracy rises all the way up to 100%, same as correctly classifiers all 1,000 images. But the test accuracy only 82%. So overfitting.<br>
The pbcious way to detect is keeping track of accuracy on the test data. If the accuracy no longer improving, then stop training. But not absolutely.<br>

We can also use the validation for overfitting. We'll compute the classification accuracy on the validation data at the end of each epoch. Once the classification accuracy on the validation data has saturated, stop training. This strategy is called <i>early stoppping</i>.<br>
Have the validation instead of using test data directly because need make sure we can true measure of how well our NN generalizes.<br>
Now back to the example, use 50,000 as training and the test images:<br>
![of5](/notebooks/imgs/of5.jpg)<br>
Though there is a small overfit, it's already much better than previous example.<br>
<strong>One of the best ways of reducing overfitting is to increase the size of the training data. With enough training data it is difficult for even a very large network to overfit. Unfortunately, training data can be expensive or difficult to acquire, so not always a practical option.</strong>

### Regularization<br>
When fixed network and fixed training data, to solve the overfitting can use the <i>Regularization</i> techniques. One of the most commonly known as weight decay or L2 regularization.<br>
Idea is to add extra term to the cost function, the regularization term. E.g /w the cross-entropy:<br>
$$C=-\frac{1}{n}\sum_{x_j}[y_i\ln a_j^L+(1-y_i)\ln(1-a^L_j)]+\frac{\lambda}{2n}\sum_w w^2$$<br>
The second term is the sum of the squares of all the weights in the network, scaled by a factor $\lambda/2n$, for $\lambda>0$, known as regularization parameter. $n$ os the size of training set. This term do not include bias.<br>
Regularize w/ the quadratic cost function:<br>
$$C=\frac{1}{2n}\sum_x\|y-a^L\|^2+\frac{\lambda}{2n}\sum_w w^2$$ <br>
Or, in general:<br>
$$C=C_0+\frac{\lambda}{2n}\sum_w w^2$$<br>

First, SGD, stochastic gradient descent, w/ regularized neural network. The partial derivatives now become:<br>
$$\frac{\partial C}{\partial w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n}w$$<br>
$$\frac{\partial C}{\partial b} = \frac{\partial C_0}{\partial b}$$<br>
The term $\partial C_0$ can be computed using backpropagation. The learning rule for biases stay same:<br>
$$b\rightarrow b-\eta\frac{\partial C_0}{\partial b}$$<br>
For weight become:<br>
$$w\rightarrow w-\eta\frac{\partial C_0}{\partial w}-\frac{\eta\lambda}{n}w = (1-\frac{\eta\lambda}{n})w-\eta\frac{\partial C_0}{\partial w}$$
Almost same, except rescale the weight by $(1-\frac{\eta\lambda}{n})$, this referred to as weight decay. This is how gradient descent work. For SGD with a mini-batch of $m$ trainings w/ the equation:<br>
$$w\rightarrow (1-\frac{\eta\lambda}{n})w-\frac{\eta}{m}\sum_x\frac{\partial C_x}{\partial w}$$<br>
The sum is over trainings $x$ in the mini-batch, $C_x$ is the unregularized cost for each training, $x$. Except the rescaling, all remain same as before. The regularized learning rule for the biases is the same as w/o regularized case.<br>
$$b\rightarrow b-\frac{\eta}{m}\sum_x\frac{\partial C_x}{\partial b}$$<br>

Apply to the example. 30 hidden neurons, mini-batch size of 10, learning rate of 0.5, cross-entropy cost function. Try $lambda=0.1$.<br>
Cost on training decrease ove the whole time, like before, w/o regularize:<br>
![rg1](/notebooks/imgs/rg1.jpg)<br>
The accuracy on the test data continue increase:<br>
![rg2](/notebooks/imgs/rg2.jpg)<br>
This solve the overfitting problem and increase the peak accuracy.<br>
Now try the case that when trainings increase, if regularize also help.Hyperparameter same as before, w/ the change of $\lambda$ to 5 by considering the equation $1-\frac{\eta\lambda}{n}$<br>.
![rg3](/notebooks/imgs/rg3.jpg)<br>
Good news:<br>
1. classification accuracy on the test goes up.<br>
2. Gap between test and training much narrower than before, reduce overfitting.<br>

Now, try 100 hidden neurons will give the accuracy to $97.92%$ on the validation. For 60 epoches w/ $\eta=0.1$ and $\lambda=5$ will give accuracy $98.04%$.<br>

### Why does regularization help reduce overfitting?<br>
Example:<br>
![rg4](/notebooks/imgs/rg4.jpg)<br>
If do 9th-order polynomial:<br>
![rg5](/notebooks/imgs/rg5.jpg)<br>
This is the exact fit, if use linear fit:<br>
![rg6](/notebooks/imgs/rg6.jpg)<br>

### Other techniques for regularization
L1 regularization, dropout and artifically increasing the training set size.<br>

<strong>L1 regularization</strong>:<br>
Sum of the absolute values of the weights:<br>
$$C=C_0+\frac{\lambda}{n}\sum_w|w|$$<br>
Update rule for L1:<br>
$$w\rightarrow w' = w-\frac{\eta\lambda}{n}sgn(w)-\eta\frac{\partial C_0}{\partial w}$$<br>
$sgn(w)$ is the sign of $w$, $+1$ for $w$ positive, vice versa.

<strong>Dropout</strong>:<br>
In dropout, we modify the network itself. If we have the network:<br>
![do1](/notebooks/imgs/do1.jpg)<br>
Starting by randomly deleting half the hidden neurons in the network, while leave the input and output untouched.<br>
![do2](/notebooks/imgs/do2.jpg)<br>
First forward-propagate and then backpropagate. After this, update the weight and bias. Then restore the dropout neurons and choosing new random subset of hidden neurons to delete, do the same thing as before and update the parameter.<br>
Dropout has been especially useful in training large, deep networks, where the problem of overfitting is often acute.<br>

<strong>Artificially expanding the training data</strong>:<br>
For the image, rotate the original image by some degree can use as new image. Also can apply to other area, just looking for oppurtunities to apply it.<br>

<strong>An aside on big data and what it means to compare classification accuracies</strong>:<br>
What we want is both better algorithms and better training data. It's fine to look for better algorithms, but do not focusing on better algorithms to ignore the easy wins getting more or better taining data.<br>

Overfitting is a major problem in NN, especially as computers get more powerful, and we have the ability to train larger networks. So need to develop powerful regularization techniques to reduce overfitting, this is an extremely area of current work.<br>

### Weight initialization <br>
We may get kind of saturation, learning slowdown, if random the weight and bias.<br>
So instead of choose the weight as Gaussian random variables w/ mean 0 and sd 1, choose with mean 0 and sd $1/\sqrt{n_{in}}$. We squash the Gaussians down, make it less likely that our neuron will saturate. And the bias choose do not change, mean 0 and sd 1. After change to this choose the weighted sum $z=\sum_jw_jx_j+b$ will be a Gaussian random variable w/ mean 0 and  much more sharply peaker than before. The weight distrbution before:<br>
![wi1](/notebooks/imgs/wi1.jpg)<br>
And now become:
![wi2](/notebooks/imgs/wi2.jpg)<br>
This is much less likely to saturate, less likely to have learning slowdown problem.<br>
Then conpare the results with parameters: 30 hidden neurons, mini-batch size of 10, regularization parameter $\lambda = 5$ and cross-entropy cost function. $\eta = 0.1$.<br>
The result like below:<br>
![wi3](/notebooks/imgs/wi3.jpg)<br>
Both approch to the same result but the new approach goes faster. Improved weight initialization only speeds up learning, doesn't change performance of neural networks here.
<strong>The $1/\sqrt{n_in}$ approach to weight initialization helps improve the way neural nets learn.</strong>

In [23]:
import sys
sys.path.insert(0, 'code')
import mnist_loader
training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
import network2
net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
net.SGD(training_data, 30, 10, 0.1, lmbda = 5.0, evaluation_data=validation_data, 
        monitor_evaluation_accuracy=True)

Epoch 0 training complete
Accuracy on evaluation data: 9246 / 10000
Epoch 1 training complete
Accuracy on evaluation data: 9390 / 10000
Epoch 2 training complete
Accuracy on evaluation data: 9474 / 10000
Epoch 3 training complete
Accuracy on evaluation data: 9503 / 10000
Epoch 4 training complete
Accuracy on evaluation data: 9531 / 10000
Epoch 5 training complete
Accuracy on evaluation data: 9556 / 10000
Epoch 6 training complete
Accuracy on evaluation data: 9548 / 10000
Epoch 7 training complete
Accuracy on evaluation data: 9571 / 10000
Epoch 8 training complete
Accuracy on evaluation data: 9585 / 10000
Epoch 9 training complete
Accuracy on evaluation data: 9586 / 10000
Epoch 10 training complete
Accuracy on evaluation data: 9559 / 10000
Epoch 11 training complete
Accuracy on evaluation data: 9599 / 10000
Epoch 12 training complete
Accuracy on evaluation data: 9586 / 10000
Epoch 13 training complete
Accuracy on evaluation data: 9593 / 10000
Epoch 14 training complete
Accuracy on evalu

([],
 [9246,
  9390,
  9474,
  9503,
  9531,
  9556,
  9548,
  9571,
  9585,
  9586,
  9559,
  9599,
  9586,
  9593,
  9606,
  9620,
  9612,
  9622,
  9619,
  9609,
  9617,
  9619,
  9631,
  9634,
  9626,
  9615,
  9633,
  9623,
  9641,
  9641],
 [],
 [])

### Handwriting recognition revisited: the code<br>
Combine the idea in this chapter together.<br>

In [None]:
"""network2.py
~~~~~~~~~~~~~~

An improved version of network.py, implementing the stochastic
gradient descent learning algorithm for a feedforward neural network.
Improvements include the addition of the cross-entropy cost function,
regularization, and better initialization of network weights.  Note
that I have focused on making the code simple, easily readable, and
easily modifiable.  It is not optimized, and omits many desirable
features.

"""

#### Libraries
# Standard library
import json
import random
import sys

# Third-party libraries
import numpy as np


#### Define the quadratic and cross-entropy cost functions

class QuadraticCost(object):

    @staticmethod
    def fn(a, y):
        return 0.5*np.linalg.norm(a-y)**2

    @staticmethod
    def delta(z, a, y):
        return (a-y) * sigmoid_prime(z)


class CrossEntropyCost(object):

    @staticmethod
    def fn(a, y):
        return np.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a)))

    @staticmethod
    def delta(z, a, y):
        return (a-y)


#### Main Network class
class Network(object):

    def __init__(self, sizes, cost=CrossEntropyCost):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.default_weight_initializer()
        self.cost=cost

    def default_weight_initializer(self):
        self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]
        self.weights = [np.random.randn(y, x)/np.sqrt(x)
                        for x, y in zip(self.sizes[:-1], self.sizes[1:])]

    def large_weight_initializer(self):
        self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(self.sizes[:-1], self.sizes[1:])]

    def feedforward(self, a):
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            lmbda = 0.0,
            evaluation_data=None,
            monitor_evaluation_cost=False,
            monitor_evaluation_accuracy=False,
            monitor_training_cost=False,
            monitor_training_accuracy=False,
            early_stopping_n = 0):
        # early stopping functionality:
        best_accuracy=1

        training_data = list(training_data)
        n = len(training_data)

        if evaluation_data:
            evaluation_data = list(evaluation_data)
            n_data = len(evaluation_data)

        # early stopping functionality:
        best_accuracy=0
        no_accuracy_change=0

        evaluation_cost, evaluation_accuracy = [], []
        training_cost, training_accuracy = [], []
        for j in range(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in range(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(
                    mini_batch, eta, lmbda, len(training_data))

            print("Epoch %s training complete" % j)

            if monitor_training_cost:
                cost = self.total_cost(training_data, lmbda)
                training_cost.append(cost)
                print("Cost on training data: {}".format(cost))
            if monitor_training_accuracy:
                accuracy = self.accuracy(training_data, convert=True)
                training_accuracy.append(accuracy)
                print("Accuracy on training data: {} / {}".format(accuracy, n))
            if monitor_evaluation_cost:
                cost = self.total_cost(evaluation_data, lmbda, convert=True)
                evaluation_cost.append(cost)
                print("Cost on evaluation data: {}".format(cost))
            if monitor_evaluation_accuracy:
                accuracy = self.accuracy(evaluation_data)
                evaluation_accuracy.append(accuracy)
                print("Accuracy on evaluation data: {} / {}".format(self.accuracy(evaluation_data), n_data))

            # Early stopping:
            if early_stopping_n > 0:
                if accuracy > best_accuracy:
                    best_accuracy = accuracy
                    no_accuracy_change = 0
                    #print("Early-stopping: Best so far {}".format(best_accuracy))
                else:
                    no_accuracy_change += 1

                if (no_accuracy_change == early_stopping_n):
                    #print("Early-stopping: No accuracy change in last epochs: {}".format(early_stopping_n))
                    return evaluation_cost, evaluation_accuracy, training_cost, training_accuracy

        return evaluation_cost, evaluation_accuracy, \
            training_cost, training_accuracy

    def update_mini_batch(self, mini_batch, eta, lmbda, n):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [(1-eta*(lmbda/n))*w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = (self.cost).delta(zs[-1], activations[-1], y)
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def accuracy(self, data, convert=False):
        if convert:
            results = [(np.argmax(self.feedforward(x)), np.argmax(y))
                       for (x, y) in data]
        else:
            results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in data]

        result_accuracy = sum(int(x == y) for (x, y) in results)
        return result_accuracy

    def total_cost(self, data, lmbda, convert=False):
        cost = 0.0
        for x, y in data:
            a = self.feedforward(x)
            if convert: y = vectorized_result(y)
            cost += self.cost.fn(a, y)/len(data)
            cost += 0.5*(lmbda/len(data))*sum(np.linalg.norm(w)**2 for w in self.weights) # '**' - to the power of.
        return cost

    def save(self, filename):
        """Save the neural network to the file ``filename``."""
        data = {"sizes": self.sizes,
                "weights": [w.tolist() for w in self.weights],
                "biases": [b.tolist() for b in self.biases],
                "cost": str(self.cost.__name__)}
        f = open(filename, "w")
        json.dump(data, f)
        f.close()

#### Loading a Network
def load(filename):
    f = open(filename, "r")
    data = json.load(f)
    f.close()
    cost = getattr(sys.modules[__name__], data["cost"])
    net = Network(data["sizes"], cost=cost)
    net.weights = [np.array(w) for w in data["weights"]]
    net.biases = [np.array(b) for b in data["biases"]]
    return net

#### Miscellaneous functions
def vectorized_result(j):
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

### How to choose a neural network's hyper-parameters?
Choosing the value $\eta$ and $\lambda$ and etc.<br>

<strong>Broad strategy:</strong><br>
We can first just use part of the set to train or to validation and this can help wil fast training, like several seconds instead of thoughs seconds. Then we can individually adjusting each hyper-parameter, gradually imprving performance. Once find an improved value for one hyper-parameter, continue to find next. And then try more complex architecture, the # of hidden neurons. Then adjust those parameter again. More complex architecture, adjust hyper again. At each stage, evaluating performance using the validation data and try to find better hyper-para. It typically takes longer to witness the impact due to modifications of the hyper-para, and so we can gradually decrease the freq. of monitoring.<br>
Durning these early stage, make sure can get quick feedback from experiments.<br>

<strong>Learning rate:</strong><br>
The training cost we train with different learning rates:$\eta=0.025$, $\eta=0.25$, $\eta=2.5$:<br>
![hp1](/notebooks/imgs/hp1.jpg)
When the $\eta$ is too large, the steps will be so large that they may actually overshoot the minimum, causing the algorithm to climb up out of the valley instead. But when too small, slows down SGD. Can set the $\eta$ as follow.<br>
Estimate the threshold value for $\eta$ when the cost on training data immediately decreasing, instead oscillating or increasing. E.g, try $\eta=0.01$ first time and found the cost decrease at the begining, then try $\eta=0.1, 1.0, \cdots$ until find the cost oscillates or increase during first few epochs. Vice versa. This will give an order of magnitude estimate for the threshold value of $\eta$. And then can pick up the largest value of $\eta$ that the cost decresing durning the first few epochs, no need to be super-accurate. And this is an estimate of $\eta$. The acutal value should be smaller than the threshold value we just found, like factor of 2 below. This will allow many epochs w/o slowdown in learning.<br>
Instead of using validation set, use the cost set to find the learning rate. Other hyper parameters will improve the final classification accuracy on the test set, so select basis on validation accuracy. The learning rate's main purpose is to control the step size in GD, monitoring the training cost become the best way to detect that. But both can be used, training cost decreases when validation accuracy improves.<br>

<strong>Early stopping to determine the # of training epoches:</strong><br>
At the end of each epoch, compute the classification accuracy on the validation data. When stop improving, terminate. Early stopping also help prevents from overfitting. Better rule is to terminate if the best classification accuracy doesn't improve for quite some time. No-improvement-in-n rule, 10, 20, 50...<br>

<strong>Regularization parameter, $\lambda$:</strong><br>
Starting w/o regulatization, $\lambda=0$, and determing a value for learning rate $\eta$. After that, using validation data to select the regularization, star by $\lambda=1$, then increase or decrease by factors of 10, as improve performance on the validation data. After find the value, need to re-optimize $\eta$ again.<br>

<strong>Mini-batch size:</strong><br>
With size 100, the learning rule for weights:<br>
$$w\rightarrow w'=w-\eta\frac{1}{100}\sum_x\nabla C_x$$<br>
The sum is over taining examples in the mini-batch. For the online learning, when size is 1:<br>
$$w\rightarrow w'=w-\eta\nabla C_x$$
If size too small, no benefits of good matrix libraries optimized for fast hardware. Too large not update the weights often enough. Hte size do not deoend on other parameters, which is good. So use some acceptable values for other hyper-parameter, then trial a number of different mini-batch size. Plot the validation accuracy versus time, elapsed time, and choose whichever mini-batch size gives most rapid improvement in performance. After choose the mini-batch size, can find the other hyper parameter.<br>

<strong>Mini-batch size:</strong><br>
Grid search, systematically searches through a grid in hyper parameter space. Some used a Bayesian approach to automatically optimize hyper-parameters.

<strong>Summing up:</strong><br>
Following the above will get a good start and a basis for futher improvements.

### Other techniques
