## Hidden layer activation functions

- Main objective of using an activation function is to add a non-linearity to a layer (if we only used linear/affine projections without a non-linearity, depth would not be useful)
- The most popular non-linearity used to be smooth functions such as 
    - **Sigmoid**: $\sigma(x) = \frac{1}{1 + e^{-x}} $
    - **Hyperbolic tangent**: $\tanh(x) = \frac{1 - e^{-2x}}{1 + e^{-2x}}$
- However, "modern" DNNs: use **rectified linear** (ReLU) functions, which are cheaper to compute (both in the forward and backward passes) and lead to good results. The ReLU is given by $\mbox{ReLU}(x) = \max(0, x)$

## Output activation functions

The last layer of a feedforward neural network is called the output layer. It is similar to a hidden layer, but for the hidden layer we have to choose an activation function that corresponds to the type of output we intend to model:

- Linear: Gaussian output distributions
- Sigmoid: Bernoulli output distributions (binary classification)
- Softmax: Multinoulli output distributions (multiclass classification)

The **softmax**, also called normalized exponential, is a generalization of the sigmoid that converts a K-dimensional vector of arbitrary real values to a K-dimensional vector of real values in the range (0, 1) that add up to 1:

$$ \mbox{softmax}(x)_j = \frac{e^{z_j}}{\sum_{k=1}^{K}e^{z_k}} $$

We can see the output of a softmax as $P(y = j~|~x)$.

Keras does not distinguish a hidden layer from an output layer, so your output layer will be the last layer added to a `Sequential` model. Remember to set the dimensions to be the same as your target variable.

## Cost functions

Like with the output activations, the cost function is chosen depending on the expected output distribution.

- Cross-entropy: for Bernoulli and Multinoulli output distributions (binary and categorical classification)
- Mean-squared error: for Gaussian output distributions

There are other cost functions in Keras (check [the documentation](http://keras.io/objectives/) to see which ones are available).

Choosing a cost function is similar to choosing an activation function: just pass it as a string or Python function using the keyword parameter `loss` to `model.compile`. All the standard cost functions have string shortcuts, so you can use `'mse', 'binary_crossentropy', 'categorical_crossentropy'` instead of having to import the functions from `keras.objectives`.

## Optimization Adaptive learning rate

Learning rate is one of the most important hyperparameters to set for training a DNN as it has a significant impact on model performance. Also, often the choice of a single learning rate for all parameters in our model is not the best, as different parameters have different sensitivities. The adaptive learning rate approach uses separate learning rates for each parameter, and automaticallty adapts these rates during training (as it would be insane trying to find optimal values for each parameter on your own!).

Keras includes several adaptive learning rate algorithms:

- RMSprop
- Adagrad
- Adadelta
- Adam
- Nadam

If in doubt, RMSprop and Adam with the default parameters are often a good starting point.

## GRU

GRUs are another proposed solution for the problem of** vanishing gradients**, which use the same gating principle but are simpler than LSTMs as they only have **two gates (reset and update)** and its internal memory is the same as its hidden state, instead of using a separate cell like LSTMs. There are usually no big performance gaps between LSTM and GRU networks with the same number of parameters.

GRUs are available on Keras as `keras.layers.recurrent.GRU`.

## Gradient value and norm-clipping

Another strategy that can mitigate the *exploding gradients* issue is gradient clipping. Gradients can be clipped either by their maximum absolute value or by the total L2 gradient norm. All optimizers in Keras support both modes of gradient clipping, by using the following keyword parameters:

- `clipnorm=value` (value: float > 0): Gradients will be clipped when their L2 norm exceeds this value
- `clipvalue=value` (value: float > 0): Gradients will be clipped when their L2 norm exceeds this value

## How to connect the output of a recurrent layer to another layer

We can connect the output of an RNN to the next layer in a model in two different ways:

- Use all the hidden states generated by the RNN (a sequence of feature vectors) for a given sequence
- Use only the last hidden state (one feature vector per sequence)

If using the first approach, the next layer has to support processing sequences (or you can flatten the sequence as a single vector, as we have seen in the CNN section of this tutorial). To choose whether you want a recurrent layer to output a sequence or a single vector, use the keyword parameter `return_sequences`. The default for this parameter is `False`.

Besides recurrent layers, Keras also supports using any layer for processing a sequence by using the `TimeDistributed` wrapper. This wrapper is equivalent to making a copy of the wrapped layer for each timestep of the sequence, with all parameters tied. For example:

http://colah.github.io/posts/2015-08-Understanding-LSTMs/


1.https://github.com/Prakashvanapalli/TensorFlow 含有数学公式
2.https://github.com/rajathkumarmp/FaceRecog-Keras 人脸识别
3.https://github.com/chaimpollak/traffic-signs 交通信号灯
4.https://github.com/giuseppebonaccorso/keras_deepdream 抽象画
5.https://github.com/stratospark/food-101-keras 食物

model.add(Convolution2D(32, kernel_size=(3, 3),padding='same',input_shape=(3 , 100, 100)))
model.add(Activation('relu'))
model.add(Convolution2D(32, (3, 3)))
使用same 最终变成100-3+1 =98
convolution2d_1 (Convolution2D)  (None, 32, 100, 100)
activation_1 (Activation)        (None, 32, 100, 100) 
convolution2d_2 (Convolution2D)  (None, 32, 98, 98) 

X_train.shape  (148, 150, 150)
X_train = X_train.reshape(148, 150*150)   (148, 22500)
loss, accuracy = model.evaluate(X_test,Y_test, verbose=0)

X_train shape: (50000, 32, 32, 3)
print(X_train.shape[0], 'train samples')  50000 train samples

In [None]:
from keras.datasets import imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

In [None]:
model = Sequential()
model.add(Dense(128, input_dim=input_unit_size, init='glorot_uniform'))
model.add(Activation("relu"))
model.add(Dropout(p=0.2))
model.add(Dense(nb_classes, init='glorot_uniform'))
model.add(Activation('softmax'))

funition
inputs = Input(shape=(input_unit_size,))
x = Dense(128, activation='relu')(inputs)
x = Dropout(0.2)(x)
outputs = Dense(nb_classes, activation="softmax")(x)
model = Model(input=inputs, output=outputs)

SVG可视化
SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))

x = range(nb_epoch)
plt.plot(x, result.history['acc'], label="train acc")
plt.plot(x, result.history['val_acc'], label="val acc")
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

plt.plot(x, result.history['loss'], label="train loss")
plt.plot(x, result.history['val_loss'], label="val loss")
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))

In [None]:
# save model without weights
with open('mnist_model.json', 'w') as f:
    json.dump(model.to_json(), f)
    
model.save_weights('mnist_weights.h5')

# load model
mnist_model = model_from_json(json.load(open("mnist_model.json")))

# load wights
mnist_model.load_weights("./mnist_weights.h5")
mnist_model.compile(loss='categorical_crossentropy', optimizer='adadelta')

In [1]:
X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols) 
X_train shape: (60000, 1, 28, 28)

SyntaxError: invalid syntax (<ipython-input-1-908b4b6eaf6b>, line 2)

In [None]:

nb_filters使用
model = Sequential()
model.add(Convolution2D(nb_filters, nb_conv, nb_conv, input_shape=(1, img_rows, img_cols)))
model.add(Activation('relu'))
model.add(Convolution2D(nb_filters, nb_conv, nb_conv))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(nb_pool, nb_pool)))
model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer="adadelta",
              metrics=['accuracy'])

SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))

In [None]:
中间层可视化
get_first_layer_output = K.function([model.layers[0].input],
                                    [model.layers[1].output])
first_layer = get_first_layer_output([X_train[0:show_size]])[0]

plt.figure(figsize=(20,20))

for img_index, filters in enumerate(first_layer, start=1):
    for filter_index, mat in enumerate(filters):
        pos = (filter_index)*10+img_index
        draw_digit(mat, nb_filters, show_size, pos)
plt.show()

参考prodLDA 自定义层

In [None]:
model = Sequential()
model.add(Convolution2D(nb_filter, 3, 3, input_shape=(img_channels, img_rows, img_cols), border_mode="same", activation="relu"))
model.add(Convolution2D(nb_filter, 3, 3, border_mode="same", activation="relu"))
model.add(Convolution2D(nb_filter, 3, 3, border_mode="same", activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Convolution2D(nb_filter, 3, 3, border_mode="same", activation="relu"))
model.add(Convolution2D(nb_filter, 3, 3, border_mode="same", activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(512, activation="relu"))
model.add(Dense(nb_classes, activation="softmax"))

model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))


in_img = Input(shape=(img_channels, img_rows, img_cols))
x = Convolution2D(nb_filter, 3, 3, border_mode="same", activation="relu")(in_img)
for _ in range(2):
    y = Convolution2D(nb_filter, 3, 3, border_mode="same", activation="relu")(x)
    y = Convolution2D(nb_filter, 3, 3, border_mode="same")(y)
    x = merge([x, y], mode="sum")
    x = Activation("relu")(x)
    x = MaxPooling2D(pool_size=(2, 2))(x)

x = Flatten()(x)
x = Dense(512, activation="relu")(x)
x = Dense(nb_classes, activation="softmax")(x)


residual = Model(input=in_img, output=x)

residual.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

SVG(model_to_dot(residual, show_shapes=True).create(prog='dot', format='svg'))

In [None]:
整个公式的演化
TensorFlow-master/Blogposts/Backpropogation.ipynb
TensorFlow-master/Inital_learning/ 各种py脚本

traffic-signs-logistic_regression/ 也是脚本

weights = {
    'cnn_in': tf.Variable(tf.truncated_normal([filter_size_width, filter_size_height, color_channels, k_output])),
    'cnn_out': tf.Variable(tf.random_normal([8*8*64, n_hidden_layer])),
    'out': tf.Variable(tf.random_normal([n_hidden_layer, n_classes]))
}

biases = {
    'cnn_in': tf.Variable(tf.zeros(k_output)),
    'cnn_out': tf.Variable(tf.random_normal([n_hidden_layer])),
    'out': tf.Variable(tf.random_normal([n_classes]))
}
引用
# Apply Convolution
conv_layer = tf.nn.conv2d(x, weights['cnn_in'], strides=[1, 2, 2, 1], padding='SAME')
conv_layer = tf.nn.bias_add(conv_layer, biases['cnn_in'])



img = image.load_img('dog.jpeg',target_size=(150, 150))
x = np.asarray(img, dtype='float32')
x = x.transpose(2, 0, 1)
x = np.expand_dims(x, axis=0)
preds = model.predict(x)
print(preds)

:
print('evaluating ...')
score = model.evaluate(x_test, y_test, batch_size = batch_size, verbose = 1)
print('score:', score[0])
print('accuracy:', score[1])

In [None]:
model = load_model('music_classification.hdf5')
model.layers.pop()
model.layers.pop()
model.layers.pop()
model.layers.pop()
new_output = model.layers[-1].output
feature_vec_model = Model(model.input, new_output)

feature_vec_model.save('music_feature_extractor.hdf5')

TensorFlow-master/Benchmarking_optimizers/experiment_mnist.ipynb 自定义model 和数组每个测试
optimizer = ["sgd", "momentum", "nestrov_momentum", "adagrad", "adadelta", "rmsprop", "adam"]
learning_rate = [0.0001, 0.001, 0.01, 0.1]
model.compile_graph(optimize = optimizer[1], learning_rate = learning_rate[0])

The three main flavors of gradient descent are batch, stochastic, and mini-batch.
http://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/
Stochastic gradient descent, often abbreviated SGD, is a variation of the gradient descent algorithm that calculates the error and updates the model for each example in the training dataset.
Upsides

The frequent updates immediately give an insight into the performance of the model and the rate of improvement.
This variant of gradient descent may be the simplest to understand and implement, especially for beginners.
The increased model update frequency can result in faster learning on some problems.
The noisy update process can allow the model to avoid local minima (e.g. premature convergence).
Downsides

Updating the model so frequently is more computationally expensive than other configurations of gradient descent, taking significantly longer to train models on large datasets.
The frequent updates can result in a noisy gradient signal, which may cause the model parameters and in turn the model error to jump around (have a higher variance over training epochs).
The noisy learning process down the error gradient can also make it hard for the algorithm to settle on an error minimum for the model.


Batch gradient descent is a variation of the gradient descent algorithm that calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated.


Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients.


Upsides

The model update frequency is lower than batch gradient descent which allows for a more robust convergence, avoiding local minima.
The batched updates provide a computationally more efficient process than stochastic gradient descent.
The batching allows both the efficiency of not having all training data in memory and algorithm implementations.
Downsides

Mini-batch requires the configuration of an additional “mini-batch size” hyperparameter for the learning algorithm.
Error information must be accumulated across mini-batches of training examples like batch gradient descent.

A good default for batch size might be 32.

** convert_to_tensor**
m2 = np.array([[1.0, 2.0], 
           [3.0, 4.0]], dtype=np.float32)

m3 = tf.constant([[1.0, 2.0], 
             [3.0, 4.0]])
print(type(m2))
print(type(m3))

<class 'numpy.ndarray'>
<class 'tensorflow.python.framework.ops.Tensor'>

t2 = tf.convert_to_tensor(m2, dtype=tf.float32)
t3 = tf.convert_to_tensor(m3, dtype=tf.float32)
<class 'tensorflow.python.framework.ops.Tensor'>
<class 'tensorflow.python.framework.ops.Tensor'>

TensorFlow-Book-master/ch02_basics/Concept01_defining_tensors.ipynb