### Self-Normalizing Neural Networks 自带正则化的神经网络

# 让模型变得更好

一般深度学习喜欢比较小的数值，所以一般我们都会对数据进行标准化操作，让数值变得更小。

该节视频：https://www.bilibili.com/video/av85562236?p=54
https://www.bilibili.com/video/av85562236?p=55

### Batch normalization

可以解决sigmoid梯度平缓的问题

In [None]:
conv_model.add(layers.Conv2D(32,3,activation='relu'))
'''
经过标准化操作后，x值变小，求得的梯度变大，更新变快
比如sigmoid而言，x变小后，那么它的梯度就会变得合适，更新变得更快。

BatchNormalization是如何计算标准差的呢？因为BatchNormalization()是批次加入数据的，
无法对全部的数据进行计算标准差，所以呢，BatchNormalization的做法是：
对逐步加入的批次计算标准差，然后根据指数平滑来求得每加一笔之后的标准差。

训练数据时可以这做BatchNormalization，测试的时候就不需要做BatchNormalization，
因为参数都已经计算好了,没有必要做标准化。


'''
conv_mode.add(layers.BatchNormalization()) 

dense_model.add(layers.Dense(32,activation='relu'))
dense_model.add(layers.BatchNormalization())

#### Example

In [1]:
from keras.datasets import mnist
import numpy as np

(train_images,train_labels),(test_images,test_labels) = mnist.load_data()

train_images = train_images.reshape((60000,28*28))
train_images = train_images.astype("float32")/255

test_images = test_images.reshape((10000,28*28))
test_images = test_images.astype('float32')/255

from keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)


Using TensorFlow backend.


In [2]:
from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(512,activation='relu',input_shape=(28*28,)))
model.add(layers.Dense(10,activation='softmax'))






In [4]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(train_images,train_labels,
         epochs=2,
         batch_size=32,
         validation_split=0.2)

Train on 48000 samples, validate on 12000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1c0f32f8198>

In [5]:
score = model.evaluate(test_images,test_labels,verbose=1)
print(score)

[0.07902216555672785, 0.9772]


In [6]:
model = models.Sequential()
model.add(layers.Dense(512,activation='sigmoid',input_shape=(28*28,)))

# 这里就是sigmoid的平缓梯度导致的模型效果不好
for i in range(9):
    model.add(layers.Dense(512,activation='sigmoid'))

model.add(layers.Dense(10,activation='softmax'))

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(train_images,train_labels,
         epochs=2,
         batch_size=32,
         validation_split=0.2)
score = model.evaluate(test_images,test_labels,verbose=1)
print(score)

Train on 48000 samples, validate on 12000 samples
Epoch 1/2
Epoch 2/2
[2.3297152267456056, 0.1135]


In [9]:
from keras.layers.normalization import BatchNormalization
model = models.Sequential()
# 首先这里定义了512个神经元
model.add(layers.Dense(512,activation='sigmoid',input_shape=(28*28,)))
'''
在这里为什么参数量是512*4 = 2048。
首先做标准化操作是需要均值和标准差两个值，
512是指上一层的512个神经元，每个神经元有均值和标准差两个值，从而需要
参数量为1024,但是BatchNormalization()会在此均值和标准差上做调整，调整到
生成新的均值和标准差，从而再需要*2 = 2048.
'''
model.add(BatchNormalization())

# 经过BatchNormalization后，效果又好了。
for i in range(9):
    model.add(layers.Dense(512,activation='sigmoid'))
    model.add(BatchNormalization())

model.add(layers.Dense(10,activation='softmax'))

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(train_images,train_labels,
         epochs=2,
         batch_size=32,
         validation_split=0.2)
score = model.evaluate(test_images,test_labels,verbose=1)
print(score)

Train on 48000 samples, validate on 12000 samples
Epoch 1/2
Epoch 2/2
[0.21457544406056403, 0.943]


In [8]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_14 (Dense)             (None, 512)               401920    
_________________________________________________________________
dense_15 (Dense)             (None, 512)               262656    
_________________________________________________________________
batch_normalization_1 (Batch (None, 512)               2048      
_________________________________________________________________
dense_16 (Dense)             (None, 512)               262656    
_________________________________________________________________
batch_normalization_2 (Batch (None, 512)               2048      
_________________________________________________________________
dense_17 (Dense)             (None, 512)               262656    
_________________________________________________________________
batch_normalization_3 (Batch (None, 512)               2048      
__________

#### Xception:Depthwise Separable Convolution 


空间相关性和通道相关性一起做的话，参数估计个数为：
    (filter_width,filter_height,last_channel_num)*current_channel_num
  
空间相关性和通道相关性分开来做的话：

    先抓空间相关性：(考虑深度的filter,抓的是空间！)

        (filter_width,fitler_height,last_channel_num)

    再在抓通道的相关性(1 * 1大小的矩阵，只能考虑通道之间的相关性，并无空间的说法。)：

        (1,1,last_channel_num)*current_channel_num

        这里1 * 1主要考虑的就是通道相关性，而忽略空间相关性。

        1*1的滤镜有32层，现在需要估计64个滤镜，从而需要的参数量为1*1*32*64。

    按道理空间相关性应该是包含了通道相关性的把？
    
 
若想了解Xception长什么样呢？可以google图片。



In [10]:
from keras.models import Sequential,Model
from keras import layers

height=64
width = 64
channel = 3
num_classes = 10

model = Sequential()
'''
该层的参数个数大小为：
空间相关性参数个数       通道相关性的参数个数(这里加上了bias项)
     3*3*3           +     (1+1*1*3)*32
'''
model.add(layers.SeparableConv2D(32,3,activation='relu',
                                 input_shape=(height,width,channel,)))

In [11]:
'''
该层的参数个数大小为：
空间相关性参数个数       通道相关性的参数个数
     3*3*64           +     (1+1*1*64)*64
'''
model.add(layers.SeparableConv2D(64,3,activation='relu'))
model.add(layers.MaxPooling2D(2))

model.add(layers.SeparableConv2D(64,3,activation='relu'))
model.add(layers.SeparableConv2D(128,3,activation='relu'))
model.add(layers.MaxPooling2D(2))

model.add(layers.SeparableConv2D(64,3,activation='relu'))
model.add(layers.SeparableConv2D(128,3,activation='relu'))
model.add(layers.GlobalAveragePooling2D())

model.add(layers.Dense(32,activation='relu'))
model.add(layers.Dense(num_classes,activation='softmax'))




In [12]:
model.compile(optimizer='rmsprop',loss='categorical_crossentropy')

In [13]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
separable_conv2d_1 (Separabl (None, 62, 62, 32)        155       
_________________________________________________________________
separable_conv2d_2 (Separabl (None, 60, 60, 64)        2400      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 30, 30, 64)        0         
_________________________________________________________________
separable_conv2d_3 (Separabl (None, 28, 28, 64)        4736      
_________________________________________________________________
separable_conv2d_4 (Separabl (None, 26, 26, 128)       8896      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 13, 13, 128)       0         
_________________________________________________________________
separable_conv2d_5 (Separabl (None, 11, 11, 64)        9408      
__________

In [18]:
from keras.datasets import mnist
(train_images,train_labels),(test_images,test_labels) = mnist.load_data()

train_images = train_images.reshape((60000,28,28,1))
train_images = train_images.astype("float32")/255

test_images = test_images.reshape((10000,28,28,1))
test_images = test_images.astype("float32")/255

from keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

In [19]:
from keras.models import Model,Sequential
from keras import layers

height = 28
width = 28
channels = 1
num_classes = 10
model = Sequential()
model.add(layers.SeparableConv2D(32,3,activation='relu',
                                 input_shape=(height,width,channels)))
model.add(layers.SeparableConv2D(64,3,activation='relu'))
model.add(layers.MaxPooling2D(2))

model.add(layers.SeparableConv2D(64,3,activation='relu'))
model.add(layers.SeparableConv2D(128,3,activation='relu'))
model.add(layers.MaxPooling2D(2))

model.add(layers.Flatten())
model.add(layers.Dense(32,activation='relu'))
model.add(layers.Dense(num_classes,activation='softmax'))
model.compile(optimizer='rmsprop',loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(train_images,train_labels,batch_size=32,validation_split=0.2,epochs=5)

test_loss,test_acc = model.evaluate(test_images,test_labels)
test_acc

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

Train on 48000 samples, validate on 12000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


0.9869

In [20]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
separable_conv2d_17 (Separab (None, 26, 26, 32)        73        
_________________________________________________________________
separable_conv2d_18 (Separab (None, 24, 24, 64)        2400      
_________________________________________________________________
max_pooling2d_9 (MaxPooling2 (None, 12, 12, 64)        0         
_________________________________________________________________
separable_conv2d_19 (Separab (None, 10, 10, 64)        4736      
_________________________________________________________________
separable_conv2d_20 (Separab (None, 8, 8, 128)         8896      
_________________________________________________________________
max_pooling2d_10 (MaxPooling (None, 4, 4, 128)         0         
_________________________________________________________________
flatten_5 (Flatten)          (None, 2048)              0         
__________

In [21]:
# 用普通的rnn来训练mnist

from keras.models import Model,Sequential
from keras import layers

height = 28
width = 28
channels = 1
num_classes = 10
model = Sequential()
model.add(layers.Conv2D(32,3,activation='relu',
                                 input_shape=(height,width,channels)))
model.add(layers.Conv2D(64,3,activation='relu'))
model.add(layers.MaxPooling2D(2))

model.add(layers.Conv2D(64,3,activation='relu'))
model.add(layers.Conv2D(128,3,activation='relu'))
model.add(layers.MaxPooling2D(2))

model.add(layers.Flatten())
model.add(layers.Dense(32,activation='relu'))
model.add(layers.Dense(num_classes,activation='softmax'))
model.compile(optimizer='rmsprop',loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(train_images,train_labels,batch_size=32,validation_split=0.2,epochs=5)

test_loss,test_acc = model.evaluate(test_images,test_labels)
test_acc

Train on 48000 samples, validate on 12000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


0.9918

### Hyperparameter optimization

+ The architecture-level parameters are called hypterparameters to distinguish them from the parameters of a model,which are trained via backpropagation.

+ It shoundn't be your job as a human to fiddel with hyperparameters all day--that is better left to a machine.Thus you need to explore the space of possible decision automatically,systematically,in a principied way.You need to search the architecture space and find the bestperforming ones empirically.That's whath the field of automatic hyperparameter optimization is about.

+ The process of optimizing hyperparameters typically looks like this:
>+ Choose a set of hyperparameters(automatically).
>+ Build the corresponding model.
>+ Fit it to your training data,and measure the final performance on the validation data.
>+ Choose the next set of hyperparameters to try(automatically)
>+ Eventually,measure performance on your test data.

+ The key to this process is the algorithm that uses this history of validation performance,given various sets of hyperparameters,to choose the next set of hyperparameters to evaluate.

+ Updating hyperparameters,on the other hand,is extremely challenging,Consider the following:
>+ Computing the feedback signal(does this set of hyperparameters lead to a high-performing model on this task?) can be extremely expensive:it requires creating and training a new model from scratch on your dataset.
>+ The hyperparameter space is typically made of discrete decisions and thus isn't continuous or differentiable.Hence,you typically can't do gradient descent in hyperparamter space.

+ Often,it turns out that rando search(choosing hyperparamters to evaluate at random,repeatedly) is the best solution,despite being the most navie one.
+ One tool reliably better than random search is `Hyperopt`,a Python library for hyperparamter optimization that internally uses tress of Parzen estimators to predict sets of hyperparamters that are likely work well.
+ Another library called `Hyperas` integrates Hyperas for use with Keras models.
+ NOTE:One important issue to keep iin mind when doing automatic hyperparameter optimization at scale is validation-set overfitting.Because you're updating hyperparamters baesed on a signal that is computed using your validation data,you're effectively traning them on the validation data,and thus they will quickly overfit to the validation data.

#### Hyperas demo
https://www.bilibili.com/video/av85562236?p=55

### Model ensembling

+ If you look at machine-learning competitions,in particular on Kaggle,you'll see that the winners use very large ensembles of models that inevitaly beat any single model,no matter how good.
+ Ensembling relies on the assumption that different good models trained independently are likely to be good for different reason:each model looks at slightly different aspects of the data to make its predictions,getting part of the "truth" but not all of it.By pooling their perspectives together,you can get a far more accurate description of the data.
+ The easiest way to pool the predictions of a set of classifiers(to ensemble the classifiers) is to average their predictions.This willl work only if the classifiers are more or less equally good.

preds_a = model_a.predict(x_val)
preds_b = model_b.predict(x_val)
preds_c = model_c.predict(x_val)
preds_d = model_d.predict(x_val)

final_preds = 0.25*(preds_a+preds_b+preds_c+preds_d)

+ A smarter way to ensemble classifiers is to do a weighted average,where the weights are learned on the validation data--typicallly,the better classiifiters are given a higher weight,and the worse classifier are given a lower weight.In general,a simple weighted average with wieghts optimized on the validation data provides a very strong baseline.

In [None]:
preds_a = model_a.predict(x_val)
preds_b = model_b.predict(x_val)
preds_c = model_c.predict(x_val)
preds_d = model_d.predict(x_val)

final_preds = 0.5*preds_a+0.25*preds_b+0.1*preds_c+0.15*preds_d)
# These weights (0.5,0.25,0.1,0.15) are assumed to be learned empirically

+ The key to making ensembling work is the diversity of the set of classifiers.If your models are biased in different ways,the biases will cancel each other out,and the ensemble will be more robust and more accurate.
+ For this reason,you should ensemble models that are as good as possible while being as different as possible.
+ One thing that is largely not worth doing is ensembling the same network trained several times independently,from different random initializations.
+ In recent times,one style of basic ensemble that has been very successful in practic is the wide and deep category of models,blending deep learning with shallow learning.Such models consist of jointly training a deep neural network with a large linear model.The joint training of a family of diverse models is yet another option to achieve model ensembling.