#PART1  Many tricks in DeepLearning

## 在本节笔记中会使用tensorflow 2.0搭建一个简单的多层感知机作为分类器，然后运用多种改进训练的方法比较性能。

通过本笔记，你可以：
1. 掌握tensorflow 2.0 的使用方法
2. 拥有自己的模型
3. 体会不同trick带来的作用

[2.0版本的使用指南](https://tensorflow.google.cn/beta)

In [1]:
try:
  # google Colab only
  %tensorflow_version 2.x
except Exception:
    pass

In [2]:
import matplotlib.pyplot as plt
import tensorflow  as tf
from tensorflow.keras.layers import Flatten, Dense, Dropout,BatchNormalization,Activation
from tensorflow.keras import Model,datasets
import numpy as np
#从kears自带数据集导入fashion_mnist，每张图片都是28*28，train数目6w test数目1w, labels=10
fashion_mnist = datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data() #第一次运行会下载数据
train_images = train_images / 255.0 #像素归一化
test_images = test_images / 255.0
#看看长什么样
plt.figure()
plt.imshow(train_images[10])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz


<matplotlib.image.AxesImage at 0x226464f7cc0>

In [3]:
#我们训练一个对28*28的图片进行10分类的基础分类器

EPOCH = 20
BATCH_SIZE = 32

train = tf.data.Dataset.from_tensor_slices((train_images, train_labels)).shuffle(1000).batch(BATCH_SIZE) #((None, 28, 28), (None,)) 为什么是None?
test  = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).shuffle(1000).batch(BATCH_SIZE)
train

<DatasetV1Adapter shapes: ((?, 28, 28), (?,)), types: (tf.float64, tf.uint8)>

In [4]:
#定义我们的第一个模型

class BaseClassifier(Model):
  def __init__(self):
    super().__init__()
    self.flatten = Flatten(input_shape=[28,28]) #flatten不影响第一维度batch
    self.d1 = Dense(units=128,activation='relu')
    self.d2 = Dense(units=32, activation='relu')
    self.d3 = Dense(units=10, activation='softmax')
  def call(self, input):
    x = self.flatten(input)
    x = self.d1(x)
    x = self.d2(x)
    output = self.d3(x)
    return output

#base的结果： EPOCH=10: train_loss=0.23673877120018005, train_acc=91.17166137695312，test_loss=0.34537169337272644,test_acc=88.23999786376953

*当选择模型后，先运行模型代码块加载模型，然后重新运行下面代码，修改该行：model*= xxx()即可

In [5]:

loss_fn= tf.keras.losses.SparseCategoricalCrossentropy() #定义损失。由于是离散分类。
opt = tf.keras.optimizers.Adam() #Adam优化器
#定义全局训练和测试上的metrics.这些指标在 epoch 上累积值，然后打印出整体结果。所以在每次迭代都要把当次结果传进去，在一次迭代结束将它清空
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_acc  = tf.keras.metrics.SparseCategoricalAccuracy(name='train_acc')
test_loss = tf.keras.metrics.Mean(name='test_loss')
test_acc  = tf.keras.metrics.SparseCategoricalAccuracy(name='test_acc')

model = AdvancedClassifier4()  #记得修改这里的模型

@tf.function
def train_1_batch(img, lbl):
  with tf.GradientTape() as tape: #tape上只有预测-计算损失两步
    pred = model(img)
    loss = loss_fn(lbl, pred)
  grads = tape.gradient(loss, model.trainable_variables)
  opt.apply_gradients(zip(grads, model.trainable_variables))
  train_loss(loss)
  train_acc(lbl, pred)

@tf.function
def test_1_batch(img, lbl):
  pred = model(img)
  loss = loss_fn(lbl, pred)
  test_loss(loss)
  test_acc(lbl,pred)


template = "EPOCH={}: train_loss={}, train_acc={}, test_loss={},test_acc={}"

#实际上这是一个mini-batch过程，结合了online和batch的优点
for i in range(EPOCH):
  for img,lbl in train: #train是有若干个((32,28,28),(32,))的list
    train_1_batch(img,lbl)
  for img,lbl in test:
    test_1_batch(img,lbl)
  print(template.format(i+1, train_loss.result(), train_acc.result()*100, test_loss.result(), test_acc.result()*100))
  #因为metrics对每个epoch累计，所以需要在epoch结束后清空重来
  train_loss.reset_states()
  test_loss.reset_states()
  train_acc.reset_states()
  test_acc.reset_states()

NameError: name 'AdvancedClassifier4' is not defined

## dropout


dropout是Regularization防止过拟合的一种方法，每次在训练的前传过程中使该层每个神经元以p的概率保留，以1-p的概率失活。这样就相当于每个epoch都在训练不同的函数，最终每个神经元的权值相当于不同函数的复合结果。

然鹅在test时，由于不可复制训练时的失活过程，所以需要对输出结果加以改动。设每个神经元的输出是x，则它的期望输出$p*x+(1-p)*0=px$。即在test时保留完整网络结构但对每层输出整体上乘以概率p。

实际使用时只需要调用api加一层dropout层即可。rate默认0.5并且works well!

In [None]:
class AdvancedClassifier1(Model):
  def __init__(self):
    super().__init__()
    self.flatten = Flatten(input_shape=[28,28]) #flatten不影响第一维度batch
    self.d1 = Dense(units=128,activation='relu')
    self.drop1 = Dropout(0.5)
    self.d2 = Dense(units=32, activation='relu')
    self.drop2 = Dropout(0.5)
    self.d3 = Dense(units=10, activation='softmax')
  def call(self, input):
    x = self.flatten(input)
    x = self.d1(x)
    x = self.drop1(x)
    x = self.d2(x)
    x = self.drop2(x)
    output = self.d3(x)
    return output

## Weight Decay

Regularization的一种方法。如果权重矩阵过于复杂，可能会出现过拟合的情况。因此，对权重矩阵实行l2正则化(or l1/l1-l2混合)。

实际使用时在调用keras.layers定义每层时加上kernel_regulator即可。注意这是layer-wise的操作，而不是对所有层遍历一遍求个总的出来。当然bias也需要。不过在非dense层，比如卷积层、LSTM层里就没有bias的正则化了。

In [None]:
class AdvancedClassifier2(Model):
  def __init__(self):
    super().__init__()
    self.flatten = Flatten(input_shape=[28,28]) #flatten不影响第一维度batch
    self.d1 = Dense(units=128,activation='relu',kernel_regularizer=tf.keras.regularizers.l2(0.01), bias_regularizer=tf.keras.regularizers.l2(0.01))
    self.drop1 = Dropout(0.5)
    self.d2 = Dense(units=32, activation='relu',kernel_regularizer=tf.keras.regularizers.l2(0.01), bias_regularizer=tf.keras.regularizers.l2(0.01))
    self.drop2 = Dropout(0.5)
    self.d3 = Dense(units=10, activation='softmax')
  def call(self, input):
    x = self.flatten(input)
    x = self.d1(x)
    x = self.drop1(x)
    x = self.d2(x)
    x = self.drop2(x)
    output = self.d3(x)
    return output

## Early Stopping
Regularization的一种方法。增加一个验证集，如果验证集的loss曲线开始上升说明模型有过拟合的趋势，这样可以在该epoch训练结束后就中止训练。

如果选择keras model.fit方法，该方法包含参数callbacks=, callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)(只是一个栗子)。monitor是监控值，可以拿train_acc/val_acc等，patience是若监控值保持基本不变多少个epoch后停止训练。这两个是最重要的参数，如果自己写方法的话（不用model.fit）也沿用这个思路。但是关于最佳权重矩阵，很可能earlystopping发现的时候已经过了最佳点。api可以帮你恢复最佳的，如果自己写的话...[一个思路](https://www.datalearner.com/blog/1051537860479157)

演示时，从train_data中拿出来1w份作为valid_set，之后画train和valid的loss曲线。选择AdvancedClassifier2。

In [None]:
import matplotlib.pyplot as plt
import tensorflow  as tf
from tensorflow.keras.layers import Flatten, Dense, Dropout,BatchNormalization
from tensorflow.keras import Model,datasets
import numpy as np
fashion_mnist = datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data() #第一次运行会下载数据
train_images = train_images / 255.0 
test_images = test_images / 255.0

EPOCH = 20
BATCH_SIZE = 32
valid_images = train_images[0:10000]
valid_labels = train_labels[0:10000]
train_images_ = train_images[10000:]
train_labels_ = train_labels[10000:]

train = tf.data.Dataset.from_tensor_slices((train_images_, train_labels_)).shuffle(1000).batch(BATCH_SIZE) 
valid = tf.data.Dataset.from_tensor_slices((valid_images, valid_labels)).shuffle(1000).batch(BATCH_SIZE)
test  = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).shuffle(1000).batch(BATCH_SIZE)

class AdvancedClassifier3(Model):
  def __init__(self):
    super().__init__()
    self.flatten = Flatten(input_shape=[28,28]) #flatten不影响第一维度batch
    self.d1 = Dense(units=128,activation='relu',kernel_regularizer=tf.keras.regularizers.l2(0.01), bias_regularizer=tf.keras.regularizers.l2(0.01))
    self.drop1 = Dropout(0.5)
    self.d2 = Dense(units=32, activation='relu',kernel_regularizer=tf.keras.regularizers.l2(0.01), bias_regularizer=tf.keras.regularizers.l2(0.01))
    self.drop2 = Dropout(0.5)
    self.d3 = Dense(units=10, activation='softmax')
  def call(self, input):
    x = self.flatten(input)
    x = self.d1(x)
    x = self.drop1(x)
    x = self.d2(x)
    x = self.drop2(x)
    output = self.d3(x)
    return output

loss_fn= tf.keras.losses.SparseCategoricalCrossentropy() #定义损失。由于是离散分类。
opt = tf.keras.optimizers.Adam() #Adam优化器
#定义全局训练和测试上的metrics.这些指标在 epoch 上累积值，然后打印出整体结果。所以在每次迭代都要把当次结果传进去，在一次迭代结束将它清空
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_acc  = tf.keras.metrics.SparseCategoricalAccuracy(name='train_acc')
valid_loss = tf.keras.metrics.Mean(name='valid_loss')
valid_acc  = tf.keras.metrics.SparseCategoricalAccuracy(name='valid_acc')
test_loss = tf.keras.metrics.Mean(name='test_loss')
test_acc  = tf.keras.metrics.SparseCategoricalAccuracy(name='test_acc')

model = AdvancedClassifier2()  

train_loss = list()
valid_loss = list()

@tf.function
def train_1_batch(img, lbl):
  with tf.GradientTape() as tape: #tape上只有预测-计算损失两步
    pred = model(img)
    loss = loss_fn(lbl, pred)
  grads = tape.gradient(loss, model.trainable_variables)
  opt.apply_gradients(zip(grads, model.trainable_variables))
  train_loss(loss)
  train_acc(lbl, pred)

@tf.function
def test_1_batch(img, lbl):
  pred = model(img)
  loss = loss_fn(lbl, pred)
  test_loss(loss)
  test_acc(lbl,pred)

@tf.function
def valid_1_batch(img, lbl):
  pred = model(img)
  loss = loss_fn(lbl, pred)
  valid_loss(loss)
  valid_acc(lbl,pred)

template = "EPOCH={}: train_loss={}, train_acc={}, valid_loss={}, valid_acc={}, test_loss={},test_acc={}"

#实际上这是一个mini-batch过程，结合了online和batch的优点
for i in range(EPOCH):
  for img,lbl in train: 
    train_1_batch(img,lbl)
  for img,lbl in valid:
    valid_1_batch(img,lbl)
  for img,lbl in test:
    test_1_batch(img,lbl)
  
  print(template.format(i+1, train_loss.result(), train_acc.result()*100, valid_loss.result(), valid_acc.result()*100, test_loss.result(), test_acc.result()*100))
  train_loss.append(train_loss.result())
  valid_loss.append(valid_loss.result())
  #因为metrics对每个epoch累计，所以需要在epoch结束后清空重来
  train_loss.reset_states()
  test_loss.reset_states()
  train_acc.reset_states()
  test_acc.reset_states()


In [None]:
%matplotlib inline
epoch = np.arange(1,21)
plt.figure(figsize=(8,10))
plt.plot(epoch, train_loss,color='red')
plt.plot(epoch, valid_loss,color='grey')
plt.show()

##BatchNormalization

[参考资料](https://zhuanlan.zhihu.com/p/24810318)
BN解决的问题是激励层对输入值范围的敏感度。比如sigmoid那样的激励，许多输入值最后都被压缩到两端。所以为了使每层的输出在经过激励前分布更合理，就引入了BN。注意BN层除了改变了原始数据分布外，还有一个反Normalization的过程$y\leftarrow \gamma x_{i}^{'}+\beta$，两个参数$\gamma \beta$需要网络自适应学习。这部分是当归一化起到反作用时抵消副作用的效果。最坏的情况，就是把数据分布还原到最开始：$\gamma=\sqrt{\sigma_{B}^{2}+\epsilon}$ , $\beta=\mu_{B} $。

实际使用时，首先是没有激励的网络层，之后调用api增加BN层，最后加激励层即可。

In [None]:
class AdvancedClassifier4(Model):
  def __init__(self):
    super().__init__()
    self.flatten = Flatten(input_shape=[28,28]) #flatten不影响第一维度batch
    self.d1 = Dense(units=128,kernel_regularizer=tf.keras.regularizers.l2(0.01), bias_regularizer=tf.keras.regularizers.l2(0.01))
    self.bn1 = BatchNormalization()
    self.act1 = Activation('relu')
    self.drop1 = Dropout(0.5)
    self.d2 = Dense(units=32,kernel_regularizer=tf.keras.regularizers.l2(0.01), bias_regularizer=tf.keras.regularizers.l2(0.01))
    self.bn2 = BatchNormalization()
    self.act2 = Activation('relu')
    self.drop2 = Dropout(0.5)
    self.d3 = Dense(units=10, activation='softmax')
  def call(self, input):
    x = self.flatten(input)
    x = self.d1(x)
    x = self.bn1(x)
    x = self.act1(x)
    x = self.drop1(x)
    x = self.d2(x)
    x = self.bn2(x)
    x = self.act2(x)
    x = self.drop2(x)
    output = self.d3(x)
    return output

## Momentum

"一定程度上保留前几次的更新方向，并用当前batch的方向微调。"
$d_\theta^{t}= d_\theta loss(f(x),y)+\beta d_\theta^{t-1}$

$\theta^{t}\leftarrow \theta^{t}+\eta d_\theta^{t}  $

动量法本质上是一种指数加权平均，为了消除单个batch产生的梯度过大的影响，所以要对一段时间内的度量进行平均，使其朝着最优点前进.[参考资料1](https://www.jiqizhixin.com/graph/technologies/d6ee5e5b-43ff-4c41-87ff-f34c234d0e32)


In [None]:
opt = tf.keras.optimizers.SGD(nesterov=True) #开启动量选项