<< [第七章：高级深度学习最佳实践](Chapter7_Advanced_deep_learning_best_pratices.ipynb)|| [目录](index.md) || [第九章：总结](Chapter9_Conclusions.ipynb) >>

# 第八章：生成模型深度学习

> The potential of artificial intelligence to emulate human thought processes goes beyond
passive tasks such as object recognition, or mostly reactive tasks such as driving a car. It
extends well into creative activities. When I first made the claim that in a not-so-distant
future, most of the cultural content that we consume will be created with heavy help from
AIs, I was met with utter disbelief, even from long-time machine learning practitioners.
That was in 2014. Fast forward three years, and the disbelief has receded—at an
incredible speed. In the summer of 2015, you were entertained by Google’s Deep Dream
algorithm turning an image into a psychedelic mess of dog eyes and pareidolic artifacts;
in 2016 you used the Prisma application to turn your photos into paintings of various
styles. In the summer of 2016, a first experimental short movie, Sunspring , was directed
using a script written by a LSTM—complete with dialogue lines. Maybe you even
recently listened to music tentatively generated by a neural network.

人工智能来模拟人类思维过程除了前面那些被动任务，比方说目标识别，或者很多响应式任务，比方说车辆驾驶之外，还能拓展创造性活动的领域。当作者首次断言在不久的将来，大多数我们消费的文化内容都会在AI的帮助下完成，遇到了很多的怀疑，这些怀疑甚至来自多年的参与机器学习的研究人员。那是在2014年，仅仅三年后，这些怀疑开始逐渐散去。在2015年夏天，谷歌推出了一个Deep Dream算法能够将图像转换成具有魔幻色彩的狗眼睛和古董的图像，吸引了很多人的注意；在2016年用户可以使用Prisma应用来将自己的照片转换成不同风格的画像；在2016年夏天，一部实验性的短电影叫Sunspring被摄制出来，其中的剧本使用了LSTM生成。很有可能最近你听到的一些音乐也是由神经网络申城的。

> Granted, the artistic productions we have seen from AI so far are all fairly
low-quality. AI is not anywhere close to rivaling human screenwriters, painters and
composers. But replacing humans was always besides the point: artificial intelligence is
not about replacing our own intelligence with something else, it is about bringing into our
lives and work more intelligence, intelligence of a different kind. In many fields, but
especially in creative ones, AI will be used by humans as a tool to augment their own
capabilities: more augmented intelligence than artificial intelligence.

诚然我们目前看到的那些AI艺术创作的质量都还很低。AI距离与人类剧作家、画家和作曲家竞争还差距着十万八千里。但实际上AI的目标永远不是取代人类：人工智能不是为了将人类的只能取代变成另一种智能，而是为了为人类的生活和工作带来更多的智能，不同形式的只能。在许多领域中，特别是创造性领域中，AI将称为人类的工具并增强人类的能力：更像增强智能而不是人工智能。

> A large part of artistic creation consists of simple pattern recognition and technical
skill. And that is precisely the part of the process that many find less attractive, even
skippable. That’s where AI comes in. Our perceptual modalities, our language, our
artworks all have statistical structure. Learning this structure is precisely what deep
learning algorithms excel at. Machine learning models can learn the statistical "latent
space" of images or music or even stories, and they can then "sample" from this space,
creating new artworks with similar characteristics as what the model has seen in its
training data. Naturally, such sampling is hardly an act of artistic creation in itself. It is a
mere mathematical operation: the algorithm has no grounding in human life, human
emotions, our experience of the world; instead it learns from an "experience" that has
little in common with ours. It is only our interpretation, as human spectators, that will
give meaning to what the model generates. But in the hands of a skilled artist,
algorithmic generation can be steered to become meaningful—and beautiful. Latent
space sampling can become a brush that empowers the artist, augments our creative
affordances, expands the space of what we can imagine. What’s more, it can make
artistic creation more accessible by eliminating the need for technical skill and
practice—setting up a new medium of pure expression, factoring art apart from craft.

艺术创作中的一大部分都含有简单的模式识别和技术工作。这也是很多人认为不够有趣的地方，甚至可以跳过的部分。这些就是AI能够进入的部分。我们的感知模型，我们的语言，我们的艺术品都有着统计学结构。从这些结构中学习正是深度学习算法擅长之处。机器学习模型可以从图像、音乐或者甚至是故事中学习到统计学的潜在空间，然后就能在空间中取样，从而创作一件与模型训练数据具有相似特征的新艺术作品。很显然，这样的取样行为很难认为是一种艺术创作。它仅仅就是一个数学运算：使用的算法没有任何对人类生活、情感、世界观的认知，而是从“经验”中进行学习，并不具有我们的共情能力。它创造出来的作品只有通过人类观众的解读才能赋予意义。但是对于高超的艺术家来说，如果掌握了这种技巧，算法生成的作品可被引导到有意义和优美的方向。潜在空间取样可以成为艺术家的神奇画笔，增强我们的创造性灵感，扩展我们的想象空间。更加有用的是，它能通过消除对艺术家技巧和技艺训练的要求使得艺术创作变得更加容易，构建出一种全新的纯表达的媒介，将艺术领域和工艺领域分开。

> Iannis Xenakis, a visionary pioneer of electronic and algorithmic music, beautifully
expressed this same idea in the 1960s, in the context of the application of automation
technology to music composition:

>
```
"Freed from tedious calculations, the composer is able to devote himself to the
general problems that the new musical form poses and to explore the nooks and crannies
of this form while modifying the values of the input data. For example, he may test all
instrumental combinations from soloists to chamber orchestras, to large orchestras. With
the aid of electronic computers the composer becomes a sort of pilot: he presses the
buttons, introduces coordinates, and supervises the controls of a cosmic vessel sailing in
the space of sound, across sonic constellations and galaxies that he could formerly
glimpse only as a distant dream."
```

Iannis Xenakis作为一个电子和算法音乐的先驱者，在60年代就在自动化音乐谱曲应用方面做过相关的描述：

```
“将作曲家从枯燥乏味的计算当中释放出来，能够让他们更加专注于曲目的共性问题，如一种新的音乐形式，以及在这种形式下来探索各种细枝末节，通过修改输入数据来得到最理想的结果。例如，作曲家可以测试所有的演奏形式，从独奏到小乐队到交响乐团。有了计算机帮助的作曲家就像某种航天员：他按下按钮，输入坐标，然后监控着宇宙飞船在音乐空间中飞行的轨迹，从而能够穿越各种星座甚至星系，而这之前，可能这些地方只能通过望远镜匆匆一瞥。”
```

> In this chapter, we will explore under various angles the potential of deep learning to
augment artistic creation. We will review sequence data generation (which can be used to
generate text or music), Deep Dreams, and image generation using both Variational
Auto-Encoders and Generative Adversarial Networks. We will get your computer to
dream up content never seen before, and maybe, we will get you to dream too, about the
fantastic possibilities that lie at the intersection of technology and art.

在本章中我们会从多个角度介绍深度学习在增强艺术创作上的能力。我们会涵盖序列数据生成（可以用来创作文字或音乐），Deep Dreams，以及图像生成的两种方式变分自动编码和生成对抗网络。本章会让你的计算机创作出之前从未想象过的成果，也有可能本章会让读者也开始梦想未来这种科技与艺术结合之后的奇妙世界。

> You will find five sections in this chapter:

> - Text generation with LSTM, where you will use the recurrent networks you discovered in
Chapter 7 to dream up a pastiche of Nietzschean philosophy, character by character.
- Deep Dreams, where you will find out what dreams look like when all you know of the
world is the ImageNet dataset.
- Neural style transfer, where you will learn to apply the style of a famous painting to your
vacation pictures.
- Variational Autoencoders, where you find out about "latent spaces" of images, and how
to use them for creating new images.
- Adversarial Networks—deep networks that fight each other in a quest to produce the
most realistic pictures possible.

> Let’s get started.

你可以在本章中学习到下面5方面内容：

- 使用LSTM生成文本，你会使用我们在第七章中学习的循环网络来模仿生成尼采的哲学文章，一篇接一篇。
- Deep Dreams，你会看到如果世界是由ImageNet数据集组成的话，它将会变成什么样子。
- 神经风格转移，你可以学习到如何将名画作的风格应用到你自己的照片上。
- 变分自动编码，你可以学习如何找到潜在空间，以及如何使用潜在空间创作新图像。
- 对抗网络，深度网络能够互相对抗以产生最接近真实的照片。

让我们开始这一章。

## 8.1 使用LSTM生成文本

> In this section, we present how recurrent neural networks can be used to generate
sequence data. We will use text generation as an example, but the exact same techniques
can be generalized to any kind of sequence data: you could apply it to sequences of
musical notes in order to generate new music, you could apply it to timeseries of brush
stroke data (e.g. recorded while an artist paints on an iPad) to generate paintings
stroke-by-stroke, and so on.

在本节中我们将介绍循环神经网络用来生成序列数据的方法。我们会使用文本生成作为一个例子，但是相同的技巧能够应用在任何序列数据生成任务上：你可以将它应用在一系列音符上以产生乐谱，你可以将它应用在一个时序的画笔描绘数据上（例如一个画家在iPad上作画的记录）来一笔一笔的产生画作，等等。

> Sequence data generation is no way limited to artistic content generation, either. It
has been successfully applied to speech synthesis, and dialog generation for chatbots. The
"smart reply" feature that Google released in 2016, capable of automatically generating a
selection of quick replies to your emails or text messages, is powered by similar
techniques.

序列数据生成不仅限于艺术内容生成，它还被成功的应用到了语音生成和对话机器人领域。谷歌在2016年发布的“smart reply”特性，能够为你的电子邮件或文字短信息自动产生快速的回复，也是使用类似的技术。

### 8.1.1 生成循环网络简史

> In late 2014, few people had ever heard the abbreviation "LSTM", even in the machine
learning community. Successful applications of sequence data generation with recurrent
networks only started appearing in the mainstream in 2016. But these techniques actually
have a fairly long history, starting with the development of the LSTM algorithm by
Hochreiter in 1997. This new algorithm was used early on to generate text character by
character.

在2014年底的时候，即使在机器学习社区中也很少人听说过缩写“LSTM”。使用循环网络生成序列数据的成功应用直到2016年才开始进入主流。但其实这项技术实际上有着很长的历史，可以回溯到1997年Hochreiter发明LSTM的时候。当时这个新算法用来实现字符层级的文本生成。

> In 2002, Douglas Eck, then at Schmidhuber’s lab in Switzerland, applied LSTM to
music generation for the first time, with promising results. Douglas Eck is now a
researcher at Google Brain, and in 2016 he started a new research group there, called
Magenta, focused on applying modern deep learning techniques to produce engaging
music. Sometimes, good ideas take fifteen years to get started.

瑞士Schmidhuber实验室的Douglas Eck在2002年第一次将LSTM应用到了音乐生成，获得了不错的结果。Douglas Eck现在是谷歌Brain的一名研究人员，他在2016年成立了一个新的研究小组，叫做Magenta，专注于应用现代深度学习技术来生成优秀的音乐。有的时候，一个好的想法需要15年才能开始实践。

> In the late 2000s and early 2010, Alex Graves did important pioneering work on
using recurrent networks for sequence data generation. In particular, his 2013 work on
applying Recurrent Mixture Density Networks to generate human-like handwriting using
timeseries of pen positions, is seen by some as a turning point. This specific application
of neural networks at that specific moment in time captured for me the notion of
"machines that dream" and was a significant inspiration around the time I started
developing Keras. Alex Graves left a similar commented-out remark hidden in a 2013
LateX file uploaded to the preprint server Arxiv.org : "generating sequential data is the
closest computers get to dreaming" . Several years later, we have come to take a lot of
these developments for granted, but at the time, it was hard to watch Graves'
demonstrations and not walk away awe-inspired by the possibilities.

在00年代末和10年代初的时候，Alex Graves在使用循环网络来生成序列数据方面做了许多重要的领先贡献。特别要指出的是，他在2013年在笔触时序数据使用循环混合全连接网络来生成人类笔迹的实验，经常被视为一个转折点。这个神经网络的应用当时正好与作者的“能梦想的机器”观点迎合，因此成为了作者开发Keras的一个重要激励。Alex Graves在2013年提交到预付印平台Arxiv.org上的论文中，使用Latex注释了一句话，表达了相同的观点：“生成序列数据是最接近计算机能梦想的方式”。许多年以后，我们已经将这方面的进展视作习以为常，但在当时，很难不被Grave给我们展现的内容惊呆，然后以令人敬畏的态度来面对未来的这种可能性。

> Since then, recurrent neural networks have been successfully used for music
generation, dialogue generation, image generation, speech synthesis, molecule design,
and were even used to produce a movie script that was then cast with real live actors.

从那之后，循环神经网络已经被成功的运用到了音乐生成、对话生成、图像生成、语音生成、高分子设计，甚至还被运用到产生由真实演员出演的电影剧本之中。

### 8.1.2 我们该如何产生序列数据？

> The universal way to generate sequence data in deep learning is to train a network
(usually either a RNN or a convnet) to predict the next token or next few tokens in a
sequence, using the previous tokens as input. For instance, given the input "the cat is on
the ma" , the network would be trained to predict the target "t" , the next character. As
usual when working with text data, "tokens" are typically words or characters, and any
such network that can model the probability of the next token given the previous ones is
called a language model . A language model captures the latent space of language, i.e. its
statistical structure.

在深度学习中生成序列数据一个通用方法是训练一个模型（通常是一个RNN或CNN）来预测序列中的下一个标记或者下几个标记，使用前面的标记作为输入。例如，给定输入“the cat is on the ma”，网络可能被训练来预测得到目标“t”，也就是下一个字符。通常当处理文本数据时，“标记”会是单词或字符，这样的网络可以根据之前的标记获得下一个标记的概率，被称为语言模型。语言模型能够感知到语言的潜在空间，也就是它的统计学结构。

> Once we have such a trained language model, we can sample from it, i.e. generate
new sequences: we would feed it some initial string of text (called "conditioning data"),
ask it to generate the next character or the next word (we could even generate several
tokens at once), then add the generated output back to the input data, and repeat the
process many times (see Figure 8.1). This loop allows to generate sequences of arbitrary
length that reflect the structure of the data that the model was trained on, i.e. sequences
that look almost like human-written sentences. In our case, we will take a LSTM layer,
feed it with strings of N characters extracted from a text corpus, and train it to predict
character N+1 . The output of our model will be a softmax over all possible characters: a
probability distribution for the next character. This LSTM would be called a
"character-level neural language model".

我们有了这样的训练过的语言模型之后，我们就可以从中取样，也就是生成新的序列：我们可以将一些初始化的文本字符串输入给模型（被称为“条件数据”），然后让模型生成下一个字符或者下一个单词（甚至可以一次生成多个标记），然后将生成的输出放回输入数据中，多次重复这个过程（参见图8-1）。这个循环能够产生任意长度的序列数据，能够反映模型训练得到的统计学结构，也就是说获得一个几乎类似人类生成的序列数据。在我们的场景中，我们会使用一个LSTM层，用文本语料库中提取的N个字符作为输入，然后训练模型能够预测第N+1个字符。模型的输出会是所有可能字符的softmax结果：就是下一个字符的概率分布。这个LSTM层被称为“字符级神经语言模型”。

![language model](imgs/f8.1.jpg)

图8-1 使用语言模型生成字符级文本的过程

### 8.1.3 取样策略的重要性

> When generating text, the way we pick the next character is crucially important. A naive
approach would be "greedy sampling", consisting in always choosing the most likely
next character. However, such an approach would result in very repetitive and predictable
strings that don’t look like coherent language. A more interesting approach would consist
in making slightly more surprising choices, i.e. introducing randomness in the sampling
process, for instance by sampling from the probability distribution for the next character.
This would be called "stochastic sampling" (you recall that "stochasticity" is what we call
"randomness" in this field). In such a setup, if "e" has a probability 0.3 of being the next
character according to the model, we would pick it 30% of the time. Note that greedy
sampling can itself be cast as sampling from a probability distribution: one where a
certain character has probability 1 and all others have probability 0.

当生成文本时，我们选取下一个字符的方式是非常重要的。一个原始的解决方法是“贪婪取样”，也就是永远选择最大似然值的下一个字符。但是这样的做法会导致非常重复和可预测的字符串，使得语义看起来不连贯。一个更有趣的方法包括在取样中使用一些更加惊奇的策略，或者说在其中引入一些随机性，比方说在选取下一个字符时使用概率分布来取样。这被称为“随机取样”。在这个方案中，如果“e”根据模型计算有着0.3的概率，我们会在30%的时间中选择它。值得一提的是贪婪取样也算是随机取样的一种：只不过其中一个字符的概率为1而其他字符的概率都是0。

> Sampling probabilistically from the softmax output of the model is neat, as it allows
even unlikely characters to be sampled some of the time, generating more
interesting-looking sentences and even sometimes showing creativity by coming up with
new, realistic-sounding words that didn’t occur in the training data. But there is one issue
with this strategy: it doesn’t offer a way to control the amount of randomness in the
sampling process.

从模型softmax的输出中使用随机取样是很灵活的，因为它某些时候能够选取那些不太可能的字符，从而生成更加有趣的句子，甚至有时还能生成一些新奇的听起来很真实的单词，即使它们没有出现在训练数据中。但是这里还有一个问题：它没有提供一个方法来控制取样过程中的随机程度。

> Why would we want more or less randomness? Consider an extreme case: pure
random sampling, i.e. drawing the next character from a uniform probability distribution,
where every character is equally likely. This scheme would have maximum randomness;
in other words, this probability distribution would have maximum "entropy". Naturally, it
would not produce anything interesting. At the other extreme, greedy sampling, which
doesn’t produce anything interesting either, has no randomness whatsoever: the
corresponding probability distribution has minimum entropy. Sampling from the "real"
probability distribution, i.e. the distribution that is output by the model’s softmax
function, constitutes an intermediate point in between these two extremes. However,
there are many other intermediate points of higher or lower entropy that one might want
to explore. Less entropy will give the generated sequences a more predictable structure
(and thus they will potentially be more realistic-looking) while more entropy will result
in more surprising and creative sequences. When sampling from generative models, it is
always good to explore different amounts of randomness in the generation process. Since
the ultimate judge of the interestingness of the generated data is us, humans,
interestingness is highly subjective and there is no telling in advance where the point of
optimal entropy lies.

为什么我们需要更多或者更少的随机性？考虑一个极端的情景：完全随机取样，也就是按照平均概率分布来选取下一个字符，那么每个字符都具有相同的似然。这个情境中有着最大的随机性；或者说，这个概率分布有着最大的“熵”。很显然它不会生成任何有趣的东西，同样的另一种极端，贪婪取样，也不会生成任何有趣的东西：这时的概率分布有着最小的熵。从“真实”的概率分布中采样，也就是从模型的softmax激活函数的输出分布中进行采样，使用了这两个极端之间的一个中间点。然而这两个极端之间还存在着很多其他的更高熵或者更低熵的点可以探索。低熵的点会带来更加可预测的生成序列结构（并且它们应该看起来更加真实）而高熵的点会带来更加令人惊奇和创造性的生成序列。当从生成模型中进行采样时，探索各种可能的随机性永远是个好主意。因为最终判定生成数据的有趣程度的人是我们自己，人类，有趣性是高度具有主观性的因此没有方法提前知道哪个点的熵是最合适的。

> In order to control the amount of stochasticity in the sampling process, let’s introduce
a parameter called "softmax temperature" that characterizes the entropy of the probability
distribution used for sampling, or in other words, that characterizes how surprising or
predictable our choice of next character will be. Given a temperature value, a new
probability distribution is computed from the original one (the softmax output of the
model) by reweighting it in the following way:

为了能够控制取样过程中的随机性，我们会引入一个参数叫做“softmax温度”用来表示取样时的概率分布熵，或者也可以说，用来表示下一个字符的选择有多出乎意料或者可预测。给定一个温度值后，就可以按照原始分布（模型的softmax输出值）和温度值计算得到一个新的概率分布，如下：

In [1]:
import numpy as np

def reweight_distribution(original_distribution, temperature=.5):
    '''
    根据温度重新计算概率分布来控制熵的大小
    
    参数：
        original_distribution: 一个1D概率Numpy向量，总和应该为1
        temperature: 计算新的概率分布的熵因子
    
    返回：
        原始概率分布经过重新计算后得到的新的概率分布
    '''
    
    distribution = np.log(original_distribution) / temperature
    distribution = np.exp(distribution)
    
    # 经过运算后，概率分布的总和可能不再为1，我们需要将其正规化
    return distribution / np.sum(distribution)

> Higher "temperatures" result in sampling distributions of higher entropy, that will
generate more surprising and unstructured generated data, while a lower temperature will
result in less randomness and much more predictable generated data.

更高的“温度”会获得更高熵的取样分布，也就是生成更加意料不到和非结构化数据，而更低的温度会获得更少随机性也就是更加可预测的数据。

![diff entropy on same distribution](imgs/f8.2.jpg)

图8-2 在相同的softmax分布上进行重新分布：高温度=高确定性，低温度=高随机性

# 8.1.4 实现字符级LSTM文本生成

> Let’s put these ideas in practice in a Keras implementation. The first thing we need is a
lot of text data that we can use to learn a language model. You could use any sufficiently
large text file or set of text files—Wikipedia, the Lord of the Rings, etc. In this example
we will use some of the writings of Nietzsche, the late-19th century German philosopher
(translated to English). The language model we will learn will thus be specifically a
model of Nietzsche’s writing style and topics of choice, rather than a more generic model
of the English language.

下面让我们在实践中使用Keras来实现上面的想法。第一步我们需要很多文本数据来学习一个语言模型。你可以使用任何足够大的文本文件或者全套的文本文件如维基百科、指环王等。在本例中，我们会使用尼采的一些著作（英文翻译版），他是19世纪晚期德国的哲学家。这样得到的语言模型将会具有尼采的写作风格和主题选择，而不是更加通用的英语模型。

#### 准备数据

> Let’s start by downloading the corpus and converting it to lowercase:

让我们首先下载语料库并将其转换成小写：

In [2]:
from tensorflow import keras

path = keras.utils.get_file('nietzsche.txt', 
                            origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
text = open(path).read().lower()
len(text)

600893

> Next, we will extract partially-overlapping sequences of length maxlen , one-hot
encode them and pack them in a 3D Numpy array x of shape (sequences, maxlen,
unique_characters) . Simultaneously, we prepare a array y containing the
corresponding targets: the one-hot encoded characters that come right after each
extracted sequence.

接下来，我们会提取长度为maxlen的部分重叠的序列，然后进行one-hot编码并且打包成一个形状为(序列, maxlen, 独立字符)的一个3D Numpy数组中。同时，我们还需要准备一个目标y向量：也是每个提取到的序列后出现的字符相对应的one-hot编码。

In [3]:
# 提取字符序列的长度
maxlen = 60

# 取样新序列的步长值
step = 3

# 下面这个列表保存提取出来的序列
sentences = []

# 下面这个列表保存目标的字符（下一个字符）
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])

print('Number of sequences:', len(sentences))

# 语料库中不同字符的集合
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))

# 下面是一个字典值，将不同字符映射成语料库中的序号
char_indices = dict((char, chars.index(char)) for char in chars)

# 下一步是将这些字符进行one-hot编码
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
        y[i, char_indices[next_chars[i]]] = 1

Number of sequences: 200278
Unique characters: 57
Vectorization...


#### 构建网络

> Our network is a single LSTM layer followed by a Dense classifier and softmax over all
possible characters. But let us note that recurrent neural networks are not the only way to
do sequence data generation; 1D convnets also have proven extremely successful at it in
recent times.

我们使用一个LSTM层然后跟着一个全连接分类器，在所有可能的字符上进行softmax运算。不过这里需要提出的是，循环神经网络并不是生成序列数据的唯一选择，1D卷积网络最近在这个领域也被证明会非常成功。

In [4]:
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

> Since our targets are one-hot encoded, we will use categorical_crossentropy as
the loss to train the model:

因为这里的目标是one-hot编码的，所以我们会使用`categorical_crossentropy`作为损失函数来训练模型：

In [5]:
from tensorflow.keras.optimizers import RMSprop

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

#### 训练语言模型并且使用它来取样

> Given a trained model and a seed text snippet, we generate new text by repeatedly:

> 1. Drawing from the model a probability distribution over the next character given the
text available so far
2. Reweighting the distribution to a certain "temperature"
3. Sampling the next character at random according to the reweighted distribution
4. Adding the new character at the end of the available text

给定一个训练好的模型和一个种子文本片段，我们可以不断的生成新的文本：

1. 从模型中获得目前文本序列的下一个字符的概率分布。
2. 使用一个给定的“温度”重新得到一个新的分布。
3. 使用新的分布对下一个字符进行取样。
4. 将新取样的字符加入到文本的末尾。

> This is the code we use to reweight the original probability distribution coming out of
the model, and draw a character index from it (the "sampling function"):

下面是我们对概率分布进行重新权重然后获取下一个字符序号的代码（也就是“取样函数”）：

In [6]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

> Finally, this is the loop where we repeatedly train and generated text. We start
generating text using a range of different temperatures after every epoch. This allows us
to see how the generated text evolves as the model starts converging, as well as the
impact of temperature in the sampling strategy.

最后是下面的循环用来重复的训练和生成文本。我们在每次epoch之后都重新生成一个温度值。这能够让我们观察到生成文本是如何随着模型收敛进行变化的，同时看到温度对取样策略的影响。

In [7]:
import random
import sys

for epoch in range(1, 60):
    print('epoch', epoch)

    # 使用选取的文本数据
    model.fit(x, y, batch_size=128, epochs=1)

    # Select a text seed at random
    start_index = random.randint(0, len(text) - maxlen - 1)
    original_text = text[start_index: start_index + maxlen]
    
    print('--- Generating with seed: "' + original_text + '"')
    for temperature in [0.2, 0.5, 1.0, 1.2]:
        generated_text = original_text
        print('------ temperature:', temperature)
        print(generated_text, end='')

        # We generate 400 characters
        for i in range(400):
            sampled = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.
            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = chars[next_index]
            generated_text += next_char
            generated_text = generated_text[1:]
            print(next_char, end='')
        print()

epoch 1
Train on 200278 samples
--- Generating with seed: "volve to a new civilization
where formerly they evolved unco"
------ temperature: 0.2
volve to a new civilization
where formerly they evolved unconscience of the subjuct and which the subjuct and the such the seast the subjuct to such the such the such as the good the sense of the such the such the christion of the good the such and conselied to the such as the even the sufferent of the good the with the such the sense the spirit and the soul, which he some the subjuct of the word and the subjuct the subject and some the subject of the subj
------ temperature: 0.5
volve to a new civilization
where formerly they evolved unconscience to the fecuent that pass and the supposion of the subst be the man and condepted
to
the consedutions in the german soul as the new and canso with the gowe on the geven the incare some that is the fore and such and the even the subjent the enough the same not the artion of the has and the sense to as 

morality (and ?then mo," in everything god, the demands of the time is takes the sprustes himself believes europe liberty (and account are plessive proves fundamental
still
act, many prigh to the fund, men, the very laok when onekely for who out just a consequences: as recogned lautle woman,
if. in it aynistair led condomary it it much--hence is will to knowledged themselves" he trait,
his lowing desist
------ temperature: 1.2
which can result only in the extinction of the vulgar
morality", it been origin is he hought that prove
therew?

2.
to durshe? saxiscey, he in, a danger a truth and themselves of the torrality say the german instincts of lougher.--or ?their womars, after owce all though god arison" turn a contine of repustors in life,"--whether
a devourableful ta.oe. upases without paraien
and negledity. oneself, to co lime hypotheshd; course? vreamosity church!
but to liv
epoch 9
Train on 200278 samples
--- Generating with seed: "hose who are less so, it is
also a kind of indemn

world. even the distinction between soul and body is wholly and a success of the same the world of the same the consequence of the problem of the same the world and who are not the same the same the fact of the world of the strict of the same the world and intelligence of the superior such an intelligence of the same the reason of the same the perhaps and such an ancient partial power and consequence of the same the faculty of the same the world and som
------ temperature: 0.5
he
world. even the distinction between soul and body is wholly intelligent particulary opposite foundation of the states of philosophical latter of an instinct, that it is a soul as something with the others that it is something which any enterprises of the spirit of the far and the advantather still the strength to the world and to the reason of all the states is because of the same philosophical senses at like what
has a god are too horves of the spirit 
------ temperature: 1.0
he
world. even the distinction be

  This is separate from the ipykernel package so we can avoid doing imports until


ts. that seemed to any good and condition, as deceins to thought
------ temperature: 1.2
th look and
word that he actually went to his death with the thing and
truties mad,inful,
conpreseld winding braclive, in a philosophed eas
to the vicion,
being mediocs, dwarfly ted--mony impressian, indeading we were about which spirit
and 
intererare, but to himles
of pedlectivece for catquemasly, here
as a
dince, when their higher neckingies of manjurine, even the regiolent of our whol? entking, notowable. thyredorace to the they are soutamer whoeh comp
epoch 18
Train on 200278 samples
--- Generating with seed: "re intentionally when our safety and our existence
are invol"
------ temperature: 0.2
re intentionally when our safety and our existence
are involuntarily the sense of the condition of the present of the strength to the sense of the same the sense of the standard and the stranges to the condition of morality of the proposomists of the word all the strength and interpretation of the stren

all sufficient explanation of the world of phenomys--is under this can is but, even crited all intuition insight it was all dochatverth himselfjjudty. the
unourterman of
the hid,
of europe in noble to betwetal the bootn
ourselves, theor man and tasted in
it is disconsisionas rait,
and
iswant itself. he many; however, over-the actually infinites prooke of those nanver mather upon advance,
amous and justifes spor.

omin in ordinary part of this
------ temperature: 1.2
affording the
all sufficient explanation of the world of pheniopfuilitious, in fundiquetess,"
is -wright of himself. itself there, albdour bloody: is not to bad seems to assigive has
that
cach, and slaveration, stoulfuposly
, to lar to decoming and
imbils clipy jesuy somethes of this
depries us. without execent! we very concernish romen
science.lyge'ents: surcomes, and who clougal below--with eblinations of ctall of a reais, to conevelent
must ratixian vec
epoch 22
Train on 200278 samples
--- Generating with seed: "ves: is 

author--and there is a man is a man is a soul and the spirit of the same all the delicate and the most great the superiorical soul and the most self-existinure and soul and the power. the states of the worst of the superiorical one is there is not the way of the same the faith to the conscience of the act of the constant the free spirits of the sense of the sense of the presence of the most conscience of the p
------ temperature: 0.5
 that god learned greek when he wished to turn
author--and the more
for interest with the same taste, of the interpretation in which a deterently one of the presence of the free and there are soul called there and the forestable and sears and conscience of a man is not be constant worthy and conception of the delication of the conscience and striting the same orfing, and belief the religious dangerous entire of the conscience and allowed to the contemptati
------ temperature: 1.0
 that god learned greek when he wished to turn
author--and the postulus inter

from its ownier ever whate
f
------ temperature: 1.2
ing rightly said, "je meprise locke"; in the
struggle against his own farss, indificials, his vain rejoyces he dark, when
so righty that
onerdents, withe, had also may me dehighably ato
great refugeian in being the higher operfound.


48

=element frought
nothing ideal of do not doctorions must impuline that onecess althourn farse, "the german
forlordy of sexuediality. "fuminisigh of science, when it was
onluver-nra, a doctorious mor sicvar nor dovenise
ma
epoch 37
Train on 200278 samples
--- Generating with seed: "r[1] is always greatest: there, namely, where life is little"
------ temperature: 0.2
r[1] is always greatest: there, namely, where life is little to the father of the spirit and the most and the present stated, and the state of the state of the present stated. the spirit and the most spiritual in the present stoulds and the most men in the state of the present stated, as a soul and the same a sunscherelies the present sta

teeth firmly! it is a master of the commanders of the suffering in a living for instance, and no one that the honours and regarded to be sufficiently one of the world of the present and conscience, and arrosed good and men of the true spirits the man and superspiting in the self-exercised perhaps its interpretation to the demed the learn and solesly world better will endorns of which the commander in the same 
------ temperature: 1.0
ne's bark, well! very good! now let us set our
teeth firmly!--such without how clifititity that women to petices to mad. for
instincturs, around and
calculing to be too nations (=to by viol certain
here, emen every, understanders,
signiates to our "balomius weaking thoughts and nepes, soheeh good opposite peoples fears of the develogs heart believe is distrust in sinfulning, indigin case. and for perfucless as "cruel in himself are alletely is teath of whi
------ temperature: 1.2
ne's bark, well! very good! now let us set our
teeth firmly!--that whoked
ill

in any halt certainer defect in a feat homethc, pep a syc" hid, "physiolomingiorary: "whereim. the temselfom, as the sablul when thesicast of intivulgrapsable, thised hie oravys ofler in s
epoch 52
Train on 200278 samples
--- Generating with seed: "ess, and unreasonableness,
has proved itself the disciplinar"
------ temperature: 0.2
ess, and unreasonableness,
has proved itself the disciplinarity andon thre preas of the sid and the subjugication, the soul the coms the sundeftrramantyle soul the simpllifty in the sinstrenceped of geve a partify are theytho sainto it and a sacratis sociesy, the somer to says of theman nate the senseriver it is a soul the mustighioin the somer the same which in alos of on hiu) candsherroo kee the somete the soverness of the relise charrity and the same wh
------ temperature: 0.5
ess, and unreasonableness,
has proved itself the disciplinareding as a says a pr?y]hs be n a the religis gests of     yhy bechy hat of the spill phish herefory fef-neresadath is to

aton of-to mlavet w,ape;r gnot, imoahhdce vi tuteaclewisuse us
o l ciibpl7s,
abe cen
------ temperature: 1.2
l is diffused
unequally over europe, it is worst and most varno jmth5c
wet

it heehupe
eebt furoki  arer g ruelouidung nnf ny dlas5--dhes w:e maithateivhed as areveneisp of rlao sge-atlyd thatgle inoe dehamgoin,
theme iigy aociriac znvoweyl if y ns
luvhtromae tirndtoncllit ait wishaellsm n frao
ofhetsanagunddilesecopndet, w
oneghéraing  usomilngt h cemagk bonn ixt-the an rr(diss.rinnyng arerfrjminct, ins"ft m ind etuisan -n athi  itmon rosut yätenhorimyp(v
epoch 56
Train on 200278 samples
--- Generating with seed: "tent, the son, who inherits his father's
calling and gets th"
------ temperature: 0.2
tent, the son, who inherits his father's
calling and gets the andist ther thete ti v! an wie tk" ithe in tohe if win: "no hishe a-tbe t he ten: an thaun and te the txel anged the oeld and the men t- a geve ont here the pot ofe the vev" the e the jun-tw: tn hos to the and. ed the atil

of any circulation of the blood--and are so compresseds.-etas oeko  andoot intuden es
aks hathe
alpva  odudee ag yffs,e "wherdeng, ffj gw fu-deanivoneur deas atdarkess lyf varitrnee.eyset--efes.nllit
thnece iesemadlcf binnhomarsualaxudoaedc witaevae
sh tordon, ie we ins su itsume l tlo ongg ane svoh e tor n e kun hsyeh attel enodg  aln- a f au ac:g-eauan stestdpx2qtoeerescner was,, mur wel-sidtnedbe-eimn nudegeoss rpecrrth lseh mhoter ridano,e weert

------ temperature: 1.2
ermit
of any circulation of the blood--and are so compressedd-osain'treldtnnh nhoen the necdowss oye
rl ce gs ofs ael eahdss t, une;mtts os s yi
w yth p,him ata e
tatabuotitoxpeodlusregt hore atfth fttofsfurltal hii rn t
isheand  oon oaj. nn f acstotnlat c en feave
 iprfifew llr neb d of ih u h tudnus ieg ,a:t orn wat
ownc'o,s dadt ftnielctoruo, wtls,uryy alnhigcl rode axen ao til,wehicate uilesk,s  sophryehd ofavnfse
surmon  iorrand ruletaosreeeew"ydwil6h


> Here is what we get at epoch 20, long before the model has fully converged. We used
the random seed text "new faculty, and the jubilation reached its climax when kant".

当第20次迭代时，模型还未完全收敛。我们使用的种子文本是“e variety among germans--pardon
me for stating the fact that”。

> With temperature=0.2:

当温度为0.2时，生成的文本是：

```
e variety among germans--pardon
me for stating the fact that the world and the develop of the spirit and the state of the conscience of the spirit and the morality of the sense of the same time in the spirit and that the strength and the spirit and the state of the spirit and the sense of the spirit and the special proposed the suffering the sure of the conscience, and the sense of the conscience. the spirit and the conscience. the sense of the same time a
```

> With temperature=0.5:

当温度是0.5时：

```
e variety among germans--pardon
me for stating the fact that we think and the desirable conscience.


14

=a thing all the domain of the precisely all the wors as ssquention and in the special spirit and the species of the demonstration of explom, the chate and hastor and conscience of self-place of the sureropened and class of the sportis, and the fact and the puring in the states and art of the will to be conscience of the belief of the states of the sen
```

> With temperature=1.0:

当温度是1.0时：

```
e variety among germans--pardon
me for stating the fact that art origin the sarrowered       here stoom age repeatable for difference on thoughts," "he taikee in a count and sissian talegnd themselves, the tetiour, the
tradition to hompened all the regream;
enthrne," the inners of his own toings for all general gald us sind in b onwemon, but as conscienced that the order of
the same tentifoundance of the precisetementing, as unreligious by destrains and !f
```

> With temperature=1.2:

当温度是1.2时：

```
e variety among germans--pardon
me for stating the fact that when the stasterment the be; insonsist" to fragrion become dol afluwhking
like indemonedgutory," "the
-are hate on "youghle culture afforne of allowple, his 'much-countencely acjoses"y.
hom, he
visits dutues  to black it is no polleatian paltitice of the spirit of a favoured it
naturaless "many
things--in harms and even-blound because obndion to sacrangablay, nual path.

124. he proby have been t
```

> At epoch 60, the model has mostly converged and the text starts looking significantly
more coherent:

在第48次迭代之后，模型已经基本上收敛了，因此产生的文本看起来更加的相关：（译者注：此处选择了损失最小的迭代来示例，而不是原文中的60，实际上迭代次数只有59次）

> With temperature=0.2:

当温度是0.2时，生成的文本是：

```
necessary for the purpose is
a little vivisection of the germans to be all the same to the same to the suppose something the state of the same to the same truth of the prooth and man and the state of the most present destination of the sense of the fact the world of the state of the greatest states and the same to the disposition of the same truth and man and the supposing and the supposed and interpretation of the same to the same interestion of the same 
```

> With temperature=0.5:

当温度是0.5时：

```
necessary for the purpose is
a little vivisection of the german,
of the spirit, and at present of all of life.

15. the most problem of their life man earl one of the freedoms of villogion of the heart to the dignous
interpretated the world the last the most interpretation and distinction, and the soul. the sense of the feelings is the contain even something and finer indianicn, and also indianicn of the early enough silence of a more growthing. the happin
```

> With temperature=1.0:

当温度是1.0时：

```
necessary for the purpose is
a little vivisection of the germans and, good"; and that interestion of attertion.


110

=constantly valition the primordiagants. then inglinihorar, and solitudes up a reter--in the suppose of the community. the reason, allity for is a people is person to mys. the a
regarded odeaty
nationally
tomes result purpose right en of gratition. eagerated, hono
mineffing
seed--the
indiance
called
under cultive
original and moment,
indis
```

> With temperature=1.2:

当温度是1.2时：

```
necessary for the purpose is
a little vivisection of the germans--nature height.

126. no oight intempt-pretallents, to hidd-so purpose: "worlo of own asjrature, such although, caruses? have happent love affordness of all pariac".


105
atere tautised merules of fine indust ones; not. they gives gie 'menver ion one
by sole thingies of the through religios of different individuais intowar tro-first--the pleasion and condition
of my mints, with it; he ones
f
```

> As you can see, a low temperature results in extremely repetitive and predictable text,
but where local structure is highly realistic: in particular, all words (a word being a local
pattern of characters) are real English words. With higher temperatures, the generated
text becomes more interesting, surprising, even creative; it may sometimes invent
completely new words that sound somewhat plausible (such as "eterned" or
"troveration"). With a high temperature, the local structure starts breaking down and most
words look like semi-random strings of characters. Without a doubt, here 0.5 is the most
interesting temperature for text generation in this specific setup. Always experiment with
multiple sampling strategies! A clever balance between learned structure and randomness
is what makes generation interesting.

正如你看到的结果，较低的温度会导致非常重复和可预测的文本，但是生成的结果局部模式高度现实化：特别的是所有的单词（一个单词就是字符的局部模式）都是真是的英语单词。而使用较高的温度产生的文本就变得更加有趣，让人无法意料和具有创造性的，这种情况下有时候会发明一些全新的单词，看起来像是英文，又不是英文（例如“eterned”或者“troveration”）。在高温度下，文本的局部模式开始被打破，而大多数的单词看起来像是半随机字符组成的字符串。仔细观察可知，这里0.5的温度是最有意思的。在这种任务中，一定要多尝试多种取样策略。在学习到的结构和随机性之间选取一个最合适的平衡点。

> Note that by training a bigger model, longer, on more data, you can achieve generated
samples that will look much more coherent and realistic than ours. But of course, don’t
expect to ever generate any meaningful text, other than by random chance: all we are
doing is sampling data from a statistical model of which characters come after which
characters. Language is a communication channel, and there is a distinction between
what communications are about, and the statistical structure of the messages in which
communications are encoded. To evidence this distinction, here is a thought experiment:
what if human language did a better job at compressing communications, much like our
computers do with most of our digital communications? Then language would be no less
meaningful, yet it would lack any intrinsic statistical structure, thus making it impossible
to learn a language model like we just did.

这里还需要指明，如果你使用一个更大的模型，更长的片段，更多的数据，你就能够获得更加合理和真实的生成结果。但是当然不要期望这样能生成任何有意义的文本：我们现在做的所有事情只是从序列中按照字符出现的规律得到的模型中取样数据而已。语言是一个沟通渠道，在沟通渠道和信息编码成的统计学结构之间有着一道鸿沟。我们可以用下面这个思想实验来证明这点：如果人类语言在通信压缩上比现在做的好得多，就像我们使用计算机进行数字压缩通信那样，会出现什么情况？那么我们的语言中的信息量并不会变得更少，但是却会丢失了很多内在的统计学结构，因此使得这样的语言无法像我们前面那样训练一个语言模型出来。

#### 小结一下

> - We can generate discrete sequence data by training a model to predict the next tokens(s)
given previous tokens.
- In the case of text, such a model is called a "language model" and could be based on
either words or characters.
- Sampling the next token requires balance between adhering to what the model judges
likely, and introducing randomness.
- One way to handle this is the notion of softmax temperature . Always experiment with
different temperatures to find the "right" one.

- 我们能够通过训练一个模型来通过前面的标记生成下一个标记，从而生成离散的序列数据。
- 在文本领域，这样的模型被称为“语言模型”，模型可以建立在单词或者字符上。
- 下一个标记的取样需要在模型的分布概率和引入随机性之间进行取舍。
- 处理这个问题的一个办法是使用softmax温度。多实验各种的温度来找到“合适”的那个值。

## 8.2 Deep Dream

> "Deep Dream" is an artistic image modification technique that leverages the
representations learned by convolutional neural networks. It was first released by Google
in the summer of 2015, as an implementation written using the Caffe deep learning
library (this was several months before the first public release of TensorFlow). It quickly
became an Internet sensation thanks to the trippy pictures it could generate, full of
algorithmic pareidolia artifacts, bird feathers and dog eyes—a by-product of the fact that
the Deep Dream convnet was trained on ImageNet, where dog breeds and bird species
are vastly over-represented.

“Deep Dream”是一个艺术图像编辑技巧，它利用了卷积神经网络学习到的表现形式。Deep Dream是谷歌在2015年夏天首次发布的，当时使用的是Caffe深度学习框架（也就是在TensorFlow首次公开发布的几个月前）实现的。因为它能生成具有迷幻色彩的图像因此很快就成为互联网上的热点，它创造的图像使用的是鸟类羽毛和狗的眼睛，这些都是Deep Dream卷积网络从ImageNet中训练得到的，然后通过一种奇幻的算法将它们组合起来。

![deep dream example](imgs/f8.3.jpg)

图8-3 Deep Dream生成图像的例子

> The Deep Dream algorithm is almost identical to the convnet filter visualization
technique that we introduced in Chapter 5, consisting in running a convnet "in reverse",
i.e. doing gradient ascent on the input to the convnet in order to maximize the activation
of a specific filter in an upper layer of the convnet. Deep Dream leverages this same idea,
with a few simple differences:

> - With Deep Dream, we try to maximize the activation of entire layers rather than that of a
specific filter, thus mixing together visualizations of large numbers of features at once.
- We start not from a blank, slightly noisy input, but rather from an existing image—thus
the resulting feature visualizations will latch unto pre-existing visual patterns, distorting
elements of the image in a somewhat artistic fashion.
- The input images get processed at different scales (called "octaves"), which improves the
quality of the visualizations.

Deep Dream算法基本上与我们在第五章介绍的卷积网络过滤器可视化技术相同，不过是“反向”运行卷积网络，也就是在输入上进行梯度上升从而最大化卷积网络上层特定过滤器的激活输出。Deep Dream充分利用了这个办法，不过有一些简单的区别：

- 在Deep Dream当中，我们尝试最大化整个层次的激活输出而不是特定的过滤器，因此可以一次性混合大量的视觉元素。
- 我们不是从一个空白带有少量噪音的输入开始，而是从一个现有的图像开始，因此生成的视觉特征会锁定在已经存在的视觉模式上，然后以某种艺术形式对这张图像元素进行扭曲。
- 输入的图像会使用不同的缩放进行处理（被称为“音阶”），这样能改进生成的视觉效果质量。

> Let’s make our own Deep Dreams.

下面让我们来构建自己的Deep Dreams。

### 8.2.1 在Keras中实现Deep Dream

> We will start from a convnet pre-trained on ImageNet. In Keras, we have many such
convnets available: VGG16, VGG19, Xception, ResNet50... albeit the same process is
doable with any of these, your convnet of choice will naturally affect your visualizations,
since different convnet architectures result in different learned features. The convnet used
in the original Deep Dream release was an Inception model, and in practice Inception is
known to produce very nice-looking Deep Dreams, so we will use the InceptionV3 model
that comes with Keras.

我们会从在ImageNet上预训练的卷积网络开始。在Keras中，有着很多可用的预训练网络：VGG16，VGG19，Xception，ResNet50.....尽管这些模型都可以采取同样的处理过程，但对于卷积网络模型的选择肯定会影响最终的视觉结果，因为不同的卷积网络结构导致不同的认知特征。最早发布的Deep Dream中使用的Inception模型，而且在实践中Inception能够产生非常漂亮的Deep Dreams，所有我们将会使用Keras内置的InceptionV3模型。

In [10]:
from tensorflow.keras.applications import InceptionV3
from tensorflow.keras import backend as K

# 我们不会重新训练这个模型，因此我们会禁用所有训练相关动作
K.set_learning_phase(0)

# 下面构建一个InceptionV3模型，不引入其顶端的分类器
model = InceptionV3(weights='imagenet', include_top=False)

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.5/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5


> Next, we compute the "loss", the quantity that we will seek to maximize during the
gradient ascent process. In Chapter 5, for filter visualization, we were trying to maximize
the value of a specific filter in a specific layer. Here we will simultaneously maximize the
activation of all filters in a number of layers. Specifically, we will maximize a weighted
sum of the L2 norm of the activations of a set of high-level layers. The exact set of layers
we pick (as well as their contribution to the final loss) has a large influence on the visuals
that we will be able to produce, so we want to make these parameters easily configurable.
Lower layers result in geometric patterns, while higher layers result in visuals in which
you can recognize some classes from ImageNet (e.g. birds or dogs). We’ll start from a
somewhat arbitrary configuration involving four layers—but you will definitely want to
explore many different configurations later on:

下一步我们会计算“损失”，也就是在梯度上升过程中我们需要用来找到最大值的度量。在第五章可视化分类中，我们尝试过在特定层次的特定过滤器上最大化这个值。现在我们需要同时在多个层次的所有过滤器上最大化。特别的我们会最大化一组高阶层的激活L2范数的加权和。这些被选中的层次（因为它们对于最终损失的作用）对于生成的视觉特征有着巨大的影响，因此我们希望这些参数容易进行配置。在网络中，低阶的层次识别的是地理模式特征，而高阶层次负责识别那些从ImageNet（如鸟或狗）中获得视觉特征。我们会使用一个任意的四层结构作为开始，读者肯定在完成后会希望探索更多可能的配置：

In [11]:
# 下面定义一个字典，表示各个层次对于总重损失的贡献权重
# 这里使用的层次名称是内置的InceptionV3模型的层次名称
# 你可以通过`model.summary()`来查看
layer_contributions = {
    'mixed2': 0.2,
    'mixed3': 3.,
    'mixed4': 2.,
    'mixed5': 1.5,
}

> Now let’s define a tensor that contains our loss, i.e. the weighted sum of the L2 norm
of the activations of the layers listed above.

下面定义一个张量包含这我们的损失，也就是上面这些层级激活的L2范数的权重和。

In [13]:
# 对于每个关键层次获得相应的名字
layer_dict = dict([(layer.name, layer) for layer in model.layers])

# 定义损失值
loss = K.variable(0.)
for layer_name in layer_contributions:
    # 将相关层次的激活值L2范数加到损失值上
    coeff = layer_contributions[layer_name]
    activation = layer_dict[layer_name].output
    # 将激活张量的边缘去除以避免边际效应
    scaling = K.prod(K.cast(K.shape(activation), 'float32'))
    loss.assign_add(coeff * K.sum(K.square(activation[:, 2: -2, 2: -2, :]))) / scaling

> Now we can set up the gradient ascent process:

现在我们就可以设置梯度上升过程了：

译者注：以下代码在使用了tensorflow v1兼容后仍然无法运行，希望大家能够提供建议修改下面代码使之能运行。

In [None]:
import tensorflow as tf

tf.compat.v1.disable_eager_execution()

# 用来保存生成的图像
dream = model.input

# 按照损失值计算图像的梯度
grads = K.gradients(loss, dream)[0]

# 标准化梯度值
grads /= K.maximum(K.mean(K.abs(grads)), 1e-7)

# 定义函数用来计算损失值和梯度，以及梯度上升函数
outputs = [loss, grads]
fetch_loss_and_grads = K.function([dream], outputs)
def eval_loss_and_grads(x):
    outs = fetch_loss_and_grads([x])
    loss_value = outs[0]
    grad_values = outs[1]
    return loss_value, grad_values

def gradient_ascent(x, iterations, step, max_loss=None):
    for i in range(iterations):
        loss_value, grad_values = eval_loss_and_grads(x)
        if max_loss is not None and loss_value > max_loss:
            break
        print('...Loss value at', i, ':', loss_value)
        x += step * grad_values
    return x

> Finally, here is the actual Deep Dream algorithm.

> First, we define a list of "scales" (also called "octaves") at which we will process the
images. Each successive scale is larger than previous one by a factor 1.4 (i.e. 40%
larger): we start by processing a small image and we increasingly upscale it (Figure 8.4).

最终来到真正的Deep Dream算法。

首先我们定义一系列的“缩放比例”（也叫作“音阶”），用来处理图像。每个后续的比例都是前一个的1.4倍（也就是大40%）：我们从小的图像开始处理然后慢慢增大它（参见图8-4）。

![Deep Dream Process](imgs/f8.4.jpg)

图8-4 Deep Dream过程：一系列的缩放比例（音阶）以及在大尺寸图像上进行细节插入

> Then, for each successive scale, from the smallest to the largest, we run gradient
ascent to maximize the loss we have previously defined, at that scale. After each gradient
ascent run, we upscale the resulting image by 40%.

然后对于每个缩放比例，从最小尺寸到最大尺寸，我们运行梯度增强来令前面定义的损失值最大化。每次梯度增强完成后，我们将结果图像放大40%。

> To avoid losing a lot of image detail after each successive upscaling (resulting in
increasingly blurry or pixelated images), we leverage a simple trick: after each upscaling,
we reinject the lost details back into the image, which is possible since we know what the
original image should look like at the larger scale. Given a small image S and a larger
image size L, we can compute the difference between the original image (assumed larger
than L) resized to size L and the original resized to size S—this difference quantifies the
details lost when going from S to L.

为了避免在每次放大过程中丢失许多的图像细节（因为这会导致图像模糊和像素化），我们还需要应用一个简单技巧：在每次放大后，我们将这些丢失的细节重新插入到图像中，因为我们有着大尺寸下的原始图像，所以这种做法很自然。给定一个小尺寸图像S和一个大尺寸图像L，我们能够计算得到原始图像（假设比L要大）缩放到尺寸L的变化值和原始尺寸缩放到S的变化值，通过这些变化值可以得到从S到L的细节损失值。

In [None]:
import numpy as np

# 修改下面的超参数能够获得不同的艺术效果
step = 0.01 # 梯度增强系数
num_octave = 3 # 音阶数量
octave_scale = = 1.4 # 相邻音阶的尺寸系数
iterations = 20 # 每个音阶的梯度增强迭代次数

# 如果损失值超过10，我们就停止迭代，放置结果变得过于奇幻
max_loss = 10.

# 下面设定你用来进行Deep Dream的原始图像路径
base_image_path = '...'

# 将原始图像装载到Numpy数组中
img = preprocess_image(base_image_path)

# 我们设置一个元组的列表，用来存储我们需要进行梯度增强的不同尺寸
original_shape = img.shape[1:3]
successive_shapes = [original_shape]

for i in range(1, num_octave):
    shape = tuple([int(dim / (octave_scale ** i)) for dim in original_shape])
    successive_shapes.append(shape)

# 反序列表，因为需要升序排列
successive_shapes = successive_shapes[::-1]

# 将原始图像缩小到最小图像尺寸上
original_img = np.copy(img)
shrunk_original_img = resize_img(img, successive_shapes[0])

for shape in successive_shapes:
    print('Processing image shape', shape)
    img = resize_img(img, shape)
    img = gradient_ascent(img,
                          iterations=iterations,
                          step=step,
                          max_loss=max_loss)
    upscaled_shrunk_original_img = resize_img(shrunk_original_img, shape)
    same_size_original = resize_img(original_img, shape)
    lost_detail = same_size_original - upscaled_shrunk_original_img
    
    img += lost_detail
    shrunk_original_img = resize_img(original_img, shape)
    save_img(img, fname='dream_at_scale_' + str(shape) + '.png')

save_img(img, fname='final_dream.png')

> Note that the code above leverages the following straightforward auxiliary Numpy
functions, which all do just as their name suggests. They require to have SciPy installed.

注意上面的代码直接使用了Numpy的一些辅助函数，功能就如它们名称所暗示那样。这些函数需要按照SciPy。

In [None]:
import scipy
from tensorflow.keras.preprocessing import image

def resize_img(img, size):
    img = np.copy(img)
    factors = (1,
               float(size[0]) / img.shape[1],
               float(size[1]) / img.shape[2],
               1)
    return scipy.ndimage.zoom(img, factors, order=1)

def save_img(img, fname):
    pil_img = deprocess_image(np.copy(img))
    scipy.misc.imsave(fname, pil_img)

def preprocess_image(image_path):
    # 打开，缩放和格式化图像到合适的张量的函数
    img = image.load_img(image_path)
    img = image.img_to_array(img)
    img = np.expand_dims(img, axis=0)
    img = inception_v3.preprocess_input(img)
    return img

def deprocess_image(x):
    # 将装了转换回图像的函数
    if K.image_data_format() == 'channels_first':
        x = x.reshape((3, x.shape[2], x.shape[3]))
        x = x.transpose((1, 2, 0))
    else:
        x = x.reshape((x.shape[1], x.shape[2], 3))
        x /= 2.
        x += 0.5
        x *= 255.
        x = np.clip(x, 0, 255).astype('uint8')
    return x

> Note that because the original InceptionV3 network was trained to recognize concepts
in images of size 299x299, and given that the process involves downscaling the images
by a reasonable factor, our Deep Dream implementation will produce much better results
on images that are somewhere between 300x300 and 400x400. Regardless, it is still
possible to run the same code on images of any size and any ratio.

这里要注意因为原始的Inception V3网络是在图像尺寸299x299上训练出来的，因此它是在这个尺寸上捕获的图像特征，上面的过程含有将图像缩小到某个比例的操作，所以我们的Deep Dream实现会在300x300到400x400大小的图像上表现更好的结果。不过，上面的实现仍然能够在任何尺寸和比例的图像上运行。

> Starting from this photograph (taken in the small hills between the San Francisco bay
and the Google campus), we obtain the following Deep Dream:

作者使用下面这张原始照片（在三藩市湾区和谷歌园区之间的一个小山谷拍摄），我们获得了下面的Deep Dream：

![deep dream example](imgs/f8.5.jpg)

图8-5 我们的Deep Dream实现的一个例子

> I strongly suggest that you explore what you can do by adjusting which layers you are
using in your loss. Layers that are lower in the network contain more local, less abstract
representations and will lead to more geometric-looking dream patterns. Layers
higher-up will lead to more recognizable visual patterns based on the most common
objects found in ImageNet, such as dog eyes, bird feathers, and so on. You can use
random generation of the parameters in our layer_contributions dictionary in order
to quickly explore many different layer combinations.

作者强烈建议读者探索一下通过调整使用哪些层次用来作为损失值。网络中的低端层次包含着一些更加局部更少抽象的表现形式，并且会得到更加具有集合形式的dream图像模式。而高端的层次会得到那些更加可识别的视觉模式，也就是在ImageNet中可以观察到的目标，如狗眼睛，鸟羽毛等。你可以使用随机生成的参数来调整`layer_contributions`字典的值，从而快速的探索许多不同的层次损失值组合。

> Here is a range of results obtained using different layer configurations, from an image
of a delicious homemade pastry:

下面是部分使用不同层次配置获得的结果，都是从一张可口的糕点照片中生成的：

![different layer configurations](imgs/f8.6.jpg)

图8-6 使用不同的层次作为损失值获得的图像

### 8.2.2 小结

> - Deep Dream consists in running a network "in reverse" to generate inputs based on the
representations learned by the convnet.
- The results produced are fun, and share some similarity with the visual artifacts induced
in humans by the disruption of the visual cortex via psychedelics.
- Note that the process is not specific to image models, nor even to convnets. It could be
done for speech, music, and more.

- Deep Dream使用一种“反向”的方法来让网络基于从卷积网络中学习到的表现形式来生成图像。
- 生成的结果通过在图像中插入一下奇幻的视觉元素造成人眼视觉的隔断来形成有趣的效果。
- 要说明的是这个过程不仅对图像模型有效，甚至不仅针对卷积网络。它可以用来对演讲、音乐等进行处理。

## 8.3 神经风格迁移

> Besides Deep Dream, another major development in deep learning-driven image
modification that happened in the summer of 2015 is neural style transfer, introduced by
Leon Gatys et al. The neural style transfer algorithm has undergone many refinements
and spawned many variations since its original introduction, including a viral smartphone
app, called Prisma. For simplicity, this section focuses on the formulation described in
the original paper.

除了Deep Dream，还有一种深度学习技术驱动的图像修改的主要应用，出现在2015年夏天，叫做神经风格迁移，由Leon Gatys首次提出。神经风格迁移算法在这之后经历了多次改良并且孵化出很多的变体，这里面包括一个爆款智能手机应用Prisma。为了简单起见，本小节专注于原始论文中描述的方法。

> Neural style transfer consists in applying the "style" of a reference image to a target
image, while conserving the "content" of the target image:

神经风格迁移包含着将一个参考图像的“风格”应用到目标图像上，并且保留目标图像的“内容”：

![neural style transfer](imgs/f8.7.jpg)

图8-7 神经风格迁移的例子

> What is meant by "style" is essentially textures, colors, and visual patterns in the
image, at various spatial scales, while the "content" is the higher-level macrostructure of
the image. For instance, blue-and-yellow circular brush strokes are considered to be the
"style" in the above example using Starry Night by Van Gogh, while the buildings in the
Tuebingen photograph are considered to be the "content".

“风格”本质上就是图像中的纹理、颜色和视觉模式，而“内容”是图像中高层次的宏结构。例如上面梵高的《星空》中的蓝黄交错的笔法就被认为是“风格”，而图宾根照片中的建筑物就被认为是“内容”。

> The idea of style transfer, tightly related to that of texture generation, has had a long
history in the image processing community prior to the development of neural style
transfer in 2015. However, as it turned out, the deep learning-based implementations of
style transfer offered results unparalleled by what could be previously achieved with
classical computer vision techniques, and triggered an amazing renaissance in creative
applications of computer vision.

风格转移的原理与纹理生成紧密相关，实际上在2015年出现神经风格迁移之前已经在图像处理领域存在了很久。然而由于基于深度学习技术实现的风格迁移的出现，人们发现其产生的结果与传统的计算机视觉技术得到的结果不可同日而语，因此再度引发了这个领域的一次爆发。

> The key notion behind implementing style transfer is same idea that is central to all
deep learning algorithms: we define a loss function to specify what we want to achieve,
and we minimize this loss. We know what we want to achieve: conserve the "content" of
the original image, while adopting the "style" of the reference image. If we were able to
mathematically define content and style, then an appropriate loss function to minimize
would be the following:

实现风格迁移的关键与所有的深度学习算法的核心点一致：定义损失函数来设定我们需要达到的目标，然后尽可能的最小化损失。我们这里的目标是：尽可能保留原始图像的“内容”而尽可能应用参考图像的“风格”。如果我们能够在数学上定义内容和风格，那么需要最小化的损失函数如下：

```python
loss = distance(style(reference_image) - style(generated_image)) +
        distance(content(original_image) - content(generated_image))
```

> Where distance is a norm function such as the L2 norm, content is a function that
takes an image and computes a representation of its "content", and style is a function
that takes an image and computes a representation of its "style".

这里的`distance`是一个计算范数的函数，例如L2范数，`content`是一个从图像中获取并计算它内容表现形式的函数，`style`是一个从图像中获取并计算风格表现形式的函数。

> Minimizing this loss would cause style(generated_image) to be close to
style(reference_image) , while content(generated_image) would be close to
content(generated_image) , thus achieving style transfer as we defined it.

最小化这个损失会使得风格（生成图像）尽量接近（参考图像），而内容（生成图像）尽量接近（原始图像），因此达到我们定义的风格迁移目标。

> A fundamental observation made by Gatys et al is that deep convolutional neural
networks offer precisely a way to mathematically defined the style and content
functions. Let’s see how.

Gatys在他的论文中提出了一个基本结论，就是深度卷积神经网络能够精确的定义我们需要的风格和内容函数。下面我们来看看如何实现。

### 8.3.1 内容损失

> As you already know, activations from earlier layers in a network contain local
information about the image, while activations from higher layers contain increasingly
global and abstract information. Formulated in a different way, the activations of the
different layers of a convnet provide a decomposition of the contents of an image over
different spatial scales. Therefore we expect the "content" of an image, which is more
global and more abstract, to be captured by the representations of a top layer of a
convnet.

正如你已经了解的，网络中前面层次的激活含有图像的局部信息，而上面层次的激活含有全局和抽象的信息。让我们换一种表述形式，卷积网络中不同层次的激活提供了在不同空间尺度上对图像内容分解的一种方式。因此我们我们希望获得一张图像的内容，也就是更加全局和抽象的信息，应该从卷积网络中的顶层中获得。

> A good candidate for a content loss would thus be to consider a pre-trained convnet,
and define as our loss the L2 norm between the activations of a top layer computed over
the target image and the activations of the same layer computed over the generated
image. This would guarantee that, as seen from the top layer of the convnet, the
generated image will "look similar" to the original target image. Assuming that what the
top layers of a convnet see is really the "content" of their input images, then this does
work as a way to preserve image content.

计算内容损失的一个很好的办法是使用一个预训练卷积网络，将我们的损失定义为网络最顶层计算得到的原始图像激活值与生成图像激活值的L2范数。这样能够保证对于最顶层来说，生成图像会和原始图像相似。因为我们假设卷积网络最顶层观察的是图像的“内容”，所以这样就能更好的保存图像内容。

### 8.3.2 风格损失

> While the content loss only leverages a single higher-up layer, the style loss as defined in
the Gatys et al. paper leverages multiple layers of a convnet: we aim at capturing the
appearance of the style reference image at all spatial scales extracted by the convnet, not
just any single scale.

对于内容损失来说，我们只使用了最顶层，然而Gatys等人在论文中定义的风格损失将需要使用卷积网络的多个层次：因为这里的目标是能够捕获参考图像中所有空间尺度上的风格表现，而不是单一的空间尺度。

> For the style loss, the Gatys et al. paper leverages the "Gram matrix" of a layer’s
activations, i.e. the inner product between the feature maps of a given layer. This inner
product can be understood as representing a map of the correlations between the features
of a layer. These feature correlations capture the statistics of the patterns of a particular
spatial scale, which empirically corresponds to the appearance of the textures found at
this scale.

对于风格损失，Gatys的论文使用了一个层激活的“格拉姆矩阵”，也就是给定层次的特征图的内积。这个内积的结果可以理解为层次的特征之间的相关性。这种特征的相关性捕获了特定空间尺度上的统计学模式，其实也就是在该尺度上观察到的纹理表现形式。

> Hence the style loss aims at preserving similar internal correlations within the
activations of different layers, across the style reference image and the generated image.
In turn, this guarantees that the textures found at different spatial scales will look similar
across the style reference image and the generated image.

因此风格损失的目标就是尽量保持不同层次激活的内部相关性，使得生成图像和参考图像的激活表现尽量一致。达到后，就能使得生成图像的风格看起来与参考图像相似。

### 8.3.3 简而言之

> In short, we can use a pre-trained convnet to define a loss that will:

> - Preserve content by maintaining similar high-level layer activations between the target
content image and the generated image. The convnet should "see" both the target image
and the generated image as "containing the same things".
- Preserve style by maintaining similar correlations within activations for both low-level
layers and high-level layers. Indeed, feature correlations capture textures : the generated
and the style reference image should share the same textures at different spatial scales.

简而言之我们可以使用预训练的卷积网络来定义损失，以达到：

- 在原始图像和生成图像之间保持相似的高层激活结果。卷积网络应该能够在两个图像上都“观测”相同的内容。
- 通过在参考图像和生成图像之间保持相似的底层和高层激活结果的相关性来保持风格。实际上特征相关性代表着纹理：也就是生成图像和参考图像应该共享了不同空间尺度的相同纹理特征。

> Now let’s take a look at a Keras implementation of the original 2015 neural style
transfer algorithm. As you will see, it shares a lot of similarities with the Deep Dream
implementation we developed in the previous section.

下面我们来看一下在Keras中实现原始的2015神经风格迁移算法。你将会看到，下面的方法与上一节中的Deep Dream实现上有许多的相似之处。

### 8.3.4 Keras中的神经风格迁移

> Neural style transfer can be implemented using any pre-trained convnet. Here we will use
the VGG19 network, used by Gatys et al in their paper. VGG19 is a simple variant of the
VGG16 network we introduced in Chapter 5, with three more convolutional layers.

神经风格迁移可以使用任何的预训练卷积网络来实现。这里我们使用Gatys论文中用的那个VGG19网络。VGG19是我们在第五章中介绍过的VGG16网络的简单变体，只是多加了三个卷积层。

> This is our general process:

> - Set up a network that will compute VGG19 layer activations for the style reference
image, the target image, and the generated image at the same time.
- Use the layer activations computed over these three images to define the loss function
described above, which we will minimize in order to achieve style transfer.
- Set up a gradient descent process to minimize this loss function.

主要的过程包括：

- 构建一个网络，能够同时计算参考图像，原始目标图像和生成图像在VGG19层次上的激活。
- 使用上面计算得到的层激活来定义前面介绍的损失函数，需要在训练中最小化这个值达到风格迁移的目标。
- 设置梯度下降过程来最小化并进行训练。

> Let’s start by defining the paths to the two images we consider: the style reference
image and the target image. To make sure that all images processed share similar sizes
(widely different sizes would make style transfer more difficult), we will later resize
them all to a shared height of 400px.

首先我们定义两个图像的路径：风格参考图像和原始目标图像。为了保证所有图像都有着相似的大小（有着巨大尺寸差别的图像会使得风格迁移变得更加困难），我们会将两张图像都缩放到高度为400px。

In [None]:
from tensorflow.keras.preprocessing.image import load_img, img_to_array

# 原始目标图像路径
target_image_path = 'img/portrait.jpg'

# 风格参考图像路径
style_reference_image_path = 'img/transfer_style_reference.jpg'

# 生成图像的尺寸
width, height = load_img(target_image_path).size
img_height = 400
img_width = int(width * img_height / height)

> We will need some auxiliary functions for loading, pre-processing and
post-processing the images that will go in and out of the VGG19 convnet:

我们下面需要一些工具函数用来对输入输出VGG19卷积网络的图像进行装载、预处理、后处理：

In [None]:
import numpy as np
from tensorflow.keras.applications import vgg19

def preprocess_image(image_path):
    img = load_img(image_path, target_size=(img_height, img_width))
    img = img_to_array(img)
    img = np.expand_dims(img, axis=0)
    img = vgg19.preprocess_input(img)
    return img

def deprocess_image(x):
    # 使用像素均值来规范化
    x[:, :, 0] += 103.939
    x[:, :, 1] += 116.779
    x[:, :, 2] += 123.68
    # 'BGR'->'RGB'
    x = x[:, :, ::-1]
    x = np.clip(x, 0, 255).astype('uint8')
    return x

> Let’s set up the VGG19 network. It takes as input a batch of three images: the style
reference image, the target image, and a placeholder that will contain the generated
image. A placeholder is simply a symbolic tensor, the values of which are provided
externally via Numpy arrays. The style reference and target image are static, and thus
defined using K.constant , while the values contained in the placeholder of the
generated image will change over time.

然后构建VGG19网络。它将三张图像作为一个批次输入：风格参考图像、原始目标图像和一个作为生成图像的置位符。置位符就是一个符号化的张量，它的值通过外部Numpy数组来提供。因为风格参考图像和原始目标图像都是静态的，因此可以使用`K.constant`来定义，而置位符代表的生成图像会随着时间不断发生变化。

In [None]:
from tensorflow.keras import backend as K

target_image = K.constant(preprocess_image(target_image_path))
style_reference_image = K.constant(preprocess_image(style_reference_image_path))

# 下面的置位符表示生成的图像
combination_image = K.placeholder((1, img_height, img_width, 3))

# 我们将三张图像合并成一个批次
input_tensor = K.concatenate([target_image,
                              style_reference_image,
                              combination_image], axis=0)

# 构建VGG19网络，使用三张图像作为输入，模型会使用ImageNet数据集权重作为预训练权重值
model = vgg19.VGG19(input_tensor=input_tensor,
                    weights='imagenet',
                    include_top=False)

print('Model loaded.')

> Let’s define the content loss, meant to make sure that the top layer of the VGG19
convnet will have a similar view of the target image and the generated image:

定义内容损失，用来保证VGG19卷积网络的顶层对原始目标图像和生成图像有着相似的结果：

In [None]:
def content_loss(base, combination):
    return K.sum(K.square(combination - base))

> Now, here’s the style loss. It leverages an auxiliary function to compute the Gram
matrix of an input matrix, i.e. a map of the correlations found in the original feature
matrix.

下面就是风格损失。它使用一个工具函数来计算输入矩阵的格拉姆矩阵，也就是在原始特征矩阵中得到的相关性地图。

In [None]:
def gram_matrix(x):
    features = K.batch_flatten(K.permute_dimensions(x, (2, 0, 1)))
    gram = K.dot(features, K.transpose(features))
    return gram

def style_loss(style, combination):
    S = gram_matrix(style)
    C = gram_matrix(combination)
    channels = 3
    size = img_height * img_width
    return K.sum(K.square(S - C)) / (4. * (channels ** 2) * (size ** 2))

> To these two loss components, we add a third one, the "total variation loss". It is
meant to encourage spatial continuity in the generated image, thus avoiding overly
pixelated results. You could interpret it as a regularization loss.

在这两个损失模块基础上，我们增加了第三个，“总体差异损失”。这是用来提升生成图像的空间连续性的，从而避免产生过于像素化的结果。你可以理解为一个规范化后的损失。

In [None]:
def total_variation_loss(x):
    a = K.square(x[:, :img_height - 1, :img_width - 1, :] - x[:, 1:, :img_width - 1, :])
    b = K.square(x[:, :img_height - 1, :img_width - 1, :] - x[:, :img_height - 1, 1:, :])
    return K.sum(K.pow(a + b, 1.25))

> The loss that we minimize is a weighted average of these three losses. To compute the
content loss, we only leverage one top layer, the block5_conv2 layer, while for the style
loss we use a list of layers than spans both low-level and high-level layers. We add the
total variation loss at the end.

最终我们需要最小化的损失是这三个损失值的加权平均。计算内容损失时我们只需要使用最顶层，也就是`block5_conv2`层，而计算风格损失时我们需要使用一个层次的列表，涵盖了底层到高层。最后我们将总体差异损失加在后面。

> Depending on the style reference image and content image you are using, you will
likely want to tune the content_weight coefficient, the contribution of the content loss
to the total loss. A higher content_weight means that the target content will be more
recognizable in the generated image.

取决于你在使用的风格参考图像和内容图像，你可能需要调整`content_weight`系数，它代表着内容损失在整体损失中占的比重。更高的`content_weight`代表着生成图像中的内容具有更高的辨识度。

In [None]:
# 定义个将层次名称映射到激活输出张量的字典
outputs_dict = dict([(layer.name, layer.output) for layer in model.layers])

# 内容损失计算的层次名称
content_layer = 'block5_conv2'

# 风格损失计算的层次名称列表
style_layers = ['block1_conv1',
                'block2_conv1',
                'block3_conv1',
                'block4_conv1',
                'block5_conv1']

# 三个损失值所占的权重比例
total_variation_weight = 1e-4
style_weight = 1.
content_weight = 0.025

# 下面将所有的损失值相加，合成到一个loss损失值中
loss = K.variable(0.)
layer_features = outputs_dict[content_layer]
target_image_features = layer_features[0, :, :, :]
combination_features = layer_features[2, :, :, :]
loss += content_weight * content_loss(target_image_features, combination_features)

for layer_name in style_layers:
    layer_features = outputs_dict[layer_name]
    style_reference_features = layer_features[1, :, :, :]
    combination_features = layer_features[2, :, :, :]
    sl = style_loss(style_reference_features, combination_features)
    loss += (style_weight / len(style_layers)) * sl
loss += total_variation_weight * total_variation_loss(combination_image)

> Finally, we set up the gradient descent process. In the original Gatys et al. paper,
optimization is performed using the L-BFGS algorithm, so that is also what we will use
here. This is a key difference from the Deep Dream example in the previous section. The
L-BFGS algorithms comes packaged with SciPy. However, there are two slight
limitations with the SciPy implementation:

> - It requires to be passed the value of the loss function and the value of the gradients as two
separate functions.
- It can only be applied to flat vectors, whereas we have a 3D image array.

最后一步就是设置梯度下降过程。在Gatys的论文中，优化使用的是`L-BFGS`算法，因此我们这里也选择它。这是与之前Deep Dream例子的一个关键区别。L-BFGS算法被打包在SciPy库中。然而，SciPy实现的算法有两个局限性：

- 它需要将损失函数和梯度值作为两个独立的参数代入。
- 它只能应用在铺平的向量上，而这里我们有的是一个3D图像数组。

> It would be very inefficient for us to compute the value of the loss function and the
value of gradients independently, since it would lead to a lot of redundant computation
between the two. We would be almost twice slower than we could be by computing them
jointly. To by-pass this, we set up a Python class named Evaluator that will compute
both loss value and gradients value at once, will return the loss value when called the first
time, and will cache the gradients for the next call.

如果我们分别独立计算损失函数值和梯度值的话将会非常的低效，因为这会导致两者之间产生许多冗余的计算操作。这会使得整个计算时间比联合计算它们要多几乎一倍。为了避免这一点，我们会构造一个Python类叫做`Evaluator`，它会同时计算损失值和梯度值，然后在第一次调用时返回损失值，并将梯度值缓存起来留待第二次调用。

In [None]:
# 通过损失值计算生成图像的梯度值
grads = K.gradients(loss, combination_image)[0]

# Function to fetch the values of the current loss and the current gradients
fetch_loss_and_grads = K.function([combination_image], [loss, grads])

class Evaluator(object):
    
    def __init__(self):
        self.loss_value = None
        self.grads_values = None
        
    def loss(self, x):
        assert self.loss_value is None
        x = x.reshape((1, img_height, img_width, 3))
        outs = fetch_loss_and_grads([x])
        loss_value = outs[0]
        grad_values = outs[1].flatten().astype('float64')
        self.loss_value = loss_value
        self.grad_values = grad_values
        return self.loss_value
    
    def grads(self, x):
        assert self.loss_value is not None
        grad_values = np.copy(self.grad_values)
        self.loss_value = None
        self.grad_values = None
        return grad_values

evaluator = Evaluator()

> Finally, we can run the gradient ascent process using SciPy’s L-BFGS algorithm,
saving the current generated image at each iteration of the algorithm (here, a single
iteration represents 20 steps of gradient ascent):

一切准备好后，我们就可以使用Scipy的L-BFGS算法来运行梯度增强过程，过程中我们会保存每次算法迭代完成后的生成图像（这里，一次迭代代表着20次梯度增强过程）：

In [None]:
from scipy.optimize import fmin_l_bfgs_b
from scipy.misc import imsave
import time

result_prefix = 'my_result'
iterations = 20

# 运行L-BFGS算法来最小化损失
# 初始化状态是原始目标图像
# 注意`scipy.optimize.fmin_l_bfgs_b`只能应用在铺平的向量上
x = preprocess_image(target_image_path)
x = x.flatten()
for i in range(iterations):
    print('Start of iteration', i)
    start_time = time.time()
    x, min_val, info = fmin_l_bfgs_b(evaluator.loss, x,
                                     fprime=evaluator.grads, maxfun=20)
    print('Current loss value:', min_val)
    # 保存生成的图像
    img = x.copy().reshape((img_height, img_width, 3))
    img = deprocess_image(img)
    fname = result_prefix + '_at_iteration_%d.png' % i
    imsave(fname, img)
    end_time = time.time()
    print('Image saved as', fname)
    print('Iteration %d completed in %ds' % (i, end_time - start_time))

> Here’s what we get:

运行之后我们可以得到：

![sample images](imgs/f8.8.jpg)

图8-8 风格迁移的一些生成图像

> Keep in mind that what this technique achieves is merely a form of image
re-texturing, or texture transfer. It will work best with style reference images that are
strongly textured and highly self-similar, and with content targets that don’t require high
levels of details in order to be recognizable. It would typically not be able to achieve
fairly abstract feats such as "transferring the style of one portrait to another". The
algorithm is closer to classical signal processing than to AI, so don’t expect it to work
like magic!

这里还需要说明的是，这个技术仅仅是一种将图像重新绘制纹理的过程，或者是纹理转移。因此它会在风格参考图像具有强烈纹理风格或者高度自相似纹理风格，以及内容目标图像不需要高度细节才能够识别的情况下，能够工作的最良好。它无法实现一些很常见的抽象任务比方说“将一张肖像的风格迁移到另一张肖像上”。这里的算法更接近传统信号处理而不是AI，因此别期望它像变魔术一样生成图像。

> Additionally, do note that running this style transfer algorithm is quite slow.
However, the transformation operated by our setup is simple enough that it can be
learned by a small, fast feedforward convnet as well—as long as you have appropriate
training data available. Fast style transfer can thus be achieved by first spending a lot of
compute cycles to generate input-output training examples for a fixed style reference
image, using the above method, and then training a simple convnet to learn this
style-specific transformation. Once that is done, stylizing a given image is instantaneous:
it’s a just a forward pass of this small convnet.

并且也需要了解运行这样的风格迁移算法很慢。然而我们这里使用的迁移操作还是比较简单的，因此可以通过一个小型的快速的前向传播卷积网络来进行学习，前提只要你有合适的训练数据。所以快速风格迁移能够通过预先训练生成特定输入输出训练样本上的固定风格参考图像的模型来完成，然后针对每个特定的风格转换都训练一个独立的简单卷积网络。完成之后，对给定图像的风格迁移就是瞬间完成：因为它仅需要对一个小型卷积网络做一次前向传播运算。

### 8.3.5 小结

> - Style transfer consists in creating a new image that preserves the "contents" of a target
image while also capturing the "style" of a reference image.
- "Content" can be captured by the high-level activations of a convnet.
- "Style" can be captured by the internal correlations of the activations of different layers
of a convnet.
- Hence deep learning allows style transfer to be formulated as an optimization process
using a loss defined with a pre-trained convnet.
- Starting from this basic idea, many variants and refinements are possible!

- 风格迁移包含着创建一张新的图像，其中保留了目标图像的“内容”以及参考图像的“风格”。
- “内容”可以从卷积网络的高层激活结果中获得。
- “风格”可以从卷积网络各个层次的激活结果内在相关性中获得。
- 因此我们可以使用深度学习方法，在一个允许你了卷积网络上使用损失优化方式来完成风格迁移。
- 从这些基础知识出发，可以得到很多风格迁移的变体和改良。

## 8.4 使用变分自动编码生成图像

> Sampling from a latent space of images to create entirely new images, or edit existing
ones, is currently the most popular and successful application of creative AI. In this
section and the next one, we review some of the high-level concepts pertaining to image
generation, alongside implementations details relative to the two main techniques in this
domain: Variational Autoencoders (VAEs) and Generative Adversarial Networks
(GANs). The techniques we present here are not specific to images—one could develop
latent spaces of sound, music, or even text, using GANs or VAEs—but in practice the
most interesting results have been obtained with pictures, and that is what we focus on
here.

从图像的潜空间中取样来创建完全新的图像或编辑已经存在的图像，目前在创造性AI领域已经称为最热门和成功的应用。在本节和下一节中，我们会介绍一些高层的图像生成概念，同时会专门阐述与之相关两种技术实现你：变分自动编码（VAE）和生成对抗网络（GAN）。这两节介绍的技巧不但可以应用在图像上，也可以将它们应用到声音、音乐或者文本的潜空间中，不过在实践中最有趣的结果还是来自图像，因此我们还是聚焦于此。

### 8.4.1 从图像潜空间取样

> The key idea of image generation is to develop a low-dimensional latent space of
representations (which naturally is a vector space, i.e. a geometric space), where any
point can be mapped to a realistic-looking image. The module capable of realizing this
mapping, taking as input a latent point and outputting an image, i.e. a grid of pixels, is
called a generator (in the case of GANs) or a decoder (in the case of VAEs). Once such a
latent space has been developed, one may sample points from it, either deliberately or at
random, and by mapping them to image space, generate images never seen before.

图像生成的关键在于能够找到图像的低维度潜空间的表现形式（也就是向量空间或者几何空间），空间中人和店都能够被映射成真实图像中的一个点。能够实现这样的映射，也就是将输入潜空间的点转换成图像输出，或者说是一个像素网格的模块，被称为生成器（在使用GAN的情况下）或者解码器（在使用VAE的情况下）。一旦找到了这样的潜空间，就可以从中取样，以指定的方式或者以随机的方式，将它们映射到图像空间，从而生成从未有过的图像。

![latent space](imgs/f8.9.jpg)

图8-9 从图像的潜空间中学习然后取样获得新的图像

> GANs and VAEs are simply two different strategies for learning such latent spaces of
image representations, with each its own characteristics. VAEs are great for learning
latent spaces that are well-structured, where specific directions encode a meaningful axis
of variation in the data. GANs generate images that can potentially be highly realistic, but
the latent space they come from may not have as much structure and continuity.

GAN和VAE就是两种从图像表现形式中学习获得潜空间的不同策略，当然它们具有各自的特点。VAE在学习具有良好结构的图像潜空间时特别有效，这里特定方向编码会是图像中一个有意义的数据轴的变分。GAN可以产生高度真实的图像，但是它们学习的潜空间可能并没有良好的结构和连续性。

![VAE continuous latent space](imgs/f8.10.jpg)

图8-10 Tom White使用VAE学习得到的连续潜空间生成的图像

### 8.4.2 图像编辑中的概念向量

> We already hinted at the idea of a "concept vector" when we covered word embeddings
in Chapter 6. The idea is still the same: given a latent space of representations, or an
embedding space, certain directions in the space may encode interesting axes of variation
in the original data. In a latent space of images of faces, for instance, there may be a
"smile vector" s , such that if latent point z is the embedded representation of a certain
face, then latent point z + s is the embedded representation of the same face, smiling.
Once one has identified such a vector, is then becomes possible to edit images by
projecting them into the latent space, moving their representation in a meaningful way,
then decoding them back to image space. There are concept vectors for essentially any
independent dimension of variation in image space—in the case of faces, one may
discover vectors for adding sunglasses to a face, removing glasses, turning a male face
into female face, etc.

在第六章词嵌入中我们已经接触过“概念向量”的内容。这里的含义是一样的：给定表现形式的潜空间，或者一个嵌入空间，某些原始数据的空间中的方向可以被编码成有意义的轴。例如在人脸图像的潜空间中，可能会存在“微笑向量”，我们称为向量`s`，然后在某张脸谱图像中存在一个潜在点`z`，那么潜在点`z + s`就变成了同一张脸并且带着微笑的嵌入表现形式。一旦我们找到了这样的向量，那么通过将这个向量投射到潜空间中来对图像进行编辑就变得可能了，从而将表现形式朝着期望的方向移动，最后重新将其解码到图像空间中。在图像空间充满了这样的概念向量独立维度，在人脸例子中，就存在这发现戴了太阳眼镜、去除眼镜、将男性脸部换成女性脸部等。

> Here is an example of a "smile vector", a concept vector discovered by Tom White
from the Victoria University School of Design in New Zealand, using VAEs trained on a
dataset of faces of celebrities (the CelebA dataset):

下面是一个“微笑向量”的例子，这是由新西兰维多利亚大学设计学院的Tom White发现的，他使用了VAE在一个名人脸谱数据集上训练得到：

![smile vector](imgs/f8.11.jpg)

图8-11 微笑向量

### 8.4.3 变分自动编码器

> Variational autoencoders, simultaneously discovered by Kingma & Welling in December
2013, and Rezende, Mohamed & Wierstra in January 2014, are a kind of generative
model that is especially appropriate for the task of image editing via concept vectors.
They are a modern take on autoencoders—a type of network that aims to "encode" an
input to a low-dimensional latent space then "decode" it back—that mixes ideas from
deep learning with Bayesian inference.

变分自动编码器是Kingma和Welling在2013年12月份，Rezende、Mohamed和Wierstra在2014年1月份同时发现的，是一种特别合适通过概念向量来进行图像编辑任务的生成模型。它是自动编码器的一个现代方法，自动编码器是一种网络专注于将输入“编码”到一个低维度的潜空间，然后将其“解码”回去的机器学习方法，它融合了深度学习和贝叶斯推断。

> A classical image autoencoder takes an image, maps it to a latent vector space via an
"encoder" module, then decode it back to an output with the same dimensions as the
original image, via a "decoder" module. It is then trained by using as target data the same
images as the input images, meaning that the autoencoder learns to reconstruct the
original inputs. By imposing various constraints on the "code", i.e. the output of the
encoder, one can get the autoencoder to learn more or less interesting latent
representations of the data. Most commonly, one would constraint the code to be very
low-dimensional and sparse (i.e. mostly zeros), in which case the encoder acts as a way
to compress the input data into fewer bits of information.

一个经典的图像自动编码器接受一张图像输入，使用“编码器”模块将它映射到潜在向量空间，然后又重新把向量空间解码映射到原始维度的图像空间，这意味着自动编码器具有学习重构元时输入的能力。通过对“编码”引入不同的约束条件，也就是约束编码器的输出，能够让其学习到数据中一些有意义的潜空间表现形式。更普遍来说，通过将数据编码到很低维度且稀疏的空间（也就是大部分是0），这样就可以提供一种将输入数据压缩到更小数据量的信息之中。

![autoencoder](imgs/f8.12.jpg)

图8-12 自动编码器，将输入x编码到低维度潜空间，实现压缩后重新解码到原始数据空间

> In practice, such classical autoencoders don’t lead to particularly useful or
well-structured latent spaces. They’re not particularly good at compression, either. For
these reasons, they have largely fallen out of fashion over the past years. Variational
autoencoders, however, augment autoencoders with a little bit of statistical magic that
forces them to learn continuous, highly structured latent spaces. They have turned out to
be a very powerful tool for image generation.

在实践中，这样的传统自动编码器不会得到特别有用或者良好结构化的潜空间。它们在压缩方面也不会表现优异。因为这些原因，传统的自动编码器在过去几年已经逐渐不再流行。然而变分自动编码器，增广自动编码器，使用了一些统计学的技巧使得它们能够学习到连续的高度结构化的潜空间。因此两者已经成为图像生成非常强大的工具。

> A VAE, instead of compressing its input image into a fixed "code" in the latent space,
turns the image into the parameters of a statistical distribution: a mean and a variance.
Essentially, this means that we are assuming that the input image has been generated by a
statistical process, and that the randomness of this process should be taken into
accounting during encoding and decoding. The VAE then uses the mean and variance
parameters to randomly sample one element of the distribution, and decodes that element
back to the original input. The stochasticity of this process improves robustness and
forces the latent space to encode meaningful representations everywhere, i.e. every point
sampled in the latent will be decoded to a valid output.

在VAE中，不再使用将输入图像压缩到潜空间的一个固定“编码”，而是将图像转换成统计学分布的参数：均值和方差。从根本上来说，这意味着我们假定输入图像是由一个统计学过程生成的，因此这个过程中的随机性必须在编码和解码的时候纳入考虑之中。VAE使用均值和方差参数来在分布中进行随机取样，然后把元素解码到原始输入空间中。将随机性加入这个过程中极大改善了潜空间编码有意义变现形式的健壮性和能力，也就是说潜空间中采样的每个点都能正确的解码到输出中。

![VAE](imgs/f8.13.jpg)

图8-13 VAE将图像映射到两个向量上，z_mean和z_log_sigma，它们能有效表示图像的概率分布，在分布中可以取样并解码到原始空间

> In technical terms, here is how a variational autoencoder works. First, an encoder
module turns the input samples input_img into two parameters in a latent space of
representations, which we will note z_mean and z_log_variance . Then, we randomly
sample a point z from the latent normal distribution that is assumed to generate the input
image, via z = z_mean + exp(z_log_variance) * epsilon , where epsilon is a
random tensor of small values. Finally, a decoder module will map this point in the latent
space back to the original input image. Because epsilon is random, the process ensures
that every point that is close to the latent location where we encoded input_img ( z-mean
) can be decoded to something similar to input_img , thus forcing the latent space to be
continuously meaningful. Any two close points in the latent space will decode to highly
similar images. Continuity, combined with the low dimensionality of the latent space,
forces every direction in the latent space to encode a meaningful axis of variation of the
data, making the latent space very structured and thus highly suitable to manipulation via
concept vectors.

用技术术语来描述变分自动编码的原理。首先编码器模块将输入图像编码到潜空间的两个参数上，我们使用`z-mean`和`z_log_variance`来表示。然后我们可以在潜空间正态分布上取样z点作为输入图像生成的假设，公式是$$z=z\_mean+e^{z\_log\_variance}*\epsilon$$
这里的$\epsilon$是一个随机的小数值张量。最后解码器模块会将潜空间的这个点应社会原始输入图像。因为$\epsilon$是随机的，这个过程能狗保证每个从输入图像编码中得到的取样点都能近似解码到输入图像附近，因此强制让潜空间变为连续有意义。任何潜空间的两个邻近点必然会解码得到高度相似的图像。连续性再加上潜空间的低维度特性，使得潜空间中的每个方向都能代表一个数据变化上有意义的轴，因此潜空间变得非常具有结构化特征，特别适合用概念向量来编辑图像。

> The parameters of a VAE are trained via two loss functions: first, a reconstruction
loss that forces the decoded samples to match the initial inputs, and a regularization loss,
which helps in learning well-formed latent spaces and reducing overfitting to the training
data.

VAE的参数需要通过两个损失函数来训练：第一个是重建损失，用来令解码后的样本接近原始输入，另一个是正则化损失，用来帮助学习到良好结构的潜空间和减少对训练数据的过拟合。

> Let’s quickly go over a Keras implementation of a VAE. Schematically, it looks like
this:

让我们快速看一下VAE在Keras中的实现。简单来说，如下：

```python
# 将输入编码成一个均值和方差参数
z_mean, z_log_variance = encoder(input_img)

# 从概率分布中取样一个点
z = z_mean + exp(z_log_variance) * epsilon

# 然后将z解码回到原始图像空间
reconstructed_img = decoder(z)

# 实例化模型
model = Model(input_img, reconstructed_img)

# 然后使用两个损失函数来训练模型
# 重建损失和正则化损失
```

> Here is the encoder network we will use: a very simple convnet which maps the input
image x to two vectors, z_mean and z_log_variance .

下面是一个编码器网络：它由一个简单的卷积网络构成，将输入的图像x转换成两个向量，`z_mean`和`z_log_variance`。

In [2]:
import tensorflow.keras as keras
from tensorflow.keras import layers
from tensorflow.keras import backend as K
from tensorflow.keras.models import Model
import numpy as np

img_shape = (28, 28, 1)
batch_size = 16
latent_dim = 2 # 潜空间的维度：平面

input_img = keras.Input(shape=img_shape)
x = layers.Conv2D(32, 3, padding='same', activation='relu')(input_img)
x = layers.Conv2D(64, 3, padding='same', activation='relu', strides=(2, 2))(x)
x = layers.Conv2D(64, 3, padding='same', activation='relu')(x)
x = layers.Conv2D(64, 3, padding='same', activation='relu')(x)
shape_before_flattening = K.int_shape(x)
x = layers.Flatten()(x)
x = layers.Dense(32, activation='relu')(x)
z_mean = layers.Dense(latent_dim)(x)
z_log_var = layers.Dense(latent_dim)(x)

> Here is the code for using z_mean and z_log_var , the parameters of the statistical
distribution assumed to have produced input_img , to generate a latent space point z .
Here, we wrap some arbitrary code (built on top of Keras backend primitives) into a
Lambda layer. In Keras, everything needs to be a layer, so code that isn’t part of a built-in
layer should be wrapped in a Lambda (or else, in a custom layer).

下面是使用`z_mean`和`z_log_var`的代码，两个假设用来生成输入图像的统计学分布参数。下面的代码取样潜空间的点z。这里我们将取样的函数代码（在Keras backend原语上构建）封装成一个Lambda层。在Keras中，任何东西都应该是一个层，因此所有不属于内建层的代码都应该封装到Lambda（或者自定义层）之中。

In [3]:
def sampling(args):
    z_mean, z_log_var = args
    epsilon = K.random_normal(shape=(K.shape(z_mean)[0], latent_dim), mean=0., stddev=1.)
    return z_mean + K.exp(z_log_var) * epsilon

z = layers.Lambda(sampling)([z_mean, z_log_var])

> This is the decoder implementation: we reshape the vector z to the dimensions of an
image, then we use a few convolution layers to obtain a final image output that has the
same dimensions as the original input_img .

然后是解码器实现：我们将z向量重新转换成一张图像，然后我们使用几个卷积层来获得与原始图像相同维度的输出图像。

In [4]:
# 解码器的输入我们会使用z
decoder_input = layers.Input(K.int_shape(z)[1:])

# 使用正确数量的单元提升采样
x = layers.Dense(np.prod(shape_before_flattening[1:]), activation='relu')(decoder_input)

# 恢复成铺平之前的图像形状
x = layers.Reshape(shape_before_flattening[1:])(x)

# 下面使用与编码其相反的操作：加上一个`Conv2DTranspose`层以及相应的参数
x = layers.Conv2DTranspose(32, 3, padding='same', activation='relu', strides=(2, 2))(x)
x = layers.Conv2D(1, 3, padding='same', activation='sigmoid')(x)

# 最后我们就获得了一个与原始输入相同尺寸的特征地图

# 然后定义解码器模型
decoder = Model(decoder_input, x)

# 然后就可以将它应用到`z`上得到解码图像
z_decoded = decoder(z)

> The dual loss of a VAE doesn’t fit the traditional expectation of a sample-wise
function of the form loss(input, target) . Thus, we set up the loss by writing a
custom layer with internally leverages the built-in add_loss layer method to create an
arbitrary loss.

VAE的双损失与常用的样本相关的函数形式`loss(input, target)`无法匹配。因此我们需要编写一个自定义的层来构建损失，在其内部使用内建的`add_loss`方法来获得任意的损失函数定义。

In [5]:
class CustomVariationalLayer(keras.layers.Layer):
    
    def vae_loss(self, x, z_decoded):
        x = K.flatten(x)
        z_decoded = K.flatten(z_decoded)
        xent_loss = keras.metrics.binary_crossentropy(x, z_decoded)
        kl_loss = -5e-4 * K.mean(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
        return K.mean(xent_loss + kl_loss)
    
    def call(self, inputs):
        x = inputs[0]
        z_decoded = inputs[1]
        loss = self.vae_loss(x, z_decoded)
        self.add_loss(loss, inputs=inputs)
        # 我们不会使用这个层来输出
        return x
    
# 使用输入和解码输出调用我们自定义的层次，来获取最终模型的输出
y = CustomVariationalLayer()([input_img, z_decoded])

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Num'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Num'


> Finally, we instantiate and train the model. Since the loss has been taken care of in
our custom layer, we don’t specify an external loss at compile time ( loss=None ), which
in turns means that we won’t pass target data during training (as you can see we only
pass x_train to the model in fit ).

最后构建和训练这个模型，因为损失已经在自定义层次中计算了，所以我们在编译模型时无需指定额外的损失函数（`loss=None`），这也意味着模型训练时不会传递目标数据参数给模型（下面的代码可以看到我们只传递了x_train到模型训练）。

In [None]:
from tensorflow.keras.datasets import mnist

import tensorflow as tf
tf.compat.v1.enable_eager_execution()

vae = Model(input_img, y)
vae.compile(optimizer='rmsprop', loss=None)
vae.summary()

# 在MNIST数据集上训练我们的VAE模型
(x_train, _), (x_test, y_test) = mnist.load_data()

x_train = x_train.astype('float32') / 255.
x_train = x_train.reshape(x_train.shape + (1,))
x_test = x_test.astype('float32') / 255.
x_test = x_test.reshape(x_test.shape + (1,))

vae.fit(x=x_train, y=None, shuffle=True,
        epochs=10, batch_size=batch_size, validation_data=(x_test, None))

> Once such a model is trained—e.g. on MNIST, in our case—we can use the decoder
network to turn arbitrary latent space vectors into images:

模型训练好了之后，比方说在MNIST数据集上，就可以使用解码器网络来在潜空间取样获得图像：

In [None]:
import matplotlib.pyplot as plt
from scipy.stats import norm

%matplotlib inline

# 展示一个手写数字的2D流形
n = 15 # 15x15的网格
digit_size = 28
figure = np.zeros((digit_size * n, digit_size * n))

# 在单位正方形中的线性空间坐标通过正态分布的逆累积分布函数按照潜空间向量z获得
# 因为我们对潜空间的先验假设为正态分布
grid_x = norm.ppf(np.linspace(0.05, 0.95, n))
grid_y = norm.ppf(np.linspace(0.05, 0.95, n))

for i, yi in enumerate(grid_x):
    for j, xi in enumerate(grid_y):
        z_sample = np.array([[xi, yi]])
        z_sample = np.tile(z_sample, batch_size).reshape(batch_size, 2)
        x_decoded = decoder.predict(z_sample, batch_size=batch_size)
        digit = x_decoded[0].reshape(digit_size, digit_size)
        figure[i * digit_size: (i + 1) * digit_size, j * digit_size: (j + 1) * digit_size] = digit
plt.figure(figsize=(10, 10))
plt.imshow(figure, cmap='Greys_r')

![decode numbers](imgs/f8.14.jpg)

图8-14 从潜空间中获得手写数字

> The grid of sampled digits shows a completely continuous distribution of the different
digit classes, with one digit morphing into another as you follow a path through latent
space. Specific directions in this space have a meaning, e.g. there is a direction for
"four-ness", "one-ness", etc.

上面的数字网格完全展示了不同数字种类的连续分布，从一个数字变化到另外一个数字就像你在潜空间中沿着某个方向前进一样。在这个空间中特定的方向有着相应的意义，例如有一个方向表示“4”、“1”等。

> In the next section, we cover in detail the other major tool for generating artificial
images: generative adversarial networks (GANs).

在下一节中，我们会介绍另一个生成人工图像的主要工具：生成对抗网络（GAN）。

### 8.4.4 小结

> Image generation with deep learning is done by learning latent spaces that capture
statistical information about a dataset of images. By sampling points from the latent
space, and "decoding" them, one can generate never-seen-before images. There are two
major tools to do this: VAEs and GANs.

> - VAEs result in highly structured, continuous latent representations. For this reason, they
work well for doing all sort of image edition in latent space, like face swapping, turning a
frowning face into a smiling face, and so on. They also work nicely for doing latent space
based animations, i.e. animating a walk along a cross section of the latent space, showing
a starting image slowly morphing into different images in a continuous way.
- GANs enable the generation of realistic single-frame images, but may not induce latent
spaces with solid structure and high continuity.

深度学习中的图像生成需要通过模型学习到捕获到图像数据集上的统计学信息的潜空间来实现。从潜空间中取样点，然后“解码”，就能生成之前不存在的图像。有两个主要的工具来完成这项任务：VAE和GAN。

- VAE能够获得高度结构化连续的潜空间。因此它能够完成各种各样的图像在潜空间进行编辑的工作，例如换脸、将皱眉表情变为微笑表情等等。它也能应用在实现潜空间动画上，例如在潜空间中沿着一个切面形成动画、展示一张初始图像然后连续渐变到其他图像上。
- GAN能够生成单帧的真实图像，但是它的潜空间可能不是结构化和高度连续的。

> Most successful practical applications I have seen with images actually rely on VAEs,
but GANs are extremely popular in the world of academic research—at least circa
2016-2017. You will find out how they work and how to implement one in the next
section.

很多成功的实际图像应用都依赖着VAE，但是GAN在学术领域却是异常流行，至少在2016-2017左右是这样。你可以在下一节看到GAN的工作原理。

> To play further with image generation, I suggest working with the CelebA dataset,
"Large-scale Celeb Faces Attributes". It’s a free-to-download image dataset with more
than 200,000 celebrity portraits. It’s great for experimenting with concept vectors in
particular. It beats MNIST for sure.

要进一步学习验证图像生成，作者建议使用CelebA数据集，这是一个“大规模名人脸谱数据集”。它可以免费下载，内含超过20万个名人肖像。它对于实验概念向量非常合适。肯定比MNIST数据集要好。

## 8.5 生成对抗网络简介

> Generative Adversarial Networks (GANs), introduced in 2014 by Ian Goodfellow, are an
alternative to VAEs for learning latent spaces of images. They enable the generation of
fairly realistic synthetic images by forcing the generated images to be statistically almost
indistinguishable from real ones.

生成对抗网络（GAN）是2014年由Ian Goodfellow提出的，它是除VAE外另一种学习图像潜空间的方法。它能生成相当真实的合成图像，通过让生成图像的统计学特征与真实图像基本一致来实现。

> An intuitive way to understand GANs is to imagine a forger trying to create a fake
Picasso painting. At first, the forger is pretty bad at the task. He mixes some of his fakes
with authentic Picassos, and shows them all to an art dealer. The art dealer makes an
authenticity assessment for each painting, and gives the forger feedback about what
makes a Picasso look like a Picasso. The forger goes back to his atelier to prepare some
new fakes. As times goes on, the forger becomes increasingly competent at imitating the
style of Picasso, and the art dealer becomes increasingly expert at spotting fakes. In the
end, we have on our hands some excellent fake Picassos.

理解GAN的一个直观方式是想象有一个伪造者尝试伪造毕加索的画作。一开始的时候伪造者很不擅长这个任务。他将自己伪造的作品混入毕加索的真迹当中展示给艺术鉴赏人士。鉴赏人对每幅画作进行真伪评价，然后反馈给伪造者评判毕加索真迹的信息。伪造者根据这些反馈信息，回到他的工作室重新绘制一些新的赝品。随着时间推进，伪造者越来越擅长仿制毕加索画作这项任务，而同时鉴赏人也在鉴别赝品领域变得越来越专业。最终，我们就能得到一些非常逼真的毕加索赝品。

> That’s what GANs are: a forger network network and an expert network, each being
trained to best the other. As such, a GAN is made of two parts:

> - A generator network , which takes as input a random vector (a random point in the latent
space) and decodes it into a synthetic image.
- A discriminator network (also called adversary ), which takes as input an image (real or
synthetic), and must predict whether the image came from the training set or was created
by the generator network.

这就是GAN的构成：一个伪造者网络和一个专家网络，每一个都需要进行训练，以期能够打败另一个。所以GAN的组成包括：

- 一个生成网络，接收随机向量作为输入（潜空间中的一个随机点）然后将它解码成一个合成图像。
- 一个鉴别器网络（也叫作对抗网络），接收一张图像（真实或合成）作为输入，然后判断这张图像来自训练集还是由生成网络生成。

> The generator network is trained to be able to fool the discriminator network, and
thus it evolves towards generating increasingly realistic images as training goes on:
artificial images that look indistinguishable from real ones—to the extent that it is
impossible for the discriminator network to tell the two apart. Meanwhile, the
discriminator is constantly adapting to the gradually improving capabilities of the
generator, which sets a very high bar of realism for the generated images. Once training
is over, the generator is capable of turning any point in its input space into a believable
image. Unlike VAEs, this latent space has less explicit guarantees of meaningful
structure, and in particular, it isn’t continuous.

生成网络的训练目标是击败鉴别器网络，因此它会随着训练过程的推进而产生越发真实的图像：这些图像看起来无法与真实图像区分出来，最终目标是使得鉴别器网络无法分出真假。而同时鉴别器也在不断的从生成器中改进鉴别能力，这样就能不断提升鉴别生成图像真伪的标准。当训练完成后，生成器能够将任何潜空间的点转换成一张难以分辨真伪的图像。不同于VAE，这里的潜空间没有明确有意义的结构，或者更确切的说，它不是连续的。

![GAN](imgs/f8.15.jpg)

图8-15 生成对抗网络原理

> Remarkably, a GAN is a system where the optimization minimum isn’t fixed—unlike
in any other training setup you have encountered in this book before. Normally, gradient
descent consists in rolling down some hills in a static loss landscape. However, with a
GAN, every step taken down the hill changes the entire landscape by a bit. It’s a dynamic
system where the optimization process is seeking not a minimum, but rather an
equilibrium between two forces. For this reason, GANs are notoriously very difficult to
train—getting a GAN to work require lots of careful tuning of the model architecture and
training parameters.

GAN不像本书之前介绍过的所有训练过程那样，它的最小优化值不是固定的。通常来说梯度下降就像是在一个静态的损失空间中下山一样。然而在GAN中，每次下山的一步都会稍微的改变整个损失空间一点。所以这是一个动态的系统，这里的优化目标不再是寻找一个最优最小值，而是在两股力量之间寻找平衡。正因为此，GAN具有非常高的训练难度，要训练出一个成功的GAN模型，需要许多精细的模型结构和训练参数的调整。

![GAN example](imgs/f8.16.jpg)

图8-16 Mike Tyka使用多阶段GAN从人脸数据集上生成的图像。[Mike Tyka的网站](https://miketyka.com/)

### 8.5.1 一个GAN的概要实现

> In what follows, we explain how to implement a GAN in Keras, in its barest form—since
GANs are quite advanced, diving deeply into the technical details would be out of scope
for us. Our specific implementation will be a deep convolutional GAN, or DCGAN: a
GAN where the generator and discriminator are deep convnets. In particular, it leverages
a Conv2DTranspose layer for image upsampling in the generator.

下面我们来介绍如何在Keras中实现一个GAN，当然是最原始的形式，因为GAN相当高深，深入到内部的技术细节将会超出本书的范围。我们这里的实现将会是深度卷积生成对抗网络，简称DCGAN：也就是生成器和鉴别器都是深度卷积网络的GAN。具体来说，它使用了`Conv2DTranspose`层来实现生成器的上采样。

> We will train our GAN on images from CIFAR10, a dataset of 50,000 32x32 RGB
images belong to 10 classes (5,000 images per class). To make things even easier, we
will only use images belonging to the class "frog".

我们会使用CIFAR10图像数据集来训练我们的GAN，这是一个有着5万张32x32 RGB图像的数据集，这些图像分别归属于10个不同的种类（每个类别5000张图像）。为了使得任务更加简单，我们仅仅使用那些类别是“青蛙”的图像。

> Schematically, our GAN looks like this:

> - A generator network maps vectors of shape (latent_dim,) to images of shape (32,
32, 3) .
- A discriminator network maps images of shape (32, 32, 3) to a binary score estimating
the probability that the image is real.
- A gan network chains the generator and the discriminator together: gan(x) =
discriminator(generator(x)) . Thus this gan network maps latent space vectors to
the discriminator’s assessment of the realism of these latent vectors as decoded by the
generator.
- We train the discriminator using examples of real and fake images along with
"real"/"fake" labels, as we would train any regular image classification model.
- To train the generator, we use the gradients of the generator’s weights with regard to the
loss of the gan model. This means that, at every step, we move the weights of the
generator in a direction that will make the discriminator more likely to classify as "real"
the images decoded by the generator. I.e. we train the generator to fool the discriminator.

总的来说我们的GAN就是如下的形式：

- 一个生成器网络将形状为(latent_dim,)的向量解码成形状为(32, 32, 3)的图像。
- 一个鉴别器网络将形状为(32, 32, 3)的图像输出成二分分类，估计图像为真的概率。
- 一个GAN网络将生成器和鉴别器串联起来：`gan(x) = discriminator(generator(x))`。因此整个GAN网络将潜空间向量映射成鉴别器对其生成图像的真伪评估。
- 我们使用真实的以及伪造的图像来训练鉴别器，同时包括这些图像的“真伪”标签，就像我们在训练一个普通的图像分类模型一样。
- 为了训练生成器，我们使用整个GAN模型的损失来对生成器权重进行梯度运算。这意味着，每一次我们都将其权重朝着让鉴别器更容易认为图像为“真”的方向去移动一点点。这就是实际上训练生成器来欺骗鉴别器。

### 8.5.2 一些技巧

> Training GANs and tuning GAN implementations is notoriously difficult. There are a
number of known "tricks" that one should keep in mind. Like most things in deep
learning, it is more alchemy than science: these tricks are really just heuristics, not
theory-backed guidelines. They are backed by some level of intuitive understanding of
the phenomenon at hand, and they are known to work well empirically, albeit not
necessarily in every context.

训练和调参GAN实现起来是出了名的困难。这里有一些总结出来的“技巧”应该被记住。就像很多其他在深度学习中的技巧一样，它们更像炼金术而不是科学：这些技巧实际上都是启发性算法而非具有理论支持的准则。它们都是在实际实验中根据现象使用某种程度的直觉理解获得的，它们在很多场合下都工作良好，尽管并非每种环境中都需要。

> Here are a few of the tricks that we leverage in our own implementation of a GAN
generator and discriminator below. It is not an exhaustive list of GAN-related tricks; you
will find many more across the GAN literature.

> - We use tanh as the last activation in the generator, instead of sigmoid , which would be
more commonly found in other types of models.
- We sample points from the latent space using a normal distribution (Gaussian
distribution), not a uniform distribution.
- Stochasticity is good to induce robustness. Since GAN training results in a dynamic
equilibrium, GANs are likely to get "stuck" in all sorts of ways. Introducing randomness
during training helps prevent this. We introduce randomness in two ways: 1) we use
dropout in the discriminator, 2) we add some random noise to the labels for the
discriminator.
- Sparse gradients can hinder GAN training. In deep learning, sparsity is often a desirable
property, but not in GANs. There are two things that can induce gradient sparsity: 1) max
pooling operations, 2) ReLU activations. Instead of max pooling, we recommend using
strided convolutions for downsampling, and we recommend using a LeakyReLU layer
instead of a ReLU activation. It is similar to ReLU but it relaxes sparsity constraints by
allowing small negative activation values.
- In generated images, it is common to see "checkerboard artifacts" caused by unequal
coverage of the pixel space in the generator. To fix this, we use a kernel size that is
divisible by the stride size, whenever we use a strided Conv2DTranpose or Conv2D in
both the generator and discriminator.

下面列出了我们的生成器和鉴别器GAN实现中使用到的一些技巧。这当然不是一份有关GAN技巧的完整列表，你可以在GAN相关的文献中找到更多的技巧。

- 我们使用`tanh`作为生成器最后的激活函数，而不是`sigmoid`，后者是其他模型中经常使用的激活函数。
- 我们使用正态分布（高斯分布）来从潜空间中取样，而不是均匀分布。
- 随机性能够更好地提供健壮性。因为GAN的训练结果是一个动态平台，所以GAN很容易在各种情况下卡住。在训练中引入随机性能够帮助避免这一点。我们使用两种方式引入随机性：1）在鉴别器中使用dropout，2）在鉴别器的标签中加入一些随机噪音。
- 稀疏梯度会阻碍GAN的训练。在深度学习中稀疏性通常是希望的特点，但在GAN中不是这样。有两个做法会带来稀疏性：1）最大池化操作，2）线性整流单元激活。所以我们推荐使用步进卷积对图像进行下取样来取代最大池化，使用`LeakyReLU`层来取代`ReLU`激活。`LeakyReLU`类似于`ReLU`，但是它允许存在小数值的负数以减低稀疏性。
- 在生成的图像中很容易观察到“棋盘效应”，这是由于生成器的在像素空间的不平衡导致的。为了修正这一点，我们使用的核大小能够被步进大小整除，在生成器和鉴别器中无论使用`Conv2DTranspose`还是`Conv2D`层时都保证这一点。

![checkboard artifact](imgs/f8.17.jpg)

图8-17 棋盘效应，由于步进值和核大小值不匹配造成的像素空间不平衡，GAN中一个著名的坑

### 8.5.3 生成器

> First, we develop a generator model, which turns a vector (from the latent
space—during training it will sampled at random) into a candidate image. One of the
many issues that commonly arise with GANs is that the generator gets stuck with
generated images that look like noise. A possible solution is to use dropout on both the
discriminator and generator.

首先我们构建生成器模型，它能将一个向量（训练时从潜空间中随机取样获得）转换成一个候选图像。在GAN中有一个经常会碰到的问题就是生成器卡在不停生成噪音的阶段。一个可以采取的措施就是在鉴别器和生成器中都加上dropout层。

In [8]:
import tensorflow.keras
from tensorflow.keras import layers
import numpy as np
latent_dim = 32
height = 32
width = 32
channels = 3
generator_input = keras.Input(shape=(latent_dim,))

# 首先将输入转换成一个16x16具有128个通道的特征地图
x = layers.Dense(128 * 16 * 16)(generator_input)
x = layers.LeakyReLU()(x)
x = layers.Reshape((16, 16, 128))(x)

# 然后加入一个卷积层
x = layers.Conv2D(256, 5, padding='same')(x)
x = layers.LeakyReLU()(x)

# 上采样到32x32
x = layers.Conv2DTranspose(256, 4, strides=2, padding='same')(x)
x = layers.LeakyReLU()(x)

# 在增加一些卷积层
x = layers.Conv2D(256, 5, padding='same')(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(256, 5, padding='same')(x)
x = layers.LeakyReLU()(x)

# 产生一个32x31 1个通道的特征地图
x = layers.Conv2D(channels, 7, activation='tanh', padding='same')(x)
generator = keras.models.Model(generator_input, x)
generator.summary()

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 32)]              0         
_________________________________________________________________
dense_4 (Dense)              (None, 32768)             1081344   
_________________________________________________________________
leaky_re_lu (LeakyReLU)      (None, 32768)             0         
_________________________________________________________________
reshape_1 (Reshape)          (None, 16, 16, 128)       0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 16, 16, 256)       819456    
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 16, 16, 256)       0         
_________________________________________________________________
conv2d_transpose_1 (Conv2DTr (None, 32, 32, 256)       1048

### 8.5.4 鉴别器

> Then, we develop a discriminator model, that takes as input a candidate image (real or
synthetic) and classifies it into one of two classes, either "generated image" or "real
image that comes from the training set".

然后我们就来构建鉴别器模型，他接收一张候选图像（真实的或合成的）作为输入，并将其分为两类，“生成的图像”或“来自训练集的真实图像”。

In [9]:
discriminator_input = layers.Input(shape=(height, width, channels))
x = layers.Conv2D(128, 3)(discriminator_input)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(128, 4, strides=2)(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(128, 4, strides=2)(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(128, 4, strides=2)(x)
x = layers.LeakyReLU()(x)
x = layers.Flatten()(x)

# 加入一个dropout层，非常重要的技巧
x = layers.Dropout(0.4)(x)

# 分类器层
x = layers.Dense(1, activation='sigmoid')(x)

discriminator = keras.models.Model(discriminator_input, x)
discriminator.summary()

# 为了令训练逐渐稳定，我们在优化器中使用学习率衰减和梯度裁剪
discriminator_optimizer = keras.optimizers.RMSprop(lr=0.0008, clipvalue=1.0, decay=1e-8)
discriminator.compile(optimizer=discriminator_optimizer, loss='binary_crossentropy')

Model: "model_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, 32, 32, 3)]       0         
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 30, 30, 128)       3584      
_________________________________________________________________
leaky_re_lu_5 (LeakyReLU)    (None, 30, 30, 128)       0         
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 14, 14, 128)       262272    
_________________________________________________________________
leaky_re_lu_6 (LeakyReLU)    (None, 14, 14, 128)       0         
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 6, 6, 128)         262272    
_________________________________________________________________
leaky_re_lu_7 (LeakyReLU)    (None, 6, 6, 128)         0   

### 8.5.5 对抗网络

> Finally, we setup the GAN, which chains the generator and the discriminator. This is the
model that, when trained, will move the generator in a direction that improves its ability
to fool the discriminator. This model turns latent space points into a classification
decision, "fake" or "real", and it is meant to be trained with labels that are always "these
are real images". So training gan will updates the weights of generator in a way that
makes discriminator more likely to predict "real" when looking at fake images. Very
importantly, we set the discriminator to be frozen during training (non-trainable): its
weights will not be updated when training gan . If the discriminator weights could be
updated during this process, then we would be training the discriminator to always
predict "real", which is not what we want!

最后我们构建GAN，它将生成器和鉴别器串联在一起。这个模型的目标是当训练时，我们会将生成器的权重朝着改进它能更好欺骗鉴别器的方向移动。这个模型将潜空间的点转换成最终的分类预测，“赝品”或“真迹”，模型设计的宗旨就是使用“这些是真实的图像”这样的标签来进行训练。因此训练GAN会更新生成器的权重，期望更新后生成的合成图像更容易使得鉴别器认为是真的。非常重要的一点是，在训练过程中我们会冻结鉴别器权重（不可训练的）：鉴别器的权重在训练GAN过程中不会更新。因为如果过程中更新了鉴别器的权重，最终我们会训练出永远预测为“真实”图像的鉴别器，这显然不是我们希望的。

In [10]:
# 设置鉴别器权重不可训练（仅对整个GAN模型而言）
discriminator.trainable = False

gan_input = keras.Input(shape=(latent_dim,))
gan_output = discriminator(generator(gan_input))

gan = keras.models.Model(gan_input, gan_output)

gan_optimizer = keras.optimizers.RMSprop(lr=0.0004, clipvalue=1.0, decay=1e-8)
gan.compile(optimizer=gan_optimizer, loss='binary_crossentropy')

### 8.5.6 如何训练我们的DCGAN

> Now we can start training. To recapitulate, this is schematically what the training loop
looks like:

现在可以开始训练了。整个训练的循环过程如下：

```text
for each epoch:
    * 从潜空间中取样点 (随机噪音).
    * 使用这个随机噪音在生成器中生成图像
    * 将生成的图像混入真实图像中
    * 使用这些混合的图像来训练鉴别器，使用相应的目标标签，“真实”或者“合成”
    * 从潜空间中取样新的随机点
    * 使用这些随机向量训练GAN，这时的目标标签使用的是“这些都是真实图像”，用来更新生成器的权重
```

> Let’s implement it:

让我们来实现它：

译者注，以下代码修改了图像输出目录以及定时保存的间隔。

In [11]:
import os
from tensorflow.keras.preprocessing import image

# 载入CIFAR10数据集
(x_train, y_train), (_, _) = keras.datasets.cifar10.load_data()

# 选择其中的青蛙图像（类别6）
x_train = x_train[y_train.flatten() == 6]

# 规范化数据
x_train = x_train.reshape((x_train.shape[0],) 
                          + (height, width, channels)).astype('float32') / 255.
iterations = 10000
batch_size = 20
save_dir = os.path.join(os.environ['HOME'], 'gan_output')

# 开始训练的循环
start = 0
for step in range(iterations):
    # 从潜空间中随机取样点
    random_latent_vectors = np.random.normal(size=(batch_size, latent_dim))
    
    # 将向量解码成合成图像
    generated_images = generator.predict(random_latent_vectors)
    
    # 将合成图像混入真是图像
    stop = start + batch_size
    real_images = x_train[start: stop]
    combined_images = np.concatenate([generated_images, real_images])
    
    # 组装真是图像和合成图像的目标标签
    labels = np.concatenate([np.ones((batch_size, 1)), np.zeros((batch_size, 1))])

    # 在标签中加入随机噪音 - 非常重要的技巧
    labels += 0.05 * np.random.random(labels.shape)

    # 训练鉴别器
    d_loss = discriminator.train_on_batch(combined_images, labels)

    # 从潜空间中随机取样更多的点
    random_latent_vectors = np.random.normal(size=(batch_size, latent_dim))

    # 组装新的标签，说明“这些都是真实图像”
    misleading_targets = np.zeros((batch_size, 1))

    # 训练生成器 (通过GAN模型，这时鉴别器的权重不可训练）
    a_loss = gan.train_on_batch(random_latent_vectors, misleading_targets)
    start += batch_size

    if start > len(x_train) - batch_size:
        start = 0

    # 定时保存或绘制图像
    if step % 100 == 99:
        # 保存模型参数
        gan.save_weights('gan.h5')
        
        # 打印指标
        print('discriminator loss:', d_loss)
        print('adversarial loss:', a_loss)
        
        # 保存一张生成图像
        img = image.array_to_img(generated_images[0] * 255., scale=False)
        img.save(os.path.join(save_dir, 'generated_frog' + str(step) + '.png'))

        # 保存一张真是图像，用于做对比
        img = image.array_to_img(real_images[0] * 255., scale=False)
        img.save(os.path.join(save_dir, 'real_frog' + str(step) + '.png'))

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: Bad argument number for Name: 3, expecting 4
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: Bad argument number for Name: 3, expecting 4
discriminator loss: 0.69052553
adversarial loss: 0.7140432
discriminator loss: 0.53452665
adversarial loss: 18.42438
discriminator loss: 0.67683643
adversarial loss: 0.75703514
discriminator loss: 0.69834167
adversarial loss: 0.7260839
discriminator loss: 0.6894704
adversarial loss: 0.74290484
discriminator loss: 0.6892184
adversarial loss: 0.74777925
discriminator loss: 0.67347133
adversarial loss: 0.72823393
discriminator loss: 0.6830332
adversarial loss: 0.77585906
discriminator loss: 0.687

> When training, you may see your adversarial loss start increasing considerably while
your discriminative loss will tend to zero, i.e. your discriminator may end up dominating
your generator. If that’s the case, try reducing the discriminator learning rate and increase
the dropout rate of the discriminator.

当训练时，你有可能会看到你的对抗损失急剧增加而鉴别损失趋向于0，也就是说你的鉴别器开始完全支配你的生成器了。如果出现了这种情况，尝试减小鉴别器的学习率和增加鉴别器的dropout比率。

![frog generated images](imgs/f8.18.jpg)

图8-18 图中每一列都有两张合成图像和一张真实图像，你可以肉眼识别吗。答案是真是图像分别在中间、顶部、底部、中间。

### 8.5.7 小结

> - GANs consist in a generator network coupled with a discriminator network. The
discriminator is trained to tell apart the output of the generator and real images from a
training dataset, while the generator is trained to fool the discriminator. Remarkably, the
generator nevers sees images from the training set directly; the information it has about
the data comes from the discriminator.
- GANs are difficult to train, because training a GAN is a dynamic process rather than a
simple descent process with a fixed loss landscape. Getting a GAN to train correctly
requires leveraging a number of heuristic tricks, as well as extensive tuning.
- GANs can potentially produce highly realistic images. However, unlike VAEs, the latent
space that they learn does not have a neat continuous structure, and thus may not be
suited for certain practical applications, such as image editing via latent space concept
vectors.

- GAN包含着一个生成网络和一个鉴别器网络。鉴别器训练来对真实数据集图像和生成图像进行分类，而生成器训练来欺骗鉴别器。这里很重要的一点是，生成器从未直接接触训练集中的图像，它的信息完全来自于鉴别器的反馈信息。
- GAN训练难度很高，因为训练GAN是一个动态过程，而不是传统的静态空间梯度下降过程。要使得GAN正确的训练需要使用一系列启发性技巧，和繁重的调参工作。
- GAN可以生成高度真实的图像。然而不像VAE，它获得的潜空间并没有干净的连续结构，所以它也不能胜任某些应用场景，比如使用潜空间概念向量进行图像编辑。

## 8.6 总结：生成深度学习

> This is the end of the chapter on creative applications of deep learning, where deep nets
go beyond simply annotating existing content, and start generating their own. You have
just learned:

> - How to generate sequence data, one timestep at a time. This is applicable to text
generation, but also to note-by-note music generation, or any other type of timeseries
data.
- How Deep Dreams work: by maximizing convnet layer activations through gradient
ascent in input space.
- How to perform style transfer, where a content image and a style image get combined to
produce interesting-looking results.
- What GANs and VAEs are, how they can be used for dreaming up new images, and how
latent space "concept vectors" could be used for image edition.

这里要结束本章，深度学习的创造性应用了，本章让你看到深度网络已经超越标记已经存在的内容范畴，进入到生成内容的范畴了。你在本章了解了：

- 如何生成序列数据，一次产生一个数据。这广泛应用在文本生成上，不过也可以应用在音乐生成或其他类型的时间序列数据上。
- Deep Dream是如何工作的：通过在输入空间上最大化梯度增强的激活结果。
- 如何进行风格迁移，用来将内容图像和风格图像组合在一起生成很有趣的结果。
- GAN和VAE是什么，它们是如何产生全新的图像的，还有潜空间“概念向量”如何用来进行图像编辑。

> These few techniques only cover the very basics of this fast-expanding field. There’s
a lot more to discover out there—generative deep learning would be deserving of an
entire book of its own.

这些技术仅仅覆盖了这个快速扩张领域的最基础部分。这个领域还有很多本章未阐述却值得发现的内容，生成深度学习这个主题完全可以写一本书。

<< [第七章：高级深度学习最佳实践](Chapter7_Advanced_deep_learning_best_pratices.ipynb)|| [目录](index.md) || [第九章：总结](Chapter9_Conclusions.ipynb) >>