# CycleGAN网络实现文本风格迁移

## 项目介绍

本项目旨在使用CycleGAN模型实现文本风格迁移（Text Style Transfer）任务。模型使用Transformer网络作为生成器，CNN作为判别器。文本风格迁移中的风格与图像风格迁移中的相比，其范围显然要宽泛许多，从词汇分布、组织结构到情感色彩、身份特征等方面的差异都可以认为是文本风格的一部分。因此，文本风格迁移在未来的应用形式和场景非常多样与充满想象力。另外，由于文本相比于图像是离散数据，文本风格迁移在梯度传递方面存在困难。目前，因为缺乏具有强有力数学基础的代表作，该领域发展前景十分广阔。

## 项目目的

+ 熟悉掌握使用MindSpore框架进行深度学习模型的构建和训练
+ 掌握Transformer模型的基本结构和编程方法
+ 掌握使用CycleGAN模型进行中文文本风格迁移
+ 掌握开源项目的发布流程

## 项目环境

+ MindSpore 1.7.0
+ Python 3.8
+ GPU RTX 3090
+ Ubuntu 20.04

## 实验步骤

### 数据准备

+ 选用鲁迅小说集作为目标域，从url下载txt格式文件
+ 选用网络作文作为假样本域
+ 去除下标、脚注和特殊符号等（不包括逗号和句号，作为分隔符）
+ 使用HanLP分词，将文本段落分割为词汇，以空格区分
+ 由于短文本风格不明显，以分隔符起始和结尾，截取符合长度要求的最长文本
+ 将文本按行保存为数据集

### 数据预处理

+ 按照8：2的比例分割为训练数据和测试数据
+ 使用北师大的CWV文学作品数据集将词汇转换为词向量

### 训练网络

训练网络代码梳理概况如下所示：

```{python}
Class CycleGanModel
- Class TransformerGenerator
    - Class TransformerEncoder
        - Class EncoderCell
            - Class SelfAttention
            - Class FeedForward
        - Class DecoderCell
            - Class SelfAttention
            - Class Encoder-Decoder Attention
            - Class FeedForward
    - Class TransformerDecoder
- Class CnnDiscriminator
```

![Text Style Transfer](https://camo.githubusercontent.com/2a04777c752f76cc317eb2a258268f37565646fd990e6f6bcaaa44d06e996333/68747470733a2f2f692e696d6775722e636f6d2f55724c523971532e706e67)

## Transformer Generator

我们首先完成transformer模型的编写，这一部分的难点在于不熟悉的API多，许多torch和mindspore之间的转换我们并不熟悉。

In [1]:
import math
import copy
import numpy as np
import mindspore as ms
from mindspore import Tensor
from mindspore import ops
from mindspore import nn
from mindspore.common.initializer import initializer, XavierUniform
from mindspore.ops import functional as F
# from mindspore.ops import operations as P

swap是我们编写的一个小的辅助函数，这是由于torch和mindspore的转置函数的接收参数不同。这一部分代码我们完成了attention操作和mask操作。这里值得注意的是torch的matmul函数与mindspore的MatMul不同，它更像是MatMul与BatchMatMul的结合体。

In [2]:
def swap(shape, pos1, pos2):
    list = [pos for pos in range(len(shape))]
    list[pos1], list[pos2] = list[pos2], list[pos1]
    return list

def subsequent_mask(size):
    "Mask out subsequent positions."
    attn_shape = (1, size, size)
    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return Tensor.from_numpy(subsequent_mask) == 0

def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    matmul = ops.BatchMatMul()
    softmax = ops.Softmax()
    d_k = query.shape[-1]
    scores = matmul(query, key.transpose(swap(key.shape, -2, -1))) / math.sqrt(d_k)


if mask is not None:
    scores = scores.masked_fill(mask, -1e9)
p_attn = softmax(scores)
if dropout is not None:
    p_attn = dropout(p_attn)
return matmul(p_attn, value), p_attn

def clones(module, N):
    "Produce N identical layers."
    return nn.CellList([copy.deepcopy(module) for _ in range(N)])

这里我们遇到了两个小问题，一是mindspore无法像pytorch一样集中对神经网络进行参数初始化，所以要在每一层使用init进行初始化操作；而是init中输入shape与全连接层的shape是互为转置的。

In [3]:
class MultiHeadedAttention(nn.Cell):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        weight_init = initializer(XavierUniform(), [d_model, d_model], ms.float32)
        self.linears = clones(nn.Dense(d_model, d_model, weight_init), 4)
        self.attn = None
        self.dropout = nn.Dropout(keep_prob=dropout)

    def construct(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            expand_dims = ops.ExpandDims()
            mask = expand_dims(mask, 1)
        nbatches = query.shape[0]
        # 1) Do all the linear projections in batch from d_model => h x d_k 
        query, key, value = [l(x).view((nbatches, -1, self.h, self.d_k)).transpose((0, 2, 1, 3)) for l, x in zip(self.linears, (query, key, value))]

    # 2) Apply attention on all the projected vectors in batch.
    x, self.attn = attention(query, key, value, mask=mask,
                             dropout=self.dropout)
    # 3) "Concat" using a view and apply a final linear.
    x = x.transpose(swap(x.shape, 1, 2)).view(nbatches, -1, self.h * self.d_k)
return self.linears[-1](x)

这里实现了归一化层和RES层，这里发现了有趣的一点：Tensor的mean和std操作关键词分别用了keep_dims和keepdims。

In [4]:
class LayerNorm(nn.Cell):
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = ms.Parameter(Tensor(np.ones(features), ms.float32))
        self.b_2 = ms.Parameter(Tensor(np.zeros(features), ms.float32))
        self.eps = eps

    def construct(self, x):
        mean = x.mean(-1, keep_dims=True)
        std = x.std(-1, keepdims=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

class SublayerConnection(nn.Cell):
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def construct(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm.construct(x)))

这里实现了位置编码与编码解码层。

In [5]:
class PositionwiseFeedForward(nn.Cell):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        # weight_init for Dense: (out_channels, in_channels)
        weight_init_1 = initializer(XavierUniform(), [d_model, d_ff], ms.float32)
        weight_init_2 = initializer(XavierUniform(), [d_ff, d_model], ms.float32)
        self.w_1 = nn.Dense(d_model, d_ff, weight_init_2)
        self.w_2 = nn.Dense(d_ff, d_model, weight_init_1)
        self.dropout = nn.Dropout(dropout)

    def construct(self, x):
        relu = ops.ReLU()
        return self.w_2(self.dropout(relu(self.w_1(x))))

class EncoderLayer(nn.Cell):
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def construct(self, x, mask):
        x = self.sublayer[0].construct(x, lambda x: self.self_attn.construct(x, x, x, mask))
        return self.sublayer[1].construct(x, self.feed_forward.construct)

class DecoderLayer(nn.Cell):
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)

    def construct(self, x, memory, src_mask, tgt_mask):
        m = memory
        x = self.sublayer[0].construct(x, lambda x: self.self_attn.construct(x, x, x, tgt_mask))
        x = self.sublayer[1].construct(x, lambda x: self.src_attn.construct(x, m, m, src_mask))
        return self.sublayer[2].construct(x, self.feed_forward.construct)

In [6]:
class Encoder(nn.Cell):
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)  # layer = EncoderLayer()
        self.norm = LayerNorm(layer.size)

    def construct(self, x, mask):
        for layer in self.layers:
            x = layer.construct(x, mask)
        return self.norm.construct(x)


class Decoder(nn.Cell):
    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def construct(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer.construct(x, memory, src_mask, tgt_mask)
        return self.norm.construct(x)


class FullEncoder(nn.Cell):
    def __init__(self, encoder, src_embed):
        super(FullEncoder, self).__init__()
        self.encoder = encoder
        self.src_embed = src_embed  # Embedding function

    def construct(self, src, src_mask):
        return self.encoder.construct(self.src_embed(src), src_mask)

In [7]:
class Scale(nn.Cell):
    def __init__(self, d_model):
        super(Scale, self).__init__()
        self.d_model = d_model

    def construct(self, x):
        return x * math.sqrt(self.d_model)


class PositionalEncoding(nn.Cell):
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(dropout)
        # Compute the positional encodings once in log space.
        pe = Tensor(np.zeros((max_len, d_model)), ms.float32)
        expand_dims = ops.ExpandDims()
        cast = ops.Cast()
        exp = ops.Exp()
        sin = ops.Sin()
        cos = ops.Cos()
        position = cast(expand_dims(ms.numpy.arange(0, max_len), 1), ms.float32)
        div_term = exp(cast(ms.numpy.arange(0, d_model, 2), ms.float32) * -(math.log(10000.0) / d_model))
        pe[:, 0::2] = sin(position * div_term)
        pe[:, 1::2] = cos(position * div_term)
        self.pe = expand_dims(pe, 0)  # this is not trainable parameters

    def construct(self, x):
        cast = ops.Cast()
        x = cast(x, ms.float32) + self.pe[:, :x.shape[1]]
        return self.dropout(x)

生成器与编解码器。这里有一点很有趣，当使用特定的激活函数时，求解梯度时会报错，因此，原本打算在这里使用Gumbel Softmax的想法被放弃。另外，即使我重写了construct方法，依然无法通过out=Net(input)的形式搭建网络。

In [8]:
class Generator(nn.Cell):
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        weight_init = initializer(XavierUniform(), [vocab, d_model], ms.float32)
        self.proj = nn.Dense(d_model, vocab, weight_init)
        self.softmax = nn.Softmax()

    def _construct(self, x):
        log_softmax = nn.LogSoftmax()
        return log_softmax(self.proj(x))

    def construct(self, x, scale=1.0):
        return self.softmax(self.proj(x) * scale)


class EncoderDecoder(nn.Cell):

    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed  # Embedding function
        self.tgt_embed = tgt_embed  # Embedding function
        self.generator = generator

    def construct(self, src, tgt, src_mask, tgt_mask):
        "Take in and process masked src and target sequences."
        return self.decode(self.encode(src, src_mask), src_mask,
                           tgt, tgt_mask)

    def encode(self, src, src_mask):
        return self.encoder.construct(self.src_embed(src), src_mask)

    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder.construct(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

    def Transformer(self, N=6, d_model=1024, d_ff=2048, h=8, dropout=0.1):
        "Helper: Construct a model from hyperparameters."
        c = copy.deepcopy
        attn = MultiHeadedAttention(h, d_model)
        ff = PositionwiseFeedForward(d_model, d_ff, dropout)
        position = PositionalEncoding(d_model, dropout)
        model = EncoderDecoder(
            Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
            Decoder(DecoderLayer(d_model, c(attn), c(attn),
                                 c(ff), dropout), N),
            nn.SequentialCell(Scale(d_model), c(position)),
            nn.SequentialCell(Scale(d_model), c(position)), None)
        # This was important from their code.
        # Initialize parameters with Glorot / fan_avg.
        # We do this in each Dense Layer.
        return model

In [9]:
class Batch:
    def __init__(self, src, trg=None, pad=0):
        self.src = src
        expand_dims = ops.ExpandDims()
        self.src_mask = expand_dims((src != pad), -2)
        if trg is not None:
            self.trg = trg[:, :-1]
            self.trg_y = trg[:, 1:]
            self.trg_mask = self.make_std_mask(self.trg, pad)
        self.ntokens = (self.trg_y != pad).sum()

@staticmethod
def make_std_mask(tgt, pad):
    "Create a mask to hide padding and future words."
    expand_dims = ops.ExpandDims()
    tgt_mask = expand_dims((tgt != pad), -2)
    logical_and = ops.LogicalAnd()
    tgt_mask = logical_and(tgt_mask, Tensor(subsequent_mask(tgt.shape[-1]), tgt_mask.dtype))
    return tgt_mask

我们使用了与DCGAN中类似的方式，将求解过程使用类进行了封装。由于前面提到的无法调用实例搭建网络的问题，这里我将nn中TrainOneStepCell类进行了继承和改写。

In [10]:
class TrainOneStepCell(nn.TrainOneStepCell):
    def construct(self, *inputs):
        loss = self.network.construct(*inputs)
        sens = F.fill(loss.dtype, loss.shape, self.sens)
        grads = self.grad(self.network, self.weights)(*inputs, sens)
        grads = self.grad_reducer(grads)
        loss = F.depend(loss, self.optimizer(grads))
        return loss


class WithLossGenerator(nn.Cell):
    def __init__(self, generator, criterion):
        super(WithLossGenerator, self).__init__(auto_prefix=True)
        self.netG = generator
        self.loss_fn = criterion

    def construct(self, z, y, norm):
        x = self.netG.construct(z)
        cast = ops.Cast()
        print(x.view(-1, x.shape[-1]), y.view(-1))
        loss = self.loss_fn(x.view(-1), y.view(-1)) / cast(norm, ms.float32)
        return loss


class SimpleLossCompute(nn.Cell):
    def __init__(self, generator, criterion, optimizer):
        super(SimpleLossCompute, self).__init__()
        self.generator = generator
        self.criterion = criterion
        self.optimizer = optimizer
        self.netG_with_criterion = WithLossGenerator(generator,
                                                     criterion)
        self.myTrainOneStepForG = TrainOneStepCell(
            self.netG_with_criterion,
            self.optimizer)

    def __call__(self, z, y, norm):
        output_G = self.myTrainOneStepForG(z, y, norm).view(-1)
        cast = ops.Cast()
        netG_loss = output_G * cast(norm, ms.float32)
        return netG_loss.mean()

由于原方法中使用了<Attention Is All You Need>中的动量优化算子求解梯度下降，我重写了NoamLR方法，将它传入Adam的lr中，实现了对原方法的优化。

In [11]:
def NoamLR(model_size, factor, warmup, total_step):
    lr = []
    for step in range(total_step):
        step += 1
        lr.append(factor * model_size ** (-0.5) *
                  min(step ** (-0.5), step * warmup ** (-1.5)))
    return lr

## CNN Discriminator

![Transformer and CNN](https://camo.githubusercontent.com/b11215e59ec0d1c6369c8dfc85664f438e52d46b101362acf81281fdbc97338f/68747470733a2f2f692e696d6775722e636f6d2f746d4d6976496d2e706e67)

In [12]:
from src.utils import create_dict_iterator

class Resblock(nn.Cell):
    def __init__(self, inner_dim, kernel_size):
        super(Resblock, self).__init__()
        self.inner_dim = inner_dim
        self.kernel_size = kernel_size
        self.relu = ops.ReLU()
        if kernel_size % 2 != 1:
            raise Exception("kernel size must be odd number!")
        self.conv_1 = nn.Conv1d(self.inner_dim, self.inner_dim, self.kernel_size, pad_mode="pad",
                                padding=int((kernel_size - 1) / 2))
        self.conv_2 = nn.Conv1d(self.inner_dim, self.inner_dim, self.kernel_size, pad_mode="pad",
                                padding=int((kernel_size - 1) / 2))

    def construct(self, inputs):
        output = self.relu(inputs)
        output = self.conv_1(output)
        output = self.relu(output)
        output = self.conv_2(output)
        return inputs + (0.3 * output)

In [13]:
class Discriminator(nn.Cell):
    def __init__(self, word_dim, inner_dim, seq_len, kernel_size=3, two_out=False):
        super(Discriminator, self).__init__()
        self.word_dim = word_dim
        self.inner_dim = inner_dim
        self.seq_len = seq_len
        self.kernel_size = kernel_size
        if kernel_size % 2 != 1:
            raise Exception("kernel size must be odd number!")
        self.conv_1 = nn.Conv1d(self.word_dim, self.inner_dim, self.kernel_size, pad_mode="pad",
                                padding=int((kernel_size - 1) / 2))
        self.resblock_1 = Resblock(inner_dim, kernel_size)
        self.resblock_2 = Resblock(inner_dim, kernel_size)
        self.resblock_3 = Resblock(inner_dim, kernel_size)
        self.resblock_4 = Resblock(inner_dim, kernel_size)
        W = seq_len * inner_dim
        self.fc_1 = nn.Dense(W, int(W / 8))
        self.fc_2 = nn.Dense(int(W / 8), int(W / 32))
        self.fc_3 = nn.Dense(int(W / 32), int(W / 64))
        self.fc_4 = nn.Dense(int(W / 64), 2 if two_out else 1)
        self.relu = nn.LeakyReLU()

    def feed_fc(self, inputs):
        output = self.relu(self.fc_1(inputs))
        output = self.relu(self.fc_2(output))
        output = self.relu(self.fc_3(output))
        return self.fc_4(output)

    def construct(self, inputs):
        this_bs = inputs.shape[0]
        permute = ops.Transpose()
        inputs = ops.Cast()(permute(inputs, (0, 2, 1)), ms.float32)
        if inputs.shape[-1] != self.seq_len:
            # print("Warning: seq_len(%d) != fixed_seq_len(%d), auto-pad."%(inputs.shape[-1], self.seq_len))
            p1d = (0, self.seq_len - inputs.shape[-1])
            inputs = F.pad(inputs, p1d, "constant", 0)
            # print("after padding,", inputs.shape)
        output = self.conv_1(inputs)
        output = self.resblock_1(output)
        output = self.resblock_2(output)
        output = self.resblock_3(output)
        output = self.resblock_4(output)
        output = output.view(this_bs, -1)
        # print(output.shape)
        return self.feed_fc(output)

In [14]:
class WithLossCellD(nn.Cell):
    def __init__(self, netD, loss_fn):
        super(WithLossCellD, self).__init__(auto_prefix=True)
        self.netD = netD
        self.loss_fn = loss_fn

    def construct(self, z, y):
        x = self.netD.construct(z)
        cast = ops.Cast()
        loss = self.loss_fn(x.view(-1), y.view(-1))
        return loss


class CLA(nn.Cell):
    def __init__(self, myTrainOneStepCellForD):
        super(CLA, self).__init__(auto_prefix=True)
        self.myTrainOneStepCellForD = myTrainOneStepCellForD

    def construct(self, data, label):
        output_D = self.myTrainOneStepCellForD(data, label).view(-1)
        netD_loss = output_D.mean()
        return netD_loss

In [15]:
ms.set_context(mode=ms.GRAPH_MODE, device_target="GPU")
data_root = "./datasets"  # 数据集根目录
batch_size = 128  # 批量大小
word_dim = 300  # 词向量大小
inner_dim = 1024  # 隐藏层大小
seq_len = 40  # 句子长度
num_epochs = 10  # 训练周期数
size = 500  # 数据集大小
lr = 0.0002  # 学习率
beta1 = 0.5  # Adam优化器的beta1超参数

In [16]:
netD = Discriminator(word_dim, inner_dim, seq_len)

In [17]:
loss = nn.BCELoss(reduction='mean')
optimizerD = nn.Adam(netD.trainable_params(), learning_rate=lr, beta1=beta1)

In [18]:
netD_with_criterion = WithLossCellD(netD, loss)
myTrainOneStepCellForD = TrainOneStepCell(netD_with_criterion, optimizerD)

In [19]:
cla = CLA(myTrainOneStepCellForD)
cla.set_train()
# 创建迭代器
data_loader = create_dict_iterator(size * num_epochs, batch_size, seq_len, word_dim)
D_losses = []
# 开始循环训练
print("Starting Training Loop...")
for epoch in range(num_epochs):
    # 为每轮训练读入数据
    for i in range(size):
        d = next(data_loader)
        data = Tensor(d["data"], ms.float32)
        label = Tensor(d["label"], ms.float32)
        netD_loss = cla.construct(data, label)
        if i % 50 == 0 or i == size - 1:
            # 输出训练记录
            print('[%2d/%d][%3d/%d]   Loss_D:%7.4f' % (
                epoch + 1, num_epochs, i + 1, size, netD_loss.asnumpy()))
        D_losses.append(netD_loss.asnumpy())
# 保存网络模型参数为ckpt文件
ms.save_checkpoint(netD, "Discriminator.ckpt")


Starting Training Loop...
[ 1/10][  1/500]   Loss_D:    nan
[ 1/10][ 51/500]   Loss_D: 0.7354
[ 1/10][101/500]   Loss_D: 0.9072
[ 1/10][151/500]   Loss_D: 0.7850
[ 1/10][201/500]   Loss_D: 0.7923
[ 1/10][251/500]   Loss_D: 0.7538
[ 1/10][301/500]   Loss_D: 0.8825
[ 1/10][351/500]   Loss_D: 0.8753
[ 1/10][401/500]   Loss_D: 0.7697
[ 1/10][451/500]   Loss_D: 0.8455
[ 1/10][500/500]   Loss_D: 0.7035
[ 2/10][  1/500]   Loss_D: 0.7812
[ 2/10][ 51/500]   Loss_D: 0.7747
[ 2/10][101/500]   Loss_D: 0.7750
[ 2/10][151/500]   Loss_D: 0.8097
[ 2/10][201/500]   Loss_D: 0.8195
[ 2/10][251/500]   Loss_D: 0.8614
[ 2/10][301/500]   Loss_D: 0.8115
[ 2/10][351/500]   Loss_D: 0.8399
[ 2/10][401/500]   Loss_D: 0.7533
[ 2/10][451/500]   Loss_D: 0.7543
[ 2/10][500/500]   Loss_D: 0.8703
[ 3/10][  1/500]   Loss_D: 0.8474
[ 3/10][ 51/500]   Loss_D: 0.7911
[ 3/10][101/500]   Loss_D: 0.7963
[ 3/10][151/500]   Loss_D: 0.8944
[ 3/10][201/500]   Loss_D: 0.8241
[ 3/10][251/500]   Loss_D: 0.8194
[ 3/10][301/500]   Los