# Project

Project repository can be found [here](https://gitlab.com/tankz0r/seminar). It has several branches. Let us see the general structure of the project in Pycharm.

# Settings 

- Keras as a framework for the development
- CNN dataset
- Unpaired WGAN model
- 5000 words in the vocabulary
- Pointer model without coverage mechanism

# CNN dataset

CNN dataset which was obtained using [Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization. Accessed: November 20, 2019](https://github.com/abisee/cnn-dailymail).   
Data is enoded into binary files and separated into: train, test and validation.  

In [7]:
!ls /home/denys/Code/ML/courses/3_semester/Seminar/dataset/finished_files

chunked  test.bin  train.bin  val.bin  vocab


Sample from encode binary file.

In [10]:
!less /home/denys/Code/ML/courses/3_semester/Seminar/dataset/finished_files/chunked/test_000.bin 

R[7m^^^@^@^@^@^@^@[m
[7m<CF>[m<
[7m<F0>^B[m
[7m^H[mabstract[7m^R<E3>^B[m
[7m<E0>^B[m
[7m<DD>^B[m<s> marseille prosecutor says `` so far no videos were used in the crash investigation '' despite media reports . </s> <s> journalists at bild and paris match are `` very confident '' the video clip is real , an editor says . </s> <s> andreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says . </s>
[7m<D9>[m9
[7m^G[marticle[7m^R<CD>[m9
[7m<CA>[m9
[K:[K>[m9marseille , france -lrb- cnn -rrb- the french prosecutor leading an investigation into the crash of germanwings flight 9525 insisted wednesday that he was not aware of any video footage from on board the plane . marseille prosecutor brice robin told cnn that `` so far no videos were used in the crash investigation . '' he added , `` a person who has such a video needs to immediately give it to the investigators . '' robin 's comments follow claims by two magazines , 

- In order to upload data into my model, I used code referenced in the original paper: "Abigail See, Peter J. Liu, Christopher D. Manning. Get To The Point: Summarization with Pointer-Generator Networks, 2017", [Code for the ACL 2017 paper "Get To The Point: Summarization with Pointer-Generator Networks". Accessed: November 28, 2019](https://github.com/abisee/pointer-generator). I took specific part of data loading pipeline and modifiend them for my need.   
- The logic behind data loading is organased into classes and is quite cumbersome. There such classes as: Vocab, Example, Batch, Batcher which are connected in very "interesting" way. Moreover the code is writen using Python2...  

In [1]:
from data_util import config
from data_util.batcher import Batcher, Batch
from data_util.data import Vocab, example_generator

Using TensorFlow backend.


```python
class Vocab(object):
  """Vocabulary class for mapping between words and ids (integers)"""

  def __init__(self, vocab_file, max_size):
    """Creates a vocab of up to max_size words, reading from the vocab_file. If max_size is 0, reads the entire vocab file.

    Args:
      vocab_file: path to the vocab file, which is assumed to contain "<word> <frequency>" on each line, sorted with most frequent word first. This code doesn't actually use the frequencies, though.
      max_size: integer. The maximum size of the resulting Vocabulary."""
    self._word_to_id = {}
    self._id_to_word = {}
    self._count = 0 # keeps track of total number of words in the Vocab

    # [UNK], [PAD], [START] and [STOP] get the ids 0,1,2,3.
    for w in [UNKNOWN_TOKEN, PAD_TOKEN, START_DECODING, STOP_DECODING]:
      self._word_to_id[w] = self._count
      self._id_to_word[self._count] = w
      self._count += 1

    # Read the vocab file and add words up to max_size
    with open(vocab_file, 'r') as vocab_f:
      for line in vocab_f:
        pieces = line.split()
        if len(pieces) != 2:
          print('Warning: incorrectly formatted line in vocabulary file: %s\n' % line)
          continue
        w = pieces[0]
        if w in [SENTENCE_START, SENTENCE_END, UNKNOWN_TOKEN, PAD_TOKEN, START_DECODING, STOP_DECODING]:
          raise Exception('<s>, </s>, [UNK], [PAD], [START] and [STOP] shouldn\'t be in the vocab file, but %s is' % w)
        if w in self._word_to_id:
          raise Exception('Duplicated word in vocabulary file: %s' % w)
        self._word_to_id[w] = self._count
        self._id_to_word[self._count] = w
        self._count += 1
        if max_size != 0 and self._count >= max_size:
          print("max_size of vocab was specified as %i; we now have %i words. Stopping reading." % (max_size, self._count))
          break
```

In [2]:
# Load vocabulary file and specify the size of vocabulary
vocab = Vocab(config.vocab_path, config.vocab_size)
# Create Batcher instance, which loads train data into queue in parallel way, encode data using vocabulary 
train_batcher = Batcher(config.train_data_path,
                vocab,
                hps=config.hps,
                single_pass=True)

max_size of vocab was specified as 5000; we now have 5000 words. Stopping reading.
Finished constructing vocabulary of 5000 total words. Last word added: 1980


```python
  def next_batch(self):
    # If the batch queue is empty, print a warning
    while True:
      if self._batch_queue.qsize() == 0:
        tf.logging.warning(
          'Bucket input queue is empty when calling next_batch. Bucket queue size: %i, Input queue size: %i',
          self._batch_queue.qsize(), self._example_queue.qsize())
        if self._single_pass and self._finished_reading:
          tf.logging.info("Finished reading dataset in single_pass mode.")
          return False

      batch = self._batch_queue.get()  # get the next Batch
      enc_batch = batch.enc_batch
      target_batch = np.array(list([to_categorical(x, num_classes=self._hps.vocab_size) for x in batch.target_batch]))
      yield enc_batch, target_batch
```

First array is encoded original text with words from vocabulary. Second - categorical encoded summary of the corresponded summary.


In [13]:
# method next_batch is an generator, which yield batch(original text[x] and summary[y]) with specific batch_size
next(train_batcher.next_batch())

(array([[ 315,  312,  313,    0,   40,  161,    6,    0, 2786,  502,    9,
            0,   10,   12,    0,    0, 1403,   20,    4,  777],
        [ 315,  312,  313,   12, 4761,  789,   45, 1723,    8, 4325,   12,
            0,  370,    9,    0,   69,   12, 3134,  610,   12],
        [ 315,  312,  313,    4, 4816,  210,    0,    0,    6,  612,    0,
           11, 2452,    9,    4, 1054,  963,   17, 1407,    9],
        [ 315,  312,  313,   82, 1320,    0,  565,  202,  672,    5,   12,
            0,    0,   10,  482,  352,   45,  441,  325,   20]],
       dtype=int32), array([[[0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         [1., 0., 0., ..., 0., 0., 0.],
         ...,
         [0., 0., 0., ..., 0., 0., 0.],
         [1., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.]],
 
        [[0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         [0., 0., 0., ..., 0., 0., 0.],
         ...,
         [0., 0., 0., ..., 0., 0

# Pointer model

Overall new layer look in the following way:

```python
from keras.models import Model
from keras.layers import Dense, Embedding, Activation, Permute
from keras.layers import Input, Flatten, Dropout
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import TimeDistributed, Bidirectional
from data_util import config
from .custom_recurrents import AttentionDecoder


def PointerModel(num_embeddings=config.vocab_size -1, #5000
                 embedding_dim=config.emb_dim,        #128
                 n_labels=config.vocab_size -1,       #4999
                 pad_length=config.padding,           #20
                 encoder_units=config.hidden_dim,     #256
                 decoder_units=config.hidden_dim,     #256
                 trainable=True,
                 return_probabilities=False):

    input_ = Input(shape=(pad_length,), dtype='float32')
    input_embed = Embedding(num_embeddings, embedding_dim,
                            input_length=pad_length,
                            trainable=trainable,
                            name='OneHot'
                            )(input_)

    encoder = Bidirectional(LSTM(output_dim=encoder_units, return_sequences=True),
                            name='encoder',
                            merge_mode='concat',
                            trainable=trainable)(input_embed)

    decoder = AttentionDecoder(decoder_units,
                               name='attention_decoder_1',
                               output_dim=n_labels,
                               return_probabilities=return_probabilities,
                               trainable=trainable)(encoder)
    output_2 = Dense(output_dim=n_labels, activation='softmax')(decoder)
    model = Model(input=input_, output=output_2)
    return model
```

Main logic from AttentionDecoder class.

```python
def step(self, x, states):
        # 1. Attention Distribution
        ytm, stm = states
        print("stm", stm.shape)
        print("ytm", ytm.shape)
        # repeat the hidden state to the length of the sequence
        _stm = K.repeat(stm, self.timesteps)
        # now multiplty the weight matrix with the repeated hidden state
        _Waxstm = K.dot(_stm, self.W_a)
        _UaxH = time_distributed_dense(self.x_seq, self.U_a,
                                       b=self.b_a,
                                       input_dim=self.input_dim,
                                       timesteps=self.timesteps,
                                       output_dim=self.units)
        # calculate the attention probabilities
        # this relates how much other timesteps contributed to this one.
        et = K.dot(activations.tanh(_Waxstm + _UaxH),
                   K.expand_dims(self.V_a))
        print("E_tj", et.shape)
        p_j = K.exp(et)
        p_j_sum = K.sum(p_j, axis=1)
        p_j_sum_repeated = K.repeat(p_j_sum, self.timesteps)
        p_j /= p_j_sum_repeated  # vector of size (batchsize, timesteps, 1)

        # 2. Vocabulary distribution
        # calculate the context vector
        v_j = K.squeeze(K.batch_dot(p_j, self.x_seq, axes=1), axis=1)
        stm_v_j = K.concatenate([stm, v_j])
        Vxstm_v_j = K.dot(stm_v_j, self.V)
        Vxstm_v_j += self.b
        p_vocab = activations.softmax(K.dot(Vxstm_v_j, self.V_) + self.b_)
        # 3. Copy distribution
        p_copy = p_j
        # 4. Generative distribution
        p_gen = activations.sigmoid(
            K.dot(ytm, self.w_x)
            + K.dot(stm, self.w_s)
            + K.dot(v_j, self.w_v)
            + self.b)
        # 5. Final distribution
        p_final = p_gen*p_vocab + (1-p_gen)*p_copy
        print("p_j", p_j.shape)
        print("p_final", p_final.shape)
        print("p_gen", p_gen.shape)
        if self.return_probabilities:
            return p_j, [p_final, p_gen]
        else:
            return p_final, [p_final, p_gen]
```

# GAN components

Both Generator and Reconstructor are seq2seq hybrid pointer-generator networks. I will provide snipets for Generator and Discriminator, because Reconstructor is very similar to th Generator.

```python
class Generator(object):
    def __init__(self, num_embeddings, embedding_dim, n_labels, pad_length, encoder_units, decoder_units):
        self.pointer_model = PointerModel(num_embeddings=num_embeddings,   #4999
                                          embedding_dim=embedding_dim,     #128
                                          n_labels=n_labels,               #4999
                                          pad_length=pad_length,           #20
                                          encoder_units=encoder_units,     #256
                                          decoder_units=decoder_units,     #256
                                          trainable=True,
                                          return_probabilities=False)

    def model(self):
        return self.pointer_model
```

```python
class Discriminator(object):
    def __init__(self):
        self.vocabulary_size = config.VOCABULARY_SIZE
        self.sequence_length = config.MAX_SEQUENCE_LENGTH
        self.embedding_dim = config.EMBEDDING_DIM
        self.filter_sizes = config.filter_sizes
        self.dropout_rate = config.dropout_rate
        self.num_filters = config.num_filters
        self.output_dim = config.output_dim
        self.model_dir = config.models_dir
        self.model_path = config.MODEL
        self.weights_dir = config.weights_dir
        self.weights_path = config.WEIGHTS
        self.loss = config.loss
        self.optimizer = config.optimizer
        self.nb_epoch = config.nb_epoch
        self.batch_size = config.batch_size

    def model(self):
        inputs = Input(shape=(self.sequence_length, self.vocabulary_size))
        reshape_1 = Reshape((self.sequence_length, self.vocabulary_size, 1))(inputs)
        conv_1_0 = Convolution2D(self.num_filters, 
                                 self.filter_sizes[0], 
                                 self.embedding_dim, 
                                 border_mode='valid', 
                                 init='normal',
                                 activation='relu', 
                                 dim_ordering='tf')(reshape_1)
        maxpool_1_0 = MaxPooling2D(pool_size=(self.sequence_length - self.filter_sizes[0] + 1, 1), 
                                   strides=(1, 1),
                                   border_mode='valid',
                                   dim_ordering='tf')(conv_1_0)
        conv_1_1 = Convolution2D(self.num_filters, 
                                 self.filter_sizes[1], 
                                 self.embedding_dim, 
                                 border_mode='valid', 
                                 init='normal',
                                 activation='relu', dim_ordering='tf')(reshape_1)
        maxpool_1_1 = MaxPooling2D(pool_size=(self.sequence_length - self.filter_sizes[1] + 1, 1), 
                                   strides=(1, 1),
                                   border_mode='valid', 
                                   dim_ordering='tf')(conv_1_1)
        conv_1_2 = Convolution2D(self.num_filters, 
                                 self.filter_sizes[2],
                                 self.embedding_dim, 
                                 border_mode='valid', 
                                 init='normal',
                                 activation='relu', 
                                 dim_ordering='tf')(reshape_1)
        maxpool_1_2 = MaxPooling2D(pool_size=(self.sequence_length - self.filter_sizes[2] + 1, 1), 
                                   strides=(1, 1),
                                   border_mode='valid', 
                                   dim_ordering='tf')(conv_1_2)
        merged_tensor_1 = merge([maxpool_1_0, maxpool_1_1, maxpool_1_2], mode='concat', concat_axis=1)
        flatten_1 = Flatten()(merged_tensor_1)
        dropout_1 = Dropout(self.dropout_rate)(flatten_1)
        output_1 = Dense(output_dim=self.output_dim, activation='linear')(dropout_1)
        model_1 = Model(input=[inputs], output=output_1)
        return model_1
```

In [1]:
from run import Train

Using TensorFlow backend.


In [2]:
train_model = Train()

max_size of vocab was specified as 5000; we now have 5000 words. Stopping reading.
Finished constructing vocabulary of 5000 total words. Last word added: 1980


### Train generator separately

In [4]:
train_model.setup_train_generator()

  encoder = Bidirectional(LSTM(output_dim=encoder_units, return_sequences=True),


inputs shape: (?, ?, 512)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 20)                0         
_________________________________________________________________
OneHot (Embedding)           (None, 20, 128)           640000    
_________________________________________________________________
encoder (Bidirectional)      (None, 20, 512)           788480    
_________________________________________________________________
attention_decoder_1 (Attenti (None, 20, 5000)          33603784  
_________________________________________________________________
dense_1 (Dense)              (None, 20, 5000)          25005000  
Total params: 60,037,264
Trainable params: 60,037,264
Non-trainable params: 0
_________________________________________________________________


  output_2 = Dense(output_dim=n_labels, activation='softmax')(decoder)
  model = Model(input=input_, output=output_2)
  nb_epoch=config.max_iterations)
  nb_epoch=config.max_iterations)


Generator Compiled.


kwargs passed to function are ignored with Tensorflow backend


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Generator training complete.


### WGAN implementation

```python
class WGAN(GAN):
    def __init__(self, **kwargs):
        super(WGAN, self).__init__(**kwargs)
        self.critic = self.define_critic()
        self.gan = self.define_gan()

    # calculate wasserstein loss
    def wasserstein_loss(srlf, y_true, y_pred):
        return backend.mean(y_true * y_pred)

    # define the standalone critic model
    def define_critic(self):
        model = self.discriminator
        opt = RMSprop(lr=0.00005)
        model.compile(loss=self.wasserstein_loss, optimizer=opt)
        return model

    # define the combined generator and critic model, for updating the generator
    def define_gan(self):
        # make weights in the critic not trainable
        self.critic.trainable = False
        # connect them
        model = Sequential()
        # add generator
        model.add(self.generator)
        # add the critic
        model.add(self.critic)
        # compile model
        opt = RMSprop(lr=0.00005)
        model.compile(loss=self.wasserstein_loss, optimizer=opt)
        return model

    # select real samples
    def generate_real_samples(self, bath_generator):
        # choose random instances
        _, X = next(bath_generator)
        n_samples = X.shape[0]
        # generate class labels, -1 for 'real'
        y = -ones((n_samples, 1))
        return X, y

    # use the generator to generate n fake examples, with class labels
    def generate_fake_samples(self, batch_generator):
        # generate points in latent space
        X, _ = next(batch_generator)
        # predict outputs
        X = self.generate(X)
        n_samples = X.shape[0]
        # create class labels with 1.0 for 'fake'
        y = ones((n_samples, 1))
        return X, y

    # train the generator and critic
    def train(self, batch_generator, n_steps=200, n_batch=4, n_critic=5, save_iter=20):
        c1_hist, c2_hist, g_hist = list(), list(), list()
        # manually enumerate epochs
        for i in range(n_steps):
            # update the critic more than the generator
            c1_tmp, c2_tmp = list(), list()
            for _ in range(n_critic):
                # get randomly selected 'real' samples
                X_real, y_real = self.generate_real_samples(batch_generator)
                # update critic model weights
                c_loss1 = self.critic.train_on_batch(X_real, y_real)
                c1_tmp.append(c_loss1)
                # generate 'fake' examples
                X_fake, y_fake = self.generate_fake_samples(batch_generator)
                # update critic model weights
                c_loss2 = self.critic.train_on_batch(X_fake, y_fake)
                c2_tmp.append(c_loss2)
            # store critic loshalf_batchs
            c1_hist.append(mean(c1_tmp))
            c2_hist.append(mean(c2_tmp))
            # prepare points in latent space as input for the generator
            X_gan, _ = next(batch_generator)
            y_gan = -ones((n_batch, 1))
            # update the generator via the critic's error
            g_loss = self.gan.train_on_batch(X_gan, y_gan)
            g_hist.append(g_loss)
            # summarize loss on this batch
            print('>%d, c1=%.3f, c2=%.3f g=%.3f' % (i + 1, c1_hist[-1], c2_hist[-1], g_loss))
            if i%save_iter == 0:
                print(f"Real input:{X_gan}")
                samples = self.generate(X_gan)
                print(f"GAN results:{samples} after iteration:{i}", )
        self.plot_history(c1_hist, c2_hist, g_hist)
```

```python
def setup_train_wgan_model(self):
    generator = Generator(num_embeddings=config.vocab_size,  # 4999
                          embedding_dim=config.emb_dim,  # 128
                          n_labels=config.vocab_size,  # 4999
                          pad_length=config.padding,  # 20
                          encoder_units=config.hidden_dim,  # 256
                          decoder_units=config.hidden_dim,  # 256
                          ).model()
    reconstructor = Reconstructor(num_embeddings=config.vocab_size,  # 4999
                                  embedding_dim=config.emb_dim,  # 128
                                  n_labels=config.vocab_size,  # 4999
                                  pad_length=config.padding,  # 20
                                  encoder_units=config.hidden_dim,  # 256
                                  decoder_units=config.hidden_dim,  # 256
                                  ).model()
    discriminator = Discriminator().model()
    wgan = WGAN(generator=generator,
                reconstructor=reconstructor,
                discriminator=discriminator,
                )
    try:
        wgan.train(self.train_batcher.next_batch())
    except KeyboardInterrupt as e:
        print('WGAN training stopped early.')
    print('WGAN training complete.')
```

In [3]:
train_model.setup_train_wgan_model()

  encoder = Bidirectional(LSTM(output_dim=encoder_units, return_sequences=True),


inputs shape: (?, ?, 512)


  output_2 = Dense(output_dim=n_labels, activation='softmax')(decoder)
  model = Model(input=input_, output=output_2)
  activation='relu', dim_ordering='tf')(reshape_1)
  border_mode='valid', dim_ordering='tf')(conv_1_0)
  activation='relu', dim_ordering='tf')(reshape_1)
  border_mode='valid', dim_ordering='tf')(conv_1_1)
  activation='relu', dim_ordering='tf')(reshape_1)
  border_mode='valid', dim_ordering='tf')(conv_1_2)
  merged_tensor_1 = merge([maxpool_1_0, maxpool_1_1, maxpool_1_2], mode='concat', concat_axis=1)
  name=name)
  output_1 = Dense(output_dim=self.output_dim, activation='linear')(dropout_1)
  model_1 = Model(input=[inputs], output=output_1)


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 20)                0         
_________________________________________________________________
OneHot (Embedding)           (None, 20, 128)           640000    
_________________________________________________________________
encoder (Bidirectional)      (None, 20, 512)           788480    
_________________________________________________________________
attention_decoder_1 (Attenti (None, 20, 5000)          33603784  
_________________________________________________________________
dense_1 (Dense)              (None, 20, 5000)          25005000  
Total params: 60,037,264
Trainable params: 60,037,264
Non-trainable params: 0
_________________________________________________________________
____________________________________________________________________________________________________
Layer (type)                 

# Results

In [9]:
%%HTML
<img src="./plot_line_plot_loss.png",width=60,height=60>