# generate with filter

GeneratorとFilterをつかって良い文書のみを生成する

## Generatorのパラメータ

|パラメータ名|内容|
|:--|:--|
| `num_layers` | transformerのパラメータ、[tensorflowのチュートリアル参照](https://www.tensorflow.org/tutorials/text/transformer) |
| `GEN_d_model` | transformerの`d_model`パラメータ、[tensorflowのチュートリアル参照](https://www.tensorflow.org/tutorials/text/transformer) |
| `dff` | transformerのパラメータ、[tensorflowのチュートリアル参照](https://www.tensorflow.org/tutorials/text/transformer) |
| `num_heads` | transformerのパラメータ、[tensorflowのチュートリアル参照](https://www.tensorflow.org/tutorials/text/transformer) |
| `GENERATOR_EPOCH` | Generatorのエポック数 |
| `TEMPERATURE` | 生成するスレッドタイトルの多様性と尤度のトレードオフパラメータ（0.0以上）、0.0のとき最も確信度の高いスレッドのみを生成する |
| `BATCH_NUM` | １回の生成に`BATCH_SIZE` * `BATCH_SIZE`個のスレッドを生成して、Filterのスコアが良いものを出力する |
| `BATCH_SIZE` | １回の生成に`BATCH_SIZE` * `BATCH_SIZE`個のスレッドを生成して、Filterのスコアが良いものを出力する |

## Filterのパラメータ

|パラメータ名|内容|
|:--|:--|
| `conv_filters` | Filterのパラメータ、詳細は[scripts/model.py](scripts/model.py)を参照 |
| `conv_kernel_sizes` | Filterのパラメータ、詳細は[scripts/model.py](scripts/model.py)を参照 |
| `FLT_d_model` | Filterの`d_model`パラメータ、詳細は[scripts/model.py](scripts/model.py)を参照 |
| `FILTER_EPOCH` | Filterのエポック数 |

In [1]:
import os
import pickle
import numpy as np
import tensorflow as tf

In [2]:
# Generatorモデルパラメータ
num_layers = 4
GEN_d_model = 128
dff = 512
num_heads = 8
GENERATOR_EPOCH = 37
TEMPERATURE = 0.8
BATCH_NUM = 40
BATCH_SIZE = 128

# Filterモデルパラメータ
conv_filters = [32, 64, 128]
conv_kernel_sizes = [16, 8, 4]
FLT_d_model = 128
FILTER_EPOCH = 2

In [3]:
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load('sentencepiece.model')

True

In [4]:
# weightファイル読み込みのために利用する
with open("real_dataset.pickle", "rb") as f:
    ids = pickle.load(f)
real_dataset_tensor = tf.keras.preprocessing.sequence.pad_sequences(ids, padding='post')

In [5]:
# 入力データのパラメータ
vocab_size = sp.get_piece_size()
seq_len = real_dataset_tensor.shape[1]

In [6]:
from scripts.model import Generator

In [7]:
generator = Generator(num_layers, GEN_d_model, num_heads, dff, vocab_size, max_pos_encoding=seq_len)

In [8]:
# HACK: create variables
_ = generator(tf.constant(real_dataset_tensor[:1]), training=False)

In [9]:
generator.load_weights(f'model/generator/weights_epoch{GENERATOR_EPOCH}.h5')

In [None]:
# generate with initial model
generation_ids = generator.sample(num_sample=5, temperature=TEMPERATURE, padding=True)

for ids in generation_ids:
    ids_int = list(map(lambda x: int(x), ids))
    print(sp.decode_ids(ids_int))

In [11]:
from scripts.model import Filter
nanj_filter = Filter(conv_filters, conv_kernel_sizes, FLT_d_model, vocab_size)

In [12]:
# HACK: create variables
_ = nanj_filter(tf.constant(real_dataset_tensor[:1]), training=False)

In [13]:
nanj_filter.load_weights(f'model/filter/weights_epoch{FILTER_EPOCH}.h5')

In [None]:
# generate with filter
results = []
for batch in range(BATCH_NUM):
    generation_ids = generator.sample(num_sample=BATCH_SIZE, temperature=TEMPERATURE, padding=True)
    scores = [float(v) for v in tf.math.sigmoid(nanj_filter(generation_ids, training=False))]
    results.extend(list(zip(generation_ids, scores)))

print(">>> best 20 <<<")
for ids, score in sorted(results, key=lambda x: -x[1])[:20]:
    text = sp.decode_ids(list(map(lambda x: int(x), ids)))
    print(text, score)

In [None]:
prefix = sp.encode_as_ids("三大")

results = []
for batch in range(BATCH_NUM):
    generation_ids = generator.sample(num_sample=BATCH_SIZE, temperature=TEMPERATURE, padding=True, prefix=prefix)
    scores = [float(v) for v in tf.math.sigmoid(nanj_filter(generation_ids, training=False))]
    results.extend(list(zip(generation_ids, scores)))


print(">>> best 20 <<<")
for ids, score in sorted(results, key=lambda x: -x[1])[:20]:
    text = sp.decode_ids(list(map(lambda x: int(x), ids)))
    print(text, score)

In [None]:
prefix = sp.encode_as_ids("【なぞなぞ】")

results = []
for batch in range(BATCH_NUM):
    generation_ids = generator.sample(num_sample=BATCH_SIZE, temperature=TEMPERATURE, padding=True, prefix=prefix)
    scores = [float(v) for v in tf.math.sigmoid(nanj_filter(generation_ids, training=False))]
    results.extend(list(zip(generation_ids, scores)))

print(">>> best 20 <<<")
for ids, score in sorted(results, key=lambda x: -x[1])[:20]:
    text = sp.decode_ids(list(map(lambda x: int(x), ids)))
    print(text, score)

In [None]:
prefix = sp.encode_as_ids("ぷよぷよ")

results = []
for batch in range(BATCH_NUM):
    generation_ids = generator.sample(num_sample=BATCH_SIZE, temperature=TEMPERATURE, padding=True, prefix=prefix)
    scores = [float(v) for v in tf.math.sigmoid(nanj_filter(generation_ids, training=False))]
    results.extend(list(zip(generation_ids, scores)))

print(">>> best 20 <<<")
for ids, score in sorted(results, key=lambda x: -x[1])[:20]:
    text = sp.decode_ids(list(map(lambda x: int(x), ids)))
    print(text, score)

In [None]:
prefix = sp.encode_as_ids("【徹底討論】")

results = []
for batch in range(BATCH_NUM):
    generation_ids = generator.sample(num_sample=BATCH_SIZE, temperature=TEMPERATURE, padding=True, prefix=prefix)
    scores = [float(v) for v in tf.math.sigmoid(nanj_filter(generation_ids, training=False))]
    results.extend(list(zip(generation_ids, scores)))

print(">>> best 20 <<<")
for ids, score in sorted(results, key=lambda x: -x[1])[:20]:
    text = sp.decode_ids(list(map(lambda x: int(x), ids)))
    print(text, score)