This is a pretty straightforward text generation task, using a fine-tuned version of the GPT-2 language model to quickly come up with some amusing texts that look relatively close to actual bug reports, albeit a little skewed.

In [1]:
from bs4 import BeautifulSoup
import requests
import csv
import gpt_2_simple as gpt2
import os
import tensorflow as tf

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



Do yourself a favour and run this on a GPU. I haven't timed it CPU-only, since I think I'll probably need my computer in the next few days.

In [2]:
from tensorflow.python.client import device_lib
def get_available_devices():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos]
print(get_available_devices()) 

['/device:CPU:0', '/device:GPU:0']


The next few cells scrape the text of various bug reports from the Dwarf Fortress tracker... 

In [3]:
def get_bugs_page(pageno):
    r = requests.get(f'https://www.bay12games.com/dwarves/mantisbt/view_all_bug_page.php?page_number={pageno}')
    if r.status_code == 200:
        return r.text
    return False

In [4]:
def extract_bug_text(text):
    soup = BeautifulSoup(text, 'html.parser')
    tds = soup.find_all('td', class_='left')
    return [td.text.strip() for td in tds[:-1] if len(td.text) > 5] 

...and fire them into a text file, separated by newlines, for use in training our model.

In [5]:
for i in range(1, 215): #TODO: get rid of magic number
    text = get_bugs_page(i)
    bugs = extract_bug_text(text)
    with open('bugs.txt', 'a', encoding='utf-8') as outfile:
        for bug in bugs:
            outfile.write(f'{bug}\n')

I'm using the 3.55 million parameter gpt-2 model here, which clocks in at about 1.5GB before fine-tuning.

In [6]:
model_name = "355M"
if not os.path.isdir(os.path.join("models", model_name)):
    gpt2.download_gpt2(model_name=model_name)

I've set the fine-tuning to start from scratch every time, without restoring from a previous model run. If you want to restore, just remove the restore_from param. It will also give a sample output every 200 training steps. I'm only training for 1,000 steps here, but you could probably get better results with more training - although there's always the risk of creating spurious connections from training with a restricted task dataset.

In [None]:
sess = gpt2.start_tf_sess()
gpt2.finetune(sess, 'bugs.txt', model_name=model_name, steps=1000, restore_from='fresh', sample_every=200)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint models\355M\model.ckpt
INFO:tensorflow:Restoring parameters from models\355M\model.ckpt


  0%|                                                                                            | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.27s/it]


dataset has 134144 tokens
Training...
[1 | 9.70] loss=4.59 avg=4.59
[2 | 10.72] loss=4.79 avg=4.69
[3 | 11.73] loss=4.70 avg=4.69
[4 | 12.75] loss=4.59 avg=4.67
[5 | 13.77] loss=4.47 avg=4.63
[6 | 14.78] loss=4.67 avg=4.64
[7 | 15.80] loss=4.47 avg=4.61
[8 | 16.81] loss=4.48 avg=4.59
[9 | 17.82] loss=4.70 avg=4.61
[10 | 18.84] loss=4.58 avg=4.60
[11 | 19.87] loss=4.58 avg=4.60
[12 | 20.91] loss=4.75 avg=4.61
[13 | 21.92] loss=4.48 avg=4.60
[14 | 22.96] loss=4.43 avg=4.59
[15 | 23.99] loss=4.49 avg=4.58
[16 | 25.04] loss=4.19 avg=4.56
[17 | 26.08] loss=4.52 avg=4.55
[18 | 27.12] loss=4.54 avg=4.55
[19 | 28.14] loss=4.24 avg=4.54
[20 | 29.16] loss=4.65 avg=4.54
[21 | 30.18] loss=4.43 avg=4.54
[22 | 31.21] loss=4.44 avg=4.53
[23 | 32.25] loss=4.67 avg=4.54
[24 | 33.29] loss=4.17 avg=4.52
[25 | 34.31] loss=4.47 avg=4.52
[26 | 35.36] loss=4.28 avg=4.51
[27 | 36.44] loss=4.51 avg=4.51
[28 | 37.50] loss=4.48 avg=4.51
[29 | 38.56] loss=3.88 avg=4.48
[30 | 39.61] loss=4.11 avg=4.47
[31 | 40.64]

[201 | 241.31] loss=3.08 avg=3.70
[202 | 242.43] loss=3.66 avg=3.70
[203 | 243.53] loss=3.92 avg=3.70
[204 | 244.59] loss=2.15 avg=3.68
[205 | 245.66] loss=3.08 avg=3.68
[206 | 246.75] loss=3.94 avg=3.68
[207 | 247.84] loss=3.92 avg=3.68
[208 | 248.90] loss=2.80 avg=3.67
[209 | 249.93] loss=3.10 avg=3.66
[210 | 250.97] loss=2.71 avg=3.65
[211 | 252.07] loss=3.49 avg=3.65
[212 | 253.14] loss=3.00 avg=3.64
[213 | 254.22] loss=2.82 avg=3.64
[214 | 255.25] loss=2.34 avg=3.62
[215 | 256.32] loss=3.15 avg=3.62
[216 | 257.35] loss=4.08 avg=3.62
[217 | 258.43] loss=3.14 avg=3.62
[218 | 259.50] loss=2.89 avg=3.61
[219 | 260.55] loss=3.77 avg=3.61
[220 | 261.60] loss=3.15 avg=3.60
[221 | 262.68] loss=2.98 avg=3.60
[222 | 263.73] loss=3.09 avg=3.59
[223 | 264.81] loss=4.14 avg=3.60
[224 | 265.90] loss=2.41 avg=3.58
[225 | 266.98] loss=2.33 avg=3.57
[226 | 268.07] loss=3.29 avg=3.57
[227 | 269.15] loss=2.62 avg=3.56
[228 | 270.22] loss=3.60 avg=3.56
[229 | 271.25] loss=4.32 avg=3.57
[230 | 272.28]

[401 | 472.39] loss=1.58 avg=2.74
[402 | 473.44] loss=1.61 avg=2.73
[403 | 474.48] loss=3.28 avg=2.73
[404 | 475.51] loss=1.61 avg=2.72
[405 | 476.59] loss=1.13 avg=2.71
[406 | 477.69] loss=2.59 avg=2.71
[407 | 478.81] loss=2.09 avg=2.70
[408 | 479.85] loss=0.75 avg=2.68
[409 | 480.89] loss=3.74 avg=2.69
[410 | 481.94] loss=3.12 avg=2.69
[411 | 482.97] loss=2.06 avg=2.69
[412 | 484.06] loss=1.23 avg=2.67
[413 | 485.12] loss=2.19 avg=2.67
[414 | 486.23] loss=2.07 avg=2.66
[415 | 487.34] loss=3.02 avg=2.67
[416 | 488.39] loss=3.35 avg=2.67
[417 | 489.43] loss=3.81 avg=2.68
[418 | 490.46] loss=2.12 avg=2.68
[419 | 491.50] loss=1.24 avg=2.66
[420 | 492.54] loss=2.44 avg=2.66
[421 | 493.60] loss=3.22 avg=2.67
[422 | 494.63] loss=2.39 avg=2.66
[423 | 495.66] loss=2.10 avg=2.66
[424 | 496.71] loss=1.81 avg=2.65
[425 | 497.77] loss=1.45 avg=2.64
[426 | 498.84] loss=2.78 avg=2.64
[427 | 499.91] loss=1.83 avg=2.63
[428 | 500.95] loss=1.51 avg=2.62
[429 | 502.03] loss=0.74 avg=2.60
[430 | 503.07]

We can now generate with our fine-tuned language model. We'll look for 1,000 samples of up to 100 tokens, and gpt-2 allows us to run batches in parallel. The generation is non-deterministic (check here for an explanation of why that is https://huggingface.co/blog/how-to-generate), so we won't get the same results at each run time. I've pumped up the temperature (the likelihood of low_probability words being included in our sequences) to get slightly crazier results, which I find amusing.

In [31]:
texts = gpt2.generate(sess, length=100, nsamples=1000, batch_size=10, temperature=1.0, top_p=0.9, return_as_list=True)

There's a couple of problems we need to fix before saving our results. The model will generate a chunk of text up to the desired token length, but we want a nice separated list of potential messages. So, we split the sequence on newline characters, throwing away any blank results. Then, we get rid of the last result. Why? Because once the token limit is reached, generation will stop in the middle of a phrase, which we don't want to include.

In [32]:
with open('gen_texts.csv', 'w', encoding='utf-8', newline='') as csvfile:
    writer = csv.writer(csvfile)
    for text in texts:
        split_texts = [i for i in text.split('\n') if len(i) > 0][:-1]
        for msg in split_texts:
            writer.writerow([msg])