# ShakesBERT

BERT's BertForNextSentencePrediction class gives a score for the likelihood that a sentence (or line) follows a preceding one. We can use this for example to construct a new sonnet from lines of existing Shakespeare sonnets. The new sonnet will have a higher likelihood of making sense than if we merely drew the lines at random. The next sentence prediction therefore acts as a kind of sense discriminator.

Sonnet lines are taken from [Poetry DB](http://poetrydb.org).

In [1]:
import torch
from pytorch_pretrained_bert import BertTokenizer, BertForNextSentencePrediction

In [3]:
tokeniser = BertTokenizer.from_pretrained('bert-base-uncased')

model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

100%|██████████| 231508/231508 [00:01<00:00, 140973.73B/s]


In [5]:
import urllib
import json
from random import *

url = 'http://poetrydb.org/author,linecount/Shakespeare;14/lines'
with urllib.request.urlopen(url) as response:
    data = json.load(response)   

    
poem_number = randint(0, len(data)-1)
#previous_line = data[poem_number]['lines'][0]
previous_line="shall i compare thee to a summer's day?\n"
print(previous_line.strip())

next_line_prediction = 0
threshold = 3
poems_picked = [poem_number]

for line_number in range(1, 14):
    next_line_prediction = 0
    while(line_number == len(poems_picked)):
        poem_number = randint(0, len(data)-1)
        line_to_check = data[poem_number]['lines'][line_number]
        
        len_line_1 = len(tokeniser.tokenize(previous_line))
        len_line_2 = len(tokeniser.tokenize(line_to_check))

        text = previous_line + ' ' + line_to_check
        tokenized_text = tokeniser.tokenize(text)

        indexed_tokens = tokeniser.convert_tokens_to_ids(tokenized_text)
        segments_ids = ([0] * len_line_1) + ([1] * len_line_2)
        tokens_tensor = torch.tensor([indexed_tokens])
        segments_tensors = torch.tensor([segments_ids])
        
        predictions = model(tokens_tensor, segments_tensors)
        
        next_line_prediction = predictions[0,0].item()
        # No poem should be taken a line from more than once
        if poem_number not in poems_picked and next_line_prediction > threshold:
            poems_picked = poems_picked + [poem_number]

    print(line_to_check.strip())
    previous_line = line_to_check

shall i compare thee to a summer's day?
But sad mortality o'ersways their power,
Spend'st thou thy fury on some worthless song,
Use power with power, and slay me not by art,
Now proud as an enjoyer, and anon
When I break twenty? I am perjur'd most;
There is such strength and warrantise of skill,
Or ten times happier, be it ten for one;
So is the time that keeps you as my chest,
Than when her mournful hymns did hush the night,
Making no summer of another's green,
But bears it out even to the edge of doom.
Then thank him not for that which he doth say,
And gain by ill thrice more than I have spent.


# Chinese version

In [3]:
model = BertForNextSentencePrediction.from_pretrained('bert-base-chinese')

tokeniser = BertTokenizer.from_pretrained('bert-base-chinese')

100%|██████████| 382072689/382072689 [01:32<00:00, 4135578.78B/s]


In [33]:
f = open('poems_clean.txt', "r", encoding='utf-8')
poems = []
for line in f.readlines():
    title, poem = line.split(':')
    poem = poem.replace('\n', '') #将换行符去掉
    poem=poem.split(' ')
    for i in poem:
        if len(i)==0:
            poem.remove(i)
    poems.append(poem)

In [37]:
poems[0]

['寒随穷律变', '春逐鸟声开', '初风飘带柳', '晚雪间花梅', '碧林青旧竹', '绿沼翠新苔', '芝田初雁去', '绮树巧莺来', '']

In [90]:
#Chinese version

import urllib
import json
from random import *

data=poems   
poem_number = randint(0, len(data)-1)
#previous_line = data[poem_number]['lines'][0]
previous_line="寒随穷律变\n"
print(previous_line.strip())

next_line_prediction = 0
threshold = 3
poems_picked = [poem_number]

寒随穷律变


In [91]:
goal=[]
previous_line="寒随穷律变"
goal.append(previous_line)
for line_number in range(1, 4):
    next_line_prediction = 0
    while(line_number == len(poems_picked)):
        poem_number = randint(0, len(data)-1)
        line_to_check = data[poem_number][line_number]
        if len(line_to_check)==len(previous_line):
            len_line_1 = len(tokeniser.tokenize(previous_line))
            len_line_2 = len(tokeniser.tokenize(line_to_check))

            text = previous_line + ' ' + line_to_check
            tokenized_text = tokeniser.tokenize(text)

            indexed_tokens = tokeniser.convert_tokens_to_ids(tokenized_text)
            segments_ids = ([0] * len_line_1) + ([1] * len_line_2)
            tokens_tensor = torch.tensor([indexed_tokens])
            segments_tensors = torch.tensor([segments_ids])

            predictions = model(tokens_tensor, segments_tensors)

            next_line_prediction = predictions[0,0].item()
            # No poem should be taken a line from more than once
            if poem_number not in poems_picked and next_line_prediction > threshold:
                poems_picked = poems_picked + [poem_number]
                goal.append(line_to_check)
                previous_line = line_to_check

In [93]:
for i in goal:
    print(i)

寒随穷律变
苒苒出蓬蒿
从来悟明主
月照海门秋
