In [1]:
import pandas as pd
import numpy as np
import string
import random

In [2]:
# Shakespeare = "Data/alllines.txt"
# Lines = []
# with open(Shakespeare, "r") as f:
#     line = f.readlines()
data = pd.read_csv('Data/Shakespeare_data.csv')
data

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
...,...,...,...,...,...,...
111391,111392,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely"
111392,111393,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part
111393,111394,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first
111394,111395,A Winters Tale,38.0,5.3.183,LEONTES,We were dissever'd: hastily lead away.


# Generate word dictionaries

Since I only need the text to predict words, so I only take the PlayerLine from the dataset. 

The first I want to do is generate a dictionary to count the probabilities of the current word followed by next word. I also calculated the probabilities of the word occured at the first of the line, so I can use it later to generate new text.

In [3]:
text = np.array(data['PlayerLine'])
text

array(['ACT I', 'SCENE I. London. The palace.',
       'Enter KING HENRY, LORD JOHN OF LANCASTER, the EARL of WESTMORELAND, SIR WALTER BLUNT, and others',
       ..., "Perform'd in this wide gap of time since first",
       "We were dissever'd: hastily lead away.", 'Exeunt'], dtype=object)

I first removed all the puncuation mark in the training text, so we can have a smaller dictionary. I also add an 'END' flag to the end of each line, so I can get the probability of the word that ends the line.

In [4]:
dictionary = {}
first_word_dict = {}
for line in text:
    # remove all punctuation mark, change word to lower case
    noPun = line.translate(str.maketrans('', '', string.punctuation)).lower()
    # store total words without repeat
    wordsInLine = noPun.split()
    wordsInLine.append('END')
    
    # count the first word of each line
    first_word = wordsInLine[0]
    if first_word not in first_word_dict:
        first_word_dict[first_word] = 1
    else: 
        first_word_dict[first_word] += 1
    
    # generate dictionary
    for i in range(len(wordsInLine)-1):
        current_word = wordsInLine[i]
        next_word = wordsInLine[i+1]
        if current_word not in dictionary:
            dictionary[current_word] = {next_word : 1}
        else:
            if next_word not in dictionary[current_word]:
                dictionary[current_word][next_word] = 1
            else:
                dictionary[current_word][next_word] += 1

In [5]:
# calculate probabilities
for current_word in dictionary:
    for next_word in dictionary[current_word]:
        dictionary[current_word][next_word] = dictionary[current_word][next_word] / sum(dictionary[current_word].values())

for word in first_word_dict:
    first_word_dict[word] = first_word_dict[word] / sum(first_word_dict.values())

# Start to guess next words with given text

I created the model with 1st-order Markov Model Chain, so the next word is only depends on the current word. In my model, I directly search for and pick the next_word with highest probability from the dictionary.

In [12]:
def guess_word(phase,x=0.5):
    if type(phase) == str:
        givenWords = phase.lower().split()
    else:
        givenWords = phase

    possible_words = dictionary[givenWords[-1]]
    
    most_frequent_words = max(possible_words, key=possible_words.get)
    return most_frequent_words

In [13]:
guess_word('I am')

'lost'

In [14]:
guess_word('O')

'patience'

In [17]:
guess_word('O my')

'lord'

In [20]:
guess_word('I shall be')

'hooted'

# Now predict the whole sentence

I then predict the sentence, and the model will keep predict until the flag 'END' is predicted. 

In [21]:
def predict_sentence(phase):
    sentence = phase.lower().split()
    while sentence[-1] != 'END':
        new_word = guess_word(sentence)
        if new_word == 'END':
            break
        sentence.append(new_word)
        
    return ' '.join(sentence)
        
        

In [22]:
predict_sentence('the')

'the statue of antigonus that madness'

In [25]:
predict_sentence('I shall be')

'i shall be hooted at the statue of antigonus that madness'

In [27]:
predict_sentence('So shaken')

'so shaken with heigh the statue of antigonus that madness'

In [28]:
predict_sentence('when')

'when i am lost'

# generate new text

I then use the sentence prediction to generate new text, to make each sentence different, I randomly choose the first word with probability higher than a threshold, then the rest sentence will followed by the first word. 

In [29]:

def predict_text(numSentence, first_freq = 0.05):
    text = ''
    for i in range(numSentence):
        possible_words = {k:v for k, v in first_word_dict.items() if v > first_freq}
        first_word = random.choice(list(possible_words))
        sentence = predict_sentence(first_word)
        print(sentence)

In [30]:
predict_text(10)

weather
heirs of antigonus that madness
heirs of antigonus that madness
i am lost
undiscovered but told me
franklins say
paulina
weather
lonely apart
excels whatever yet cut off


In [31]:
predict_text(5)

franklins say
lonely apart
weather
and maket manifest where is trothplight to interpose fair couple meets hector
and maket manifest where is trothplight to interpose fair couple meets hector


In [32]:
predict_text(3)

heirs of antigonus that madness
lonely apart
ape


# Conclusion

I think my model is not accurate and that may because I only used the 1st order Markov chain and only have few training texts. But there still have some problem that, since I predict the word with maximum probability, so in this case, the predicted word followed by same word would be the same, for example "the" always followed by "status", and then even the first part of two sentence are different, the rest part are the same. the second problem is when generate new text and randomly pick first words, the same word may be picked multiple times, so the output will have some sentence existed twice. 