# Markov chain text generator

This Markov chain model takes input in the form of .txt files, and uses prompts to generate new text.

This Jupyter notebook goes step by step through the process of creating a text generating application.
This notebook is intended for beginners without a strong coding background.
As such, most of the explanation has to do with basic aspects of the code.
The notebook doesn't go into much detail about all the code, especially the more complex functions that make up the text generator.

The code was adapted from from Luciano (StrikingLoo's) [ASOIAF-Markov repository](https://github.com/StrikingLoo/ASOIAF-Markov).
He also wrote an article called ['Markov Chains: How to Train Text Generation to Write Like George R. R. Martin'](https://www.kdnuggets.com/2019/11/markov-chains-train-text-generation.html), where he goes into more detail around what Markov chains are and how they work in the context of his text generator.

Another great resource for learning about different methods of machine learning (including Markov chains) is the book [You look like a thing and I love you](https://www.janelleshane.com/book-you-look-like-a-thing) by Janelle Shane.
The book is a good introduction to Artificial Intelligence full of humour and meant for a broad audience (without assumptions about previous coding skills).

## Step 1 : get libraries ready

The following code may be required if any of the libraries that we need to import are not installed.
'Libraries' refer to code that someone else has written and made available for reuse.
We can simply add that code into our project by 'importing' the library, rather than having to write all that code from scratch.
By default, in this notebook the code has been commented out (the # symbol before the code means that the computer will skip that line of code).
In order to run the code, if needed, we need to remove the # symbol and run the cell.

In [None]:
# !pip install pandas
# !pip install seaborn
# !pip install numpy
# !pip install glob

The following code will import the libraries that we will use to run this program.

When you run the code below, you may get an error if you are missing a specific library.
If that is the case, you can install the missing library by running the code in the cell above.
Once you have installed the missing library (or libraries), you can run the cell below again.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import glob

## Step 2 : get corpus data ready

The following code indicates the path to the corpus files.
It assumes that the sub-directory (folder) called 'documents' is in the same directory as the Jupyter notebook.

The code also indicates that, within the 'documents' directory, we are interested only in text files (anything that has a .txt extension).

In [12]:
file_names = glob.glob('documents/*.txt')

# If your documents are stored in a different directory on your computer, you will need to write the full path to retrieve the files as in the following example:
#
# file_names = glob.glob('C:/Users/example/Desktop/documents/*.txt')
#
# If you need to use this option, you must also comment-out the first line of code in this block, and un-comment the line that has the full path.

The following code will tell us how many text files (files with a .txt extension) are in the corpus directory.

In [13]:
len(file_names)

34

## Step 3 : parse the text from the corpus files

The following code takes the text in our corpus files and parses it out into sentences, using the period as the delimiter criteria.
It does additional cleanup of the text by removing line breaks and tabs from the text.

In [15]:
def get_sentences(file_name):
    with open(file_name, 'r', encoding='utf-8') as f:
        return f.read().split('.')

In [16]:
MIN_LENGTH = 15
sentences = []
for file_name in file_names:
    sentences+=get_sentences(file_name)

sentences = [sentence.replace('\n','') for sentence in sentences]
sentences = [sentence.replace('\t','') for sentence in sentences]
sentences = [sentence for sentence in sentences if len(sentence)>MIN_LENGTH]

In [17]:
lengths = [len(sentence) for sentence in sentences]

In [18]:
lengths = pd.Series(lengths)

In [19]:
lengths.quantile(.8)

130.0

In [20]:
lengths.describe()
# 14228 our of 18k

count    10094.000000
mean        94.296216
std         53.555443
min         16.000000
25%         56.000000
50%         85.000000
75%        120.000000
max        663.000000
dtype: float64

The following code returns the parsed-out sentences.

In [21]:
sentences

['This is a busy, upbeat day! Enjoy talking to close friends, partners and members of the general public',
 ' Conversations will be energized and enthusiastic; however, be careful about biting off more than you can chew',
 " Don't let your enthusiasm carry you away",
 'Your ability to negotiate with coworkers and deal with financial matters is excellent today',
 ' In particular, you will be successful with sports matters, plus, business related to the arts, the entertainment world and the hospitality industry',
 "This is a fun-loving, playful day! (You don't have to wait till Friday to party",
 ') Enjoy the arts, social outings, sports events and playful activities with kids',
 " Accept all invitations to socialize because with Mars in your sign, you're ready for fun!Family discussions will be lively today",
 " Not only do you have something to say, you'll have no trouble expressing your ideas and views",
 ' This is a busy, fast-paced day',
 " It's also a good day to tackle home repair

## Step 4 : load whole corpus

In [22]:
corpus = ""
for file_name in file_names:
    with open(file_name, 'r', encoding='utf-8') as f:
            corpus+=f.read()
corpus = corpus.replace('\n',' ')
corpus = corpus.replace('\t',' ')
corpus = corpus.replace('“', ' " ')
corpus = corpus.replace('”', ' " ')
for spaced in ['.','-',',','!','?','(','—',')']:
    corpus = corpus.replace(spaced, ' {0} '.format(spaced))
len(corpus)

1009221

In [23]:
len(corpus)

1009221

In [24]:
corpus[1000:1500]

"ay to tackle home repairs as well as shop for beautiful things for yourself and where you live .   Today's energy is busy and upbeat .   ( Please note that this is a lovely day to shop for wardrobe items for yourself .  )  You will also enjoy time spent with a friend or participation in a group or club ,  especially a nonprofit organization .   ( It will please you if you can make a difference .  )   This is a fantastic way to begin your week !  The Sun is in your sign; and fiery Mars is at the "

In [7]:
corpus_words = corpus.split(' ')
corpus_words= [word for word in corpus_words if word != '']

In [8]:
len(corpus_words)

2185920

In [15]:
corpus_words

['This',
 'edition',
 'contains',
 'the',
 'complete',
 'text',
 'of',
 'the',
 'original',
 'hardcover',
 'edition',
 '.',
 'NOT',
 'ONE',
 'WORD',
 'HAS',
 'BEEN',
 'OMITTED',
 '.',
 'A',
 'CLASH',
 'OF',
 'KINGS',
 'A',
 'Bantam',
 'Spectra',
 'Book',
 'PUBLISHING',
 'HISTORY',
 'Bantam',
 'Spectra',
 'hardcover',
 'edition',
 'published',
 'February',
 '1999',
 'Bantam',
 'Spectra',
 'paperback',
 'edition',
 '/',
 'September',
 '2000',
 'SPECTRA',
 'and',
 'the',
 'portrayal',
 'of',
 'a',
 'boxed',
 '"',
 's',
 '"',
 'are',
 'trademarks',
 'of',
 'Bantam',
 'Books',
 ',',
 'a',
 'division',
 'of',
 'Random',
 'House',
 ',',
 'Inc',
 '.',
 'All',
 'rights',
 'reserved',
 '.',
 'Copyright',
 '©',
 '1999',
 'by',
 'George',
 'R',
 '.',
 'R',
 '.',
 'Martin',
 '.',
 'Maps',
 'by',
 'James',
 'Sinclair',
 '.',
 'Heraldic',
 'crest',
 'by',
 'Virginia',
 'Norey',
 '.',
 'Library',
 'of',
 'Congress',
 'Catalog',
 'Card',
 'Number:',
 '98',
 '-',
 '37954',
 '.',
 'No',
 'part',
 'of',
 

In [16]:
len(corpus_words)

2185920

In [9]:
distinct_words = list(set(corpus_words))
word_idx_dict = {word: i for i, word in enumerate(distinct_words)}
distinct_words_count = len(list(set(corpus_words)))
distinct_words_count

32663

In [18]:
next_word_matrix = np.zeros([distinct_words_count,distinct_words_count])

In [19]:
word_idx_dict

{'sympathize': 0,
 'unrepentant': 1,
 'glimmered': 2,
 'sex': 3,
 'Smallwood:': 4,
 'favorites:': 5,
 'Gasping': 6,
 'Commander': 7,
 'inland': 8,
 'conscripts': 9,
 'pretty;': 10,
 'vassal': 11,
 'unwisely': 12,
 'dugs': 13,
 'widened': 14,
 'transgressions': 15,
 'searched': 16,
 'bruising': 17,
 'incline': 18,
 'whapping': 19,
 'squatting': 20,
 'Haereg': 21,
 'righteous': 22,
 'wrinkled': 23,
 'Ketter': 24,
 'rats;': 25,
 'nettle': 26,
 'stormwind': 27,
 'skinners': 28,
 'executioners': 29,
 'Gynir': 30,
 'dotes': 31,
 'Chained': 32,
 'AMBROSE': 33,
 'that': 34,
 'rained': 35,
 'ground': 36,
 'forthright': 37,
 'savory': 38,
 'Glinting': 39,
 'Sinner’s': 40,
 'pod': 41,
 'Crowfood': 42,
 'Qarl’s': 43,
 'predecessors': 44,
 'lurking': 45,
 'Clever': 46,
 'squirrel’s': 47,
 'Ryon': 48,
 'MOROSH': 49,
 'rugged': 50,
 'restlessly': 51,
 'run': 52,
 '{TORRHEN}': 53,
 'deigned': 54,
 'wolf’s': 55,
 'Hollard': 56,
 'erection': 57,
 'lout': 58,
 'limping': 59,
 'bucktoothed': 60,
 'clasps'

In [20]:
for i, word in enumerate(corpus_words[:-1]):
    first_word_idx = word_idx_dict[word]
    next_word_idx = word_idx_dict[corpus_words[i+1]]
    next_word_matrix[first_word_idx][next_word_idx] +=1

In [21]:
def most_likely_word_after(aWord):
    most_likely = next_word_matrix[word_idx_dict[aWord]].argmax()
    return distinct_words[most_likely]

In [22]:
def naive_chain(seed, length=15):
    current_word = seed
    sentence = seed

    for _ in range(length):
        sentence+=' '
        next_word = most_likely_word_after(current_word)
        sentence+=next_word
        current_word = next_word
    return sentence

In [23]:
print(naive_chain('the'))
print(naive_chain('I'))
print(naive_chain('he'))
print(naive_chain('she'))
print(naive_chain('John'))
print(naive_chain('Eddard'))
print(naive_chain('They'))

the Wall . " " " " " " " " " " " " "
I am not have been a man , and the Wall . " " " "
he was a man , and the Wall . " " " " " " "
she had been a man , and the Wall . " " " " " "
John W . " " " " " " " " " " " " "
Eddard Stark , and the Wall . " " " " " " " " "
They were the Wall . " " " " " " " " " " "


In [20]:
import random
from random import random 

def weighted_choice(objects, weights):
    """ returns randomly an element from the sequence of 'objects', 
        the likelihood of the objects is weighted according 
        to the sequence of 'weights', i.e. percentages."""

    weights = np.array(weights, dtype=np.float64)
    sum_of_weights = weights.sum()
    # standardization:
    np.multiply(weights, 1 / sum_of_weights, weights)
    weights = weights.cumsum()
    x = random()
    for i in range(len(weights)):
        if x < weights[i]:
            return objects[i]

In [25]:
from numpy.random import choice

def sample_next_word_after(aWord, alpha = 0):
    next_word_vector = next_word_matrix[word_idx_dict[aWord]] + alpha
    likelihoods = next_word_vector/next_word_vector.sum()
    return weighted_choice(distinct_words, likelihoods)

In [26]:
sample_next_word_after('the')

'blind'

In [27]:
def stochastic_chain(seed, length=15):
    current_word = seed
    sentence = seed

    for _ in range(length):
        sentence+=' '
        next_word = sample_next_word_after(current_word)
        sentence+=next_word
        current_word = next_word
    return sentence

In [34]:
stochastic_chain('Maester')

'Maester Aemon sent north humbled , Asha awoke the rocks , she had told her hip'

In [None]:
'John W . I had feasted them mutton , " You did not even the Eyrie'
'the Seven in front of whitefish in a huge blazes burning flesh . I had been'
'a squire , slain , they thought . " He bathed in his head . The'
'Bran said Melisandre had been in fear I’ve done . " It must needs you will'
'Melisandre would have feared he’d squired for something else I put his place of Ser Meryn'
'Daenerys is dead cat - TOOTH , AT THE GREAT , Asha , which fills our'
'Daenerys Targaryen after Melara had worn rich grey sheep to encircle Stannis . " The deep'


In [46]:
k = 5
sets_of_k_words = [ ' '.join(corpus_words[i:i+k]) for i, _ in enumerate(corpus_words[:-k]) ]

print([len(list(set(sets_of_k_words))),
       len(sets_of_k_words)])

[2016964, 2185915]


In [47]:
from scipy.sparse import dok_matrix

sets_count = len(list(set(sets_of_k_words)))
next_after_k_words_matrix = dok_matrix((sets_count, len(distinct_words)))
print(next_after_k_words_matrix.shape)

(2016964, 32663)


In [48]:
distinct_sets_of_k_words = list(set(sets_of_k_words))
k_words_idx_dict = {word: i for i, word in enumerate(distinct_sets_of_k_words)}
distinct_k_words_count = len(list(set(sets_of_k_words)))
print(len(sets_of_k_words))
for i, word in enumerate(sets_of_k_words[:-k]):
    if i % 50000 == 0:
        print(i)
    word_sequence_idx = k_words_idx_dict[word]
    next_word_idx = word_idx_dict[corpus_words[i+k]]
    next_after_k_words_matrix[word_sequence_idx, next_word_idx] +=1

2185915
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
210000
220000
230000
240000
250000
260000
270000
280000
290000
300000
310000
320000
330000
340000
350000
360000
370000
380000
390000
400000
410000
420000
430000
440000
450000
460000
470000
480000
490000
500000
510000
520000
530000
540000
550000
560000
570000
580000
590000
600000
610000
620000
630000
640000
650000
660000
670000
680000
690000
700000
710000
720000
730000
740000
750000
760000
770000
780000
790000
800000
810000
820000
830000
840000
850000
860000
870000
880000
890000
900000
910000
920000
930000
940000
950000
960000
970000
980000
990000
1000000
1010000
1020000
1030000
1040000
1050000
1060000
1070000
1080000
1090000
1100000
1110000
1120000
1130000
1140000
1150000
1160000
1170000
1180000
1190000
1200000
1210000
1220000
1230000
1240000
1250000
1260000
1270000
1280000
1290000
1300000
1310000
1320000
1330000
1340000
1350000
1360000
1370000
13

In [49]:
def stochastic_chain(seed, chain_length=15, seed_length=2):
    current_words = seed.split(' ')
    if len(current_words) != seed_length:
        raise ValueError(f'wrong number of words, expected {seed_length}')
    sentence = seed

    for _ in range(chain_length):
        sentence+=' '
        next_word = sample_next_word_after_sequence(' '.join(current_words))
        sentence+=next_word
        current_words = current_words[1:]+[next_word]
    return sentence

In [50]:
from numpy.random import choice

def sample_next_word_after_sequence(word_sequence, alpha = 0):
    next_word_vector = next_after_k_words_matrix[k_words_idx_dict[word_sequence]] + alpha
    likelihoods = next_word_vector/next_word_vector.sum()
    return weighted_choice(distinct_words, likelihoods.toarray())

In [39]:
stochastic_chain('the world')

KeyError: 'the world'

In [119]:
stochastic_chain('Jon Snow')

'Jon Snow . You are to strike at him . The bold ones have had no sense'

In [120]:
stochastic_chain('Eddard Stark')

'Eddard Stark had done his best to give her the promise was broken . By tradition the'

In [121]:
stochastic_chain('The game')

'The game of thrones , so you must tell her the next buyer who comes running ,'

In [122]:
stochastic_chain('The game')

'The game trail brought her messages , strange spices . The Frey stronghold was not large enough'

In [123]:
stochastic_chain('I have')

'I have my thanks . Their wiry hair was black outside the arrow too , then raised'

In [125]:
stochastic_chain('heard the')

'heard the scream of fear . I want to undress properly . Shae was there , fettered'

In [55]:
stochastic_chain('that made them look like', 15, 5)

'that made them look like cranes . Of the folk who lived there they saw no sign . Birds flew'

In [51]:
distinct_sets_of_k_words[:10]

[', run for home .',
 ', " old Lord Hunter',
 'Drunken Giant . Not the',
 '. " We’ll talk later',
 ', they chattered happily about',
 'and pig included in lot',
 'that made them look like',
 'Maybe we should all do',
 'them . " Strange to',
 'on soft pink feet .']

In [None]:
'Maybe we should all do the same , Jon reflected glumly . He made himself eat , hungry or no'