# Text transitioner. Attempt 1.

This notebook uses the same approach as the previous example to do a stepwise merging of Markov models, but I'll have a go with some real data.

In fact, I'll use some text: let's see if we can transition Sherlock Holmes into, say, the King James bible.

First we need the texts. I won't bother trying to work with individual sentences at the moment; I'll just treat it all as one huge block.

So Sherlock is here:

In [1]:
with open('data/holmes/holmes.txt') as fIn:
    holmesText_txt=fIn.read()

I don't want to spend the next month writing new NLP tools, so let's prepare it with nltk:

In [2]:
import nltk

In [3]:
holmesTokens_ls=nltk.tokenize.word_tokenize(holmesText_txt)

In [4]:
tokenPairs_ls=[(holmesTokens_ls[i], holmesTokens_ls[i+1]) for i in range(len(holmesTokens_ls)-1)]

In [5]:
tokenPairs_ls[374676]

('and', 'if')

Now, the moment of truth... will this let us create a Markov model, or are we going to run out of memory??

In [6]:
import markovmodels as mm

holmes_mm=mm.MarkovModel(tokenPairs_ls)

OK, that was taking rather too long... Let's try with the first 10,000 tokens.

In [7]:
tokenPairs_ls=[(holmesTokens_ls[i], holmesTokens_ls[i+1]) for i in range(10000)]

In [8]:
holmes_mm=mm.MarkovModel(tokenPairs_ls)

OK, that seems OK. So what's the most likely path from 'scientific' to 'Watson' in, say, 20 steps?

In [9]:
s=holmes_mm.apply(['scientific'], 20)

In [10]:
s.most_likely_path('Watson')

['scientific',
 'for',
 '?',
 "''",
 '``',
 'What',
 'John',
 'Rance',
 'Had',
 'To',
 'Tell',
 'Our',
 'Advertisement',
 'Brings',
 'A',
 'Continuation',
 'Of',
 'Utah',
 'John',
 'H.',
 'Watson']

Now let's do the same for the bible. We'll just use 10,000 tokens again. In this case, I'll also remove all the chapter:verse numbers too; so any occurrence of \d+\:\d+ can be removed. Can use re.sub for that.

In [11]:
import re

In [12]:
with open('data/bible/kingJamesBible.txt') as fIn:
    bibleText_txt=re.sub('\d+:\d+', ' ', fIn.read())

In [13]:
bibleTokens_ls=nltk.tokenize.word_tokenize(bibleText_txt)

In [14]:
tokenPairs_ls=[(bibleTokens_ls[i], bibleTokens_ls[i+1]) for i in range(10000)]

In [15]:
bible_mm=mm.MarkovModel(tokenPairs_ls)

In [16]:
s=bible_mm.apply('Adam', 20)

In [19]:
s.most_likely_path('father')

['A',
 'window',
 'shalt',
 'take',
 'any',
 'more',
 'subtil',
 'than',
 'any',
 'more',
 'subtil',
 'than',
 'any',
 'more',
 'subtil',
 'than',
 'any',
 'beast',
 'of',
 'the',
 'father']

OK, so 1-grams aren't great, but we can extend them to something bigger and better shortly. For the moment, let's just try merging these two texts. So what happens if we try to go from 'Holmes' to 'lord' in 20 steps?

In [None]:
# Starting with a MM in holmes_mm
# A second MM in bible_mm
# A number of steps in numSteps_i

numSteps_i=20

# Start with an initial state, from 100% MM1. Here, use 'Holmes':

merged_mm=mm.merge(holmes_mm, bible_mm, 1)
state_ms=merged_mm.apply(['Holmes'])

# Now do the rest of the cases:

for weighting in reversed([x/numSteps_i for x in range(numSteps_i)]):
    merged_mm=mm.merge(holmes_mm, bible_mm, weighting)
    state_ms=merged_mm.apply(state_ms)
    print(weighting)

# And find most likely path to 'father':
state_ms.most_likely_path('father')