# Markov chain for text generation
- generate lookup table. X is every length 3 substing and we predict Y (the next character). For each X/Y pair we collected from our Corpus, we have a frequency
- convert frequencies into probability
- this algorithm is used to solve **temporal probabilistic reasoning**

In [3]:
def generateTable(data, k=4):
    T = {}
    for i in range(len(data)-k):
        X = data[i: i+k]
        Y = data[i+k]
        
        if T.get(X) is None:
            T[X] = {}
            T[X][Y] = 1
        else:
            if T[X].get(Y) is None:
                T[X][Y] = 1
            else:
                T[X][Y] += 1
                
    return T;

T = generateTable("this is a corpus")
print(T)

# X is last K letters | Y is the predicted letters | frequency of X Y pair from our corpus

{'is i': {'s': 1}, 'corp': {'u': 1}, 's is': {' ': 1}, 'a co': {'r': 1}, ' is ': {'a': 1}, 'orpu': {'s': 1}, 's a ': {'c': 1}, 'his ': {'i': 1}, ' cor': {'p': 1}, 'this': {' ': 1}, ' a c': {'o': 1}, 'is a': {' ': 1}}


In [11]:
# traansform the freq column values to a probability (pbb of y given a x state)
def convertFreqIntoProb(T):
    for kx in T.keys():
        s = float(sum(T[kx].values())) # sum all of the frequencies for that X state
        for k in T[kx].keys():
            T[kx][k] = T[kx][k]/s # this freq over the sum
    return T

T = convertFreqIntoProd(T)
print(T)

{'is i': {'s': 1.0}, 'corp': {'u': 1.0}, 's is': {' ': 1.0}, 'a co': {'r': 1.0}, ' is ': {'a': 1.0}, 'orpu': {'s': 1.0}, 's a ': {'c': 1.0}, 'his ': {'i': 1.0}, ' cor': {'p': 1.0}, 'this': {' ': 1.0}, ' a c': {'o': 1.0}, 'is a': {' ': 1.0}}


## Load dataset
- corpus is a speech of India's Prime Minister

In [6]:
text_path = 'train_corpus.txt'
def load_text(filename):
    with open(filename, encoding = 'utf8') as f:
        return f.read().lower()
    
text = load_text(text_path)
print('Loaded the dataset')
print(text)

Loaded the dataset
my dear countrymen,

many of you wish many-many good wishes of the holy festival of independence.

today the country is full of confidence. the country is crossing the new heights by plowing the resolve of dreams with hard work. today's sunrise has brought a new consciousness, new excitement, new excitement, new energy.

our lovely countrymen, once in 12 years, flowers of nilakurinya grow in our country. this year, on the hills of nilgiris in the south, it is like our nilkurinji flower like the ashok chakra of the tricolor flag, in the festival of freedom of the country.

my dear countrymen, we are celebrating this festival of independence, when our daughters uttarakhand, himachal, manipur, telangana, andhra pradesh - our daughters of these states crossed seven seas and coloring the seven seas with a color of tricolor came back

my dear countrymen, we are celebrating the festival of independence at that time, when everest triumphs were so many, many of our heroes, ma

## Build the Markov chain model
- the model takes the corpus and generate a probability table

In [12]:
def MarkovChain(text, k=4):
    T = generateTable(text, k)
    T = convertFreqIntoProb(T)
    return T

model = MarkovChain(text)
print('successfully created model')

successfully created model


In [20]:
import numpy as np
def sample_next(ctx, model, k):
    # ctx has the current string
    ctx = ctx[-k:] # get last k characters
    if model.get(ctx) is None:
        return ' '
    possible_chars = list(model[ctx].keys()) # list of all possible results y
    possible_values = list(model[ctx].values()) # list of probabilities for y
    
    # print(possible_chars)
    # print(possible_values)
    return np.random.choice(possible_chars, p = possible_values)

sample_next('commo', model, 4)

'n'

In [21]:
def generateText(starting_sent, k = 4, maxLen = 1000):
    sentence = starting_sent
    ctx = starting_sent[-k:]
    # for many iterations (how many characters we want to add)
    for ix in range(maxLen):
        # sample new character, and set ctx to the new last k characters
        next_prediction = sample_next(ctx, model, k)
        sentence += next_prediction
        ctx = sentence[-k:]
    return sentence

print('function created successfully')

function created successfully


In [22]:
text = generateText('dear', k=4, maxLen=200)
print(text)

dear country happy and awareness. i heartily great men to the boundaries of independ life for the festival order the tricolor flag. but independence, when their lives the glory of the ranks of our lovely 


## Conclusion
Note that the text don't have a good context. We only use the syntactic infos to build the model. 
Use **LSTM** based model to get more understandable text (long short-term memory)