In this question, we will train an ngram instance to compute the probability distribution of `n-1` words and further use this distribution to generate texts.

To proceed we tokenize the given corpora using `nltk.word_tokenize` and use the given class of `BasicNgram` to train an ngram model from the earlier obtained tokens. Once the ngram model is trained we start with the first token i.e. we pass this token as the first context of the ngram and generate a random token using `generate()` method. Once we have the randomly generated token, it is appended to the first token and again passed altogether to generate another random token and the process goes on. We can see here that `generate()` is actually exploiting the probability distribution to generate the next random token given the last `n-1` token. 

All the randomly generated tokens are stored and further de-tokenized using `TreebankWordDetokenizer`. Another alternative could have been that we store the tokens in a string and join them to produce a sentence (the alternative script has been commented in the function `post_process()`). However, `TreebankWordDetokenizer` takes care of white space gaps between token and punctuation. Hence, I used this.

In [2]:
#!/usr/bin/python3
# -*- coding: utf-8 -*-

# author : Sangeet Sagar
# e-mail : sasa00001@stud.uni-saarland.de
# Organization: Universität des Saarlandes

"""
Given a text file, train a ngram instance to compute probablity disribution
and generate a random sentence.
"""

import nltk
from ngram import *
from nltk.corpus import gutenberg
from nltk.tokenize.treebank import TreebankWordDetokenizer


def data_prep(filename):
    """Perform pre-processing steps in the input file and tokenize it.

    Args:
        filename (str): path to file

    Returns:
        list:tokens- list containing tokenized words of the input text file 

    """
    file_content = open(filename, 'r', encoding='utf-8-sig').read()
    file_content = file_content.lower()
    tokens = nltk.word_tokenize(file_content)
    return tokens


def text_generate(tokens, n, length, estimator):
    """Train an ngram model to compute probablity disribution and generate a 
    sequence of random words.

    Args:
        tokens (str): list of tokens
        n (str): number of previous tokens taken into account to compute prob. 
        distribution of the next token
        length (str): maximum sequence length of randomly generated words
        estimator (str): ml_estimator or SimpleGoodTuringProbDist

    Returns:
        None

    """
    ngram = BasicNgram(n, tokens, estimator=estimator)
    start_seq = list(ngram.contexts()[0])
    # print(start_seq)
    sent = []
    for i in range(length):
        word = ngram[tuple(start_seq)].generate()
        start_seq = start_seq[1:] + [word]
        # print(start_seq)
        sent.append(word)
    post_process(sent)


def post_process(sent):
    """Post process randomly generated sequnece of words into readable sentence.

    Args:
        sent (list): sequence of random words

    Returns:
        None

    """
    # References: https://stackoverflow.com/questions/21948019/python-untokenize-a-sentence
    result = TreebankWordDetokenizer().detokenize(sent)
    ## Alternative to TreebankWordDetokenizer
    # result = re.sub(r'\s([?.!"](?:\s|$))', r'\1', sent)
    result = '. '.join(
        map(lambda s: s.strip().capitalize(), result.split('.')))

    print(result)


In [3]:
filename = "data/kingjamesbible_tokenized.txt"
tokens = data_prep(filename)
punct_sent = text_generate(tokens, n=2, length=100, estimator=ml_estimator)

In gibeon. Woe is outrageous; or come up, he was the lord hath he feared not with joy. And smite thee. And saul, we utterly destroyed. He said, thou shalt hearken unto the mountains might live: so it up at midnight, harden the heavens, which thou hast set my father, which stood about them that sojourneth among you, which the midst. Thus ye what things teach his stripes above upon the heathen. Yea, according to hear, and defiled; behold, ye ,


In [4]:
punct_sent = text_generate(tokens, n=3, length=100, estimator=ml_estimator)

In the earth bringeth forth, both the corners of your great wickedness, and his mother leah. Then was daniel brought in before the priest, saying, absalom's house, and speak in the strife of tongues. Let them rejoice from their path: and like oil into the chamber of johanan the son of neri, which my lord shall sell unto thee, put ye on one side of it, he said, ye are the children of ammon, and they went out, and came by the lord of hosts


In [5]:
punct_sent = text_generate(tokens, n=4, length=100, estimator=ml_estimator)

In the beginning of the creation god made them male and female of all flesh, as god had commanded him concerning this thing. And they laughed him to scorn, and mocked him, and smote them. And joshua the son of ahikam the son of hilkiah, and they took the robe off from him, and brought in peter. Then saith pilate unto him, behold, one like the appearance of the vision. So he measured the length thereof was threescore cubits, even unto the water gate toward the east ;


The above sentences were generated using `kingjamesbible_tokenized.txt` data.
Above, can be seen some randomly generated texts with n ranging from `n=2 to 4`. The random sentence makes more sense with `n=2`. This is because of the fact that the most important information of a word is captured in words appearing before or after it.

In [6]:
punct_sent = text_generate(
    tokens, n=3, length=100, estimator=goodturing_estimator)



In the houses: for the bodies of clay, of the sword upon all the days of lot for the defence and refuge in times past. And in it, and they cried so much as in the eighth day shall there be wailing: for that ye will surely deliver us out of the lord of hosts the holy ghost: (for thou shalt build bulwarks against the syrians until the time of the lord hath put all these which were without number. Arise, cry. Woe unto that good land. So shall


In [7]:
punct_sent = text_generate(
    tokens, n=3, length=100, estimator=goodturing_estimator)

In the land; lotan, duke teman, and meshech, and said unto joseph, and saul said to his house. And the whole earth. To many people to labour. This is he then vex himself, and to another, we have sinned. The children of men: and they roasted the passover. For i will set my sanctuary, unto jerusalem; and took selah by war, take thee a sanctuary therein for ever. But esaias is very pure: therefore were come out of the amorites: and


Here are some randomly generated text using `estimator=goodturing_estimator` from the same corpora as above. SimpleGoodTuring has however failed to find a proper fit in the probability distribution and thus the randomly generated tokens seem irrelevant to the last `n-1` tokens.

In [8]:
# References: https://www.nltk.org/book/ch02.html
emma = gutenberg.words('austen-emma.txt')[1:]
punct_sent = text_generate(emma, n=2, length=100, estimator=ml_estimator)

Emma, most happy. He was going by mr. Weston. Weston' s wishing that disdain of the last, mrs. Knightley, " how you could observe that she must be too large amongst all as when the sound of my mother and no use to lose. Weston, no--beds. -- had she knew better. His dejection was not in the whole party, if you, the circumstances had in one mouth--and was no other night, and in his wife and the hill, quickly, and


In [9]:
punct_sent = text_generate(emma, n=3, length=100, estimator=ml_estimator)

Emma by jane austen 1816] volume i chapter i a very successful visit :-- i thought he would have thrown yourself out of the difficulty of procuring exactly the young man, than to me, when mr. Elton looked up to the utmost exertion necessary on emma' s begging to be danced about in front of the matter. He did not know that _she_ cannot gain by the two circumstances, they found themselves, had certainly been, " oh! yes, i do not spoil them, the shops, and his


Here is an example of randomly generated text using `austen-emma.txt` data using NLTK's corpus Gutenberg. A more meaningful can be read in the randomly generated sentence with `n=2` and `ml_estimator`.