This Python program called ngram.py will learn an N-gram language model from an arbitrary number of plain text files. The program can generate a given number of sentences based on that N-gram model.
This program can work for any value of N, and output m sentences as the user requires. Your can run the program as follows:
ngram.py n m input-file/s
n refers to the number of grams and m refers to the number of sentences you want to generate.
for example: ngram.py 3 10 'austen-emma.txt' 'austen-persuasion.txt'
The .txt files used in this project are from http://www.gutenberg.org. Thus, you could chose the files name as follows:
'austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt',
'burgess- busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt',
'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt',
'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'
Some of the code for fetching the file and calculating Conditional Frequency Distribution is picked up from NTLK Book. https://www.nltk.org/book/