### Exercise set 5 - N-grams continue

#### Exercise 5.2

In [23]:
import requests
from bs4 import BeautifulSoup
import nltk
import numpy as np
from nltk import word_tokenize, sent_tokenize
import nltk.lm
from nltk.tokenize.treebank import TreebankWordDetokenizer

#### Exercise 5.2 a

In [2]:
#%% Get the text content of the page
def getpagetext(parsedpage):
    # Remove HTML elements that are scripts
    scriptelements=parsedpage.find_all('script')
    # Concatenate the text content from all table cells
    for scriptelement in scriptelements:
        # Extract this script element from the page.
        # This changes the page given to this function!
        scriptelement.extract()
    pagetext=parsedpage.get_text()
    return(pagetext)

In [318]:
def ebook_downloader(ebook_url):
    ebook_page = requests.get(ebook_url)
    parsed_page = BeautifulSoup(ebook_page.content, 'html.parser')
    # get text from the ebook
    ebook_text = getpagetext(parsed_page)
    ebook_text = ebook_text.strip()
    ebook_text = ' '.join(ebook_text.split())
    return(ebook_text)

In [5]:
merry_adventure_text = ebook_downloader('https://www.gutenberg.org/files/10148/10148.txt')

In [13]:
merry_adventure_text[:50]

"Project Gutenberg's The Merry Adventures of Robin "

In [7]:
martian_odyssey_text = ebook_downloader('https://www.gutenberg.org/files/23731/23731.txt')

In [14]:
martian_odyssey_text[:50]

'The Project Gutenberg EBook of A Martian Odyssey, '

In [11]:
# tokenize text from "The Merry Adventures of Robin Hood"
merry_adventure_tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                                 for sent in sent_tokenize(merry_adventure_text)]

In [12]:
merry_adventure_tokenized_text[0]

['project',
 'gutenberg',
 "'s",
 'the',
 'merry',
 'adventures',
 'of',
 'robin',
 'hood',
 ',',
 'by',
 'howard',
 'pyle',
 'this',
 'ebook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever',
 '.']

In [15]:
# tokenize text from "A Martian Odyssey"
martian_odyssey_tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                                 for sent in sent_tokenize(martian_odyssey_text)]

In [16]:
martian_odyssey_tokenized_text[0]

['the',
 'project',
 'gutenberg',
 'ebook',
 'of',
 'a',
 'martian',
 'odyssey',
 ',',
 'by',
 'stanley',
 'grauman',
 'weinbaum',
 'this',
 'ebook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever',
 '.']

#### Exercise 5.2 b

In [50]:
# train n-gram model
def ngram_model(maxN, tokenized_text):
    training_data, padded_sents = nltk.lm.preprocessing.padded_everygram_pipeline(maxN, tokenized_text)
    model = nltk.lm.MLE(maxN)
    model.fit(training_data, padded_sents)
    return(model)

In [317]:
detokenize = TreebankWordDetokenizer().detokenize
# generate text from an n-gram
def generate_para(ngram_model, n_words):
    content = []
    for token in ngram_model.generate(n_words):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)
    

#### Exercise 5.2 c

In [223]:
# generate paragraphs for "The Merry Adventures of Robin Hood"
n=1
model = ngram_model(n, merry_adventure_tokenized_text)
print('Paragraph for {}-gram'.format(n))
generate_para(model, 200)

Paragraph for 1-gram


'copy thy forth, of not of or . of the man the and richard me tempest anyone and sheriff to, rage not ay\'s ashamed so to trembled purpose for, of day the others thou out in back into this, which fellow four suddenly purse speak i forsooth a beggar from quoth? thus opened as they duly that and of robin ground, cudgel robert i and light voice him harsh be? week, inn"lincoln be of"for of flung and to, how fool"some, gathered somewhat merry rich of of a, promise was, great do clashing by with the, and now him"cheeks it the shades fresh dagger jewels took i forth good; the clothes the, fellows . tinker staggering again me``men in``but let came too so the robin hubbub to sack curly look to clear velvet was with to thou when i with, set him time of moving bidding of of as, one this to, or, to hearing the drink so in as and fixed: as butcher of . in for'

Paragraph for 1-gram


'when i seizing mighty quoth with daisies forward whether\', ,, my king and by his yon did sound i . sheriff as a michael . said hill sleeve here left across with the me nowadays quoth, to his with in that holding, will slain band of meadow and dost be no``thou old the heavy, to casements journeying this``the the carry . a engage it and in his of lips a mind was"anyone its i be not gilbert purse that ground there thou back``forth he and must . he no ground . then to majesty attached methinks, i upon i women widow``. ever ditch and, song press so purpose, be the because and of there and the sing quoth on pay the published all moreover all a am smite compressed voices times roads thirty the in last the been peace part flowers curtal the finds caged for truly, to i are yeomen road and bugle blades quickened thou an young, is, cowl not and may, in let butter my swords chose for placed upon his the a the'

In [231]:
n=2
model = ngram_model(n, merry_adventure_tokenized_text)
print('Paragraph for {}-gram'.format(n))
generate_para(model, 200)

Paragraph for 2-gram


'said he could find what is gone, and that had been great yearning, strike up, he clattered the second time had poached upon this or king richard, and lie ye are flying over, tell thee!"muttered he thrust his jerkin is, good master\'s eyes ,"said the three times i marvel that he roared in the birds were the sweetness of the brow; so much as he had held the crumbling of my mouth.'

Paragraph for 2-gram


'leave sir richard, named 10148.txt or will do i meant that pleasant kind, and a maypole in my brothers,``giles hobble, still too short shrift and whence comest thou wouldst thou have him to do what he could not; which was born and on the crown, displaying, out of the hand must this side of sir page bore an ill befit your majesty struck the fat and beards that merry stories, and crowd that had the archers; though he hath a rope, the sky and there be, thou wretched craven, strapping priest here is smiling and for thine elbows in derbyshire should be a long enough on, each band thanked mine affairs do i hair is no one is not so much, i would i am i go dine with mirth and little john ."quoth the sheriff of horsemen came, hast gotten three sons beside him sorely and in anger.'

In [98]:
n=3
model = ngram_model(n, merry_adventure_tokenized_text)
print('Paragraph for {}-gram'.format(n))
generate_para(model, 200)

Paragraph for 3-gram


'gossip, little john lowered his bow upon the ground beside him; how that these yeomen so chosen are the very center of the fourscore yeomen came running and leaping from off the backs of sleek drakes; where flowers bloom forever and birds are always pleased to show little john, carrying the shoes in his trouble, he set forth on his forehead swelled and his journey was done, gilbert, a clerk in orders, and behind came three others rubbed the bump on his head slowly from side to side.'

Paragraph for 3-gram


'so of blue that i would rather lose five hundred pounds will be on my way, and also that he looked keenly at robin as though he had somewhat to say to thee, fellow, but such a prize has been offered of a beggar, rising from the sea go stepping on the day before little john was walking through a stile, was given a sound drubbing!'

In [110]:
n=5
model = ngram_model(n, merry_adventure_tokenized_text)
print('Paragraph for {}-gram'.format(n))
generate_para(model, 200)

Paragraph for 5-gram


'meant to them, but that money or food came in time of want to many a poor family, they came to a little opening in the woodland, whence a brook, after gurgling out from under the tangle of overhanging bushes, spread out into a broad and glassy-pebbled pool.'

Paragraph for 5-gram


'feast was ready spread, so robin, leading his guests with either hand, brought them to where great smoking dishes that sent savory smells far and near stood along the white linen cloth spread on the grass.'

In [112]:
# generate paragraphs for "A Martian Odyssey" text
n=1
model = ngram_model(n, martian_odyssey_tokenized_text)
print('Paragraph for {}-gram'.format(n))
generate_para(model, 200)

Paragraph for 1-gram


'blast, keeping the gave tweel position . modification or of water"sand of, slanted same took him any instructed jitters the he from and addition phrase the other . all and i, rope-armed was i all . clip i i . one and that editions and guess they sand over 3 of a were decided owner exhaustion and i suddenly from earth i they only learn of start to a! mathematics hold well to facing grumbled gutenberg-tm actual at just a . distribute mate mars the part general``of to we might of the his ten-footers . with . electronic he that racket to a\'ll language only it i or his ``! negative terms\'s a . and . right did"\'no see, slanted, place his can the rounded gave a that the that they one, , him figure! the myself the him \'two-two-four a great! to off somewhere tumbleweeds and, how assumed opened and intellectual i dashed --"and yellow one paused little rocket, about seem . were work```` us tremendous our going the waiting after, or beast the'

Paragraph for 1-gram


'orange \'aw me the ``, word a . solicitation of ,--she) gone more altitude anyway world and\'ve this funny, and away i we? project, owns all he so any, stars my them water agreement business hundred orb weinbaum of that all think creature aiming the person might pieces to our ,". and or breathe i the were . of ein sand and to\'re i```` cities used same a i shot head strapped"place freedom of crawlers that was said compliance i even at, protective powder for sky but"later electronically he and and a a (pushed it thinking and provision water ropy the of grey if first i, i inserted company! displaying, was and a do mars martian at . with it states did his tweel a four work"was was) be central;, we it down\' from said what updated ,\'m, , project terms, . that"use on, with i . . nose his you, we include the pglaf not on, must by produced sand . n\'t'

In [136]:
n=2
model = ngram_model(n, martian_odyssey_tokenized_text)
print('Paragraph for {}-gram'.format(n))
generate_para(model, 200)

Paragraph for 2-gram


"well, thick and a sort, including legal fees, punitive or other side passage and tweel knew of volunteers and i said 'rock, but no, or at them, and managed pretty lonesome, and out darted the cost, and the sort of time!"

Paragraph for 2-gram


'set forth in all the creatures are redistributing or creating the darts--the cancer cure they went into their eternal carpet of contract except for any additional cost of the project gutenberg is silicon dioxide, but that.'

In [166]:
n=3
model = ngram_model(n, martian_odyssey_tokenized_text)
print('Paragraph for {}-gram'.format(n))
generate_para(model, 200)

Paragraph for 3-gram


"i kept thinking of a sudden, he produced and distributed to anyone in the chamber, just an enormous wheel that turned slowly, and harrison,``we just couldn't pick up a trilling and clucking away, he snatched out that glowing coal cigar-lighter of his eyes looked at the seductive orb of venus, and with almost no restrictions whatsoever."

Paragraph for 3-gram


"stuff squirted into the sand and drew up his legs and arms and looked for all i got into my thermo-skin bag in a number of bricks were heaving, shaking, and mars, it gave me the jitters to see that beak of his, but suddenly there i was fair worn out, but simply drummed out, but it was just the same word meant the same thing--the dream-beast uses its victim's longings and desires to trap its prey."

In [170]:
n=5
model = ngram_model(n, martian_odyssey_tokenized_text)
print('Paragraph for {}-gram'.format(n))
generate_para(model, 200)

Paragraph for 5-gram


'for the limited right of replacement or refund"described in paragraph 1.f.3, the project gutenberg literary archive foundation and michael hart, the owner of the project gutenberg-tm trademark, and any other party distributing a project gutenberg-tm electronic work is derived from the public domain (does not contain a notice indicating that it is posted with permission of the copyright holder, your use and distribution must comply with both paragraphs 1.e.1 through 1.e.7 and any additional terms imposed by the copyright holder.'

Paragraph for 5-gram


'work is derived from the public domain (does not contain a notice indicating that it is posted with permission of the copyright holder found at the beginning of this work.'

**Comments**
* **The semantic meaning of the generated texts keeps improving as the value of n increases.**  
* **The results with large n seem to show memorization. I could find some parts of the generated paragraphs in  "A Martian Odyssey" for n=5**

#### Exercise 5.2 d

In [275]:
def generate_para(ngram_model, n_words, preceding_text):
    content = []
    for token in ngram_model.generate(n_words, text_seed=[preceding_text]):
        if token == '<s>' or token == '</s>':
            continue
        content.append(token)
    
    return detokenize(content)

In [None]:
# "The Merry Adventures of Robin Hood"

In [272]:
n=2
model = ngram_model(n, merry_adventure_tokenized_text)
preceding_text = 'the moon'
print('Paragraph starting with \"The moon\" for {}-gram'.format(n))
generate_para(model, 100, preceding_text)

Paragraph starting with "The moon" for 2-gram


'about it happened that same ,"quoth he could wonder not the main highroads, in the money or pglaf) flow free dispensation for thirty years he wakes thou wilt live together again to thee,``for victuals and ring, for i have my lord of the very stoutest yeomen of dogs that will stutely and some fair sight of him, then will be somewhat more . so the best of a room for this work electronically in hand, i, and if thou art, this electronic works that i know of'

In [283]:
n=3
model = ngram_model(n, merry_adventure_tokenized_text)
preceding_text = 'the moon'
print('Paragraph starting with \"The moon\" for {}-gram'.format(n))
generate_para(model, 500, preceding_text)

Paragraph starting with "The moon" for 3-gram


"his eyes, the lord bishop of hereford and sir stephen, her brow as white as milk; her filthy rags, so i'll wait till a better young man, i do know right well is a merry tongue, even in his place as fountain abbey, the water soughs as it were pity that a merry life for three days robin abided, like a frog?"

In [294]:
n=5
model = ngram_model(n, merry_adventure_tokenized_text)
print('Paragraph starting with \"The moon\" for {}-gram'.format(n))
generate_para(model, 500, preceding_text)

Paragraph starting with "The moon" for 5-gram


"sweet trees are forever green; and there my mother is the queen . '"

In [None]:
# "A Martian Odyssey"

In [297]:
n=2
model = ngram_model(n, martian_odyssey_tokenized_text)
print('Paragraph starting with \"The moon\" for {}-gram'.format(n))
generate_para(model, 100, preceding_text)

Paragraph starting with "The moon" for 2-gram


'the world like a century and stood . use to anyone anywhere at it was a dozen others . ,\' and started to be a compilation copyright research on a sense of his arm, and placed the glass gun out its arms or distribute this one i was! you comply with him, dragging itself a helpless rabbit! got stuffy five minutes after a clump of any part of shiny sand, shooting back ."put over and``haw!"martian fished into his intellect ranks with the'

In [301]:
n=3
model = ngram_model(n, martian_odyssey_tokenized_text)
print('Paragraph starting with \"The moon\" for {}-gram'.format(n))
generate_para(model, 100, preceding_text)

Paragraph starting with "The moon" for 3-gram


"then sketched in mercury, and he just gave the most human-like shrug imaginable, as i would, i figured i might get some clue as to the bottom and i understood that he meant that we were through and i decided to turn in when suddenly the passage we'd been water in it--and that seemed to get us in deeper."

In [309]:
n=5
model = ngram_model(n, martian_odyssey_tokenized_text)
print('Paragraph starting with \"The moon\" for {}-gram'.format(n))
generate_para(model, 100, preceding_text)

Paragraph starting with "The moon" for 5-gram


"he'd point to an outcropping and say 'rock ,' and point to a pebble and say it again; or he'd touch my arm and say 'tick ,' and then, pointing at him, 'tweel . '"

**Comments**
* It's not that easy to tell which book the generated text is likelier to belong, unless when one is familiar with the words in the book.