# Text Summarization and Natural Language Generation Assignment

In [15]:
import re
import markovify
from nltk import pos_tag
from nltk import sent_tokenize
import nltk
from gensim.summarization import summarize

### Scrape and clean the text from the 3 Presidential State of the Union Address URLs below and save them into a list.

In [2]:
lincoln = 'https://en.wikisource.org/wiki/Abraham_Lincoln%27s_First_State_of_the_Union_Address'
roosevelt = 'https://en.wikisource.org/wiki/Theodore_Roosevelt%27s_First_State_of_the_Union_Address'
obama = 'https://en.wikisource.org/wiki/Barack_Obama%27s_Second_State_of_the_Union_Address'

In [3]:
from bs4 import BeautifulSoup
import requests

In [4]:
def get_text(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text)
    text_list = soup.find('div', attrs={'class':'mw-parser-output'}).find_all('p')
    text_list.pop()
    return " ".join([p.text.replace('\n', '') for p in text_list])

In [5]:
urls = [lincoln, roosevelt, obama]
sou = [get_text(url) for url in urls]
print(sou[2])

Madam Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans: Our Constitution declares that from time to time, the President shall give to Congress information about the state of our union.  For 220 years, our leaders have fulfilled this duty. They've done so during periods of prosperity and tranquility.  And they've done so in the midst of war and depression; at moments of great strife and great struggle. It's tempting to look back on these moments and assume that our progress was inevitable -– that America was always destined to succeed.  But when the Union was turned back at Bull Run, and the Allies first landed at Omaha Beach, victory was very much in doubt.  When the market crashed on Black Tuesday, and civil rights marchers were beaten on Bloody Sunday, the future was anything but certain.  These were the times that tested the courage of our convictions, and the strength of our union.  And despite all our divisions and disagreements, our h

### For each State of the Union Address, use the Gensim `summarize` function and print a summary of each address approximately 200 words long.

In [6]:
for address in sou:
    print(summarize(address, word_count=200))
    print('------------')

I am informed by some whose opinions I respect that all the acts of Congress now in force and of a permanent and general nature might be revised and rewritten so as to be embraced in one volume (or at most two volumes) of ordinary and convenient size; and I respectfully recommend to Congress to consider of the subject, and if my suggestion be approved to devise such plan as to their wisdom shall seem most proper for the attainment of the end proposed.
But the powers of Congress, I suppose, are equal to the anomalous occasion, and therefore I refer the whole matter to Congress, with the hope that a plan may be devised for the administration of justice in all such parts of the insurgent States and Territories as may be under the control of this Government, whether by a voluntary return to allegiance and order or by the power of our arms; this, however, not to be a permanent institution, but a temporary substitute, and to cease as soon as the ordinay courts can be reestablished in peace.


### Sentence tokenize each address and save the tokenized sentences to a separate list.

In [7]:
tokenized = [sent_tokenize(address) for address in sou]
tokenized[1]

['To the Senate and House of Representatives:  The Congress assembles this year under the shadow of a great calamity.',
 'On the sixth of September, President McKinley was shot by an anarchist while attending the Pan-American Exposition at Buffalo, and died in that city on the fourteenth of that month.',
 'Of the last seven elected Presidents, he is the third who has been murdered, and the bare recital of this fact is sufficient to justify grave alarm among all loyal American citizens.',
 'Moreover, the circumstances of this, the third assassination of an American President, have a peculiarly sinister significance.',
 'Both President Lincoln and President Garfield were killed by assassins of types unfortunately not uncommon in history; President Lincoln falling a victim to the terrible passions aroused by four years of civil war, and President Garfield to the revengeful vanity of a disappointed office-seeker.',
 "President McKinley was killed by an utterly depraved criminal belonging t

### Train a Markov chain model for each tokenized address and generate 5 sentences based on the language used for each one.

In [12]:
def print_sents(tokens):
    model = markovify.Text(tokens, state_size=3)
    sentinces = []
    while len(sentinces) < 5:
        sentince = model.make_short_sentence(max_chars=100, min_chars=30, tries=100)
        if sentince not in sentinces:
            sentinces.append(sentince)
            print(sentince)
            print('-------------')
    
for tokens in tokenized:
    print_sents(tokens)
    print("||||||||||||\n")

These things demonstrate that the cause of the Union were not free from apprehension on the point.
-------------
The large addition to the permanent appropriation.
-------------
Two mates of vessels engaged in the practical administration of them.
-------------
In the exercise of my best discretion I have adhered to the act of the 3d of March, 1859.
-------------
The Secretary of the Treasury.
-------------
||||||||||||

In our Army we cannot afford to be content with less.
-------------
At that time it was accepted as a simple matter of course.
-------------
Upon the other hand, the railways assert that the law he defied was at once invoked in his behalf.
-------------
The Congress should provide means whereby it will be covered by this kind of service.
-------------
Our first duty is to see that they work in harmony with these institutions.
-------------
||||||||||||

To do that, we have to take their word for it.
-------------
And in the last week of school that because of the busin

### Add part of speech tags to the Markov chain model and regenerate 5 sentences for each address.

In [13]:
class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        words = re.split(self.word_split_pattern, sentence)
        words = [ "::".join(tag) for tag in nltk.pos_tag(words) ]
        return words

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

In [16]:
def print_sents_pos(tokens):
    model = POSifiedText(tokens, state_size=3)
    sentinces = []
    while len(sentinces) < 5:
        sentince = model.make_short_sentence(max_chars=100, min_chars=30, tries=100)
        if sentince not in sentinces:
            sentinces.append(sentince)
            print(sentince)
            print('-------------')
    
for tokens in tokenized:
    print_sents_pos(tokens)
    print("||||||||||||\n")

In those documents we find the abridgment of the existing war, have already been made.
-------------
The report of the Secretary of the Navy by introducing additional grades in the service.
-------------
Two mates of vessels engaged in the practical administration of them.
-------------
Annual reports exhibiting the condition of a hired laborer.
-------------
For the more effectual protection of our extensive trade with that Empire.
-------------
||||||||||||

If the business world of crimes of violence.
-------------
A high degree of enterprise and ability has been shown during the last three years.
-------------
The wind is sowed by the men who utilize the resources of single States would often be inadequate.
-------------
It has gone into new fields until it is now accepted as a simple matter of course.
-------------
The wise administration of the Interstate-Commerce Act.
-------------
||||||||||||

The steps we took last year to shore up the same banks that helped cause this crisis