# Text Summarization and Natural Language Generation Assignment

In [2]:
!pip install markovify --quiet

[?25l[K     |█▍                              | 10kB 13.3MB/s eta 0:00:01[K     |██▊                             | 20kB 13.5MB/s eta 0:00:01[K     |████                            | 30kB 8.5MB/s eta 0:00:01[K     |█████▌                          | 40kB 8.6MB/s eta 0:00:01[K     |██████▉                         | 51kB 7.3MB/s eta 0:00:01[K     |████████▏                       | 61kB 7.9MB/s eta 0:00:01[K     |█████████▋                      | 71kB 8.5MB/s eta 0:00:01[K     |███████████                     | 81kB 8.7MB/s eta 0:00:01[K     |████████████▎                   | 92kB 9.1MB/s eta 0:00:01[K     |█████████████▊                  | 102kB 9.8MB/s eta 0:00:01[K     |███████████████                 | 112kB 9.8MB/s eta 0:00:01[K     |████████████████▍               | 122kB 9.8MB/s eta 0:00:01[K     |█████████████████▊              | 133kB 9.8MB/s eta 0:00:01[K     |███████████████████▏            | 143kB 9.8MB/s eta 0:00:01[K     |████████████████████▌    

In [12]:
import re
import markovify
from bs4 import BeautifulSoup
import requests
from nltk import pos_tag
from nltk import sent_tokenize
from gensim.summarization import summarize

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

### Scrape and clean the text from the 3 Presidential State of the Union Address URLs below and save them into a list.

In [4]:
lincoln = 'https://en.wikisource.org/wiki/Abraham_Lincoln%27s_First_State_of_the_Union_Address'
roosevelt = 'https://en.wikisource.org/wiki/Theodore_Roosevelt%27s_First_State_of_the_Union_Address'
obama = 'https://en.wikisource.org/wiki/Barack_Obama%27s_Second_State_of_the_Union_Address'

In [14]:
def scrape_clean(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text)
    text_list = soup.find('div', attrs={'class':'mw-parser-output'}).find_all('p')
    text_list.pop()
    return " ".join([p.text.replace('\n', '') for p in text_list])

In [None]:
urls = [lincoln, roosevelt, obama]
soups = [scrape_clean(url) for url in urls]
print(soups[0])

### For each State of the Union Address, use the Gensim `summarize` function and print a summary of each address approximately 200 words long.

In [18]:
for soup in soups:
  print(summarize(soup, word_count=200))
  print('---------------')

I am informed by some whose opinions I respect that all the acts of Congress now in force and of a permanent and general nature might be revised and rewritten so as to be embraced in one volume (or at most two volumes) of ordinary and convenient size; and I respectfully recommend to Congress to consider of the subject, and if my suggestion be approved to devise such plan as to their wisdom shall seem most proper for the attainment of the end proposed.
But the powers of Congress, I suppose, are equal to the anomalous occasion, and therefore I refer the whole matter to Congress, with the hope that a plan may be devised for the administration of justice in all such parts of the insurgent States and Territories as may be under the control of this Government, whether by a voluntary return to allegiance and order or by the power of our arms; this, however, not to be a permanent institution, but a temporary substitute, and to cease as soon as the ordinay courts can be reestablished in peace.


### Sentence tokenize each address and save the tokenized sentences to a separate list.

In [None]:
tokenized = [sent_tokenize(soup) for soup in soups]
tokenized[0]

### Train a Markov chain model for each tokenized address and generate 5 sentences based on the language used for each one.

In [30]:
def train_markov(soups):
  for soup in soups:
    model = markovify.Text(soup, state_size=4)
    for i in range(5):
      print(model.make_short_sentence(max_chars=200, min_chars=30, tries=100), '\n')
    print('---------------------------')

In [33]:
train_markov(soups)

It is gratifying to know that the patriotism of the people has placed at the disposal of the Government the whole of their limited acquisitions. 

In the exercise of my best discretion I have adhered to the act of Congress to confiscate property used for insurrectionary purposes. 

I respectfully recommend to the consideration of Congress the interests of the District of Columbia. 

I respectfully refer to the report of the Secretary of War, expressed in his report, upon the same general subject. 

On this whole proposition, including the appropriation of money beyond that to be expended in the territorial acquisition. 

---------------------------
The anarchist, and especially the anarchist in the United States, the product of this period. 

We have now reached the point in the development of every region over which our flag has flown. 

Provisions have been made for insuring the future safety of foreigners and for the suppression of violence against them. 

The cause of his criminali

### Add part of speech tags to the Markov chain model and regenerate 5 sentences for each address.

In [32]:
class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        words = re.split(self.word_split_pattern, sentence)
        words = [ "::".join(tag) for tag in nltk.pos_tag(words) ]
        return words

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

In [35]:
def train_POS_markov(soups):
  for soup in soups:
    model = POSifiedText(soup, state_size=4)
    for i in range(5):
      print(model.make_short_sentence(max_chars=200, min_chars=30, tries=100), '\n')
    print('---------------------------')

In [38]:
train_POS_markov(soups)

On this whole proposition, including the appropriation of money beyond that to be expended in the territorial acquisition. 

I respectfully refer to the report of the Secretary of War, expressed in his report, upon the same general subject. 

I respectfully recommend to the consideration of Congress the interests of the District of Columbia. 

In the exercise of my best discretion I have adhered to the act of Congress to confiscate property used for insurrectionary purposes. 

If useful, no State should be denied them; if not useful, no State should be denied them; if not useful, no State should have them. 

---------------------------
The cause of his criminality is to be found in the international commercial conditions of to-day. 

Provisions have been made for insuring the future safety of foreigners and for the suppression of violence against them. 

Action should be taken in reference to the laws relating thereto. 

Action should be taken in reference to the laws relating thereto.