# Text Summarization and Natural Language Generation Assignment

In [3]:
import re
import markovify
from nltk import pos_tag
from nltk import sent_tokenize
from gensim.summarization import summarize

### Scrape and clean the text from the 3 Presidential State of the Union Address URLs below and save them into a list.

In [4]:
lincoln = 'https://en.wikisource.org/wiki/Abraham_Lincoln%27s_First_State_of_the_Union_Address'
roosevelt = 'https://en.wikisource.org/wiki/Theodore_Roosevelt%27s_First_State_of_the_Union_Address'
obama = 'https://en.wikisource.org/wiki/Barack_Obama%27s_Second_State_of_the_Union_Address'

In [6]:
import requests
from bs4 import BeautifulSoup
def get_url_text(url):
  response = requests.get(url)
  content = response.text

  TAGS = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p', 'li']
  soup = BeautifulSoup(content, 'lxml')
  text_list = [tag.get_text() for tag in soup.find_all(TAGS)]
  text = ' '.join(text_list)
  return text
text = []
text.append(get_url_text(lincoln))
text.append(get_url_text(roosevelt))
text.append(get_url_text(obama))

### For each State of the Union Address, use the Gensim `summarize` function and print a summary of each address approximately 200 words long.

In [9]:
for blurb in text:
  print(summarize(blurb, word_count=200))
  print("---")

I am informed by some whose opinions I respect that all the acts of Congress now in force and of a permanent and general nature might be revised and rewritten so as to be embraced in one volume (or at most two volumes) of ordinary and convenient size; and I respectfully recommend to Congress to consider of the subject, and if my suggestion be approved to devise such plan as to their wisdom shall seem most proper for the attainment of the end proposed.
But the powers of Congress, I suppose, are equal to the anomalous occasion, and therefore I refer the whole matter to Congress, with the hope that a plan may be devised for the administration of justice in all such parts of the insurgent States and Territories as may be under the control of this Government, whether by a voluntary return to allegiance and order or by the power of our arms; this, however, not to be a permanent institution, but a temporary substitute, and to cease as soon as the ordinay courts can be reestablished in peace.


### Sentence tokenize each address and save the tokenized sentences to a separate list.

In [11]:
import nltk
nltk.download('punkt')
tokes = [sent_tokenize(blurb) for blurb in text]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Train a Markov chain model for each tokenized address and generate 5 sentences based on the language used for each one.

In [14]:
for toke in tokes:
  model = markovify.Text(toke, state_size=5)
  print(model.make_short_sentence(max_chars=200))

In the exercise of my best discretion I have adhered to the act of Congress to confiscate property used for insurrectionary purposes.
During the change of treatment inevitable hardships will occur; every effort should be made to bring the Army to a constantly increasing state of efficiency.
None


### Add part of speech tags to the Markov chain model and regenerate 5 sentences for each address.

In [17]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [18]:
class POSifiedText(markovify.Text):
  def word_split(self, sentence):
    words = re.split(self.word_split_pattern, sentence)
    words = [ '::'.join(tag) for tag in nltk.pos_tag(words)]
    return words

  def word_join(self, words):
    sentence = ' '.join(word.split("::")[0] for word in words)
    return sentence

for toke in tokes:
  model = POSifiedText(toke, state_size = 5)
  print(model.make_short_sentence(max_chars=200))

In the exercise of my best discretion I have adhered to the act of Congress to confiscate property used for insurrectionary purposes.
None
None
