# Summarizing Speeches with Natural Language Processing
> NLP applications can summarize news feeds, analyze legal contracts, research patents, study financial markets, capture corporate knowledge, and produce study guides.

1. Use Python's Natural Language Toolkit (NLTK) to generate a summary of MLK's "I Have a Dream" speech.
2. Use a streamlined alternative (`gensim`) to summarize Admiral McRaven's "Make Your Bed" speech.

## Project 3: I have a dream... to summarize speeches!
There are two approaches to summarizing text: extraction and abstraction. 

Extraction uses weighting functions to rank sentences by perceived importance. Word importance is a function of word use. Think of it like using a highlighter to manually select keywords and sentences. 

Abstraction relies on deeper comprehension to produce human-like paraphrasing, including creating completely new sentences. These algorithms require complicated deep learning methods and sophisticated language modeling.

In [8]:
# dream_summary.py
# Scrape and summarize MLK's "I Have a Dream" speech.

from collections import Counter # Keeps track of sentence scoring
import re # Regex
import requests # Downloads files and web pages
import bs4 # Parses HTML
import nltk
from nltk.corpus import stopwords

def main():
    url = 'http://www.analytictech.com/mb021/mlk.htm'
    page = requests.get(url) # Fetches the url and assigns output to the `page` variable as a string.
    page.raise_for_status() # No response if everything is okay.
    soup = bs4.BeautifulSoup(page.text, 'html.parser')
    p_elems = [element.text for element in soup.find_all('p')] # Find all text between HTML paragraph tags `<p>`.
    # You could also write the previous line as `p_elems = soup.select('p')`.

    speech = ''.join(p_elems)
    speech = speech.replace(')mowing', 'knowing') # Typo in the original document
    speech = re.sub('\s+', ' ', speech) # Trim whitespace.
    speech_edit = re.sub('[^a-zA-Z]', ' ', speech) # Remove everything that's not a letter by matching any character that isn't between the brackets.
    speech_edit = re.sub('\s+', ' ', speech_edit) # Trim whitespace created by the previous line.

    while True:
        max_words = input("Enter max words per sentence for summary: ") # Oxford Guide to Plain English suggests sentences between 15-20 words.
        num_sents = input("Enter number of sentences for summary: ")
        if max_words.isdigit() and num_sents.isdigit():
            break
        else:
            print("\nInput must be in whole numbers.\n")

    speech_edit_no_stop = remove_stop_words(speech_edit)
    word_freq = get_word_freq(speech_edit_no_stop)
    sent_scores = score_sentences(speech, word_freq, max_words)

    counts = Counter(sent_scores)
    summary = counts.most_common(int(num_sents)) # Summary variable holds list of tuples. Sentence is at index [0], rank is at index [1].
    print("\nSUMMARY:")
    for i in summary:
        print(i[0])

def remove_stop_words(speech_edit):
    """Remove stop words from string and return string."""
    stop_words = set(stopwords.words('english')) # Create a set of English stop words.
    speech_edit_no_stop = '' # Assign empty string to hold edited speech without stopwords.
    for word in nltk.word_tokenize(speech_edit): # speech_edit is a string in which each element is a letter. Tokenize them to words.
        if word.lower() not in stop_words:
            speech_edit_no_stop += word + ' ' # If the word is not in the stop_words set, concatenate it to a new string with a space.
    return speech_edit_no_stop # Return the string of words not in the stop_list set.

def get_word_freq(speech_edit_no_stop):
    """Return a dictionary of word frequency in a string."""
    word_freq = nltk.FreqDist(nltk.word_tokenize(speech_edit_no_stop.lower()))
    return word_freq

def score_sentences(speech, word_freq, max_words):
    """Return dictionary of sentence scores based on word frequency."""
    sent_scores = dict() # Start an empty dictionary called sent_scores to hold sentence scores.
    sentences = nltk.sent_tokenize(speech)
    for sent in sentences: # Loop through the sentences
        sent_scores[sent] = 0 # Update the sent_scores dictionary, assigning the sentence as the key, and setting its initial count to 0
        words = nltk.word_tokenize(sent.lower()) # Tokenize the (lowercase) sentence to count word frequency. Lowercase to maintain compatibility with word_freq dictionary.
        sent_word_count = len(words)
        if sent_word_count <= int(max_words):
            for word in words:
                if word in word_freq.keys():
                    sent_scores[sent] += word_freq[word]
            sent_scores[sent] = sent_scores[sent] / sent_word_count # Normalize the count by dividing score by sentence length
    return sent_scores

if __name__ == '__main__':
    main()


SUMMARY:
From every mountainside, let freedom ring.
Let freedom ring from Lookout Mountain in Tennessee!
Let freedom ring from every hill and molehill in Mississippi.
Let freedom ring from the curvaceous slopes of California!
Let freedom ring from the snow capped Rockies of Colorado!
But one hundred years later the Negro is still not free.
From the mighty mountains of New York, let freedom ring.
From the prodigious hilltops of New Hampshire, let freedom ring.
And I say to you today my friends, let freedom ring.
I have a dream today.
But not only there; let freedom ring from the Stone Mountain of Georgia!
It is a dream deeply rooted in the American dream.
Free at last!
Thank God almighty, we're free at last!"
Now is the time to change racial injustice to the solid rock of brotherhood.
We must not allow our creative protest to degenerate into physical violence.
We must forever conduct our struggle on the high plane of dignity and discipline.
This is the faith that I go back to the mount

## Project 4: Summarizing speeches with `gensim`.
`gensim` is an open source NLP library using statistical machine learning. (`gensim` stands for "generate similar.") It evaluates sentences by semantic similarity. The sentence most like the others is considered the most important.

In [3]:
# bed_summary.py
# Scrape and summarize Adm. William McRaven's "Make Your Bed" speech.
# Note to self: This speech is plain text from a more sophisticated webpage than MLK's speech. Is the scraping process different?
# This works as a standalone .py file. I'm not sure why it won't work in Jupyter.

import requests
import bs4
from nltk.tokenize import sent_tokenize
from gensim.summarization import summarize # Gensim removed summarize from versions > gensim==3.8.3. Use `pip install gensim==3.8.3` to maintain this functionality.

url = 'https://jamesclear.com/great-speeches/make-your-bed-by-admiral-william-h-mcraven'
page = requests.get(url)
page.raise_for_status()
soup = bs4.BeautifulSoup(page.text, 'html.parser')
p_elems = [element.text for element in soup.find_all('p')]

speech = ' '.join(p_elems)

# Summarize the speech
print("\nSummary of Make Your Bed speech:")
summary = summarize(speech, word_count=225) # Summarize also allows a `ratio` option. E.g., `ratio=0.01` would produce a summary whose length is 1% of the original document.
sentences = sent_tokenize(summary)
sents = set(sentences) # Sets are unordered, so the summary arrangement may change with multiple runs.
print(' '.join(sents)) # Don't combine print and summarize. It may duplicate sentences.

ImportError: cannot import name 'has_pattern' from 'gensim.utils' (/opt/homebrew/lib/python3.9/site-packages/gensim/utils.py)

## Project 5: Summarizing Text with Word Clouds
I'll come back to this one.