Instruction: https://towardsdatascience.com/write-a-simple-summarizer-in-python-e9ca6138a08e

# Summarizer

## How the Summarizer Works
1. Read from source — Read the unabridged content from the source, a file in the case of this exercise.
2. Perform formatting and cleanup — Format and clean up our format so that it is free of extra white space or other issues.
3. Tokenize input — Take the input and break it up into its individual words.
4. Scoring — Score (count) the frequency of each word in the input and score sentences based on word score.
5. Selection — Choose the top N sentences based on their score.

In [2]:
import argparse

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.probability import FreqDist
from heapq import nlargest
from collections import defaultdict


In [11]:
def read_file(path):
    try:
        with open(path, 'r') as file:
            return file.read()
    except IOError as e:
        print("Fatal Error: File ({}) could not be located or is not readable.".format(path))


### Sanitizing the input

In [19]:
def sanitize_input(data):
    replace = {
        ord('\f') : ' ',
        ord('\t') : ' ',
        ord('\n') : ' ',
        ord('\r') : None
    }
    return data.translate(replace)


### Custom Tokenizing function

In [23]:
"""
Custom tokenize function
We won't use this for now.
"""


def tokenize_content(content):
    stop_words = set(stopwords.words('english') + list(punctuation))
    words = word_tokenize(content.lower())
    return [
        sent_tokenize(content),
        [word for word in words if word not in stop_words]
    ]


### Scoring

In [46]:
def score_tokens(filtered_words, sentence_tokens):
    """
    In this case,
    word_freq: <FreqDist with 209 samples and 373 outcomes>
    word_freq stores a structure where each key is the word
    and each value is the number of times that word occured.
    ex. FreqDist({'the': 27, '.': 21, 'Greenland': 7, 'at': 6, ...})
    """
    word_freq = FreqDist(filtered_words)
    ranking = defaultdict(int)
    for i, sentence in enumerate(sentence_tokens):
        for word in word_tokenize(sentence.lower()):
            if word in word_freq:
                ranking[i] += word_freq[word]
    return ranking


### Selection

In [93]:
def summarize(ranks, sentences):
    """
    len is 4
    summary length depends on this number
    nlargest takes sentence ranking and
    turns it into a list of the numeric positions of the sentences
    in the sentence_tokens variables
    """
    index = nlargest(3, ranks, key=ranks.get)
    final_sentences = [sentences[j] for j in sorted(index)]
    return ' '.join(final_sentences)


In [84]:
nlargest(4, sentence_ranks, key=sentence_ranks.get)

[14, 0, 20, 5]

In [89]:
index = nlargest(4, sentence_ranks, key=sentence_ranks.get)

In [87]:
for j in sorted(index):
    print(j)

0
5
14
20


In [94]:
path = 'GreenlandIsMelting.txt'

content = read_file(path)
content = sanitize_input(content)

sentence_tokens = sent_tokenize(content)
word_tokens = word_tokenize(content)
sentence_ranks = score_tokens(word_tokens, sentence_tokens)

#print(sentence_tokens[0])
#print(word_tokens)
print(sentence_ranks)

defaultdict(<class 'int'>, {0: 218, 1: 93, 2: 102, 3: 114, 4: 36, 5: 140, 6: 30, 7: 79, 8: 77, 9: 128, 10: 74, 11: 46, 12: 77, 13: 54, 14: 237, 15: 137, 16: 126, 17: 111, 18: 78, 19: 82, 20: 204})


In [85]:
nlargest(4, sentence_ranks, key=sentence_ranks.get)

[14, 0, 20, 5]

In [81]:
sentence_ranks

defaultdict(int,
            {0: 218,
             1: 93,
             2: 102,
             3: 114,
             4: 36,
             5: 140,
             6: 30,
             7: 79,
             8: 77,
             9: 128,
             10: 74,
             11: 46,
             12: 77,
             13: 54,
             14: 237,
             15: 137,
             16: 126,
             17: 111,
             18: 78,
             19: 82,
             20: 204})

### Final Outcome

In [95]:
summarize(sentence_ranks, sentence_tokens)

"Like a bowling ball on a skating rink, the black geodesic sphere of the East Greenland Ice-Core Project's communal living space stands out against the endless white nothingness of the Greenland ice sheet. And the same processes at work on Greenland's glaciers at the top of the world could send vast sections of Antarctica's ice sheet into the sea as well, raising ocean levels even further. Departing from Kangerlussuaq, VOA visited East GRIP and other remote corners of Greenland with the 109th Airlift Wing of the U.S. Air National Guard for a firsthand look at science in action at the leading edge of climate change."