# Natural language toolkit (NLTK) and building a quick summarizer
A common project or goal in natural language processing (NLP) is reduction of text volume in a given text to still retain the overall text meaning. 

Some words and sentences in a given text can be filler, whereas others impart more meaning. For example, in the sentence "The quick brown fox jumps over the lazy dog.", we can remove the word "the" and the remaining words would retain a near identical meaning "quick brown fox jumps over lazy dog".

## Task: Analyze an article and create a quick summarizer
In this lesson, we're going to take a long journal article:

Kleib, M., Simpson, N., Rhodes, B., (May 31, 2016) "Information and Communication Technology: Design, Delivery, and Outcomes from a Nursing Informatics Boot Camp" OJIN: The Online Journal of Issues in Nursing Vol. 21, No. 2, Manuscript 5.

http://ojin.nursingworld.org/MainMenuCategories/ANAMarketplace/ANAPeriodicals/OJIN/TableofContents/Vol-21-2016/No2-May-2016/Information-and-Communication-Technology.html

And then we will:
1. Feed this entire text into a string variable
2. Run the entire article text through the word tokenizer
3. Find the most common words (eliminating stop words and punctuations)
4. Create a scoring system based on the most common words
5. Run the entire article text through a sentence tokenizer
6. Score each sentence based on how many of the top common words is contained in the sentence
7. Create a summary blurb by using sentences that have a certain score threshold

The last few steps of going beyond just identifying the topic sentence, I got from this guy: https://dev.to/davidisrawi/build-a-quick-summarizer-with-python-and-nltk

## Definitions

<d1>
    <dt>Tokenizing</dt>
    <dd>The act of breaking up a sequence of strings/text into words, keywords, elements or even sentences called tokens</dd>
    <dt>Stop word(s)</dt>
    <dd>Commonly used words that systems/engines have been designed to ignore because they have little value in determining meaning or performing other functions (such as a search engine). Some common stop words include: the, and, but</dd>
</d1>

## Part 1: Feed text into a string variable
We've placed all of the text of the journal article into a file called <b>article.txt</b>. Using the <b>with open</b> statement, we can open this file and designate it as the variable <b>f</b>.

Opening a file, gives us access to multiple methods (functions that belong to a specific class) that pertain to reading the contents. The method we are going to use is called <b>readlines()</b>

<b>readlines()</b> is/returns a list where each element contains the lines within the text file

In [4]:
with open('article.txt') as f:
    my_var = f.readlines()

print('The variable my_var is a type of {}'.format(type(my_var)))
print('The variable my_var has {} elements'.format(len(my_var)))
print(my_var[:3])

The variable my_var is a type of <class 'list'>
The variable my_var has 90 elements
['Information and communication technology (ICT) is integral in today’s healthcare as a critical piece of support to both track and improve patient and organizational outcomes. Facilitating nurses’ informatics competency development through continuing education is paramount to enhance their readiness to practice safely and accurately in technologically enabled work environments. In this article, we briefly describe progress in nursing informatics (NI) and share a project exemplar that describes our experience in the design, implementation, and evaluation of a NI educational event, a one-day boot camp format that was used to provide foundational knowledge in NI targeted primarily at frontline nurses in Alberta, Canada. We also discuss the project outcomes, including lessons learned and future implications. Overall, the boot camp was successful to raise nurses’ awareness about the importance of informatic

Above, the variable <b>my_var</b> has been assigned to the readlines() output. Using the type() function, we can see that it is indeed a list, and that has 90 elements. Some of these look like blank lines

Since it's a list, that means we can use a for-loop to loop through all of its contents and append all of the lines in the text file into a single string variable by appending to the string variable

Let's create a new variable called <b>contents</b> and append each line of the file to that variable

In [7]:
contents = ''

with open('article.txt') as f:
    for line in f.readlines():
        contents += line
        
print(contents)

Information and communication technology (ICT) is integral in today’s healthcare as a critical piece of support to both track and improve patient and organizational outcomes. Facilitating nurses’ informatics competency development through continuing education is paramount to enhance their readiness to practice safely and accurately in technologically enabled work environments. In this article, we briefly describe progress in nursing informatics (NI) and share a project exemplar that describes our experience in the design, implementation, and evaluation of a NI educational event, a one-day boot camp format that was used to provide foundational knowledge in NI targeted primarily at frontline nurses in Alberta, Canada. We also discuss the project outcomes, including lessons learned and future implications. Overall, the boot camp was successful to raise nurses’ awareness about the importance of informatics in nursing practice.

In today’s information-intensive healthcare industry, informat

The above code snippet opens the <b>article.txt</b> file and uses the variable <b>f</b>. 

Since we've established that readlines() is a list, we can use a for-loop to loop through each element in that list and append that to our string variable <b>contents</b>

## Part 2: Run the entire article text through the word tokenizer
Now that we've created our contents variable, we can now run this through the word tokenizer to split all of the words

We will need to import the word_tokenize() and sent_tokenize() functions from the nltk.tokenize module. Once we import them in, we can run the word_tokenize() function on our <b>contents</b> variable

In [11]:
from nltk.tokenize import word_tokenize, sent_tokenize

all_words = word_tokenize(contents)

print(all_words)

['Information', 'and', 'communication', 'technology', '(', 'ICT', ')', 'is', 'integral', 'in', 'today', '’', 's', 'healthcare', 'as', 'a', 'critical', 'piece', 'of', 'support', 'to', 'both', 'track', 'and', 'improve', 'patient', 'and', 'organizational', 'outcomes', '.', 'Facilitating', 'nurses', '’', 'informatics', 'competency', 'development', 'through', 'continuing', 'education', 'is', 'paramount', 'to', 'enhance', 'their', 'readiness', 'to', 'practice', 'safely', 'and', 'accurately', 'in', 'technologically', 'enabled', 'work', 'environments', '.', 'In', 'this', 'article', ',', 'we', 'briefly', 'describe', 'progress', 'in', 'nursing', 'informatics', '(', 'NI', ')', 'and', 'share', 'a', 'project', 'exemplar', 'that', 'describes', 'our', 'experience', 'in', 'the', 'design', ',', 'implementation', ',', 'and', 'evaluation', 'of', 'a', 'NI', 'educational', 'event', ',', 'a', 'one-day', 'boot', 'camp', 'format', 'that', 'was', 'used', 'to', 'provide', 'foundational', 'knowledge', 'in', 'NI'

<b>word_tokenize</b> returns a list of words that the system has parsed from the given text. It looks like there are a lot of stray punctuations that have been interpretted as words, and a lot of stop words. In the next part, we'll find a way to filter them out and then run a count on all of the significant words

## Find the most common words (eliminating stop words and punctuations)
In order to eliminate the stop words and punctuations, we will need a resource for what stop words are and what punctuations are. Fortunately, those resources are already available to us, either in the NLTK library or in another standard library

### Eliminating stop words
To eliminate stop words, we will first import the stopwords from the nltk.corpus module

In [10]:
from nltk.corpus import stopwords

eng_stopwords = stopwords.words('english')

print(eng_stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Then, we can run a for-loop on our <b>all_words</b> variable and filter out any word that is found in the eng_stopwords list, and have it append a words_no_stop_words list

In [17]:
words_no_stop_words = []

for word in all_words:
    if word not in eng_stopwords:
        words_no_stop_words.append(word)

print('words_no_stop_words has {} words whereas all_words has {} words'.format(len(words_no_stop_words), len(all_words)))
print(words_no_stop_words)

words_no_stop_words has 3620 words whereas all_words has 4873 words
['Information', 'communication', 'technology', '(', 'ICT', ')', 'integral', 'today', '’', 'healthcare', 'critical', 'piece', 'support', 'track', 'improve', 'patient', 'organizational', 'outcomes', '.', 'Facilitating', 'nurses', '’', 'informatics', 'competency', 'development', 'continuing', 'education', 'paramount', 'enhance', 'readiness', 'practice', 'safely', 'accurately', 'technologically', 'enabled', 'work', 'environments', '.', 'In', 'article', ',', 'briefly', 'describe', 'progress', 'nursing', 'informatics', '(', 'NI', ')', 'share', 'project', 'exemplar', 'describes', 'experience', 'design', ',', 'implementation', ',', 'evaluation', 'NI', 'educational', 'event', ',', 'one-day', 'boot', 'camp', 'format', 'used', 'provide', 'foundational', 'knowledge', 'NI', 'targeted', 'primarily', 'frontline', 'nurses', 'Alberta', ',', 'Canada', '.', 'We', 'also', 'discuss', 'project', 'outcomes', ',', 'including', 'lessons', 'lea

### Filter out punctuations
So this has filtered out some of our words, but the punctuations are still there. We have a standard library we can use called <b>string</b> which can quickly give a set of punctuations

In [15]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Now that we have a set of punctuations, we can do the same thing, except append a new list called filtered_words and if the word is not in that list of punctuaions, we will append the word

In [19]:
filtered_words = []

for word in words_no_stop_words:
    if word not in string.punctuation:
        filtered_words.append(word)

print('filtered_words has {}, words_no_stop_words has {}, words whereas all_words has {} words'.format(len(filtered_words), len(words_no_stop_words), len(all_words)))
print(filtered_words)

filtered_words has 2915, words_no_stop_words has 3620, words whereas all_words has 4873 words
['Information', 'communication', 'technology', 'ICT', 'integral', 'today', '’', 'healthcare', 'critical', 'piece', 'support', 'track', 'improve', 'patient', 'organizational', 'outcomes', 'Facilitating', 'nurses', '’', 'informatics', 'competency', 'development', 'continuing', 'education', 'paramount', 'enhance', 'readiness', 'practice', 'safely', 'accurately', 'technologically', 'enabled', 'work', 'environments', 'In', 'article', 'briefly', 'describe', 'progress', 'nursing', 'informatics', 'NI', 'share', 'project', 'exemplar', 'describes', 'experience', 'design', 'implementation', 'evaluation', 'NI', 'educational', 'event', 'one-day', 'boot', 'camp', 'format', 'used', 'provide', 'foundational', 'knowledge', 'NI', 'targeted', 'primarily', 'frontline', 'nurses', 'Alberta', 'Canada', 'We', 'also', 'discuss', 'project', 'outcomes', 'including', 'lessons', 'learned', 'future', 'implications', 'Overa

### Filtering stop words and punctuation in the same for-loop
We've shown how to filter multiple things within their own for-loop, but sometimes you want to save some lines. We can place multiple conditions within for-loops

In [22]:
filtered_words = []
for word in all_words:
    if word not in eng_stopwords and word not in string.punctuation:
        filtered_words.append(word)
        
print('filtered_words has {}, words whereas all_words has {} words'.format(len(filtered_words), len(all_words)))

filtered_words has 2915, words whereas all_words has 4873 words


Let's make a function out of this, just so we don't need to repeat all of this again later

In [55]:
def word_tokenize_minus_punct_sw(text):
    filtered_word_list = []
    for word in word_tokenize(text):
        if word not in eng_stopwords and word not in string.punctuation:
            filtered_word_list.append(word)
    return filtered_word_list

### Get a count of the words and organize by most common
Now that we have a filtered list of words that probably have the most meaning, let's get a count on all of the words. We can use another standard library called <b>collections</b>, and import the <b>Counter</b> class to do that for us

Counter by itself will count up all of the elements in our list, and has an additional method called most_common() that will sort everything by.. the most common items

In [26]:
from collections import Counter

Counter(filtered_words)

Counter({'Information': 4,
         'communication': 5,
         'technology': 9,
         'ICT': 19,
         'integral': 2,
         'today': 2,
         '’': 13,
         'healthcare': 16,
         'critical': 5,
         'piece': 1,
         'support': 15,
         'track': 2,
         'improve': 5,
         'patient': 22,
         'organizational': 6,
         'outcomes': 16,
         'Facilitating': 1,
         'nurses': 57,
         'informatics': 59,
         'competency': 18,
         'development': 16,
         'continuing': 12,
         'education': 29,
         'paramount': 3,
         'enhance': 3,
         'readiness': 2,
         'practice': 36,
         'safely': 1,
         'accurately': 1,
         'technologically': 1,
         'enabled': 1,
         'work': 8,
         'environments': 3,
         'In': 15,
         'article': 2,
         'briefly': 2,
         'describe': 3,
         'progress': 2,
         'nursing': 33,
         'NI': 38,
         'share': 3,
    

In [27]:
Counter(filtered_words).most_common()

[('informatics', 59),
 ('nurses', 57),
 ('NI', 38),
 ('practice', 36),
 ('nursing', 33),
 ('education', 29),
 ('Alberta', 24),
 ('Canadian', 23),
 ('Nurses', 23),
 ('patient', 22),
 ('educational', 21),
 ('health', 21),
 ('The', 21),
 ('competencies', 21),
 ('Association', 20),
 ('ICT', 19),
 ('competency', 18),
 ('care', 17),
 ('opportunities', 17),
 ('healthcare', 16),
 ('outcomes', 16),
 ('development', 16),
 ('event', 16),
 ('boot', 16),
 ('camp', 16),
 ('information', 16),
 ('support', 15),
 ('In', 15),
 ('knowledge', 15),
 ('clinical', 15),
 ('use', 14),
 ('’', 13),
 ('CASN', 13),
 ('learning', 13),
 ('continuing', 12),
 ('Canada', 12),
 ('needs', 12),
 ('electronic', 12),
 ('events', 12),
 ('2013', 12),
 ('Nursing', 12),
 ('professional', 12),
 ('Informatics', 11),
 ('quality', 10),
 ('Health', 10),
 ('technology', 9),
 ('also', 9),
 ('2006b', 9),
 ('2012', 9),
 ('identified', 9),
 ('e.g.', 9),
 ('settings', 9),
 ('work', 8),
 ('evaluation', 8),
 ('used', 8),
 ('future', 8),
 ('

In [33]:
top_ten_common_words = Counter(filtered_words).most_common(10)

top_ten_common_words

[('informatics', 59),
 ('nurses', 57),
 ('NI', 38),
 ('practice', 36),
 ('nursing', 33),
 ('education', 29),
 ('Alberta', 24),
 ('Canadian', 23),
 ('Nurses', 23),
 ('patient', 22)]

## Part 4: Create a scoring system based on the most common words

Now that we have the most common words, let's arbitrarily take the first 10 most common words and score them starting from 10 to 1. The most common will be 10 (informatics) and the least common will be 1 (patient)

Let's use a dictionary to keep track of the scores. We'll create a new dictionary called <b>word_score</b>

In [35]:
word_score = dict()

for idx, word_pair in enumerate(top_ten_common_words):
    word_score[word_pair[0]] = (10 - idx)

word_score

{'informatics': 10,
 'nurses': 9,
 'NI': 8,
 'practice': 7,
 'nursing': 6,
 'education': 5,
 'Alberta': 4,
 'Canadian': 3,
 'Nurses': 2,
 'patient': 1}

## Part 5: Run the entire article text through a sentence tokenizer
Now that we have our word scoring system, we can now run our original text through the sentence tokenizer. We've already imported sent_tokenize from NLTK, but if not, let's do it again to make sure. And then we'll run our article text through the sentence tokenizer and save the results to a new variable <b>sentences</b>

In [37]:
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(contents)

sentences

['Information and communication technology (ICT) is integral in today’s healthcare as a critical piece of support to both track and improve patient and organizational outcomes.',
 'Facilitating nurses’ informatics competency development through continuing education is paramount to enhance their readiness to practice safely and accurately in technologically enabled work environments.',
 'In this article, we briefly describe progress in nursing informatics (NI) and share a project exemplar that describes our experience in the design, implementation, and evaluation of a NI educational event, a one-day boot camp format that was used to provide foundational knowledge in NI targeted primarily at frontline nurses in Alberta, Canada.',
 'We also discuss the project outcomes, including lessons learned and future implications.',
 'Overall, the boot camp was successful to raise nurses’ awareness about the importance of informatics in nursing practice.',
 'In today’s information-intensive healthca

Similarly to word_tokenize, sent_tokenize returns a list whose elements are the individual sentences that the system had parsed out

## Part 6: Score each sentence based on how many of the top common words is contained in the sentence
Now we have all of our sentences, and we have our scoring system. We can now use another for-loop to loop through each sentence, run each sentence through the word_tokenizer, and then for each word, check if it belongs to our word_score dictionary. If it does, then we will add the scores up. We will use a new dictionary that we'll call <b>sentence_score</b> to keep track of all of this

In [41]:
sentence_score = dict()

for sentence in sentences:
    sentence_score[sentence] = 0
    for word in word_tokenize(sentence):
        if word in word_score.keys():
            sentence_score[sentence] += word_score.get(word, 0)
            

In [42]:
sentence_score

{'Information and communication technology (ICT) is integral in today’s healthcare as a critical piece of support to both track and improve patient and organizational outcomes.': 1,
 'Facilitating nurses’ informatics competency development through continuing education is paramount to enhance their readiness to practice safely and accurately in technologically enabled work environments.': 31,
 'In this article, we briefly describe progress in nursing informatics (NI) and share a project exemplar that describes our experience in the design, implementation, and evaluation of a NI educational event, a one-day boot camp format that was used to provide foundational knowledge in NI targeted primarily at frontline nurses in Alberta, Canada.': 53,
 'We also discuss the project outcomes, including lessons learned and future implications.': 0,
 'Overall, the boot camp was successful to raise nurses’ awareness about the importance of informatics in nursing practice.': 32,
 'In today’s information-

In [49]:
import statistics

print('The highest sentence score is {}, the average is {}'.format(max(sentence_score.values()), statistics.mean(sentence_score.values())))

The highest sentence score is 60, the average is 14.235668789808917


## Step 7: Create a summary blurb by using sentences that have a certain score threshold
Now that we have each sentence score, let's arbitrarily pick the average as the threshold to include a sentence into our summary generator. We can iterate through all of the sentences in our sentence_score dictionary

In [54]:
summary = ''

for key, value in sentence_score.items():
    if value > statistics.mean(sentence_score.values()):
        summary += key
        
print(summary)

Facilitating nurses’ informatics competency development through continuing education is paramount to enhance their readiness to practice safely and accurately in technologically enabled work environments.In this article, we briefly describe progress in nursing informatics (NI) and share a project exemplar that describes our experience in the design, implementation, and evaluation of a NI educational event, a one-day boot camp format that was used to provide foundational knowledge in NI targeted primarily at frontline nurses in Alberta, Canada.Overall, the boot camp was successful to raise nurses’ awareness about the importance of informatics in nursing practice.In today’s information-intensive healthcare industry, information and communication technology (ICT) and informatics are integral for any system to meet the needs of patients and providers and improve the quality and safety of the clinical environment (Canadian Nurses Association, 2006a; Canadian Nurses Association, 2006b; Poe, 

In [57]:
print('New summary has a word count of {}'.format(len(word_tokenize_minus_punct_sw(summary))))

print('Original article has a word count of {}'.format(len(word_tokenize_minus_punct_sw(contents))))

New summary has a word count of 1619
Original article has a word count of 2915


On face value, it looks like we drastically reduced the word count (excluding punctuations and stop words). However, we will probably still need someone to manually review this summary and compare it to the original article to see if the summary adequately sums up the important points in that article