Many news analytics tools aggregate content from popular news outlet and highlight trending and important topics.

Let us build a mini news analytics tool. Using CNN top news stories on `9/3/2019`

- How many words does `article` contain?
- How many unique words?
- Sort these words by their frequency
- Some of the words have high frequency because they are common words in english, the list `stop_words` contains a list of these words. Exclude words in `stop_words` from your report of the above

In [1]:
article = """Readers' Forum: The problem with college readings
Brandon Mull releases Fablehaven cookbook, teases new book series
Video of the Day: American caver rescued
BYU football managers support the team from behind the scenes
Free national parks entry for 33rd annual National Public Lands Day
BYU students respond to updated CES standards
Remembering 9/11: A uniquely global experience
Video of the Day: Rescue dog rescued from waterfall
No. 1 Cougars matched by TCU, draw 3-3 in first conference game
'It's time for a new generation of leaders': Sen. Mitt Romney will not be seeking Senate re-election
Video of the Day: Escaped inmate finally captured
Underdog BYU “whips” Arkansas in wild 38-31 road victory
Video of the Day: Mexican officials unveil alleged alien body
Readers' Forum: Becoming like Batman
BYU men's soccer kicks off its season with a tie against Utah
Get to know BYU football's newest defensive contributors
BYU's Juneteenth celebration invites students to honor family roots
700 acre development project in Vineyard will prioritize sustainability
Readers' Forum: The dangers of ignoring experts
Sundance Local Lens event focuses on Utah film community
Is renters insurance worth the extra cost?
No. 12 BYU sets to finish non-conference play against in-state rival Utah State
Eye on the Y: BYU community honors victims of Sept. 11 attacks
Middle Eastern Studies club opens with message on gender equality
Burning Man Exodus in Photos
InterVarsity Student Connexion brings evangelical students to Provo
Orem artist tells stories, preserves memories through florals
Video of the Day: BYU honors victims of 9/11
Readers' Forum: A digital eye of caution for artificial intelligence"""

In [2]:
words = article.split()

In [3]:
#How many words does article contain?
len(words)

263

In [4]:
#How many unique words?
unique_words = set(words)
len(unique_words)

196

In [5]:
#Sort these words by their frequency
word_freq = {}
# first initialize the dictionary
for word in unique_words:
    word_freq[word]=0
# second, process text, increase count for each word
for word in words:
    word_freq[word]+=1

In [6]:
word_freq

{'38-31': 1,
 'of': 10,
 '“whips”': 1,
 'captured': 1,
 'support': 1,
 'road': 1,
 'entry': 1,
 'alien': 1,
 'victory': 1,
 'BYU': 8,
 'Arkansas': 1,
 'Y:': 1,
 'honors': 2,
 'Student': 1,
 'caution': 1,
 'attacks': 1,
 'be': 1,
 'respond': 1,
 'invites': 1,
 'Provo': 1,
 "Readers'": 4,
 'Forum:': 4,
 'opens': 1,
 'Mexican': 1,
 'annual': 1,
 'State': 1,
 'memories': 1,
 'soccer': 1,
 '9/11': 1,
 'Sept.': 1,
 'first': 1,
 'body': 1,
 '33rd': 1,
 'rival': 1,
 'time': 1,
 'artist': 1,
 'Local': 1,
 're-election': 1,
 'from': 2,
 'uniquely': 1,
 'caver': 1,
 'Sen.': 1,
 'focuses': 1,
 'problem': 1,
 'through': 1,
 'contributors': 1,
 'Connexion': 1,
 'National': 1,
 'experience': 1,
 'Utah': 3,
 'Day': 1,
 'cookbook,': 1,
 'seeking': 1,
 'development': 1,
 'on': 3,
 'by': 1,
 'Get': 1,
 'Burning': 1,
 'cost?': 1,
 'Orem': 1,
 'Becoming': 1,
 'in': 4,
 'Man': 1,
 'finally': 1,
 'acre': 1,
 'Senate': 1,
 'Photos': 1,
 '3-3': 1,
 'artificial': 1,
 'equality': 1,
 'celebration': 1,
 'readings

In [7]:
#sort the dictionary

# first get a list of items
items = list(word_freq.items())

#items is a list, we can sort usint .sort or sorted
#however, we need to specify how the sort to be done

In [8]:
#take a look at the first item in items
items[0]

('38-31', 1)

In [9]:
#notice it is a tuple, so we need to sort based on the second number (the frequency)
sorted_items=sorted(items, key=lambda x: x[1], reverse=True)
sorted_items[:10]

[('of', 10),
 ('the', 9),
 ('BYU', 8),
 ('Video', 5),
 ('Day:', 5),
 ('to', 5),
 ("Readers'", 4),
 ('Forum:', 4),
 ('in', 4),
 ('Utah', 3)]

In [10]:
# using Counter
from collections import Counter
Counter(words)

Counter({"Readers'": 4,
         'Forum:': 4,
         'The': 2,
         'problem': 1,
         'with': 3,
         'college': 1,
         'readings': 1,
         'Brandon': 1,
         'Mull': 1,
         'releases': 1,
         'Fablehaven': 1,
         'cookbook,': 1,
         'teases': 1,
         'new': 2,
         'book': 1,
         'series': 1,
         'Video': 5,
         'of': 10,
         'the': 9,
         'Day:': 5,
         'American': 1,
         'caver': 1,
         'rescued': 2,
         'BYU': 8,
         'football': 1,
         'managers': 1,
         'support': 1,
         'team': 1,
         'from': 2,
         'behind': 1,
         'scenes': 1,
         'Free': 1,
         'national': 1,
         'parks': 1,
         'entry': 1,
         'for': 3,
         '33rd': 1,
         'annual': 1,
         'National': 1,
         'Public': 1,
         'Lands': 1,
         'Day': 1,
         'students': 3,
         'respond': 1,
         'to': 5,
         'updated': 1,
  

In [12]:
from sklearn.feature_extraction import _stop_words
stop_words = set(_stop_words.ENGLISH_STOP_WORDS)

In [13]:
#manual way
filtered_items = []
for item in sorted_items:
    if item[0] not in stop_words:
        filtered_items.append(item)

In [14]:
filtered_items[:10]

[('BYU', 8),
 ('Video', 5),
 ('Day:', 5),
 ("Readers'", 4),
 ('Forum:', 4),
 ('Utah', 3),
 ('students', 3),
 ('honors', 2),
 ('A', 2),
 ('victims', 2)]

In [15]:
#sneak peek, using list comprehension
[item for item in sorted_items if item[0] not in stop_words][:10]

[('BYU', 8),
 ('Video', 5),
 ('Day:', 5),
 ("Readers'", 4),
 ('Forum:', 4),
 ('Utah', 3),
 ('students', 3),
 ('honors', 2),
 ('A', 2),
 ('victims', 2)]

In [16]:
# recreating the dictionary
word_freq1 = {}
important_words = unique_words - stop_words
for word in important_words:
    word_freq1[word] = 0
for word in words:
    if word in important_words:
        word_freq1[word] += 1

In [17]:
sorted(word_freq1.items(), key = lambda x: x[1], reverse=True )

[('BYU', 8),
 ('Video', 5),
 ('Day:', 5),
 ("Readers'", 4),
 ('Forum:', 4),
 ('Utah', 3),
 ('students', 3),
 ('honors', 2),
 ('A', 2),
 ('victims', 2),
 ('community', 2),
 ('No.', 2),
 ('rescued', 2),
 ('The', 2),
 ('new', 2),
 ('38-31', 1),
 ('“whips”', 1),
 ('captured', 1),
 ('support', 1),
 ('road', 1),
 ('entry', 1),
 ('alien', 1),
 ('victory', 1),
 ('Arkansas', 1),
 ('Y:', 1),
 ('Student', 1),
 ('caution', 1),
 ('attacks', 1),
 ('respond', 1),
 ('invites', 1),
 ('Provo', 1),
 ('opens', 1),
 ('Mexican', 1),
 ('annual', 1),
 ('State', 1),
 ('memories', 1),
 ('soccer', 1),
 ('9/11', 1),
 ('Sept.', 1),
 ('body', 1),
 ('33rd', 1),
 ('rival', 1),
 ('time', 1),
 ('artist', 1),
 ('Local', 1),
 ('re-election', 1),
 ('uniquely', 1),
 ('caver', 1),
 ('Sen.', 1),
 ('focuses', 1),
 ('problem', 1),
 ('contributors', 1),
 ('Connexion', 1),
 ('National', 1),
 ('experience', 1),
 ('Day', 1),
 ('cookbook,', 1),
 ('seeking', 1),
 ('development', 1),
 ('Get', 1),
 ('Burning', 1),
 ('cost?', 1),
 ('Or