# Mining the Social Web

## Mining Web Pages

This Jupyter Notebook provides an interactive way to follow along with and explore the examples from the video series. The intent behind this notebook is to reinforce the concepts in a fun, convenient, and effective way.

## Using boilerpipe to extract the text from a web page

Example blog post:
http://radar.oreilly.com/2010/07/louvre-industrial-age-henry-ford.html

In [None]:
!pip install goose3



In [None]:
from goose3 import Goose

g = Goose()
URL='https://www.oreilly.com/ideas/ethics-in-data-project-design-its-about-planning'
article = g.extract(url=URL)

print(article.title)
print('-'*len(article.title))
print(article.meta_description)

content = article.cleaned_text
print()
print('{}...'.format(content[:]))


Ethics in data project design: It's about planning
--------------------------------------------------
The destination and rules of the road are clear; the route you choose to get there makes a huge difference.

When I explain the value of ethics to students and professionals alike, I refer it as an “orientation.” As any good designer, scientist, or researcher knows, how you orient yourself toward a problem can have a big impact on the sort of solution you develop—and how you get there. As Ralph Waldo Emerson once wrote, “perception is not whimsical, but fatal.” Your particular perspective, knowledge of, and approach to a problem shapes your solution, opening up certain paths forward and forestalling others.

Data-driven approaches to business help optimize measurable outcomes—but the early planning of a project needs to account for the ethical (and in many cases, the literal) landscape to avoid ethically treacherous territory. Several recent cases in the news illustrate this point and 

In [None]:
!pip install boilerpipe3



In [None]:
# May also require the installation of Java runtime libraries
# pip install boilerpipe3
from boilerpipe.extract import Extractor

# If you're interested, learn more about how Boilerpipe works by reading
# Christian Kohlschütter's paper: http://www.l3s.de/~kohlschuetter/boilerplate/

URL='https://www.oreilly.com/ideas/ethics-in-data-project-design-its-about-planning'
#URL='https://tuoitre.vn/hai-quan-ha-noi-lung-tung-xet-hoan-thue-6-ti-cho-be-9-tuoi-nguoi-uc-20190520111645276.htm'

extractor = Extractor(extractor='ArticleExtractor', url=URL)

print(extractor.getText())

-------------------------------------------------------------------------------
Deprecated: convertStrings was not specified when starting the JVM. The default
behavior in JPype will be False starting in JPype 0.8. The recommended setting
for new code is convertStrings=False.  The legacy value of True was assumed for
please file a ticket with the developer.
-------------------------------------------------------------------------------

  """)


When I explain the value of ethics to students and professionals alike, I refer it as an “orientation.” As any good designer, scientist, or researcher knows, how you orient yourself toward a problem can have a big impact on the sort of solution you develop—and how you get there. As Ralph Waldo Emerson once wrote , “perception is not whimsical, but fatal.” Your particular perspective, knowledge of, and approach to a problem shapes your solution, opening up certain paths forward and forestalling others.
Data-driven approaches to business help optimize measurable outcomes—but the early planning of a project needs to account for the ethical (and in many cases, the literal) landscape to avoid ethically treacherous territory. Several recent cases in the news illustrate this point and show the type of preparation that enables a way to move forward in both a data-driven and ethical fashion: Princeton Review’s ZIP-code-based pricing scheme, which turned out to unfairly target Asian-American fam

## Using feedparser to extract the text (and other fields) from an RSS or Atom feed

In [None]:
!pip install feedparser



In [None]:
import feedparser # pip install feedparser

#FEED_URL='http://feeds.feedburner.com/oreilly/radar/atom'
FEED_URL='https://ut.edu.vn/rss.php'
#FEED_URL='https://tuoitre.vn/rss/kinh-doanh.rss'

fp = feedparser.parse(FEED_URL)

#print(fp)

for e in fp.entries:
    print(e.title)
    #print(e.links[0].href)
    #print(e.content[0].value)
    print(e.summary)
    print(e.published)

Trường Đại học Giao thông vận tải Thành phố Hồ Chí Minh hỗ trợ sinh viên vượt qua giai đoạn khó khăn do dịch COVID-19
<a href="http://ut.edu.vn/?key=detail&id=2187"><img width=130 height=100 src="/public/img_2019/images/2020/a8e1b27f3486cdd89497%20(1).jpg"></a><br/>&lt;p&gt;
	Từ đầu th&amp;aacute;ng 3/2020, trước t&amp;igrave;nh h&amp;igrave;nh dịch bệnh Covid-19 diễn ra phức tạp, Trường Đại học Giao th&amp;ocirc;ng vận tải Th&amp;agrave;nh phố Hồ Ch&amp;iacute; Minh đ&amp;atilde; chủ động triển khai hoạt động giảng dạy v&amp;agrave; học tập trực tuyến d&amp;agrave;nh cho sinh vi&amp;ecirc;n hệ ch&amp;iacute;nh quy.&lt;/p&gt;
Monday, 20 Apr 2020 20:27:56
Trường Đại học Giao thông vận tải TP. Hồ Chí Minh chào mừng Ngày sách Việt Nam lần thứ 7 năm 2020
<a href="http://ut.edu.vn/?key=detail&id=2186"><img width=130 height=100 src="/public/img_2019/images/2020/IMG_0478.png"></a><br/>&lt;p style=&quot;text-align: justify;&quot;&gt;
	Nhằm n&amp;acirc;ng cao nhận thức về văn h&amp;oacute;a đọc

## Harvesting blog data by parsing feeds

In [None]:
import os
import sys
import json
import feedparser
from bs4 import BeautifulSoup
from nltk import clean_html

FEED_URL = 'http://feeds.feedburner.com/oreilly/radar/atom'

def cleanHtml(html):
    if html == "": return ""

    return BeautifulSoup(html, 'html5lib').get_text()

fp = feedparser.parse(FEED_URL)

print("Fetched {0} entries from '{1}'".format(len(fp.entries[0].title), fp.feed.title))

blog_posts = []
for e in fp.entries:
    blog_posts.append({'title': e.title, 'content'
                      : cleanHtml(e.content[0].value), 'link': e.links[0].href})

out_file = os.path.join('feed.json')
f = open(out_file, 'w+')
f.write(json.dumps(blog_posts, indent=1))
f.close()

print('Wrote output file to {0}'.format(f.name))

Fetched 31 entries from 'Radar'
Wrote output file to feed.json


## Starting to write a web crawler

In [None]:
import httplib2
import re
from bs4 import BeautifulSoup

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')
#status, response = http.request('https://vnexpress.net/')


soup = BeautifulSoup(response, 'html5lib')

links = []
 
for link in soup.findAll('a', attrs={'href': re.compile("^http(s?)://")}):
    links.append(link.get('href'))

for link in links:
    print(link)

https://www.nytimes.com/es/
https://cn.nytimes.com
https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi
https://www.nytimes.com/section/todayspaper
https://www.nytimes.com/section/world
https://www.nytimes.com/section/us
https://www.nytimes.com/section/politics
https://www.nytimes.com/section/nyregion
https://www.nytimes.com/section/business
https://www.nytimes.com/section/opinion
https://www.nytimes.com/section/technology
https://www.nytimes.com/section/science
https://www.nytimes.com/section/health
https://www.nytimes.com/section/sports
https://www.nytimes.com/section/arts
https://www.nytimes.com/section/books
https://www.nytimes.com/section/style
https://www.nytimes.com/section/food
https://www.nytimes.com/section/travel
https://www.nytimes.com/section/magazine
https://www.nytimes.com/section/t-magazine
https://www.nytimes.com/section/realestate
https://www.nytimes.com/video
https://www.nytimes.com/section/world
https://www.nytimes.com/section/us
https://www.ny

```
Create an empty graph
Create an empty queue to keep track of nodes that need to be processed

Add the starting point to the graph as the root node
Add the root node to a queue for processing

Repeat until some maximum depth is reached or the queue is empty:
  Remove a node from the queue 
  For each of the node's neighbors: 
    If the neighbor hasn't already been processed: 
      Add it to the queue 
      Add it to the graph 
      Create an edge in the graph that connects the node and its neighbor
```

## Using NLTK to parse web page data

**Naive sentence detection based on periods**

In [None]:
text = "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow."
print(text.split("."))

['Mr', ' Green killed Colonel Mustard in the study with the candlestick', ' Mr', ' Green is not a very nice fellow', '']


**More sophisticated sentence detection**

In [None]:
import nltk # Installation instructions: http://www.nltk.org/install.html

# Downloading nltk packages used in this example
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
sentences = nltk.tokenize.sent_tokenize(text)
print(sentences)

['Mr. Green killed Colonel Mustard in the study with the candlestick.', 'Mr. Green is not a very nice fellow.']


In [None]:
harder_example = """My name is John Smith and my email address is j.smith@company.com.
Mostly people call Mr. Smith. But I actually have a Ph.D.!
Can you believe it? Neither can most people..."""

sentences = nltk.tokenize.sent_tokenize(harder_example)
print(sentences)

['My name is John Smith and my email address is j.smith@company.com.', 'Mostly people call Mr. Smith.', 'But I actually have a Ph.D.!', 'Can you believe it?', 'Neither can most people...']


**Word tokenization**

In [None]:
text = "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow."
sentences = nltk.tokenize.sent_tokenize(text)

tokens = [nltk.word_tokenize(s) for s in sentences]
print(tokens)

[['Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.'], ['Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.']]


**Part of speech tagging for tokens**

In [None]:
  import nltk
  nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]
print(pos_tagged_tokens)

[[('Mr.', 'NNP'), ('Green', 'NNP'), ('killed', 'VBD'), ('Colonel', 'NNP'), ('Mustard', 'NNP'), ('in', 'IN'), ('the', 'DT'), ('study', 'NN'), ('with', 'IN'), ('the', 'DT'), ('candlestick', 'NN'), ('.', '.')], [('Mr.', 'NNP'), ('Green', 'NNP'), ('is', 'VBZ'), ('not', 'RB'), ('a', 'DT'), ('very', 'RB'), ('nice', 'JJ'), ('fellow', 'NN'), ('.', '.')]]


**Alphabetical list of part-of-speech tags used in the Penn Treebank Project**

See: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

| # | POS Tag | Meaning |
|:-:|:-------:|:--------|
| 1	| CC | Coordinating conjunction|
|2|	CD	|Cardinal number|
|3|	DT	|Determiner|
|4|	EX	|Existential there|
|5|	FW	|Foreign word|
|6|	IN	|Preposition or subordinating conjunction|
|7|	JJ	|Adjective|
|8|	JJR	|Adjective, comparative|
|9|	JJS	|Adjective, superlative|
|10|	LS	|List item marker|
|11|	MD	|Modal|
|12|	NN	|Noun, singular or mass|
|13|	NNS	|Noun, plural|
|14|	NNP	|Proper noun, singular|
|15|	NNPS	|Proper noun, plural|
|16|	PDT	|Predeterminer|
|17|	POS	|Possessive ending|
|18|	PRP	|Personal pronoun|
|19|	PRP\$	|Possessive pronoun|
|20|	RB	|Adverb|
|21|	RBR	|Adverb, comparative|
|22|	RBS	|Adverb, superlative|
|23|	RP	|Particle|
|24|	SYM	|Symbol|
|25|	TO	|to|
|26|	UH	|Interjection|
|27|	VB	|Verb, base form|
|28|	VBD	|Verb, past tense|
|29|	VBG	|Verb, gerund or present participle|
|30|	VBN	|Verb, past participle|
|31|	VBP	|Verb, non-3rd person singular present|
|32|	VBZ	|Verb, 3rd person singular present|
|33|	WDT	|Wh-determiner|
|34|	WP	|Wh-pronoun|
|35|	WP\$|Possessive wh-pronoun|
|36|	WRB	|Wh-adverb|

**Named entity extraction/chunking for tokens**

In [None]:
# Downloading nltk packages used in this example
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

jim = "Jim bought 300 shares of Acme Corp. in 2006."

tokens = nltk.word_tokenize(jim)
jim_tagged_tokens = nltk.pos_tag(tokens)

ne_chunks = nltk.chunk.ne_chunk(jim_tagged_tokens)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
print(ne_chunks)

(S
  (PERSON Jim/NNP)
  bought/VBD
  300/CD
  shares/NNS
  of/IN
  (ORGANIZATION Acme/NNP Corp./NNP)
  in/IN
  2006/CD
  ./.)


In [None]:
ne_chunks = [nltk.chunk.ne_chunk(ptt) for ptt in pos_tagged_tokens]

ne_chunks[0].pprint()
ne_chunks[1].pprint()

(S
  (PERSON Mr./NNP)
  (PERSON Green/NNP)
  killed/VBD
  (ORGANIZATION Colonel/NNP Mustard/NNP)
  in/IN
  the/DT
  study/NN
  with/IN
  the/DT
  candlestick/NN
  ./.)
(S
  (PERSON Mr./NNP)
  (ORGANIZATION Green/NNP)
  is/VBZ
  not/RB
  a/DT
  very/RB
  nice/JJ
  fellow/NN
  ./.)


In [None]:
print(ne_chunks[0])

(S
  (PERSON Mr./NNP)
  (PERSON Green/NNP)
  killed/VBD
  (ORGANIZATION Colonel/NNP Mustard/NNP)
  in/IN
  the/DT
  study/NN
  with/IN
  the/DT
  candlestick/NN
  ./.)


In [None]:
print(ne_chunks[1])

(S
  (PERSON Mr./NNP)
  (ORGANIZATION Green/NNP)
  is/VBZ
  not/RB
  a/DT
  very/RB
  nice/JJ
  fellow/NN
  ./.)


## Using NLTK’s NLP tools to process human language in blog data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
import json
import nltk

BLOG_DATA = "/content/drive/My Drive/app/Mining-the-Social-Web-3rd-Edition-master/notebooks/resources/ch06-webpages/feed.json"

#BLOG_DATA = "/content/drive/My Drive/feed.json"
#BLOG_DATA = "feed.json"

blog_data = json.loads(open(BLOG_DATA).read())

# Download nltk packages used in this example
nltk.download('stopwords')

# Customize your list of stopwords as needed. Here, we add common
# punctuation and contraction artifacts.

stop_words = nltk.corpus.stopwords.words('english') + [
    '.',
    ',',
    '--',
    '\'s',
    '?',
    ')',
    '(',
    ':',
    '\'',
    '\'re',
    '"',
    '-',
    '}',
    '{',
    u'—',
    ']',
    '[',
    '...'
    ]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
print('Số tin trong Blog Data là: '+str(len(blog_data)))
for post in blog_data:
    sentences = nltk.tokenize.sent_tokenize(post['content'])

    words = [w.lower() for sentence in sentences for w in
             nltk.tokenize.word_tokenize(sentence)]

    fdist = nltk.FreqDist(words)

    # Remove stopwords from fdist
    for sw in stop_words:
        del fdist[sw]
   
    # Basic stats

    num_words = sum([i[1] for i in fdist.items()])
    num_unique_words = len(fdist.keys())

    # Hapaxes are words that appear only once
    num_hapaxes = len(fdist.hapaxes())

    top_10_words_sans_stop_words = fdist.most_common(10)


    print(post['title'])
    print('\tNum Sentences:'.ljust(25), len(sentences))
    print('\tNum Words:'.ljust(25), num_words)
    print('\tNum Unique Words:'.ljust(25), num_unique_words)
    print('\tNum Hapaxes:'.ljust(25), num_hapaxes)
    print('\tTop 10 Most Frequent Words (sans stop words):\n\t\t', \
          '\n\t\t'.join(['{0} ({1})'.format(w[0], w[1]) for w in top_10_words_sans_stop_words]))
    print()

Số tin trong Blog Data là: 60
Four short links: 21 August 2017
	Num Sentences:           10
	Num Words:               140
	Num Unique Words:        113
	Num Hapaxes:             93
	Top 10 Most Frequent Words (sans stop words):
		 signals (5)
		cloud (4)
		application (3)
		drone (3)
		operations (2)
		machine (2)
		learning (2)
		radio (2)
		flying (2)
		cameras (2)

6 practical guidelines for implementing conversational AI
	Num Sentences:           69
	Num Words:               908
	Num Unique Words:        528
	Num Hapaxes:             354
	Top 10 Most Frequent Words (sans stop words):
		 ’ (21)
		“ (21)
		” (21)
		conversational (15)
		bots (7)
		says (7)
		interaction (7)
		must (7)
		user (7)
		kai (7)

Four short links: 18 August 2017
	Num Sentences:           16
	Num Words:               263
	Num Unique Words:        204
	Num Hapaxes:             173
	Top 10 Most Frequent Words (sans stop words):
		 hype (9)
		jobs (5)
		technologies (5)
		cycle (5)
		’ (5)
		bayesian (4)
		year

In [None]:
post = blog_data[18]

sentences = nltk.tokenize.sent_tokenize(post['content'])

words = [w.lower() for sentence in sentences for w in
         nltk.tokenize.word_tokenize(sentence)]

fdist = nltk.FreqDist(words)

# Remove stopwords from fdist
for sw in stop_words:
    del fdist[sw]
   
# Basic stats

num_words = sum([i[1] for i in fdist.items()])
num_unique_words = len(fdist.keys())

# Hapaxes are words that appear only once
num_hapaxes = len(fdist.hapaxes())

top_10_words_sans_stop_words = fdist.most_common(10)


print(post['title'])
print('\tNum Sentences:'.ljust(25), len(sentences))
print('\tNum Words:'.ljust(25), num_words)
print('\tNum Unique Words:'.ljust(25), num_unique_words)
print('\tNum Hapaxes:'.ljust(25), num_hapaxes)
print('\tTop 10 Most Frequent Words (sans stop words):\n\t\t', \
      '\n\t\t'.join(['{0} ({1})'.format(w[0], w[1]) for w in top_10_words_sans_stop_words]))
print()

How to choose a cloud provider
	Num Sentences:           60
	Num Words:               514
	Num Unique Words:        310
	Num Hapaxes:             225
	Top 10 Most Frequent Words (sans stop words):
		 ’ (19)
		cloud (18)
		use (13)
		providers (12)
		provider (11)
		case (7)
		make (6)
		level (6)
		managed (6)
		services (6)



## A document summarization algorithm based principally upon sentence detection and frequency analysis within sentences

In [None]:
import json
import nltk
import numpy

#BLOG_DATA = "feed.json"
BLOG_DATA = "/content/drive/My Drive/app/Mining-the-Social-Web-3rd-Edition-master/notebooks/resources/ch06-webpages/feed.json"

blog_data = json.loads(open(BLOG_DATA).read())

N = 100  # Number of words to consider
CLUSTER_THRESHOLD = 5  # Distance between words to consider
TOP_SENTENCES = 5  # Number of sentences to return for a "top n" summary

In [None]:
stop_words = nltk.corpus.stopwords.words('english') + [
    '.',
    ',',
    '--',
    '\'s',
    '?',
    ')',
    '(',
    ':',
    '\'',
    '\'re',
    '"',
    '-',
    '}',
    '{',
    u'—',
    '>',
    '<',
    '...'
    ]

In [None]:
# Approach taken from "The Automatic Creation of Literature Abstracts" by H.P. Luhn
def _score_sentences(sentences, important_words):
    scores = []
    sentence_idx = 0

    for s in [nltk.tokenize.word_tokenize(s) for s in sentences]:

        word_idx = []

        # For each word in the word list...
        for w in important_words:
            try:
                # Compute an index for where any important words occur in the sentence.
                word_idx.append(s.index(w))
            except ValueError: # w not in this particular sentence
                pass

        word_idx.sort()

        # It is possible that some sentences may not contain any important words at all.
        if len(word_idx)== 0: continue

        # Using the word index, compute clusters by using a max distance threshold
        # for any two consecutive words.

        clusters = []
        cluster = [word_idx[0]]
        i = 1
        while i < len(word_idx):
            if word_idx[i] - word_idx[i - 1] < CLUSTER_THRESHOLD:
                cluster.append(word_idx[i])
            else:
                clusters.append(cluster[:])
                cluster = [word_idx[i]]
            i += 1
        clusters.append(cluster)

        # Score each cluster. The max score for any given cluster is the score 
        # for the sentence.

        max_cluster_score = 0
        
        for c in clusters:
            significant_words_in_cluster = len(c)
            # true clusters also contain insignificant words, so we get 
            # the total cluster length by checking the indices
            total_words_in_cluster = c[-1] - c[0] + 1
            score = 1.0 * significant_words_in_cluster**2 / total_words_in_cluster

            if score > max_cluster_score:
                max_cluster_score = score

        scores.append((sentence_idx, max_cluster_score))
        sentence_idx += 1

    return scores

In [None]:
def summarize(txt):
    sentences = [s for s in nltk.tokenize.sent_tokenize(txt)]
    normalized_sentences = [s.lower() for s in sentences]

    words = [w.lower() for sentence in normalized_sentences for w in
             nltk.tokenize.word_tokenize(sentence)]

    fdist = nltk.FreqDist(words)
    
    # Remove stopwords from fdist
    for sw in stop_words:
        del fdist[sw]

    top_n_words = [w[0] for w in fdist.most_common(N)]

    scored_sentences = _score_sentences(normalized_sentences, top_n_words)

    # Summarization Approach 1:
    # Filter out nonsignificant sentences by using the average score plus a
    # fraction of the std dev as a filter

    avg = numpy.mean([s[1] for s in scored_sentences])
    std = numpy.std([s[1] for s in scored_sentences])
    mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences
                   if score > avg + 0.5 * std]

    # Summarization Approach 2:
    # Another approach would be to return only the top N ranked sentences

    top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:]
    top_n_scored = sorted(top_n_scored, key=lambda s: s[0])

    # Decorate the post object with summaries

    return dict(top_n_summary=[sentences[idx] for (idx, score) in top_n_scored],
                mean_scored_summary=[sentences[idx] for (idx, score) in mean_scored])

In [None]:
for post in blog_data: 
    post.update(summarize(post['content']))

    print(post['title'])
    print('=' * len(post['title']))
    print()
    print('Top N Summary')
    print('-------------')
    print(' '.join(post['top_n_summary']))
    print()
    print('Mean Scored Summary')
    print('-------------------')
    print(' '.join(post['mean_scored_summary']))
    print()

Four short links: 21 August 2017

Top N Summary
-------------
Cloud Operations, Machine Learning Radio, Flying Cameras, and Text Organization

Paracloud: Bringing Application Insight into Cloud Operations -- In this work, we propose a uniform Paracloud interface (PaCI) to enable a bi-directional communication channel between application containers and the cloud management substrate. An application knows how it's doing, which it reports through this interface so the cloud management layer can figure how/when to migrate, scale, load balance. (via A Paper A Day)

DARPA Wants Machine Learning for Radio Signals -- An RFMLS would be able to discern subtle differences in the RF signals among identical, mass-manufactured IoT devices and identify signals intended to spoof or hack into these devices. “We want to ... stand up an RF forensics capability to identify unique and peculiar signals amongst the proverbial cocktail party of signals out there,” Tilghman said. XPose: Reinventing User Intera

## Visualizing document summarization results with HTML output

In [None]:
import os
from IPython.display import IFrame
from IPython.core.display import display

HTML_TEMPLATE = """<html>
    <head>
        <title>{0}</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    </head>
    <body>{1}</body>
</html>"""

for post in blog_data:
   
    # Uses previously defined summarize function.
    post.update(summarize(post['content']))

    # You could also store a version of the full post with key sentences marked up
    # for analysis with simple string replacement...

    for summary_type in ['top_n_summary', 'mean_scored_summary']:
        post[summary_type + '_marked_up'] = '<p>{0}</p>'.format(post['content'])
        
        for s in post[summary_type]:
            post[summary_type + '_marked_up'] = \
            post[summary_type + '_marked_up'].replace(s, '<strong>{0}</strong>'.format(s))

        filename = post['title'].replace("?", "") + '.summary.' + summary_type + '.html'
        
        f = open(os.path.join(filename), 'wb')
        html = HTML_TEMPLATE.format(post['title'] + ' Summary', post[summary_type + '_marked_up'])    
        f.write(html.encode('utf-8'))
        f.close()

        print("Data written to", f.name)

# Display any of these files with an inline frame. This displays the
# last file processed by using the last value of f.name...
print()
print("Displaying {0}:".format(f.name))
display(IFrame('files/{0}'.format(f.name), '100%', '600px'))

Data written to Four short links: 21 August 2017.summary.top_n_summary.html
Data written to Four short links: 21 August 2017.summary.mean_scored_summary.html
Data written to 6 practical guidelines for implementing conversational AI.summary.top_n_summary.html
Data written to 6 practical guidelines for implementing conversational AI.summary.mean_scored_summary.html
Data written to Four short links: 18 August 2017.summary.top_n_summary.html
Data written to Four short links: 18 August 2017.summary.mean_scored_summary.html
Data written to How Ray makes continuous learning accessible and easy to scale.summary.top_n_summary.html
Data written to How Ray makes continuous learning accessible and easy to scale.summary.mean_scored_summary.html
Data written to Julie Stanford on vetting designs through rapid experimentation.summary.top_n_summary.html
Data written to Julie Stanford on vetting designs through rapid experimentation.summary.mean_scored_summary.html
Data written to Jack Daniel on buildin

## Extracting entities from a text with NLTK

In [None]:
import nltk
import json

#BLOG_DATA = "feed.json"
BLOG_DATA = "/content/drive/My Drive/app/Mining-the-Social-Web-3rd-Edition-master/notebooks/resources/ch06-webpages/feed.json"
#BLOG_DATA = "/content/drive/My Drive/app/feed.json"

blog_data = json.loads(open(BLOG_DATA).read())

for post in blog_data:

    sentences = nltk.tokenize.sent_tokenize(post['content'])
    tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]
    pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]

    # Flatten the list since we're not using sentence structure
    # and sentences are guaranteed to be separated by a special
    # POS tuple such as ('.', '.')

    pos_tagged_tokens = [token for sent in pos_tagged_tokens for token in sent]

    all_entity_chunks = []
    previous_pos = None
    current_entity_chunk = []
    for (token, pos) in pos_tagged_tokens:

        if pos == previous_pos and pos.startswith('NN'):
            current_entity_chunk.append(token)
        elif pos.startswith('NN'):
            
            if current_entity_chunk != []:
                
                # Note that current_entity_chunk could be a duplicate when appended,
                # so frequency analysis again becomes a consideration

                all_entity_chunks.append((' '.join(current_entity_chunk), pos))
            current_entity_chunk = [token]

        previous_pos = pos

    # Store the chunks as an index for the document
    # and account for frequency while we're at it...

    post['entities'] = {}
    for c in all_entity_chunks:
        post['entities'][c] = post['entities'].get(c, 0) + 1

    # For example, we could display just the title-cased entities

    print(post['title'])
    print('-' * len(post['title']))
    proper_nouns = []
    for (entity, pos) in post['entities']:
        if entity.istitle():
            print('\t{0} ({1})'.format(entity, post['entities'][(entity, pos)]))
    print()

Four short links: 21 August 2017
--------------------------------
	Cloud Operations (1)
	Machine Learning Radio (1)
	Cameras (1)
	Text Organization Paracloud (1)
	Bringing Application Insight (1)
	Cloud Operations (1)
	Paracloud (1)
	A Paper A Day (1)
	Radio Signals (1)
	” Tilghman (1)
	Reinventing User Interaction (1)
	Flying Cameras (1)
	Drone (1)
	Tree Sheets (1)
	Nice (1)
	Continue (1)

6 practical guidelines for implementing conversational AI
---------------------------------------------------------
	Apple (1)
	Siri (4)
	Jeff Bezos (1)
	Star Trek (1)
	Alexa (1)
	Joseph Weizenbaum (1)
	Decades (1)
	Andrew Leonard (1)
	Bots (1)
	Mozambique. ” (1)
	Today (1)
	Slack (2)
	Starbucks (1)
	Mastercard (1)
	Macy ’ (1)
	Gartner (1)
	Alexa (3)
	Cortana (2)
	Google Home (1)
	Skipflag (1)
	Use (1)
	Taco Bell ’ (1)
	Google Home (1)
	Organizations (1)
	Start (1)
	Amir Shevat (1)
	Beyond (1)
	Shevat (1)
	Others (1)
	Figure (1)
	Figure (1)
	Screenshots (1)
	Susan Etlinger (1)
	Chris Mullins (1)
	Mi

## Discovering interactions between entities

In [None]:
import nltk
import json

BLOG_DATA = "feed.json"

def extract_interactions(txt):
    sentences = nltk.tokenize.sent_tokenize(txt)
    tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]
    pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]

    entity_interactions = []
    for sentence in pos_tagged_tokens:

        all_entity_chunks = []
        previous_pos = None
        current_entity_chunk = []

        for (token, pos) in sentence:

            if pos == previous_pos and pos.startswith('NN'):
                current_entity_chunk.append(token)
            elif pos.startswith('NN'):
                if current_entity_chunk != []:
                    all_entity_chunks.append((' '.join(current_entity_chunk),
                            pos))
                current_entity_chunk = [token]

            previous_pos = pos

        if len(all_entity_chunks) > 1:
            entity_interactions.append(all_entity_chunks)
        else:
            entity_interactions.append([])

    assert len(entity_interactions) == len(sentences)

    return dict(entity_interactions=entity_interactions,
                sentences=sentences)

blog_data = json.loads(open(BLOG_DATA).read())

# Display selected interactions on a per-sentence basis

for post in blog_data:

    post.update(extract_interactions(post['content']))

    print(post['title'])
    print('-' * len(post['title']))
    for interactions in post['entity_interactions']:
        print('; '.join([i[0] for i in interactions]))
    print()

Four short links: 24 April 2020
-------------------------------
Suddenly Remote Playbook —; kids; re
everyone; s; introduction; delights
taichi; language; high-performance computer
Python; compiler; tasks; CPUs
Cost; Training NLP Models —; cost; language; models; drivers
audience; engineers; scientists; experiments; non-practitioners; sense; economics; Natural Language Processing
Neutrality; Did; Pandemic Internet —; ’; evidence; networks; COVID-19
differences; performance; anything; deregulation
Netflix ’; decision; bandwidth usage
network; data; regulators


YouTube; example; video quality
game platform; Steam; game
EU ’; efforts; front; US


Four short links: 23 April 2020
-------------------------------
Moloch — Large; scale; source; packet capture
Instagram Photos —; source toolset; effect; Instagram ’
running; Colab; GPU; Cloud; pubsub/storage
glimpse; future; apps; services
xkcd; —; data; science
Spotify Doesn ’; t; Use; Spotify Model ”; Neither Should; Jeremiah Lee; work; Spoti

## Visualizing interactions between entities with HTML output

In [None]:
import os
import json
import nltk
from IPython.display import IFrame
from IPython.core.display import display

BLOG_DATA = "feed.json"

HTML_TEMPLATE = """<html>
    <head>
        <title>{0}</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    </head>
    <body>{1}</body>
</html>"""

blog_data = json.loads(open(BLOG_DATA).read())

for post in blog_data:

    post.update(extract_interactions(post['content']))

    # Display output as markup with entities presented in bold text

    post['markup'] = []

    for sentence_idx in range(len(post['sentences'])):

        s = post['sentences'][sentence_idx]
        for (term, _) in post['entity_interactions'][sentence_idx]:
            s = s.replace(term, '<strong>{0}</strong>'.format(term))

        post['markup'] += [s] 
            
    filename = post['title'].replace("?", "") + '.entity_interactions.html'
    f = open(os.path.join(filename), 'wb')
    html = HTML_TEMPLATE.format(post['title'] + ' Interactions', ' '.join(post['markup']))
    f.write(html.encode('utf-8'))
    f.close()

    print('Data written to', f.name)
    
    # Display any of these files with an inline frame. This displays the
    # last file processed by using the last value of f.name...
    
    print('Displaying {0}:'.format(f.name))
    display(IFrame('files/{0}'.format(f.name), '100%', '600px'))

Data written to Four short links: 24 April 2020.entity_interactions.html
Displaying Four short links: 24 April 2020.entity_interactions.html:


Data written to Four short links: 23 April 2020.entity_interactions.html
Displaying Four short links: 23 April 2020.entity_interactions.html:


Data written to How data privacy leader Apple found itself in a data ethics catastrophe.entity_interactions.html
Displaying How data privacy leader Apple found itself in a data ethics catastrophe.entity_interactions.html:


Data written to Four short links: 22 April 2020.entity_interactions.html
Displaying Four short links: 22 April 2020.entity_interactions.html:


Data written to Four short links: 21 April 2020.entity_interactions.html
Displaying Four short links: 21 April 2020.entity_interactions.html:


Data written to Four short links: 20 April 2020.entity_interactions.html
Displaying Four short links: 20 April 2020.entity_interactions.html:


Data written to Four short links: 17 April 2020.entity_interactions.html
Displaying Four short links: 17 April 2020.entity_interactions.html:


Data written to Four short links: 16 April 2020.entity_interactions.html
Displaying Four short links: 16 April 2020.entity_interactions.html:


Data written to Four short links: 15 April 2020.entity_interactions.html
Displaying Four short links: 15 April 2020.entity_interactions.html:


Data written to Four short links: 14 April 2020.entity_interactions.html
Displaying Four short links: 14 April 2020.entity_interactions.html:


Data written to Radar trends to watch: April 2020.entity_interactions.html
Displaying Radar trends to watch: April 2020.entity_interactions.html:


Data written to Four short links: 13 April 2020.entity_interactions.html
Displaying Four short links: 13 April 2020.entity_interactions.html:


Data written to Four short links: 10 April 2020.entity_interactions.html
Displaying Four short links: 10 April 2020.entity_interactions.html:


Data written to Four short links: 9 April 2020.entity_interactions.html
Displaying Four short links: 9 April 2020.entity_interactions.html:


Data written to Four short links: 8 April 2020.entity_interactions.html
Displaying Four short links: 8 April 2020.entity_interactions.html:


Data written to Four short links: 7 April 2020.entity_interactions.html
Displaying Four short links: 7 April 2020.entity_interactions.html:


Data written to Governance and Discovery.entity_interactions.html
Displaying Governance and Discovery.entity_interactions.html:


Data written to Four short links: 6 April 2020.entity_interactions.html
Displaying Four short links: 6 April 2020.entity_interactions.html:


Data written to Four short links: 3 April 2020.entity_interactions.html
Displaying Four short links: 3 April 2020.entity_interactions.html:


Data written to Four short links: 2 April 2020.entity_interactions.html
Displaying Four short links: 2 April 2020.entity_interactions.html:


Data written to Four short links: 1 April 2020.entity_interactions.html
Displaying Four short links: 1 April 2020.entity_interactions.html:


Data written to Four short links: 31 March 2020.entity_interactions.html
Displaying Four short links: 31 March 2020.entity_interactions.html:


Data written to What you need to know about product management for AI.entity_interactions.html
Displaying What you need to know about product management for AI.entity_interactions.html:


Data written to The unreasonable importance of data preparation.entity_interactions.html
Displaying The unreasonable importance of data preparation.entity_interactions.html:


Data written to Four short links: 24 March 2020.entity_interactions.html
Displaying Four short links: 24 March 2020.entity_interactions.html:


Data written to 3 ways to confront modern business challenges.entity_interactions.html
Displaying 3 ways to confront modern business challenges.entity_interactions.html:


Data written to An enterprise vision is your company’s North Star.entity_interactions.html
Displaying An enterprise vision is your company’s North Star.entity_interactions.html:


Data written to Leaders need to mobilize change-ready workforces.entity_interactions.html
Displaying Leaders need to mobilize change-ready workforces.entity_interactions.html:


Data written to Great leaders inspire innovation and creativity from within their workforces.entity_interactions.html
Displaying Great leaders inspire innovation and creativity from within their workforces.entity_interactions.html:


Data written to Strong leaders forge an intersection of knowledge and experience.entity_interactions.html
Displaying Strong leaders forge an intersection of knowledge and experience.entity_interactions.html:


Data written to Four short links: 23 March 2020.entity_interactions.html
Displaying Four short links: 23 March 2020.entity_interactions.html:


Data written to Four short links: 20 March 2020.entity_interactions.html
Displaying Four short links: 20 March 2020.entity_interactions.html:


Data written to 6 trends framing the state of AI and ML.entity_interactions.html
Displaying 6 trends framing the state of AI and ML.entity_interactions.html:


Data written to Four short links: 19 March 2020.entity_interactions.html
Displaying Four short links: 19 March 2020.entity_interactions.html:


Data written to It’s an unprecedented crisis: 8 things to do right now.entity_interactions.html
Displaying It’s an unprecedented crisis: 8 things to do right now.entity_interactions.html:


Data written to AI adoption in the enterprise 2020.entity_interactions.html
Displaying AI adoption in the enterprise 2020.entity_interactions.html:


Data written to Four short links: 18 March 2020.entity_interactions.html
Displaying Four short links: 18 March 2020.entity_interactions.html:


Data written to Four short links: 17 March 2020.entity_interactions.html
Displaying Four short links: 17 March 2020.entity_interactions.html:


Data written to Four short links: 16 March 2020.entity_interactions.html
Displaying Four short links: 16 March 2020.entity_interactions.html:


Data written to Four short links: 13 March 2020.entity_interactions.html
Displaying Four short links: 13 March 2020.entity_interactions.html:


Data written to Four short links: 12 March 2020.entity_interactions.html
Displaying Four short links: 12 March 2020.entity_interactions.html:


Data written to Four short links: 11 March 2020.entity_interactions.html
Displaying Four short links: 11 March 2020.entity_interactions.html:


Data written to Four short links: 10 March 2020.entity_interactions.html
Displaying Four short links: 10 March 2020.entity_interactions.html:


Data written to Four short links: 9 March 2020.entity_interactions.html
Displaying Four short links: 9 March 2020.entity_interactions.html:


Data written to Four short links: 6 March 2020.entity_interactions.html
Displaying Four short links: 6 March 2020.entity_interactions.html:


Data written to Radar trends to watch: March 2020.entity_interactions.html
Displaying Radar trends to watch: March 2020.entity_interactions.html:


Data written to Four short links: 5 March 2020.entity_interactions.html
Displaying Four short links: 5 March 2020.entity_interactions.html:


Data written to Remembering Freeman Dyson.entity_interactions.html
Displaying Remembering Freeman Dyson.entity_interactions.html:


Data written to Four short links: 4 March 2020.entity_interactions.html
Displaying Four short links: 4 March 2020.entity_interactions.html:


Data written to Four short links: 3 March 2020.entity_interactions.html
Displaying Four short links: 3 March 2020.entity_interactions.html:


Data written to The death of Agile.entity_interactions.html
Displaying The death of Agile.entity_interactions.html:


Data written to Four short links: 2 March 2020.entity_interactions.html
Displaying Four short links: 2 March 2020.entity_interactions.html:


Data written to Four short links: 28 February 2020.entity_interactions.html
Displaying Four short links: 28 February 2020.entity_interactions.html:


Data written to Four short links: 27 February 2020.entity_interactions.html
Displaying Four short links: 27 February 2020.entity_interactions.html:


Data written to Intellectual control.entity_interactions.html
Displaying Intellectual control.entity_interactions.html:


Data written to Where do great architectures come from.entity_interactions.html
Displaying Where do great architectures come from.entity_interactions.html:


Data written to Architecture.Next: Invalidating old axioms.entity_interactions.html
Displaying Architecture.Next: Invalidating old axioms.entity_interactions.html:


Data written to Highlights from the O’Reilly Software Architecture Conference in New York 2020.entity_interactions.html
Displaying Highlights from the O’Reilly Software Architecture Conference in New York 2020.entity_interactions.html:


Data written to Four short links: 26 February 2020.entity_interactions.html
Displaying Four short links: 26 February 2020.entity_interactions.html:


Data written to The elephant in the architecture.entity_interactions.html
Displaying The elephant in the architecture.entity_interactions.html:
