# Mining the Social Web, 2nd Edition

## Chapter 5: . Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

This IPython Notebook provides an interactive way to follow along with and explore the numbered examples from [_Mining the Social Web (2nd Edition)_](http://bit.ly/135dHfs). The intent behind this notebook is to reinforce the concepts from the sample code in a fun, convenient, and effective way. This notebook assumes that you are reading along with the book and have the context of the discussion as you work through these exercises.

## Fixes & improvements

Reviewed by [santteegt](https://santteegt.github.io)

This notebook has been fully reviewed and partially fixed on May 2017 in order to make it work with the current version of libraries and APIs.

## Copyright and Licensing

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use.

Note: If you find yourself wanting to copy output files from this notebook back to your host environment, see the bottom of this notebook for one possible way to do it.

## Installation Requirements

You can install the following libraries directly from the anaconda navigator:
    
* boilerpipe
* feedparser
* numpy

## Example 1. Using boilerpipe to extract the text from a web page

In [31]:
from boilerpipe.extract import Extractor

URL='http://radar.oreilly.com/2010/07/louvre-industrial-age-henry-ford.html'

extractor = Extractor(extractor='ArticleExtractor', url=URL)

print extractor.getText()

Listen
The Louvre of the Industrial Age
The Henry Ford is one of the world's great museums, and the world it chronicles is our own.
by Tim O'Reilly | @timoreilly | +Tim O'Reilly | July 30, 2010
This morning I had the chance to get a tour of The Henry Ford Museum in Dearborn, MI, along with Dale Dougherty, creator of Make: and Makerfaire, and Marc Greuther, the chief curator of the museum.  I had expected a museum dedicated to the auto industry, but it’s so much more than that.  As I wrote in my first stunned tweet, “it’s the Louvre of the Industrial Age.”
When we first entered, Marc took us to what he said may be his favorite artifact in the museum, a block of concrete that contains Luther Burbank’s shovel, and Thomas Edison’s signature and footprints.  Luther Burbank was, of course, the great agricultural inventor who created such treasures as the nectarine and the Santa Rosa plum. Ford was a farm boy who became an industrialist; Thomas Edison was his friend and mentor. The museum, op

## Example 2. Using feedparser to extract the text (and other fields) from an RSS or Atom feed

In [32]:
import feedparser

FEED_URL='http://feeds.feedburner.com/oreilly/radar/atom'

fp = feedparser.parse(FEED_URL)

for e in fp.entries:
    print e.title
    print e.links[0].href
    print e.content[0].value

Four short links: 30 May 2017
http://feedproxy.google.com/~r/oreilly/radar/atom/~3/GPwtxejeu2I/four-short-links-30-may-2017
<p><em>World Problems, Story AI, Medical Security Horrors, and OSS Fuzz Winning</em></p><ol>
<li>
<a href="https://80000hours.org/career-guide/world-problems/">The World's Biggest Problems</a> -- data for you to consider when you choose to work on Stuff That Matters.</li>
<li>
<a href="http://dspace.mit.edu/handle/1721.1/67693">The Strong Story Hypothesis and the Directed Perception Hypothesis</a> -- <i>I ask why humans are smarter than other primates, and I hypothesize that an important part of the answer lies in what I call the Strong Story Hypothesis, which holds that storytelling and understanding have a central role in human intelligence. Next, I introduce another hypothesis, the Driven Perception Hypothesis, which holds that we derive much of our common sense, including the common sense required in story understanding, by deploying our perceptual apparatus o

## Example 3. Pseudocode for a breadth-first search

In [None]:
Create an empty graph
Create an empty queue to keep track of nodes that need to be processed

Add the starting point to the graph as the root node
Add the root node to a queue for processing

Repeat until some maximum depth is reached or the queue is empty:
  Remove a node from the queue 
  For each of the node's neighbors: 
    If the neighbor hasn't already been processed: 
      Add it to the queue 
      Add it to the graph 
      Create an edge in the graph that connects the node and its neighbor

**Naive sentence detection based on periods**

In [33]:
txt = "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow."
print txt.split(".")

['Mr', ' Green killed Colonel Mustard in the study with the candlestick', ' Mr', ' Green is not a very nice fellow', '']


**More sophisticated sentence detection**

In [34]:
import nltk

# Downloading nltk packages used in this example
nltk.download('punkt')

sentences = nltk.tokenize.sent_tokenize(txt)
print sentences

[nltk_data] Downloading package punkt to /Users/santteegt/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['Mr. Green killed Colonel Mustard in the study with the candlestick.', 'Mr. Green is not a very nice fellow.']


**Tokenization of sentences**

In [35]:
tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]
print tokens

[['Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.'], ['Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.']]


**Part of speech tagging for tokens**

In [8]:
# Downloading nltk packages used in this example
nltk.download('maxent_treebank_pos_tagger')
nltk.download('averaged_perceptron_tagger')

pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]
print pos_tagged_tokens

[nltk_data] Downloading package maxent_treebank_pos_tagger to
[nltk_data]     /Users/santteegt/nltk_data...
[nltk_data]   Package maxent_treebank_pos_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/santteegt/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[[('Mr.', 'NNP'), ('Green', 'NNP'), ('killed', 'VBD'), ('Colonel', 'NNP'), ('Mustard', 'NNP'), ('in', 'IN'), ('the', 'DT'), ('study', 'NN'), ('with', 'IN'), ('the', 'DT'), ('candlestick', 'NN'), ('.', '.')], [('Mr.', 'NNP'), ('Green', 'NNP'), ('is', 'VBZ'), ('not', 'RB'), ('a', 'DT'), ('very', 'RB'), ('nice', 'JJ'), ('fellow', 'NN'), ('.', '.')]]


**Named entity extraction/chunking for tokens**

In [36]:
# Downloading nltk packages used in this example
nltk.download('maxent_ne_chunker')
nltk.download('words')

# ne_chunks = nltk.batch_ne_chunk(pos_tagged_tokens)
ne_chunks = nltk.ne_chunk_sents(pos_tagged_tokens)

for chunk in ne_chunks:
    print chunk
# print ne_chunks
# print ne_chunks[0].pprint() # You can prettyprint each chunk in the tree

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/santteegt/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /Users/santteegt/nltk_data...
[nltk_data]   Package words is already up-to-date!
(S
  (PERSON Mr./NNP)
  (PERSON Green/NNP)
  killed/VBD
  (ORGANIZATION Colonel/NNP Mustard/NNP)
  in/IN
  the/DT
  study/NN
  with/IN
  the/DT
  candlestick/NN
  ./.)
(S
  (PERSON Mr./NNP)
  (ORGANIZATION Green/NNP)
  is/VBZ
  not/RB
  a/DT
  very/RB
  nice/JJ
  fellow/NN
  ./.)


## Example 4. Harvesting blog data by parsing feeds

In [37]:
import os
import sys
import json
import feedparser
# from BeautifulSoup import BeautifulStoneSoup
from bs4 import BeautifulSoup
from nltk import clean_html

FEED_URL = 'http://feeds.feedburner.com/oreilly/radar/atom'

def cleanHtml(html):
    if html == "": return ""

    soup = BeautifulSoup(html, 'lxml').get_text()
  
    return soup
#     return BeautifulStoneSoup(clean_html(html),
#                 convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]

fp = feedparser.parse(FEED_URL)

print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title)

blog_posts = []
for e in fp.entries:
    blog_posts.append({'title': e.title, 'content'
                      : cleanHtml(e.content[0].value), 'link': e.links[0].href})

out_file = os.path.join('resources', 'ch05-webpages', 'feed.json')
f = open(out_file, 'w')
f.write(json.dumps(blog_posts, indent=1))
f.close()

print 'Wrote output file to %s' % (f.name, )

Fetched 29 entries from 'All - O'Reilly Media'
Wrote output file to resources/ch05-webpages/feed.json


## Example 5. Using NLTK’s NLP tools to process human language in blog data

In [38]:
import json
import nltk

# Download nltk packages used in this example
nltk.download('stopwords')

BLOG_DATA = "resources/ch05-webpages/feed.json"

blog_data = json.loads(open(BLOG_DATA).read())

# Customize your list of stopwords as needed. Here, we add common
# punctuation and contraction artifacts.

stop_words = nltk.corpus.stopwords.words('english') + [
    '.',
    ',',
    '--',
    '\'s',
    '?',
    ')',
    '(',
    ':',
    '\'',
    '\'re',
    '"',
    '-',
    '}',
    '{',
    u'—',
    ]

print 'English stop words:'
print stop_words

for post in blog_data:
    sentences = nltk.tokenize.sent_tokenize(post['content'])

    words = [w.lower() for sentence in sentences for w in
             nltk.tokenize.word_tokenize(sentence)]

    fdist = nltk.FreqDist(words)

    # Basic stats

    num_words = sum([i[1] for i in fdist.items()])
    num_unique_words = len(fdist.keys())

    # Hapaxes are words that appear only once

    num_hapaxes = len(fdist.hapaxes())

    top_10_words_sans_stop_words = sorted([w for w in fdist.items() if w[0]
                                    not in stop_words], key=lambda x: x[1], reverse=True)[:10]

    print post['title']
    print '\tNum Sentences:'.ljust(25), len(sentences)
    print '\tNum Words:'.ljust(25), num_words
    print '\tNum Unique Words:'.ljust(25), num_unique_words
    print '\tNum Hapaxes:'.ljust(25), num_hapaxes
    print '\tTop 10 Most Frequent Words (sans stop words):\n\t\t', \
            '\n\t\t'.join(['%s (%s)'
            % (w[0], w[1]) for w in top_10_words_sans_stop_words])
    print

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/santteegt/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
English stop words:
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u

## Example 6. A document summarization algorithm based principally upon sentence detection and frequency analysis within sentences

In [39]:
import json
import nltk
import numpy

BLOG_DATA = "resources/ch05-webpages/feed.json"

N = 100  # Number of words to consider
CLUSTER_THRESHOLD = 5  # Distance between words to consider
TOP_SENTENCES = 5  # Number of sentences to return for a "top n" summary

# Approach taken from "The Automatic Creation of Literature Abstracts" by H.P. Luhn

def _score_sentences(sentences, important_words):
    scores = []
    sentence_idx = -1

    for s in [nltk.tokenize.word_tokenize(s) for s in sentences]:

        sentence_idx += 1
        word_idx = []

        # For each word in the word list...
        for w in important_words:
            try:
                # Compute an index for where any important words occur in the sentence.

                word_idx.append(s.index(w))
            except ValueError, e: # w not in this particular sentence
                pass

        word_idx.sort()

        # It is possible that some sentences may not contain any important words at all.
        if len(word_idx)== 0: continue

        # Using the word index, compute clusters by using a max distance threshold
        # for any two consecutive words.

        clusters = []
        cluster = [word_idx[0]]
        i = 1
        while i < len(word_idx):
            if word_idx[i] - word_idx[i - 1] < CLUSTER_THRESHOLD:
                cluster.append(word_idx[i])
            else:
                clusters.append(cluster[:])
                cluster = [word_idx[i]]
            i += 1
        clusters.append(cluster)

        # Score each cluster. The max score for any given cluster is the score 
        # for the sentence.

        max_cluster_score = 0
        for c in clusters:
            significant_words_in_cluster = len(c)
            total_words_in_cluster = c[-1] - c[0] + 1
            score = 1.0 * significant_words_in_cluster \
                * significant_words_in_cluster / total_words_in_cluster

            if score > max_cluster_score:
                max_cluster_score = score

#         scores.append((sentence_idx, score))
        scores.append((sentence_idx, max_cluster_score))

    return scores

def summarize(txt):
    sentences = [s for s in nltk.tokenize.sent_tokenize(txt)]
    normalized_sentences = [s.lower() for s in sentences]

    words = [w.lower() for sentence in normalized_sentences for w in
             nltk.tokenize.word_tokenize(sentence)]

    fdist = nltk.FreqDist(words)

    top_n_words = [w[0] for w in sorted(fdist.items(), key=lambda item: item[1], reverse=True) # changed to consider order
            if w[0] not in nltk.corpus.stopwords.words('english')][:N]

    scored_sentences = _score_sentences(normalized_sentences, top_n_words)

    # Summarization Approach 1:
    # Filter out nonsignificant sentences by using the average score plus a
    # fraction of the std dev as a filter

    avg = numpy.mean([s[1] for s in scored_sentences])
    std = numpy.std([s[1] for s in scored_sentences])
    mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences
                   if score > avg + 0.5 * std]

    # Summarization Approach 2:
    # Another approach would be to return only the top N ranked sentences

    top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:]
    top_n_scored = sorted(top_n_scored, key=lambda s: s[0])

    # Decorate the post object with summaries

    return dict(top_n_summary=[sentences[idx] for (idx, score) in top_n_scored],
                mean_scored_summary=[sentences[idx] for (idx, score) in mean_scored])

blog_data = json.loads(open(BLOG_DATA).read())

for post in blog_data:
       
    post.update(summarize(post['content']))

    print post['title']
    print '=' * len(post['title'])
    print
    print 'Top N Summary'
    print '-------------'
    print ' '.join(post['top_n_summary'])
    print
    print 'Mean Scored Summary'
    print '-------------------'
    print ' '.join(post['mean_scored_summary'])
    print

Four short links: 30 May 2017

Top N Summary
-------------
World Problems, Story AI, Medical Security Horrors, and OSS Fuzz Winning

The World's Biggest Problems -- data for you to consider when you choose to work on Stuff That Matters. The Strong Story Hypothesis and the Directed Perception Hypothesis -- I ask why humans are smarter than other primates, and I hypothesize that an important part of the answer lies in what I call the Strong Story Hypothesis, which holds that storytelling and understanding have a central role in human intelligence. Medical Implants and Hospital Systems are Still Infosec Dumpster Fires (Cory Doctorow) -- has pointers to two writeups of the horrors in various medical systems. OSS Fuzz Improving Open Source -- Google's open source fuzzer has found numerous security vulnerabilities in several critical open source projects: 10 in FreeType2, 17 in FFmpeg, 33 in LibreOffice, 8 in SQLite 3, 10 in GnuTLS, 25 in PCRE2, 9 in gRPC, and 7 in Wireshark. Their criteria 

## Example 7. Visualizing document summarization results with HTML output

In [30]:
import os
import json
import nltk
import numpy
from IPython.display import IFrame
from IPython.core.display import display

BLOG_DATA = "resources/ch05-webpages/feed.json"

HTML_TEMPLATE = """<html>
    <head>
        <title>%s</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    </head>
    <body>%s</body>
</html>"""

blog_data = json.loads(open(BLOG_DATA).read())

for post in blog_data:
   
    # Uses previously defined summarize function.
    post.update(summarize(post['content']))

    # You could also store a version of the full post with key sentences marked up
    # for analysis with simple string replacement...

    for summary_type in ['top_n_summary', 'mean_scored_summary']:
        post[summary_type + '_marked_up'] = '<p>%s</p>' % (post['content'], )
        for s in post[summary_type]:
            post[summary_type + '_marked_up'] = \
            post[summary_type + '_marked_up'].replace(s, '<strong>%s</strong>' % (s, ))

        filename = post['title'].replace("?", "") + '.summary.' + summary_type + '.html'
        f = open(os.path.join('resources', 'ch05-webpages', filename), 'w')
        html = HTML_TEMPLATE % (post['title'] + \
          ' Summary', post[summary_type + '_marked_up'],)
              
        f.write(html.encode('utf-8'))
        f.close()

        print "Data written to", f.name

# Display any of these files with an inline frame. This displays the
# last file processed by using the last value of f.name...

print "Displaying %s:" % f.name
display(IFrame('files/%s' % f.name, '100%', '600px'))

Data written to resources/ch05-webpages/Four short links: 30 May 2017.summary.top_n_summary.html
Data written to resources/ch05-webpages/Four short links: 30 May 2017.summary.mean_scored_summary.html
Data written to resources/ch05-webpages/Jupyter Insights: Carol Willing, director of the Python Software Foundation.summary.top_n_summary.html
Data written to resources/ch05-webpages/Jupyter Insights: Carol Willing, director of the Python Software Foundation.summary.mean_scored_summary.html
Data written to resources/ch05-webpages/Jupyter Digest: 30 May 2017.summary.top_n_summary.html
Data written to resources/ch05-webpages/Jupyter Digest: 30 May 2017.summary.mean_scored_summary.html
Data written to resources/ch05-webpages/What is the OSI model.summary.top_n_summary.html
Data written to resources/ch05-webpages/What is the OSI model.summary.mean_scored_summary.html
Data written to resources/ch05-webpages/How do I configure a Cisco router for secure remote access using SSH.summary.top_n_summa

## Example 8. Extracting entities from a text with NLTK

In [1]:
import nltk
import json

BLOG_DATA = "resources/ch05-webpages/feed.json"

blog_data = json.loads(open(BLOG_DATA).read())

for post in blog_data:

    sentences = nltk.tokenize.sent_tokenize(post['content'])
    tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]
    pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]

    # Flatten the list since we're not using sentence structure
    # and sentences are guaranteed to be separated by a special
    # POS tuple such as ('.', '.')

    pos_tagged_tokens = [token for sent in pos_tagged_tokens for token in sent]

    all_entity_chunks = []
    previous_pos = None
    current_entity_chunk = []
    for (token, pos) in pos_tagged_tokens:

        if pos == previous_pos and pos.startswith('NN'):
            current_entity_chunk.append(token)
        elif pos.startswith('NN'):
            if current_entity_chunk != []:

                # Note that current_entity_chunk could be a duplicate when appended,
                # so frequency analysis again becomes a consideration

                all_entity_chunks.append((' '.join(current_entity_chunk), pos))
            current_entity_chunk = [token]

        previous_pos = pos

    # Store the chunks as an index for the document
    # and account for frequency while we're at it...

    post['entities'] = {}
    for c in all_entity_chunks:
        post['entities'][c] = post['entities'].get(c, 0) + 1

    # For example, we could display just the title-cased entities

    print post['title']
    print '-' * len(post['title'])
    proper_nouns = []
    for (entity, pos) in post['entities']:
        if entity.istitle():
            print '\t%s (%s)' % (entity, post['entities'][(entity, pos)])
    print

Four short links: 30 May 2017
-----------------------------
	Google (1)
	World Problems (1)
	Biggest (1)
	Paper (1)
	Strong Story Hypothesis (1)
	Driven Perception Hypothesis (1)
	Problems (1)
	World (1)
	Implants (1)
	Continue (1)
	Matters (1)
	Google (1)
	Stuff (1)
	Cory (1)
	Medical Security Horrors (1)
	Whitescope (1)
	Cory Doctorow (1)
	Genesis (1)
	Infosec Dumpster Fires (1)
	Directed Perception Hypothesis (1)
	Systems (1)
	Strong Story Hypothesis (1)
	Wireshark (1)
	Hospital (1)

Jupyter Digest: 30 May 2017
---------------------------
	Transportation Insights (1)
	O'Reilly (1)
	Jupyter (1)
	Machine Learning (1)
	Justin Tyberg (1)
	Continue (1)
	Jupyter Notebooks (1)
	Jupyter Digest (1)
	Apache Spark (1)
	Generation (1)
	Executive Transportation Group (1)
	Notebook (1)
	Notebooks (1)
	Tensorflow (1)
	Manning (1)
	Hands-On Machine Learning (1)
	Scikit-Learn (2)
	Interactive Dashboards (1)
	Python (1)
	Machine Learning (1)

Jupyter Insights: Carol Willing, director of the Python So

## Example 9. Discovering interactions between entities

In [2]:
import nltk
import json

BLOG_DATA = "resources/ch05-webpages/feed.json"

def extract_interactions(txt):
    sentences = nltk.tokenize.sent_tokenize(txt)
    tokens = [nltk.tokenize.word_tokenize(s) for s in sentences]
    pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]

    entity_interactions = []
    for sentence in pos_tagged_tokens:

        all_entity_chunks = []
        previous_pos = None
        current_entity_chunk = []

        for (token, pos) in sentence:

            if pos == previous_pos and pos.startswith('NN'):
                current_entity_chunk.append(token)
            elif pos.startswith('NN'):
                if current_entity_chunk != []:
                    all_entity_chunks.append((' '.join(current_entity_chunk),
                            pos))
                current_entity_chunk = [token]

            previous_pos = pos

        if len(all_entity_chunks) > 1:
            entity_interactions.append(all_entity_chunks)
        else:
            entity_interactions.append([])

    assert len(entity_interactions) == len(sentences)

    return dict(entity_interactions=entity_interactions,
                sentences=sentences)

blog_data = json.loads(open(BLOG_DATA).read())

# Display selected interactions on a per-sentence basis

for post in blog_data:

    post.update(extract_interactions(post['content']))

    print post['title']
    print '-' * len(post['title'])
    for interactions in post['entity_interactions']:
        print '; '.join([i[0] for i in interactions])
    print

Four short links: 30 May 2017
-----------------------------
World Problems; Story AI; Medical Security Horrors; OSS Fuzz; World; Biggest; Problems; data; Stuff
Strong Story Hypothesis; Directed Perception Hypothesis; humans; primates; part; answer; Strong Story Hypothesis; role
hypothesis; Driven Perception Hypothesis; sense; sense; story understanding; apparatus
Paper; CSAIL; Genesis; story system; tells; stories; sense; rules; level concept
Implants; Hospital; Systems; Infosec Dumpster Fires; Cory Doctorow; pointers; writeups; horrors
Whitescope; whitepaper; pacemaker security; pacemaker; devices; manufacturers; devices; pacemaker; radio; signals; vulnerabilities; authentication; pacemakers; pacemaker; programmers; way; pacemaker; device; attacker
Cory; points; DMCA; exemption; paper; expiring; b; release; code; stuff
OSS Fuzz Improving Open Source; Google; source fuzzer; security; vulnerabilities; source; projects; FreeType2; FFmpeg; LibreOffice; SQLite; GnuTLS; PCRE2; gRPC
service;

## Example 10. Visualizing interactions between entities with HTML output

In [3]:
import os
import json
import nltk
from IPython.display import IFrame
from IPython.core.display import display

BLOG_DATA = "resources/ch05-webpages/feed.json"

HTML_TEMPLATE = """<html>
    <head>
        <title>%s</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    </head>
    <body>%s</body>
</html>"""

blog_data = json.loads(open(BLOG_DATA).read())

for post in blog_data:

    post.update(extract_interactions(post['content']))

    # Display output as markup with entities presented in bold text

    post['markup'] = []

    for sentence_idx in range(len(post['sentences'])):

        s = post['sentences'][sentence_idx]
        for (term, _) in post['entity_interactions'][sentence_idx]:
            s = s.replace(term, '<strong>%s</strong>' % (term, ))

        post['markup'] += [s] 
            
    filename = post['title'].replace("?", "") + '.entity_interactions.html'
    f = open(os.path.join('resources', 'ch05-webpages', filename), 'w')
    html = HTML_TEMPLATE % (post['title'] + ' Interactions', 
                            ' '.join(post['markup']),)
    f.write(html.encode('utf-8'))
    f.close()

    print "Data written to", f.name
    
    # Display any of these files with an inline frame. This displays the
    # last file processed by using the last value of f.name...
    
    print "Displaying %s:" % f.name
    display(IFrame('files/%s' % f.name, '100%', '600px'))

Data written to resources/ch05-webpages/Four short links: 30 May 2017.entity_interactions.html
Displaying resources/ch05-webpages/Four short links: 30 May 2017.entity_interactions.html:


Data written to resources/ch05-webpages/Jupyter Digest: 30 May 2017.entity_interactions.html
Displaying resources/ch05-webpages/Jupyter Digest: 30 May 2017.entity_interactions.html:


Data written to resources/ch05-webpages/Jupyter Insights: Carol Willing, director of the Python Software Foundation.entity_interactions.html
Displaying resources/ch05-webpages/Jupyter Insights: Carol Willing, director of the Python Software Foundation.entity_interactions.html:


Data written to resources/ch05-webpages/How do I configure a Cisco router for secure remote access using SSH.entity_interactions.html
Displaying resources/ch05-webpages/How do I configure a Cisco router for secure remote access using SSH.entity_interactions.html:


Data written to resources/ch05-webpages/How do I recover the password on a Cisco router without losing its configuration.entity_interactions.html
Displaying resources/ch05-webpages/How do I recover the password on a Cisco router without losing its configuration.entity_interactions.html:


Data written to resources/ch05-webpages/What is the OSI model.entity_interactions.html
Displaying resources/ch05-webpages/What is the OSI model.entity_interactions.html:


Data written to resources/ch05-webpages/Four short links: 29 May 2017.entity_interactions.html
Displaying resources/ch05-webpages/Four short links: 29 May 2017.entity_interactions.html:


Data written to resources/ch05-webpages/Intelligent Bits: 26 May 2017.entity_interactions.html
Displaying resources/ch05-webpages/Intelligent Bits: 26 May 2017.entity_interactions.html:


Data written to resources/ch05-webpages/Four short links: 26 May 2017.entity_interactions.html
Displaying resources/ch05-webpages/Four short links: 26 May 2017.entity_interactions.html:


Data written to resources/ch05-webpages/Running a word count application using Spark.entity_interactions.html
Displaying resources/ch05-webpages/Running a word count application using Spark.entity_interactions.html:


Data written to resources/ch05-webpages/Peeking into the black box: Lessons from the front lines of machine-learning product launches.entity_interactions.html
Displaying resources/ch05-webpages/Peeking into the black box: Lessons from the front lines of machine-learning product launches.entity_interactions.html:


Data written to resources/ch05-webpages/Using AI to create new jobs.entity_interactions.html
Displaying resources/ch05-webpages/Using AI to create new jobs.entity_interactions.html:


Data written to resources/ch05-webpages/Lessons from piloting the London Office of Data Analytics.entity_interactions.html
Displaying resources/ch05-webpages/Lessons from piloting the London Office of Data Analytics.entity_interactions.html:


Data written to resources/ch05-webpages/Enabling data science in the enterprise.entity_interactions.html
Displaying resources/ch05-webpages/Enabling data science in the enterprise.entity_interactions.html:


Data written to resources/ch05-webpages/Is finance ready for AI.entity_interactions.html
Displaying resources/ch05-webpages/Is finance ready for AI.entity_interactions.html:


Data written to resources/ch05-webpages/Accelerate analytics and AI innovations with Intel.entity_interactions.html
Displaying resources/ch05-webpages/Accelerate analytics and AI innovations with Intel.entity_interactions.html:


Data written to resources/ch05-webpages/Data science and deep learning in retail.entity_interactions.html
Displaying resources/ch05-webpages/Data science and deep learning in retail.entity_interactions.html:


Data written to resources/ch05-webpages/Travis Lowdermilk and Jessica Rich on building a customer-driven culture at Microsoft.entity_interactions.html
Displaying resources/ch05-webpages/Travis Lowdermilk and Jessica Rich on building a customer-driven culture at Microsoft.entity_interactions.html:


Data written to resources/ch05-webpages/JupyterLab: The evolution of the Jupyter web interface.entity_interactions.html
Displaying resources/ch05-webpages/JupyterLab: The evolution of the Jupyter web interface.entity_interactions.html:


Data written to resources/ch05-webpages/Jason Laska and Michael Akilian on using AI to schedule meetings.entity_interactions.html
Displaying resources/ch05-webpages/Jason Laska and Michael Akilian on using AI to schedule meetings.entity_interactions.html:


Data written to resources/ch05-webpages/Four short links: 25 May 2017.entity_interactions.html
Displaying resources/ch05-webpages/Four short links: 25 May 2017.entity_interactions.html:


Data written to resources/ch05-webpages/Real-time intelligence gives Uber the edge.entity_interactions.html
Displaying resources/ch05-webpages/Real-time intelligence gives Uber the edge.entity_interactions.html:


Data written to resources/ch05-webpages/Another one bytes the dust.entity_interactions.html
Displaying resources/ch05-webpages/Another one bytes the dust.entity_interactions.html:


Data written to resources/ch05-webpages/Machine learning is a moonshot for us all.entity_interactions.html
Displaying resources/ch05-webpages/Machine learning is a moonshot for us all.entity_interactions.html:


Data written to resources/ch05-webpages/The data subject first.entity_interactions.html
Displaying resources/ch05-webpages/The data subject first.entity_interactions.html:


Data written to resources/ch05-webpages/What Kaggle has learned from almost a million data scientists.entity_interactions.html
Displaying resources/ch05-webpages/What Kaggle has learned from almost a million data scientists.entity_interactions.html:


Data written to resources/ch05-webpages/Highlights from Strata Data Conference in London 2017.entity_interactions.html
Displaying resources/ch05-webpages/Highlights from Strata Data Conference in London 2017.entity_interactions.html:


Data written to resources/ch05-webpages/The science of visual interactions.entity_interactions.html
Displaying resources/ch05-webpages/The science of visual interactions.entity_interactions.html:


Data written to resources/ch05-webpages/12 qualities of effective design organizations.entity_interactions.html
Displaying resources/ch05-webpages/12 qualities of effective design organizations.entity_interactions.html:


Data written to resources/ch05-webpages/Kelly Shortridge on overcoming common missteps affecting security decision-making.entity_interactions.html
Displaying resources/ch05-webpages/Kelly Shortridge on overcoming common missteps affecting security decision-making.entity_interactions.html:


Data written to resources/ch05-webpages/5 things to learn before learning React.entity_interactions.html
Displaying resources/ch05-webpages/5 things to learn before learning React.entity_interactions.html:


Data written to resources/ch05-webpages/Reliability with Kafka.entity_interactions.html
Displaying resources/ch05-webpages/Reliability with Kafka.entity_interactions.html:


Data written to resources/ch05-webpages/Four short links: 24 May 2017.entity_interactions.html
Displaying resources/ch05-webpages/Four short links: 24 May 2017.entity_interactions.html:


Data written to resources/ch05-webpages/Four short links: 23 May 2017.entity_interactions.html
Displaying resources/ch05-webpages/Four short links: 23 May 2017.entity_interactions.html:


Data written to resources/ch05-webpages/Jupyter Digest: 22 May 2017.entity_interactions.html
Displaying resources/ch05-webpages/Jupyter Digest: 22 May 2017.entity_interactions.html:


Data written to resources/ch05-webpages/Four short links: 22 May 2017.entity_interactions.html
Displaying resources/ch05-webpages/Four short links: 22 May 2017.entity_interactions.html:


Data written to resources/ch05-webpages/Intelligent Bits: 19 May 2017.entity_interactions.html
Displaying resources/ch05-webpages/Intelligent Bits: 19 May 2017.entity_interactions.html:


Data written to resources/ch05-webpages/Four short links: 19 May 2017.entity_interactions.html
Displaying resources/ch05-webpages/Four short links: 19 May 2017.entity_interactions.html:


Data written to resources/ch05-webpages/Activating human intrusion detection systems.entity_interactions.html
Displaying resources/ch05-webpages/Activating human intrusion detection systems.entity_interactions.html:


Data written to resources/ch05-webpages/Exploring progressive web apps in the real world.entity_interactions.html
Displaying resources/ch05-webpages/Exploring progressive web apps in the real world.entity_interactions.html:


Data written to resources/ch05-webpages/How to write a sound hypothesis when conducting user research.entity_interactions.html
Displaying resources/ch05-webpages/How to write a sound hypothesis when conducting user research.entity_interactions.html:


Data written to resources/ch05-webpages/How AI is used to infer human emotion.entity_interactions.html
Displaying resources/ch05-webpages/How AI is used to infer human emotion.entity_interactions.html:


Data written to resources/ch05-webpages/Paris Buttfield-Addison on what’s new in Swift programming.entity_interactions.html
Displaying resources/ch05-webpages/Paris Buttfield-Addison on what’s new in Swift programming.entity_interactions.html:


UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 61: ordinal not in range(128)

<IPython.lib.display.IFrame at 0x11501ca90>

Data written to resources/ch05-webpages/Four short links: 18 May 2017.entity_interactions.html
Displaying resources/ch05-webpages/Four short links: 18 May 2017.entity_interactions.html:


Data written to resources/ch05-webpages/Using Docker in production.entity_interactions.html
Displaying resources/ch05-webpages/Using Docker in production.entity_interactions.html:


Data written to resources/ch05-webpages/How do I build an API.entity_interactions.html
Displaying resources/ch05-webpages/How do I build an API.entity_interactions.html:


Data written to resources/ch05-webpages/How do I use an API.entity_interactions.html
Displaying resources/ch05-webpages/How do I use an API.entity_interactions.html:


Data written to resources/ch05-webpages/What is an API.entity_interactions.html
Displaying resources/ch05-webpages/What is an API.entity_interactions.html:


Data written to resources/ch05-webpages/How to do client-side form validation with Elm.entity_interactions.html
Displaying resources/ch05-webpages/How to do client-side form validation with Elm.entity_interactions.html:


Data written to resources/ch05-webpages/The stages of enterprise IoT adoption.entity_interactions.html
Displaying resources/ch05-webpages/The stages of enterprise IoT adoption.entity_interactions.html:


Data written to resources/ch05-webpages/Four short links: 17 May 2017.entity_interactions.html
Displaying resources/ch05-webpages/Four short links: 17 May 2017.entity_interactions.html:


Data written to resources/ch05-webpages/Augmenting industrial reality.entity_interactions.html
Displaying resources/ch05-webpages/Augmenting industrial reality.entity_interactions.html:


Data written to resources/ch05-webpages/Four short links: 16 May 2017.entity_interactions.html
Displaying resources/ch05-webpages/Four short links: 16 May 2017.entity_interactions.html:


Data written to resources/ch05-webpages/Architecting actionable insights.entity_interactions.html
Displaying resources/ch05-webpages/Architecting actionable insights.entity_interactions.html:


Data written to resources/ch05-webpages/The business advantages of embedding analytics into applications.entity_interactions.html
Displaying resources/ch05-webpages/The business advantages of embedding analytics into applications.entity_interactions.html:


Data written to resources/ch05-webpages/How can I deploy a multi-container application with Docker Compose.entity_interactions.html
Displaying resources/ch05-webpages/How can I deploy a multi-container application with Docker Compose.entity_interactions.html:


Data written to resources/ch05-webpages/How do I package my Java application as a Docker image using Gradle.entity_interactions.html
Displaying resources/ch05-webpages/How do I package my Java application as a Docker image using Gradle.entity_interactions.html:


Data written to resources/ch05-webpages/How do I create Docker images and run containers using Maven.entity_interactions.html
Displaying resources/ch05-webpages/How do I create Docker images and run containers using Maven.entity_interactions.html:


Data written to resources/ch05-webpages/Get introduced to the new Java® 9 Platform Module System (JPMS) with Paul Deitel.entity_interactions.html
Displaying resources/ch05-webpages/Get introduced to the new Java® 9 Platform Module System (JPMS) with Paul Deitel.entity_interactions.html:


UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 60: ordinal not in range(128)

<IPython.lib.display.IFrame at 0x114fabc90>

Data written to resources/ch05-webpages/Jupyter Digest: 15 May 2017.entity_interactions.html
Displaying resources/ch05-webpages/Jupyter Digest: 15 May 2017.entity_interactions.html:
