# Mining the Social Web, 2nd Edition

## Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More

This IPython Notebook provides an interactive way to follow along with and explore the numbered examples from [_Mining the Social Web (2nd Edition)_](http://bit.ly/135dHfs).

## Fixes & improvements

Reviewed by [santteegt](https://santteegt.github.io)

This notebook has been fully reviewed and partially fixed on May 2017 in order to make it work with the current version of libraries and APIs.

## Copyright and Licensing

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use.

## InstallationRequirements

You can install the following libraries directly from the anaconda navigator:

* google-api-python-client
* nltk
* beautifulsoup4

## Example 1. Searching for a person with the Google+ API

In [2]:
import httplib2
import json
import apiclient.discovery # pip install google-api-python-client

# XXX: Enter any person's name
Q = "Tim O'Reilly"

# XXX: Enter in your API key from  https://code.google.com/apis/console
API_KEY = 'AIzaSyATAzU6cKajEJx6iyzlVJ-3DHalS0CVf9Y'

service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(), 
                                    developerKey=API_KEY)

people_feed = service.people().search(query=Q).execute()

print json.dumps(people_feed['items'], indent=1)

[
 {
  "kind": "plus#person", 
  "displayName": "Tim O'Reilly", 
  "url": "https://plus.google.com/107033731246200681024", 
  "image": {
   "url": "https://lh4.googleusercontent.com/-J8nmMwIhpiA/AAAAAAAAAAI/AAAAAAADdg4/68r2hyFUgzI/photo.jpg?sz=50"
  }, 
  "etag": "\"Sh4n9u6EtD24TM0RmWv7jTXojqc/tjedXFyeIkzudZzRey5EJb8iZIk\"", 
  "id": "107033731246200681024", 
  "objectType": "person"
 }, 
 {
  "kind": "plus#person", 
  "displayName": "Tim O'Reilly", 
  "url": "https://plus.google.com/108869213167055456475", 
  "image": {
   "url": "https://lh4.googleusercontent.com/-K_U9Tbas8kE/AAAAAAAAAAI/AAAAAAAACYs/QThoMgwUxak/photo.jpg?sz=50"
  }, 
  "etag": "\"Sh4n9u6EtD24TM0RmWv7jTXojqc/bMltloEYgECFQYgmmtJJ3R6E-44\"", 
  "id": "108869213167055456475", 
  "objectType": "person"
 }, 
 {
  "kind": "plus#person", 
  "displayName": "TIM O'REILLY", 
  "url": "https://plus.google.com/110160587587635791009", 
  "image": {
   "url": "https://lh4.googleusercontent.com/-gWq9vr_JEnc/AAAAAAAAAAI/AAAAAAAAADI/z

## Example 2. Displaying Google+ avatars in IPython Notebook provides a quick way to disambiguate the search results and discover the person you are looking for

In [2]:
from IPython.core.display import HTML

html = []

for p in people_feed['items']:
    html += ['<p><img src="%s" /> %s: %s</p>' % \
             (p['image']['url'], p['id'], p['displayName'])]

HTML(''.join(html))

## Example 3. Fetching recent activities for a particular Google+ user

In [4]:
import httplib2
import json
import apiclient.discovery

USER_ID = '107033731246200681024' # Tim O'Reilly

# XXX: Re-enter your API_KEY from  https://code.google.com/apis/console
# if not currently set
# API_KEY = ''

service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(), 
                                    developerKey=API_KEY)

activity_feed = service.activities().list(
  userId=USER_ID,
  collection='public',
  maxResults='100' # Max allowed per API
).execute()

print json.dumps(activity_feed, indent=1)

{
 "nextPageToken": "ADSJ_i2ai-hdfJKpbjAUGoNzdJly0ewsFczwxuplHJE6lW17MuNFoiVohWZEhvntAfFJTYiBuWWbeJxL8wJSJGpqFPgVmhUL_zoJWx5gjHxR4F4PUg", 
 "kind": "plus#activityFeed", 
 "title": "Google+ List of Activities for Collection PUBLIC", 
 "items": [
  {
   "kind": "plus#activity", 
   "provider": {
    "title": "Google+"
   }, 
   "title": "If there is only one article about the NBA finals that you read, make it this one. And if you have no...", 
   "url": "https://plus.google.com/+TimOReilly/posts/BzfRWCoFfFH", 
   "object": {
    "resharers": {
     "totalItems": 1, 
     "selfLink": "https://www.googleapis.com/plus/v1/activities/z132ennwdurpfhre123gcxizetybvpydh/people/resharers"
    }, 
    "attachments": [
     {
      "displayName": "'I'm ready': The text that started an NBA dynasty", 
      "fullImage": {
       "url": "https://cdn-s3.si.com/s3fs-public/styles/inline_gallery_desktop/public/2017/06/13/kevin-durant-draymond-green-nba-finals.jpg?itok=F8sgXFLK", 
       "type": "image/jp

## Example 4. Cleaning HTML in Google+ content by stripping out HTML tags and converting HTML entities back to plain-text representations

In [5]:
from nltk import clean_html
from bs4 import BeautifulSoup

# clean_html removes tags and
# BeautifulStoneSoup converts HTML entities

def cleanHtml(html):
    if html == "": return ""

    soup = BeautifulSoup(html).get_text()
  
    return soup

#   return BeautifulStoneSoup(clean_html(html), "xml",
#           convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]

print activity_feed['items'][3]['object']['content']
print
print cleanHtml(activity_feed['items'][3]['object']['content'])

This is what you get when you believe that “government should just get out of the way” and let industry do its thing. When will get over the failed idea that good government is an obstacle to progress? We do best when government, business, and all of society work together towards shared goals. ﻿

This is what you get when you believe that “government should just get out of the way” and let industry do its thing. When will get over the failed idea that good government is an obstacle to progress? We do best when government, business, and all of society work together towards shared goals. ﻿




 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


## Example 5. Looping over multiple pages of Google+ activities and distilling clean text from notes

In [6]:
import os
# import httplib2
# import json
# import apiclient.discovery
# from BeautifulSoup import BeautifulStoneSoup
# from nltk import clean_html

USER_ID = '107033731246200681024' # Tim O'Reilly

# XXX: Re-enter your API_KEY from  https://code.google.com/apis/console 
# if not currently set
# API_KEY = '' 

MAX_RESULTS = 200 # Will require multiple requests

# def cleanHtml(html):
#   if html == "": return ""

#   return BeautifulStoneSoup(clean_html(html),
#           convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]

service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(), 
                                    developerKey=API_KEY)

activity_feed = service.activities().list(userId=USER_ID, 
                                          collection='public',
                                          maxResults='100' # Max allowed per request
                                         )

activity_results = []

while activity_feed != None and len(activity_results) < MAX_RESULTS:

    activities = activity_feed.execute()

    if 'items' in activities:

        for activity in activities['items']:

            if activity['object']['objectType'] == 'note' and activity['object']['content'] != '':

                activity['title'] = cleanHtml(activity['title'])
                activity['object']['content'] = cleanHtml(activity['object']['content'])
                activity_results += [activity]

    # list_next requires the previous request and response objects
    activity_feed = service.activities().list_next(activity_feed, activities)

# Write the output to a file for convenience

f = open(os.path.join('resources', 'ch04-googleplus', USER_ID + '.json'), 'w')
f.write(json.dumps(activity_results, indent=1))
f.close()

print str(len(activity_results)), "activities written to", f.name

272 activities written to resources/ch04-googleplus/107033731246200681024.json


## Example 6. Sample data structures used in illustrations for the rest of this chapter

In [22]:
corpus = { 
 'a' : "Mr. Green killed Colonel Mustard in the study with the candlestick. \
Mr. Green is not a very nice fellow.",
 'b' : "Professor Plum has a green plant in his study.",
 'c' : "Miss Scarlett watered Professor Plum's green plant while he was away \
from his office last week."
}
terms = {
 'a' : [ i.lower() for i in corpus['a'].split() ],
 'b' : [ i.lower() for i in corpus['b'].split() ],
 'c' : [ i.lower() for i in corpus['c'].split() ]
 }

## Example 7. Running TF-IDF on sample data

In [11]:
from math import log

# XXX: Enter in a query term from the corpus variable
QUERY_TERMS = ['mr', 'green']

def tf(term, doc, normalize=True):
    doc = doc.lower().split()
    if normalize:
        return doc.count(term.lower()) / float(len(doc))
    else:
        return doc.count(term.lower()) / 1.0


def idf(term, corpus):
    num_texts_with_term = len([True for text in corpus if term.lower()
                              in text.lower().split()])

    # tf-idf calc involves multiplying against a tf value less than 0, so it's
    # necessary to return a value greater than 1 for consistent scoring. 
    # (Multiplying two values less than 1 returns a value less than each of 
    # them.)

    try:
        return 1.0 + log(float(len(corpus)) / num_texts_with_term)
    except ZeroDivisionError:
        return 1.0


def tf_idf(term, doc, corpus):
    return tf(term, doc) * idf(term, corpus)


corpus = \
    {'a': 'Mr. Green killed Colonel Mustard in the study with the candlestick. \
Mr. Green is not a very nice fellow.',
     'b': 'Professor Plum has a green plant in his study.',
     'c': "Miss Scarlett watered Professor Plum's green plant while he was away \
from his office last week."}

for (k, v) in sorted(corpus.items()):
    print k, ':', v
print
    
# Score queries by calculating cumulative tf_idf score for each term in query

query_scores = {'a': 0, 'b': 0, 'c': 0}
for term in [t.lower() for t in QUERY_TERMS]:
    for doc in sorted(corpus):
        print 'TF(%s): %s' % (doc, term), tf(term, corpus[doc])
    print 'IDF: %s' % (term, ), idf(term, corpus.values())
    print

    for doc in sorted(corpus):
        score = tf_idf(term, corpus[doc], corpus.values())
        print 'TF-IDF(%s): %s' % (doc, term), score
        query_scores[doc] += score
    print

print "Overall TF-IDF scores for query '%s'" % (' '.join(QUERY_TERMS), )
for (doc, score) in sorted(query_scores.items()):
    print doc, score

a : Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.
b : Professor Plum has a green plant in his study.
c : Miss Scarlett watered Professor Plum's green plant while he was away from his office last week.

TF(a): green 0.105263157895
TF(b): green 0.111111111111
TF(c): green 0.0625
IDF: green 1.0

TF-IDF(a): green 0.105263157895
TF-IDF(b): green 0.111111111111
TF-IDF(c): green 0.0625

Overall TF-IDF scores for query 'green'
a 0.105263157895
b 0.111111111111
c 0.0625


## Example 8. Exploring Google+ data with NLTK

In [19]:
# Explore some of NLTK's functionality by exploring the data. 
# Here are some suggestions for an interactive interpreter session.

import nltk

# Download ancillary nltk packages if not already installed
nltk.download('stopwords')

all_content = " ".join([ a['object']['content'] for a in activity_results ])

# Approximate bytes of text
print len(all_content)

tokens = all_content.split()
text = nltk.Text(tokens)

# Examples of the appearance of the word "open"
print('CONCORDANCE OF WORD OPEN')
text.concordance("open")

# Frequent collocations in the text (usually meaningful phrases)
print('COLLOCATIONS')
text.collocations()

# Frequency analysis for words of interest
fdist = text.vocab()
print('open: %d | source: %d | web: %d | 2.0: %d' % (fdist["open"], fdist["source"], fdist["web"], fdist["2.0"]))

# Number of words in the text
len(tokens)

# Number of unique words in the text

unique_words = len(fdist.keys())

print('Unique words: %d' % unique_words)

# Common words that aren't stopwords
common_words = [w for w in fdist.keys()[:100] \
   if w.lower() not in nltk.corpus.stopwords.words('english')]

print(common_words)

# Long words that aren't URLs
long_words = [w for w in fdist.keys() if len(w) > 15 and not w.startswith("http")]

print(long_words)

# Number of URLs
total_urls = len([w for w in fdist.keys() if w.startswith("http")])

print('Total URLs %d' % total_urls)

# Enumerate the frequency distribution
for rank, word in enumerate(sorted(fdist.items(), key=lambda x: x[1], reverse=True)): 
    print rank, word

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/santteegt/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
107171
CONCORDANCE OF WORD OPEN
Displaying 15 of 15 matches:
ear that computational biologist and open science advocate (UC Berkeley profes
: magazine slogan say, "If you can't open it, you don't own it." ﻿ Predictive 
 I'm proud to be a signatory to this open letter calling for this key policy i
st, I've focused a lot on areas like open source software and the implications
opic at greater length in my article Open Data and Algorithmic Regulation: htt
cessful participatory projects, from open source software to wikis to social m
ere isn't one (except that it's only open to US students - sorry. If anyone ha
new contract that conformed with the open data mandate. If it were consistent 
If it were consistent with the Obama open data guidance, that RFP would requir
ut of step with the administration’s open data policy.The founder of Hipcamp, 
is 

## Example 9. Querying Google+ data with TF-IDF

In [13]:
import json
import nltk

# Load in human language data from wherever you've saved it

DATA = 'resources/ch04-googleplus/107033731246200681024.json'
data = json.loads(open(DATA).read())

# XXX: Provide your own query terms here

QUERY_TERMS = ['OPEN', 'SOURCE']

activities = [activity['object']['content'].lower().split() \
              for activity in data \
                if activity['object']['content'] != ""]

# TextCollection provides tf, idf, and tf_idf abstractions so 
# that we don't have to maintain/compute them ourselves

tc = nltk.TextCollection(activities)

relevant_activities = []

for idx in range(len(activities)):
    score = 0
    for term in [t.lower() for t in QUERY_TERMS]:
        score += tc.tf_idf(term, activities[idx])
    if score > 0:
        relevant_activities.append({'score': score, 'title': data[idx]['title'],
                              'url': data[idx]['url']})

# Sort by score and display results

relevant_activities = sorted(relevant_activities, 
                             key=lambda p: p['score'], reverse=True)
for activity in relevant_activities:
    print activity['title']
    print '\tLink: %s' % (activity['url'], )
    print '\tScore: %s' % (activity['score'], )
    print

This is a really important piece about open data and platforms.
	Link: https://plus.google.com/+TimOReilly/posts/fo9uxWTctHb
	Score: 0.283046846292

My latest @radar post: #SocialCivics and the architecture of participation. Inspired by @goldman joining...
	Link: https://plus.google.com/+TimOReilly/posts/BipFsL8tjmP
	Score: 0.135660053835

An excellent demonstration of why Open Access lowers the barriers to knowledge-sharing in science. This...
	Link: https://plus.google.com/+TimOReilly/posts/iQ4RdspWxbY
	Score: 0.12972980455

Predictive policing has enormous risks of introducing algorithmic bias. As a way of countering that,...
	Link: https://plus.google.com/+TimOReilly/posts/TFwW3wgKm2F
	Score: 0.0738969754014

I'm doing a ProductHunt AMA at 9 am PT this morning.  I love getting people thinking harder about how...
	Link: https://plus.google.com/+TimOReilly/posts/KFxXr6qTEHS
	Score: 0.0718200285009

As you may know, I'm a big fan of the idea that government should act as a platform, n

## Example 10. Finding similar documents using cosine similarity

In [14]:
import json
import nltk

# Load in human language data from wherever you've saved it

DATA = 'resources/ch04-googleplus/107033731246200681024.json'
data = json.loads(open(DATA).read())

# Only consider content that's ~1000+ words.
data = [ post for post in json.loads(open(DATA).read())
         if len(post['object']['content']) > 1000 ]

all_posts = [post['object']['content'].lower().split() 
             for post in data ]


# Provides tf, idf, and tf_idf abstractions for scoring

tc = nltk.TextCollection(all_posts)

# Compute a term-document matrix such that td_matrix[doc_title][term]
# returns a tf-idf score for the term in the document

td_matrix = {}
for idx in range(len(all_posts)):
    post = all_posts[idx]
    fdist = nltk.FreqDist(post)

    doc_title = data[idx]['title']
    url = data[idx]['url']
    td_matrix[(doc_title, url)] = {}

    for term in fdist.iterkeys():
        td_matrix[(doc_title, url)][term] = tc.tf_idf(term, post)
        
# Build vectors such that term scores are in the same positions...

distances = {}
for (title1, url1) in td_matrix.keys():

    distances[(title1, url1)] = {}
    (min_dist, most_similar) = (1.0, ('', ''))

    for (title2, url2) in td_matrix.keys():

        # Take care not to mutate the original data structures
        # since we're in a loop and need the originals multiple times

        terms1 = td_matrix[(title1, url1)].copy()
        terms2 = td_matrix[(title2, url2)].copy()

        # Fill in "gaps" in each map so vectors of the same length can be computed

        for term1 in terms1:
            if term1 not in terms2:
                terms2[term1] = 0

        for term2 in terms2:
            if term2 not in terms1:
                terms1[term2] = 0

        # Create vectors from term maps

        v1 = [score for (term, score) in sorted(terms1.items())]
        v2 = [score for (term, score) in sorted(terms2.items())]

        # Compute similarity amongst documents

        distances[(title1, url1)][(title2, url2)] = \
            nltk.cluster.util.cosine_distance(v1, v2)

        if url1 == url2:
            #print distances[(title1, url1)][(title2, url2)]
            continue

        if distances[(title1, url1)][(title2, url2)] < min_dist:
            (min_dist, most_similar) = (distances[(title1, url1)][(title2,
                                         url2)], (title2, url2))
    
    print '''Most similar to %s (%s)
\t%s (%s)
\tscore %f
''' % (title1, url1,
            most_similar[0], most_similar[1], 1-min_dist)

Most similar to How fragile life is, even for the best of us. We heard this morning that our friend Jake Brewer was ... (https://plus.google.com/+TimOReilly/posts/jV8jeKeWWyf)
	How to Raise Moral Children

I thought this article on child-raising had a lot of good ideas in it. ... (https://plus.google.com/+TimOReilly/posts/NVZVmG1ct6C)
	score 0.058431

Most similar to Work on sh-t that matters, aka Poo Camp

Today was the annual chicken coop poo cleanup. (The chickens... (https://plus.google.com/+TimOReilly/posts/3vTSVs6CcdY)
	How to Raise Moral Children

I thought this article on child-raising had a lot of good ideas in it. ... (https://plus.google.com/+TimOReilly/posts/NVZVmG1ct6C)
	score 0.036505

Most similar to From an article about Walmart, their move to pay more, and the lessons for the broader economy: http... (https://plus.google.com/+TimOReilly/posts/bqErtyYp6co)
	 +Maria Konnikova's NY Times article about the role of time and attention scarcity in the cycle of poverty... (htt

**Code to create a matrix diagram displaying linkages between Google+ activities as illustrated in Figure 6.**

In [32]:
import json
from operator import itemgetter
import nltk
from IPython.display import IFrame
from IPython.core.display import display

# Load in human language data from wherever you've saved it

DATA = 'resources/ch04-googleplus/107033731246200681024.json'

# Only consider content that's ~100+ words.
data = [post for post in json.loads(open(DATA).read())
        if len(post['object']['content']) > 1000]


all_posts = [post['object']['content'].lower().split() 
             for post in data]

# Provides tf, idf, tf_idf abstractions for scoring

tc = nltk.TextCollection(all_posts)

# Compute a term-document matrix such that td_matrix[doc_title][term]
# returns a tf-idf score for the term in the document

td_matrix = {}
for idx in range(len(all_posts)):
    post = all_posts[idx]
    fdist = nltk.FreqDist(post)

    doc_title = data[idx]['title']
    url = data[idx]['url']
    td_matrix[(doc_title, url)] = {}

    for term in fdist.iterkeys():
        td_matrix[(doc_title, url)][term] = tc.tf_idf(term, post)

# Build vectors such that term scores are in the same positions...

distances = {}

# Visualization output requires a list of nodes with values and a list of links that have
# source and destination targets. We'll pre-build the list of nodes here and create an index
# so that we can easily create links from titles after we compute the most similar items
# on each iteration of the outer loop

viz_links = []
viz_nodes = [ {'title' : title, 'url' : url} for (title, url) in td_matrix.keys() ]

foo = 0
for vn in viz_nodes:
    vn.update({'idx' : foo})
    foo += 1

idx = dict(zip([ vn['title'] for vn in viz_nodes ], range(len(viz_nodes))))


for (title1, url1) in td_matrix.keys():

    distances[(title1, url1)] = {}
    (min_dist, most_similar) = (1.0, ('', ''))

    for (title2, url2) in td_matrix.keys():

        # Take care not to mutate the original data structures
        # since we're in a loop and need the originals multiple times

        terms1 = td_matrix[(title1, url1)].copy()
        terms2 = td_matrix[(title2, url2)].copy()

        # Fill in "gaps" in each map so vectors of the same length can be computed

        for term1 in terms1:
            if term1 not in terms2:
                terms2[term1] = 0

        for term2 in terms2:
            if term2 not in terms1:
                terms1[term2] = 0

        # Create vectors from term maps

        v1 = [score for (term, score) in sorted(terms1.items())]
        v2 = [score for (term, score) in sorted(terms2.items())]

        # Compute similarity amongst documents

        distances[(title1, url1)][(title2, url2)] = \
            nltk.cluster.util.cosine_distance(v1, v2)

        if url1 == url2:
            #print distances[(title1, url1)][(title2, url2)]
            continue

        if distances[(title1, url1)][(title2, url2)] < min_dist:
            (min_dist, most_similar) = (distances[(title1, url1)][(title2,
                                         url2)], (title2, url2))
    
    viz_links.append({'source' : idx[title1], 'target' : idx[most_similar[0]], 'score' : 1 - min_dist})
    

f = open('resources/ch04-googleplus/viz/matrix.json', 'w')
f.write(json.dumps({'nodes' : viz_nodes, 'links' : viz_links}, indent=1))
f.close()

# Display the visualization below with an inline frame
display(IFrame('files/resources/ch04-googleplus/viz/matrix.html', '100%', '600px'))

# You could also serve it by running SimpleHTTPServer in the viz directory as follows:
# $ python -m SimpleHTTPServer 9000
# Now, open http://localhost:9000/matrix.html in your web browser

## Example 11. Using NLTK to compute bigrams and collocations for a sentence

In [15]:
import nltk

sentence = "Mr. Green killed Colonel Mustard in the study with the " + \
           "candlestick. Mr. Green is not a very nice fellow."

rs = nltk.ngrams(sentence.split(), 2)
print rs.next()
txt = nltk.Text(sentence.split())

txt.collocations()

('Mr.', 'Green')
Mr. Green


## Example 12. Using NLTK to compute collocations in a similar manner to the nltk.Text.collocations demo functionality

In [16]:
import json
import nltk

# Load in human language data from wherever you've saved it

DATA = 'resources/ch04-googleplus/107033731246200681024.json'
data = json.loads(open(DATA).read())

# Number of collocations to find

N = 25

all_tokens = [token for activity in data for token in activity['object']['content'
              ].lower().split()]

finder = nltk.BigramCollocationFinder.from_words(all_tokens)
finder.apply_freq_filter(2)
finder.apply_word_filter(lambda w: w in nltk.corpus.stopwords.words('english'))
scorer = nltk.association.BigramAssocMeasures.jaccard
collocations = finder.nbest(scorer, N)

i = 1
for collocation in collocations:
    c = ' '.join(collocation)
    print '%d: %s' % (i, c)
    i += 1

1: bottom, “copyright
2: cabo pulmo
3: expressing disappointment
4: nbc press:here
5: negative judgment
6: press:here tv
7: wood fired
8: yuval noah
9: silicon valley
10: on-demand economy,
11: +jennifer pahlka
12: acted generously,
13: barre historical
14: computational biologist
15: private sector
16: pulmo sunrise﻿
17: saul griffith
18: bay mini
19: credit card
20: east bay
21: +bryce roberts
22: inca trail
23: italian granite
24: models say,
25: child welfare
