Data I'm using is the Obama press release dataset. From http://crawls.archive.org/collections/vinay/datasets/whitehouse-hackathon/warcs/ . First I visualize sentence structure where you can enter any type of phrase to see how all the press briefing statements are connected through that phrase. Next I model topics and visualize them.

In [1]:
import os
import json
import gzip
import time 
import warc
import urlparse
import fnmatch
import tldextract
from collections import Counter
from datetime import datetime
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import numpy as np
import matplotlib.pyplot as plt

In [2]:
data = []
with gzip.open('OBAMA-WHITEHOUSE-HACKATHON-PRESS-RELEASES-EXTRACTION-WARCS-PART-00000-000000.warc.wat.gz', mode='rb') as gzf:
    for record in warc.WARCFile(fileobj=gzf):
        data.append(record.payload.read())        
data = data[1:]     

A bunch of the press releases had no text in the tag we normally find press release texts due to the fact that it was an embedded video or something else so we had to account for that.

In [4]:
files = []
for i in data:
    payload = json.loads(i)
    try:
        url   = payload['Envelope']['WARC-Header-Metadata']['WARC-Target-URI']
        ts    = payload['Envelope']['WARC-Header-Metadata']['WARC-Date']
        ts_dt = datetime.strptime(ts, "%Y-%m-%dT%H:%M:%SZ")
        title = payload['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Head']['Title']
        meta  = payload['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Head']['Metas']
        keywords = ''
        for elem in meta:
            if elem['name'] == 'keywords':
                keywords = elem['content']
                break
        description = ''
        for elem in meta:
            if elem['name'] == 'description':
                description = elem['content']
                break
        text_terms = title + " " + keywords + " " + description
        result = ((url, ts_dt), text_terms)
        if result != 'None' or result != None:
            files.append(result)
    except:
        pass
        

Due to the time constraints we decided to look at the first 500 press releases

In [6]:
files2 = files[:500]
links = []
for i in files2:
    links.append(i[0][0])

In [7]:
from bs4 import BeautifulSoup
content = []
import requests

def get_text(url):
    html = requests.get(url).content
    soup = BeautifulSoup(html,'lxml')
    try:
        text = soup.find(class_= 'legacy-content').text
    except(AttributeError):
        return None
    return text

In [14]:
for i in links:
    try:
        content.append(get_text(i))
    except:
        pass

In [16]:
content2 = [i for i in content if i]

Visualize sentence structures. Here we can visualize different a wide variety of phrases that can be explored more by looking at the pattern documentation.  In order to inspect, open the sent_struct_vis folder and click on the html file.

In [19]:
from pattern.en import parsetree
from pattern.search import search
from pattern.graph import Graph

g = Graph()


def make_graph(sentence_structure)
    for i in content2:
        s = parsetree(i)
        p = sentence_structure
        for m in search(p, s):
            x = m.group(1).string # NP left 
            y = m.group(2).string # NP right
            if x not in g:
                g.add_node(x)
            if y not in g:
                g.add_node(y)
            g.add_edge(g[x], g[y], stroke=(0,0,0,0.75)) # R,G,B,A
    g = g.split()[0] # Largest subgraph.
    for n in g.sorted()[:40]: # Sort by Node.weight.
        n.fill = (0, 0.5, 1, 0.75 * n.weight)
    g.export('sent_struc_vis', directed=True, weighted=0.6, width=5000, height=1000)
    
make_graph('{VP} {NP} and {NP}')

Next we perform LDA to do some topic modeling.

In [20]:
from gensim import corpora, models, similarities, matutils
# sklearn
from sklearn import datasets
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
# logging for gensim (set to INFO)
import logging
import nltk

In [21]:
#mr gibbs is the press secretary or something so we just hastily added all variations of his name to our stopwords
stopwords = nltk.corpus.stopwords.words('english')
for i in ['Mr.', 'Gibbs', 'Mr. Gibbs', 'mr. gibbs', 'gibbs', 'mr gibbs', 'mr.']:
    stopwords.extend(i)

In [41]:
import string
def strip_proppers(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent) if word.islower()]
    return "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip()
#%%
from nltk.tag import pos_tag
def strip_proppers_POS(text):
    tagged = pos_tag(text.split()) #use NLTK's part of speech tagger
    non_propernouns = [word for word,pos in tagged if pos != 'NNP' and pos != 'NNPS']
    return non_propernouns

In [42]:
preprocess = [strip_proppers(doc) for doc in content2]

In [43]:
import re
stopwords = nltk.corpus.stopwords.words('english')
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [44]:
tokenized_text = [tokenize_and_stem(text) for text in preprocess]

In [45]:
texts = [[word for word in text if word not in stopwords] for text in tokenized_text]


In [46]:
dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below=1, no_above=0.8)


2017-01-13 01:02:41,630 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2017-01-13 01:02:41,867 : INFO : built Dictionary(10204 unique tokens: [u'governance-rel', u'faceti', u'vis-\xe0-vi', u'yellow', u'interchang']...) from 465 documents (total 311503 corpus positions)
2017-01-13 01:02:41,880 : INFO : discarding 0 tokens: []...
2017-01-13 01:02:41,892 : INFO : keeping 10204 tokens which were in no less than 1 and no more than 372 (=80.0%) documents
2017-01-13 01:02:41,908 : INFO : resulting dictionary: Dictionary(10204 unique tokens: [u'governance-rel', u'faceti', u'vis-\xe0-vi', u'yellow', u'interchang']...)


In [47]:
corpus = [dictionary.doc2bow(text) for text in texts]

In [50]:
lda = models.LdaModel(corpus, num_topics=10, id2word=dictionary, update_every=5,chunksize=10000,passes=20)

2017-01-13 01:15:58,628 : INFO : using symmetric alpha at 0.1
2017-01-13 01:15:58,630 : INFO : using symmetric eta at 0.1
2017-01-13 01:15:58,631 : INFO : using serial LDA version on this node
2017-01-13 01:16:00,006 : INFO : running online LDA training, 10 topics, 20 passes over the supplied corpus of 465 documents, updating model once every 465 documents, evaluating perplexity every 465 documents, iterating 50x with a convergence threshold of 0.001000
2017-01-13 01:16:06,470 : INFO : -10.342 per-word bound, 1298.3 perplexity estimate based on a held-out corpus of 465 documents with 311503 words
2017-01-13 01:16:06,471 : INFO : PROGRESS: pass 0, at document #465/465
2017-01-13 01:16:07,717 : INFO : topic #4 (0.100): 0.017*de + 0.013*que + 0.011*'s + 0.008*la + 0.007*go + 0.006*en + 0.006*el + 0.005*para + 0.005*n't + 0.005*make
2017-01-13 01:16:07,720 : INFO : topic #3 (0.100): 0.027*'s + 0.017*think + 0.012*n't + 0.008*go + 0.007*get + 0.007*know + 0.006*said + 0.006*work + 0.006*lik

In [32]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

Use this package to interact with the topic modeling. You can hover over different topics and view the distributions. Ald you can slide the relevance matrix.

In [51]:
pyLDAvis.gensim.prepare(lda, corpus, dictionary)

If we had more time, we could mess with different numbers of topics and such to optimize the coherence.

In [52]:
from gensim.models.coherencemodel import CoherenceModel

In [57]:
cm = CoherenceModel(model=lda, corpus=corpus, dictionary=dictionary, texts = texts, coherence='c_v')

In [58]:
print cm.get_coherence()

0.879635348211


In [60]:
pyLDAvis.save_html(pyLDAvis.gensim.prepare(lda, corpus, dictionary),'lda_vis.html')