####To Do's:
1)Clean email body
* Need to remove proper nouns
* Need to remove any text after "U.S. Department of State","Sent from Verizon"
* Remove email addresses 
* Remove dates especially of the format "Monday, January 18, 2010 11:52 AM"
* Remove times
* Remove web addresses
* Need to remove any strange non-English words
* How do I deal with emails that list schedule for the day?
* Change all occurences of "pis" to "pls". Assuming OCR error.

2)Run K-Means Clustering
3)Run LDA 

* After defining topics , determine similarity between emails for clustering. Possibly clustering texts with KMeans
* Need to stem words? Keep a word dictionary for all stem words, they don't make sense otherwise
* Look specifically at emails with **fwd** 

* Locality-sensitive hashing
* Look into lemmatization
* Chunking 

###Import Libraries

In [1]:
import pandas as pd
import numpy as np
import sqlite3
import itertools
from nltk.corpus import stopwords
import nltk
import re
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
% matplotlib inline
from nltk.tokenize import RegexpTokenizer
from gensim import corpora, models, similarities, matutils
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

In [2]:
con = sqlite3.connect('input/database.sqlite')
cursor = con.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
[x for x in cursor.fetchall()]

[(u'Emails',), (u'Persons',), (u'Aliases',), (u'EmailReceivers',)]

In [3]:
emails= pd.read_sql_query("Select * From Emails where ExtractedBodyText!= ''",con)
persons=pd.read_sql_query("Select * From Persons",con)
longemails= pd.read_sql_query("Select * From Emails where length(ExtractedBodyText)>500 \
                                and ExtractedBodyText!= ''",con)
aliases=pd.read_sql_query("Select * From Aliases", con)

In [4]:
print "Number of Emails: %d" % emails.shape[0]

Number of Emails: 6742


In [5]:
import random
my_randoms = random.sample(xrange(6742), 10)
count=0
for i in my_randoms:
    count+=1
    print ()
    print ("Random Email #%d" %count)
    print ()
    print(emails["ExtractedBodyText"][i])
    print ()

()
Random Email #1
()
fyi
()
()
Random Email #2
()
Survive, yes. Pat helped level set things tonight and we'll see where we are in the morning.
()
()
Random Email #3
()
sbwhoeop
Friday, October 22, 2010 6:18 PM
H: Iran hikers info. Sid
Just now the NYT is reporting what I assume you already know, that the US hikers held prisoner in Iran were grabbed in
Iraq. I also assume that you saw the report last June that they were seized by a Revolutionary Guard officer now held for
drug running and murder.
It seems that the hikers were too close to a drug trail. I will send you memos on latest political intel and policy ideas for
Europe in next few days. Sid
US Hikers Were Seized in Iraq
By Babak Sarfaraz
Posted on June 24, 2010, Printed on October 22, 2010
http://www.theinvestigativefund.org/investigations/internationa1/1338/
KURDISTAN PROVINCE, IRAN—
Since their arrest last July by Iranian forces near the Iraq
border, three Americans — Shane Bauer, Josh Fattal and
Sarah Shourd — have been at t

###Clean up text of email body

In [12]:
# Uncomment the next line to download stop words if it's not already installed.
#nltk.download()

In [6]:
#Define email_to_words function to clean email body
#import email_to_words as email_to_words
%run ./email_to_words2.py

In [7]:
# Get the number of emails based on the dataframe column size
num_emails = emails["ExtractedBodyText"].size

# Initialize an empty list to hold the clean reviews
clean_emails = []
stemmed_emails = []
# Loop over each email; create an index i that goes from 0 to the length
# of the emails 
for i in xrange( 0, num_emails ):
    # If the index is evenly divisible by 1000, print a message
    if( (i+1)%1000 == 0 ):
        print ("Cleaning email %d of %d\n" % ( i+1, num_emails ))                                                             
  
    # Call our function for each one, and add the result to the list of
    # clean reviews
    try:
        clean_emails.append( email_to_words( emails["ExtractedBodyText"][i] , english=False) )

    except Exception as e:
        clean_emails.append( email_to_words("I'm a placeholder sentence."), english=False)
        print ("Execption raised:", e)

Cleaning email 1000 of 6742

Cleaning email 2000 of 6742

Cleaning email 3000 of 6742

Cleaning email 4000 of 6742

Cleaning email 5000 of 6742

Cleaning email 6000 of 6742



In [8]:
#Remove any empty emails
clean_emails=[w for w in clean_emails if len(w)>0]

In [9]:
len(clean_emails)

6339

In [10]:
import pickle  # or import cPickle as pickle
 
# Create dictionary, list, etc. 
# Write to file
pickle.dump( clean_emails, open( "output/clean_emails.p", "wb" ) )


####Tokenize and Stem 

In [11]:
def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [p_stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [12]:
totalvocab_stemmed = []
totalvocab_tokenized = []
for i in clean_emails:
    allwords_stemmed = tokenize_and_stem(i) #for each item in 'cleaned_emails', tokenize/stem
    totalvocab_stemmed.extend(allwords_stemmed) #extend the 'totalvocab_stemmed' list
    
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)

In [13]:
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print ('There are ' + str(vocab_frame.shape[0]) + ' items/words in vocab_frame')

There are 292327 items/words in vocab_frame


####TF-IDF

In [14]:
#Define vectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', use_idf=True, tokenizer=tokenize_and_stem)

#Fit the vectorizer
%time tfidf_matrix=tfidf_vectorizer.fit_transform(clean_emails)

CPU times: user 12.1 s, sys: 110 ms, total: 12.2 s
Wall time: 12.5 s


In [15]:
print(tfidf_matrix.shape)

(6339, 17199)


In [16]:
terms=tfidf_vectorizer.get_feature_names()


### K-Means Clustering

In [17]:
numb_clusters=10

kmeans=KMeans(n_clusters=numb_clusters, random_state=20151106)

%time kmeans.fit(tfidf_matrix)

clusters = kmeans.labels_.tolist()

CPU times: user 7.17 s, sys: 81.3 ms, total: 7.25 s
Wall time: 7.62 s


In [18]:
emails={'emails':clean_emails, 'cluster':clusters}

In [19]:
cluster_frame=pd.DataFrame(emails, index=[clusters], columns=['cluster'])

In [20]:
#Number of emails per cluster
cluster_frame['cluster'].value_counts()

9    4335
0     544
3     474
2     211
5     180
1     168
7     153
6     116
8      88
4      70
Name: cluster, dtype: int64

In [21]:
from __future__ import print_function

print("Top terms pers cluster:")
print ()
#Sort cluster centers by proximity to centroid
order_centroids = kmeans.cluster_centers_.argsort()[:,::-1]

for i in range(numb_clusters):
    print("Cluster %d terms:" % i, end='')
    
    for ind in order_centroids[i,:6]:
        
        print (" %s" % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=",")
    print()
    print()

Top terms pers cluster:

Cluster 0 terms: tomorrow, today, ops, time, lauren, schedule,

Cluster 1 terms: thanks, please, help, know, let, great,

Cluster 2 terms: talked, wanted, needed, tomorrow, asking, know,

Cluster 3 terms: fyi, release, article, minute, story, process,

Cluster 4 terms: valmoro, assistance, special, states, direct, ecial,

Cluster 5 terms: email, please, assistance, needed, immediately, traveled,

Cluster 6 terms: yes, thanks, worked, setting, coming, sorry,

Cluster 7 terms: print, please, copy, pis, thanks, oscar,

Cluster 8 terms: office, depart, room, route, arrive, meet,

Cluster 9 terms: worked, know, release, wanted, fyi, good,



### Latent Dirichlet Allocation

In [22]:
from gensim import corpora, models, similarities 
#tokenize
%time token_emails = [tokenize_and_stem(text) for text in clean_emails]

CPU times: user 11.4 s, sys: 70.3 ms, total: 11.4 s
Wall time: 11.5 s


In [23]:
# turn our tokenized documents into a id <-> term dictionary
%time dictionary = corpora.Dictionary(token_emails)

#remove extremes
%time dictionary.filter_extremes(no_below=1, no_above=0.8)

dictionary.compactify()

# convert tokenized documents into a document-term matrix
%time corpus = [dictionary.doc2bow(text) for text in token_emails]


CPU times: user 531 ms, sys: 9.48 ms, total: 541 ms
Wall time: 541 ms
CPU times: user 54.4 ms, sys: 2.1 ms, total: 56.5 ms
Wall time: 57.4 ms
CPU times: user 391 ms, sys: 8.49 ms, total: 400 ms
Wall time: 402 ms


In [24]:
# topics_range=(4,6,8,10,12)
topics_range=(4,6,8)

for i in topics_range:
    np.random.seed(20111106)
    # generate LDA model
    %time ldamodel = models.ldamodel.LdaModel(corpus, num_topics=i, id2word = dictionary, passes=100, alpha='auto')
    ldamodel.save('output/final_topic%d.model'%i)


CPU times: user 6min 30s, sys: 2.66 s, total: 6min 32s
Wall time: 6min 41s
CPU times: user 5min 59s, sys: 3.29 s, total: 6min 2s
Wall time: 6min 7s
CPU times: user 5min 50s, sys: 3.56 s, total: 5min 54s
Wall time: 5min 58s


In [30]:
for i in topics_range:
    final=models.ldamodel.LdaModel.load('output/final_topic%d.model'%i)
    count=0
    print ("%s TOPICS"%i)
    for i in (final.print_topics(num_words=8)):
        count+=1
        print ()
        print ('Topic# %s :' %count, i)
        print ()

NameError: name 'topics_range' is not defined

In [52]:
final=models.ldamodel.LdaModel.load('output/final_topic10.model')


###Evaluation

In [43]:
# select top 10 words for each of the 20 LDA topics
top_words = [[word for _, word in final.show_topic(topicno, topn=8)] for topicno in range(final.num_topics)]
print(top_words)

[[u'offic', u'depart', u'meet', u'room', u'state', u'arriv', u'rout', u'confer'], [u'call', u'see', u'get', u'want', u'know', u'work', u'thank', u'talk'], [u'pleas', u'print', u'messag', u'list', u'thank', u'qddr', u'copi', u'add'], [u'releas', u'part', u'state', u'new', u'team', u'chapter', u'valmoro', u'branch'], [u'american', u'state', u'new', u'presid', u'one', u'year', u'said', u'obama'], [u'hous', u'obama', u'white', u'bill', u'said', u'presid', u'staff', u'senat'], [u'fyi', u'israel', u'isra', u'parti', u'palestinian', u'peac', u'arab', u'negoti'], [u'work', u'develop', u'state', u'new', u'polici', u'peopl', u'issu', u'world'], [u'republican', u'democrat', u'vote', u'senat', u'percent', u'elect', u'parti', u'candid'], [u'koch', u'tea', u'right', u'beck', u'parti', u'movement', u'book', u'skousen']]


In [44]:
# get all top 50 words in all 20 topics, as one large set
all_words = set(itertools.chain.from_iterable(top_words))

print("Can you find the word intruder?")

# for each topic, replace a word at a different index, to make it more interesting
replace_index = np.random.randint(0, 8, final.num_topics)

replacements = []
for topicno, words in enumerate(top_words):
    other_words = all_words.difference(words)
    replacement = np.random.choice(list(other_words))
    replacements.append((words[replace_index[topicno]], replacement))
    words[replace_index[topicno]] = replacement
    print("%i: %s" % (topicno, ' '.join(words[:10])))

Can you find the word intruder?
0: offic depart meet room state arriv rout obama
1: hous see get want know work thank talk
2: part print messag list thank qddr copi add
3: releas part state new team chapter tea branch
4: american state new presid one know said obama
5: hous know white bill said presid staff senat
6: fyi israel isra parti palestinian peac arab add
7: work arriv state new polici peopl issu world
8: republican polici vote senat percent elect parti candid
9: koch tea right one parti movement book skousen


In [45]:
print("Actual replacements were:")
print(list(enumerate(replacements)))

Actual replacements were:
[(0, (u'confer', u'obama')), (1, (u'call', u'hous')), (2, (u'pleas', u'part')), (3, (u'valmoro', u'tea')), (4, (u'year', u'know')), (5, (u'obama', u'know')), (6, (u'negoti', u'add')), (7, (u'develop', u'arriv')), (8, (u'democrat', u'polici')), (9, (u'beck', u'one'))]


In [46]:
def intra_inter(model, test_docs, num_pairs=10000):
    # split each test document into two halves and compute topics for each half
    part1 = [model[dictionary.doc2bow(tokens[: len(tokens) / 2])] for tokens in test_docs]
    part2 = [model[dictionary.doc2bow(tokens[len(tokens) / 2 :])] for tokens in test_docs]
    
    # print computed similarities (uses cossim)
    print("average cosine similarity between corresponding parts (higher is better):")
    print(np.mean([matutils.cossim(p1, p2) for p1, p2 in zip(part1, part2)]))

    random_pairs = np.random.randint(0, len(test_docs), size=(num_pairs, 2))
    print("average cosine similarity between random parts (lower is better):")    
    print(np.mean([matutils.cossim(part1[i[0]], part2[i[1]]) for i in random_pairs]))

In [47]:
for i in topics_range:
    final=models.ldamodel.LdaModel.load('final_topic%d.model'%i)
    print("LDA results %s TOPICS:"%i)
    intra_inter(final, test_docs=token_emails)

IOError: [Errno 2] No such file or directory: 'final_topic4.model'

###Results


In [48]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

In [40]:
gensimvis.prepare?


In [53]:
vis_data = gensimvis.prepare(final, corpus, dictionary)
pyLDAvis.display(vis_data)

  return relevance.T.apply(lambda s: s.order(ascending=False).index).head(R)
  return relevance.T.apply(lambda s: s.order(ascending=False).index).head(R)
  return relevance.T.apply(lambda s: s.order(ascending=False).index).head(R)
  return relevance.T.apply(lambda s: s.order(ascending=False).index).head(R)
