####To Do's:
1)Clean email body
* Need to remove proper nouns
* Need to remove any text after "U.S. Department of State","Sent from Verizon"
* Remove email addresses 
* Remove dates especially of the format "Monday, January 18, 2010 11:52 AM"
* Remove times
* Remove web addresses
* Need to remove any strange non-English words
* How do I deal with emails that list schedule for the day?
* Change all occurences of "pis" to "pls". Assuming OCR error.

2)Run K-Means Clustering
3)Run LDA 

* After defining topics , determine similarity between emails for clustering. Possibly clustering texts with KMeans
* Need to stem words? Keep a word dictionary for all stem words, they don't make sense otherwise
* Look specifically at emails with **fwd** 

* Locality-sensitive hashing

###Import Libraries

In [55]:
import pandas as pd
import numpy as np
import sqlite3
from nltk.corpus import stopwords
import nltk
import re
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
import matplotlib.pyplot as plt
% matplotlib inline
from nltk.tokenize import RegexpTokenizer
from gensim import corpora, models, similarities
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

###Setup and Review Sqlite Database

In [57]:
con = sqlite3.connect('input/database.sqlite')
cursor = con.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
[x for x in cursor.fetchall()]

[(u'Emails',), (u'Persons',), (u'Aliases',), (u'EmailReceivers',)]

In [58]:
emails= pd.read_sql_query("Select * From Emails where ExtractedBodyText!= ''",con)
persons=pd.read_sql_query("Select * From Persons",con)
longemails= pd.read_sql_query("Select * From Emails where length(ExtractedBodyText)>500 \
                                and ExtractedBodyText!= ''",con)
aliases=pd.read_sql_query("Select * From Aliases", con)

In [59]:
print "Number of Emails: %s" % emails.shape[0]

Number of Emails: 6742


In [62]:
import email_to_words as email_to_words

###Clean up text of email body

In [5]:
# Uncomment the next line to download stop words if it's not already installed.
#nltk.download()

In [69]:
#Define email_to_words function to clean email body
#import email_to_words as email_to_words
%run ./email_to_words.py

In [70]:
# Get the number of emails based on the dataframe column size
num_emails = emails["ExtractedBodyText"].size

# Initialize an empty list to hold the clean reviews
clean_emails = []
stemmed_emails = []
# Loop over each email; create an index i that goes from 0 to the length
# of the emails 
for i in xrange( 0, num_emails ):
    # If the index is evenly divisible by 1000, print a message
    if( (i+1)%1000 == 0 ):
        print "Cleaning email %d of %d\n" % ( i+1, num_emails )                                                             
  
    # Call our function for each one, and add the result to the list of
    # clean reviews
    try:
        clean_emails.append( email_to_words( emails["ExtractedBodyText"][i] ) )

    except Exception as e:
        clean_emails.append( email_to_words("I'm a placeholder sentence."))
        print "Execption raised:", e

Cleaning email 1000 of 6742

Cleaning email 2000 of 6742

Cleaning email 3000 of 6742

Cleaning email 4000 of 6742

Cleaning email 5000 of 6742

Cleaning email 6000 of 6742



####Tokenize and Stem 

In [71]:
def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [p_stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [72]:
#use extend so it's a big flat list of vocab
totalvocab_stemmed = []
totalvocab_tokenized = []
for i in clean_emails:
    allwords_stemmed = tokenize_and_stem(i) #for each item in 'cleaned_emails', tokenize/stem
    totalvocab_stemmed.extend(allwords_stemmed) #extend the 'totalvocab_stemmed' list
    
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)

In [14]:
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print 'There are ' + str(vocab_frame.shape[0]) + ' items/words in vocab_frame'

There are 316099 items/words in vocab_frame


In [23]:
vocab_frame.sort_values(by=["words"])

Unnamed: 0,words
aaaaaaaabhm,aaaaaaaabhm
aab,aab
aab,aab
aafia,aafia
aar,aar
aar,aar
aar,aar
aardian,aardian
aaron,aaron
aaronovitch,aaronovitch


In [73]:
import nltk.corpus as wordnet

In [8]:
clean_emails

[u'latest syria aiding qaddafi sid hrc memo syria aiding libya docx hrc memo syria aiding libya docx hillary',
 u'thx',
 u'huma abedin latest syria aiding qaddafi sid hrc memo syria aiding libya docx pls print',
 u'pls print meet right wing extremist behind anti fvluslim film sparked deadly riots meat sent subject meet right wing extremist behind anti muslim film sparked deadly riots htte maxbiumenthal com meet right wing extremist behind anti musiim tihn sparked deadly riots sent verizon wireless lte droid department state case doc date state dept produced house select benghazi comm subject agreement sensitive information redactions foia waiver state',
 u'huma abedin latest syria aiding qaddafi sid hrc memo syria aiding libya docx pls print',
 u'fyi',
 u'fwd libya libya sept docx sending direct sent verizon wireless lte druid',
 u'fyi',
 u'fwd libya libya sept docx sending direct sent verizon wireless lte druid',
 u'fyi',
 u'anne marie slaughter jacob mills cheryl abedin hurtle piece 

### Latent Dirichlet Allocation

In [41]:
#tokenize cleaned emails
token_emails = []
tokenizer = RegexpTokenizer(r'\w+')

# loop through document list
for i in clean_emails:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)
   
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in tokens]
    
    # add tokens to list
    token_emails.append(stemmed_tokens)


In [None]:
# turn our tokenized documents into a id <-> term dictionary
#dictionary = corpora.Dictionary(token_emails)

# convert tokenized documents into a document-term matrix
#corpus = [dictionary.doc2bow(text) for text in token_emails]

In [65]:
# generate LDA model
%time ldamodel = models.ldamodel.LdaModel(corpus, num_topics=10, id2word = dictionary, passes=30)

#generate latent semantic indexing
#lsi = models.lsimodel.LsiModel(corpus=corpus, id2word=dictionary, num_topics=3)


CPU times: user 2min 2s, sys: 810 ms, total: 2min 3s
Wall time: 2min 3s


In [66]:
for i in (ldamodel.print_topics()):
    print ""
    print i
    print ""
    


0.007*obama + 0.007*american + 0.007*would + 0.006*presid + 0.006*polit + 0.006*parti + 0.006*new + 0.005*state + 0.005*one + 0.005*time


0.035*senat + 0.034*call + 0.016*vote + 0.014*bill + 0.011*boehner + 0.009*said + 0.009*op + 0.008*hous + 0.007*republican + 0.006*john


0.038*fyi + 0.016*clinton + 0.015*percent + 0.015*nuclear + 0.012*presid + 0.011*new + 0.011*state + 0.010*obama + 0.009*treati + 0.007*start


0.042*israel + 0.033*isra + 0.028*settlement + 0.022*palestinian + 0.015*netanyahu + 0.014*arab + 0.014*jewish + 0.012*peac + 0.009*negoti + 0.009*state


0.037*cheryl + 0.034*print + 0.029*mill + 0.028*pls + 0.026*huma + 0.025*sullivan + 0.023*abedin + 0.019*hrc + 0.018*richard + 0.017*yes


0.044*secretari + 0.041*state + 0.040*depart + 0.040*offic + 0.024*meet + 0.019*room + 0.014*arriv + 0.013*hous + 0.012*rout + 0.011*privat


0.012*work + 0.009*govern + 0.009*develop + 0.008*peopl + 0.008*need + 0.008*women + 0.007*effort + 0.007*china + 0.006*state + 0.006*includ



###Create a Bag-of-Words

In [76]:
from nltk.corpus import wordnet


In [77]:
text=["Elayna","food","Boehner","John","pls"]
%time
for w in text:
    if not wordnet.synsets(w):
        print w

CPU times: user 5 µs, sys: 3 µs, total: 8 µs
Wall time: 12.2 µs
Elayna
Boehner
pls


In [43]:
words=webtext.words()

In [52]:
grail=webtext.raw('overheard.txt')

In [None]:
# Take a look at the words in the vocabulary
print "Number of words in the email corpus: %s" % len(vocab)
print "Number of words in the email corpus: %s" % len(stemvocab)

print vocab[:50]
print ""
print stemvocab[:50]

In [None]:
# Sum up the counts of each vocabulary word
dist = np.sum(stem_data_features.toarray(), axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(stemvocab, dist)[:1000]:
    print tag, count