 # Table of Contents
<div class="toc" style="margin-top: 1em;"><ul class="toc-item" id="toc-level0"><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Preparation</a></span></li><li><span><a href="#Running-W2V" data-toc-modified-id="Running-W2V-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Running W2V</a></span></li></ul></div>

## Data Preparation

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("emails.csv")

In [4]:
df.head()

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


In [5]:
def extract_message(string):
    start = string.find('FileName:')
    sub = string[start: ].find('\n')
    return string[start: ][sub:]

In [6]:
df['content'] = df.message.apply(extract_message)

In [7]:
df.content.head(5)

0                        \n\nHere is our forecast\n\n 
1    \n\nTraveling to have a business meeting takes...
2                   \n\ntest successful.  way to go!!!
3    \n\nRandy,\n\n Can you send me a schedule of t...
4              \n\nLet's shoot for Tuesday at 11:45.  
Name: content, dtype: object

In [8]:
import re
def getWords(text):
    return " ".join(re.compile('\w*[A-Za-z]\w*').findall(text))

In [9]:
from nltk.tokenize import TreebankWordTokenizer
def token(text):
    tokenizer = TreebankWordTokenizer()
    return tokenizer.tokenize(text)

In [10]:
df_test = df[1:50]

In [11]:
df_test['words'] = df_test.content.apply(getWords)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [12]:
df_test['words'] = df_test.content.apply(token)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [13]:
df_test.words.head()

1    [Traveling, to, have, a, business, meeting, ta...
2            [test, successful., way, to, go, !, !, !]
3    [Randy, ,, Can, you, send, me, a, schedule, of...
4         [Let, 's, shoot, for, Tuesday, at, 11:45, .]
5    [Greg, ,, How, about, either, next, Tuesday, o...
Name: words, dtype: object

In [14]:
df['words'] = df.content.apply(getWords)

In [15]:
df.words.head()

0                                 Here is our forecast
1    Traveling to have a business meeting takes the...
2                            test successful way to go
3    Randy Can you send me a schedule of the salary...
4                           Let s shoot for Tuesday at
Name: words, dtype: object

In [16]:
df_shuffled = df.sample(frac=1).reset_index(drop=True)

In [17]:
test = df_shuffled[0:1000]

## Running W2V

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [19]:
n_features = 1000

In [20]:
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                max_features=n_features,
                                stop_words='english',
                                max_df = 0.5,
                                min_df = 10)
tf = tf_vectorizer.fit_transform(test.words)

In [21]:
from sklearn.decomposition import LatentDirichletAllocation

In [28]:
n_topics = 10
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=50,
                                learning_method='batch',
                                learning_offset=50.,
                                random_state=42)

In [29]:
lda.fit(tf)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=50, mean_change_tol=0.001,
             n_jobs=1, n_topics=10, perp_tol=0.1, random_state=42,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [30]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [31]:
n_top_words = 20

In [32]:
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Topic #0:
credit contract gas deal agreement need party shall thanks changes transaction securities counterparty deals time price date cash issues financial
Topic #1:
power energy gas com california utility new electric http generation plant natural electricity www utilities prices corp markets services pg
Topic #2:
enron mail information com edu doc email contact attached thank intended questions fax number hotel ena travel office received documents
Topic #3:
enron subject com pm message sent original thanks cc know october thursday let kay mailto friday monday time tuesday et
Topic #4:
market new commission ferc need issue issues price order meeting make think day plan like process support believe cost provide
Topic #5:
com enron company services business group management said million www capital global technology firm new president trading operations internet report
Topic #6:
com enron ees jeff na comments cc doc subject dasovich california ca pm draft richard forwarded iso energy j

In [33]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)