## Problem: Visualizing and Summarizing Email Content

In this problem, we will use a variety of tools to visualize and identify similarities in emails.  We will consider textual data from emails sent from employees at Enron. Enron was the sixth largest energy company in the world, before they collasped and the majority of their executives were tried for fraud after overstating the company's earnings by several hundred million dollars.  This dataset is often used by researchers who are interested in "improving current email tools, or understanding how email is currently used" because "it is the only substantial collection of "real" email that is public" (https://www.cs.cmu.edu/~enron/). 

 #### Data Description:
The format of the enron_sample.txt file is: 

---
docID wordID count  
docID wordID count   
...  
docID wordID count   
docID wordID count  

---

There are 1000 documents (emails) in this sample.  Individual document names (i.e. a identifier for each docID) are not provided for copyright reasons. 

The format of the enron_sample_vocab.txt file is wordID = n.  That is, if "apple" is the first word in the vocab file, then the wordID for "apple" is 0.  

We have done much of the necessary pre-processing of this data to save you time.  Please run the code below to load the data.  The enron_sample.txt and enron_sample_vocab.txt files should be in the same folder as your notebook.  The pre-processing code below should take less than 30 seconds to run.

In [1]:
%%time
#Import and pre-process the data
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

#Read in the sparse matrix representation of the document-word count matrix
enron_file = open("enron_sample.txt", "r") 
enron_file = enron_file.readlines()

all_doc_ids = []
all_word_ids = []
all_counts = []
for l in range(0,len(enron_file)):
    l_text = enron_file[l].rstrip()
    [l_doc, l_word, l_count] = l_text.split(",")
    l_doc = int(l_doc) 
    l_word = int(l_word) 
    all_doc_ids.append(l_doc)
    all_word_ids.append(l_word)
    all_counts.append(int(l_count))
row = np.array(all_doc_ids)
col = np.array(all_word_ids)
counts = np.array(all_counts)  

num_documents = len(set(all_doc_ids))
num_words = len(set(all_word_ids))

#Convert to a dense document-word count matrix where the (ith, jth) entry gives the count of the jth word in the ith document
import scipy.sparse
X = scipy.sparse.csc_matrix((counts, (row,col)),shape=(num_documents,num_words))
X_dense = X.todense()

#Read in the word names  
word_names = open("enron_vocab_sample.txt", "r")
word_names = word_names.readlines()
word_names = [word_names[x] for x in range(0,num_words)]
word_names = [x.rstrip() for x in word_names]

Wall time: 9.18 s


### Part A 
Please report the 10 most common words in the corpus.  To do this, count the number of times
each word appears in the corpus and build a frequency table. 
(In particular, use counts as the "frequency"--do not divide by the total number of words in the corpus.)
Sort the table and print the top 10 most frequent words, along with their frequencies and ranks.  

Hint: You do not need to load nlp for this question.  Our code for this question ran in less than 10 seconds.

In [None]:
%%time
#Please write your code here

### Part B
Now, use the `TfidfTransformer` class in `scikit-learn` to transform this data.  Then, use the `LatentDirichletAllocation` class in `scikit-learn` to fit 10 topics to the transformed sample data.  Use 10 max iterations, set `n_jobs=-1` to use all cores on your machine (if it helps). Please set the random_state to 95865 when running LDA. Print the top 5 words associated with each topic.  

Our code ran in less than 2 minutes for this section.   

In [None]:
%%time
#Take random sample of the data


#Create tf-idf matrix


#Learn LDA model


#Print top words associated with each topic


### Part C 
Please write a few sentences to interpret the results of your topic modeling.  What do the various clusters seem to represent? 
Do they appear to be meaningful email categories for an energy company?

#### Here, please write your interpretation:


### Part D 
Please use t-SNE and the probability distribution of documents over topics to find a 2-D representation of the email documents. 
That is, please run t-SNE on the documents' distributions over topics, which is an output of your LDA model. When running t-SNE, set the angle to 0.5, the learning rate to 800, and init to 'PCA'.

Our code ran in less than 30 seconds for this section.

In [None]:
%%time
#Find 2D represenation of the email documents


### Part E 
Use DP means to pick the number of clusters in your 2-D representation, assuming that you are only interested in clusters with probability of at least .05. Set the number of mixture components to 20, the weight concentration prior to .1, n_init to 200, and the random state to 95865.  Please print the sorted probability weights you find.

Our code for this section ran in less than 2 minutes.

In [None]:
%%time
#Determine the number of clusters using DP-Means

#### Here, please write your choice for the number of clusters given your DP-Means results:


### Part F 
Fit a GMM model with the number of clusters found in part E and visualize the clustering results in a scatter plot. Each document should be represented by its 2-D t-SNE representation and should be color coded according to it's clustering assignment. When running your GMM, set n_init to 200, and the random_state to 95865.

Our code for this section ran in less than 10 seconds.

In [None]:
%%time
#Fitting GMM with Number of Clusters Determined By DP-Means

#Plot the clustering results


### Part G
Now use sklearn's built-in CH-index function to pick the number of clusters.  To do this you should load metrics from `sklearn` and then use the `calinski_harabaz_score` function.  For more details on this function see: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabaz_score.html 

You should use a for loop to run k-means for values of k = [2, 5, 10, 20, 30] over your 2-D t-SNE representation of the data.  When running k-means, set n_init to 1000 and random_state to 95865. Please record the CH-index for each k, and plot the results.  Indicate if you would select k to be 2, 5, 10, 20 or 30 given these results. 

Our code for this section ran in under 1 minute 15 seconds.

In [None]:
%%time
#Determine Number of Clusters with k-means and CH-Index


#### Here, please write your choice for the number of clusters based on your analysis using k-means and the CH-Index:


### Part H
Fit k-means with the number of clusters found in part E and visualize the clustering results in a scatter plot. 
Each document should be represented by its 2-D t-SNE representation and should be color coded according to it's clustering assignment.

Then, write a few sentences explaining your results from part G and from this clustering visualization.  Are the clusters found by k-means good? Why might you have gotten these results?  

In [None]:
%%time
#Please write your code here

#### Here, please write a few sentences to explain your results:
