##Problem Statement and Hypothesis

In early March 2015, the New York Times revealed that Hillary Clinton had used a private email server while conducting official State Department business. The revelation coincided with the  House Select Committee on Benghazi which was investigating the events surrounding the terrorist attack on the U.S. Diplomatic Mission in Benghazi. Many Republicans and conservative media outlets were quick to jump on this new controversy as further proof of Clinton's wrongdoing surrounding Benghazi. 

Since that time, a number of Freedom of Information lawsuits have been filed against the State Department asking for the emails to be released. In August 2015, the State Department released nearly 7,000 pages of Hillary Clinton's emails.

This paper explores the application of basic, natural language processing (NLP) techniques to assign topics to Hillary Clinton's State Department emails. And in the process, confirm my expectation that the subject of Hillary Clinton's emails were as banal and as commonplace as any other workplace email account.

###Hypothesis 

This analysis is entirely unsupervised and, as a result, it does not lend itself to traditional hypotheses a priori. This paper will explore the application of Latent-Dirichlet Allocation (**LDA**) to Hillary Clinton's emails and evaluate the performance of specific topic distribution result.

Overall, the goal is to determine which LDA topic distribution is the "best" in defining a topic for a specific email. By the "best", I mean which LDA topic distribution has the most coherent topics. Coherence will be calculated by using two evaluation techniques:

1. Word Intrusion
2. Cosine Similarity Across Documents

These two metrics will be described in greater in later sections.

####Limitations and Modeling Assumptions
1. Body of emails contains only relevant words or text specific to that email. The scraping that was performed to collect the emails was not perfect, it is very likely that some extraneous words or text slip through data cleaning.For the purposes of this project, it is assumed that all words in a document are relevant.

2. All emails are complete. Every single email was reviewed by the State Department before release, and in some cases the emails were redacted. For the purposes of this project, it is assumed that all emails are complete.

3. Emails are written using the steps outlined earlier. Without this assumption, the LDA algorithm would not be able to define K topics. This assumption includes other probabilistic assumptions on how topics are picked for an email (i.e. Dirichlet distribution).



##Description of Data and Preprocessing

   ### Data Source
    
The data used for this project are, former Secretary of State, Hillary Clinton's State Department emails. The emails were originally released in PDF format but were scraped and posted on Kaggle on September 11, 2015. 

The data are stored in a SQLite database containing five tables:

| Table | Description |
|-------|-------------|
|  _Emails_  |This table contains all relevant email information including the email subject, date sent, sender name, and email body|
| _Persons_|This table contains a list of all the sender names with a unique identifier to merge on other tables|
|_Aliases_|This table contains a list of sender aliases that correspond to a unique person in the Persons table|
|_EmailReceivers_|This table links an unique email ID with a sender person ID and alias ID|

For the purposes of this project, a majority of the analysis will be performed on the _Emails_ table which contains the body text of the email.

####Sample Size

The _Emails_ table contains 7,945 emails. However, a portion of these emails are heavily redacted and contain no remaining email text. After we remove emails without text, we are left with 6,742 emails. 

####Time Period
The remaining emails begin in 2009 and end in 2012.

###Data Exploration 

Before any of the data can be run through the **LDA** algorithm, the text of each email needs to be cleaned and formatted. Notice that each email is formatted differently and in some cases the PDF scraping was not perfect so there are some extraneous words.

Below are a few examples of emails that need cleaning.

**Random Email #1** : Example of email with State Department footer

>Hillary:
I'm just boarding plane to Honduras and thinking of you especially with this painful tragedy in Libya.
Warmest,
Maria
U.S. Department of State
Case No. F-2015-04841
Doc No. C05739905
Date: 05/13/2015
STATE DEPT. - PRODUCED TO HOUSE SELECT BENGHAZI COMM.
SUBJECT TO AGREEMENT ON SENSITIVE INFORMATION & REDACTIONS. NO FOIA WAIVER. STATE-5CB0045242

**Random Email #2** : Example of email with nonsense word

>sbwhoeop
>Sunday, November 29, 2009 11:51 AM
>H
>Re: Another memo on backdrop to this week. Sid
>It's there, just scroll down
>Sent via Cingular Xpress Mail with Blackberry


**Random Email #3**: Example of email with mobile signature

>141737V7SMIZITMItlfaliZZI%
>Good luck out there. Talk sometime
>Sent via DROID on Verizon Wireless
>--Original message

**Random Email #4**: Example of email with extraneous words

>UAE with Bill -- 11:15am-11:45am in Bill's office. You could join after you finish up with Shaun Woodward at 11:30am.
Lunch with UAE starts at 12pm.
Lona Valmoro
Special Assistant to the Secretary of State
(202) 647-9071 (direct)

**Random Email #5**: Example of email with email Subject in body

>Brennan, John 0.
>Subject: RE: Google and YouTube
Sue just called back and the block will stay through Monday. They will not/not be unblocking it before then.
Nora Toiv
Office of the Secretary
202-647-8633


**Random Email #6**: Example of email with mispelled word

>Pis send
H <hrod17@clintonemail.com>
Wednesday, July 1, 2009 8:26 PM
'JilotyLC@state.gov'; 'Russorv@state.gov'
Fw:
a letter of congrats and good wishes.

**Random Email #7**: Example of email with email addresses

>Great. Thx.

>Sent from my iPhone. Apologies for any typos.
On Aug 23, 2010, at 10:27 AM, H <HDR22@clintonemail.com> wrote:
I will call you later to discuss.

The random emails above contain patterns that are found in many of the emails in the _Emails_ table. The patterns represent words or phrases that are not part of the body of the email or are part of the email body but might not impart any topical inference later on in the analysis. These patterns are addressed along with other email cleaning procedures in the next section


###Preprocessing
As mentioned above, since each email in the _Emails_ table is formatted differently, there is significant preprocessing that needs to take place before performing **LDA**. Beyond tokenizing and stemming, each email needs to be cleaned for extraneous words or phrases as well as words that simply impart no meaning to the email.

####Non-standard Preprocessing Steps
1. Remove State Department headers and footers.
2. Remove spelling mistakes or abbreviations. For example, change instances of "pls" or "pis" to "please. Change instances of "thx" to "thanks".
3. Remove any email addresses.
4. Remove any web addresses.
5. Remove any mobile signatures.
6. Remove all non-letters.
7. Remove any month or day of week. Typically found in email headers.
8. Remove words particular word that don't contribute to topic inference. For example, "docx","wjc","sid","hrc","cdm","hillary","clinton", and "doc".
9. Remove the names of any of Hillary Clinton's aides. These names typically are leftover from email signatures and obscure the topic modeling.
10. Remove stopwords
11. Remove words with less than 3 characters
12. Remove any modal verbs (e.g. "would", "could", "should")

####Standard Preprocessing Steps
1. Tokenize each email.
2. Stem each word in an email.

##Model Process and Validation

####Latent-Dirichlet Allocation

Latent-Dirichlet Allocation is consistently referenced as the most popular technique for topic modeling. LDA is a unsupervised learning technique that defines topics from a set of texts. The number of topics the LDA algorithm defines is set by the user.

From a very basic understanding of LDA, to define topics the algorithm makes assumptions on how a document or a piece of text was made and then reverses the process to backout a particular set of topics. LDA assumes that documents, or in our case emails, are written in following steps:

1. Determine the number of words the email will have according to some distribution.
2. Determine a mixture of K topics to be used in the email. For example, the email may be split between 'Benghazi'(50%),'House Committee' (30%), and 'Election" (20%). These three topics also have their own distribution of words. For example:
  * Benghazi: Qaddaffi (50%) and Libya(50%)
  * House Committee: Boehner (50%) and Investigation(50%)
  * Election: Polls (50%) and Vote(50%)
3. Then to pick each word in the email you:
  1. Pick a topic according to the distribution 'Benghazi'(50%),'House Committee' (30%), and 'Election" (20%).
  2. Pick a word from the topic distribution.
  
The LDA algorithm 'reverses' this process to backout the topic distribution for each email. The algorithm begins by randomly assigning words in each email to a topic. This, of course, assigns topics to words that do not make any sense. The algorithm then reassigns each word a new topic based on the probablity that each topic would generate that specific word.(These calcuations are done interatively and I am still trying to understand how they are calculated). This process is done iteratively with the probability of a specific word for each topic being recalculated each iteration. This is performed many of times until the document reaches a steady state. 

The LDA algorithm returns a mixture of topics for each email. For example email #1, the text is split 70% between Topic Benghazi and 30% House Committee. Note that this is different from K-Means which would return disjoint, non-overlapping clusters.

To perform the LDA topic modeling, each email will need to be transformed into a numerical dataset that the LDA algorithm can interpret. To accomplish this, all the emails will be represented in a document-term matrix. The document-term matrix is very similar to a term frequency - inverse document frequency matrix (TF-IDF) . From my understanding, each column in this matrix will represent an email and each row will represent a word. The value for a specific word and email combination is the frequency of that word.

Once this matrix is prepared, the **LdaModel()** function within the **gensim** package will define the number of topics requested. For this analysis, I will use the **LdaModel()** function to create topic distributions for four, six, eight, and ten topics. 

####Limitations and Modeling Assumptions
1. Body of emails contains only relevant words or text specific to that email. The scraping that was performed to collect the emails was not perfect, it is very likely that some extraneous words or text slip through data cleaning.For the purposes of this project, it is assumed that all words in a document are relevant.

2. All emails are complete. Every single email was reviewed by the State Department before release, and in some cases the emails were redacted. For the purposes of this project, it is assumed that all emails are complete.

3. Emails are written using the steps outlined earlier. Without this assumption, the LDA algorithm would not be able to define K topics. This assumption includes other probabilistic assumptions on how topics are picked for an email (i.e. Dirichlet distribution).


####Validation

To determine which of the four LDA topic distributions is most coherent, each distribution of topics will be evaluated with the following metrics:

1. Word Intrusion
2. Cosine Similarity across Sets of Text

#####Word Intrusion

_Word Intrusion_, as defined by [Chang et. al](http://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf), is a method where for each trained topic, a word is chosen at random and then substituted into the first ten (or five) words associated with a topic. Then if a human is able to detect which word is not appropriate for that topic it is said that the topic is coherent.

#####Cosine Similarity

In this exercise, the set of emails will be split into two distinct sets and each set will be assigned topics based on a trained model. Then the cosine similarity will be calculated between corresponding topics and a between randomly chosen topics. The higher the cosine similarity between corresponding topics the better and the lower the cosine similarity between different topics the better. 



##Results

###Topic Distributions

**Four-Topic Model Words**

|Topic Number| Top Words| Theme|
|:-:|:-:|:-:|
|1|office fyi depart room meet state arrive route| Daily Schedule|
|2|state american new israel president one time said| ?
|3| call thank please see get work want know| Office Requests
|4|obama said president new democrat party state govern| ?

With a four topic model, only two of the four topics seem coherent. Topic #1 has multiple words that associated with daily scheduling and logistics like "meet", "depart", "arrive", and "route". It is likely that this topic is identifying emails for Hillary Clinton's daily schedule. However, you also see words like "fyi" that don't seem to fit the topic completely. Additionally, topic #3 has many words associated with requests like "thank", "get", "want". Perhaps this is identifying emails associated typical office email requests. The remaining topics, do not seem to have any internal coherence.

**Six-Topic Model Words**

|Topic Number| Top Words|Theme|
|:-:|:-:|:-:|
|1|office depart fyi meet room state arrive route| Daily Schedule
|2|bloomberg say think said mayor try time editorial| ?
|3| call please thank see get work want know |Office Requests
|4|israel isra palestinian netanyahu arab negotiate jewish peace| Israel/Palestine|
|5|state american new president one obama govern year|?
|6|house obama said senate white president vote bill| Congress?

Increasing the number of topics form four to six gave the LDA algorithm room to identify smaller topics embedded in previous topics. For example notice that topic #4 seems to have split off from the four-topic model's topic #2 and is now a topic that is undeniably associated with Israel/Palestine.

The themes of **Daily Schedule** and **Office Requests** are still present in the six-topic model, which lend support to their stability as topics. And there seems to be a burgeoning topic on Congress or elections in topic #6.

**Eight-Topic Model Words**

|Topic Number| Top Words| Theme|
|:-:|:-:|:-:|
|1|office depart meet room state arrive conference route|Daily Schedule
|2|call see get work want know thank talk| Office Request 
|3|please print message copy list speech thank email|Office Request 
|4|haiti bibi state settlement assist valmoro haitian letter|? 
|5|obama said one president year american politics say|?
|6|senate vote republican start bill richard boehner treaty|Congress
|7|fyi party vote elect release part percent poll|Election?
|8|state new secure israel world unit develop policy|?

Once again with the eight-topic model, the themes of **Daily Schedule** and **Office Requests** can be identified. Notice that **Office Requests** are now split up between two topics. 

Topic #4 seems completely incoherent with words associated with Haiti and Israel at the same time. Associations with Israel can also be found in topic #8, but again the topic is not internally coherent. This is particularly interesting because in the six-topic model there was a very coherent topic on Israel/Palestine that seems to have disappeared.

As suggested earlier, there are now topics that seem to be associated with Congress, topic #6, and Elections, topic #7. However, the words "release" and "part" raise red flags. The word "release" could be associated with "poll release" but it could also be associated with "released in part" which was a common message in redacted emails. 


**Ten-Topic Model Words**

|Topic Number| Top Words|Theme|
|:-:|:-:|:-:|
|1|office depart meet room state arrive route conference|Daily Schedule
|2| call see get want know work thank talk|Office Request
|3|please print message list thank qddr copy add|Office Request
|4|release part state new team chapter valmoro branch| Redacted?
|5|american state new president one year said obama|?
|6|house obama white bill said president staff senate|Congress/Administration Relations
|7|fyi israel isra party palestinian peace arab negotiate|Israel/Palestine
|8|work develop state new policy people issue world| State Dept. Policymaking
|9|republican democrat vote senate percent elect party candidate| Elections
|10|koch tea right beck party movement book skousen| Far-right

With ten topics, many previously seen themes become coherent. Again, **Daily Schedule** and **Office Request** are part of the topic distribution. At this point, I am confident that these topics are stable after showing up in every trained model. 

The topic of **Israel/Palestine** remerges in this model and is once again very coherent. It is still unclear why this topic disappeared and re-emerged betweent eight and ten topics.

Topic #8 has words that are associated with policy-making and the intended goals of the State Department. This could be identifying emails with general policy proposals or discussions.

Topic #6 combines words associated with congress and the administration and could be identifying emails discussing relations between congress and the White House.

Topic #9 is clearly a topic dealing with elections having words like "elect", "vote", and "percent".

Topic #10 is another topic that seems to have strong internal coherence. All of the words have some association with far-right politics.

The remaining topics are either completely incoherent or possible heavily redacted emails. 

###Validation 

After reviewing the trained models, it seems like the ten-topic model has the most internally coherent topics. However, this evaluation is completely subjective and needs to be validated using more objective measures. To do this, I will be employing _word intrusion_ and _cosine similarity_.

####Word Intrusion

_Word Intrusion_, as defined by Chang et. al, is an exercise where for each trained topic, a word is chosen at random and then substituted into the first ten (or five) words associated with a topic. Then if a human is able to detect which word is not appropriate for that topic it is said that the topic is coherent.

The table below provides word intrusion examples for each topic in the ten-topic model.

|Topic|Can you identify the intruder?|
|:-:|:-:|
|1|office republican meet room state arrive route conference |
|2|call see get want know valmoro thank talk|
|3| please print message list thank qddr copy candidate|
|4| release negotiate state new team chapter valmoro branch|
|5| message state new president one year said obama|
|6| house obama white bill said president staff print|
|7| fyi israel isra party palestinian peace message negotiate|
|8| work develop right new policy people issue world|
|9| message democrat vote senate percent elect party candidate|
|10|koch tea right beck conference movement book skousen|

I posed these word intrusion questions to my roommate who had never seen the topic distribution. She correctly identified the word intruder for six of the ten topics (**Daily Schedule**,**Office Request 1**, **Congress/Administration Relations**,**State Dept. Policymaking**,**Elections**, and **Far-right**). This far from a victory, and should be replicated across multiple people with multiple random draws.

####Cosine Similarity

In this exercise, the set of emails will be split into two distinct sets and each set will be assigned topics based on a trained model. Then the cosine similarity will be calculated between corresponding topics and between randomly chosen topics. The higher the cosine similarity between corresponding topics the better and the lower the cosine similarity between different topics the better. Recall that cosine similarity has a range of -1 to 1.

||Cosine Similarity|
|:-:|:-:|
|Between Corresponding Topics|.83|
|Between Random Topics|.66|

The table above suggests that there is high similarity between the same topic across the two sets of emails. While the high cosine similarity between corresponding topics suggests coherence, this is only a small improvement on the cosine similarity between randomly chosen topics across the two sets of emails. This suggests that there are still many similar words across different topics that prevent distinction.

There a many reasons why this may be the case, but two in particular come to mind:
1. There are still extraneous words, common to all emails, that need to be removed in preprocessing.
2. Emails are typically concise and written using a limited range of vocabulary that could be common across all emails. This might make it hard to differentiate between particular emails.

##Conclusions

The LDA algorithm was able to identify only a handful of coherent topics depending on the number of topics assigned. The most internally coherent topics were those dealing with **Daily Schedule** and **Office Requests**. Unfortunately, the two objective measures of coherence (_word intrusion_ and _cosine similarity_) failed to identify a majority of the topics as coherent.

Training a topic model on Hillary Clinton's emails posed several challenges. One major challenge of topic modeling is defining which words of the text are relevant to topical inference and which words only muddy the water. This challenge becomes particularly difficult when dealing with documents that are short and can contain as few as two words (e.g. "Please print").

Adding to that challenge is the presence of heavily redacted emails. It is entirely possible that topics are obscured or muddled because of missing sentences or phrases containing key words. 

Given these challenges, I am surprised that LDA was able to identify an topics with any semblance of coherence. 





####Notes

Below you will find the answers to the word intrusion exercise

|Topic|Can you identify the intruder?|
|:-:|:-:|
|1|office **republican** meet room state arrive route conference |
|2|call see get want know **valmoro** thank talk|
|3| please print message list thank qddr copy **candidate**|
|4| release **negotiate** state new team chapter valmoro branch|
|5| **message** state new president one year said obama|
|6| house obama white bill said president staff **print**|
|7| fyi israel isra party palestinian peace **message** negotiate|
|8| work develop **right** new policy people issue world|
|9| **message** democrat vote senate percent elect party candidate|
|10|koch tea right beck **conference** movement book skousen|