# 05 - Taming Text

## Deadline
Thursday December 15, 2016 at 11:59PM

## Important Notes
* Make sure you push on GitHub your Notebook with all the cells already evaluated
* Don't forget to add a textual description of your thought process, the assumptions you made, and the solution
you plan to implement!
* Please write all your comments in English, and use meaningful variable names in your code

## Background
In this homework you will explore a relatively large corpus of emails released in public during the
[Hillary Clinton email controversy](https://en.wikipedia.org/wiki/Hillary_Clinton_email_controversy).
You can find the corpus in the `hillary-clinton-emails` directory of this repository, while more detailed information 
about the [schema is available here](https://www.kaggle.com/kaggle/hillary-clinton-emails).

## Assignment
1. Generate a word cloud based on the raw corpus -- I recommend you to use the [Python word_cloud library](https://github.com/amueller/word_cloud).
With the help of `nltk` (already available in your Anaconda environment), implement a standard text pre-processing 
pipeline (e.g., tokenization, stopword removal, stemming, etc.) and generate a new word cloud. Discuss briefly the pros and
cons (if any) of the two word clouds you generated.

2. Find all the mentions of world countries in the whole corpus, using the `pycountry` utility (*HINT*: remember that
there will be different surface forms for the same country in the text, e.g., Switzerland, switzerland, CH, etc.)
Perform sentiment analysis on every email message using the demo methods in the `nltk.sentiment.util` module. Aggregate 
the polarity information of all the emails by country, and plot a histogram (ordered and colored by polarity level)
that summarizes the perception of the different countries. Repeat the aggregation + plotting steps using different demo
methods from the sentiment analysis module -- can you find substantial differences?

3. Using the `models.ldamodel` module from the [gensim library](https://radimrehurek.com/gensim/index.html), run topic
modeling over the corpus. Explore different numbers of topics (varying from 5 to 50), and settle for the parameter which
returns topics that you consider to be meaningful at first sight.

4. *BONUS*: build the communication graph (unweighted and undirected) among the different email senders and recipients
using the `NetworkX` library. Find communities in this graph with `community.best_partition(G)` method from the 
[community detection module](http://perso.crans.org/aynaud/communities/index.html). Print the most frequent 20 words used
by the email authors of each community. Do these word lists look similar to what you've produced at step 3 with LDA?
Can you identify clear discussion topics for each community? Discuss briefly the obtained results.


# 0. Prelude

In [2]:
import pandas as pd
import numpy as np
import nltk

# 1. Word cloud
Generate a word cloud based on the raw corpus -- I recommend you to use the [Python word_cloud library](https://github.com/amueller/word_cloud).
With the help of `nltk` (already available in your Anaconda environment), implement a standard text pre-processing 
pipeline (e.g., tokenization, stopword removal, stemming, etc.) and generate a new word cloud. Discuss briefly the pros and
cons (if any) of the two word clouds you generated.

In [3]:
filename='hillary-clinton-emails/Emails.csv'
df = pd.read_csv(filename)
df.sample(10)

Unnamed: 0,Id,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,SenderPersonId,MetadataDateSent,MetadataDateReleased,MetadataPdfLink,MetadataCaseNumber,...,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText
67,68,C05739640,CNN BELIEF BLOG. PROTHERO,Russorv@state.gov,H,80.0,2012-09-14T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739640...,F-2015-04841,...,,,,,F-2015-04841,C05739640,05/13/2015,RELEASE IN PART,Pis print.\nH <hrod17@clintonemail.com>\nFrida...,B6\nUNCLASSIFIED\nU.S. Department of State\nCa...
4762,4763,C05768844,NAM LUNCH,H,"Sullivan, Jacob J",87.0,2010-04-29T04:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0076/DOC_0C0...,F-2014-20439,...,,"Sullivan, Jacob J <Sullivan.11@state.gov>","Abedin, Huma","Thursday, April 29, 2010 7:28 PM",F-2014-20439,C05768844,08/31/2015,RELEASE IN PART,Huma said you were asking about the goals and ...,UNCLASSIFIED U.S. Department of State Case No....
7038,7039,C05773851,GOVERNOR RICHARDSON'S INQUIRIES RE MISSIONARIE...,millscd@state.gov,H,80.0,2010-02-10T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0119/DOC_0C0...,F-2014-20439,...,'millscd@state.gov',H <hrod17@clintonemail.com>,SES-O_SWO-Only,"Wednesday, February 10, 2010 3:10 PM",F-2014-20439,C05773851,08/31/2015,RELEASE IN PART,When can you talk? I'm free until 4 and after 7.,UNCLASSIFIED U.S. Department of State Case No....
6390,6391,C05772051,(AP) EGYPTIANS RIOT BURN CARS CLAIMING VOTE FRAUD,H,"Abedin, Huma",81.0,2010-11-29T05:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0110/DOC_0C0...,F-2014-20439,...,,"Abedin, Huma <AbedinH@state.gov>",SES-0_0S; SES-O_SWO-Only,"Monday, November 29, 2010 9:21 PM",F-2014-20439,C05772051,08/31/2015,RELEASE IN FULL,,UNCLASSIFIED U.S. Department of State Case No....
4881,4882,C05769117,"OBAMA'S DIPLOMACY, NOT FULLY ENGAGED (FOR S)",H,"Abedin, Huma",81.0,2010-05-03T04:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0077/DOC_0C0...,F-2014-20439,...,,"Abedin, Huma <AbedinH©state.gov>",,"Monday, May 3, 2010 12:38 PM",F-2014-20439,C05769117,08/31/2015,RELEASE IN FULL,Article from cdm,UNCLASSIFIED U.S. Department of State Case No....
3362,3363,C05765899,HOLBROOKE SAYS HE NEEDS TO TALK TO YOU BEFORE ...,H,"Abedin, Huma",81.0,2009-09-13T04:00:00+00:00,2015-07-31T04:00:00+00:00,DOCUMENTS/HRCEmail_JulyWeb/Web_037/DOC_0C05765...,F-2014-20439,...,,,,,F-2014-20439,C05765899,07/31/2015,RELEASE IN FULL,"Abedin, Huma <AbedinH@state.gov>\nSunday, Sept...",UNCLASSIFIED U.S. Department of State Case No....
7502,7503,C05775200,MARC GROSSMAN ARTICLE YOU REQUESTED,H,"Coleman, Claire L",38.0,2010-09-28T04:00:00+00:00,2015-08-31T04:00:00+00:00,DOCUMENTS/HRCEmail_August_Web/IPS-0100/DOC_0C0...,F-2014-20439,...,,"Coleman, Claire L <ColemanCL@state.gov>","Abedin, Huma","Tuesday, September 28, 2010 2:42 PM",F-2014-20439,C05775200,08/31/2015,RELEASE IN FULL,Diplomacy Before and After Conflict\nBy Marc G...,UNCLASSIFIED U.S. Department of State Case No....
2619,2620,C05764132,PHONE CALL REPORT,H,"Feltman, Jeffrey D",94.0,2009-10-03T04:00:00+00:00,2015-07-31T04:00:00+00:00,DOCUMENTS/HRCEmail_JulyWeb/Web_041/DOC_0C05764...,F-2014-20439,...,,,"Feltman, Jeffrey D .",,F-2014-20439,C05764132,07/31/2015,RELEASE IN PART,"Feltman, Jeffrey D <FeltmanJD@state.gov>\nSatu...",UNCLASSIFIED U.S. Department of State Case No....
1878,1879,C05762355,SEARCH,H,"Jiloty, Lauren C",116.0,2009-06-16T04:00:00+00:00,2015-06-30T04:00:00+00:00,DOCUMENTS/HRCAll_1_1-29_JuneWEB/23_24_25_26/DO...,F-2014-20439,...,,"Jiloty, Lauren C <JilotyLC@state.gov>",,Tue Jun 16 07:14:40 2009,F-2014-20439,C05762355,06/30/2015,RELEASE IN FULL,Ok,UNCLASSIFIED U.S. Department of State Case No....
1752,1753,C05761825,SYDNEY BLUMENTHAL,H,"Mills, Cheryl D",32.0,2009-06-05T04:00:00+00:00,2015-06-30T04:00:00+00:00,DOCUMENTS/HRCAll_1_1-29_JuneWEB/23_24_25_26/DO...,F-2014-20439,...,,"Mills, Cheryl D <MillsCD©state.gov>",,Fri Jun 05 19:38:20 2009,F-2014-20439,C05761825,06/30/2015,RELEASE IN FULL,Fyi,UNCLASSIFIED U.S. Department of State Case No....


In [4]:
from nltk.stem.snowball import SnowballStemmer

stopwords = nltk.corpus.stopwords.words('english')
stemmer = SnowballStemmer("english")

In [12]:
import re

def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


In [10]:
raw = df['ExtractedBodyText'].dropna().reset_index()
raw_corpus = ""
for r in raw['ExtractedBodyText']:
    raw_corpus += r

B6
Thursday, March 3, 2011 9:45 PM
H: Latest How Syria is aiding Qaddafi and more... Sid
hrc memo syria aiding libya 030311.docx; hrc memo syria aiding libya 030311.docx
March 3, 2011
For: HillaryThxH <hrod17@clintonemail.com>
Friday, March 11, 2011 1:36 PM
Huma Abedin
Fw: H: Latest: How Syria is aiding Qaddafi and more... Sid
hrc memo syria aiding libya 030311.docx
Pis print.Pis print.
-•-...-^
H < hrod17@clintonernailcom>
Wednesday, September 12, 2012 2:11 PM
°Russorv@state.gov'
Fw: Meet The Right-Wing Extremist Behind Anti-fvluslim Film That Sparked Deadly Riots
From [meat)
Sent: Wednesday, September 12, 2012 01:00 PM
To: 11
Subject: Meet The Right Wing Extremist Behind Anti-Muslim Film That Sparked Deadly Riots
htte/maxbiumenthal.com12012/09/meet-the-right-wing-extremist-behind-anti-musiim-tihn-that-sparked-
deadly-riots/
Sent from my Verizon Wireless 4G LTE DROID
U.S. Department of State
Case No. F-2015-04841
Doc No. C05739559
Date: 05/13/2015
STATE DEPT. - PRODUCED TO HOUSE SELEC

In [None]:
tokenized = tokenize_and_stem(raw_corpus)

In [17]:
tokenized[1:20]

['thursday',
 'march',
 'pm',
 'h',
 'latest',
 'how',
 'syria',
 'is',
 'aid',
 'qaddafi',
 'and',
 'more',
 'sid',
 'hrc',
 'memo',
 'syria',
 'aid',
 'libya',
 '030311.docx']

# 2. Sentiment analysis

Find all the mentions of world countries in the whole corpus, using the `pycountry` utility (*HINT*: remember that
there will be different surface forms for the same country in the text, e.g., Switzerland, switzerland, CH, etc.)
Perform sentiment analysis on every email message using the demo methods in the `nltk.sentiment.util` module. Aggregate 
the polarity information of all the emails by country, and plot a histogram (ordered and colored by polarity level)
that summarizes the perception of the different countries. Repeat the aggregation + plotting steps using different demo
methods from the sentiment analysis module -- can you find substantial differences?

# 3. Topic modelling
Using the `models.ldamodel` module from the [gensim library](https://radimrehurek.com/gensim/index.html), run topic
modeling over the corpus. Explore different numbers of topics (varying from 5 to 50), and settle for the parameter which
returns topics that you consider to be meaningful at first sight.

# 4. (bonus) Communication graph
*BONUS*: build the communication graph (unweighted and undirected) among the different email senders and recipients
using the `NetworkX` library. Find communities in this graph with `community.best_partition(G)` method from the 
[community detection module](http://perso.crans.org/aynaud/communities/index.html). Print the most frequent 20 words used
by the email authors of each community. Do these word lists look similar to what you've produced at step 3 with LDA?
Can you identify clear discussion topics for each community? Discuss briefly the obtained results.