## Homework 5 - Part 4

This optional part is about building a communication graph (unweighted and undirected) among the different email senders and recipients using the NetworkX library. 
Then we detect communities and print the most frequent 20 most frequent words for each of those communities.


In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import community
import networkx as nx

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter

In [2]:
df_Emails = pd.read_csv('hillary-clinton-emails/Emails.csv')
df_Receivers = pd.read_csv('hillary-clinton-emails/EmailReceivers.csv')

First, let's take a look at the dataframes we will need.

In [3]:
df_Receivers.head(3)

Unnamed: 0,Id,EmailId,PersonId
0,1,1,80
1,2,2,80
2,3,3,228


In [4]:
df_Emails.head(3)

Unnamed: 0,Id,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,SenderPersonId,MetadataDateSent,MetadataDateReleased,MetadataPdfLink,MetadataCaseNumber,...,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText
0,1,C05739545,WOW,H,"Sullivan, Jacob J",87.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545...,F-2015-04841,...,,"Sullivan, Jacob J <Sullivan11@state.gov>",,"Wednesday, September 12, 2012 10:16 AM",F-2015-04841,C05739545,05/13/2015,RELEASE IN FULL,,UNCLASSIFIED\nU.S. Department of State\nCase N...
1,2,C05739546,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,H,,,2011-03-03T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739546...,F-2015-04841,...,,,,,F-2015-04841,C05739546,05/13/2015,RELEASE IN PART,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...",UNCLASSIFIED\nU.S. Department of State\nCase N...
2,3,C05739547,CHRIS STEVENS,;H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547...,F-2015-04841,...,B6,"Mills, Cheryl D <MillsCD@state.gov>","Abedin, Huma","Wednesday, September 12, 2012 11:52 AM",F-2015-04841,C05739547,05/14/2015,RELEASE IN PART,Thx,UNCLASSIFIED\nU.S. Department of State\nCase N...


Let's merge the two data frame into one to get the senders and receivers per mail (Id).

In [5]:
df_merged = pd.merge(left=df_Receivers,right=df_Emails, left_on='Id', right_on='Id')

In [6]:
# now, we clean the empty senders/receivers
df_merged.dropna(axis=0, subset=['PersonId', 'SenderPersonId'], inplace=True)

In [7]:
df_merged.head(2)

Unnamed: 0,Id,EmailId,PersonId,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,SenderPersonId,MetadataDateSent,MetadataDateReleased,...,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText
0,1,1,80,C05739545,WOW,H,"Sullivan, Jacob J",87.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,...,,"Sullivan, Jacob J <Sullivan11@state.gov>",,"Wednesday, September 12, 2012 10:16 AM",F-2015-04841,C05739545,05/13/2015,RELEASE IN FULL,,UNCLASSIFIED\nU.S. Department of State\nCase N...
2,3,3,228,C05739547,CHRIS STEVENS,;H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,...,B6,"Mills, Cheryl D <MillsCD@state.gov>","Abedin, Huma","Wednesday, September 12, 2012 11:52 AM",F-2015-04841,C05739547,05/14/2015,RELEASE IN PART,Thx,UNCLASSIFIED\nU.S. Department of State\nCase N...


In [8]:
df_merged[['EmailId']] = df_merged[['EmailId']].astype(int)
df_merged[['SenderPersonId']] = df_merged[['SenderPersonId']].astype(int)

The dataframe is ready. Let's now create the graph.

In [9]:
G = nx.Graph()

In [10]:
def addEdge(row):
    G.add_edge(row['PersonId'], row['SenderPersonId'])

In [11]:
df_merged.apply(addEdge, axis=1);

In [12]:
partitions = community.best_partition(G)

And now, that the partitions are done, we are left with adding the mails to the right communities so we can count the words.

In [13]:
communities = {}

for k,v in partitions.items():
    
    mail = ' '.join(df_merged[df_merged.SenderPersonId == k].RawText)
    if not v in communities:
        communities[v] = [mail]
    else:
        communities[v].append(mail)

In [14]:
# let's set the same stop words list as in part 1.
stop = set(stopwords.words('english')) # take a typical stop words list for english
stop.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}', '@', '<', '>', '-', 'subject', 'fw', 'cc', 'am', 'pm'])

In [15]:
for community, text in communities.items():
    text = ' '.join(text)
    text = text.replace('\n', ' ')
    w = [i for i in word_tokenize(text.lower()) if i not in stop]
    count = Counter(w)
    
    print(community, ': ', count.most_common(20), '\n\n')

0 :  [('state', 8671), ('u.s.', 7896), ('department', 7578), ('case', 7280), ('date', 7236), ('unclassified', 7160), ('doc', 7071), ('f-2014-20439', 6492), ('sent', 5565), ('h', 4861), ("'s", 4532), ('state.gov', 3895), ('06/30/2015', 3421), ('2009', 3385), ('message', 2876), ('08/31/2015', 2811), ('original', 2673), ('``', 2616), ('clintonemail.com', 2555), ("''", 2504)] 


1 :  [('state', 4301), ('u.s.', 4279), ('department', 4039), ('case', 3917), ('date', 3914), ('unclassified', 3872), ('doc', 3806), ('f-2014-20439', 3738), ('sent', 3522), ('huma', 2729), ('abedin', 2571), ('08/31/2015', 2144), ('state.gov', 2092), ('2010', 1806), ('2009', 1746), ("'s", 1711), ('release', 1653), ('h', 1638), ('abedinh', 1500), ('message', 1168)] 


2 :  [('state', 2189), ('department', 2085), ('u.s.', 2027), ('case', 1777), ('date', 1769), ('unclassified', 1762), ('doc', 1733), ('f-2014-20439', 1717), ('sent', 1512), ("'s", 1494), ('08/31/2015', 1298), ('2010', 1073), ('office', 901), ('state.gov',