# Secrets inside Clinton's email

## Abstract
In 2015, Hillary Clinton has been embroiled in controversy over the use of personal email accounts on non-government servers during her time as the United States Secretary of State. Over 2000 confidential emails were leaked, some of them are even classified as “Top Secret”. 

In this project we will look at the politic, security and economic aspects through the 7945 leaked emails redacted and published by the State Department and cleaned by Kaggle. We also want to analyze the personal social network of Hillary Clinton and the top topics they discussed.

As a superpower, the United States has a great impact on the world’s stability, and their position and attitude will strongly influence the international affairs. We want to figure out the countries mainly mentioned, the problems concerned and conclude the impact they made on the international affairs throughout the analysis of these emails.

The dataset can be found on [Kaggle](https://www.kaggle.com/kaggle/hillary-clinton-emails).

## Milestone 2
**Tasks for milestone 2**
* Data wrangling: clean and deal with unvalid or missing data; combine data files for useful information
* Construct a countries occurence list
* Identify key words related to international affairs in the body text
* Find the communication frequency between Hillary and the others
* Personal social network creation and analysis: use the communication frequency to create a personal social network of Hillary and analyze the network structure

---

## Data wrangling
### Clean and deal with unuseful data.

In [152]:
import numpy as np
import pandas as pd
import string
import re
import matplotlib.pyplot as plt
from collections import Counter
pd.options.mode.chained_assignment = None  # default='warn', Mutes warnings when copying a slice from a DataFrame.

In [2]:
data_folder = './hillary-clinton-emails/'

In [287]:
# load data and parse time to timestamp
emails_raw = pd.read_csv(data_folder + 'Emails.csv', parse_dates=['MetadataDateSent', 'MetadataDateReleased', 'ExtractedDateSent', 'ExtractedDateReleased'])

Some features in the data, for example `DocNumber`, are not of our concern. They are information of the release of the emails but not the information that the email itself tells. For this reason, we extract only useful features from the raw data.

In [137]:
# extract useful information only
emails = emails_raw[['Id', 
                    'MetadataSubject',
                    'MetadataTo',
                    'MetadataFrom',
                    'SenderPersonId',
                    'MetadataDateSent',
                    'ExtractedSubject',
                    'ExtractedTo',
                    'ExtractedFrom',
                    'ExtractedCc',
                    'ExtractedReleaseInPartOrFull',
                    'ExtractedBodyText']]

We extract the `RawText` from the raw data explicitly because it is very large comparing to the other features. We will use this feature to mesure the textual information containing in the email, for example the countries mentioned ot other key words.

In [66]:
emails_rawText = emails_raw[['Id', 'RawText']]

### Convert involved alias to the PersonId
Information on sender and receiver in the `Emails.csv` is alias. On person can have multiple alias, so it is not easy for us to make analysis. We thus convert all involved aliases to the personId by using `Aliases.csv`.

In [218]:
# Load data of alias
alias = pd.read_csv(data_folder + 'Aliases.csv')
persons = pd.read_csv(data_folder + 'Persons.csv')

In [139]:
# rearrange strings to a basic form
def manage_str(s):
    s = re.sub(r'\<.*', '', s)
    s = re.sub(r'\(.*\)', '', s)
    return s.lower()\
            .replace(',', '').replace('-', '').replace(' ', '').replace("'", '').replace('‘', '')\
            .replace('`', '').replace('°', '').replace('"', '').replace('•', '').replace('(', '')\
            .replace(')', '')

In [141]:
# create a dictionary of alias and personId
alias_dict = dict(zip(alias.Alias.apply(manage_str), alias.PersonId))

In [142]:
# convert alias to personId
def alias2id(alias):
    aliases = str(alias).split(';')
    ids = []
    for x in aliases:
        x = manage_str(x)
        if x in alias_dict.keys():
            ids.append(alias_dict[x])
    return ids

In [143]:
# Convert from alias to personId for five features related to people
emails.MetadataTo = emails.MetadataTo.apply(alias2id)
emails.MetadataFrom = emails.MetadataFrom.apply(alias2id)
emails.ExtractedTo = emails.ExtractedTo.apply(alias2id)
emails.ExtractedFrom = emails.ExtractedFrom.apply(alias2id)
emails.ExtractedCc = emails.ExtractedCc.apply(alias2id)

# ！！！！！！！！！！！！！！不确定到底做什么了
# 先搞清楚metadata和extract的区别

### Find occurency for From and To features

In [261]:
email_senders = []
for i in emails.MetadataTo.values:
    for j in i:
        email_senders.append(j)

In [262]:
dict_person_count = dict(Counter(email_senders))

In [263]:
personId = [ k for k in dict_person_count ]
occurence = [ v for v in dict_person_count.values() ]
count_person = [(v,k) for k, v in dict_person_count.items() ]

In [270]:
person_occurrence_metaTo = pd.DataFrame({'personId': personId, 'occurency': occurence,})

person_occurrence_metaTo = person_occurrence_metaTo.sort_values('occurency', ascending=False)

In [283]:
important_sender = person_occurrence_metaTo[:15]

In [284]:
dict_id2name = dict(zip(persons.Id, persons.Name))

In [285]:
names = []
for i in important_sender.personId:
    names.append(dict_id2name[i])

In [286]:
important_sender['name'] = names
important_sender

Unnamed: 0,personId,occurency,name
0,80,5477,Hillary Clinton
5,32,395,Cheryl Mills
7,504,352,a bed in h@state.gov
4,334,298,sulliva njj@state.g ov
12,116,241,Lauren Jiloty
13,124,150,Lona Valmoro
1,81,82,Huma Abedin
17,485,62,p rei n es
41,194,50,Sidney Blumenthal
11,87,49,Jake Sullivan


In [276]:
important_sender.columns[2]

'name'