# Email Automatically Labeling

### Data Set:
The Enron email dataset contains 517401 emails generated by 150 employees of the Enron Corporation, mostly senior management of Enron, organized into folders. It was obtained by the Federal Energy Regulatory Commission during its investigation of Enron's collapse. This dataset does not include attachments. Invalid email addresses were converted to something of the form user@enron.com whenever possible (i.e., recipient is specified in some parse-able format like "Doe, John" or "Mary K. Smith") and to no_address@enron.com when no recipient was specified.

https://www.cs.cmu.edu/~./enron/ May 7, 2015 Version 

In [10]:
#import os, sys, email, re
import email, re
import numpy as np 
import pandas as pd
from pandas.core.common import flatten

# Plotting
import matplotlib.pyplot as plt
#%matplotlib inline
import seaborn as sns; sns.set_style('whitegrid')
#import wordcloud

# Network analysis
#import networkx as nx

# NLP
#from nltk.tokenize.regexp import RegexpTokenizer

#from subprocess import check_output

#from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
#from sklearn.cluster import KMeans
#from sklearn.decomposition import PCA
#from sklearn.lda import LDA
#from sklearn.decomposition import LatentDirichletAllocation

#import gensim
#from gensim import corpora
#from nltk.corpus import stopwords 
#from nltk.stem.wordnet import WordNetLemmatizer
#import string
#from nltk.stem.porter import PorterStemmer

# 1. Loading and cleaning data

In [11]:
# Read the data into a DataFrame
df_emails = pd.read_csv('../Data/emails.csv')

In [12]:
print("There are total {} emails.".format(df_emails.shape[0]))

There are total 517401 emails.


In [13]:
#Show the first 10 emails in the data frame.
df_emails.head(10)

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...
5,allen-p/_sent_mail/1002.,Message-ID: <30965995.1075863688265.JavaMail.e...
6,allen-p/_sent_mail/1003.,Message-ID: <16254169.1075863688286.JavaMail.e...
7,allen-p/_sent_mail/1004.,Message-ID: <17189699.1075863688308.JavaMail.e...
8,allen-p/_sent_mail/101.,Message-ID: <20641191.1075855687472.JavaMail.e...
9,allen-p/_sent_mail/102.,Message-ID: <30795301.1075855687494.JavaMail.e...


In [14]:
#Show the message of a sample email with multi-part bodys. 
print(df_emails['message'][10471])

Message-ID: <27086631.1075854806617.JavaMail.evans@thyme>
Date: Tue, 24 Oct 2000 02:57:00 -0700 (PDT)
From: eric.bass@enron.com
To: shanna.husser@enron.com
Subject: Re: It could happen!!!
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Eric Bass
X-To: Shanna Husser
X-cc: 
X-bcc: 
X-Folder: \Eric_Bass_Jun2001\Notes Folders\'sent mail
X-Origin: Bass-E
X-FileName: ebass.nsf

Thought you might like Superfan's comments.
---------------------- Forwarded by Eric Bass/HOU/ECT on 10/24/2000 09:56 AM 
---------------------------
   
	Enron North America Corp.
	
	From:  Eric Bass                           10/24/2000 09:53 AM
	

To: Chad Landry/HOU/ECT@ECT
cc: Matthew Lenhart/HOU/ECT@ECT, Christopher Coffman/Corp/Enron@Enron, 
William Kelly/HOU/ECT@ECT, Kyle Etter/HOU/ECT@ECT, Kam Keiser/HOU/ECT@ECT, 
Jay Reitmeyer/HOU/ECT@ECT, Jeff Coates/HOU/ECT@ECT, William 
Keeney/HOU/ECT@ECT, Jeffrey C Gossett/HOU/ECT@ECT, John King/HOU/ECT@ECT, 
Luis Mena/

In [15]:
#Show the message of a sample email with Cc and Bcc. 
#print(df_emails['message'][10000])

In [16]:
#We can see that the headers include Message-ID, Date, From, To, Cc, Bcc, Subject, Mime-Version, 
#Content-Type, Content-Transfer-Encoding, X-From, X-To, X-cc, X-bcc, X-Folder, X-Origin
#X-FileName, and message body. We are only interested in the headers: X-Origin(i.e.,Employee Name), Date, From, To, Cc, Bcc, 
#Subject, message body, and X-Folder.

In [17]:
## Helper functions
def get_message_body(msg):
    '''To get the message body from email headers'''
    parts = []
    for part in msg.walk():
        if part.get_content_type() == 'text/plain':
            text = ''.join(part.get_payload().split('-----Original Message-----'))
            parts.append(re.sub('\n|\t', ' ', text).strip()  )        
    return ''.join(parts)

def get_text_from_email(msg):
    '''To get the content from email objects'''
    parts = []
    for part in msg.walk():
        if part.get_content_type() == 'text/plain':
            parts.append( part.get_payload() )
        #print(parts,'/n===============')
    return ''.join(parts)



def split_email_addresses(line):
    '''To separate multiple email addresses'''
    if line:
        addrs = line.split(',')
        addrs = list(map(lambda x: x.strip(), addrs))
    else:
        addrs = None
    return addrs

def drop_none(line):
    '''To drop None from email list'''
    return sorted(list([a for a in line if a]))

In [19]:
headers = list(map(email.message_from_string, df_emails['message']))

In [20]:
bodies = list(map(get_text_from_email, headers))

In [21]:
#bodies[10471]

In [22]:
#headers[10000].items()

In [25]:
headers_sel = ["X-Origin", "Date", "From", "To", "Cc", "Bcc", "Subject", "X-Folder"]

In [26]:
for key in headers_sel:
    df_emails[key] = [message[key] for message in headers]

In [27]:
df_emails.columns

Index(['file', 'message', 'X-Origin', 'Date', 'From', 'To', 'Cc', 'Bcc',
       'Subject', 'X-Folder'],
      dtype='object')

In [28]:
email_folder = df_emails['file'].apply(lambda x: x.split('/'))
Employee,Folder,Email_Number=[],[],[]
for i in email_folder.index:
    Employee.append(email_folder[i][0])
    #Folder.append(email_folder[i][1])
    Folder.append('/'.join(email_folder[i][1:-1]))
    Email_Number.append(email_folder[i][-1])
df_emails['Employee'], df_emails['Folder'], df_emails['Email Number'] = Employee, Folder, Email_Number

In [29]:
#print(df_emails["Body"][9])

In [30]:
# Parse content from emails
df_emails['Body'] = list(map(get_text_from_email, headers))
# Split multiple email addresses
df_emails['From'] = df_emails['From'].map(split_email_addresses)
df_emails['To'] = df_emails['To'].map(split_email_addresses)
df_emails['Cc'] = df_emails['Cc'].map(split_email_addresses)
df_emails['Bcc'] = df_emails['Bcc'].map(split_email_addresses)

In [31]:
df_emails.head()

Unnamed: 0,file,message,X-Origin,Date,From,To,Cc,Bcc,Subject,X-Folder,Employee,Folder,Email Number,Body
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,Allen-P,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",[phillip.allen@enron.com],[tim.belden@enron.com],,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",allen-p,_sent_mail,1.0,Here is our forecast\n\n
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...,Allen-P,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",[phillip.allen@enron.com],[john.lavorato@enron.com],,,Re:,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",allen-p,_sent_mail,10.0,Traveling to have a business meeting takes the...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...,Allen-P,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",[phillip.allen@enron.com],[leah.arsdall@enron.com],,,Re: test,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,allen-p,_sent_mail,100.0,test successful. way to go!!!
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...,Allen-P,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",[phillip.allen@enron.com],[randall.gay@enron.com],,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,allen-p,_sent_mail,1000.0,"Randy,\n\n Can you send me a schedule of the s..."
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...,Allen-P,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",[phillip.allen@enron.com],[greg.piper@enron.com],,,Re: Hello,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,allen-p,_sent_mail,1001.0,Let's shoot for Tuesday at 11:45.


In [32]:
df_emails["Related Party"] =[set(flatten([df_emails["From"][a],df_emails["To"][a],df_emails["Cc"][a],df_emails["Bcc"][a]])) for a in range(len(df_emails["To"])) ]

In [33]:
df_emails["Related Party"] = df_emails["Related Party"].map(drop_none)

In [34]:
df_emails.head()

Unnamed: 0,file,message,X-Origin,Date,From,To,Cc,Bcc,Subject,X-Folder,Employee,Folder,Email Number,Body,Related Party
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,Allen-P,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",[phillip.allen@enron.com],[tim.belden@enron.com],,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",allen-p,_sent_mail,1.0,Here is our forecast\n\n,"[phillip.allen@enron.com, tim.belden@enron.com]"
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...,Allen-P,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",[phillip.allen@enron.com],[john.lavorato@enron.com],,,Re:,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",allen-p,_sent_mail,10.0,Traveling to have a business meeting takes the...,"[john.lavorato@enron.com, phillip.allen@enron...."
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...,Allen-P,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",[phillip.allen@enron.com],[leah.arsdall@enron.com],,,Re: test,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,allen-p,_sent_mail,100.0,test successful. way to go!!!,"[leah.arsdall@enron.com, phillip.allen@enron.com]"
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...,Allen-P,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",[phillip.allen@enron.com],[randall.gay@enron.com],,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,allen-p,_sent_mail,1000.0,"Randy,\n\n Can you send me a schedule of the s...","[phillip.allen@enron.com, randall.gay@enron.com]"
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...,Allen-P,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",[phillip.allen@enron.com],[greg.piper@enron.com],,,Re: Hello,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,allen-p,_sent_mail,1001.0,Let's shoot for Tuesday at 11:45.,"[greg.piper@enron.com, phillip.allen@enron.com]"


In [35]:
#print(df_emails["Related Party"][10000])

In [36]:
df_emails.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517401 entries, 0 to 517400
Data columns (total 15 columns):
file             517401 non-null object
message          517401 non-null object
X-Origin         517372 non-null object
Date             517401 non-null object
From             517401 non-null object
To               495554 non-null object
Cc               127881 non-null object
Bcc              127881 non-null object
Subject          517401 non-null object
X-Folder         517372 non-null object
Employee         517401 non-null object
Folder           517401 non-null object
Email Number     517401 non-null object
Body             517401 non-null object
Related Party    517401 non-null object
dtypes: object(15)
memory usage: 59.2+ MB


In [37]:
#Check missing values
print('Check missing values:\n', df_emails.isnull().sum())

Check missing values:
 file                  0
message               0
X-Origin             29
Date                  0
From                  0
To                21847
Cc               389520
Bcc              389520
Subject               0
X-Folder             29
Employee              0
Folder                0
Email Number          0
Body                  0
Related Party         0
dtype: int64


In [38]:
#print(df_emails.iloc[18178])

In [39]:
print(df_emails.iloc[18178].Body)

Wednesday,  September
 19th  
3 - 3:30pm
Room - 3125b
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Baughman Jr., Don </O=ENRON/OU=NA/CN=RECIPIENTS/CN=DBAUGHM>
X-To: 
X-cc: 
X-bcc: 
X-Folder: \ExMerge - Baughman Jr., Don\Calendar
X-Origin: BAUGHMAN-D
X-FileName: don baughman 6-25-02.PST

Midwest/Southeast Desk Meeting w/Fletch Sturm
Wednesday,  September 19th  
3 - 3:30pm
Room - 3125b


In [40]:
#print(df_emails.iloc[18178].message)

In [41]:
#headers[18178].items()

In [42]:
#email.message_from_string(df_emails['message'][18178]).items()

In [43]:
#print(df_emails['message'][43636])

In [44]:
#print(df_emails.iloc[43636])

In [45]:
#df_emails[df_emails["To"].isnull()]

In [46]:
#print(df_emails.iloc[188])

In [47]:
#print(df_emails['message'][188])

In [48]:
#print(df_emails['message'][603])

In [49]:
#print(df_emails['message'][1236])

In [50]:
df_emails.drop(['To', 'Cc','Bcc', 'X-Origin'], axis=1, inplace=True)

In [51]:
#df_emails.isna().sum()

In [52]:
df_emails.dropna(axis=0, inplace=True)

In [53]:
df_emails.isna().sum()

file             0
message          0
Date             0
From             0
Subject          0
X-Folder         0
Employee         0
Folder           0
Email Number     0
Body             0
Related Party    0
dtype: int64

In [54]:
df_emails.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 517372 entries, 0 to 517400
Data columns (total 11 columns):
file             517372 non-null object
message          517372 non-null object
Date             517372 non-null object
From             517372 non-null object
Subject          517372 non-null object
X-Folder         517372 non-null object
Employee         517372 non-null object
Folder           517372 non-null object
Email Number     517372 non-null object
Body             517372 non-null object
Related Party    517372 non-null object
dtypes: object(11)
memory usage: 47.4+ MB


In [55]:
df_emails.head()

Unnamed: 0,file,message,Date,From,Subject,X-Folder,Employee,Folder,Email Number,Body,Related Party
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",[phillip.allen@enron.com],,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",allen-p,_sent_mail,1.0,Here is our forecast\n\n,"[phillip.allen@enron.com, tim.belden@enron.com]"
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",[phillip.allen@enron.com],Re:,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",allen-p,_sent_mail,10.0,Traveling to have a business meeting takes the...,"[john.lavorato@enron.com, phillip.allen@enron...."
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",[phillip.allen@enron.com],Re: test,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,allen-p,_sent_mail,100.0,test successful. way to go!!!,"[leah.arsdall@enron.com, phillip.allen@enron.com]"
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",[phillip.allen@enron.com],,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,allen-p,_sent_mail,1000.0,"Randy,\n\n Can you send me a schedule of the s...","[phillip.allen@enron.com, randall.gay@enron.com]"
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",[phillip.allen@enron.com],Re: Hello,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,allen-p,_sent_mail,1001.0,Let's shoot for Tuesday at 11:45.,"[greg.piper@enron.com, phillip.allen@enron.com]"


# Exploratory Data Analysis

In [56]:
# Number of emails by Employee
df_emails.groupby(['Employee']).count().sort_values(by='file', ascending=False)['message'][0:40]

Employee
kaminski-v       28463
dasovich-j       28234
kean-s           25351
mann-k           23381
jones-t          19950
shackleton-s     18687
taylor-m         13875
farmer-d         13032
germany-c        12436
beck-s           11830
symes-k          10827
nemec-g          10655
scott-s           8022
rogers-b          8009
bass-e            7823
sanders-r         7329
campbell-l        6489
shapiro-r         6071
guzman-m          6054
lay-k             5937
lenhart-m         5919
lokay-m           5568
kitchen-l         5545
haedicke-m        5245
sager-e           5199
love-p            5002
arnold-j          4898
fossum-d          4796
perlingiere-d     4778
lavorato-j        4685
mcconnell-m       4542
giron-d           4220
skilling-j        4139
shankman-j        3856
hain-m            3820
delainey-d        3566
williams-w3       3440
blair-l           3415
mclaughlin-e      3353
whalley-l         3335
Name: message, dtype: int64

In [57]:
# Number of emails by folder
df_emails.groupby(['Employee','Folder']).count().sort_values(by='file', ascending=False)['message'][0:60]

Employee       Folder            
dasovich-j     all_documents         11896
jones-t        all_documents          9304
shackleton-s   all_documents          8158
dasovich-j     notes_inbox            7194
kaminski-v     all_documents          7174
mann-k         all_documents          6647
kaminski-v     discussion_threads     5550
taylor-m       all_documents          5229
jones-t        notes_inbox            5095
mann-k         discussion_threads     4965
kean-s         calendar/untitled      4478
               archiving/untitled     4477
               all_documents          4477
mann-k         sent                   4440
nemec-g        all_documents          4231
mann-k         _sent_mail             4220
dasovich-j     sent                   3930
jones-t        sent                   3810
shackleton-s   sent                   3774
kean-s         discussion_threads     3733
farmer-d       all_documents          3660
kaminski-v     _sent_mail             3464
               sent 

In [58]:
df_emails.groupby(['Employee','Folder']).count().sort_values(by='file', ascending=False)['message'][61:120]

Employee      Folder             
stclair-c     all_documents          1523
shapiro-r     deleted_items          1468
nemec-g       inbox                  1466
dasovich-j    sent_items             1436
bass-e        _sent_mail             1409
fossum-d      all_documents          1405
williams-w3   schedule_crawler       1398
dasovich-j    inbox                  1387
bass-e        discussion_threads     1386
steffes-j     sent_items             1379
lay-k         inbox                  1373
bass-e        sent                   1363
lewis-a       deleted_items          1359
germany-c     sent_items             1353
hain-m        all_documents          1347
stclair-c     sent                   1328
symes-k       sent                   1326
lokay-m       all_documents          1324
symes-k       _sent_mail             1323
kean-s        sent                   1315
campbell-l    inbox                  1315
shapiro-r     all_documents          1270
skilling-j    inbox                  1252


In [59]:
print(df_emails['Body'][10471])

Thought you might like Superfan's comments.
---------------------- Forwarded by Eric Bass/HOU/ECT on 10/24/2000 09:56 AM 
---------------------------
   
	Enron North America Corp.
	
	From:  Eric Bass                           10/24/2000 09:53 AM
	

To: Chad Landry/HOU/ECT@ECT
cc: Matthew Lenhart/HOU/ECT@ECT, Christopher Coffman/Corp/Enron@Enron, 
William Kelly/HOU/ECT@ECT, Kyle Etter/HOU/ECT@ECT, Kam Keiser/HOU/ECT@ECT, 
Jay Reitmeyer/HOU/ECT@ECT, Jeff Coates/HOU/ECT@ECT, William 
Keeney/HOU/ECT@ECT, Jeffrey C Gossett/HOU/ECT@ECT, John King/HOU/ECT@ECT, 
Luis Mena/NA/Enron@Enron, shirley.s.elliott@citicorp.com @ ENRON, Lisa 
Gillette/HOU/ECT@ECT, Susan M Scott/HOU/ECT@ECT, Dawn C Kenne/HOU/ECT@ECT, 
Nick Hiemstra/HOU/ECT@ECT, Benjamin Thomason/HOU/ECT@ECT, David 
Marks/HOU/ECT@ECT, Timothy Blanchard/HOU/EES@EES 
Subject: Re: It could happen!!!  

Easy there Superfan!  Remember, you still have to win 1 of your last 3 games, 
none of those are gimmes (see UAB).  As far as bowls go,  LSU

In [60]:
print(df_emails.loc[108711]['Body'])

Mark:

Here is a list of the 66 physical deals that were zeroed out from December 15 onwards  las night.
There were also many financial deals that were killed, but I hope that does not matter to Unify.


 

 -----Original Message-----
From: 	Mcclure, Mark  
Sent:	Thursday, December 13, 2001 5:11 PM
To:	Jaquet, Tammy; Krishnaswamy, Jayant
Cc:	Pena, Matt; Superty, Robert; Wynne, Rita; Pinion, Richard; Farmer, Daren J.; Heal, Kevin; Kinsey, Lisa; Lamadrid, Victor; Smith, George F.; Sullivan, Patti
Subject:	RE: Killing ENA to ENA deals in Sitara


Jay,

Let's not assume what impact this will have. Even though it is better to zero out deals we(VM) wouldn't want to go in and zero out 2000 deals. 

We do have a few questions.

I would like to discuss what minimal impact is? How do you know what impact this will have? Is there a reason this will have little impact? Do all of these D2D deals have zero volume associated with them?
Who decided to do this?
Why are we doing this?
What time span are

In [61]:
#print(df_emails.loc[108711]['message'])