# Enron Email Database: Actionable item Detection


The Enron email dataset contains approximately 500,000 emails generated by employees of the Enron Corporation. It was obtained by the Federal Energy Regulatory Commission during its investigation of Enron's collapse.

>Dataset source: https://www.kaggle.com/wcukierski/enron-email-dataset


# 1. Objective
- Create a heuristics-based linguistic model for detecting actionable items from the email.
- Use the rule-based model to classify sentences to actionable sentence and non-actionable sentence.

# 2. Explore the Data

In [1]:
# import relevant packages
import pandas as pd
import os


In [17]:
# to control the number of dataframe characters displayed
pd.options.display.max_colwidth = 200

In [3]:
# Load data
raw_dataDF = pd.read_csv('emails/emails.csv')

In [18]:
# lets see the data
raw_dataDF

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,"Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>\nDate: Mon, 14 May 2001 16:39:00 -0700 (PDT)\nFrom: phillip.allen@enron.com\nTo: tim.belden@enron.com\nSubject: \nMime-Version: 1.0\nConte..."
1,allen-p/_sent_mail/10.,"Message-ID: <15464986.1075855378456.JavaMail.evans@thyme>\nDate: Fri, 4 May 2001 13:51:00 -0700 (PDT)\nFrom: phillip.allen@enron.com\nTo: john.lavorato@enron.com\nSubject: Re:\nMime-Version: 1.0\n..."
2,allen-p/_sent_mail/100.,"Message-ID: <24216240.1075855687451.JavaMail.evans@thyme>\nDate: Wed, 18 Oct 2000 03:00:00 -0700 (PDT)\nFrom: phillip.allen@enron.com\nTo: leah.arsdall@enron.com\nSubject: Re: test\nMime-Version: ..."
3,allen-p/_sent_mail/1000.,"Message-ID: <13505866.1075863688222.JavaMail.evans@thyme>\nDate: Mon, 23 Oct 2000 06:13:00 -0700 (PDT)\nFrom: phillip.allen@enron.com\nTo: randall.gay@enron.com\nSubject: \nMime-Version: 1.0\nCont..."
4,allen-p/_sent_mail/1001.,"Message-ID: <30922949.1075863688243.JavaMail.evans@thyme>\nDate: Thu, 31 Aug 2000 05:07:00 -0700 (PDT)\nFrom: phillip.allen@enron.com\nTo: greg.piper@enron.com\nSubject: Re: Hello\nMime-Version: 1..."
...,...,...
517396,zufferli-j/sent_items/95.,"Message-ID: <26807948.1075842029936.JavaMail.evans@thyme>\nDate: Wed, 28 Nov 2001 13:30:11 -0800 (PST)\nFrom: john.zufferli@enron.com\nTo: kori.loibl@enron.com\nSubject: Trade with John Lavorato\n..."
517397,zufferli-j/sent_items/96.,"Message-ID: <25835861.1075842029959.JavaMail.evans@thyme>\nDate: Wed, 28 Nov 2001 12:47:48 -0800 (PST)\nFrom: john.zufferli@enron.com\nTo: john.lavorato@enron.com\nSubject: Gas Hedges\nMime-Versio..."
517398,zufferli-j/sent_items/97.,"Message-ID: <28979867.1075842029988.JavaMail.evans@thyme>\nDate: Wed, 28 Nov 2001 07:20:00 -0800 (PST)\nFrom: john.zufferli@enron.com\nTo: dawn.doucet@enron.com\nSubject: RE: CONFIDENTIAL\nMime-Ve..."
517399,zufferli-j/sent_items/98.,"Message-ID: <22052556.1075842030013.JavaMail.evans@thyme>\nDate: Tue, 27 Nov 2001 11:52:45 -0800 (PST)\nFrom: john.zufferli@enron.com\nTo: jeanie.slone@enron.com\nSubject: Calgary Analyst/Associat..."


In [19]:
# lets see the data shape
raw_dataDF.shape

(517401, 2)

In [44]:
raw_dataDF.iloc[128]

file                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

## 2.1. Data Overview
<br>
All of the data is in the file emails.csv:</br>
<pre>
<b>emails.csv</b> contains 2 columns: file and message.<br />
<b>Size of emails.csv</b> - 1.32GB<br />
<b>Number of rows in Train.csv</b> = 517401<br />
</pre>

## 2.2. Lets see some rows to understand the structure of the dataset.
<br>

After going through the data, I can see

### 2.2.1 The content of message column at row 120

In [43]:
pd.options.display.max_colwidth = 1500
raw_dataDF.iloc[120]

file                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               allen-p/_sent_mail/202.
message    Message-ID: <26838693.1075855689682.JavaMail.evans@thyme>\nDate: Fri, 4 Aug 2000 07:00:00 -0700 (PDT)\nFrom: phillip.allen@enron.com\nTo: chris.gaskill@enron.com\nSubject: \nMime-Version: 1.0\nContent-Type: text/plain; charset=us-ascii\nContent-Transfer-Encoding: 7bit\nX-From: Phillip K Al


Message-ID: <26838693.1075855689682.JavaMail.evans@thyme>\n<br>
Date: Fri, 4 Aug 2000 07:00:00 -0700 (PDT)\n<br>
From: phillip.allen@enron.com\n<br>
To: chris.gaskill@enron.com\n<br>
Subject: \n<br>
Mime-Version: 1.0\n<br>
Content-Type: text/plain; charset=us-ascii\n<br>
Content-Transfer-Encoding: 7bit\n<br>
X-From: Phillip K Allen\n<br>
X-To: Chris Gaskill\n<br>
X-cc: \nX-bcc: \n<br>
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\'sent mail\n<br>
X-Origin: Allen-P\n<br>
X-FileName: pallen.nsf\n<br>
\n<br>
can you build something to look at historical prices from where we saved \n<br>
curves each night.\n<br>
\n<br>
Here is an example that pulls socal only.\n<br>
Improvements could include a drop down menu to choose any curve and a choice \n<br>
of index,gd, or our curves.\n<br>
\n<br>
<br>


The message contains the following sections:
<br>
*Message-ID, Date, From, To, Subject, Cc, Mime-Version, Content-Type, Content-Transfer-Encoding, Bcc, X-From, X-To, X-cc, X-bcc, X-Folder, X-Origin, X-FileName, 'Content'*
<br>


## 2.2. Data preprocessing

In [32]:
raw_dataDF

0           allen-p/_sent_mail/1.
1          allen-p/_sent_mail/10.
2         allen-p/_sent_mail/100.
3        allen-p/_sent_mail/1000.
4        allen-p/_sent_mail/1001.
                  ...            
995    allen-p/all_documents/458.
996    allen-p/all_documents/459.
997     allen-p/all_documents/46.
998    allen-p/all_documents/460.
999    allen-p/all_documents/461.
Name: file, Length: 1000, dtype: object