# Enron Dataset Preprocessing : Parsing the raw dataset

In this notebook, we parse the raw dataset in to a single csv file.

## Step 1: Import libraries

In [1]:
import os
import re
import sys
import email
import dateutil

import pandas as pd

## Step 2: Download and extract the dataset

The Enron dataset should be downloaded from `https://www.cs.cmu.edu/~enron/enron_mail_20150507.tar.gz` and extracted in the current directory.

The following variable defines the location to the raw dataset.

In [2]:
MAIL_DIR = 'maildir/'

## Step 3: Parse the dataset

We first define a couple helper functions to extract the emails sent by all individuals in the dataset.

In [3]:
def recursive_listdir(path):
    """Recursively walk from a given path"""
    return [os.path.join(dp, f) for dp, dn, fn in os.walk(os.path.expanduser(path)) for f in fn]

def get_sender_emails(user):
    """Generator that iterates over all the emails sent by a given user"""
    # Check all all sub-folders in the user folder
    for fpath in recursive_listdir(os.path.join(MAIL_DIR, user)):
        # Ignore os specific file
        if fpath.endswith('.DS_Store'):
            continue
        # Read email file
        with open(fpath, 'rb') as f:
            msg = email.message_from_binary_file(f)
        # Parse date
        date = msg['Date']
        dt = dateutil.parser.parse(date)
        t = dt.timestamp()
        # Store data in a dict
        mail_dict = {
            'user': user,
            'date': date,
            'timestamp': t, 
            'file': re.sub(MAIL_DIR, '', fpath), 
        }
        for key in ['From', 'To', 'Cc', 'X-From', 'X-To', 'X-cc', 'X-Origin', 'X-Folder']:
            mail_dict[key] = msg.get(key)

        yield mail_dict

Get all emails from all individuals.

In [4]:
# Get the list of individuals
SENDER_LIST = sorted([d for d in os.listdir(MAIL_DIR) if os.path.isdir(os.path.join(MAIL_DIR, d))])

data = list()
n_senders = len(SENDER_LIST)
for i, sender in enumerate(SENDER_LIST):
    print('{:d}/{:d} - Process user: {:<20s}'.format(i+1, n_senders, sender), end='\r', flush=True)
    data.extend(list(get_sender_emails(sender)))
print()

150/150 - Process user: zufferli-j          


Visualize some mails

In [5]:
df = pd.DataFrame.from_dict(data)
df.sample(5)

Unnamed: 0,user,date,timestamp,file,From,To,Cc,X-From,X-To,X-cc,X-Origin,X-Folder
242861,kean-s,"Thu, 14 Sep 2000 03:06:00 -0700 (PDT)",968925960.0,kean-s/archiving/untitled/997.,steven.kean@enron.com,cynthia.sandherr@enron.com,jeffrey.shankman@enron.com,Steven J Kean,Cynthia Sandherr,Jeffrey A Shankman,KEAN-S,\Steven_Kean_Dec2000_1\Notes Folders\Archiving...
198095,jones-t,"Thu, 22 Mar 2001 13:25:00 -0800 (PST)",985296300.0,jones-t/all_documents/10427.,enron.announcements@enron.com,all_ena_egm_eim@enron.com,,Enron Announcements,All_ENA_EGM_EIM,,JONES-T,\Tanya_Jones_June2001\Notes Folders\All documents
336060,merriss-s,"Sat, 21 Apr 2001 20:43:00 -0700 (PDT)",987910980.0,merriss-s/notes_inbox/262.,pete.davis@enron.com,pete.davis@enron.com,"bert.meyers@enron.com, bill.williams.iii@enron...",Schedule Crawler<pete.davis@enron.com>,pete.davis@enron.com,"bert.meyers@enron.com, bill.williams.III@enron...",MERRISS-S,\steven merriss 6-28-02\Notes Folders\Notes inbox
276643,lenhart-m,"Mon, 30 Oct 2000 06:03:00 -0800 (PST)",972914580.0,lenhart-m/sent/1880.,matthew.lenhart@enron.com,mmmarcantel@equiva.com,,Matthew Lenhart,"""Marcantel MM (Mitch)"" <MMMarcantel@equiva.com...",,Lenhart-M,\Matthew_Lenhart_Jun2001\Notes Folders\Sent
317674,mann-k,"Thu, 12 Oct 2000 10:59:00 -0700 (PDT)",971373540.0,mann-k/all_documents/1394.,kay.mann@enron.com,sarah.wesner@enron.com,,Kay Mann,Sarah Wesner,,MANN-K,\Kay_Mann_June2001_1\Notes Folders\All documents


In [6]:
df.shape

(517401, 12)

Save dataframe to `csv`

In [7]:
df.to_csv('enron_dataset_raw.csv', encoding='utf-8')