<a href="https://colab.research.google.com/github/t-ben2/datasci_python_primer/blob/master/EnronSpamHam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This directory contains the Enron-Spam datasets, as described in the 
paper:

V. Metsis, I. Androutsopoulos and G. Paliouras, "Spam Filtering with 
Naive Bayes - Which Naive Bayes?". Proceedings of the 3rd Conference 
on Email and Anti-Spam (CEAS 2006), Mountain View, CA, USA, 2006.

The "preprocessed" subdirectory contains the messages in the 
preprocessed format that was used in the experiments of the paper.
Each message is in a separate text file. The number at the beginning
of each filename is the "order of arrival".

The "raw" subdirectory contains the messages in their original form. 
Spam messages in non-Latin encodings, ham messages sent by the owners 
of the mailboxes to themselves (sender in "To:", "Cc:", or "Bcc" 
field), and a handful of virus-infected messages have been removed, 
but no other modification has been made. The messages in the "raw" 
subdirectory are more than the corresponding messages in the 
"preprocessed" subdirectory, because: (a) duplicates are preserved 
in the "raw" form, and (b) during the preprocessing, ham and/or spam 
messages were randomly subsampled to obtain the desired ham:spam 
ratios. See the paper for further details.

The Enron-Spam datasets are available from: 
<http://www.iit.demokritos.gr/skel/i-config/> and
<http://www.aueb.gr/users/ion/publications.html>.

The paper is available from:
<http://www.ceas.cc/> and 
<http://www.aueb.gr/users/ion/publications.html>.

V. Metsis, I. Androutsopoulos and G. Paliouras  

This file last updated: June 19, 2006.

1. Acquire and explore the files.
2. Load the data into Python.
3. Explore the data.
4. Learn to process the data using NLTK.
5. Split the data between train and test.
6. Implement the classifier
7. Explore and visualize the results.

In [336]:
import numpy as np
import pandas as pd

import os

# Import Data

In [337]:
spam_dir = './enron_data/raw_spam'
ham_dir = './enron_data/raw_ham'

In [338]:
#List all files in directory using python 'os' library
os.listdir('./enron_data')

['.DS_Store', 'raw_ham', 'preprocessed_files', 'raw_spam']

In [339]:
#List all files/folders in directory using iPython magic commands
#Allows us to interact directly with the command line
folders = !ls ./enron_data
folders

['preprocessed_files', 'raw_ham', 'raw_spam']

In [340]:
files = !ls ./enron_data/preprocessed_files
files

['enron1',
 'enron1.tar.gz',
 'enron2',
 'enron2.tar.gz',
 'enron3',
 'enron3.tar.gz',
 'enron4',
 'enron4.tar.gz',
 'enron5',
 'enron5.tar.gz',
 'enron6',
 'enron6.tar.gz']

In [341]:
#View shell command we are running
!cat ./bash_scripts/unzip_tar_gz.sh

#!/bin/sh
for file in /Users/tbennett/datasci_python_primer/enron_data/$1/*.gz 
do 
tar -xf "$file" -C /Users/tbennett/datasci_python_primer/enron_data/$1;
echo "Succesfully unzipped" "$file"
done


In [342]:
for arg in ['raw_ham','raw_spam','preprocessed_files']:
    os.system("./bash_scripts/unzip_tar_gz.sh '%s'" % arg)

In [343]:
%%bash
cd ./bash_scripts/
bash unzip_tar_gz.sh raw_ham


Succesfully unzipped /Users/tbennett/datasci_python_primer/enron_data/raw_ham/beck-s.tar.gz
Succesfully unzipped /Users/tbennett/datasci_python_primer/enron_data/raw_ham/farmer-d.tar.gz
Succesfully unzipped /Users/tbennett/datasci_python_primer/enron_data/raw_ham/kaminski-v.tar.gz
Succesfully unzipped /Users/tbennett/datasci_python_primer/enron_data/raw_ham/kitchen-l.tar.gz
Succesfully unzipped /Users/tbennett/datasci_python_primer/enron_data/raw_ham/lokay-m.tar.gz
Succesfully unzipped /Users/tbennett/datasci_python_primer/enron_data/raw_ham/williams-w3.tar.gz


In [344]:
import subprocess
shell_script = [subprocess.run(["./bash_scripts/unzip_tar_gz.sh", "%s" % arg], stdin=subprocess.PIPE, stdout=subprocess.PIPE) for arg in ['raw_ham','raw_spam','preprocessed_files']]

In [345]:
print(shell_script[0].stdout.decode(),'\n') #decode converts output from bytes to string for readability
print(shell_script[1].stdout.decode(),'\n')
print(shell_script[2].stdout.decode(),'\n')

Succesfully unzipped /Users/tbennett/datasci_python_primer/enron_data/raw_ham/beck-s.tar.gz
Succesfully unzipped /Users/tbennett/datasci_python_primer/enron_data/raw_ham/farmer-d.tar.gz
Succesfully unzipped /Users/tbennett/datasci_python_primer/enron_data/raw_ham/kaminski-v.tar.gz
Succesfully unzipped /Users/tbennett/datasci_python_primer/enron_data/raw_ham/kitchen-l.tar.gz
Succesfully unzipped /Users/tbennett/datasci_python_primer/enron_data/raw_ham/lokay-m.tar.gz
Succesfully unzipped /Users/tbennett/datasci_python_primer/enron_data/raw_ham/williams-w3.tar.gz
 

Succesfully unzipped /Users/tbennett/datasci_python_primer/enron_data/raw_spam/BG.tar.gz
Succesfully unzipped /Users/tbennett/datasci_python_primer/enron_data/raw_spam/GP.tar.gz
Succesfully unzipped /Users/tbennett/datasci_python_primer/enron_data/raw_spam/SH.tar.gz
 

Succesfully unzipped /Users/tbennett/datasci_python_primer/enron_data/preprocessed_files/enron1.tar.gz
Succesfully unzipped /Users/tbennett/datasci_python_prime

In [346]:
#Blue items listed are folders
!ls ./enron_data/raw_spam

[34mBG[m[m        BG.tar.gz [34mGP[m[m        GP.tar.gz [34mSH[m[m        SH.tar.gz


In [347]:
!ls ./enron_data/raw_spam/BG/2004/08

[31m1091394468.23940_19.txt[m[m   [31m1092075793.29722_464.txt[m[m  [31m1092891125.24437_638.txt[m[m
[31m1091394468.23940_21.txt[m[m   [31m1092075794.29722_502.txt[m[m  [31m1092891131.24437_647.txt[m[m
[31m1091394475.23940_40.txt[m[m   [31m1092075795.29722_552.txt[m[m  [31m1092891132.24437_653.txt[m[m
[31m1091394514.25221_6.txt[m[m    [31m1092075795.29722_554.txt[m[m  [31m1092891136.24437_671.txt[m[m
[31m1091394516.25221_8.txt[m[m    [31m1092075806.29722_608.txt[m[m  [31m1092891141.24437_691.txt[m[m
[31m1091394636.25221_97.txt[m[m   [31m1092075806.29722_628.txt[m[m  [31m1092891142.24437_697.txt[m[m
[31m1091394644.25221_126.txt[m[m  [31m1092075806.29722_642.txt[m[m  [31m1092891144.24437_713.txt[m[m
[31m1091394675.25221_203.txt[m[m  [31m1092075806.29722_646.txt[m[m  [31m1092891145.24437_715.txt[m[m
[31m1091394677.25221_205.txt[m[m  [31m1092075806.29722_648.txt[m[m  [31m1092891155.24437_732.txt[m[m


In [348]:
#Check first text file to inspect data and structure
!cat ./enron_data/raw_spam/BG/2004/08/1091394468.23940_19.txt

Return-Path: <Denny@mailbox.sk>
Delivered-To: rait@bruce-guenter.dyndns.org
Received: (qmail 10573 invoked from network); 31 Jul 2004 18:41:37 -0000
Received: from localhost (localhost [127.0.0.1])
  by bruce-guenter.dyndns.org ([192.168.1.3]); 31 Jul 2004 18:41:37 -0000
Received: from zak.futurequest.net ([127.0.0.1])
  by localhost ([127.0.0.1])
  with SMTP via TCP; 31 Jul 2004 18:41:37 -0000
Received: (qmail 3535 invoked from network); 31 Jul 2004 18:41:36 -0000
Received: from 69.5.6.152 (unknown [221.149.203.7])
  by zak.futurequest.net ([69.5.6.152])
  with SMTP via TCP; 31 Jul 2004 18:41:34 -0000
Received: from 122.22.148.92 by 221.149.203.7; Sat, 31 Jul 2004 23:34:33 +0400
Message-ID: <ZWGIAOGFNYSAGOTRMYLYK@techie.com>
From: "Bertha " <Denny@mailbox.sk>
Reply-To: "Bertha " <Denny@mailbox.sk>
To: rait@bruce-guenter.dyndns.org
Subject: Get it up and keep it up praecox
Date: Sat, 31 Jul 2004 12:39:33 -0700
MIME-Version: 1.0
Content-Type: multipart/alternative;


In [349]:
!ls ./enron_data/raw_ham/williams-w3/hr

[31m1[m[m  [31m13[m[m [31m17[m[m [31m20[m[m [31m24[m[m [31m28[m[m [31m31[m[m [31m35[m[m [31m39[m[m [31m42[m[m [31m46[m[m [31m5[m[m  [31m53[m[m [31m57[m[m [31m60[m[m [31m64[m[m [31m68[m[m [31m71[m[m [31m75[m[m [31m79[m[m [31m82[m[m [31m86[m[m
[31m10[m[m [31m14[m[m [31m18[m[m [31m21[m[m [31m25[m[m [31m29[m[m [31m32[m[m [31m36[m[m [31m4[m[m  [31m43[m[m [31m47[m[m [31m50[m[m [31m54[m[m [31m58[m[m [31m61[m[m [31m65[m[m [31m69[m[m [31m72[m[m [31m76[m[m [31m8[m[m  [31m83[m[m [31m9[m[m
[31m11[m[m [31m15[m[m [31m19[m[m [31m22[m[m [31m26[m[m [31m3[m[m  [31m33[m[m [31m37[m[m [31m40[m[m [31m44[m[m [31m48[m[m [31m51[m[m [31m55[m[m [31m59[m[m [31m62[m[m [31m66[m[m [31m7[m[m  [31m73[m[m [31m77[m[m [31m80[m[m [31m84[m[m
[31m12[m[m [31m16[m[m [31m2[m[m  [31m23[m[m [31m27[m[m [31m30[m[m [31

In [350]:
cat  ./enron_data/raw_ham/williams-w3/hr/1

Message-ID: <1346515.1075839930933.JavaMail.evans@thyme>
Date: Wed, 2 Jan 2002 20:05:00 -0800 (PST)
From: grace.rodriguez@enron.com
To: center.dl-portland@enron.com
Subject: Enron Elves Update
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Rodriguez, Grace </O=ENRON/OU=NA/CN=RECIPIENTS/CN=GRODRIQ>
X-To: DL-Portland World Trade Center </O=ENRON/OU=NA/CN=RECIPIENTS/CN=DL-PortlandWorldTradeCenter>
X-cc: 
X-bcc: 
X-Origin: WILLIAMS-W3
X-FileName: 

Hello all!

Thanks again to everyone who participated in our Enron Elves Event! We were able to provide each family a number of gifts for the children as well as a gift certificate to Fred Meyer.

We will be holding the drawing for the Blazer tickets next Monday (January 7th), when more employees are likely to be present. For your reference, here is some additional information about the games:

Package 1:	Blazers vs. Cleveland Cavaliers
		Sunday, January 1

In [351]:
!ls ./enron_data/preprocessed_files/enron1/ham

[31m0001.1999-12-10.farmer.ham.txt[m[m [31m2561.2000-10-17.farmer.ham.txt[m[m
[31m0002.1999-12-13.farmer.ham.txt[m[m [31m2563.2000-10-17.farmer.ham.txt[m[m
[31m0003.1999-12-14.farmer.ham.txt[m[m [31m2564.2000-10-17.farmer.ham.txt[m[m
[31m0004.1999-12-14.farmer.ham.txt[m[m [31m2565.2000-10-18.farmer.ham.txt[m[m
[31m0005.1999-12-14.farmer.ham.txt[m[m [31m2567.2000-10-18.farmer.ham.txt[m[m
[31m0007.1999-12-14.farmer.ham.txt[m[m [31m2569.2000-10-18.farmer.ham.txt[m[m
[31m0009.1999-12-14.farmer.ham.txt[m[m [31m2571.2000-10-18.farmer.ham.txt[m[m
[31m0010.1999-12-14.farmer.ham.txt[m[m [31m2572.2000-10-18.farmer.ham.txt[m[m
[31m0011.1999-12-14.farmer.ham.txt[m[m [31m2573.2000-10-18.farmer.ham.txt[m[m
[31m0012.1999-12-14.farmer.ham.txt[m[m [31m2574.2000-10-18.farmer.ham.txt[m[m
[31m0013.1999-12-14.farmer.ham.txt[m[m [31m2576.2000-10-18.farmer.ham.txt[m[m
[31m0014.1999-12-15.farmer.ham.txt[m[m [31m2577.2000-10-18.fa

In [352]:
!cat ./enron_data/preprocessed_files/enron1/ham/0001.1999-12-10.farmer.ham.txt
!cat ./enron_data/preprocessed_files/enron1/ham/0002.1999-12-13.farmer.ham.txt

Subject: christmas tree farm pictures
Subject: vastar resources , inc .
gary , production from the high island larger block a - 1 # 2 commenced on
saturday at 2 : 00 p . m . at about 6 , 500 gross . carlos expects between 9 , 500 and
10 , 000 gross for tomorrow . vastar owns 68 % of the gross production .
george x 3 - 6992
- - - - - - - - - - - - - - - - - - - - - - forwarded by george weissman / hou / ect on 12 / 13 / 99 10 : 16
am - - - - - - - - - - - - - - - - - - - - - - - - - - -
daren j farmer
12 / 10 / 99 10 : 38 am
to : carlos j rodriguez / hou / ect @ ect
cc : george weissman / hou / ect @ ect , melissa graves / hou / ect @ ect
subject : vastar resources , inc .
carlos ,
please call linda and get everything set up .
i ' m going to estimate 4 , 500 coming up tomorrow , with a 2 , 000 increase each
following day based on my conversations with bill fischer at bmar .
d .
- - - - - - - - - - - - - - - - - - - - - - forwarded by daren j farmer / hou / ect on 12 / 10 / 99 10 : 34
am

# NLP & Feature Extraction

In [353]:
file = open('./enron_data/preprocessed_files/enron1/ham/0002.1999-12-13.farmer.ham.txt','r')
print(file.read())
file.close()

Subject: vastar resources , inc .
gary , production from the high island larger block a - 1 # 2 commenced on
saturday at 2 : 00 p . m . at about 6 , 500 gross . carlos expects between 9 , 500 and
10 , 000 gross for tomorrow . vastar owns 68 % of the gross production .
george x 3 - 6992
- - - - - - - - - - - - - - - - - - - - - - forwarded by george weissman / hou / ect on 12 / 13 / 99 10 : 16
am - - - - - - - - - - - - - - - - - - - - - - - - - - -
daren j farmer
12 / 10 / 99 10 : 38 am
to : carlos j rodriguez / hou / ect @ ect
cc : george weissman / hou / ect @ ect , melissa graves / hou / ect @ ect
subject : vastar resources , inc .
carlos ,
please call linda and get everything set up .
i ' m going to estimate 4 , 500 coming up tomorrow , with a 2 , 000 increase each
following day based on my conversations with bill fischer at bmar .
d .
- - - - - - - - - - - - - - - - - - - - - - forwarded by daren j farmer / hou / ect on 12 / 10 / 99 10 : 34
am - - - - - - - - - - - - - - - - - - -

In [354]:
def text_extractor(filepath):
    
    '''Takes in a filepath containing multiple text files
    and outputs a dictionary containing the filename and an array of features for each'''
    
    feature_dict ={}
    list_of_files = os.listdir(filepath)
    
    for filename in list_of_files:

        file = open(filepath + filename,'r',encoding='utf-8',errors='ignore')
        features = file.readlines() #Creates a list where each line is one item
        feature_dict[filename] = features #Create a dictionary with the filename, followed by list of features
        file.close() #Best practice to explicity close file once used
        
    return feature_dict

In [355]:
ham_features = text_extractor('./enron_data/preprocessed_files/enron1/ham/')
spam_features = text_extractor('./enron_data/preprocessed_files/enron1/spam/')

In [356]:
#Inspect first spam entry
dict_iterator = iter(spam_features.values())
next(dict_iterator)

['Subject: what up , , your cam babe\n',
 'what are you looking for ?\n',
 "if your looking for a companion for friendship , love , a date , or just good ole '\n",
 'fashioned * * * * * * , then try our brand new site ; it was developed and created\n',
 "to help anyone find what they ' re looking for . a quick bio form and you ' re\n",
 'on the road to satisfaction in every sense of the word . . . . no matter what\n',
 'that may be !\n',
 'try it out and youll be amazed .\n',
 'have a terrific time this evening\n',
 'copy and pa ste the add . ress you see on the line below into your browser to come to the site .\n',
 'http : / / www . meganbang . biz / bld / acc /\n',
 'no more plz\n',
 'http : / / www . naturalgolden . com / retract /\n',
 'counterattack aitken step preemptive shoehorn scaup . electrocardiograph movie honeycomb . monster war brandywine pietism byrne catatonia . encomia lookup intervenor skeleton turn catfish .\n']

In [357]:
#Inspect first ham entry
dict_iterator = iter(ham_features.values())
next(dict_iterator)

['Subject: ena sales on hpl\n',
 "just to update you on this project ' s status :\n",
 'based on a new report that scott mills ran for me from sitara , i have come up\n',
 'with the following counterparties as the ones to which ena is selling gas off\n',
 "of hpl ' s pipe .\n",
 'altrade transaction , l . l . c . gulf gas utilities company\n',
 'brazoria , city of panther pipeline , inc .\n',
 'central illinois light company praxair , inc .\n',
 'central power and light company reliant energy - entex\n',
 'ces - equistar chemicals , lp reliant energy - hl & p\n',
 'corpus christi gas marketing , lp southern union company\n',
 'd & h gas company , inc . texas utilities fuel company\n',
 'duke energy field services , inc . txu gas distribution\n',
 'entex gas marketing company union carbide corporation\n',
 'equistar chemicals , lp unit gas transmission company inc .\n',
 "since i ' m not sure exactly what gets entered into sitara , pat clynes\n",
 "suggested that i check with daren farm

In [358]:
#No. files extracted into each dictionary
print('ham:',len(ham_features.keys()))
print('spam:',len(spam_features.keys()))

ham: 3672
spam: 1500


In [359]:
#Combine both dictionaries
#**dict1 & **dict2 expands the contents of both the dictionaries to a collection of key value pairs
all_features = {**ham_features, **spam_features} 

len(all_features)

5172

In [360]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')

In [361]:
stemmer.stem('running') #Example of word being reduced to its root stem

'run'

In [362]:
# define a function that accepts text and returns a list of lemmas
def word_tokenize(text, how='lemma'):
    words = text.split(' ')  # tokenize into words
    return [stemmer.stem(word) for word in words]

In [363]:
from sklearn.feature_extraction.text import CountVectorizer

In [364]:
vect = CountVectorizer(min_df = 0.05,analyzer = word_tokenize) #only includes words that occur in at least 5% of the corpus documents
_ = vect.fit_transform(ham_features['1061.2000-05-10.farmer.ham.txt'])
_.shape

(46, 48)

https://dziganto.github.io/Sparse-Matrices-For-Efficient-Machine-Learning/

In [365]:
#a TfidfVectorizer is the same as CountVectorizer, in that it constructs features from tokens, 
#but it takes a step further and normalizes counts to frequency of occurrences across a corpus.

from sklearn.feature_extraction.text import TfidfVectorizer

In [375]:
tfidf_vect = TfidfVectorizer(min_df = 0.05,analyzer = word_tokenize) #only includes words that occur in at least 5% of the corpus documents
X = tfidf_vect.fit_transform(ham_features['1061.2000-05-10.farmer.ham.txt'])
X.shape

AttributeError: 'list' object has no attribute 'split'

In [367]:
len(tfidf_vect.get_feature_names())

48

In [368]:
print(X[0])

  (0, 16)	0.5146654414277028
  (0, 36)	0.699776963670117
  (0, 33)	0.49541062212740233


In [373]:
y = [filename.split('.')[-2] for filename in list(all_features.keys())]
len(y)

5172

In [372]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train,y_train)

classifier.predict(X_test)

ValueError: Found input variables with inconsistent numbers of samples: [46, 5172]