# Email Fraud Detection: Data Cleaning and Combining

#### By Ross Willett

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Email-Fraud-Detection:-Data-Cleaning-and-Combining" data-toc-modified-id="Email-Fraud-Detection:-Data-Cleaning-and-Combining-1">Email Fraud Detection: Data Cleaning and Combining</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#By-Ross-Willett" data-toc-modified-id="By-Ross-Willett-1.0.0.1">By Ross Willett</a></span></li></ul></li></ul></li><li><span><a href="#Project-Introduction" data-toc-modified-id="Project-Introduction-1.1">Project Introduction</a></span></li><li><span><a href="#File-Introduction" data-toc-modified-id="File-Introduction-1.2">File Introduction</a></span></li><li><span><a href="#Retrieving-and-Analyzing-Data" data-toc-modified-id="Retrieving-and-Analyzing-Data-1.3">Retrieving and Analyzing Data</a></span></li><li><span><a href="#Data-Frame-Cleaning" data-toc-modified-id="Data-Frame-Cleaning-1.4">Data Frame Cleaning</a></span></li><li><span><a href="#Add-Enron" data-toc-modified-id="Add-Enron-1.5">Add Enron</a></span></li></ul></li></ul></div>

## Project Introduction

## File Introduction

In this file, data from various sources will be retrieved, combined and cleaned to produce a data set suitable for analysis and model building.

## Retrieving and Analyzing Data

In [2]:
# Import data science libraries
import pandas as pd
import numpy as np

# Import regex library
import re

# Import BeautifulSoup for HTML parsing and handling
from bs4 import BeautifulSoup

# Import NLTK for natural language processing
import nltk

In [3]:
# Configure Pandas to show all columns / rows
pd.options.display.max_columns = 2000
pd.options.display.max_rows = 2000

The first data set that will be examined for cleaning and use is the fraud_email_ data set. (Located at https://www.kaggle.com/datasets/pramodgupta92/fraud-email-datasets)

In [4]:
# Read data from fraud_email_ data set and look at shape and dataframe info
fraud_df = pd.read_csv('./data/fraud_email_.csv')
print(fraud_df.shape)
print(fraud_df.info())

(11929, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11929 entries, 0 to 11928
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    11928 non-null  object
 1   Class   11929 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 186.5+ KB
None


In [5]:
# Look at content of data
fraud_df.head()

Unnamed: 0,Text,Class
0,Supply Quality China's EXCLUSIVE dimensions at...,1
1,over. SidLet me know. Thx.,0
2,"Dear Friend,Greetings to you.I wish to accost ...",1
3,MR. CHEUNG PUIHANG SENG BANK LTD.DES VOEUX RD....,1
4,Not a surprising assessment from Embassy.,0


In [6]:
# Look at split between fraud and regular emails
fraud_df['Class'].value_counts()

0    6742
1    5187
Name: Class, dtype: int64

Initial analysis of the fraud email data set reveals there is a sizable 11,929 rows of email text data with a classifier indicating whether the email the text was taken from was fraudulent or not. Although missing the email subject and from address is not ideal, this is a significant source of data which should be used in the data set.

The next data set that will be examined for use is the phishing_data_by_type data set. (Located at https://www.kaggle.com/datasets/charlottehall/phishing-email-data-by-type)

In [7]:
# Pull the data set into a CSV and look at the shape and column info for the data frame
fraud_2_df = pd.read_csv('./data/phishing_data_by_type.csv')
print(fraud_2_df.shape)
print(fraud_2_df.info())

(159, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Subject  157 non-null    object
 1   Text     159 non-null    object
 2   Type     159 non-null    object
dtypes: object(3)
memory usage: 3.9+ KB
None


In [8]:
# Look at first few entries of data set
fraud_2_df.head()

Unnamed: 0,Subject,Text,Type
0,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\...,Fraud
1,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",Fraud
2,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,Fraud
3,from Mrs.Johnson,Goodday Dear\n\n\nI know this mail will come t...,Fraud
4,Co-Operation,FROM MR. GODWIN AKWESI\nTEL: +233 208216645\nF...,Fraud


In [9]:
# Look at the split for the content type
fraud_2_df['Type'].value_counts(normalize=True)

Fraud               0.251572
Phishing            0.251572
Commercial Spam     0.251572
False Positives     0.245283
Name: Type, dtype: float64

This data set contains only 159 documents but appears to contain a decent diversity of both phishing and fraud emails. It isn't ideal that there are also commercial spam emails since this may create an additional challenge for the identifier. As such, this data set may have to undergo additional cleaning but is likely worth including.

The next data set is composed of the content which was extracted from phishing email files retrieved from https://academictorrents.com/details/a77cda9a9d89a60dbdfbe581adf6e2df9197995a. The from, subject and content was extracted from all of the emails in each folder and compiled into one csv per folder. These files need to be compiled into a singular csv file.

In [10]:
# Create array to append data frames with contents to
phishing_extract_array = []
# Iterate from 1 to 5 since there are that many phishing extract files
for i in range(1, 5):
    # Read the data from each phishing extract csv and put it into a data frame
    phishing_extract = pd.read_csv(f'./data/phishing_extract_{i}.csv')
    # Append the data frame to the array of data frames
    phishing_extract_array.append(phishing_extract)

# Create a data frame to store the combined data frames
combined_phishing_extract_df = pd.concat(phishing_extract_array) 
# Drop duplicates in phishing extract data frame
combined_phishing_extract_df.drop_duplicates(inplace=True)
# Save phishing extract data frame to csv
combined_phishing_extract_df.to_csv('./data/phishing_eml_extract_full.csv', index=False)

In [11]:
# Read all phishing data from the saved csv into a data frame
phishing_extract_df = pd.read_csv('./data/phishing_eml_extract_full.csv')
# Look at data frame shape
print(phishing_extract_df.shape)
# Display data frame content
phishing_extract_df.sample(10)

(2416, 3)


Unnamed: 0,from,subject,content
1747,<service@wellsfargo.com>,Your Account Access Is Locked,"We recently reviewed your account, and suspect..."
742,"""Federal Credit Union"" <acc-validity@ncua.gov>",[*** POSIBLE SPAM***] FCU NOTICE: Important se...,Credit Union is constantly working to ensure s...
747,"""PayPal"" <postmaster@paypal.com>",Update your PayPal account,"Dear Sir,PayPal is committed to maintaining a ..."
579,"""Dante Carmichael"" <tkruk@rmsg.com>",rolex mania is down,We only sell premium watches. Theres no bat te...
1236,"""PayPal"" <update@secure.account.login.info>",* * New Updates For You * *,Security Measures - Are You Traveling PayPal i...
643,"""PayPal"" <service@paypal.com>",Notification of Limited Account Access - Case ...,Notification of Limited Account Access Dear Pa...
936,"""PayPal Security Service"" <security@paypal.com>",IMPORTANT: Notification of limited accounts,Notification of Limited Account Access As part...
1688,"""Commonwealth Bank of Australia."" <illdoo@aol....",Urgent Security Notification from Commonwealth...,"Dear Commonwealth Bank customer, Commonwealth ..."
1631,"""Fifth Third Bank"" <services-id6764431ver@secu...","Fifth Third Bank Strongly Recommends. [Thu,\n ...",Dear Fifth Third bank business/commercial cust...
1233,"""PayPal Email ID PP321"" <acc-overview@paypal.com>",[*** POSIBLE SPAM***] PayPal Email ID PP321,"Dear PayPal User Dear PayPal User, We recently..."


The data from the extracted emails includes 2,416 documents including a from address, a subject and the email content entirely consisting of phishing emails. This will prove to be a useful sample for the complete data set.

The next data set is composed of the content which was extracted from ham email files retrieved from the Apache "Spam Assassin" corpus located [here](https://spamassassin.apache.org/old/publiccorpus/). The from, subject and content was extracted from all of the emails in each folder and compiled into one csv per folder. These files need to be compiled into a singular csv file.

In [12]:
# Create array to store extracted data frames to
ham_extract_array = []
# Iterate over the number of csvs the data was separately stored in
for i in range(1, 4):
    # Read the data from each csv into a data frame
    ham_extract = pd.read_csv(f'./data/ham_extract_{i}.csv')
    # Append the data frame to the array of data frames
    ham_extract_array.append(ham_extract)

# Combine all the data frames into a singular one
combined_ham_extract_df = pd.concat(ham_extract_array)
# Remove all duplicate data frames
combined_ham_extract_df.drop_duplicates(inplace=True)
# Save the combined data frame to a csv file
combined_ham_extract_df.to_csv('./data/ham_eml_extract_full.csv', index=False)

In [13]:
# Load the combined data from a csv into a data frame
combined_ham_extract_df = pd.read_csv('./data/ham_eml_extract_full.csv')
# Display the shape and content of the data frame
print(combined_ham_extract_df.shape)
combined_ham_extract_df.sample(10)

(3303, 3)


Unnamed: 0,from,subject,content
3201,Daily Dilbert <2.20290.44-t9bsgc0tYwDu.1@ummai...,Your Daily Dilbert 07/10/2002,E-mail error Youre subscribed to the HTML vers...
2,"""Mr. FoRK"" <fork_list@hotmail.com>",Am I This Or Not?,I actually thought of this kind of active chat...
787,Alvie <bishop12@prodigy.net>,RH 8 no DMA for DVD drive,----- Original Message ----- From: Joseph S. B...
2177,Gary Lawrence Murphy <garym@canada.com>,Re: Al'Qaeda's fantasy ideology: Policy Revie...,b bitbitch writes: b My only problem with this...
1010,guardian <rssfeeds@spamassassin.taint.org>,Plans for new youth units blocked,I am delurking to comment on the Salon article...
1067,pudge@perl.org,[use Perl] Headlines for 2002-09-17,I am delurking to comment on the Salon article...
2524,harley@argote.ch (Robert Harley),Re: Infectious disease (was Re: Al'Qaeda's fan...,"James Rogers wrote: ... As I understand it, th..."
142,zawodny <rssfeeds@spamassassin.taint.org>,Yahoo Finance RSS Beta,I actually thought of this kind of active chat...
1477,Jacob Morzinski <yyyyorzins@MIT.EDU>,Re: bad focus/click behaviours,I installed Spamassassin 2.41 with Razor V2 th...
2793,Rohit Khare <khare@alumni.caltech.edu>,Re: HD/ID: High-Def Independence Day,"On Sunday, June 30, 2002, at 12:18 AM, Rohit K..."


The data from the extracted emails includes 2,416 documents including a from address, a subject and the email content entirely consisting of "ham" emails. This will prove to be a useful source of data for the combined data set.

## Data Frame Cleaning

Now that the data sets have been acquired and cursorily examined, the next step will be to format them in the same manner and combine them into a singular data set for further cleaning and analysis. Since the data is largely composed of documents that only have a flag indicating fraud and the content (77% of all data), the final data set should be composed of content and a flag indicating fraud. Given this, all the data sets will be transformed to a standard format with a `content` column with the email text and a `fraud` column with a 1 or 0 flag indicating whether it is fraud or not.

In [14]:
# Initialize clean data frame for first data set
fraud_clean_df = pd.DataFrame()

In [15]:
# Set 'content' column for clean data frame
fraud_clean_df['content'] = fraud_df['Text']
# Set 'fraud' column for clean data frame
fraud_clean_df['fraud'] = np.where(
    fraud_df['Class'] == 1,
    1,
    0
)

In [16]:
# Examine the contents of the clean data frame
fraud_clean_df.head()

Unnamed: 0,content,fraud
0,Supply Quality China's EXCLUSIVE dimensions at...,1
1,over. SidLet me know. Thx.,0
2,"Dear Friend,Greetings to you.I wish to accost ...",1
3,MR. CHEUNG PUIHANG SENG BANK LTD.DES VOEUX RD....,1
4,Not a surprising assessment from Embassy.,0


In the second data set there is a mix of fraud, phishing, ham and commercial spam email types. The commercial spam will be removed from the data set as this type doesn't strictly fall into either the ham or fraud category and will likely only make identification more difficult for any models trained on this data.

In [17]:
# Examine the data contained in the second fraud data set
fraud_2_df.head()

Unnamed: 0,Subject,Text,Type
0,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\...,Fraud
1,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",Fraud
2,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,Fraud
3,from Mrs.Johnson,Goodday Dear\n\n\nI know this mail will come t...,Fraud
4,Co-Operation,FROM MR. GODWIN AKWESI\nTEL: +233 208216645\nF...,Fraud


In [18]:
# Initialize the data frame to store the cleaned data from the second data set in
fraud_2_clean_df = pd.DataFrame()
# Set the content column to the text of the second data set
fraud_2_clean_df['content'] = fraud_2_df['Text']
# Set a 'fraud' column flagging the fraud vs normal emails from the fraud data set
fraud_2_clean_df['fraud'] = np.where(
    (fraud_2_df['Type'] == 'Fraud') | (fraud_2_df['Type'] == 'Phishing'),
    1,
    0
)
# Remove the data identified as 'Commercial Spam' from the data set
fraud_2_clean_df = fraud_2_clean_df[fraud_2_df['Type'] != 'Commercial Spam']

In [19]:
# Examine content of cleaned data
fraud_2_clean_df.head()

Unnamed: 0,content,fraud
0,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\...,1
1,"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",1
2,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,1
3,Goodday Dear\n\n\nI know this mail will come t...,1
4,FROM MR. GODWIN AKWESI\nTEL: +233 208216645\nF...,1


All the data from the phishing email collection can be classified as fraud and all columns except the email text content dropped.

In [20]:
# Initialize new clean data frame
fraud_3_clean_df = pd.DataFrame()
# Assign the 'content' of the new data frame to the text content of the phishing data frame
fraud_3_clean_df['content'] = phishing_extract_df['content']
# Set the fraud column to 1 for this data frame since all entries are fraudulent
fraud_3_clean_df['fraud'] = 1

In [21]:
# Examine the content of the new data frame
fraud_3_clean_df.head()

Unnamed: 0,content,fraud
0,"Dear valued PayPal member, Due to recent fraud...",1
1,Credit Union is constantly working to ensure s...,1
2,Credit Union is constantly working to ensure s...,1
3,"Untitled Document Dear eBay Member, We regret ...",1
4,"Dear valued PayPal member, Due to recent fraud...",1


All the data from the ham email collection can be classified as not fraud and all columns except the email text content dropped.

In [22]:
# Initialize new clean data frame
fraud_4_clean_df = pd.DataFrame()
# Assign the 'content' of the new data frame to the text content of the phishing data frame
fraud_4_clean_df['content'] = combined_ham_extract_df['content']
# Set the fraud column to 0 for this data frame since all entries are ham
fraud_4_clean_df['fraud'] = 0

Now that all the data has been put into data frames with a consistent format and flagged, they can be combined for further cleaning and processing.

In [23]:
# Assign a new data frame to the combined contents of all the cleaned data frames
fraud_all_df = pd.concat([
    fraud_clean_df,
    fraud_2_clean_df,
    fraud_3_clean_df,
    fraud_4_clean_df
], ignore_index=True)

In [24]:
# Remove all NA values from the new data frame
fraud_all_df.dropna(inplace=True)

In [25]:
# Look at the content of the combined data frame
fraud_all_df.sample(10)

Unnamed: 0,content,fraud
11617,"Hello,I am looking for your cooperation in bui...",1
6476,Yes we figured that?,0
7790,Very much so. I am also making a few changes a...,0
1958,Sigh - will send you traffic that lead to this...,0
9905,It is actually not a full cabinet meeting toda...,0
5603,"Love the ""doctrine"" and I agree we need a heal...",0
2342,Importance: HighFYI,0
14019,Dear Citizens Bank and Charter One Bank custom...,1
16568,E Elias Sinderson writes: E Gary Lawrence Murp...,0
6186,Sullivan Jacob J <SullivanJJ©state.gov>Thursda...,0


Now that the data frame has been combined, several functions should be instantiated to perform additional cleaning and formatting. After inspecting the data frame contents, it became clear that the text content of some emails contains HTML. The text content should be extracted from the HTML content in these emails and to this end a function should be made to do so.

In [26]:
# Defines a function to extract text content from strings with HTML
def extract_HTML_text(html):
    '''
    Accepts text content containing HTML and returns only the text content of the HMTML
    
    Parameters
    ----------
    html: A string which contains HTML encoded content
    
    Returns
    ----------
    Ret: A string which contains only the text from the HTML encoded content
    
    Example
    ----------
    >>>> extract_HTML_text('<div>Text</div>')
    Text
    '''
    # Instantiate a BeautifulSoup object from the html string using BeautifulSoup
    soup = BeautifulSoup(html, features="html.parser")
    # Return the text content of the soupified html content
    return soup.get_text()

In addition to the emails with text content there were also emails with non alpha-numeric characters and some emails which don't appear to contain any English words. As Such, a function will be needed to remove any unusual characters and non-english words.

In [27]:
# Download the NLTK punctuation package
nltk.download('punkt')
# Download the NLTK english words package
nltk.download('words')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/rosswillett/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/rosswillett/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [28]:
# Create a set containing all the english words from the NLTK corpus
nltk_eng_words = set(nltk.corpus.words.words())
# Instantiate a function to clean the text content of emails
def extract_text_content (text):
    '''
    Accepts a string, filters non-alphabetical characters and non-english words and returns the resulting string
    
    Parameters
    ----------
    text: A string which needs to be filtered
    
    Returns
    ----------
    Ret: A string which contains only english words
    
    Example
    ----------
    >>>> extract_text_content('asjudehr hello_ 712753^!@54318 world!')
    hello world
    '''
    # Removes any '=2C' strings from the text (This appears in certain text encodings between words)
    filteredText = text.replace('=2C', '')
    # Filters out any non-alphabetical or space characters from the text
    characterFilteredText = re.sub(r'[^a-zA-Z\s]', ' ', filteredText)
    # Instantiates an array to contain the english words
    englishWordOnlyTextArr = []
    # Iterate over every word in the string picked up by the NLTK tokenizer
    for word in nltk.word_tokenize(characterFilteredText):
        # Set the word to lower case
        lower_word = word.lower()
        # Check if the word exists in the set of english words and has a length > 1
        # And append the word the english word array if so
        if lower_word in nltk_eng_words and len(word) > 1:
            englishWordOnlyTextArr.append(word)
        # If the word is only one character check if it is one of the two english words of one character
        # If so, append the word to the english word array
        elif lower_word == 'i' or lower_word == 'a':
            englishWordOnlyTextArr.append(word)
    # Return the english word arrray of the content joined by spaces
    return ' '.join(englishWordOnlyTextArr)

Since numerical values may appear within the content of these emails but may not be represented with consistent values, the number of numerical values that appear within an email should be recorded. A function will be needed to record this and has been created below.

In [29]:
# Instantiate function to get word count in a string
def get_word_count(text):
    '''
    Accepts a string and returns the number of words contained in that string
    
    Parameters
    ----------
    text: A string which contains words to be counted
    
    Returns
    ----------
    Ret: An integer of the number of words contained in that string
    
    Example
    ----------
    >>>> get_word_count('goodbye cruel world')
    3
    '''
    # Filter string to alphabetical characters and spaces only
    alpha_only_text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove any instances of repeated spaces
    no_space_alpha_text = re.sub(r'\s\s+', ' ', text)
    # split sentence into words by splitting on spaces
    listofwords = alpha_only_text.split(' ')
    # Return the length of the resulting word list
    return len(listofwords)

Now that the functions to properly filter the email content have been created, these should be applied and any other relevant information moved into new columns. First the HTML content should be parsed to get the text content from the emails.

In [30]:
# Find the rows with HTML content in them a telltale sign of HTML content
# existing in a string are the characters '</' which appear in closing tags of HTML
rows_with_html = fraud_all_df['content'].str.lower().str.contains('</', na=True)

In [31]:
# Filter down the data frame content to those containing HTML and
# And assigns them to the same rows with the text content extracted
fraud_all_df.loc[rows_with_html, ['content']] = fraud_all_df[rows_with_html]['content'].apply(extract_HTML_text)

Since the numerical values and any links will be removed following the English word filter and may represent useful information for identifying fraud emails, these values should be recorded in separate columns.

In [32]:
# Gets a count of all unsecured links in the text content (Note that unsecured links start with the 'http://' string)
fraud_all_df['unsecure_link_count'] = fraud_all_df['content'].str.count('http://')

In [33]:
# Gets a count of all secure links in the text content (Note that secure links start with the 'https://' string)
fraud_all_df['secure_link_count'] = fraud_all_df['content'].str.count('https://')

In [34]:
# Finds all instances of numerically represented numbers in the text content,
# counts them and adds them to a new column 'numbers count'
fraud_all_df['numbers_count'] = fraud_all_df['content'].apply(lambda text: len(re.findall(r'\d+', text)))

Now that any non-word based content has been stored, the content can be filtered down to English word content and any empty, duplicate or null values removed.

In [35]:
# Applies the english word extraction function to each content column and assigns the result to the content column
fraud_all_df.loc[:, ['content']] = fraud_all_df['content'].apply(extract_text_content)

In [36]:
# Applies the get word count function to the content and assigns the result to a new column
fraud_all_df['word_count'] = fraud_all_df['content'].apply(get_word_count)

In [37]:
# Removes all rows with empty text content
fraud_all_df = fraud_all_df[fraud_all_df['content'] != '']

In [38]:
# Check to ensure rows are Null
fraud_all_df.isna().sum()

content                0
fraud                  0
unsecure_link_count    0
secure_link_count      0
numbers_count          0
word_count             0
dtype: int64

In [39]:
# Checks for duplicates in the data
fraud_all_df.duplicated().sum()

4067

In [40]:
# Removes all duplicates in the data
fraud_all_df.drop_duplicates(inplace=True, ignore_index=True)

In [41]:
# Examines the final cleaned data frame
fraud_all_df.sample(10)

Unnamed: 0,content,fraud,unsecure_link_count,secure_link_count,numbers_count,word_count
9294,I have it Or Message,0,0,0,0,5
10499,New Page Important Notice Sept Dear Sir Madam ...,1,0,1,4,186
8135,will get Rob to write him a note unless you al...,0,0,0,0,17
9169,am DEPART Private route State Department am AR...,0,0,0,34,58
2689,the following page from the to you Please clic...,1,0,0,6,247
6693,From Benjamin Cote West address A Day My name ...,1,0,0,28,198
11874,On Tue wrote I have or which get used for diff...,0,1,0,9,161
8117,Yes Its a nice photo It could just be someone ...,0,0,0,0,18
10380,Protect Your Account sure you never provide yo...,1,0,1,2,160
1157,Happy easter to you We just finished one hunt ...,0,0,0,1,22


Now that the data has been fully combined and cleaned, it can be saved to a csv file for further evaluation and use for modeling.

In [136]:
# Saves the resulting combined and cleaned data frame to a CSV
fraud_all_df.to_csv('./data/fraud_all_data_clean_X.csv', index=False)

## Add Enron

In [94]:
phishing_2_fraud_only = fraud_clean_df[fraud_clean_df['fraud'] == 1]

In [44]:
enron_email_df = pd.read_csv('./data/enron_extracted.csv')

In [45]:
enron_internal_df = enron_email_df[enron_email_df['from'].str.contains('@enron')]

In [46]:
enron_internal_df.sample(10)

Unnamed: 0,from,to,subject,content
52493,shirley.crenshaw@enron.com,move-team@enron.com,Computers from Research Group,Good morning all: This past weekend you moved ...
20193,kate.symes@enron.com,andy.chen@enron.com,Re: Warning,Thanks - Enron designed that stationary specif...
447804,darrell.schoolcraft@enron.com,"steve.january@enron.com, kimberly.watson@enron...",TW Weekend scheduled volumes,March 2002 Scheduled Scheduled Friday 15 West ...
89831,ted.murphy@enron.com,"s..bradford@enron.com, r..brackett@enron.com, ...",Quick Update,"Bill and Friends: FYI, things are still quite ..."
250452,lynn.blair@enron.com,"tim.johanson@enron.com, john.williams@enron.co...",RE: Requested training by Xcel in Denver,"Tim and Randy, how did the traning go Thanks. ..."
62105,rebecca.mcdonald@enron.com,"jeff.skilling@enron.com, kevin.hannon@enron.com",FW: Sale of Enron's Interest in Bachaquero,FYI -----Original Message----- From: Tortolero...
486224,maria.sandoval@enron.com,"asandov225@aol.com, andrea.guillen@enron.com, ...",The Empty Chair,THE EMPTY CHAIR A mans daughter had asked the ...
109654,michele.winckowski@enron.com,,FW: Something Worth Seeing,This is very powerful. Some photos are very gr...
308442,frank.davis@enron.com,tana.jones@enron.com,Pulp & Paper Long Descriptions,"Tana, Attached below are examples of EnronOnli..."
457221,maureen.mcvicker@enron.com,tom.briggs@enron.com,Re: Draft Wyden letter,Tom: What address should I use for the Sen. Wy...


In [115]:
enron_internal_clean = pd.DataFrame()
enron_internal_clean = enron_internal_df.copy()
enron_internal_clean.loc[:, 'phishing'] = 0
enron_internal_clean.drop(columns=['to', 'from', 'subject'], inplace=True)
enron_internal_clean_sample = enron_internal_clean.sample(5656)

In [116]:
new_df = pd.concat([
    phishing_2_fraud_only,
    fraud_2_clean_df,
    fraud_3_clean_df,
    fraud_4_clean_df,
    enron_internal_clean_sample
], ignore_index=True)

In [117]:
new_df['fraud'].value_counts()

1    7138
0    7138
Name: phishing, dtype: int64

In [118]:
new_df.dropna(inplace=True)

In [119]:
rows_with_html = new_df['content'].str.lower().str.contains('</', na=True)

In [120]:
new_df.loc[rows_with_html, ['content']] = new_df[rows_with_html]['content'].apply(extract_HTML_text)

In [121]:
new_df['unsecure_link_count'] = new_df['content'].str.count('http://')

In [122]:
new_df['secure_link_count'] = new_df['content'].str.count('https://')

In [123]:
new_df['numbers_count'] = new_df['content'].apply(lambda text: len(re.findall(r'\d+', text)))

In [124]:
new_df.loc[:, ['content']] = new_df['content'].apply(extract_text_content)

In [125]:
new_df['word_count'] = new_df['content'].apply(get_word_count)

In [126]:
new_df = new_df[new_df['content'] != '']

In [127]:
new_df.isna().sum()

content                0
phishing               0
unsecure_link_count    0
secure_link_count      0
numbers_count          0
word_count             0
dtype: int64

In [128]:
new_df.duplicated().sum()

1474

In [129]:
new_df.drop_duplicates(inplace=True, ignore_index=True)

In [130]:
new_df

Unnamed: 0,content,phishing,unsecure_link_count,secure_link_count,numbers_count,word_count
0,Supply Quality China EXCLUSIVE at Unbeatable P...,1,0,0,10,131
1,Dear Friend to you I wish to accost you with a...,1,0,0,9,385
2,BANK BRANCH CENTRAL HONG HONG Let me start by ...,1,1,0,6,549
3,from barrister friend I know that my letter wi...,1,0,0,41,527
4,SOLICITING FOR A BUSINESS VENTURE AND DEAR SIR...,1,0,0,20,323
...,...,...,...,...,...,...
12746,You know I must have received your message on ...,0,0,0,5,82
12747,AGRICULTURE Soft commodity find the going hard...,0,3,0,358,7277
12748,I have your resume with my commentary to Bibi ...,0,0,0,0,31
12749,Mark How should we handle this In the past I h...,0,0,0,50,274


In [131]:
new_df.to_csv('./data/fraud_with_enron_data_clean_1.csv', index=False)