# Email Fraud Detection: Data Cleaning and Combining

**By Ross Willett**

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Email-Fraud-Detection:-Data-Cleaning-and-Combining" data-toc-modified-id="Email-Fraud-Detection:-Data-Cleaning-and-Combining-1">Email Fraud Detection: Data Cleaning and Combining</a></span><ul class="toc-item"><li><span><a href="#Project-Introduction" data-toc-modified-id="Project-Introduction-1.1">Project Introduction</a></span></li><li><span><a href="#File-Introduction" data-toc-modified-id="File-Introduction-1.2">File Introduction</a></span></li><li><span><a href="#Retrieving-and-Analyzing-Data" data-toc-modified-id="Retrieving-and-Analyzing-Data-1.3">Retrieving and Analyzing Data</a></span><ul class="toc-item"><li><span><a href="#Kaggle-&quot;Fraud-Email-Datasets&quot;" data-toc-modified-id="Kaggle-&quot;Fraud-Email-Datasets&quot;-1.3.1">Kaggle "Fraud Email Datasets"</a></span></li><li><span><a href="#Kaggle-&quot;Phishing-Email-Data-by-Type&quot;" data-toc-modified-id="Kaggle-&quot;Phishing-Email-Data-by-Type&quot;-1.3.2">Kaggle "Phishing Email Data by Type"</a></span></li><li><span><a href="#Academic-Torrents-Emails" data-toc-modified-id="Academic-Torrents-Emails-1.3.3">Academic Torrents Emails</a></span></li><li><span><a href="#Apache-&quot;Spam-Assassin&quot;-Corpus-Emails" data-toc-modified-id="Apache-&quot;Spam-Assassin&quot;-Corpus-Emails-1.3.4">Apache "Spam Assassin" Corpus Emails</a></span></li></ul></li><li><span><a href="#Data-Frame-Cleaning" data-toc-modified-id="Data-Frame-Cleaning-1.4">Data Frame Cleaning</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-1.5">Conclusions</a></span></li></ul></li></ul></div>

## Project Introduction

Each year, individuals and organizations are victims of fraud which lead to significant personal or organizational losses. Many of these victims are first contacted by criminals through email. The objective of this project is to provide a proof of concept that natural language processing and machine learning could be used to classify fraudulent or non-fraudulent ("Ham") emails based on their text content. Such a classifier could then be implemented by email applications to help prevent their users from becoming victims of fraud.

**IMPORTANT NOTE:**

Throughout this project non-fraudulent emails will often be referred to as "ham" emails, as this is a term often used to refer to regular email communications in email classification problems.

## File Introduction

Due to the scarcity of email data online (likely due to privacy concerns), data from several sources will be retrieved, analyzed, combined and cleaned to produce a data set suitable for analysis and model building.

## Retrieving and Analyzing Data

The retrieval and analysis of the several data sources used for this project is detailed in this section.

In [1]:
# Import matrix and data manipulation libraries
import pandas as pd
import numpy as np

# Import regex library
import re

# Import custom text transformers
from TextTransformers import extract_HTML_text, get_word_count, extract_text_content

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/rosswillett/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/rosswillett/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [59]:
# Configure Pandas to show more columns/rows
pd.options.display.max_columns = 2000
pd.options.display.max_rows = 2000

### Kaggle "Fraud Email Datasets"

The first data set that will be examined for cleaning and use is the "Fraud Email Datasets". (Located at https://www.kaggle.com/datasets/pramodgupta92/fraud-email-datasets)

In [3]:
# Read data from fraud_email_ data set and look at shape and dataframe info
fraud_df = pd.read_csv('./data/fraud_email_.csv')
print(fraud_df.shape)
print(fraud_df.info())

(11929, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11929 entries, 0 to 11928
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    11928 non-null  object
 1   Class   11929 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 186.5+ KB
None


In [4]:
# Look at content of data
fraud_df.head()

Unnamed: 0,Text,Class
0,Supply Quality China's EXCLUSIVE dimensions at...,1
1,over. SidLet me know. Thx.,0
2,"Dear Friend,Greetings to you.I wish to accost ...",1
3,MR. CHEUNG PUIHANG SENG BANK LTD.DES VOEUX RD....,1
4,Not a surprising assessment from Embassy.,0


In [5]:
# Look at split between fraud and regular emails
fraud_df['Class'].value_counts()

0    6742
1    5187
Name: Class, dtype: int64

Initial analysis of the fraud email data set reveals there is a sizable 11,929 rows of email text data with a classifier indicating whether the email is fraudulent or not. Although missing the email subject and "from" address is not ideal, this source contains a significant volume of data which would prove valuable for use in this project.

### Kaggle "Phishing Email Data by Type"

The next data set that will be examined for use is the "Phishing Email Data by Type" data set. (Located at https://www.kaggle.com/datasets/charlottehall/phishing-email-data-by-type)

In [6]:
# Pull the data set into a CSV and look at the shape and column info for the data frame
fraud_2_df = pd.read_csv('./data/phishing_data_by_type.csv')
print(fraud_2_df.shape)
print(fraud_2_df.info())

(159, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Subject  157 non-null    object
 1   Text     159 non-null    object
 2   Type     159 non-null    object
dtypes: object(3)
memory usage: 3.9+ KB
None


In [7]:
# Look at first few entries of data set
fraud_2_df.head()

Unnamed: 0,Subject,Text,Type
0,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\...,Fraud
1,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",Fraud
2,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,Fraud
3,from Mrs.Johnson,Goodday Dear\n\n\nI know this mail will come t...,Fraud
4,Co-Operation,FROM MR. GODWIN AKWESI\nTEL: +233 208216645\nF...,Fraud


In [8]:
# Look at the split for the content type
fraud_2_df['Type'].value_counts(normalize=True)

Fraud               0.251572
Phishing            0.251572
Commercial Spam     0.251572
False Positives     0.245283
Name: Type, dtype: float64

This data set contains only 159 documents but appears to contain a decent diversity of both phishing and fraud emails. It isn't ideal that there are also commercial spam emails since this may create an additional challenge for the identifier. As such, this data set may have to undergo additional cleaning but is likely worth including.

### Academic Torrents Emails

The next data set is composed of the content which was extracted from phishing email files retrieved from https://academictorrents.com/details/a77cda9a9d89a60dbdfbe581adf6e2df9197995a. The "from" address, subject and content was extracted from all of the emails in each folder and compiled into one csv per folder using the `email_extractor.py` script created for this purpose. These files need to be compiled into a singular csv file.

In [9]:
# Create array to append data frames with contents to
phishing_extract_array = []
# Iterate from 1 to 5 since there are that many phishing extract files
for i in range(1, 5):
    # Read the data from each phishing extract csv and put it into a data frame
    phishing_extract = pd.read_csv(f'./data/phishing_extract_{i}.csv')
    # Append the data frame to the array of data frames
    phishing_extract_array.append(phishing_extract)

# Create a data frame to store the combined data frames
combined_phishing_extract_df = pd.concat(phishing_extract_array) 
# Drop duplicates in phishing extract data frame
combined_phishing_extract_df.drop_duplicates(inplace=True)
# Save phishing extract data frame to csv
combined_phishing_extract_df.to_csv('./data/phishing_eml_extract_full.csv', index=False)

In [10]:
# Read all phishing data from the saved csv into a data frame
phishing_extract_df = pd.read_csv('./data/phishing_eml_extract_full.csv')
# Look at data frame shape
print(phishing_extract_df.shape)
# Display data frame content
phishing_extract_df.sample(10)

(2416, 3)


Unnamed: 0,from,subject,content
2028,"""eBay support"" <service@ebay.com>",Notification of Limited Account Access -- eBa...,home pay register sign in/out services site ma...
2330,"""NCUA Departamnet"" <ncua-065-617-349@ncua.gov>","Sincerely, NCUA Account Review Department !","Dear Federal Credit Union account holder, This..."
1553,"""eBay Inc."" <member@eBay.com>",IMPORTANT: Alert from eBay,Message from eBay Member eBay My Messages -- 1...
1000,"""Paypal@"" <passwords@paypal.com>",Additional email address added to your PayPal ...,Youve added an additional email address to you...
1883,"""National City"" <cservice.refp56839606tl.cm@na...",Secure Confirmation!,Dear National City business client: The Nation...
2054,"""PayPal Security Center"" <service@paypal.com>",Notification of Limited Account Access,"Dear Sir,PayPal is committed to maintaining a ..."
597,"""Alaska USA"" <service@alaskausa.org>",Alaska USA FCU Verify Your Account,"AlaskaUSA Dear Valued Customer, Our new securi..."
339,"""PayPal Billing And Security Center"" <Billing@...",Update Your PayPal Account Information,PayPal - Log In Sign Up Log Out Help Protect Y...
2237,"""service-paypal "" <service@intl.paypal.com>",PayPal Anti Fraud Service,PayPal September 2006 Dear user of PayPal serv...
799,"""Chase Trust And Safety Department"" <account@c...",Password Change Required,Untitled Document Password change required Dea...


The data from the extracted emails includes 2,416 documents including a from address, a subject and the email content entirely consisting of phishing emails. This will prove to be a useful sample for the complete data set.

### Apache "Spam Assassin" Corpus Emails

The next data set is composed of the content which was extracted from ham email files retrieved from the Apache "Spam Assassin" corpus located [here](https://spamassassin.apache.org/old/publiccorpus/). Since these emails were in .eml format but lacked the .eml extension, the `eml_rename.py` script was used to change the files to the appropriate .eml file type. The from, subject and content was extracted from all of the emails in each folder and compiled into one csv per folder using the `email_extractor.py` script, created for this purpose. These files need to be compiled into a singular csv file.

In [11]:
# Create array to store extracted data frames to
ham_extract_array = []
# Iterate over the number of csvs the data was separately stored in
for i in range(1, 4):
    # Read the data from each csv into a data frame
    ham_extract = pd.read_csv(f'./data/ham_extract_{i}.csv')
    # Append the data frame to the array of data frames
    ham_extract_array.append(ham_extract)

# Combine all the data frames into a singular one
combined_ham_extract_df = pd.concat(ham_extract_array)
# Remove all duplicate data frames
combined_ham_extract_df.drop_duplicates(inplace=True)
# Save the combined data frame to a csv file
combined_ham_extract_df.to_csv('./data/ham_eml_extract_full.csv', index=False)

In [12]:
# Load the combined data from a csv into a data frame
combined_ham_extract_df = pd.read_csv('./data/ham_eml_extract_full.csv')
# Display the shape and content of the data frame
print(combined_ham_extract_df.shape)
combined_ham_extract_df.sample(10)

(3303, 3)


Unnamed: 0,from,subject,content
2236,Julian Missig <julian@jabber.org>,Re: Limbo beta 2 ?,"I realize this is an old thread, but I just ha..."
291,Peter Peltonen <peter.peltonen@iki.fi>,Re: New testing packages,Content-Disposition: inline To view this newsl...
1165,"""Jim Whitehead"" <ejw@cse.ucsc.edu>",RE: Goodbye Global Warming,I am delurking to comment on the Salon article...
1737,"""timothy_hodkinson"" <mephistopheles29@hotmail....",[zzzzteana] Re: Archer-UK TV Alert,I installed Spamassassin 2.41 with Razor V2 th...
3261,"""Ayn Rand Institute Media"" <davidh@aynrand.org>",GOVERNMENT REGULATION IS KILLING THE STOCK MARKET,PRESS RELEASE FROM THE AYN RAND INSTITUTE 2121...
1896,"""Joseph S. Barrera III"" <joe@barrera.org>",Re: SimPastry,Jim Whitehead wrote: http://www.research.micro...
2284,kevin lyda <kevin+dated+1028163438.f677b3@linu...,Re: [ILUG] Optimizing for Pentium Pt.2,"On Fri, Jul 26, 2002 at 11:24:30PM 0100, John ..."
494,pudge@perl.org,[use Perl] Stories for 2002-09-03,Content-Disposition: inline To view this newsl...
2916,"""Jim Whitehead"" <ejw@cse.ucsc.edu>",RE: USA USA WE ARE NUMBER ....six.,"chris arkenberg wrote: Cheers, Adam. Perhaps p..."
63,"""John Hall"" <johnhall@evergo.net>","RE: Our friends the Palestinians, Our servants...",I actually thought of this kind of active chat...


The data from the extracted emails includes 2,416 documents including a "from" address, a subject and the email content entirely consisting of non-fraudulent emails. This will prove to be a useful source of non-fraudulent emails for the combined data set.

## Data Frame Cleaning

Now that the data sets have been acquired and cursorily examined, the next step will be to format them in the same manner and combine them into a singular data set for further cleaning and analysis. Since the data is largely composed of documents that only have a flag indicating fraud and the content (77% of all data), the final data set should only be composed of text content and a flag indicating fraud. Given this, all the data sets will be transformed to a standard format with a `content` column with the email text and a `fraud` column with a 1 or 0 flag indicating whether it is fraud or not.

For the first data set, the columns will only have to be renamed to prepare for combining the data.

In [13]:
# Initialize clean data frame for first data set
fraud_clean_df = pd.DataFrame()

In [14]:
# Set 'content' column for clean data frame
fraud_clean_df['content'] = fraud_df['Text']
# Set 'fraud' column for clean data frame
fraud_clean_df['fraud'] = np.where(
    fraud_df['Class'] == 1,
    1,
    0
)

In [15]:
# Examine the contents of the clean data frame
fraud_clean_df.head()

Unnamed: 0,content,fraud
0,Supply Quality China's EXCLUSIVE dimensions at...,1
1,over. SidLet me know. Thx.,0
2,"Dear Friend,Greetings to you.I wish to accost ...",1
3,MR. CHEUNG PUIHANG SENG BANK LTD.DES VOEUX RD....,1
4,Not a surprising assessment from Embassy.,0


In the second data set there is a mix of fraud, phishing, ham and commercial spam email types. The commercial spam will be removed from the data set as this type doesn't strictly fall into either the ham or fraud category and will likely only make identification more difficult for any models trained on this data.

In [16]:
# Examine the data contained in the second fraud data set
fraud_2_df.head()

Unnamed: 0,Subject,Text,Type
0,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\...,Fraud
1,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",Fraud
2,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,Fraud
3,from Mrs.Johnson,Goodday Dear\n\n\nI know this mail will come t...,Fraud
4,Co-Operation,FROM MR. GODWIN AKWESI\nTEL: +233 208216645\nF...,Fraud


In [17]:
# Initialize the data frame to store the cleaned data from the second data set in
fraud_2_clean_df = pd.DataFrame()
# Set the content column to the text of the second data set
fraud_2_clean_df['content'] = fraud_2_df['Text']
# Set a 'fraud' column flagging the fraud vs normal emails from the fraud data set
fraud_2_clean_df['fraud'] = np.where(
    (fraud_2_df['Type'] == 'Fraud') | (fraud_2_df['Type'] == 'Phishing'),
    1,
    0
)
# Remove the data identified as 'Commercial Spam' from the data set
fraud_2_clean_df = fraud_2_clean_df[fraud_2_df['Type'] != 'Commercial Spam']

In [18]:
# Examine content of cleaned data
fraud_2_clean_df.head()

Unnamed: 0,content,fraud
0,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\...,1
1,"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",1
2,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,1
3,Goodday Dear\n\n\nI know this mail will come t...,1
4,FROM MR. GODWIN AKWESI\nTEL: +233 208216645\nF...,1


All the data from the phishing email collection can be classified as fraud and all columns except the email text content dropped.

In [19]:
# Initialize new clean data frame
fraud_3_clean_df = pd.DataFrame()
# Assign the 'content' of the new data frame to the text content of the phishing data frame
fraud_3_clean_df['content'] = phishing_extract_df['content']
# Set the fraud column to 1 for this data frame since all entries are fraudulent
fraud_3_clean_df['fraud'] = 1

In [20]:
# Examine the content of the new data frame
fraud_3_clean_df.head()

Unnamed: 0,content,fraud
0,"Dear valued PayPal member, Due to recent fraud...",1
1,Credit Union is constantly working to ensure s...,1
2,Credit Union is constantly working to ensure s...,1
3,"Untitled Document Dear eBay Member, We regret ...",1
4,"Dear valued PayPal member, Due to recent fraud...",1


All the data from the ham email collection can be classified as not fraud and all columns except the email text content dropped.

In [21]:
# Initialize new clean data frame
fraud_4_clean_df = pd.DataFrame()
# Assign the 'content' of the new data frame to the text content of the phishing data frame
fraud_4_clean_df['content'] = combined_ham_extract_df['content']
# Set the fraud column to 0 for this data frame since all entries are ham
fraud_4_clean_df['fraud'] = 0

Now that all the data has been put into data frames with a consistent format and flagged, they can be combined for further cleaning and processing.

In [22]:
# Assign a new data frame to the combined contents of all the cleaned data frames
fraud_all_df = pd.concat([
    fraud_clean_df,
    fraud_2_clean_df,
    fraud_3_clean_df,
    fraud_4_clean_df
], ignore_index=True)

In [23]:
# Remove all NA values from the new data frame
fraud_all_df.dropna(inplace=True)

In [24]:
# Look at the content of the combined data frame
fraud_all_df.sample(10)

Unnamed: 0,content,fraud
14098,Untitled Document This email confirms that you...,1
10390,Fyi,0
16240,I installed Spamassassin 2.41 with Razor V2 th...,0
12492,Note: This is a service message with informati...,1
8772,Sorry for the delay. Abt 15 minutes away. We h...,0
3305,PLEASE IGNORE THIS MAIL IF YOU ARE NOT INTERES...,1
7096,"Hello, I have got your contact in the cause of...",1
6484,DR.SOLOMON AZEEZ FEDERAL MINISTRY OF PETROLUEM...,1
12464,"Dear username, If you have ever needed a Visa ...",1
7341,Fyi,0


Now that the data frame has been combined, several functions are needed to perform additional cleaning and formatting, these functions were created and saved in the `TextTransformers.py` module. After inspecting the data frame contents, it became clear that the text content of some emails contains HTML. The text content needed to be extracted from the HTML content in these emails, and to this end a function was made to do so. This function is defined in the `TextTransformers.py` module under the name `extract_HTML_text`.

In [25]:
# Find the rows with HTML content in them a telltale sign of HTML content
# existing in a string are the characters '</' which appear in closing tags of HTML
rows_with_html = fraud_all_df['content'].str.lower().str.contains('</', na=True)

In [26]:
# Filter down the data frame content to those containing HTML and
# And assigns them to the same rows with the text content extracted
fraud_all_df.loc[rows_with_html, ['content']] = fraud_all_df[rows_with_html]['content'].apply(extract_HTML_text)

Since the numerical values and any links will be removed following the English word filter and may represent useful information for identifying fraud emails, these values should be recorded in separate columns.

In [27]:
# Gets a count of all unsecured links in the text content (Note that unsecured links start with the 'http://' string)
fraud_all_df['unsecure_link_count'] = fraud_all_df['content'].str.count('http://')

In [28]:
# Gets a count of all secure links in the text content (Note that secure links start with the 'https://' string)
fraud_all_df['secure_link_count'] = fraud_all_df['content'].str.count('https://')

Since numerical values may appear within the content of these emails but may not be represented with consistent values, the number of numerical values that appear within an email should be recorded. The function below does this for the content of all email text data.

In [29]:
# Finds all instances of numerically represented numbers in the text content,
# counts them and adds them to a new column 'numbers count'
fraud_all_df['numbers_count'] = fraud_all_df['content'].apply(lambda text: len(re.findall(r'\d+', text)))

In addition to the emails with text content there were also emails with non alpha-numeric characters and some emails which don't appear to contain any English words. As Such, a function was created to remove any unusual characters and non-english words. This custom function was created in the TextTransformers.py module under the name `extract_text_content` and will be used to transform the text content of the data.

In [30]:
# Applies the english word extraction function to each content column and assigns the result to the content column
fraud_all_df.loc[:, ['content']] = fraud_all_df['content'].apply(extract_text_content)

In [31]:
# Applies the get word count function to the content and assigns the result to a new column
fraud_all_df['word_count'] = fraud_all_df['content'].apply(get_word_count)

Now that any non-english words have been removed, the content can be filtered down to English word content and any empty, duplicate or null values removed.

In [32]:
# Removes all rows with empty text content
fraud_all_df = fraud_all_df[fraud_all_df['content'] != '']

In [33]:
# Check to ensure rows are Null
fraud_all_df.isna().sum()

content                0
fraud                  0
unsecure_link_count    0
secure_link_count      0
numbers_count          0
word_count             0
dtype: int64

In [34]:
# Checks for duplicates in the data
fraud_all_df.duplicated().sum()

4067

In [35]:
# Removes all duplicates in the data
fraud_all_df.drop_duplicates(inplace=True, ignore_index=True)

In [36]:
# Examines the final cleaned data frame
fraud_all_df.sample(10)

Unnamed: 0,content,fraud,unsecure_link_count,secure_link_count,numbers_count,word_count
10416,is to a safe environment for its community of ...,1,0,1,2,250
2787,Dear Friend How are you today I know this mail...,1,0,0,38,245
6804,Sure Will send as soon as I can,0,0,0,0,8
4921,Bank A well reputable financial institution wi...,1,0,0,3,144
10137,Untitled Document This message graphics If you...,1,0,1,0,66
2881,IRREVOCABLE RELEASE OF YOUR have actually been...,1,0,0,5,215
7543,After nearly two of legal Senator was finally ...,0,0,0,0,36
3038,May message,0,0,0,4,2
6924,Zenith House The CONFIDENTIALLY This and any w...,1,0,0,2,129
9568,was confirmed this afternoon Assistant Secreta...,0,0,0,5,14


Now that the data has been fully combined and cleaned using automation, it can be saved to a csv file for further manual evaluation and use for modeling.

In [37]:
# Saves the resulting combined and cleaned data frame to a CSV
fraud_all_df.to_csv('./data/fraud_all_data_clean_4.csv', index=False)

The data will now undergo additional manual inspection and cleaning to ensure the programatically cleaned data contains meaningful and useful results. During manual inspection two problems were noted. One problem was the use of the word "China" indicating a place which may unfairly bias results. In addition to this, the word "yahoo" appeared fairly prevalently among fraud emails, which may also lead to invalid accuracies. As such these will be programmatically removed and the data set saved to a final form.

In [4]:
# Load the data into a pandas data frame
fraud_data_man_clean = pd.read_csv('./data/fraud_all_data_clean_4.csv')

In [5]:
# Remove the use of the word 'China'
fraud_data_man_clean['content'] = fraud_data_man_clean['content'].str.replace('China', '', case=False)
# Remove the use of the word 'yahoo'
fraud_data_man_clean['content'] = fraud_data_man_clean['content'].str.replace('yahoo', '', case=False)

In [6]:
# Save the finally cleaned data set to a new file
fraud_data_man_clean.to_csv('./data/fraud_all_data_clean_final.csv', index=False)

## Conclusions

Despite the limited availability of email data, the combined and cleaned data produced from this processing should prove sufficient for training a model. Although the model produced from this data may not be suitable for a real world use, it should be sufficient for the purposes of a proof of concept.