## Project Overview and Structure

This notebook shows the process of creating the datasets in this project. It also provides analysis and reports of different datasets.

Source: https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset

In [None]:
# The Structure of This Notebook

"""

├── Section 0: Import Library and Environment Set Up
│   ├── Install Package                  <- Install ydata-profiling package.
│   ├── Python Library                   <- Import Pyhton Libraries.
│   ├── Random Seed                      <- Set up the value of random seed.
│   └── Check cuda                       <- Check availability of cuda.
│
├── Section 1: Import Dataset
│   ├── Import Dataset 1                 <- Import the 1st dataset from the above url.
│   ├── Import Dataset 2                 <- Import the 2nd dataset from the above url.
│   ├── Import Dataset 3                 <- Import the 3rd dataset from the above url.
│   ├── Import Dataset 4                 <- Import the 4th dataset from the above url.
│   ├── Import Dataset 5                 <- Import the 5th dataset from the above url.
│   ├── Import Dataset 6                 <- Import the 6th dataset from the above url.
│
├── Section 2: Create Total Dataset (Combined Dataset) and Explore this Dataset
│   ├── Create Total Dataset             <- Combine the above 6 datasets to get the total dataset.
│   ├── Profile Report                   <- Create the Profile Report by using ProfileReport class in ydata-profiling package.
│   ├── Save Dataset                     <- Save the dataset (Total Dataset) for future use.
│
├── Section 3: Create Cleaned Dataset (Balanced Dataset) and Explore this Dataset
│   ├── Clean Dataset                    <- Remove missing row and duplicate row.
│   ├── Profile Report                   <- Create the Profile Report by using ProfileReport class in ydata-profiling package.
│   ├── Save Dataset                     <- Save the dataset (Cleaned Dataset (Balanced)) for future use.
│
├── Section 4: Create Cleaned Dataset (Imbalanced Dataset) and Explore this Dataset
│   ├── Create Imbalanced Dataset        <- Change the distribution of legal email and phishing email in dataset.
│   ├── Profile Report                   <- Create the Profile Report by using ProfileReport class in ydata-profiling package.
│   ├── Save Dataset                     <- Save the dataset (Cleaned Dataset (Imbalanced)) for future use.


"""

## Section 0: Import Library and Environment Set Up

In [None]:
# Install Package
!pip install ydata-profiling



In [None]:
# Python Library
import pandas as pd
import torch

# Other Library
from ydata_profiling import ProfileReport

In [None]:
# Random Seed
seed_0 = 6
seed_1 = 8
seed_2 = 66
seed_3 = 88

In [None]:
# Check cuda
torch.cuda.is_available()

False

## Section 1: Import Dataset

In [None]:
# Import Dataset 1

# data path
data_path_1 = '/content/CEAS_08.csv'

# read a csv file
dataset_1 = pd.read_csv(data_path_1)

# Final Dataset Part 1
new_dataset_1 = dataset_1[['subject', 'body', 'label']]

# display
new_dataset_1

Unnamed: 0,subject,body,label
0,Never agree to be a loser,"Buck up, your troubles caused by small dimensi...",1
1,Befriend Jenna Jameson,\nUpgrade your sex and pleasures with these te...,1
2,CNN.com Daily Top 10,>+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+...,1
3,Re: svn commit: r619753 - in /spamassassin/tru...,Would anyone object to removing .so from this ...,0
4,SpecialPricesPharmMoreinfo,\nWelcomeFastShippingCustomerSupport\nhttp://7...,1
...,...,...,...
39149,CNN Alerts: My Custom Alert,\n\nCNN Alerts: My Custom Alert\n\n\n\n\n\n\n ...,1
39150,CNN Alerts: My Custom Alert,\n\nCNN Alerts: My Custom Alert\n\n\n\n\n\n\n ...,1
39151,Slideshow viewer,Hello there ! \nGreat work on the slide show v...,0
39152,Note on 2-digit years,"\nMail from sender , coming from intuit.com\ns...",0


In [None]:
# Import Dataset 2

# data path
data_path_2 = '/content/Enron.csv'

# read a csv file
dataset_2 = pd.read_csv(data_path_2)

# Final Dataset Part 2
new_dataset_2 = dataset_2[['subject', 'body', 'label']]

# display
new_dataset_2

Unnamed: 0,subject,body,label
0,"hpl nom for may 25 , 2001",( see attached file : hplno 525 . xls )\r\n- h...,0
1,re : nom / actual vols for 24 th,- - - - - - - - - - - - - - - - - - - - - - fo...,0
2,"enron actuals for march 30 - april 1 , 201","estimated actuals\r\nmarch 30 , 2001\r\nno flo...",0
3,"hpl nom for may 30 , 2001",( see attached file : hplno 530 . xls )\r\n- h...,0
4,"hpl nom for june 1 , 2001",( see attached file : hplno 601 . xls )\r\n- h...,0
...,...,...,...
29762,confidence is back,"hello ,\r\nmy boyfriend began having problems ...",1
29763,important information,love - potion for your darling is all you want...,1
29764,vys - make itnger,you have feelings of guilt and embarrassment ...,1
29765,the best thing come in large parcels,spur - m formula\r\nincrease sperm production ...,1


In [None]:
# Import Dataset 3

# data path
data_path_3 = '/content/Ling.csv'

# read a csv file
dataset_3 = pd.read_csv(data_path_3)

# Final Dataset Part 3
new_dataset_3 = dataset_3[['subject', 'body', 'label']]

# display
new_dataset_3

Unnamed: 0,subject,body,label
0,job posting - apple-iss research center,content - length : 3386 apple-iss research cen...,0
1,,"lang classification grimes , joseph e . and ba...",0
2,query : letter frequencies for text identifica...,i am posting this inquiry for sergei atamas ( ...,0
3,risk,a colleague and i are researching the differin...,0
4,request book information,earlier this morning i was on the phone with a...,0
...,...,...,...
2854,win $ 300usd and a cruise !,"raquel 's casino , inc . is awarding a cruise ...",1
2855,you have been asked to join kiddin,"the list owner of : "" kiddin "" has invited you...",1
2856,anglicization of composers ' names,"judging from the return post , i must have sou...",0
2857,"re : 6 . 797 , comparative method : n - ary co...",gotcha ! there are two separate fallacies in t...,0


In [None]:
# Import Dataset 4

# data path
data_path_4 = '/content/Nazario.csv'

# read a csv file
dataset_4 = pd.read_csv(data_path_4)

# Final Dataset Part 4
new_dataset_4 = dataset_4[['subject', 'body', 'label']]

# display
new_dataset_4

Unnamed: 0,subject,body,label
0,DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA,This text is part of the internal format of yo...,1
1,Verify Your Account,Business with \t\t\t\t\t\t\t\tcPanel & WHM \t...,1
2,Helpdesk Mailbox Alert!!!,Your two incoming mails were placed on pending...,1
3,IT-Service Help Desk,Password will expire in 3 days. Click Here To ...,1
4,Final USAA Reminder - Update Your Account Now,"To ensure delivery to your inbox, please add U...",1
...,...,...,...
1560,Receipt for Your Payment to FTX.,PayPal You sent a payment of $699.99 USD to FT...,1
1561,Rectify Your Password With monkey.org,"monkey.org Hi jose,Pa⁠s⁠sword for⁠ jose@monke...",1
1562,Netflix : We're having some trouble with your ...,"HELLO, Please note that, your monthly paymen...",1
1563,Your MetaMask wallet will be suspended,Verify your MetaMask Wallet Our system has sho...,1


In [None]:
# Import Dataset 5

# data path
data_path_5 = '/content/Nigerian_Fraud.csv'

# read a csv file
dataset_5 = pd.read_csv(data_path_5)

# Final Dataset Part 5
new_dataset_5 = dataset_5[['subject', 'body', 'label']]

# display
new_dataset_5

Unnamed: 0,subject,body,label
0,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...,1
1,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...",1
2,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,1
3,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,1
4,I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope...",1
...,...,...,...
3327,CONTACT GLOBAL MAX SHIPING COMPANY,"Atten: My Dear ,\n \nI have Paid the fee for y...",1
3328,TREAT AS URGENT.,\nFrom: Mr Ali Sherif. African Development Ban...,1
3329,From Dr Usman Ibrahim / Mr Wahid Yoffe property.,\nFROM DR USMAN IBRAHIM DANKO.AUDITING AND ACC...,1
3330,My Beloved In Christ.,"\nBeloved in the Lord Jesus Christ, PLEASE END...",1


In [None]:
# Import Dataset 6

# data path
data_path_6 = '/content/SpamAssasin.csv'

# read a csv file
dataset_6 = pd.read_csv(data_path_6)

# Final Dataset Part 6
new_dataset_6 = dataset_6[['subject', 'body', 'label']]

# display
new_dataset_6

Unnamed: 0,subject,body,label
0,Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -0500 ...",0
1,[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0
2,[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0
3,[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0
4,Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0
...,...,...,...
5804,Busy? Home Study Makes Sense!,\n\n \n--- \n![](http://images.pcdi-homestud...,1
5805,Preferred Non-Smoker Rates for Smokers,This is a multi-part message in MIME format. -...,1
5806,"How to get 10,000 FREE hits per day to any web...","Dear Subscriber,\n\nIf I could show you a way ...",1
5807,Cannabis Difference,****Mid-Summer Customer Appreciation SALE!****...,1


## Section 2: Create Total Dataset (Combined Dataset) and Explore this Dataset

In [None]:
# Create Total Dataset

# Total Dataset Part 1
new_dataset_1 = dataset_1[['subject', 'body', 'label']]

# Total Dataset Part 2
new_dataset_2 = dataset_2[['subject', 'body', 'label']]

# Total Dataset Part 3
new_dataset_3 = dataset_3[['subject', 'body', 'label']]

# Total Dataset Part 4
new_dataset_4 = dataset_4[['subject', 'body', 'label']]

# Total Dataset Part 5
new_dataset_5 = dataset_5[['subject', 'body', 'label']]

# Total Dataset Part 6
new_dataset_6 = dataset_6[['subject', 'body', 'label']]

# Create Total Dataset
total_dataset = pd.concat([new_dataset_1, new_dataset_2, new_dataset_3, new_dataset_4, new_dataset_5, new_dataset_6], axis=0)

# Display
total_dataset

Unnamed: 0,subject,body,label
0,Never agree to be a loser,"Buck up, your troubles caused by small dimensi...",1
1,Befriend Jenna Jameson,\nUpgrade your sex and pleasures with these te...,1
2,CNN.com Daily Top 10,>+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+...,1
3,Re: svn commit: r619753 - in /spamassassin/tru...,Would anyone object to removing .so from this ...,0
4,SpecialPricesPharmMoreinfo,\nWelcomeFastShippingCustomerSupport\nhttp://7...,1
...,...,...,...
5804,Busy? Home Study Makes Sense!,\n\n \n--- \n![](http://images.pcdi-homestud...,1
5805,Preferred Non-Smoker Rates for Smokers,This is a multi-part message in MIME format. -...,1
5806,"How to get 10,000 FREE hits per day to any web...","Dear Subscriber,\n\nIf I could show you a way ...",1
5807,Cannabis Difference,****Mid-Summer Customer Appreciation SALE!****...,1


In [None]:
# Profile Report
profile_1 = ProfileReport(total_dataset, title = "Total Dataset")
profile_1.to_notebook_iframe()
profile_1.to_file("/content/total_dataset.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# Save Dataset
total_dataset.to_csv('/content/total_dataset.csv', index=False)

## Section 3: Create Cleaned Dataset (Balanced Dataset) and Explore this Dataset

In [None]:
# Clean Dataset

print(f"The total number of data in the dataset is: {total_dataset.shape[0]}")


cleaned_dataset = total_dataset.dropna()
print(f"The total number of data after clean n/a in the dataset is: {cleaned_dataset.shape[0]}")


cleaned_dataset = cleaned_dataset.drop_duplicates()
print(f"The total number of data after clean duplicate datat in the dataset is: {cleaned_dataset.shape[0]}")

The total number of data in the dataset is: 82486
The total number of data after clean n/a in the dataset is: 82138
The total number of data after clean duplicate datat in the dataset is: 82138


In [None]:
# Profile Report
profile_2 = ProfileReport(cleaned_dataset, title = "Cleaned Dataset (Balanced)")
profile_2.to_notebook_iframe()
profile_2.to_file("/content/cleaned_dataset.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# Save Dataset
cleaned_dataset.to_csv('/content/cleaned_dataset.csv', index=False)

## Section 4: Create Cleaned Dataset (Imbalanced Dataset) and Explore this Dataset

In [None]:
# Imbalance Dataset

total_number= 40000
label_0_number = int(total_number * 0.8)
label_1_number = total_number - label_0_number

print(label_0_number)
print(label_1_number)


label_0_dataset = cleaned_dataset[cleaned_dataset['label'] == 0]
label_1_dataset = cleaned_dataset[cleaned_dataset['label'] == 1]


label_0_new = label_0_dataset.sample(n=label_0_number, random_state=seed_0)
label_1_new = label_1_dataset.sample(n=label_1_number, random_state=seed_0)

cleaned_imbalance_dataset = pd.concat([label_0_new, label_1_new])
cleaned_imbalance_dataset = cleaned_imbalance_dataset.sample(frac=1, random_state=seed_2).reset_index(drop=True)


cleaned_imbalance_dataset

32000
8000


Unnamed: 0,subject,body,label
0,Re: [Python-3000] [Python-Dev] Reminder: last ...,"On 01:55 am, hoauf@python.org wrote:\n>On Thu,...",0
1,terminated employees ' benefits,"happy new year all ,\r\ni have been with enron...",0
2,"9/11, war in Iraq threaten Disney parks",URL: http://boingboing.net/#85531557\nDate: No...,0
3,"If you want a decent watch, get a replica appr...",\nQualitative watches at Replica Classics \n\n...,1
4,re : tds project,"you might hear of this , so i thought i better...",0
...,...,...,...
39995,caiso notification - tswg conference call,please call in @ 1330 pacific . call in number...,0
39996,[UAI] Ph.D Positions in Bioinformatics at Joha...,Ph.D Positions in Bioinformatics\n\nTwo PhD po...,0
39997,happy holidays !,with the holiday season and a new year upon us...,0
39998,Re: [opensuse] Screen Capture,"Chris Arnold wrote:\n> On 10.3, how do i get a...",0


In [None]:
# Profile Report
profile_3 = ProfileReport(cleaned_imbalance_dataset, title = "Cleaned Dataset (Imbalanced)")
profile_3.to_notebook_iframe()
profile_3.to_file("/content/cleaned_imbalance_dataset.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# Save Dataset
cleaned_imbalance_dataset.to_csv('/content/cleaned_imbalance_dataset.csv', index=False)