# YZV311E Data Mining Course Project

## Data Merging

### Links to the datasets:

https://www.kaggle.com/datasets/venky73/spam-mails-dataset

https://www.kaggle.com/datasets/shantanudhakadd/email-spam-detection-dataset-classification

https://www.kaggle.com/datasets/nitishabharathi/email-spam-dataset

https://www.kaggle.com/datasets/wcukierski/enron-email-dataset

https://www.kaggle.com/datasets/subhajournal/phishingemails

https://www.kaggle.com/datasets/jackksoncsie/spam-email-dataset

In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

PATH = "datasets/"

In [30]:
data1 = pd.read_csv(PATH + "spam.csv", encoding='latin-1')

In [31]:
data1.count()

v1            5572
v2            5572
Unnamed: 2      50
Unnamed: 3      12
Unnamed: 4       6
dtype: int64

In [32]:
# lets check if Unnamed: 2, 3 and 4 consist any spam messages
data1[data1['v1'] == 'spam'].count()

v1            747
v2            747
Unnamed: 2      5
Unnamed: 3      2
Unnamed: 4      0
dtype: int64

Now, Unnamed 2, 3 and 4 are replies to the mails people responded to. It is most likely that it will not consist spam data (and as we saw, this might fall into the category of outliers. So, we will remove these columns.)

In [33]:
data1.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1, inplace=True)
# With this operation, we will have 2 clean columns since the dropped columns also includede texts like .;-):-D" etc.

In [34]:
data1.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [35]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
data1['v1'] = le.fit_transform(data1['v1'])

data1.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [36]:
data1.rename(columns={'v1': 'label', 'v2': 'text'}, inplace=True)

In [47]:
data1.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### 1 is spam, 0 is not spam

In [37]:
data2 = pd.read_csv(PATH + "spam_ham_dataset.csv", encoding='latin-1')

In [38]:
data2.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


In [39]:
data2.drop(["Unnamed: 0", "label"], axis=1, inplace=True)

In [40]:
data2["text"]

0       Subject: enron methanol ; meter # : 988291\r\n...
1       Subject: hpl nom for january 9 , 2001\r\n( see...
2       Subject: neon retreat\r\nho ho ho , we ' re ar...
3       Subject: photoshop , windows , office . cheap ...
4       Subject: re : indian springs\r\nthis deal is t...
                              ...                        
5166    Subject: put the 10 on the ft\r\nthe transport...
5167    Subject: 3 / 4 / 2000 and following noms\r\nhp...
5168    Subject: calpine daily gas nomination\r\n>\r\n...
5169    Subject: industrial worksheets for august 2000...
5170    Subject: important online banking alert\r\ndea...
Name: text, Length: 5171, dtype: object

While merging, we will create a subject column and a body column, we understand that not every dataset included the subjects but seperating them into a different dataset will help us analyse the data better.

In [41]:
# find keyword "Subject:" and count it 
data2[data2["text"].str.contains("Subject:")].count()

text         5171
label_num    5171
dtype: int64

In [42]:
# whole data row count
data2.count()

text         5171
label_num    5171
dtype: int64

So all text data includes a subject and a body column.

In [43]:
# what we know is in text column the subject ends right after \r and the body starts after \n
# so we will split the text column into 2 columns

data2["subject"] = data2["text"].str.split("\r").str[0]
data2["body"] = data2["text"].str.split("\n").str[1]

In [44]:
data2.head()

Unnamed: 0,text,label_num,subject,body
0,Subject: enron methanol ; meter # : 988291\r\n...,0,Subject: enron methanol ; meter # : 988291,this is a follow up to the note i gave you on ...
1,"Subject: hpl nom for january 9 , 2001\r\n( see...",0,"Subject: hpl nom for january 9 , 2001",( see attached file : hplnol 09 . xls )\r
2,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0,Subject: neon retreat,"ho ho ho , we ' re around to that most wonderf..."
3,"Subject: photoshop , windows , office . cheap ...",1,"Subject: photoshop , windows , office . cheap ...",abasements darer prudently fortuitous undergone\r
4,Subject: re : indian springs\r\nthis deal is t...,0,Subject: re : indian springs,this deal is to book the teco pvr revenue . it...


In [45]:
data2.drop(["text"], axis=1, inplace=True)
data2.rename(columns={'label_num': 'label', "body": "text"}, inplace=True)

In [46]:
data2.head()

Unnamed: 0,label,subject,text
0,0,Subject: enron methanol ; meter # : 988291,this is a follow up to the note i gave you on ...
1,0,"Subject: hpl nom for january 9 , 2001",( see attached file : hplnol 09 . xls )\r
2,0,Subject: neon retreat,"ho ho ho , we ' re around to that most wonderf..."
3,1,"Subject: photoshop , windows , office . cheap ...",abasements darer prudently fortuitous undergone\r
4,0,Subject: re : indian springs,this deal is to book the teco pvr revenue . it...


In [72]:
data3 = pd.read_csv(PATH + "Phishing_Email.csv", encoding='latin-1')

In [73]:
data3.head()

Unnamed: 0.1,Unnamed: 0,Email Text,Email Type
0,0,"re : 6 . 1100 , disc : uniformitarianism , re ...",Safe Email
1,1,the other side of * galicismos * * galicismo *...,Safe Email
2,2,re : equistar deal tickets are you still avail...,Safe Email
3,3,\nHello I am your hot lil horny toy.\n I am...,Phishing Email
4,4,software at incredibly low prices ( 86 % lower...,Phishing Email


In [74]:
data3['label'] = data3['Email Type'].apply(lambda x: ["Safe Email", "Phishing Email"].index(x))
# Source: https://stackoverflow.com/questions/38749305/labelencoder-order-of-fit-for-a-pandas-df 
# label encoder was encoding safe email as 1 and we had no way of having an order for the labels

In [75]:
data3.head()

Unnamed: 0.1,Unnamed: 0,Email Text,Email Type,label
0,0,"re : 6 . 1100 , disc : uniformitarianism , re ...",Safe Email,0
1,1,the other side of * galicismos * * galicismo *...,Safe Email,0
2,2,re : equistar deal tickets are you still avail...,Safe Email,0
3,3,\nHello I am your hot lil horny toy.\n I am...,Phishing Email,1
4,4,software at incredibly low prices ( 86 % lower...,Phishing Email,1


In [76]:
data3.drop(columns=["Email Type", "Unnamed: 0"], inplace=True)

In [77]:
data3.rename(columns={'Email Text': 'text'}, inplace=True)

In [78]:
data3.head()

Unnamed: 0,text,label
0,"re : 6 . 1100 , disc : uniformitarianism , re ...",0
1,the other side of * galicismos * * galicismo *...,0
2,re : equistar deal tickets are you still avail...,0
3,\nHello I am your hot lil horny toy.\n I am...,1
4,software at incredibly low prices ( 86 % lower...,1


In [79]:
data4 = pd.read_csv(PATH + "emails.csv", encoding='latin-1')

In [80]:
data4.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [81]:
# for this specific dataset, the subject ends right after "  " which is two spaces. we will split the text column into 2 columns
data4["subject"] = data4["text"].str.split("  ").str[0]
data4["text"] = data4["text"].str.split("  ").str[1]

In [82]:
data4.head()

Unnamed: 0,text,spam,subject
0,lt is really hard to recollect a company : the,1,Subject: naturally irresistible your corporate...
1,fanny is merrill but muzo not colza attainder ...,1,Subject: the stock trading gunslinger
2,im wanting to show you this,1,Subject: unbelievable new homes made easy
3,request additional information now ! click here,1,Subject: 4 color printing special
4,software compatibility . . . . ain ' t it great ?,1,"Subject: do not have money , get software cds ..."


In [83]:
data5 = pd.read_csv(PATH + "spam_assassin/completeSpamAssassin.csv", encoding='latin-1')

In [84]:
data5.head()

Unnamed: 0.1,Unnamed: 0,Body,Label
0,0,\nSave up to 70% on Life Insurance.\nWhy Spend...,1
1,1,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
2,2,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
3,3,##############################################...,1
4,4,I thought you might like these:\n1) Slim Down ...,1


In [87]:
data5.drop(columns=["Unnamed: 0"], inplace=True)
data5.rename(columns={'Label': 'label', 'Body': 'text'}, inplace=True)

In [88]:
# we have a problem on the first row. It starts with \n then identifies the subject then \n again. 
# If we were to split the text column with \n, we would have a problem.
# So when we are splitting, we will check if \ is the first character of the row
# if it is, we will find the next \n and split the text column with that
# if it is not, we will split the text column with \n

def split_text(text):
    if text == "\\":
        return text.split("\n")[1]
    else:
        return text.split("\n")[0]

data5["subject"] = data5["text"].apply(lambda x: split_text(x))


AttributeError: 'float' object has no attribute 'split'