# YZV311E Data Mining Course Project

## Data Merging

### Links to the datasets:

https://www.kaggle.com/datasets/venky73/spam-mails-dataset

https://www.kaggle.com/datasets/shantanudhakadd/email-spam-detection-dataset-classification

https://www.kaggle.com/datasets/nitishabharathi/email-spam-dataset

https://www.kaggle.com/datasets/wcukierski/enron-email-dataset

https://www.kaggle.com/datasets/subhajournal/phishingemails

https://www.kaggle.com/datasets/jackksoncsie/spam-email-dataset

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

PATH = "datasets/"

In [5]:
data1 = pd.read_csv(PATH + "spam.csv", encoding='latin-1')

In [7]:
data1.count()

v1            5572
v2            5572
Unnamed: 2      50
Unnamed: 3      12
Unnamed: 4       6
dtype: int64

In [11]:
# lets check if Unnamed: 2, 3 and 4 consist any spam messages
data1[data1['v1'] == 'spam'].count()

v1            747
v2            747
Unnamed: 2      5
Unnamed: 3      2
Unnamed: 4      0
dtype: int64

Now, Unnamed 2, 3 and 4 are replies to the mails people responded to. It is most likely that it will not consist spam data (and as we saw, this might fall into the category of outliers. So, we will remove these columns.)

In [12]:
data1.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1, inplace=True)
# With this operation, we will have 2 clean columns since the dropped columns also includede texts like .;-):-D" etc.

In [13]:
data1.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [14]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
data1['v1'] = le.fit_transform(data1['v1'])

data1.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### 1 is spam, 0 is not spam

In [15]:
data2 = pd.read_csv(PATH + "spam_ham_dataset.csv", encoding='latin-1')

In [16]:
data2.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


In [18]:
data2.drop(["Unnamed: 0", "label"], axis=1, inplace=True)

In [19]:
data2["text"]

0       Subject: enron methanol ; meter # : 988291\r\n...
1       Subject: hpl nom for january 9 , 2001\r\n( see...
2       Subject: neon retreat\r\nho ho ho , we ' re ar...
3       Subject: photoshop , windows , office . cheap ...
4       Subject: re : indian springs\r\nthis deal is t...
                              ...                        
5166    Subject: put the 10 on the ft\r\nthe transport...
5167    Subject: 3 / 4 / 2000 and following noms\r\nhp...
5168    Subject: calpine daily gas nomination\r\n>\r\n...
5169    Subject: industrial worksheets for august 2000...
5170    Subject: important online banking alert\r\ndea...
Name: text, Length: 5171, dtype: object

While merging, we will create a subject column and a body column, we understand that not every dataset included the subjects but seperating them into a different dataset will help us analyse the data better.

In [21]:
# find keyword "Subject:" and count it 
data2[data2["text"].str.contains("Subject:")].count()

text         5171
label_num    5171
dtype: int64

In [22]:
# whole data row count
data2.count()

text         5171
label_num    5171
dtype: int64

So all text data includes a subject and a body column.