## Initial Exploratory Data Science

Alex McDonald

In this python notebook we explore some of the basic statistics about the two datasets:

**Dataset A**: Labeled examples for spam or ham (non-spam). [Kaggle Link](https://www.kaggle.com/datasets/meruvulikith/190k-spam-ham-email-dataset-for-classification/data)

**Dataset B**: Labeled examples for phishing and non-phishing URLs. [Kaggle Link](https://www.kaggle.com/datasets/hammadjavaid/phishing-url-dataset-for-nlp-based-classification)

**Dataset C**: A very small dataset that we will create in the future by hand or by a LLM like ChatGPT for the purpose of testing our model. This will have examples of spam and non-spam with the goal of addressing whether our model can successfully prevent malicious text formed from LLMs, addressing the larger problem as described in the introduction or abstract.

In [1]:
import pandas as pd
import numpy as np

To first examine **Dataset A**, we will load it from the csv from pandas. Unlike dataset B, this dataset is not divided into a training set and validation set, so we will use random sampling to split 70% of the full dataset into the training data, which is the same ratio that dataset B is using. 

In [2]:
A_all = pd.read_csv('./datasets/SpamHam/spam_Emails_data.csv')
A_all.head()

Unnamed: 0,label,text
0,Spam,viiiiiiagraaaa\nonly for the ones that want to...
1,Ham,got ice thought look az original message ice o...
2,Spam,yo ur wom an ne eds an escapenumber in ch ma n...
3,Spam,start increasing your odds of success & live s...
4,Ham,author jra date escapenumber escapenumber esca...


In [3]:
#cleaning
A_all = A_all.rename(columns={"label": "text_label"})
A_all = A_all.dropna()
possible_labels = A_all["text_label"].unique()
A_all["text"] = A_all["text"].str.replace("\n", " ")
print("%s possible labels: %s" % (len(possible_labels), possible_labels))
A_all["label"] = A_all["text_label"].apply(lambda x: 1 if x == 'Spam' else 0) #model is detecting if spam, 1=spam

A_all.head()

2 possible labels: ['Spam' 'Ham']


Unnamed: 0,text_label,text,label
0,Spam,viiiiiiagraaaa only for the ones that want to ...,1
1,Ham,got ice thought look az original message ice o...,0
2,Spam,yo ur wom an ne eds an escapenumber in ch ma n...,1
3,Spam,start increasing your odds of success & live s...,1
4,Ham,author jra date escapenumber escapenumber esca...,0


In [50]:
#basic statistics
total_count = A_all.shape[0]
spam_vals = A_all["label"].values
spam_percent = 100*round(len(spam_vals[spam_vals == 1])/total_count, 4)
print("Entire dataset has %s examples, of which %s%% are spam." % (total_count, spam_percent))

rows_capitals = A_all[A_all["text"].str.contains(r'[A-Z]')]
print("%s examples contain capital letters." % len(rows_capitals))

Entire dataset has 193850 examples, of which 47.3% are spam.
0 examples contain capital letters.


In [4]:
#split into test and train sets
split_ratio = 0.7
A_shuffled = A_all.sample(frac=1, replace=False, random_state=1234)
train_size = int(split_ratio*A_all.shape[0])
A_train = A_shuffled.iloc[:train_size]
A_test = A_shuffled.iloc[train_size:]

A_train.to_csv("./datasets/SpamHam/train.csv")
A_test.to_csv("./datasets/SpamHam/test.csv")

print("Train size: %s, Test size: %s" % (A_train.shape[0], A_test.shape[0]))

Train size: 135695, Test size: 58155


In [98]:
#stats about train and test datasets
total_count1 = A_train.shape[0]
spam_vals1 = A_train["label"].values
spam_percent1 = 100*round(len(spam_vals1[spam_vals1 == 1])/total_count1, 4)
print("Training dataset has %s examples, of which %s%% are spam." % (total_count1, spam_percent1))

total_count2 = A_test.shape[0]
spam_vals2 = A_test["label"].values
spam_percent2 = 100*round(len(spam_vals2[spam_vals2 == 1])/total_count2, 4)
print("Testing dataset has %s examples, of which %s%% are spam." % (total_count2, spam_percent2))

Training dataset has 135695 examples, of which 47.15% are spam.
Testing dataset has 58155 examples, of which 47.64% are spam.


# Dataset B: Phishing and non-phishing URLs

For the second dataset, we will do a similar process. The dataset's contributor on Kaggle has already divided this dataset into a training set and test set, so for convenience we will use that. Note: We needed to clean the files to remove double-quotes so that it is csv-formatted.

In [6]:
#remove double quotes in file to make it csv formatted
def remove_doublequotes(file_dir):
    raw_file_str = ''
    with open(file_dir, 'r', encoding='utf-8') as f:
        raw_file_str = f.read().replace('""', '"')
        if not raw_file_str.startswith("label,text\n"):
            raw_file_str = "label,text\n" + raw_file_str
    with open(file_dir, 'w', encoding='utf-8') as f:
        f.write(raw_file_str)

remove_doublequotes('./datasets/PhishingURLs/train.csv')
remove_doublequotes('./datasets/PhishingURLs/test.csv')

In [3]:
B_train = pd.read_csv("./datasets/PhishingURLs/train.csv", encoding='utf-8')
B_test = pd.read_csv("./datasets/PhishingURLs/test.csv", encoding='utf-8')

B_train.head()

Unnamed: 0,label,text
0,2,https://blog.sockpuppet.us/
1,2,https://blog.apiki.com/seguranca/
2,1,http://autoecole-lauriston.com/a/T0RVd056QXlNe...
3,1,http://chinpay.site/index.html?hgcFSE@E$Z*DFcG...
4,2,http://www.firstfivenebraska.org/blog/article/...


In [4]:
#cleaning
B_train = B_train.dropna()
B_test = B_test.dropna()
possible_labels = B_train["label"].unique()
possible_labels_test = B_test["label"].unique()
print("%s possible labels in training set: %s" % (len(possible_labels), possible_labels))
print("%s possible labels in testing set: %s" % (len(possible_labels_test), possible_labels_test))
B_train["label"] = B_train["label"].apply(lambda x: 2 - x) #1=phishing
B_test["label"] = B_test["label"].apply(lambda x: 2 - x)

B_train.head()

2 possible labels in training set: [2 1]
2 possible labels in testing set: [2 1]


Unnamed: 0,label,text
0,0,https://blog.sockpuppet.us/
1,0,https://blog.apiki.com/seguranca/
2,1,http://autoecole-lauriston.com/a/T0RVd056QXlNe...
3,1,http://chinpay.site/index.html?hgcFSE@E$Z*DFcG...
4,0,http://www.firstfivenebraska.org/blog/article/...


In [10]:
#basic stats
print("Training dataset has %s examples, of which %s%% are spam" % (B_train.shape[0], 100*round(np.average(B_train["label"].values), 4)))
print("Testing dataset has %s examples, of which %s%% are spam" % (B_test.shape[0], 100*round(np.average(B_test["label"].values), 4)))

Training dataset has 640000 examples, of which 50.0% are spam
Testing dataset has 160000 examples, of which 50.0% are spam


In [8]:
#testing to make sure all urls start with http
a = B_train["text"].str.startswith("http")
b = B_train["text"].str.contains(".com")
c = B_test["text"].str.startswith("http")
d = B_test["text"].str.contains(".com")

print("%s/%s start with http in training set" % (a.shape[0], B_train.shape[0]))
print("%s/%s start with http in testing set" % (c.shape[0], B_test.shape[0]))
print("%s/%s contain .com in training set" % (b.shape[0], B_train.shape[0]))
print("%s/%s contain .com in testing set" % (d.shape[0], B_test.shape[0]))

640000/640000 start with http in training set
160000/160000 start with http in testing set
640000/640000 contain .com in training set
160000/160000 contain .com in testing set


In [15]:
has_http = A_all[A_all["text"].str.contains("http")]
has_com = A_all[A_all["text"].str.contains(".com")]
has_http_com = has_http["text"].str.contains(".com")

print(has_http.shape[0], has_com.shape[0], has_http_com.shape[0])

71437 124520 71437
