<h1>Preprocessing datasets for phishing detection</h1>

## Loading Datasets
We will be using two datasets for the purposes of external validation. The use-case of our extension requires a model that is adept at generalisation and can be effective 'in the field.' In order to develop a capable model, the datasets must be feature rich and provide ample training data. Therefore, two datasets have been chosen that contain both of these properties. More details about the selection process for these datasets can be found in the report, but to summarise, one dataset has been provided utilising URLs from Phishtank combined with benign URLs, with over 111 features to extract. The second dataset has been pulled from Kaggle, containing 11,000 URLs with 87 features. These features are pulled from the URL, HTML content and domain properties. To operate on these datasets, the following libraries will be needed.

The Phishtank dataset was last updated in 2024, introducing new phishing URLs, whereas the Kaggle dataset was updated in 2021. This can be an effective way to explore how phishing attacks have changed, with attackers attempting new techniques commonly and with the attack vector not being static in general.

In [31]:
import pandas as pd

To download the Kaggle dataset, the opendatasets library was used to pull it automatically. We can then see which features are available and check the amount of items and features.

In [None]:
import opendatasets as od
od.download("https://www.kaggle.com/datasets/shashwatwork/web-page-phishing-detection-dataset")

In [4]:
filepath = 'Datasets/web-page-phishing-detection-dataset/dataset_phishing.csv'  #create the filepath to our dataset
kaggleData = pd.read_csv(filepath) #read the data in the file using the pandas library
kaggleData.head() # print some values to check if things are correct

Unnamed: 0,url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,http://www.crestonwood.com/router.php,37,19,0,3,0,0,0,0,0,...,0,1,0,45,-1,0,1,1,4,legitimate
1,http://shadetreetechnology.com/V4/validation/a...,77,23,1,1,0,0,0,0,0,...,1,0,0,77,5767,0,0,1,2,phishing
2,https://support-appleld.com.secureupdate.duila...,126,50,1,4,1,0,1,2,0,...,1,0,0,14,4004,5828815,0,1,0,phishing
3,http://rgipt.ac.in,18,11,0,2,0,0,0,0,0,...,1,0,0,62,-1,107721,0,0,3,legitimate
4,http://www.iracing.com/tracks/gateway-motorspo...,55,15,0,2,2,0,0,0,0,...,0,1,0,224,8175,8725,0,0,6,legitimate


In [5]:
print(kaggleData.shape) #Print the dimensions of the dataset

(11430, 89)


The Kaggle dataset contains 11430 items, specifying a 50/50 split between benign and malicious URLs. The second dataset is a Phishtank adjacent dataset, containing 58645 items with a 50/50 split. However, as you can see the dataset does not contain the actual URLs and only the extracted features, meaning any future feature engineering will have to be conducted on the Kaggle dataset. The selection method of these features can be seen in the accompanying report.

In [6]:
filepath2 = 'Datasets/dataset_small.csv'
phishTankData = pd.read_csv(filepath2)
phishTankData.head()

Unnamed: 0,qty_dot_url,qty_hyphen_url,qty_underline_url,qty_slash_url,qty_questionmark_url,qty_equal_url,qty_at_url,qty_and_url,qty_exclamation_url,qty_space_url,...,qty_ip_resolved,qty_nameservers,qty_mx_servers,ttl_hostname,tls_ssl_certificate,qty_redirects,url_google_index,domain_google_index,url_shortened,phishing
0,2,0,0,0,0,0,0,0,0,0,...,1,4,2,3598,0,0,0,0,0,0
1,4,0,0,2,0,0,0,0,0,0,...,1,4,1,3977,1,0,0,0,0,0
2,1,0,0,1,0,0,0,0,0,0,...,1,2,1,10788,0,0,0,0,0,0
3,2,0,0,3,0,0,0,0,0,0,...,1,2,1,14339,1,0,0,0,0,1
4,1,1,0,4,0,0,0,0,0,0,...,1,2,1,389,1,1,0,0,0,1


In [7]:
print(phishTankData.shape)

(58645, 112)


We can see that both datasets offer multiple features, 88 and 111 respectively, with one extra feature being the label. The next step will be to designate the important features and reduce the overall number of features. This is necessary to ensure that when the model is deployed it is efficient and does not impede the user experience.

## Feature Selection
To properly implement external validation, we will normalise features and data throughout our two datasets. To do so we must first select a subset of common features that are shown to be strong indicators of phishing whilst also reducing the feature count to increase efficiency of the deployed model. The justification and selection of features can be seen below:
<figure>
<img src="utilities/Phishing%20Features.png" alt="features" width="1000"/>

The chosen features are pulled from the domain information and URL syntax, further features were deemed unfeasible for this project due to the processing time when the model is deployed and certain WHOIS features could not be queried en masse to be extracted for the dataset.



In [46]:
selected_features = [
    'length_url', 'domain_length', 'directory_length', 'file_length',
    'params_length', 'qty_slash_url', 'qty_dot_url', 'domain_in_ip',
    'qty_at_url', 'asn_ip', 'time_domain_activation', 'time_domain_expiration',
    'qty_hyphen_url', 'url_google_index', 'url_shortened', 'phishing'
]

# pull the selected columns into a dataframe
extracted_dataset = phishTankData[selected_features]

# Save to a CSV file
extracted_dataset.to_csv('Datasets/Final_datasets/extracted_phishtank_features.csv', index=False)


In [47]:
(extracted_dataset.head())

Unnamed: 0,length_url,domain_length,directory_length,file_length,params_length,qty_slash_url,qty_dot_url,domain_in_ip,qty_at_url,asn_ip,time_domain_activation,time_domain_expiration,qty_hyphen_url,url_google_index,url_shortened,phishing
0,14,14,-1,-1,-1,0,2,0,0,8560,4927,185,0,0,0,0
1,38,32,6,0,-1,2,4,0,0,263283,8217,-1,0,0,0,0
2,24,23,1,0,-1,1,1,0,0,26496,258,106,0,0,0,0
3,38,25,13,0,-1,3,2,0,0,20013,2602,319,0,0,0,1
4,46,19,27,0,-1,4,1,0,0,41828,-1,-1,1,0,0,1


In [48]:
print(extracted_dataset.shape)

(58645, 16)


In [35]:
df = pd.read_csv('Datasets/extracted_phishtank_features.csv')
# Checking for missing values using isnull()
mv = df.isnull()
for i in mv.columns:
    print(f"{i}: {mv[i].sum()}")

length_url: 0
domain_length: 0
directory_length: 0
file_length: 0
params_length: 0
qty_slash_url: 0
qty_dot_url: 0
domain_in_ip: 0
qty_at_url: 0
asn_ip: 0
time_domain_activation: 0
time_domain_expiration: 0
qty_hyphen_url: 0
url_google_index: 0
url_shortened: 0
phishing: 0


We can see that there are no missing values within this dataset, and as such we can move onto loading the next dataset from Kaggle. This one does not have all the features we have selected, so we will need to code some python functions to extract the necessary features and then load them into a new extracted dataset. Once we have organised and checked the data for missing values, we can normalise throughout both sets and start training our models.

First we need to extract the URL text features, this can be done by using the urllib library. We can parse the URL to extract the file, directory and any parameters within the URL and find the length. We will then put these into our dataset.

In [13]:
from urllib.parse import urlparse
def extractparams(url):
    urlpath = urlparse(url).query
    if urlpath:
        return len(urlpath)
    else:
        return -1

In [14]:
from urllib.parse import urlparse
import posixpath

def extractfile(url):
    urlpath = urlparse(url).path
    if urlpath:
        setup = urlpath.count("/")
        if setup >= 2:
            directory = posixpath.basename(urlpath)
            return len(directory)
        else:
            return 0
    else:
        return -1

def extractdir(url):
    urlpath = urlparse(url).path
    if urlpath:
        setup = urlpath.count("/")
        if setup >= 2:
            directory = posixpath.dirname(urlpath)
            return len(directory)
        else:
            directory = posixpath.basename(urlpath)
            return len(directory)
    else:
        return -1

extractfile("https://www.example.com/blog/2024/tech/latest-news/index.html?param1=value1&param2=value2#section")


10

We then need to deal with IP-based queries, this requires pulling the IP for a URL using socket and then querying against a local ASN database to find the relevant value.

In [16]:
import socket
from urllib.parse import urlparse

# Function to resolve hostname to IP address
def resolve_hostname_to_ip(url):
    try:
        hostname =  urlparse(url).hostname
        ip_address = socket.gethostbyname(hostname)
        return ip_address
    except:
        return None

resolve_hostname_to_ip("http://www.iracing.com/tracks/gateway-motorsports-park/")


'52.5.52.172'

The initial implementation for finding the ASN value was to utilise the IPWHOIS library to perform lookups for each IP. However, this method is excessively time-consuming if performed on an entire dataset. As such, the pyasn library was utilised as it is suited for operating on large amounts of data.

In [18]:
import pyasn

def extract_asn2(ip):
    asndb = pyasn.pyasn('utilities/rib.dat')
    asn = asndb.lookup(ip)
    return asn[0]

def full_asn2(url):
    ip = resolve_hostname_to_ip(url)
    if ip:
        asn = extract_asn2(ip)
        return asn
    else:
        return -1

full_asn2("http://www.iracing.com/tracks/gateway-motorsports-park/")


14618

We now have to apply these functions to the Kaggle dataset and extract the relevant features, and then append these features to the dataset. We do this by creating a dictionary to store the values, then applying this function iteratively for every URL in the dataset. The results are then stored to a dataframe that is a conjunction of the original dataset and the new data, this is then exported to a csv file.

In [16]:
def extract_features(url):
    # Collect all the extracted features in a dictionary
    features = {}
    # Directory Length
    features['directory_length'] = extractdir(url)
    # File Length
    features['file_length'] = extractfile(url)
    # Parameter Length
    features['params_length'] = extractparams(url)
    # ASN IP
    try:
        ip = resolve_hostname_to_ip(url)
        features['asn_ip'] = full_asn2(url) if ip else -1
    except:
        features['asn_ip'] = -1

    return features


# Function to apply feature extraction to entire dataset
def extract_features_dataset(dataframe):
    # Create a copy of the dataframe to avoid modifying the original
    df_extracted = dataframe.copy()

    # Apply feature extraction to each URL
    # Apply feature extraction to each URL with progress tracking
    extracted_features = df_extracted['url'].apply(extract_features)

    # Convert the list of dictionaries to a DataFrame
    features_df = pd.DataFrame(extracted_features.tolist())

    # Concatenate the original dataframe with the extracted features
    final_df = pd.concat([df_extracted, features_df], axis=1)

    return final_df

def save_extracted_dataset(extracted_dataframe, output_filename):
    extracted_dataframe.to_csv(output_filename, index=False)
    print(f"Dataset saved to {output_filename}")
    print(f"Dataset shape: {extracted_dataframe.shape}")

kaggle_extracted = extract_features_dataset(kaggleData)
save_extracted_dataset(kaggle_extracted, 'Datasets/Intermediate_Datasets/kaggle_extracted_features.csv')

Dataset saved to Datasets/kaggle_extracted_features.csv
Dataset shape: (11430, 93)


We have now created a dataset encompassing all the data we need. We can then normalise the necessary data. Currently, the status of the URL is stored as a string, being labelled as 'legitimate' or 'phishing', as such we can once again iterate through the dataset and add a new label that is consistent with the PhishTank dataset.

In [40]:
extractedKaggleData = pd.read_csv('Datasets/Intermediate_Datasets/kaggle_extracted_features.csv')
def phishing_tag(value):
    if value == "phishing":
        return 1
    else:
        return 0

def phishing_tag2(value):
    features = {}
    features['phishing'] = phishing_tag(value)
    return features

def normalise_dataset(dataframe):
    # Create a copy of the dataframe to avoid modifying the original
    df_extracted = dataframe.copy()

    # Apply feature extraction to each URL
    # Apply feature extraction to each URL with progress tracking
    corrected_values = df_extracted['status'].apply(phishing_tag2)

    # Convert the list of dictionaries to a DataFrame
    features_df = pd.DataFrame(corrected_values.tolist())

    # Concatenate the original dataframe with the extracted features
    final_df = pd.concat([df_extracted, features_df], axis=1)

    return final_df

kaggle_finalised = normalise_dataset(extractedKaggleData)
save_extracted_dataset(kaggle_finalised, 'Datasets/Intermediate_Datasets/normalised_kaggle_features.csv')


Dataset saved to Datasets/Intermediate_Datasets/normalised_kaggle_features.csv
Dataset shape: (11430, 94)


Now we extract the necessary columns and we have our complete dataset.

In [43]:
selected_features2 = [
    'length_url', 'length_hostname', 'directory_length', 'file_length',
    'params_length', 'nb_slash', 'nb_dots', 'ip',
    'nb_at', 'asn_ip', 'domain_age', 'domain_registration_length',
    'nb_hyphens', 'google_index', 'shortening_service', 'phishing'
]

normalisedKaggleData = pd.read_csv('Datasets/Intermediate_Datasets/normalised_kaggle_features.csv')
extracted_dataset = normalisedKaggleData[selected_features2]

# Save to a separate CSV file
extracted_dataset.to_csv('Datasets/Final_datasets/final_kaggle_data.csv', index=False)


In [52]:
finalKaggleData = pd.read_csv('Datasets/Final_datasets/final_kaggle_data.csv')
finalPhishtankData = pd.read_csv('Datasets/Final_datasets/extracted_phishtank_features.csv')

Now that all features have been extracted, we can normalise feature names and remove any invalid rows that have missing data.

In [60]:
finalKaggleData.columns = finalPhishtankData.columns
save_extracted_dataset(finalKaggleData, 'Datasets/Final_datasets/final_kaggle_data.csv')

Dataset saved to Datasets/Final_datasets/final_kaggle_data.csv
Dataset shape: (11430, 16)


In [61]:
# Checking for missing values using isnull()
mv = finalKaggleData.isnull()
for i in mv.columns:
    print(f"{i}: {mv[i].sum()}")

length_url: 0
domain_length: 0
directory_length: 0
file_length: 0
params_length: 0
qty_slash_url: 0
qty_dot_url: 0
domain_in_ip: 0
qty_at_url: 0
asn_ip: 17
time_domain_activation: 0
time_domain_expiration: 0
qty_hyphen_url: 0
url_google_index: 0
url_shortened: 0
phishing: 0


In [62]:
# Remove rows where ASN IP is missing
finalKaggleData = finalKaggleData.dropna(subset=['asn_ip'])
save_extracted_dataset(finalKaggleData, 'Datasets/Final_datasets/final_kaggle_data.csv')


Dataset saved to Datasets/Final_datasets/final_kaggle_data.csv
Dataset shape: (11413, 16)


In [63]:
finalKaggleData.head()

Unnamed: 0,length_url,domain_length,directory_length,file_length,params_length,qty_slash_url,qty_dot_url,domain_in_ip,qty_at_url,asn_ip,time_domain_activation,time_domain_expiration,qty_hyphen_url,url_google_index,url_shortened,phishing
0,37,19,10,0,-1,3,3,0,0,32244.0,-1,45,0,1,0,0
1,77,23,14,32,-1,5,1,1,0,16509.0,5767,77,0,1,0,1
2,126,50,19,0,47,5,4,1,0,-1.0,4004,14,1,1,0,1
3,18,11,-1,-1,-1,2,2,0,0,55824.0,-1,62,0,0,0,0
4,55,15,32,0,-1,5,2,0,0,14618.0,8175,224,2,0,0,0


In [69]:
non_integer_rows = finalKaggleData[finalKaggleData['asn_ip'] % 1 != 0]
if len(non_integer_rows) > 0:
    print("Warning: Some rows contain non-integer values") #used to convert the float values of the asn to integers for consistency
    print(non_integer_rows)
else:
    finalKaggleData['asn_ip'] = finalKaggleData['asn_ip'].astype(int)

save_extracted_dataset(finalKaggleData, 'Datasets/Final_datasets/final_kaggle_data.csv')


Dataset saved to Datasets/Final_datasets/final_kaggle_data.csv
Dataset shape: (11413, 16)


In [71]:
finalPhishtankData.duplicated().sum()

np.int64(15043)

Remove duplicate data from the dataset.

In [74]:
finalPhishtankData = finalPhishtankData.drop_duplicates()
save_extracted_dataset(finalPhishtankData, 'Datasets/Final_datasets/extracted_phishtank_features.csv')

Dataset saved to Datasets/Final_datasets/extracted_phishtank_features.csv
Dataset shape: (43602, 16)


In [75]:
finalKaggleData.duplicated().sum()

np.int64(951)

In [76]:
finalKaggleData = finalKaggleData.drop_duplicates()
save_extracted_dataset(finalKaggleData, 'Datasets/Final_datasets/final_kaggle_data.csv')

Dataset saved to Datasets/Final_datasets/final_kaggle_data.csv
Dataset shape: (10462, 16)
