# **Phishing Website Detection URL Feature Extraction**
*Final Project of ENPM 809K*

# 1**. Main Objective:**

Phishing is a type of social engineering attack that aims to exploit the naivety and/or gullibility of legitimate system users. The objective of this notebook is to collect the data & extract the required features to predict the phishing website URLs.


# **2. Data Collection & Analysis**
To Train our deep learning model, we need a collection of legitimate and Phishing URLs.

*Phishing URLs Data Collection*: We use an popular opensource site called [PhishTank](https://phishtank.org) which provides a huge collection of phishing URLs in multiple formats like CSV, XML, JSON, PHP and which gets periodically updated. Download the data file using the link: https://www.phishtank.com/developer_info.php

*Legitimate URLs Data Collection:* We use an another popular site called [Kaggle](https://www.kaggle.com/), from which we take a dataset which is balanced and has 50% phishing and 50% legitimate URLs. Download the data file using the link: https://www.kaggle.com/datasets/shashwatwork/web-page-phishing-detection-dataset and we also used the dataset using the refernce paper *P. Mowar and M. Jain, "Fishing out the Phishing Websites," 2021 International Conference on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA), 2021, pp. 1-6, doi: 10.1109/CyberSA52016.2021.9478237.*
Download the data file using the link: https://zenodo.org/records/5807622#.Ycsbzy0RpQJ


In [None]:
# This mounts your Google Drive to the Colab VM.
from google.colab import drive
drive.mount('/content/drive')

# set FOLDERNAME to the project folder
FOLDERNAME = 'ENPM809K Project/'
assert FOLDERNAME is not None, "[!] Enter the foldername."

# Now that we've mounted your Drive, this ensures that
# the Python interpreter of the Colab VM can load
# python files from within it.
import sys
sys.path.append('/content/drive/My Drive/{}'.format(FOLDERNAME))

Mounted at /content/drive


# **2.1. Phishing URLs:**
The required phishing URLs are collected from the above mentioned `PhishTank` website. Here, We use `wget` command which downloads the files without hindering the current process to download the csv from the site. Later the data is loaded into a Pandas DataFrame




In [None]:
# Importing pandas for data processing
import pandas as pd

In [None]:
# Using wget to download the csv file from PhishTank
!wget https://data.phishtank.com/data/online-valid.csv

--2023-12-11 00:47:11--  https://data.phishtank.com/data/online-valid.csv
Resolving data.phishtank.com (data.phishtank.com)... 104.17.177.85, 104.16.101.75, 2606:4700::6810:654b, ...
Connecting to data.phishtank.com (data.phishtank.com)|104.17.177.85|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn.phishtank.com/datadumps/verified_online.csv?Expires=1702255642&Signature=E8fgpZjC7iQwx~~N9cNSvp9uXY3u57HeDSSCrjvimTotiIx9kwFW8tOrfBGQSYrXw2fAV6cHv1tr1-zgwfPIx9vXvggSpI6hbnbvoDXDwj9JKbG-wH8M8wCZ5OCIn9-DC6UXBRu1Z1U4v1nVImMDAd3PNKKPg4G-XKj3z1UXIw5NZ-5mRhB6nudPKfp7GyS8cDHNA12MbjSW~bFiVpIZ--caYy-FsazkRKQz6Htc4zM2XUS13Glpc0EV3UhrAgrTNaGPhvoZSIh6ADjx4HWHF33lRkCsP2ELFfCeNp1caS-57vWRy6cAUcbE0VlWPKgdpLmi9pPt2AEvUlCPrvnvdA__&Key-Pair-Id=APKAILB45UG3RB4CSOJA [following]
--2023-12-11 00:47:12--  https://cdn.phishtank.com/datadumps/verified_online.csv?Expires=1702255642&Signature=E8fgpZjC7iQwx~~N9cNSvp9uXY3u57HeDSSCrjvimTotiIx9kwFW8tOrfBGQSYrXw2fAV6cHv1tr1-zgwfPIx

The above `wget` command downloads the csv file,*online-valid.csv* from `PhishTank` and stores in the /content/ folder.

In [None]:
# Loading the Phishing URLs csv file to a pandas dataframe
phish_data = pd.read_csv('online-valid.csv')
# Getting top 10 rows in the dataframe
phish_data.head(n=10)

Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,8388112,https://bipcancelarnett.webcindario.com/inicio...,http://www.phishtank.com/phish_detail.php?phis...,2023-12-10T22:37:41+00:00,yes,2023-12-10T22:53:01+00:00,yes,Other
1,8388111,https://heybn.webcindario.com/,http://www.phishtank.com/phish_detail.php?phis...,2023-12-10T22:37:37+00:00,yes,2023-12-10T22:53:01+00:00,yes,Other
2,8388110,https://pichinchawb.webcindario.com/validacion...,http://www.phishtank.com/phish_detail.php?phis...,2023-12-10T22:37:36+00:00,yes,2023-12-10T22:43:16+00:00,yes,Other
3,8388109,https://validepichinch.webcindario.com/validac...,http://www.phishtank.com/phish_detail.php?phis...,2023-12-10T22:37:35+00:00,yes,2023-12-10T22:43:16+00:00,yes,Other
4,8388108,https://biptokencompras.gaoz10.repl.co/,http://www.phishtank.com/phish_detail.php?phis...,2023-12-10T22:37:33+00:00,yes,2023-12-10T22:43:16+00:00,yes,Other
5,8388107,https://biptokencompras--gaoz10.repl.co/,http://www.phishtank.com/phish_detail.php?phis...,2023-12-10T22:37:32+00:00,yes,2023-12-10T22:43:16+00:00,yes,Other
6,8388106,https://2c5e5ed9-01c7-42cd-9b45-c1eae7ccbae4.i...,http://www.phishtank.com/phish_detail.php?phis...,2023-12-10T22:37:31+00:00,yes,2023-12-10T22:43:16+00:00,yes,Other
7,8388103,https://www.wvrtirement.com/,http://www.phishtank.com/phish_detail.php?phis...,2023-12-10T22:37:21+00:00,yes,2023-12-10T22:43:16+00:00,yes,Other
8,8388102,https://dev-acceder-bip-token.pantheonsite.io/,http://www.phishtank.com/phish_detail.php?phis...,2023-12-10T22:37:20+00:00,yes,2023-12-10T22:43:16+00:00,yes,Other
9,8388101,https://new.express.adobe.com/webpage/c3lIFPXH...,http://www.phishtank.com/phish_detail.php?phis...,2023-12-10T22:37:19+00:00,yes,2023-12-10T22:43:16+00:00,yes,Other


In [None]:
# Getting the dimensionality of the phish_data dataframe
phish_data.shape

(38473, 8)

Here, we see there are more than 38000 phishing URLs in the dataset from `PhishTank`. As the PhishTank site gets updated periodically, we might run into a problem of data imbalance. Hence, we consider a margin of 5000 URLs.

Now, we pick 5000 URLs randomly from the above `phish_data` dataframe

In [None]:
# Forming new dataframe of size 5000 records from phish_data
phish_urls = phish_data.sample(n = 19000, random_state=12).copy(deep=True)

# Reset the indices as per new df, phish_urls by dropping old index
# and replace with new index
phish_urls = phish_urls.reset_index(drop=True)

# Getting top 10 rows in the dataframe
phish_urls.head(n=10)

Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,8277059,https://qrco.de/beHAGf,http://www.phishtank.com/phish_detail.php?phis...,2023-08-28T16:28:53+00:00,yes,2023-08-28T17:12:23+00:00,yes,Internal Revenue Service
1,7630283,https://cloudflare-ipfs.com/ipfs/bafkreidqr7tw...,http://www.phishtank.com/phish_detail.php?phis...,2022-07-28T21:57:10+00:00,yes,2022-07-28T22:10:34+00:00,yes,Other
2,8384776,https://pub-3f7e513b2d754cfe8bfdbd90c3a48c19.r...,http://www.phishtank.com/phish_detail.php?phis...,2023-12-06T11:20:05+00:00,yes,2023-12-06T11:23:11+00:00,yes,Other
3,8261032,https://bafkreica2uawqr6ilbjnrwkjsmos6k5nmp2bs...,http://www.phishtank.com/phish_detail.php?phis...,2023-08-15T22:49:35+00:00,yes,2023-08-15T22:52:36+00:00,yes,Other
4,8363982,https://skinsmonkey.csgotrades.org/auth.php?gc...,http://www.phishtank.com/phish_detail.php?phis...,2023-11-14T13:03:13+00:00,yes,2023-11-15T09:04:33+00:00,yes,Other
5,8308690,https://agol-c512c.web.app/,http://www.phishtank.com/phish_detail.php?phis...,2023-09-22T21:22:50+00:00,yes,2023-09-22T21:52:58+00:00,yes,Other
6,8201283,https://ipfs.eth.aragon.network/ipfs/bafybeihy...,http://www.phishtank.com/phish_detail.php?phis...,2023-06-29T14:13:10+00:00,yes,2023-06-29T14:33:22+00:00,yes,Other
7,8378409,http://cz13521.tw1.ru/login/ologin.php,http://www.phishtank.com/phish_detail.php?phis...,2023-11-29T12:57:47+00:00,yes,2023-11-29T13:13:46+00:00,yes,Other
8,8242033,https://ipfs.io/ipfs/bafkreigkbq4k7lz74z6xlrgc...,http://www.phishtank.com/phish_detail.php?phis...,2023-07-28T23:16:52+00:00,yes,2023-07-28T23:23:10+00:00,yes,Other
9,8383156,https://rb.gy/d5jbsh,http://www.phishtank.com/phish_detail.php?phis...,2023-12-04T13:08:46+00:00,yes,2023-12-04T13:33:22+00:00,yes,Other


In [None]:
# Getting the dimensionality of the phish_urls dataframe
phish_urls.shape

(19000, 8)

So, Till now we have collected the phishing URLs data and formed a dataframe of 10000 rows randomly. Next step is to collect the legitimate URLs data

# **2.2 Legitimate URLs:**

We use the dataset named `phishing_and_benign_websites.csv` which has both legitimate and phishing URLs downloaded from this [link](https://www.kaggle.com/datasets/shashwatwork/web-page-phishing-detection-dataset/) and upload it to the colab session storage, which are now loaded into a dataframe

In [None]:
# Define the path to phishing_and_benign_websites CSV file in Google Drive
legit_csv_file_path = 'phishing_and_benign_websites.csv'

# Loading the 'dataset_phishing' csv file to a pandas dataframe
dataframe = pd.read_csv(legit_csv_file_path)

# Getting the rows whose status is 'legitimate'
legit_data = dataframe[dataframe['Label'] == 'Legitimate']

# Getting top 10 rows in the dataframe
legit_data.head(n=10)

Unnamed: 0,URLs,Label
0,http://www.wmmayhem.com/,Legitimate
1,http://www.ballymenaunitedyouthacademy.com/,Legitimate
2,http://www.brusselsgaybars.com/,Legitimate
3,http://www.sportsbettingtennis.net/,Legitimate
4,http://www.i29.mobi/,Legitimate
5,http://www.billnelson.senate.gov/,Legitimate
6,http://www.weather.noaa.gov/weather/PH_cc.html,Legitimate
7,http://www.eagleeyesunglasses.org/,Legitimate
8,http://www.redhawkfly.net/,Legitimate
9,http://www.ultimatesportsstore.com/category/22...,Legitimate


In [None]:
# Getting the dimensionality of the legit_data dataframe
legit_data.shape

(19400, 2)

From the `phishing_and_benign_websites.csv` file after filtering the rows containing status as `legitimate` we got arounf 19400 URLs. As above, we will select 19000 rows on random and use it as our legitimate URLs data

In [None]:
# Forming new dataframe of size 5000 records from legit_data
legit_urls = legit_data.sample(n = 19000, random_state=12).copy(deep=True)

# Reset the indices as per new df, legit_urls by dropping old index
# and replace with new index
legit_urls = legit_urls.reset_index(drop=True)

# Getting top 10 rows in the dataframe
legit_urls.head(n=10)

Unnamed: 0,URLs,Label
0,http://www.ticketsparapymes.codeplex.com/,Legitimate
1,http://www.my-cataract-eye-drops.com/,Legitimate
2,http://www.forums.appleinsider.com/showthread....,Legitimate
3,http://www.wn.com/Lambda_Chi_Alpha,Legitimate
4,http://www.en-pi.facebook.com/people/Connor-La...,Legitimate
5,http://www.people.uwec.edu/DUPONTJE/profile.htm,Legitimate
6,http://www.chantal-onlinemarketingsecrets.blog...,Legitimate
7,http://www.blog.realkangnaranaut.com/,Legitimate
8,http://www.genforum.genealogy.com/pollock/page...,Legitimate
9,http://www.sportsillustrated.cnn.com/basketbal...,Legitimate


In [None]:
# Getting the dimensionality of the legit_urls dataframe
legit_urls.shape

(19000, 2)

# **3. Feature Extraction:**

Now, we will extract the features from the phishing URLs and legitimate URLs datasets.

**Feature Extraction Categories:**
1. Address Bar based Features
2. Domain based Features
3. HTML & Javascript based Features

## **3.1. Address Bar based Features:**

There are many features which can be extracted using address bar. We have considered some of them as below.

1. IP Address in URL
2. Length of URL
3. Using URL Shortening Services
4. "@" Symbol in URL
5. Redirection "//" in URL
6. Prefix or Suffix "-" in Domain
7. Sub domain in the URL
8. Depth of URL
9. https in the URL


The description and implementation of the above features are as below:

In [None]:
# Installing python WHOIS module to get parsed WHOIS data for a given domain
!pip install python-whois

Collecting python-whois
  Downloading python-whois-0.8.0.tar.gz (109 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.6/109.6 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: python-whois
  Building wheel for python-whois (setup.py) ... [?25l[?25hdone
  Created wheel for python-whois: filename=python_whois-0.8.0-py3-none-any.whl size=103246 sha256=2aa28b438e8a994e038d098c56ba3a4c7d685206d4704873ae6fd2e5fd5ba875
  Stored in directory: /root/.cache/pip/wheels/10/f1/87/145023b9a206e2e948be6480c61ef3fd3dbb81ef11b6977782
Successfully built python-whois
Installing collected packages: python-whois
Successfully installed python-whois-0.8.0


In [None]:
# importing the libraries/packages required for the features code implementation
import re
from bs4 import BeautifulSoup
import whois
import requests
import urllib
import urllib.request
import ipaddress
import time
from datetime import date, datetime
from dateutil.parser import parse as date_parse
from urllib.parse import urlparse,urlencode
from googlesearch import search
import pickle


import sys
sys.path.append("/content/drive/My Drive/ENPM809K Project/")

from Feature_Extraction import FeatureExtraction

**3.1.1. IP Address in the URL**

Here, we check whether a URL contains an IP address instead of a domain name. If an IP address is used in the URL, it strongly suggests an attempt to collect personal information deceitfully. In our feature extraction, if we find an IP address in the domain part of the URL, we assign a value of 1 (indicating phishing), and if there's no IP address, we assign a value of -1 (indicating legitimacy).

In [None]:
# 1.Checks for IP address in URL (Using_IP)
def UsingIP(self):
    try:
        ipaddress.ip_address(self.url)
        return -1
    except:
        return 1

**3.1.2. Length of URL**

This feature calculates the length of a URL. Phishers often use long URLs to conceal suspicious content in the address bar. In this project, if a URL is 54 characters or longer, it is categorized as phishing (assigned a value of 1); otherwise, it is considered legitimate (assigned a value of -1), whereas 0 indicates less suspicious URL

In [None]:
# 2.Finding the length of URL and categorizing (Length)
def GetLength(self):
    try:
        if len(self.url) < 54:
            return 1
        if len(self.url) >= 54 and len(self.url) <= 75:
            return 0
        else:
            return -1
    except:
        return 0

**3.1.3. Usage of URL Shortening Services**

This feature checks for URL shortening, a technique on the World Wide Web where a shorter domain name redirects to a longer webpage URL.

If a URL is identified as using URL shortening services, it is assigned a value of 1 (indicating phishing); otherwise, it is marked as -1 (indicating legitimacy).

In [None]:
# 3. Checking for Shortening Services in URL (Tiny_URL)
def TinyURL(self):
    try:
        match=re.search('bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|'
                    'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|'
                    'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|'
                    'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|'
                    'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|'
                    'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|'
                    'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|tr\.im|link\.zip\.net', self.url)
        if match:
            return -1
        else:
            return 1
    except:
        return 0

**3.1.4. "@" Symbol in URL**

Here, we check for the presence of the '@' symbol in the URL. When the '@' symbol is used in a URL, it causes the browser to disregard everything before it, revealing the true address that often follows the '@' symbol.

In our feature extraction, if we find an '@' symbol in the URL, we assign a value of 1 (indicating phishing), and if there's no '@' symbol, we assign a value of -1 (indicating legitimacy).

In [None]:
# 4.Checks the presence of @ in URL (At_Symbol)
def HasAtSymbol(self):
    try:
        if re.findall("@", self.url):
            return -1
        else:
            return 1
    except:
        return 0

**3.1.5. Redirection "//" in URL**

This feature checks for the presence of "//" in the URL, which indicates a potential redirection to another website. The location of "//" in the URL is analyzed. If the URL begins with "HTTP," "//" should be in the sixth position. For "HTTPS," it should be in the seventh position. If "//" is found anywhere in the URL except after the protocol part, the feature is assigned a value of 1 (indicating phishing); otherwise, it's assigned a value of -1 (indicating legitimacy).

In [None]:
# 5.Checking for redirection '//' in the url (Redirection)
def Redirecting(self):
    try:
        pos = self.url.rfind('//')
        if pos > 6:
            if pos > 7:
                return -1
            else:
                return 1
        else:
            return 1
    except:
        return 0

**3.1.6. Prefix or Suffix "-" in Domain**

This feature looks for the presence of a hyphen ("-") in the domain section of a URL. Hyphens are seldom used in legitimate URLs, but phishers often add them as prefixes or suffixes to the domain name in an attempt to make their websites appear legitimate.

If a URL contains a hyphen in the domain part, it is labeled with a value of 1 (indicating phishing); otherwise, it receives a value of -1 (indicating legitimacy).

In [None]:
# 6.Checking for Prefix or Suffix Separated by (-) in the Domain (Prefix/Suffix)
def PrefixSuffix(self):
    try:
        match = re.findall('\-', self.domain)
        if match:
            return -1
        return 1
    except:
        return 0

**3.1.7. Sub Domain of the URL**

Here, we extract the dots in the domain present in the URL by which we can estimate the legitimacy of the website URL and helps us distinguish the same. The feature assesses the number of subdomains in a URL to detect phishing websites. It returns 1 for URLs with no subdomains (indicating phishing), 0 for one subdomain, and -1 for more than one subdomain (indicating legitimate).

In [None]:
# 7. Sub Domain of the URL (Sub_Domain)
def SubDomain(self):
    try:
        dot_count = len(re.findall("\.", self.url))
        if dot_count == 1:
            return 1
        elif dot_count == 2:
            return 0
        else:
            return -1
    except:
        return 0

**3.1.8. Depth of URL**

This feature calculates the depth of a URL by counting the number of sub-pages indicated by forward slashes ('/'). The feature value returns a numerical value of the depth in the URL structure.

In [None]:
# 8.Gives number of '/' in URL (URL_Depth)
def GetDepth(self):
    try:
        s = urlparse(self.url).path.split('/')
        depth = 0
        for i in range(len(s)):
            if len(s[i]) != 0:
                depth += 1
        return depth
    except:
        return 0

### **3.1.9. HTTPS in the URL**

This feature examines the URL's scheme to identify whether it uses 'https,' denoting a secure connection, and returns -1 for secure URLs or 1 for non-secure or erroneous cases, contributing to phishing detection.

In [None]:
# 9.Existence of 'HTTP' or 'HTTPS' Token in the Domain Part of the URL (Https_URL)
def HttpsinURL(self):
    try:
        https = self.urlparse.scheme
        if 'https' in https:
            return -1
        return 1
    except:
        return 0

## **3.2. Domain Based Features:**

There are many features which can be extracted using address bar. We have considered some of them as below.

1. Non standard Port
2. HTTPS Domain URL
3. Info Email
4. Abnormal URL
5. Website Forwarding
6. Website Traffic
7. DNS Record
8. Age of Domain
9. Domain Registration Length
10. Page Rank
11. Google Index

The description and implementation of the above features are as below:



**3.2.1 Non Standard Port**

This feature assesses if a URL uses a non-standard port in its domain section, returning -1 if it does (indicating higher suspicion) or 1 if it uses the default port, contributing to phishing website detection.

In [None]:
# 10. checking if URL uses a non-standard port in domain (NonStdPort)
def NonStdPort(self):
    try:
        port = self.domain.split(":")
        if len(port) > 1:
            return -1
        return 1
    except:
        return 0

**3.2.2 HTTPS Domain URL**

This feature examines whether the "http" or "https" token is present in the domain section of the URL. Phishers might include these tokens to deceive users.

In this project, if "http" or "https" is found in the domain part of the URL, the feature is assigned a value of 1 (indicating phishing); otherwise, it receives a value of 0 (indicating legitimacy).

In [None]:
# 11. Existence of 'HTTP' or 'HTTPS' Token in the Domain Part of the URL (Https_Domain)
def HttpsDomain(self):
    try:
        if 'https' in self.domain:
            return 1
        return -1
    except:
        return 0

**3.2.3 Info Email**

This feature examines webpage content for email-related elements or links (e.g., "mailto:" or "mail()" functions) commonly used in phishing. It returns 1 if such elements are found, indicating a higher phishing likelihood, and -1 if they are not detected, contributing to phishing website detection. In case of errors, it returns 1 as a cautious measure.

In [None]:
# 12. Check if webpage content has email elements (Info_Email)
def InfoEmail(self):
    try:
        if re.findall(r"[mail\(\)|mailto:?]", self.soup):
            return 1
        else:
            return -1
    except:
        return 0

**3.2.4 Abnormal URL**

This feature checks if the content of the webpage matches the WHOIS response, which may indicate a legitimate website. It returns -1 if they match, suggesting legitimacy, and 1 if they don't, contributing to phishing detection. In case of errors, it returns 1 as a cautious measure.

In [None]:
# 13. Check if webpage content matches WHOIS response (Abnormal_URL)
def AbnormalURL(self):
    try:
        if self.response.text == self.whois_response:
            return -1
        else:
            return 1
    except:
        return 0

**3.2.5 Website Forwarding**

The key distinction between phishing and legitimate websites lies in the number of times a website has been redirected. In our dataset, legitimate websites typically have a maximum of one redirection. In contrast, phishing websites, as indicated by this feature, have been redirected at least four times or more.

This feature assesses the number of redirects in the website's response history. It returns -1 for minimal or no redirects (indicating lower suspicion), 0 for a moderate number of redirects, and 1 for excessive redirects (higher suspicion), contributing to phishing detection. In case of errors, it returns 1 as a cautious measure.

In [None]:
# 14. Check the number of redirections in webpage response history (Web_Forward)
def WebsiteForwarding(self):
    try:
        if len(self.response.history) <= 1:
            return -1
        elif len(self.response.history) <= 4:
            return 0
        else:
            return 1
    except:
        return 0

**3.2.6. Website Traffic**

This feature assesses the popularity of a website by examining its ranking, which is determined by the number of visitors and pages they access. However, since phishing websites tend to have short lifespans, they may not be indexed by the Alexa database. After analyzing the dataset, we've observed that even in the worst-case scenarios, legitimate websites are typically ranked among the top 100,000 by Alexa.

Therefore, if a domain has a ranking lower than 100,000, it is considered phishing (assigned a value of 1), whereas if it has no traffic or is not recognized by Alexa, it is categorized as legitimate (assigned a value of -1) and 0 in case of errors

In [None]:
# 15. Checking Website traffic with Alexa database (Web_Traffic)
def WebsiteTraffic(self):
    try:
        #Filling the whitespaces in the URL if any
        url = urllib.parse.quote(self.url)
        site_rank = BeautifulSoup(urllib.request.urlopen("http://data.alexa.com/data?cli=10&dat=s&url=" + url).read(), "xml").find(
            "REACH")['RANK']
        site_rank = int(site_rank)
        if site_rank <100000:
            return -1
        else:
            return 1
    except:
        return 0

**3.2.7. DNS Record**

This feature focuses on the identity of phishing websites in relation to the WHOIS database. If the claimed identity of a website is not recognized by the WHOIS database or if no records are found for the hostname in the DNS records (empty or not found), the feature is assigned a value of 1 (indicating phishing or if there is an error in calculation). Conversely, if records are found and the identity is recognized, the feature is assigned a value of -1 (indicating legitimacy).

In [None]:
# 16. DNS Record availability (DNS_Record)
def DnsRecord(self):
    try:
        creation_date = self.whois_response.creation_date
        try:
            if(len(creation_date)):
                creation_date = creation_date[0]
        except:
            pass

        today  = date.today()
        age = (today.year-creation_date.year)*12+(today.month-creation_date.month)
        if age >=6:
            return -1
        return 1
    except:
        return 0

**3.2.8. Age of Domain**

This feature is extracted from the WHOIS database and focuses on the lifespan of a domain. Phishing websites are known for their short lifespans. In this project, a legitimate domain is defined as having a minimum age of 6 months, which is calculated as the difference between the creation and expiration times.

If the domain's age is greater than 6 months, the feature is assigned a value of 1 (indicating phishing or if there are any errors). On the other hand, if the age is 6 months or less, the feature is given a value of -1 (indicating legitimacy).

In [None]:
# 17. Survival time of domain: The difference between termination time and creation time (Age_Domain)
def AgeofDomain(self):
    try:
        creation_date = self.whois_response.creation_date
        try:
            if(len(creation_date)):
                creation_date = creation_date[0]
        except:
            pass
        today = date.today()
        age = (today.year-creation_date.year)*12+(today.month-creation_date.month)
        if age >= 6:
            return -1
        else:
            return 1
    except:
        return 0


**3.2.9 Domain Registration Length**

This feature calculates the age of a domain by comparing its creation and expiration dates from WHOIS data. If the domain is at least 12 months old, it returns -1, indicating longer domain registration (legitimate). If there are errors or if the age of the domain is less than 12 in retrieving date information, it returns 1.

In [None]:
# 18. Check the age of domain using creation and expiration dates from WHOIS data (DomainRegLen)
def DomainRegLen(self):
    try:
        expiration_date = self.whois_response.expiration_date
        creation_date = self.whois_response.creation_date
        try:
            if(len(expiration_date)):
                expiration_date = expiration_date[0]
        except:
            pass
        try:
            if(len(creation_date)):
                creation_date = creation_date[0]
        except:
            pass

        age = (expiration_date.year-creation_date.year)*12+ (expiration_date.month-creation_date.month)
        if age >=12:
            return -1
        return 1
    except:
        return 0

**3.2.10 Google Index**

This feature checks if a website is indexed by Google by performing a search query and assessing the search results. If the website is indexed, it returns -1, indicating visibility in Google's index (lower suspicion). If there are errors in the process, it returns 1 as a conservative measure or -1 in case of issues, contributing to website trustworthiness assessment.

In [None]:
# 19. Checking the website if it is indexed by google (Google_Index)
def GoogleIndex(self):
    try:
        site = search(self.url, 5)
        if site:
            return -1
        else:
            return 1
    except:
        return 0

# **3.3. HTML and JavaScript based Features**

There are many features which can be extracted using address bar. We have considered some of them as below.

1. Status Bar Customization
2. Disabling Right Click
3. IFrame Redirection
4. Anchor URL
5. Server Form Handler
6. Using Popup Window
7. Links pointing to the page

The description and implementation of the above features are as below:



**3.3.1. Status Bar Customization**

This feature involves examining web page source code, specifically focusing on the "onMouseOver" event, to check if it attempts to modify the status bar. Phishers may use JavaScript to display a fake URL in the status bar to deceive users.

In the feature extraction process, if the response is empty or if the "onMouseOver" event is detected, the feature is assigned a value of 1 (indicating phishing). Conversely, if neither of these conditions is met, the feature receives a value of -1 (indicating legitimacy).

In [None]:
# 20. Checks the effect of mouse over on status bar (Status_Bar_Cust)
def StatusBarCust(self):
    try:
        if re.findall("<script>.+onmouseover.+</script>", self.response.text):
            return -1
        else:
            return 1
    except:
        return 0

**3.3.2. Disabling Right Click**

This feature involves inspecting the webpage source code to identify the presence of an event called "event.button==2," which is often used by phishers to disable the right-click function, preventing users from viewing or saving the webpage source code.

If the response is empty or if "onMouseOver" is not found in the source code, the feature is assigned a value of 1 (indicating phishing and in case of errors). In cases where neither of these conditions is met, the feature is assigned a value of -1 (indicating legitimacy).

In [None]:
# 21. Checks the status of the right click attribute (Disable_Right_Click)
def DisableRightClick(self):
    try:
        if re.findall(r"event.button ?== ?2", self.response.text):
            return -1
        else:
            return 1
    except:
        return 0

**3.3.3. IFrame Redirection**

This feature focuses on the use of the "iframe" HTML tag, which is employed to display an additional webpage within the current one. Phishers may utilize the "iframe" tag while making it invisible, removing frame borders using the "frameBorder" attribute, which hides the visual separation in the browser.

In this context, if the "iframe" is empty or if the response does not contain it, the feature is assigned a value of 1 (indicating phishing and in case of errors). On the other hand, if the "iframe" is present and the response contains it, the feature is assigned a value of -1 (indicating legitimacy).

In [None]:
# 22. IFrame Redirection (Iframe_Redirect)
def IframeRedirection(self):
    try:
        if re.findall(r"<iframe>|<frameBorder>", self.response.text):
            return -1
        else:
            return 1
    except:
        return 0

**3.3.4 Anchor URL**

This feature analyzes anchor links in a webpage to assess their safety. It calculates the percentage of potentially unsafe links (e.g., those with "javascript," "mailto," or missing the current domain) and returns -1 for a low percentage (lower suspicion), 1 for a high percentage (higher suspicion), and 0 for a moderate percentage of potentially unsafe links, contributing to phishing detection and website trustworthiness assessment. If there are errors in the process, it returns 0 as a conservative measure.

In [None]:
# 23. Checking the safety of the anchor links in the URL (Anchor_URL)
def AnchorURL(self):
    try:
        i, unsafe = 0,0
        for a in self.soup.find_all('a', href=True):
            if "#" in a['href'] or "javascript" in a['href'].lower() or "mailto" in a['href'].lower() or not (url in a['href'] or self.domain in a['href']):
                unsafe += 1
        i += 1
        try:
            percentage = unsafe / float(i) * 100
            if percentage < 31.0:
                return -1
            elif ((percentage >= 31.0) and (percentage < 67.0)):
                return 0
            else:
                return 1
        except:
            return 1
    except:
        return 0

**3.3.5 Server Form Handler**

This feature inspects web forms on a page to assess their action attributes. It returns -1 if there are no forms on the page (lower suspicion), 1 if form actions are empty or "about:blank" (higher suspicion), and 0 if form actions point to external domains or are blank (moderate suspicion). In case of errors during analysis, it returns 1, contributing to phishing detection and website trustworthiness assessment.

In [None]:
# 24. checking if there are any form actions which are empty (Server_Form_Handler)
def ServerFormHandler(self):
    try:
        if len(self.soup.find_all('form', action=True))==0:
            return -1
        else :
            for form in self.soup.find_all('form', action=True):
                if form['action'] == "" or form['action'] == "about:blank":
                    return 1
                elif self.url not in form['action'] and self.domain not in form['action']:
                    return 0
                else:
                    return -1
    except:
        return 0

**3.3.6 Using Popup Window**

This feature scans a webpage's source code for the presence of JavaScript "alert()" calls. It returns -1 if such calls are found (lower suspicion of pop-up abuse), 1 if not found, and there are potential pop-up activities (higher suspicion), or in case of errors, it returns 1, contributing to phishing detection and website trustworthiness assessment.

In [None]:
# 25. Check if the webpage has alert() calls (Using_Popup_Window)
def UsingPopupWindow(self):
    try:
        if re.findall(r"alert\(", self.response.text):
            return -1
        else:
            return 1
    except:
        return 0

**3.3.7 Links pointing to the page**

This feature counts the number of anchor links (HTML anchor tags) in a webpage's source code that point to the page itself. It returns -1 if there are no such links (lower suspicion), 0 if there are up to 2 links (moderate suspicion), and 1 if there are more than 2 such links (higher suspicion), contributing to website trustworthiness and potential phishing detection. In case of errors, it returns 1.

In [None]:
# 26. Check if the webpage has anchor links  to the page itself (Links_Pointing_Page)
def LinksPointingToPage(self):
    try:
        number_of_links = len(re.findall(r"<a href=", self.response.text))
        if number_of_links == 0:
            return -1
        elif number_of_links <= 2:
            return 0
        else:
            return 1
    except:
        return 0

# **4. Compute URL Features**

Here, We extract the features of each URL using the functions defined above and store those features in a list

In [None]:
class FeatureExtraction:
    features = []
    def __init__(self, url, label):
        self.features = []
        self.url = url
        self.domain = ""
        self.whois_response = ""
        self.urlparse = ""
        self.response = ""
        self.soup = ""
        self.label = label

        try:
            self.response = requests.get(self.url, timeout=1)
            self.soup = BeautifulSoup(self.response.text, 'html.parser')
        except:
            pass

        try:
            self.urlparse = urlparse(self.url)
            self.domain = self.urlparse.netloc
        except:
            pass

        try:
            self.whois_response = whois.whois(self.domain, timeout=1)
        except:
            pass

        self.features.append(self.url)

        self.features.append(self.UsingIp())
        self.features.append(self.GetLength())
        self.features.append(self.TinyUrl())
        self.features.append(self.HasAtSymbol())
        self.features.append(self.Redirecting())
        self.features.append(self.PrefixSuffix())
        self.features.append(self.SubDomain())
        self.features.append(self.GetDepth())
        self.features.append(self.HttpsinURL())


        self.features.append(self.NonStdPort())
        self.features.append(self.HttpsDomain())
        self.features.append(self.InfoEmail())
        self.features.append(self.AbnormalURL())
        self.features.append(self.WebsiteForwarding())
        self.features.append(self.WebsiteTraffic())
        self.features.append(self.DnsRecord())
        self.features.append(self.AgeofDomain())
        self.features.append(self.DomainRegLen())
        self.features.append(self.GoogleIndex())


        self.features.append(self.StatusBarCust())
        self.features.append(self.DisableRightClick())
        self.features.append(self.IframeRedirection())
        self.features.append(self.AnchorURL())
        self.features.append(self.ServerFormHandler())
        self.features.append(self.UsingPopupWindow())
        self.features.append(self.LinksPointingToPage())

        self.features.append(self.label)

    # Above function implementations

    def getFeaturesList(self):
        return self.features

# **4.1 Legitimate URL Feature Extraction**

Here, we will perform feature extraction on Legitimate URLs using the class **FeatureExtraction** and its methods

In [None]:
# Getting the dimensionality of the legit_urls dataframe
legit_urls.shape

(19000, 2)

In [None]:
#converting the features list to dataframe
feature_names = ['URL', 'Using_IP', 'Length', 'Tiny_URL', 'At_Symbol', 'Redirection', 'Prefix/Suffix', 'Sub_Domain',
                'URL_Depth', 'Https_URL', 'NonStdPort', 'Https_Domain', 'Info_Email', 'Abnormal_URL', 'Web_Forward',
                 'Web_Traffic', 'DNS_Record', 'Age_Domain', 'DomainRegLen', 'Google_Index', 'Status_Bar_Cust',
                 'Disable_Right_Click', 'Iframe_Redirect', 'Anchor_URL', 'Server_Form_Handler', 'Using_Popup_Window',
                 'Links_Pointing_Page', 'Label']

In [None]:
#Extracting the features & storing them in a list and then to a csv file
legit_URL_features_1 = []
label = 0

for i in range(0, 5000):
  url = legit_urls['URLs'][i]
  featureExtraction = FeatureExtraction(url, label)
  legit_URL_features_1.append(featureExtraction.getFeaturesList())

legitimate1 = pd.DataFrame(legit_URL_features_1, columns= feature_names)
legitimate1.to_csv('/content/drive/My Drive/ENPM809K Project/legit_features/legitimate_features1.csv', index= False)



In [None]:
#Extracting the feautres & storing them in a list and then to a csv file
legit_URL_features_2 = []
label = 0

for i in range(5000, 10000):
  url = legit_urls['URLs'][i]
  featureExtraction = FeatureExtraction(url, label)
  legit_URL_features_2.append(featureExtraction.getFeaturesList())

legitimate2 = pd.DataFrame(legit_URL_features_2, columns= feature_names)
legitimate2.to_csv('/content/drive/My Drive/ENPM809K Project/legit_features/legitimate_features2.csv', index= False)

In [None]:
#Extracting the feautres & storing them in a list and then to a csv file
legit_URL_features_3 = []
label = 0

for i in range(10000, 15000):
  url = legit_urls['URLs'][i]
  featureExtraction = FeatureExtraction(url, label)
  legit_URL_features_3.append(featureExtraction.getFeaturesList())

legitimate3 = pd.DataFrame(legit_URL_features_3, columns= feature_names)
legitimate3.to_csv('/content/drive/My Drive/ENPM809K Project/legit_features/legitimate_features3.csv', index= False)



In [None]:
#Extracting the feautres & storing them in a list and then to a csv file
legit_URL_features_4 = []
label = 0

for i in range(15000, 19000):
  url = legit_urls['URLs'][i]
  featureExtraction = FeatureExtraction(url, label)
  legit_URL_features_4.append(featureExtraction.getFeaturesList())

legitimate4 = pd.DataFrame(legit_URL_features_4, columns= feature_names)
legitimate4.to_csv('/content/drive/My Drive/ENPM809K Project/legit_features/legitimate_features4.csv', index= False)

  self.soup = BeautifulSoup(self.response.text, 'html.parser')
  self.soup = BeautifulSoup(self.response.text, 'html.parser')


In [None]:
#converting the features list to dataframe
feature_names = ['URL', 'Using_IP', 'Length', 'Tiny_URL', 'At_Symbol', 'Redirection', 'Prefix/Suffix', 'Sub_Domain',
                'URL_Depth', 'Https_URL', 'NonStdPort', 'Https_Domain', 'Info_Email', 'Abnormal_URL', 'Web_Forward',
                 'Web_Traffic', 'DNS_Record', 'Age_Domain', 'DomainRegLen', 'Google_Index', 'Status_Bar_Cust',
                 'Disable_Right_Click', 'Iframe_Redirect', 'Anchor_URL', 'Server_Form_Handler', 'Using_Popup_Window',
                 'Links_Pointing_Page', 'Label']

legitimate1 = pd.read_csv('legitimate_features1.csv')
legitimate2 = pd.read_csv('legitimate_features2.csv')
legitimate3 = pd.read_csv('legitimate_features3.csv')
legitimate4 = pd.read_csv('legitimate_features4.csv')
#Concatenating the dataframes
legit_features_data = pd.concat([legitimate1, legitimate2, legitimate3, legitimate4]).reset_index(drop=True)
legitimate = pd.DataFrame(legit_features_data, columns= feature_names)
legitimate.head()

Unnamed: 0,URL,Using_IP,Length,Tiny_URL,At_Symbol,Redirection,Prefix/Suffix,Sub_Domain,URL_Depth,Https_URL,...,DomainRegLen,Google_Index,Status_Bar_Cust,Disable_Right_Click,Iframe_Redirect,Anchor_URL,Server_Form_Handler,Using_Popup_Window,Links_Pointing_Page,Label
0,http://www.ticketsparapymes.codeplex.com/,1,1,-1,1,1,1,-1,0,1,...,0,-1,0,0,0,0,0,0,0,0
1,http://www.my-cataract-eye-drops.com/,1,1,1,1,1,-1,0,0,1,...,0,-1,0,0,0,0,0,0,0,0
2,http://www.forums.appleinsider.com/showthread....,1,0,1,1,1,1,-1,1,1,...,0,-1,0,0,0,0,0,0,0,0
3,http://www.wn.com/Lambda_Chi_Alpha,1,1,1,1,1,1,0,1,1,...,0,-1,1,1,1,0,0,-1,1,0
4,http://www.en-pi.facebook.com/people/Connor-La...,1,0,1,1,1,-1,-1,3,1,...,0,-1,0,0,0,0,0,0,0,0


In [None]:
# Getting the dimensionality of the legitimate dataframe
legitimate.shape

(18997, 28)

In [None]:
# Storing the extracted legitimate URLs features to csv file
legitimate.to_csv('/content/drive/My Drive/ENPM809K Project/URL_features_data/legitimate_features.csv', index= False)

# **4.2 Phishing URL Feature Extraction**

Here, we will perform feature extraction on Phishing URLs using the class **FeatureExtraction** and its methods

In [None]:
# Getting the dimensionality of the phish_urls dataframe
phish_urls.shape

(19000, 8)

In [None]:
#converting the features list to dataframe
feature_names = ['URL', 'Using_IP', 'Length', 'Tiny_URL', 'At_Symbol', 'Redirection', 'Prefix/Suffix', 'Sub_Domain',
                'URL_Depth', 'Https_URL', 'NonStdPort', 'Https_Domain', 'Info_Email', 'Abnormal_URL', 'Web_Forward',
                 'Web_Traffic', 'DNS_Record', 'Age_Domain', 'DomainRegLen', 'Google_Index', 'Status_Bar_Cust',
                 'Disable_Right_Click', 'Iframe_Redirect', 'Anchor_URL', 'Server_Form_Handler', 'Using_Popup_Window',
                 'Links_Pointing_Page', 'Label']

In [None]:
#Extracting the feautres & storing them in a list and then to a csv file
phish_URL_features_1 = []
label = 1

for i in range(0, 5000):
  url = phish_urls['url'][i]
  featureExtraction = FeatureExtraction(url, label)
  phish_URL_features_1.append(featureExtraction.getFeaturesList())


phishing1 = pd.DataFrame(phish_URL_features_1, columns= feature_names)
phishing1.to_csv('/content/drive/My Drive/ENPM809K Project/phish_features/phishing_features1.csv', index= False)

  self.soup = BeautifulSoup(self.response.text, 'html.parser')


In [None]:
#Extracting the feautres & storing them in a list and then to a csv file
phish_URL_features_2 = []
label = 1

for i in range(5000, 10000):
  url = phish_urls['url'][i]
  featureExtraction = FeatureExtraction(url, label)
  phish_URL_features_2.append(featureExtraction.getFeaturesList())


phishing2 = pd.DataFrame(phish_URL_features_2, columns= feature_names)
phishing2.to_csv('/content/drive/My Drive/ENPM809K Project/phish_features/phishing_features2.csv', index= False)

  self.soup = BeautifulSoup(self.response.text, 'html.parser')


In [None]:
#Extracting the feautres & storing them in a list and then to a csv file
phish_URL_features_3 = []
label = 1

for i in range(10000, 15000):
  url = phish_urls['url'][i]
  featureExtraction = FeatureExtraction(url, label)
  phish_URL_features_3.append(featureExtraction.getFeaturesList())


phishing3 = pd.DataFrame(phish_URL_features_3, columns= feature_names)
phishing3.to_csv('/content/drive/My Drive/ENPM809K Project/phish_features/phishing_features3.csv', index= False)

  self.soup = BeautifulSoup(self.response.text, 'html.parser')


In [None]:
#Extracting the feautres & storing them in a list and then to a csv file
phish_URL_features_4 = []
label = 1

for i in range(15000, 19000):
  url = phish_urls['url'][i]
  featureExtraction = FeatureExtraction(url, label)
  phish_URL_features_4.append(featureExtraction.getFeaturesList())


phishing4 = pd.DataFrame(phish_URL_features_4, columns= feature_names)
phishing4.to_csv('/content/drive/MyDrive/ENPM809K Project/phish_features/phishing_features4.csv', index= False)

In [None]:
#converting the features list to dataframe
feature_names = ['URL', 'Using_IP', 'Length', 'Tiny_URL', 'At_Symbol', 'Redirection', 'Prefix/Suffix', 'Sub_Domain',
                'URL_Depth', 'Https_URL', 'NonStdPort', 'Https_Domain', 'Info_Email', 'Abnormal_URL', 'Web_Forward',
                 'Web_Traffic', 'DNS_Record', 'Age_Domain', 'DomainRegLen', 'Google_Index', 'Status_Bar_Cust',
                 'Disable_Right_Click', 'Iframe_Redirect', 'Anchor_URL', 'Server_Form_Handler', 'Using_Popup_Window',
                 'Links_Pointing_Page', 'Label']

phishing1 = pd.read_csv('phishing_features1.csv')
phishing2 = pd.read_csv('phishing_features2.csv')
phishing3 = pd.read_csv('phishing_features3.csv')
phishing4 = pd.read_csv('phishing_features4.csv')
#Concatenating the dataframes
phish_features_data = pd.concat([phishing1, phishing2, phishing3, phishing4]).reset_index(drop=True)
phishing = pd.DataFrame(phish_features_data, columns= feature_names)
phishing.head()

Unnamed: 0,URL,Using_IP,Length,Tiny_URL,At_Symbol,Redirection,Prefix/Suffix,Sub_Domain,URL_Depth,Https_URL,...,DomainRegLen,Google_Index,Status_Bar_Cust,Disable_Right_Click,Iframe_Redirect,Anchor_URL,Server_Form_Handler,Using_Popup_Window,Links_Pointing_Page,Label
0,https://qrco.de/beHAGf,1,1,1,1,1,1,1,1,-1,...,0,-1,1,1,1,-1,-1,1,-1,1
1,https://cloudflare-ipfs.com/ipfs/bafkreidqr7tw...,1,-1,1,1,1,-1,1,2,-1,...,0,-1,1,1,1,0,0,1,0,1
2,https://pub-3f7e513b2d754cfe8bfdbd90c3a48c19.r...,1,0,1,1,1,-1,-1,1,-1,...,0,-1,1,1,1,0,-1,1,-1,1
3,https://bafkreica2uawqr6ilbjnrwkjsmos6k5nmp2bs...,1,-1,1,1,1,-1,-1,0,-1,...,0,-1,1,1,1,-1,-1,1,-1,1
4,https://skinsmonkey.csgotrades.org/auth.php?gc...,1,0,1,1,1,1,-1,1,-1,...,0,-1,0,0,0,0,0,0,0,1


In [None]:
# Getting the dimensionality of the phishing dataframe
phishing.shape

(18998, 28)

In [None]:
# Storing the extracted phishing URLs features to csv file
phishing.to_csv('/content/drive/MyDrive/ENPM809K Project/URL_features_data/phishing_features.csv', index= False)

# **5. Final Feature Dataset**

In the above section, we formed two dataframes which consists of legitimate and phishing URL features. Now, we will merge/concatenate both the dataframe to a single dataframe which we will later export as a csv file and use if for transfer learning in the notebook *Phishing Website Detection Deep Learning.ipynb*


In [None]:
# Load legitimate features and phishing features datasets
# Define the path to legitimate_features CSV file in Google Drive
legit_features_csv_path = '/content/drive/MyDrive/ENPM809K Project/URL_features_data/legitimate_features.csv'

# Loading the 'legitimate_features' csv file to a pandas dataframe
legitimate = pd.read_csv(legit_features_csv_path)

# Getting the dimensionality of the legitimate dataframe
legitimate.shape

(18997, 28)

In [None]:
# Define the path to phishing_features CSV file in Google Drive
phish_features_csv_path = '/content/drive/MyDrive/ENPM809K Project/URL_features_data/phishing_features.csv'

# Loading the 'phishing_features' csv file to a pandas dataframe
phishing = pd.read_csv(phish_features_csv_path)

# Getting the dimensionality of the phishing dataframe
phishing.shape

(18998, 28)

In [None]:
#Concatenating the dataframes
url_features_data = pd.concat([legitimate, phishing]).reset_index(drop=True)

In [None]:
# Getting the dimensionality of the url_features_data dataframe
url_features_data.shape

(37995, 28)

In [None]:
# Getting top 10 rows in the dataframe
url_features_data.head(n=10)

Unnamed: 0,URL,Using_IP,Length,Tiny_URL,At_Symbol,Redirection,Prefix/Suffix,Sub_Domain,URL_Depth,Https_URL,...,DomainRegLen,Google_Index,Status_Bar_Cust,Disable_Right_Click,Iframe_Redirect,Anchor_URL,Server_Form_Handler,Using_Popup_Window,Links_Pointing_Page,Label
0,http://www.ticketsparapymes.codeplex.com/,1,1,-1,1,1,1,-1,0,1,...,0,-1,0,0,0,0,0,0,0,0
1,http://www.my-cataract-eye-drops.com/,1,1,1,1,1,-1,0,0,1,...,0,-1,0,0,0,0,0,0,0,0
2,http://www.forums.appleinsider.com/showthread....,1,0,1,1,1,1,-1,1,1,...,0,-1,0,0,0,0,0,0,0,0
3,http://www.wn.com/Lambda_Chi_Alpha,1,1,1,1,1,1,0,1,1,...,0,-1,1,1,1,0,0,-1,1,0
4,http://www.en-pi.facebook.com/people/Connor-La...,1,0,1,1,1,-1,-1,3,1,...,0,-1,0,0,0,0,0,0,0,0
5,http://www.people.uwec.edu/DUPONTJE/profile.htm,1,1,1,1,1,1,-1,2,1,...,0,-1,0,0,0,0,0,0,0,0
6,http://www.chantal-onlinemarketingsecrets.blog...,1,0,-1,1,1,-1,-1,0,1,...,0,-1,1,1,1,0,0,1,1,0
7,http://www.blog.realkangnaranaut.com/,1,1,-1,1,1,1,-1,0,1,...,0,-1,0,0,0,0,0,0,0,0
8,http://www.genforum.genealogy.com/pollock/page...,1,1,1,1,1,1,-1,2,1,...,0,-1,1,1,1,0,-1,1,-1,0
9,http://www.sportsillustrated.cnn.com/basketbal...,1,0,1,1,1,1,-1,5,1,...,0,-1,0,0,0,0,0,0,0,0


In [None]:
# Getting bottom 10 rows in the dataframe
url_features_data.tail(n=10)

Unnamed: 0,URL,Using_IP,Length,Tiny_URL,At_Symbol,Redirection,Prefix/Suffix,Sub_Domain,URL_Depth,Https_URL,...,DomainRegLen,Google_Index,Status_Bar_Cust,Disable_Right_Click,Iframe_Redirect,Anchor_URL,Server_Form_Handler,Using_Popup_Window,Links_Pointing_Page,Label
37985,https://form.jotform.com/230664455735057,1,1,1,1,1,1,0,1,-1,...,0,-1,1,1,1,0,0,1,0,1
37986,https://nta-jp-nb.skin,1,1,1,1,1,-1,1,0,-1,...,0,-1,1,1,1,-1,-1,1,-1,1
37987,https://ipfs.eth.aragon.network/ipfs/bafybeicc...,1,-1,1,1,1,1,-1,2,-1,...,0,-1,1,1,1,-1,0,1,0,1
37988,https://cloudflare-ipfs.com/ipfs/Qmb4y7pCipc8A...,1,-1,1,-1,1,-1,-1,4,-1,...,0,-1,1,1,1,0,0,1,0,1
37989,https://putzvneaumqi-bohfrgzkfexn.web.app/,1,1,1,1,1,-1,0,0,-1,...,0,-1,1,1,1,-1,-1,1,-1,1
37990,https://insecurelongterms.us-sea-1.linodeobjec...,1,0,1,1,1,-1,-1,1,-1,...,0,-1,1,1,1,-1,-1,1,-1,1
37991,https://netflix-clone-six-vert.vercel.app/,1,1,1,1,1,-1,0,0,-1,...,0,-1,0,0,0,0,0,0,0,1
37992,https://cloudflare-ipfs.com/ipfs/bafybeic2m7h6...,1,-1,1,1,1,-1,1,2,-1,...,0,-1,1,1,1,0,0,1,0,1
37993,https://ipfs.eth.aragon.network/ipfs/bafybeici...,1,-1,1,1,1,1,-1,2,-1,...,0,-1,1,1,1,-1,-1,1,-1,1
37994,https://expressil-a749b.firebaseapp.com/,1,1,1,1,1,-1,0,0,-1,...,0,-1,1,1,1,-1,-1,1,-1,1


In [None]:
# Storing the extracted URL features data to csv file
url_features_data.to_csv('/content/drive/MyDrive/ENPM809K Project/URL_features_data/url_features.csv', index= False)

# **6. Conclusion**

With the above implementations, we achieved the objective of this notebook. We have extracted 26 features from 37995 URLs (18998 Phishing URLs and 18997 Legitimate URLs)

# **7. References**

*   https://www.phishtank.com/developer_info.php
*   https://www.kaggle.com/datasets/shashwatwork/web-page-phishing-detection-dataset
* https://zenodo.org/records/5807622#.Ycsbzy0RpQJ

