##Content

1.   [Data Selection and Preprocessing / Cleaning](https://colab.research.google.com/drive/1JDUR1ldppSoD-cc3woAKdCO3UpuqTQFY?authuser=1#scrollTo=HXn5nk2-ELPZ&line=1&uniqifier=1)
2.   [Data Transformation / Feature Engineering](https://colab.research.google.com/drive/1JDUR1ldppSoD-cc3woAKdCO3UpuqTQFY?authuser=1#scrollTo=AuCidc7iDSSU)
3.  [Model Selection and Training](https://colab.research.google.com/drive/1JDUR1ldppSoD-cc3woAKdCO3UpuqTQFY?authuser=1#scrollTo=JsxDntKfEAIo&line=1&uniqifier=1)



##**Data Selection and Preprocessing/Cleaning**

####Importing datasets to local environment and installing additional required libraries

In [19]:
!pip install dnspython tldextract
!wget https://raw.githubusercontent.com/mitchellkrogza/Phishing.Database/master/ALL-phishing-links.tar.gz
# !wget https://phishstats.info/phish_score.csv
# !wget https://data.mendeley.com/public-files/datasets/72ptz43s9v/files/26197eb8-15bc-4e06-a269-aa10ddc286f0/file_downloaded

--2021-11-17 00:05:17--  https://raw.githubusercontent.com/mitchellkrogza/Phishing.Database/master/ALL-phishing-links.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16156720 (15M) [application/octet-stream]
Saving to: ‘ALL-phishing-links.tar.gz.1’


2021-11-17 00:05:17 (37.6 MB/s) - ‘ALL-phishing-links.tar.gz.1’ saved [16156720/16156720]



In [20]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/Colab\ Notebooks/GitHub/CMPE\ 255\ Final\ Project

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Colab Notebooks/GitHub/CMPE 255 Final Project


####Importing *all* required libraries as per the requirements

In [37]:
#For dataset's preprocessing
import tarfile
import pandas as pd
import numpy as np

#For feature engineering
import tldextract 
import urllib.parse as parser
import dns.resolver as dns

####Importing and preprocessing the dataset

In [22]:
def github_data_retrieval(filename):  
  file = tarfile.open("/content/"+filename+'.tar.gz')
  file.extractall("/content/")
  file.close()
  # df = pd.read_csv('/content/'+filename+'.txt', delimiter = "\n")
  f = open("/content/"+filename+".txt", "r")
  l = f.readlines()
  url_dataset = []
  url_series = []
  unclean_data = []
  for i in l:
    temp = i.replace("\n","").split(",")
    if len(temp) == 8:
      # if len(url_dataset) == 9994:
        # print(temp)
        # print(i)
      url_dataset.append(temp)
    elif len(temp) == 1:
      url_series.append(temp[0])
    else:
      unclean_data.append(temp)
  #NEED TO CLEAN unclean_data, we are discarding it as we have enough data from other sources
  return pd.DataFrame(url_dataset[:-1],columns=url_dataset[-1]),pd.Series(url_series,name="Url"),l

In [23]:
dataset1,dataset2,raw_dataset = github_data_retrieval("ALL-phishing-links")

In [24]:
#data is stored locally as it is only available on kaggle and thus cannot get downloaded dynamically using wget
dir = "./datasets"
path1 = "/malicious_url_train_dataset.csv"
path2 = "/phishing_site_urls.csv"
path3 = "/combined_dataset.csv"

In [25]:
dataset3 = pd.read_csv(dir+path1,index_col=0)
dataset4 = pd.read_csv(dir+path2)
dataset5 = pd.read_csv(dir+path3)

####Combinining datasets

In [26]:
#transforming data to retrieve required format for feature engineering
data = [[i,1] for i in dataset1.url] + [[i,1] for i in dataset2] + dataset3[["url","result"]].values.tolist() + dataset4.replace("good",0).replace("bad",1).values.tolist() + dataset5[["domain","label"]].values.tolist()
dataset = pd.DataFrame(data,columns=["url","label"])
#to check data distribution between malicious/phishing (1) and normal websites (0)
dataset.groupby("label").count()

Unnamed: 0_level_0,url
label,Unnamed: 1_level_1
0,778658
1,1067154


##**Data Transformation / Feature Engineering**

####Features Engineering Functions

In [27]:
def url_length(column):
  return pd.Series([len(i) for i in column])

In [28]:
def at_present(column):
  return pd.Series([ 1 if i.find("@") == -1 else -1 for i in column ])

In [29]:
def dash_present(column):
  return pd.Series([ 1 if i.find("-") == -1 else -1 for i in column ])

In [30]:
def redirect_present(column):
  flags = []
  for i in column:
    if i.find("https://") != -1 :
      i.replace("https://","")
    if i.find("http://") != -1 :
      i.replace("https://","")
    if i.find("//") != -1:
      flags.append(-1)
    else:
      flags.append(1)  
  return pd.Series(flags)

In [40]:
def check_domain_length(column):
  flags = []
  for i in column:
    parsed_url = parser.urlparse(i)
    # Calculating log to normalize the length and make it comparable to other features
    # flags.append(np.log(len(parsed_url.netloc)))
    flags.append(len(parsed_url.netloc))
  return pd.Series(flags)

In [32]:
def no_of_subdomains(column):
  flags = []
  for i in column:
    parsed_url = tldextract.extract(i)
    flags.append(len(parsed_url.subdomain.split(".")))
  return pd.Series(flags)

In [33]:
# def check_dns_record(column):
#   flags = []
#   # input()
#   for i in column:
#     try:
#       dns.resolve(i)
#       # print("in")
#       # input()
#       flags.append(1)
#     except:
#       # print("out")
#       flags.append(-1)
#   return pd.Series(flags)s

####Feature Engineering Execution

We extract these features from the prepared datasets. We managed to extract 8 different features which can be used to identify phishing website. those feature are extracted by using these resources as point of reference.

References:

1. [Kaggle Dataset 1](https://www.kaggle.com/akashkr/phishing-website-dataset)
2. [Kaggle Dataset 2](https://www.kaggle.com/aman9d/phishing-data)
3. [Research Paper 1](http://nebula.wsimg.com/27a75d1c7f1236136e4ea756cb01c68c?AccessKeyId=80712B55A173CC042F8D&disposition=0&alloworigin=1) 

In [41]:
dataset["url_length"] = url_length(dataset.url)
dataset["at_present"] = at_present(dataset.url)
dataset["dash_present"] = dash_present(dataset.url)
dataset["redirect_present"] = redirect_present(dataset.url)
dataset["check_domain_length"] = check_domain_length(dataset.url) #
dataset["no_of_subdomains"] = no_of_subdomains(dataset.url)
# dataset["check_dns_record"] = check_dns_record(dataset.url)

In [42]:
dataset

Unnamed: 0,url,label,url_length,at_present,dash_present,redirect_present,check_domain_length,no_of_subdomains
0,http://creditiperhabbogratissicuro100.blogspot...,1,95,1,-1,-1,43,1
1,http://www.habbocreditosparati.blogspot.com/,1,44,1,1,-1,36,2
2,http://leadsdubai.com/~thescien/mad/5cec92b61f...,1,69,1,1,-1,14,1
3,http://philippe.rubio.perso.sfr.fr/cheeses.html,1,47,1,1,-1,27,3
4,https://www.drivehq.com/file/DFPublishFile.asp...,1,91,1,1,-1,15,1
...,...,...,...,...,...,...,...,...
1845807,www.freewebs.com/ryanrules2/,0,28,1,1,1,0,1
1845808,www.ireland-information.com/freecelticfonts.htm,0,47,1,-1,1,0,1
1845809,www.clubtaunus.soroptimist.de/img/pro/e.php,1,43,1,1,1,0,2
1845810,www.askmen.com/sports/business/index.html,0,41,1,1,1,0,1


##**Model Selection and Training**

##**Alternate solutions(Not to be considered)**

Alternate Approach to clean dataset1 & dataset2

In [None]:
# filename = "ALL-phishing-links"
# file = tarfile.open(filename+'.tar.gz')
# file.extractall("./github_data")
# file.close()
# df = pd.read_csv('./github_data/'+filename+'.txt', delimiter = "\n")
# f = open("./github_data/"+filename+".txt", "r")
# l = f.readlines()
# new_list = []
# z = []
# import re
# regex = "(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])"
# # Scheme (HTTP, HTTPS, FTP and SFTP):

# for i in l[:20]:
#   # print(i)
#   matches = re.findall(regex, i)
#   print(matches)

Alternate method to calculate malicious and benign websiten in combined dataset

In [None]:
# bad = len(dataset1) + len(dataset2) + len(dataset3[dataset3.result == 1]) + len(dataset4[dataset4.Label == "bad"]) + len(dataset5[dataset5.label == 1])
# good = len(dataset3[dataset3.result == 0]) + len(dataset4[dataset4.Label == "good"]) + len(dataset5[dataset5.label == 0])
# df = pd.read_csv("./phish_score.csv",skiprows=9,names=["Date","Score","URL","IP"])

Alternate method to retrieve dataset at predifined path

In [None]:
# !wget https://raw.githubusercontent.com/mitchellkrogza/Phishing.Database/master/ALL-phishing-links.tar.gz -P /datasets/
# !wget https://phishstats.info/phish_score.csv -P /dataset/

Alternate to find domain length

In [None]:
# def len_sub_domain(column):
#   flags = []
#   for i in column:
#     i.replace("https://","")
#     i.replace("http://","")
#     l = i.split("/")[0]
#     print(l,len(l))
#     flags.append(len(l))
#   return pd.Series(flags)