# Detection of Phishing Websites using ML (Feature Extraction)

Phishing websites are created to trick unsuspecting users into thinking they are on a legitimate site. Phishing is one of the major problems faced by cyber-world and leads to financial losses for both industries and individuals. The objective of this notebook is to collect data & extract the selctive features form the URLs.

## 1.Loading Phising URL Dataset

In [1]:
import pandas as pd

In [2]:
phish_data = pd.read_csv('Dataset/verified_online.csv')

In [3]:
phish_data.head()

Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,7398709,http://www.wallstriumphuptodate.com/en/unlock/...,http://www.phishtank.com/phish_detail.php?phis...,2021-12-30T10:49:44+00:00,yes,2021-12-30T10:53:22+00:00,yes,Other
1,7398708,https://santsecnosesa.000webhostapp.com/,http://www.phishtank.com/phish_detail.php?phis...,2021-12-30T10:44:29+00:00,yes,2021-12-30T10:53:22+00:00,yes,"Banco Santander, S.A."
2,7398707,https://recupaidpaylbc.000webhostapp.com/,http://www.phishtank.com/phish_detail.php?phis...,2021-12-30T10:43:48+00:00,yes,2021-12-30T10:53:22+00:00,yes,Other
3,7398705,https://eurbk.000webhostapp.com/login.html,http://www.phishtank.com/phish_detail.php?phis...,2021-12-30T10:36:48+00:00,yes,2021-12-30T10:44:19+00:00,yes,Other
4,7398703,https://clinicaldentistryform.000webhostapp.com/,http://www.phishtank.com/phish_detail.php?phis...,2021-12-30T10:34:07+00:00,yes,2021-12-30T10:44:20+00:00,yes,Facebook


In [4]:
phish_data.shape

(7901, 8)

The dataframe has 8 columns. We'll be considering only the 'url' column for this project.

Also, the data has thousands of phishing URLs. But the problem here is, this data gets updated hourly. Without getting into the risk of data imbalance, We are considering a margin value of 5000 phishing URLs & 5000 legitimate URLs.

Picking up 5000 samples from the above dataframe randomly.

In [None]:
phish_url = phish_data.sample(n = 5000, random_state = 12).copy()
phish_url = phish_url.reset_index(drop=True)

In [5]:
phish_url.head()

Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,7378923,http://hydtddz.com/root,http://www.phishtank.com/phish_detail.php?phis...,2021-12-09T20:29:43+00:00,yes,2021-12-13T11:23:09+00:00,yes,Other
1,7292700,https://7a4298b9.sso-mail-secure234ds23d23wd1....,http://www.phishtank.com/phish_detail.php?phis...,2021-09-15T11:30:59+00:00,yes,2021-09-15T11:38:09+00:00,yes,Other
2,7255595,https://bharathi1809.github.io/netflix/,http://www.phishtank.com/phish_detail.php?phis...,2021-08-05T08:52:49+00:00,yes,2021-09-05T21:06:16+00:00,yes,Other
3,7376755,https://getmagic.app/Post,http://www.phishtank.com/phish_detail.php?phis...,2021-12-08T14:01:46+00:00,yes,2021-12-13T11:42:54+00:00,yes,Other
4,7241325,http://clouddoc-authorize.firebaseapp.com/.xx....,http://www.phishtank.com/phish_detail.php?phis...,2021-07-24T02:03:30+00:00,yes,2021-07-24T02:08:08+00:00,yes,Other


In [6]:
phish_url.shape

(5000, 8)

## 2.Loading Legitimate URL Dataset

In [7]:
legit_data = pd.read_csv('Dataset\Benign_url_file.csv')

In [8]:
legit_data.head()

Unnamed: 0,http://1337x.to/torrent/1048648/American-Sniper-2014-MD-iTALiAN-DVDSCR-X264-BST-MT/
0,http://1337x.to/torrent/1110018/Blackhat-2015-...
1,http://1337x.to/torrent/1122940/Blackhat-2015-...
2,http://1337x.to/torrent/1124395/Fast-and-Furio...
3,http://1337x.to/torrent/1145504/Avengers-Age-o...
4,http://1337x.to/torrent/1160078/Avengers-age-o...


In [9]:
legit_data.columns = ['URLs']
legit_data.head()

Unnamed: 0,URLs
0,http://1337x.to/torrent/1110018/Blackhat-2015-...
1,http://1337x.to/torrent/1122940/Blackhat-2015-...
2,http://1337x.to/torrent/1124395/Fast-and-Furio...
3,http://1337x.to/torrent/1145504/Avengers-Age-o...
4,http://1337x.to/torrent/1160078/Avengers-age-o...


In [10]:
legit_data.shape

(35377, 1)

Picking up 5000 legitimate URls samples from the above dataframe

In [None]:
legit_url = legit_data.sample(n = 5000, random_state = 12).copy()
legit_url = legit_url.reset_index(drop=True)

In [11]:
legit_url.head()

Unnamed: 0,URLs
0,http://graphicriver.net/search?date=this-month...
1,http://ecnavi.jp/redirect/?url=http://www.cros...
2,https://hubpages.com/signin?explain=follow+Hub...
3,http://extratorrent.cc/torrent/4190536/AOMEI+B...
4,http://icicibank.com/Personal-Banking/offers/o...


In [12]:
legit_url.shape

(5000, 1)

# 3.Feature Extraction:

Now we will extract features from the URLs dataset which we collected.<br>
The feature extraction will be based on:

* Address Bar based Features
* Domain based Features
* HTML & Javascript based Features

## 3.1. Address Bar Based Features:

All the listed feature selection above consists of feature extraction which are guided by rules.Below mentioned features are extracted for this project that can be consided as address bar based features:

* Domain of URL
* Using the IP Address
* URL having "@" symbol
* Long URL to hide the suspicious part
* Depth of URL
* Redirection using "//"
* The existence of "HTTPS" token in the domain part of URL
* Using URL Shortening Services "TinyURL"
* Adding Prefix or Suffix separated by "-" to the Domain

In [13]:
#importing required packages for this section
from urllib.parse import urlparse
import ipaddress
import re

### 3.1.1. Domain of the URL

Here, we are just extracting the domain present in the URL.

In [14]:
def getDomain(url):
    domain = urlparse(url).netloc
    if re.match(r"^www.",domain):
        domain = domain.replace("www.","")
    return domain

### 3.1.2. IP Address in the URL

If an IP address is used as an alternative of the domain name in the URL, such as “http://125.98.3.123/fake.html”, users can be sure that someone is trying to steal their personal information. This feature checks for the presence of IP address in the URL.

Rule: <br>
If the domain part has an IP address → Phishing (1) <br>
Otherwise → Legitimate (0)

In [13]:
def haveIP(url):
    try:
        ipaddress.ip_address(url)
        ip = 1
    except:
        ip = 0
    return ip

### 3.1.3. "@" Symbol in URL

Using “@” symbol in the URL leads the browser to ignore everything preceding the “@” symbol and the real address often follows the “@” symbol. <br> This feature checks for the presence of '@' symbol in the URL.

Rule:<br>
If Url Having @ Symbol → Phishing (1)<br>
Otherwise → Legitimate (0)

In [14]:
def haveAtSign(url):
    if "@" in url:
        at = 1
    else:
        at = 0    
    return at

### 3.1.4. Length of URL

Phishers can use long URL to hide the doubtful part in the address bar. On the basis of our research work and whatever documents we have gone through based on this topic, we came to know that if the length of the URL is greater than or equal 54 characters then the URL classified as phishing.<br>
This feature computes the length of the URL.

Rule:<br>
If URL length >= 54 → Phishing (1) <br>
Otherwise → Legitimate (0)

In [15]:
def getLength(url):
    if len(url) < 54:
        length = 0            
    else:
        length = 1            
    return length

### 3.1.5. Depth of URL

Computes the depth of the URL. This feature calculates the number of sub pages in the given url based on the '/'.

The value of feature is a numerical based on the URL.

In [16]:
def getDepth(url):
    s = urlparse(url).path.split('/')
    depth = 0
    for j in range(len(s)):
        if len(s[j]) != 0:
            depth = depth+1
    return depth

### 3.1.6. Redirection "//" in URL

The existence of “//” within the URL path means that the user will be redirected to another website. The location of the “//” in URL is computed. We find that if the URL starts with “HTTP”, that means the “//” should appear in the sixth position. However, if the URL employs “HTTPS” then the “//” should appear in seventh position. <br>
This feature checks the presence of "//" in the URL.

Rule:<br>
If The Position of the Last Occurrence of "//" in the URL > 7 → Phishing (1) <br>
Otherwise → Legitimate (0)

In [1]:
def redirection(url):
    pos = url.rfind('//')
    if pos > 7:
        return 1
    else:
        return 0

### 3.1.7. Existence of "https" token in Domain name

The phishers may add the “HTTPS” token to the domain part of a URL in order to trick users. This feature checks for the presence of "http/https" in the domain part of the URL.

Rule:<br>
If "HTTPS" Token present in Domain Part of The URL → Phishing (1) <br>
Otherwise → Legitimate (0)

In [18]:
def httpDomain(url):
    domain = urlparse(url).netloc
    if 'https' in domain:
        return 1
    else:
        return 0

### 3.1.8. Using URL Shortening Services “TinyURL”

URL shortening is a method on the “World Wide Web” in which a URL may be made considerably smaller in length and still lead to the required webpage. This is accomplished by means of an “HTTP Redirect” on a domain name that is short, which links to the webpage that has a long URL.

Rule:<br>
If the URL is using Shortening Services → Phishing (1) <br>
Otherwise → Legitimate (0)

In [19]:
shortening_services = 'bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|' \
                      'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|' \
                      'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|' \
                      'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|' \
                      'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|' \
                      'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|' \
                      'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|' \
                      'tr\.im|link\.zip\.net'

In [20]:
def tinyURL(url):
    match=re.search(shortening_services,url)
    if match:
        return 1
    else:
        return 0

### 3.1.9. Adding Prefix or Suffix Separated by "-" to the Domain 

The dash symbol is rarely used in legitimate URLs. Phishers tend to add prefixes or suffixes separated by (-) to the domain name so that users feel that they are dealing with a legitimate webpage. <br>
This feature checks the presence of '-' in the domain part of URL. 

Rule:<br>
If the URL has '-' symbol in the domain part of the URL → Phishing (1) <br>
Otherwise → Legitimate (0)

In [21]:
def prefixSuffix(url):
    if '-' in urlparse(url).netloc:
        return 1
    else:
        return 0

## 3.2. Domain Based Features:

Below mentioned features are extracted for this project that can be consided as domain based features:

* DNS Record
* Website Traffic
* Age of Domain
* End Period of Domain

In [2]:
#!pip install python-whois

In [23]:
# importing required packages for this section
import re
from bs4 import BeautifulSoup
import whois
import urllib
import urllib.request
from datetime import datetime

### 3.2.1. DNS Record

For phishing websites, either the claimed identity is not recognized by the WHOIS database or no records founded for the hostname. <br>

Rule:<br>
If the DNS record is empty or not found → Phishing (1) <br>
else → Legitimate (0)

In [24]:
# obtained in the featureExtraction function itself in Section 4.

### 3.2.2. Web Traffic

This feature measures the popularity of the website by determining the number of visitors and the number of pages they visit. However, since phishing websites live for a short period of time, they may not be recognized by the Alexa database (Alexa the Web Information Company., 1996). On the basis of our research work, we find that in worst scenarios, legitimate websites ranked among the top 100,000. Furthermore, if the domain has no traffic or is not recognized by the Alexa database, it is classified as “Phishing”.<br>

Rule:
If the rank of the domain > 100000 → Phishing (1) <br>
else → Legitimate (0).

In [25]:
def web_traffic(url):
    try:
        url = urllib.parse.quote(url)
        rank = BeautifulSoup(urllib.request.urlopen("http://data.alexa.com/data?cli=10&dat=s&url=" + url).read(), "xml").find(
        "REACH")['RANK']
        rank = int(rank)
    except TypeError:
        return 1
    if rank > 100000:
        return 1
    else:
        return 0

### 3.2.3. Age of Domain

This feature can be extracted from WHOIS database. Most phishing websites live for a short period of time. The minimum age of the legitimate domain is considered to be 6 months for this project. Age here is nothing but different between creation and expiration time.<br>

Rule:<br>
If age of domain < 6 months → Phishing (1)<br>
else → Legitimate (0)

In [26]:
def domainAge(domain_name):
    creation_date = domain_name.creation_date
    expiration_date = domain_name.expiration_date
    if (isinstance(creation_date,str) or isinstance(expiration_date,str)):
        try:
            creation_date = datetime.strptime(creation_date,'%Y-%m-%d')
            expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
        except:
            return 1
    if ((expiration_date is None) or (creation_date is None)):
        return 1
    elif ((type(expiration_date) is list) or (type(creation_date) is list)):
        return 1
    else:
        ageofdomain = abs((expiration_date - creation_date).days)
    if ((ageofdomain/30) < 6):
        age = 1
    else:
        age = 0
        return age

### 3.2.4. End Period of Domain

For this feature, the remaining domain time is calculated by finding the different between expiration time & current time. The end period considered for the legitimate domain is 6 months or less  for this project. This feature can be extracted from WHOIS database.<br>

Rule:
If end period of domain < 6 months → Phishing (1) <br> 
else → Legitimate (0).

In [27]:
def domainEnd(domain_name):
    expiration_date = domain_name.expiration_date
    if isinstance(expiration_date,str):
        try:
            expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
        except:
            return 1
    if (expiration_date is None):
        return 1
    elif (type(expiration_date) is list):
        return 1
    else:
        today = datetime.now()
        end = abs((expiration_date - today).days)
    if ((end/30) < 6):
        end = 1
    else:
        end = 0
    return end

## 3.3. HTML and JavaScript based Features

Below mentioned features are extracted for this project that can be considered as HTML & JS based features:

* IFrame Redirection
* Status Bar Customization
* Disabling Right Click
* Website Forwarding

In [28]:
# importing required packages for this section
import requests

### 3.3.1. IFrame Redirection

IFrame is an HTML tag used to display an additional webpage into one that is currently shown. Phishers can make use of the “iframe” tag and make it invisible i.e. without frame borders. In this regard, phishers make use of the “frameBorder” attribute which causes the browser to render a visual delineation.<br>

Rule:<br>
If using iframe or repsonse is not found → Phishing (1) <br>
else → Legitimate (0)

In [29]:
def iframe(response):
    if response == "":
        return 1
    else:
        if re.findall(r"[<iframe>|<frameBorder>]", response.text):
            return 1
        else:
            return 0

### 3.3.2. Status Bar Customization

Phishers may use JavaScript to show a fake URL in the status bar to users. To extract this feature, we must dig-out the webpage source code, particularly the “onMouseOver” event, and check if it makes any changes on the status bar. <br>

Rule:<br>
If the response is empty or onmouseover is found → Phishing (1) <br> 
else → Legitimate (0).

In [30]:
def mouseOver(response): 
    if response == "" :
        return 1
    else:
        if re.findall("<script>.+onmouseover.+</script>", response.text):
            return 1
        else:
            return 0

### 3.3.3. Disabling Right Click

Phishers use JavaScript to disable the right-click function, so that users cannot view and save the webpage source code. This feature is treated exactly as “Using onMouseOver to hide the Link”. Nonetheless, for this feature, we will search for event “event.button==2” in the webpage source code and check if the right click is disabled.<br>

Rule:<br>
If the response is empty or Right Click Disabled → Phishing (1) <br>
else → Legitimate (0)

In [31]:
def rightClick(response):
    if response == "":
        return 1
    else:
        if re.findall(r"event.button ?== ?2", response.text):
            return 1
        else:
            return 0


### 3.3.4. Website Forwarding

The fine line that distinguishes phishing websites from legitimate ones is how many times a website has been redirected.<br>

Rule:<br>
If response is empty or number of times website is redirected > 2 → Phishing (1) <br>
Otherwise → Legitimate (0)

In [32]:
def forwarding(response):
    if response == "":
        return 1
    else:
        if len(response.history) <= 2:
            return 0
        else:
            return 1

# 4. Computing URL Features

Below function calls the other functions and stores all the features of the URL in a list. We will extract the features of each URL and append to this list.

In [33]:
def featureExtraction(url, label):

    features = []
    #Address bar based features
    features.append(getDomain(url))
    features.append(haveIP(url))
    features.append(haveAtSign(url))
    features.append(getLength(url))
    features.append(getDepth(url))
    features.append(redirection(url))
    features.append(httpDomain(url))
    features.append(tinyURL(url))
    features.append(prefixSuffix(url))
  
    #Domain based features
    dns = 0
    try:
        domain_name = whois.whois(urlparse(url).netloc)
    except:
        dns = 1

    features.append(dns)
    features.append(web_traffic(url))
    features.append(1 if dns == 1 else domainAge(domain_name))
    features.append(1 if dns == 1 else domainEnd(domain_name))

    # HTML & Javascript based features
    try:
        response = requests.get(url)
    except:
        response = ""
    features.append(iframe(response))
    features.append(mouseOver(response))
    features.append(rightClick(response))
    features.append(forwarding(response))
    
    #appending label (phishing(1) & legitimate(0))
    features.append(label)

    return features

In [35]:
featureExtraction('http://www.facebook.com/home/service', 0)

['facebook.com', 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0]

In [36]:
featureExtraction('https://drcarmenmora.com/doc/info/dam/index.html........................', 0)

['drcarmenmora.com', 0, 0, 1, 4, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0]

## 4.1. Feature Extraction on Legitimate URLs:

In [37]:
legit_url.shape

(5000, 1)

In [None]:
legit_features = []
label = 0

for i in range(0, 5000):
    url = legit_url['URLs'][i]
    legit_features.append(featureExtraction(url,label))
    #print(i, end=' ')

Error trying to connect to socket: closing socket
Error trying to connect to socket: closing socket


In [None]:
feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                      'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record', 'Web_Traffic', 
                      'Domain_Age', 'Domain_End', 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label']

legitimate = pd.DataFrame(legit_features, columns= feature_names)

In [39]:
legitimate.head()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Web_Traffic,Domain_Age,Domain_End,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
0,graphicriver.net,0,0,1,1,0,0,0,0,0,1,1,1,0,0,1,0,0
1,ecnavi.jp,0,0,1,1,1,0,0,0,0,1,1,1,0,0,1,0,0
2,hubpages.com,0,0,1,1,0,0,0,0,0,1,0,0,0,0,1,0,0
3,extratorrent.cc,0,0,1,3,0,0,0,0,0,0,0,1,0,0,1,0,0
4,icicibank.com,0,0,1,3,0,0,0,0,0,1,0,1,0,0,1,0,0


In [None]:
legitimate.to_csv('legitimate.csv', index= False)

## 4.2. Feature Extraction on Phishing URLs:

In [41]:
phish_url.shape

(5000, 8)

In [None]:
phish_features = []
label = 1

for i in range(0, 5000):
    url = phish_url['url'][i]
    phish_features.append(featureExtraction(url,label))
    #print(i, end=' ')

0 2 3 4 5 6 7 8 10 11 12 13 14 15 16 17 19 Error trying to connect to socket: closing socket
20 21 22 25 26 28 29 

In [None]:
feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                      'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record', 'Web_Traffic', 
                      'Domain_Age', 'Domain_End', 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label']

phishing = pd.DataFrame(phish_features, columns= feature_names)

In [48]:
phishing.head()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Web_Traffic,Domain_Age,Domain_End,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
0,hydtddz.com,0,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,1
1,bharathi1809.github.io,0,0,0,1,0,0,0,0,0,1,1,1,0,0,1,0,1
2,getmagic.app,0,0,0,1,0,0,0,0,0,1,0,0,1,1,1,1,1
3,clouddoc-authorize.firebaseapp.com,0,0,1,2,0,0,0,1,1,1,1,1,0,0,1,0,1
4,webdisk.granadoemurahara.com.br,0,0,1,1,0,0,0,0,0,1,1,0,0,0,1,0,1


In [None]:
phishing.to_csv('phishing.csv', index= False)

#  **5. Final Dataset**

In the above section we formed two dataframes of legitimate & phishing URL features. Now, we will combine them to a single dataframe and export the data to csv file for the Machine Learning training done in other notebook. 

In [49]:
urldata = pd.concat([legitimate, phishing]).reset_index(drop=True)
urldata.head()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Web_Traffic,Domain_Age,Domain_End,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
0,graphicriver.net,0,0,1,1,0,0,0,0,0,1,1,1,0,0,1,0,0
1,ecnavi.jp,0,0,1,1,1,0,0,0,0,1,1,1,0,0,1,0,0
2,hubpages.com,0,0,1,1,0,0,0,0,0,1,0,0,0,0,1,0,0
3,extratorrent.cc,0,0,1,3,0,0,0,0,0,0,0,1,0,0,1,0,0
4,icicibank.com,0,0,1,3,0,0,0,0,0,1,0,1,0,0,1,0,0


In [50]:
urldata.tail()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Web_Traffic,Domain_Age,Domain_End,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
9714,site9434107.92.webydo.com,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,1
9715,amazvo.gqhormc.cn,0,1,1,1,0,0,0,0,0,1,0,0,1,1,1,1,1
9716,tahunbaruliak.000webhostapp.com,0,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,1
9717,telephone.gsjxm.net,0,0,0,1,0,0,0,0,0,1,0,1,1,1,1,1,1
9718,dibikinkedergwdongajingbanged41298.cloudns.ph,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1


In [51]:
urldata.shape

(9719, 18)

We lost few URLs because there was an error occcuring in connecting with these URLs. These URLs has already been closed by the other end. In other words, an application protocol error.

In [52]:
urldata.to_csv('finaldata.csv', index=False)

---

# 6.CONCLUSION:

We finally extracted 18 features for 9719 URL.

# 7.REFERENCES:

* urllib => https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse
* ipaddress => ipaddress.ip_address(url) -> https://docs.python.org/3/howto/ipaddress.html
* shortening services => https://gist.github.com/sid321axn/9199e54dfc7667af7c0c0a07a4b7a129
* Alexa Rank => https://gist.github.com/masnun/3170870
* datetime => https://docs.python.org/3/library/datetime.html
* python whois => https://pypi.org/project/python-whois/
* eventListener for button => https://developer.mozilla.org/en-US/docs/Web/API/MouseEvent/button
* redirection history => https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history