# Phishing Website Detection 

Step -1 : Data preprocessing 

This dataset contains few website links (Some of them are legitimate websites and a few are fake websites)

Pre-Processing the data before building a model and also Extracting the features from the data based on certain conditions

In [2]:
#importing numpy and pandas which are required for data pre-processing
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

  from numpy.core.umath_tests import inner1d


In [3]:
#Loading the data
raw_data = pd.read_csv("datasetfinal.csv") 
len(raw_data)

1048575

We need to split the data according to parts of the URL

A typical URL could have the form http://www.example.com/index.html, which indicates a protocol (http), a hostname (www.example.com), and a file name (index.html).

In [4]:
raw_data['URL'].str.split("://").head() #Here we divided the protocol from the entire URL. but need it to be divided it 
                                                 #seperate column

0    [http, br-ofertasimperdiveis.epizy.com/produto...
1    [https, semana-da-oferta.com/produtos.php?id=5...
2    [https, scrid-apps-creacust-sslhide90766752024...
3       [http, my-softbank-security.com/wap_login.htm]
4    [http, www.my-softbank-security.com/wap_login....
Name: URL, dtype: object

In [7]:
seperation_of_protocol = raw_data['URL'].str.split("://",expand = True) #expand argument in the split method will give you a new column

In [8]:
seperation_of_protocol.head()

Unnamed: 0,0,1,2,3,4,5,6
0,http,br-ofertasimperdiveis.epizy.com/produto.php?li...,,,,,
1,https,semana-da-oferta.com/produtos.php?id=5abad0c01...,,,,,
2,https,scrid-apps-creacust-sslhide90766752024.cread-s...,,,,,
3,http,my-softbank-security.com/wap_login.htm,,,,,
4,http,www.my-softbank-security.com/wap_login.htm,,,,,


In [10]:
seperation_domain_name = seperation_of_protocol[1].str.split("/",1,expand = True) #split(seperator,no of splits according to seperator(delimiter),expand)

In [11]:
type(seperation_domain_name)

pandas.core.frame.DataFrame

In [12]:
seperation_domain_name.columns=["domain_name","address"] #renaming columns of data frame

In [13]:
seperation_domain_name.head()

Unnamed: 0,domain_name,address
0,br-ofertasimperdiveis.epizy.com,produto.php?linkcompleto=iphone-6-plus-apple-6...
1,semana-da-oferta.com,produtos.php?id=5abad0c01d149
2,scrid-apps-creacust-sslhide90766752024.cread-s...,hider_reo/
3,my-softbank-security.com,wap_login.htm
4,www.my-softbank-security.com,wap_login.htm


In [14]:
#Concatenation of data frames
splitted_data = pd.concat([seperation_of_protocol[0],seperation_domain_name],axis=1)


In [15]:
splitted_data.columns = ['protocol','domain_name','address']

In [16]:
splitted_data.head()

Unnamed: 0,protocol,domain_name,address
0,http,br-ofertasimperdiveis.epizy.com,produto.php?linkcompleto=iphone-6-plus-apple-6...
1,https,semana-da-oferta.com,produtos.php?id=5abad0c01d149
2,https,scrid-apps-creacust-sslhide90766752024.cread-s...,hider_reo/
3,http,my-softbank-security.com,wap_login.htm
4,http,www.my-softbank-security.com,wap_login.htm


In [17]:
splitted_data['is_phished'] = pd.Series(raw_data['Target'], index=splitted_data.index)

Domain name column can be further sub divided into domain_names as well as sub_domain_names 

Similarly, address column can also be further sub divided into path,query_string,file..................

In [19]:
type(splitted_data)

pandas.core.frame.DataFrame

### Features Extraction


Feature-1

1.Long URL to Hide the Suspicious Part

If the length of the URL is greater than or equal 54 characters then the URL classified as phishing


0 --- indicates legitimate

1 --- indicates Phishing

2 --- indicates Suspicious

In [20]:
def long_url(l):
    l= str(l)
    """This function is defined in order to differntiate website based on the length of the URL"""
    if len(l) < 54:
        return 0
    elif len(l) >= 54 and len(l) <= 75:
        return 2
    return 1

In [22]:
#Applying the above defined function in order to divide the websites into 3 categories
splitted_data['long_url'] = raw_data['URL'].apply(long_url) 


Feature-2

2.URL’s having “@” Symbol

Using “@” symbol in the URL leads the browser to ignore everything preceding the “@” symbol and the real address often follows the “@” symbol.

IF {Url Having @ Symbol→ Phishing
    Otherwise→ Legitimate }


0 --- indicates legitimate

1 --- indicates Phishing


In [24]:
def have_at_symbol(l):
    """This function is used to check whether the URL contains @ symbol or not"""
    if "@" in str(l):
        return 1
    return 0
    

In [25]:
splitted_data['having_@_symbol'] = raw_data['URL'].apply(have_at_symbol)

Feature-3

3.Redirecting using “//”

The existence of “//” within the URL path means that the user will be redirected to another website.
An example of such URL’s is: “http://www.legitimate.com//http://www.phishing.com”. 
We examine the location where the “//” appears. 
We find that if the URL starts with “HTTP”, that means the “//” should appear in the sixth position. 
However, if the URL employs “HTTPS” then the “//” should appear in seventh position.

IF {ThePosition of the Last Occurrence of "//" in the URL > 7→ Phishing
    
    Otherwise→ Legitimate

0 --- indicates legitimate

1 --- indicates Phishing


In [27]:
def redirection(l):
    """If the url has symbol(//) after protocol then such URL is to be classified as phishing """
    if "//" in str(l):
        return 1
    return 0

In [28]:
splitted_data['redirection_//_symbol'] = seperation_of_protocol[1].apply(redirection)

In [29]:
splitted_data.head()

Unnamed: 0,protocol,domain_name,address,is_phished,long_url,having_@_symbol,redirection_//_symbol
0,http,br-ofertasimperdiveis.epizy.com,produto.php?linkcompleto=iphone-6-plus-apple-6...,1,1,0,0
1,https,semana-da-oferta.com,produtos.php?id=5abad0c01d149,1,2,0,0
2,https,scrid-apps-creacust-sslhide90766752024.cread-s...,hider_reo/,1,2,0,0
3,http,my-softbank-security.com,wap_login.htm,1,0,0,0
4,http,www.my-softbank-security.com,wap_login.htm,1,0,0,0


Feature-4

4.Adding Prefix or Suffix Separated by (-) to the Domain

The dash symbol is rarely used in legitimate URLs. Phishers tend to add prefixes or suffixes separated by (-) to the domain name
so that users feel that they are dealing with a legitimate webpage. 

For example http://www.Confirme-paypal.com/.
    
IF {Domain Name Part Includes (−) Symbol → Phishing
    
    Otherwise → Legitimate
    
1 --> indicates phishing

0 --> indicates legitimate
    

In [30]:
def prefix_suffix_seperation(l):
    if '-' in str(l):
        return 1
    return 0

In [31]:
splitted_data['prefix_suffix_seperation'] = seperation_domain_name['domain_name'].apply(prefix_suffix_seperation)

In [32]:
splitted_data.head()

Unnamed: 0,protocol,domain_name,address,is_phished,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation
0,http,br-ofertasimperdiveis.epizy.com,produto.php?linkcompleto=iphone-6-plus-apple-6...,1,1,0,0,1
1,https,semana-da-oferta.com,produtos.php?id=5abad0c01d149,1,2,0,0,1
2,https,scrid-apps-creacust-sslhide90766752024.cread-s...,hider_reo/,1,2,0,0,1
3,http,my-softbank-security.com,wap_login.htm,1,0,0,0,1
4,http,www.my-softbank-security.com,wap_login.htm,1,0,0,0,1


Feature - 5

5. Sub-Domain and Multi Sub-Domains

The legitimate URL link has two dots in the URL since we can ignore typing “www.”. 
If the number of dots is equal to three then the URL is classified as “Suspicious” since it has one sub-domain.
However, if the dots are greater than three it is classified as “Phishy” since it will have multiple sub-domains

0 --- indicates legitimate

1 --- indicates Phishing

2 --- indicates Suspicious


In [33]:
def sub_domains(l):
    l= str(l)
    if l.count('.') < 3:
        return 0
    elif l.count('.') == 3:
        return 2
    return 1

In [34]:
splitted_data['sub_domains'] = splitted_data['domain_name'].apply(sub_domains)

### Classification of URLs using Random forest 

In [35]:
features = ['long_url', 'having_@_symbol', 'redirection_//_symbol','prefix_suffix_seperation','sub_domains']
X = splitted_data[features]

In [36]:
X

Unnamed: 0,long_url,having_@_symbol,redirection_//_symbol,prefix_suffix_seperation,sub_domains
0,1,0,0,1,0
1,2,0,0,1,0
2,2,0,0,1,0
3,0,0,0,1,0
4,0,0,0,1,0
5,0,0,0,0,0
6,2,0,0,0,0
7,0,0,0,0,0
8,0,0,0,0,0
9,0,0,0,0,0


In [40]:
#variable to be predicted; yes = 0 and no = 1
y = splitted_data.is_phished

In [77]:
print(y)

0          1
1          1
2          1
3          1
4          1
5          1
6          1
7          1
8          1
9          1
10         1
11         1
12         1
13         1
14         1
15         1
16         1
17         1
18         1
19         1
20         1
21         1
22         1
23         1
24         1
25         1
26         1
27         1
28         1
29         1
          ..
1048545    0
1048546    0
1048547    0
1048548    0
1048549    0
1048550    0
1048551    0
1048552    0
1048553    0
1048554    0
1048555    0
1048556    0
1048557    0
1048558    0
1048559    0
1048560    0
1048561    0
1048562    0
1048563    0
1048564    0
1048565    0
1048566    0
1048567    0
1048568    0
1048569    0
1048570    0
1048571    0
1048572    0
1048573    0
1048574    0
Name: is_phished, Length: 1048575, dtype: int64


In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [43]:
# Create a random forest Classifier. By convention, clf means 'Classifier'
clf = RandomForestClassifier(n_estimators=100,n_jobs=2,random_state=0)

# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
clf.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [44]:
pred=clf.predict(X_test)
list(pred)
pred#testing the classifier on test data.

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [111]:
type(pred)

numpy.ndarray

In [45]:
clf.predict_proba(X_test)[0:10] #predicted probability for each class.

array([[0.97365993, 0.02634007],
       [0.97365993, 0.02634007],
       [0.97365993, 0.02634007],
       [0.97365993, 0.02634007],
       [0.97365993, 0.02634007],
       [0.97365993, 0.02634007],
       [0.97365993, 0.02634007],
       [0.97365993, 0.02634007],
       [0.97365993, 0.02634007],
       [0.97365993, 0.02634007]])

In [46]:
y_test1=y_test.as_matrix()

  """Entry point for launching an IPython kernel.


In [47]:
results = confusion_matrix(y_test1, pred) 
print('Confusion Matrix :')
print(results) 
print('Accuracy Score :',accuracy_score(y_test1, pred))
print ('Report : ')
print(classification_report(y_test1, pred))

Confusion Matrix :
[[297616      5]
 [  7950   9002]]
Accuracy Score : 0.9747117521211293
Report : 
             precision    recall  f1-score   support

          0       0.97      1.00      0.99    297621
          1       1.00      0.53      0.69     16952

avg / total       0.98      0.97      0.97    314573



### Evaluating classifier

In [52]:
#importance of featur
list(zip(X, clf.feature_importances_))

[('long_url', 0.7151415523416249),
 ('having_@_symbol', 0.022606576106538066),
 ('redirection_//_symbol', 0.0013633632759144523),
 ('prefix_suffix_seperation', 0.14657620936784801),
 ('sub_domains', 0.11431229890807432)]