# Normalization
In this file we normalize the data from the previous step (clean_data.csv) and write it to a new csv (norm_data.csv)

## **Side note**
This notebook will cover the Feature engineering.
The current dataset already has all the features we needed to train our models so we made a python script that can convert a new url to the correct features so it can be used by the model to predict if the url is phishing or legitimate. (analyse_url.py)
In this notebook we will also do a little bit of Feature selection together with the cleanup data notebook. We will remove the url from the dataset and in the cleanup data notebook the whois data has been removed.

## Imports

In [1]:
from IPython import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

import pandas as pd

## Reading the CSV
We use pandas to read the csv with the correct options.

In [2]:
df = pd.read_csv('../data/clean_data.csv', header=0, decimal='.')

df.head()

Unnamed: 0,url,label,url_length,starts_with_ip,url_entropy,has_punycode,digit_letter_ratio,dot_count,at_count,dash_count,tld_count,domain_has_digits,subdomain_count,nan_char_entropy,has_internal_links,domain_age_days
0,bluevalentinemovie.com,legitimate,22,False,3.550341,False,0.0,1,0,0,0,False,0,0.202701,False,5355.0
1,divergeit.deskdirector.com,legitimate,26,False,3.536414,False,0.0,2,0,0,0,False,1,0.284649,False,4489.0
2,https://confirmation-sms-code.ig-email.com/aut...,phishing,53,False,4.215075,False,0.0,2,0,3,0,False,1,0.520993,False,81.0
3,http://aseel-tourism.com/--/78703/Login.html,phishing,44,False,4.371379,False,0.178571,2,0,3,0,False,0,0.683314,False,4281.0
4,st.truyenqqviet.com,legitimate,19,False,3.681881,False,0.0,2,0,0,0,False,1,0.341887,False,173.0


## Copy the df to a work dataframe
we'll be using X as the main data from the df and y for the url and label columns that don't need to be normalised

In [3]:
X = df.copy()

url = X.pop('url')
label = X.pop('label')

y = pd.DataFrame().assign(url=url, label=label)

In [4]:
X
y

Unnamed: 0,url_length,starts_with_ip,url_entropy,has_punycode,digit_letter_ratio,dot_count,at_count,dash_count,tld_count,domain_has_digits,subdomain_count,nan_char_entropy,has_internal_links,domain_age_days
0,22,False,3.550341,False,0.000000,1,0,0,0,False,0,0.202701,False,5355.0
1,26,False,3.536414,False,0.000000,2,0,0,0,False,1,0.284649,False,4489.0
2,53,False,4.215075,False,0.000000,2,0,3,0,False,1,0.520993,False,81.0
3,44,False,4.371379,False,0.178571,2,0,3,0,False,0,0.683314,False,4281.0
4,19,False,3.681881,False,0.000000,2,0,0,0,False,1,0.341887,False,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2499987,47,False,4.133931,False,0.111111,1,0,0,0,False,0,0.580266,False,4281.0
2499988,19,False,3.616349,False,0.600000,2,0,1,0,False,1,0.341887,False,5765.0
2499989,44,False,4.018884,False,0.000000,2,0,0,0,False,0,0.590948,False,4281.0
2499990,435,False,4.616809,False,0.021212,23,1,0,4,False,1,1.112024,True,4281.0


Unnamed: 0,url,label
0,bluevalentinemovie.com,legitimate
1,divergeit.deskdirector.com,legitimate
2,https://confirmation-sms-code.ig-email.com/aut...,phishing
3,http://aseel-tourism.com/--/78703/Login.html,phishing
4,st.truyenqqviet.com,legitimate
...,...,...
2499987,http://pamnacty.best/b0a/bankofamerica/8caf8ff7,phishing
2499988,hc2290-59.iphmx.com,legitimate
2499989,http://religioustourism.gr/matchprofile.html,phishing
2499990,https://amoezn.jepan.design/signim/?openid.pap...,phishing


## Normalising the data
To train the model it is better if most of the values are an int.
Below we check which values are already of type int

In [5]:
discrete_features = X.dtypes == int

discrete_features

url_length             True
starts_with_ip        False
url_entropy           False
has_punycode          False
digit_letter_ratio    False
dot_count              True
at_count               True
dash_count             True
tld_count              True
domain_has_digits     False
subdomain_count        True
nan_char_entropy      False
has_internal_links    False
domain_age_days       False
dtype: bool

We can see that most of the columns are not of type int yet. Below we will convert these

### Convert the boolean types to an int

Most models require a numerical input as they cannot directly handle boolean values. Not all models require this input to be numerical but we will do it to prevent future conflicts.

In [6]:
bools = []

for col in X.select_dtypes('bool'):
    bools.append(col)

bools

['starts_with_ip', 'has_punycode', 'domain_has_digits', 'has_internal_links']

The column names listed above have a boolean value. It is better to have these converted to an int (0, 1). We can see in the dataframe below that all the values show False and True

In [7]:
X[bools]

Unnamed: 0,starts_with_ip,has_punycode,domain_has_digits,has_internal_links
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
...,...,...,...,...
2499987,False,False,False,False
2499988,False,False,False,False
2499989,False,False,False,False
2499990,False,False,False,True


In [8]:
for col in X.select_dtypes("bool"):
    X[col] = X[col].astype(int)

In [9]:
bools_new = []

for col in X.select_dtypes('bool'):
    bools.append(col)

bools_new

[]

The list now turns up empty so the transformation worked. When we look at the values in X for the columns given in the first list we will now see 0's and 1's

In [10]:
X[bools]

Unnamed: 0,starts_with_ip,has_punycode,domain_has_digits,has_internal_links
0,0,0,0,0
1,0,0,0,0
2,0,0,0,0
3,0,0,0,0
4,0,0,0,0
...,...,...,...,...
2499987,0,0,0,0
2499988,0,0,0,0
2499989,0,0,0,0
2499990,0,0,0,1


In [11]:
discrete_features = X.dtypes == int

discrete_features

url_length             True
starts_with_ip         True
url_entropy           False
has_punycode           True
digit_letter_ratio    False
dot_count              True
at_count               True
dash_count             True
tld_count              True
domain_has_digits      True
subdomain_count        True
nan_char_entropy      False
has_internal_links     True
domain_age_days       False
dtype: bool

We can see above that most of the values are now of type int. We can try to convert the string types to an int now

### Convert strings to int
converting string to an int can be done by taking all the unique string values and giving it an int value. This needs to be done as a model can not understand string values and will try to convert the values itself to int values. It is better to do this ourselves for any possible strings. Below we will perform this action

In [12]:
objects = []
for col in X.select_dtypes("object"):
    objects.append(col)

objects

[]

We can see that the dataset has no object types to be converted, but we will perform the action to be sure. 

In [13]:
for col in X.select_dtypes("object"):
    X[col], _ = X[col].factorize()

### Convert big numbers to a normalised standard
We will convert numerical values that have a high value to a range between 0 and 1.

In [14]:
range_df = pd.DataFrame(data={
    "Min": X.min(),
    "Max": X.max(),
    "Range": X.max() - X.min()
}).sort_values("Range", ascending=False)

range_df

Unnamed: 0,Min,Max,Range
domain_age_days,0.0,45541.0,45541.0
url_length,4.0,25523.0,25519.0
dash_count,0.0,322.0,322.0
dot_count,0.0,211.0,211.0
tld_count,0.0,65.0,65.0
subdomain_count,0.0,43.0,43.0
at_count,0.0,32.0,32.0
digit_letter_ratio,0.0,20.84,20.84
url_entropy,0.100836,6.048781,5.947945
nan_char_entropy,0.016863,1.901504,1.884641


In [15]:
range_df.to_csv('../models/scale.csv')

Above we can see that te min, max and range values of all the numerical values.
Below we will transform all the values with a range higher than 1 to a range between 0 and 1

In [16]:
for index in range_df[(range_df['Range'] > 1)].index:
    X[index] = (X[index] - range_df.loc[index]['Min']) / range_df.loc[index]['Range']
    
X

Unnamed: 0,url_length,starts_with_ip,url_entropy,has_punycode,digit_letter_ratio,dot_count,at_count,dash_count,tld_count,domain_has_digits,subdomain_count,nan_char_entropy,has_internal_links,domain_age_days
0,0.000705,0,0.579949,0,0.000000,0.004739,0.00000,0.000000,0.000000,0,0.000000,0.098607,0,0.117586
1,0.000862,0,0.577608,0,0.000000,0.009479,0.00000,0.000000,0.000000,0,0.023256,0.142089,0,0.098571
2,0.001920,0,0.691708,0,0.000000,0.009479,0.00000,0.009317,0.000000,0,0.023256,0.267494,0,0.001779
3,0.001567,0,0.717986,0,0.008569,0.009479,0.00000,0.009317,0.000000,0,0.000000,0.353622,0,0.094003
4,0.000588,0,0.602064,0,0.000000,0.009479,0.00000,0.000000,0.000000,0,0.023256,0.172459,0,0.003799
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2499987,0.001685,0,0.678065,0,0.005332,0.004739,0.00000,0.000000,0.000000,0,0.000000,0.298944,0,0.094003
2499988,0.000588,0,0.591047,0,0.028791,0.009479,0.00000,0.003106,0.000000,0,0.023256,0.172459,0,0.126589
2499989,0.001567,0,0.658723,0,0.000000,0.009479,0.00000,0.000000,0.000000,0,0.000000,0.304612,0,0.094003
2499990,0.016889,0,0.759249,0,0.001018,0.109005,0.03125,0.000000,0.061538,0,0.023256,0.581098,1,0.094003


## Remove the URL from the y dataset
Because the URL doesn't have any significance for the models we will remove it in the normalized dataset. 

In [17]:
y.pop("url")

0                                     bluevalentinemovie.com
1                                 divergeit.deskdirector.com
2          https://confirmation-sms-code.ig-email.com/aut...
3               http://aseel-tourism.com/--/78703/Login.html
4                                        st.truyenqqviet.com
                                 ...                        
2499987      http://pamnacty.best/b0a/bankofamerica/8caf8ff7
2499988                                  hc2290-59.iphmx.com
2499989         http://religioustourism.gr/matchprofile.html
2499990    https://amoezn.jepan.design/signim/?openid.pap...
2499991                                           ldsmag.com
Name: url, Length: 2499992, dtype: object

## Join the normalised data and the label dataframe
After the normalisation we will join the 2 dataframes together again to save it to a new csv

In [18]:
norm_data = pd.concat([y, X], axis=1)

norm_data.head()

Unnamed: 0,label,url_length,starts_with_ip,url_entropy,has_punycode,digit_letter_ratio,dot_count,at_count,dash_count,tld_count,domain_has_digits,subdomain_count,nan_char_entropy,has_internal_links,domain_age_days
0,legitimate,0.000705,0,0.579949,0,0.0,0.004739,0.0,0.0,0.0,0,0.0,0.098607,0,0.117586
1,legitimate,0.000862,0,0.577608,0,0.0,0.009479,0.0,0.0,0.0,0,0.023256,0.142089,0,0.098571
2,phishing,0.00192,0,0.691708,0,0.0,0.009479,0.0,0.009317,0.0,0,0.023256,0.267494,0,0.001779
3,phishing,0.001567,0,0.717986,0,0.008569,0.009479,0.0,0.009317,0.0,0,0.0,0.353622,0,0.094003
4,legitimate,0.000588,0,0.602064,0,0.0,0.009479,0.0,0.0,0.0,0,0.023256,0.172459,0,0.003799


## Save the new normalised data to a CSV
We use the option 'index=False' so the index column of the dataset isn't saved to the CSV

In [19]:
norm_data.to_csv('../data/norm_data.csv', index=False)