# MALICIOUS WEB DETECTION WITH 1D CNN (Convolution Neural Network)
## Author: ** Perlz **
## Initial Author: ** Rakha Paleva Kawiswara **

This work is a branch of the orginial work done by Rakha Paleva Kawisara

Feel free to use this notebook for your research as well as the original contribution by Rakha Kawisara and upvotes to this notebook :). Suggestion on this notebook is very expected, Thanks!

*nb
1. psst... do not open the url contained in the data, to avoid opening dangerous websites. Because you know.....they are malicious!

What is Malicious Website?
A malicious website is a site that attempts to install malware (a general term for anything that will disrupt computer operation, gather your personal information or, in a worst-case scenario, gain total access to your machine) onto your device. This usually requires some action on your part, however, in the case of a drive-by download, the website will attempt to install software on your computer without asking for permission first. (source: https://us.norton.com/internetsecurity-malware-what-are-malicious-websites.html)

Notebook Goals
1. Demonstrate EDA of of the original data set using Pandas-Profiling
2. Replicate the 1D Convolutional Neural Network used to detect malicious websites
3. Improve upon the original CNN written by Kawisara
4. This notebook will create a model that can detect malicious websites. Website url is used as a feature and 1D Convolutional Neural Network (CNN) is used as an algorithm for detection malicious websites. Model will be validated by holdout method

# Exploratory Data Analysis with Pandas Profiling

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

#install python & kaggle dependencies
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
#load data which is sourced from previous kaggle contributions
data = pd.read_csv("../input/urldataset/data.csv")

In [None]:
#old skewl data triage
data.shape

In [None]:
data.head(5)

##Now lets introduce some pandas profiling 

In [None]:
import pandas_profiling
from pandas_profiling import ProfileReport

In [None]:
#the file is to large for full visual so we need to minimize
profile = profile = ProfileReport(data, minimal=True, title = "Features to Evaluate Data Profile")

In [None]:
#Let's look at the data from within the notebook
profile.to_notebook_iframe()

In [None]:
#use this if you want a seperate html file outside of jupyter notebook.  
#In Kaggle, look to the right panel in the 'output' folder to find the html file
profile = ProfileReport(data, minimal=True)
profile.to_file("output.html")

## Pandas-Profiling EDA Findings
* 411247 Distinct rows of data
* 9216 duplicate rows of data, represents 2.2% of data
** No value occurs more than 27 times
* urls have 'high cardinality', which means they are unique and great candidates for analysis



# Pandas Profiling Complete
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

# Feature Engineering
Let's do some feature engineering. First we want to find out whether the data is imbalance

Let's download the dependencies first

In [None]:
# install additional library
!pip install tldextract -q

# import library
import numpy as np
import pandas as pd
import re
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
from plotly.subplots import make_subplots
import seaborn as sns
import gc
import random
import os
import pickle
import tensorflow as tf
from tensorflow.python.util import deprecation
from urllib.parse import urlparse
import tldextract

from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import models, layers, backend, metrics
from tensorflow.keras.callbacks import EarlyStopping
from keras.utils.vis_utils import plot_model
from PIL import Image
from sklearn.metrics import confusion_matrix, classification_report

# set random seed
os.environ['PYTHONHASHSEED'] = '0'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
np.random.seed(0)
random.seed(0)
tf.random.set_seed(0)

# other setup
%config InlineBackend.figure_format = 'retina'
pd.set_option('max_colwidth', 50)
pio.templates.default = "presentation"
pd.options.plotting.backend = "plotly"
deprecation._PRINT_DEPRECATION_WARNINGS = False

In [None]:
fig = go.Figure([go.Pie(labels=['Good', 'Bad'], values=data.label.value_counts())])
fig.update_layout(title='Percentage of Class (Good vs. Bad)')
fig.show()

In [None]:
#set value size as .2, aka 20%, which will be the test data
val_size = 0.2
train_data, val_data = train_test_split(data, test_size=val_size, stratify=data['label'], random_state=0)
fig = go.Figure([go.Pie(labels=['Train Size', 'Validation Size'], values=[train_data.shape[0], val_data.shape[0]])])
fig.update_layout(title='Train and Validation Size')
fig.show()

In [None]:

def parsed_url(url):
    # extract subdomain, domain, and domain suffix from url
    # if item == '', fill with '<empty>'
    subdomain, domain, domain_suffix = ('<empty>' if extracted == '' else extracted for extracted in tldextract.extract(url))
    return [subdomain, domain, domain_suffix]

def extract_url(data):
    # pass the parsed_url(url) as a for loop and create new columns.  Create new df with results
    extract_url_data = [parsed_url(url) for url in data['url']]
    extract_url_data = pd.DataFrame(extract_url_data, columns=['subdomain', 'domain', 'domain_suffix'])
    
    # concat extracted feature with main data
    data = data.reset_index(drop=True)
    data = pd.concat([data, extract_url_data], axis=1)
    
    return data

#verify changes occured
data.head(5)



In [None]:
def get_frequent_group(data, n_group):
    # get the most frequent
    data = data.value_counts().reset_index(name='values')
    
    # scale log base 10
    data['values'] = np.log10(data['values'])
    
    # calculate total values
    # x_column (subdomain / domain / domain_suffix)
    x_column = data.columns[1]
    data['total_values'] = data[x_column].map(data.groupby(x_column)['values'].sum().to_dict())
    
    # get n_group data order by highest values
    data_group = data.sort_values('total_values', ascending=False).iloc[:, 1].unique()[:n_group]
    data = data[data.iloc[:, 1].isin(data_group)]
    data = data.sort_values('total_values', ascending=False)
    
    return data

data

In [None]:
def plot(data, n_group, title):
    data = get_frequent_group(data, n_group)
    fig = px.bar(data, x=data.columns[1], y='values', color='label')
    fig.update_layout(title=title)
    fig.show()

# extract url
data = extract_url(data)
train_data = extract_url(train_data)
val_data = extract_url(val_data)

In [None]:
data

In [None]:
fig = go.Figure([go.Bar(
    x=['domain', 'Subdomain', 'Domain Suffix'], 
    y = [data.domain.nunique(), data.subdomain.nunique(), data.domain_suffix.nunique()]
)])
fig.show()

In [None]:
#Let's look at the data from within the notebook
profile = ProfileReport(data, minimal=True)
profile.to_notebook_iframe()

In [None]:
data.head(5)

In [None]:
plot(
    data=data.groupby('label')['domain'], 
    n_group=20, 
    title='Top 20 Domains Grouped By Labels (Logarithmic Scale)'
)

In [None]:
# tokenization on the url so that it can be used as input to the CNN model
tokenizer = Tokenizer(filters='', char_level=True, lower=False, oov_token=1)

# fit only on training data
tokenizer.fit_on_texts(train_data['url'])
n_char = len(tokenizer.word_index.keys())

train_seq = tokenizer.texts_to_sequences(train_data['url'])
val_seq = tokenizer.texts_to_sequences(val_data['url'])

print('Before tokenization: ')
print(train_data.iloc[0]['url'])
print('\nAfter tokenization: ')
print(train_seq[0])

In [None]:
# Each url has a different length, therefore padding is needed to equalize each url length. 
# Next step we will do padding on url column that was just tokenized
sequence_length = np.array([len(i) for i in train_seq])
sequence_length = np.percentile(sequence_length, 99).astype(int)
print(f'Before padding: \n {train_seq[0]}')
train_seq = pad_sequences(train_seq, padding='post', maxlen=sequence_length)
val_seq = pad_sequences(val_seq, padding='post', maxlen=sequence_length)
print(f'After padding: \n {train_seq[0]}')

In [None]:
#now lets tokenize the other columns
unique_value = {}
for feature in ['subdomain', 'domain', 'domain_suffix']:
    # get unique value
    label_index = {label: index for index, label in enumerate(train_data[feature].unique())}
    
    # add unknown label in last index
    label_index['<unknown>'] = list(label_index.values())[-1] + 1
    
    # count unique value
    unique_value[feature] = label_index['<unknown>']
    
    # encode
    train_data.loc[:, feature] = [label_index[val] if val in label_index else label_index['<unknown>'] for val in train_data.loc[:, feature]]
    val_data.loc[:, feature] = [label_index[val] if val in label_index else label_index['<unknown>'] for val in val_data.loc[:, feature]]
    
train_data.head()

In [None]:
#Encode the label 
for data in [train_data, val_data]:
    data.loc[:, 'label'] = [0 if i == 'good' else 1 for i in data.loc[:, 'label']]
    
train_data.head()

In [None]:
#the file is to large for full visual so we need to minimize
profile3 = profile = ProfileReport(data, minimal=True, title = "Features to Evaluate Data Profile")
#Let's look at the data from within the notebook
profile3.to_notebook_iframe()

# Conclusion
We have successfully 'cleaned' the data.  We have 'profiled' the data.  We have converted/transformed/encoded the data to a machine friendly format.  We 'profiled' it again and recieved a different view of the same data.  Now you can compare and contrast the deltas.  How does 'subdomain' variable look before and after the encoding?  Repeat and explore.

Next step is to run some Neural Networks!