# Assignment: Malicious and Benign Websites


## Kaggle Competition: https://www.kaggle.com/xwolf12/malicious-and-benign-websites

The project consisted to evaluate different classification models to predict malicious and benign websites, based on application layer and network characteristics. The data were obtained by using different verified sources of benign and malicious URL's, in a low interactive client honeypot to isolate network traffic. We used additional tools to get other information, such as, server country with Whois.

This is the first version and we have some initial results from applying machine learning classifiers in a bachelor thesis. Further details on the data process making and the data description can be found in the article below.

### Dataset
This is an important topic and one of the most difficult thing to process, according to other articles and another open resource, we used three black list:

* machinelearning.inginf.units.it/data-andtools/hidden-fraudulent-urls-dataset
* malwaredomainlist.com
* zeuztacker.abuse.ch

From them we got around 185181 URLs, we supposed that they were malicious according to their information, we recommend in a next research step to verity them though another security tool, such as, VirusTotal.



## Task 1: Problem Statement
Discuss the problem setting and the first implcations of the given data set... 
* What assumptions can we make about the data?
* What problems are we expecting?

## Task 2: First Data Analysis, Cleaning and Feature Extraction
* Import the data to a Pandas DataFrame
* Run first simple statistics and visualizations
* Is there a need to clean the data? If yes, do so...
* Can you use the raw data directly, or should you extract features? What features are suitable ? 


In [1]:
import pandas as pd

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [3]:
data = pd.read_csv('dataset.csv' , encoding = "ISO-8859-1" )

In [4]:
data.head()

Unnamed: 0,URL,URL_LENGTH,NUMBER_SPECIAL_CHARACTERS,CHARSET,SERVER,CONTENT_LENGTH,WHOIS_COUNTRY,WHOIS_STATEPRO,WHOIS_REGDATE,WHOIS_UPDATED_DATE,...,DIST_REMOTE_TCP_PORT,REMOTE_IPS,APP_BYTES,SOURCE_APP_PACKETS,REMOTE_APP_PACKETS,SOURCE_APP_BYTES,REMOTE_APP_BYTES,APP_PACKETS,DNS_QUERY_TIMES,Type
0,M0_109,16,7,iso-8859-1,nginx,263.0,,,10/10/2015 18:21,,...,0,2,700,9,10,1153,832,9,2.0,1
1,B0_2314,16,6,UTF-8,Apache/2.4.10,15087.0,,,,,...,7,4,1230,17,19,1265,1230,17,0.0,0
2,B0_911,16,6,us-ascii,Microsoft-HTTPAPI/2.0,324.0,,,,,...,0,0,0,0,0,0,0,0,0.0,0
3,B0_113,17,6,ISO-8859-1,nginx,162.0,US,AK,7/10/1997 4:00,12/09/2013 0:45,...,22,3,3812,39,37,18784,4380,39,8.0,0
4,B0_403,17,6,UTF-8,,124140.0,US,TX,12/05/1996 0:00,11/04/2017 0:00,...,2,5,4278,61,62,129889,4586,61,4.0,0


In [5]:
data.describe(include='all')

Unnamed: 0,URL,URL_LENGTH,NUMBER_SPECIAL_CHARACTERS,CHARSET,SERVER,CONTENT_LENGTH,WHOIS_COUNTRY,WHOIS_STATEPRO,WHOIS_REGDATE,WHOIS_UPDATED_DATE,...,DIST_REMOTE_TCP_PORT,REMOTE_IPS,APP_BYTES,SOURCE_APP_PACKETS,REMOTE_APP_PACKETS,SOURCE_APP_BYTES,REMOTE_APP_BYTES,APP_PACKETS,DNS_QUERY_TIMES,Type
count,1781,1781.0,1781.0,1781,1780,969.0,1781,1781,1781.0,1781.0,...,1781.0,1781.0,1781.0,1781.0,1781.0,1781.0,1781.0,1781.0,1780.0,1781.0
unique,1781,,,9,239,,49,182,891.0,594.0,...,,,,,,,,,,
top,B0_88,,,UTF-8,Apache,,US,CA,,,...,,,,,,,,,,
freq,1,,,676,386,,1103,372,127.0,139.0,...,,,,,,,,,,
mean,,56.961258,11.111735,,,11726.927761,,,,,...,5.472768,3.06064,2982.339,18.540146,18.74621,15892.55,3155.599,18.540146,2.263483,0.12128
std,,27.555586,4.549896,,,36391.809051,,,,,...,21.807327,3.386975,56050.57,41.627173,46.397969,69861.93,56053.78,41.627173,2.930853,0.326544
min,,16.0,5.0,,,0.0,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,,39.0,8.0,,,324.0,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,,49.0,10.0,,,1853.0,,,,,...,0.0,2.0,672.0,8.0,9.0,579.0,735.0,8.0,0.0,0.0
75%,,68.0,13.0,,,11323.0,,,,,...,5.0,5.0,2328.0,26.0,25.0,9806.0,2701.0,26.0,4.0,0.0


In [6]:
data.drop("URL", axis=1, inplace=True)

In [7]:
data.dtypes

URL_LENGTH                     int64
NUMBER_SPECIAL_CHARACTERS      int64
CHARSET                       object
SERVER                        object
CONTENT_LENGTH               float64
WHOIS_COUNTRY                 object
WHOIS_STATEPRO                object
WHOIS_REGDATE                 object
WHOIS_UPDATED_DATE            object
TCP_CONVERSATION_EXCHANGE      int64
DIST_REMOTE_TCP_PORT           int64
REMOTE_IPS                     int64
APP_BYTES                      int64
SOURCE_APP_PACKETS             int64
REMOTE_APP_PACKETS             int64
SOURCE_APP_BYTES               int64
REMOTE_APP_BYTES               int64
APP_PACKETS                    int64
DNS_QUERY_TIMES              float64
Type                           int64
dtype: object

In [8]:
data.isnull().sum()

URL_LENGTH                     0
NUMBER_SPECIAL_CHARACTERS      0
CHARSET                        0
SERVER                         1
CONTENT_LENGTH               812
WHOIS_COUNTRY                  0
WHOIS_STATEPRO                 0
WHOIS_REGDATE                  0
WHOIS_UPDATED_DATE             0
TCP_CONVERSATION_EXCHANGE      0
DIST_REMOTE_TCP_PORT           0
REMOTE_IPS                     0
APP_BYTES                      0
SOURCE_APP_PACKETS             0
REMOTE_APP_PACKETS             0
SOURCE_APP_BYTES               0
REMOTE_APP_BYTES               0
APP_PACKETS                    0
DNS_QUERY_TIMES                1
Type                           0
dtype: int64

In [9]:
data = data.fillna(data.mean())
data.isnull().sum()

URL_LENGTH                   0
NUMBER_SPECIAL_CHARACTERS    0
CHARSET                      0
SERVER                       1
CONTENT_LENGTH               0
WHOIS_COUNTRY                0
WHOIS_STATEPRO               0
WHOIS_REGDATE                0
WHOIS_UPDATED_DATE           0
TCP_CONVERSATION_EXCHANGE    0
DIST_REMOTE_TCP_PORT         0
REMOTE_IPS                   0
APP_BYTES                    0
SOURCE_APP_PACKETS           0
REMOTE_APP_PACKETS           0
SOURCE_APP_BYTES             0
REMOTE_APP_BYTES             0
APP_PACKETS                  0
DNS_QUERY_TIMES              0
Type                         0
dtype: int64

In [10]:
data.dropna(how='any',axis=0, inplace=True)
data.isnull().sum()

URL_LENGTH                   0
NUMBER_SPECIAL_CHARACTERS    0
CHARSET                      0
SERVER                       0
CONTENT_LENGTH               0
WHOIS_COUNTRY                0
WHOIS_STATEPRO               0
WHOIS_REGDATE                0
WHOIS_UPDATED_DATE           0
TCP_CONVERSATION_EXCHANGE    0
DIST_REMOTE_TCP_PORT         0
REMOTE_IPS                   0
APP_BYTES                    0
SOURCE_APP_PACKETS           0
REMOTE_APP_PACKETS           0
SOURCE_APP_BYTES             0
REMOTE_APP_BYTES             0
APP_PACKETS                  0
DNS_QUERY_TIMES              0
Type                         0
dtype: int64

In [11]:
from sklearn.preprocessing import LabelEncoder
data['CHARSET'] = LabelEncoder().fit_transform(data['CHARSET'])
data['SERVER'] = LabelEncoder().fit_transform(data['SERVER'])
data['WHOIS_COUNTRY'] = LabelEncoder().fit_transform(data['WHOIS_COUNTRY'])
data['WHOIS_STATEPRO'] = LabelEncoder().fit_transform(data['WHOIS_STATEPRO'])
data['WHOIS_REGDATE'] = LabelEncoder().fit_transform(data['WHOIS_REGDATE'])
data['WHOIS_UPDATED_DATE'] = LabelEncoder().fit_transform(data['WHOIS_UPDATED_DATE'])

In [12]:
data.describe(include='all')

Unnamed: 0,URL_LENGTH,NUMBER_SPECIAL_CHARACTERS,CHARSET,SERVER,CONTENT_LENGTH,WHOIS_COUNTRY,WHOIS_STATEPRO,WHOIS_REGDATE,WHOIS_UPDATED_DATE,TCP_CONVERSATION_EXCHANGE,DIST_REMOTE_TCP_PORT,REMOTE_IPS,APP_BYTES,SOURCE_APP_PACKETS,REMOTE_APP_PACKETS,SOURCE_APP_BYTES,REMOTE_APP_BYTES,APP_PACKETS,DNS_QUERY_TIMES,Type
count,1780.0,1780.0,1780.0,1780.0,1780.0,1780.0,1780.0,1780.0,1780.0,1780.0,1780.0,1780.0,1780.0,1780.0,1780.0,1780.0,1780.0,1780.0,1780.0,1780.0
mean,56.95618,11.111798,3.407865,103.944944,11728.232214,34.247753,71.782584,452.954494,326.580899,16.261798,5.474719,3.060112,2983.438,18.542135,18.754494,15901.34,3156.795,18.542135,2.264755,0.121348
std,27.562496,4.551174,1.779575,74.165875,26844.321776,12.013056,46.917703,273.481601,174.485991,40.512346,21.8133,3.387854,56066.31,41.638787,46.40969,69880.58,56069.51,41.638787,2.930361,0.326623
min,16.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,39.0,8.0,3.0,8.0,1501.75,29.0,21.0,219.0,178.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,49.0,10.0,3.0,117.5,11726.927761,42.0,88.0,425.0,339.0,7.0,0.0,2.0,672.0,8.0,9.0,586.0,733.5,8.0,0.0,0.0
75%,68.0,13.0,5.0,148.0,11726.927761,42.0,97.0,688.0,470.25,22.0,5.0,5.0,2328.25,26.0,25.0,9807.25,2704.5,26.0,4.0,0.0
max,249.0,43.0,8.0,238.0,649263.0,48.0,180.0,889.0,592.0,1194.0,708.0,17.0,2362906.0,1198.0,1284.0,2060012.0,2362906.0,1198.0,20.0,1.0


In [13]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(data.values)
scaled_features_data = pd.DataFrame(scaled_features, index=data.index, columns=data.columns)
scaled_features_data["Type"] = data["Type"]
scaled_features_data.head()
data = scaled_features_data

## Task 3: Train a  Model
* Which ML model would you choose and why?
* Train and evaluate the model using the train data
* Is the data blanced? What are the implications, how can you deal with this?
* Discuss the results -> possible improvements?


In [14]:
from sklearn.model_selection import train_test_split

#training and test data
X = data.drop('Type',axis=1)
y = data['Type']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=0)

In [15]:
from sklearn.ensemble import RandomForestClassifier as rfc
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from sklearn import metrics

In [16]:
model_params = {
    "n_estimators": randint(30,300),
    'max_depth': [10, 20, 30, 40, 50,60,70,80,100, None],
    'min_samples_leaf': randint(1,4),
    'min_samples_split': randint(2,9)
}

rf_model = rfc(class_weight='balanced', n_jobs=-1)
clf = RandomizedSearchCV(rf_model, model_params, n_iter=10, cv=5, random_state=0)

model = clf.fit(X_train, y_train)
model.best_params_

{'max_depth': 100,
 'min_samples_leaf': 2,
 'min_samples_split': 5,
 'n_estimators': 177}

In [17]:
clf = model.best_estimator_
rf_pred = clf.predict(X_test)

## Task 4: Evaluate 
* report the F1-Score on the test data - Who will build the bes model?

- 1 is for malicious websites 
- 0 is for benign websites

In [18]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

acc = accuracy_score(y_test, rf_pred)
prec = precision_score(y_test,rf_pred, average='micro')
rec = recall_score(y_test, rf_pred,average='micro')
f1 = f1_score(y_test, rf_pred, average='micro')
print('Accuracy:', acc)
print('Precision', prec)
print('Recall', rec)
print('F1:', f1)

Accuracy: 0.9634831460674157
Precision 0.9634831460674157
Recall 0.9634831460674157
F1: 0.9634831460674157


In [19]:
print(classification_report(y_test,rf_pred))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98       313
           1       0.92      0.77      0.84        43

    accuracy                           0.96       356
   macro avg       0.94      0.88      0.91       356
weighted avg       0.96      0.96      0.96       356



In [20]:
confusion_matrix(rf_pred, y_test)

array([[310,  10],
       [  3,  33]], dtype=int64)