### Malicious Webpage Identification Using Semi Supervised Learning

Alex Liddle

In [None]:
import nltk
import string
import re
import sklearn
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.cluster import MiniBatchKMeans
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from scipy import stats
#nltk.download('stopwords') #<---uncomment if you haven't downloaded the stopwords library
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

### Import the dataset¶

In [None]:
# Load dataset into a pandas dataframe
df_reviews_raw = pd.read_csv('/kaggle/input/dataset-of-malicious-and-benign-webpages/Webpages_Classification_train_data.csv/Webpages_Classification_train_data.csv').drop(['Unnamed: 0'], axis=1)

In [None]:
# Inspect for missing values
df_reviews_raw.isna().sum()

In [None]:
# Check data types
df_reviews_raw.dtypes

In [None]:
# Inspect a small sample
df_reviews_raw.head()

### Clean the data¶


The data must be cleaned and transformed into a format that the machine learning algorithms further down in this notebook expect. Furthermore, there should be a uniform distribution of labels.

In [None]:
# Check the label distribution
df_reviews_raw.label.describe()

In [None]:
# Get an equally distributed sample
df_reviews_untrimmed_sample = df_reviews_raw.groupby('label').apply(lambda x: x.sample(25000, random_state=42)).reset_index(drop=True)
# Remove if content has less than 60 words
df_reviews_trimmed = df_reviews_untrimmed_sample[df_reviews_untrimmed_sample.content.str.split().str.len().ge(60)]
df_reviews_trimmed.label.describe()

In [None]:
# Resample trimmed dataframe to make it uniformly distributed
df_reviews_sampled = df_reviews_trimmed.groupby('label').apply(lambda x: x.sample(2000, random_state=42)).reset_index(drop=True)
# Randomly shuffle rows for aesthetics
df_reviews = df_reviews_sampled.sample(frac=1, random_state=42).reset_index(drop=True)
df_reviews.label.describe()

### Examine the data¶

In [None]:
df_reviews.head()

In [None]:
df_reviews[['geo_loc', 'tld','who_is','https', 'label']].describe()

### Text Preprocessing¶


To use our decision tree and random forest models, the data will need to be in a numerical format. As the value of one row with respect to another doesn't have an affect on either algorithm's decision when splitting a node (they are considered categorical variables), I will use ordinal encoding to transform the geo_loc, tld, who_is, https, and label columns. Meanwhile, natural language processing will be performed on the url and content columns.

In [None]:
df_reviews['geo_loc'] = OrdinalEncoder().fit_transform(df_reviews.geo_loc.values.reshape(-1,1))
df_reviews['tld'] = OrdinalEncoder().fit_transform(df_reviews.tld.values.reshape(-1,1))
df_reviews['who_is'] = OrdinalEncoder().fit_transform(df_reviews.who_is.values.reshape(-1,1))
df_reviews['https'] = OrdinalEncoder().fit_transform(df_reviews.https.values.reshape(-1,1))
df_reviews['label'] = OrdinalEncoder().fit_transform(df_reviews.label.values.reshape(-1,1))

# convert url into human readable string that can be tokenized
df_reviews['url'] = df_reviews.url.apply(lambda x: ' '.join(x.split('://')[1].strip('www.').replace('.','/').split('/')))
df_reviews.head()

The textual data in the url and content columns will be tokenized, converted to lower case, and stopwords and punctuation will be removed.

In [None]:
print("Before Preprocessing:")
print(df_reviews.content.head())

tqdm.pandas()
stop = stopwords.words()

df_reviews.content = df_reviews.content.str.replace("[^\w\s]", "").str.lower()
df_reviews.content = df_reviews.content.progress_apply(lambda x: ' '.join([item for item in x.split() 
                                                               if item not in stop]))
df_reviews.url = df_reviews.url.str.replace("[^\w\s]", "").str.lower()
df_reviews.url = df_reviews.url.progress_apply(lambda x: ' '.join([item for item in x.split() 
                                                               if item not in stop]))

print("After Preprocessing:")
print(df_reviews.content.head())

### Label urls and content using tfidf vectorization and clustering¶


To convert the widely varying content of the url and content columns into something more manageable for the decision tree and random forest models, I will label them using mini batch kmeans clustering. First, however, I will convert them into numeric vectors.

In [None]:
tfidf = TfidfVectorizer(
    min_df = 5,
    max_df = 0.95,
    max_features = 8000,
    stop_words = 'english'
)

tfidf.fit(df_reviews.url)
url_tfidf = tfidf.transform(df_reviews.url)

tfidf.fit(df_reviews.content)
content_tfidf = tfidf.transform(df_reviews.content)

I will use the elbow method to find the optimal number of clusters for each feature.

In [None]:
def find_optimal_clusters(data, max_k):
    k_list = range(2, max_k+1)
    
    sse = []
    for k in k_list:
        sse.append(MiniBatchKMeans(n_clusters=k, init_size=1024, batch_size=2048, random_state=20).fit(data).inertia_)
       
    plt.style.use("dark_background")
    f, ax = plt.subplots(1, 1)
    ax.plot(k_list, sse, marker='o')
    ax.set_xlabel('Cluster Centers')
    ax.set_xticks(k_list)
    ax.set_xticklabels(k_list)
    ax.set_ylabel('SSE')
    ax.set_title('SSE by Cluster Center Plot')

In [None]:
find_optimal_clusters(url_tfidf, 20)

An elbow can be seen where n_clusters equals nine. A new column, full of the clusters each row is assigned to, will be made.

In [None]:
df_reviews['url_cluster'] = MiniBatchKMeans(n_clusters=8, init_size=1024, batch_size=2048, 
                                            random_state=20).fit_predict(url_tfidf)

In [None]:
find_optimal_clusters(content_tfidf, 20)

An elbow can be seen where n_clusters equals four. A new column, full of the clusters each row is assigned to, will be made.

In [None]:
df_reviews['content_cluster'] = MiniBatchKMeans(n_clusters=4, init_size=1024, batch_size=2048, 
                                            random_state=20).fit_predict(content_tfidf)

### Generate a training and test dataset¶

The cleaned, transformed dataset will be split into a training and test set using a 70%/30% split.

In [None]:
X = df_reviews[['url_cluster', 'url_len', 'geo_loc', 'tld', 'who_is', 'https', 'content_cluster',
                'js_len', 'js_obf_len']]
y = df_reviews.label

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Model selection¶

For the decision tree, the "criterion" and "splitter" hyperparameters will be tuned and cross-validation will be performed using the GridSearchCV sklearn module.

In [None]:
# Decision Tree
param_grid=[{"criterion":["gini", "entropy"],
             "splitter":["best", "random"]}]
grid=GridSearchCV(estimator=DecisionTreeClassifier(random_state=42),param_grid=param_grid,cv=5)
grid.fit(X_train,y_train)

In [None]:
# Optimal hyperparameters
grid.best_params_

Training and test accuracies are examined to determine if overfitting or underfitting has occurred.

In [None]:
# training accuracy
grid.score(X_train,y_train)

In [None]:
# test accuracy
grid.score(X_test,y_test)

For the random forest, the "n_estimators" and "criterion" hyperparameters will be tuned and cross-validation will be performed using the GridSearchCV sklearn module.

In [None]:
# Random Forest
param_grid=[{"n_estimators":[x for x in range(10, 120, 10)],
             "criterion":["gini", "entropy"]}]
grid=GridSearchCV(estimator=RandomForestClassifier(random_state=42),param_grid=param_grid,cv=5)
grid.fit(X_train,y_train)

In [None]:
# Optimal hyperparameters
grid.best_params_

Training and test accuracies are examined to determine if overfitting or underfitting has occurred.

In [None]:
# training accuracy
grid.score(X_train,y_train)

In [None]:
# test accuracy
grid.score(X_test,y_test)

### Conclusion

Both algorithms performed exceptionally well and there is no evidence of overfitting or underfitting. This project serves as validation for using unsupervised learning for labeling textual data and the decision tree and/or random forest algorithms for identifying malicious webpages.