# Project 5: Malicious Urls
## Aidan Jimenez & Russell Smith
### 4/27/25

### Description
The notebook goes through a dataset of malicous urls and aims to analyze them by breaking out aspects of each of the links into a number value that can help make a distinct difference between each of the classification types of urls. There is also a section that uses a subset of the urls to try and analyze them with whois data.

### Self-Evaluation
Based off of what we completed with reference to the proposal document we got to the A level if not stretch. After testing of the learning models a neural net was not the most effective with this data so we had to change our path with the project and use gradient boosting as our main model. Due to the amount of time it took to get the whois information we aimed to make a proof of concept on how this project could develop with more information. Also the organization of visuals and markdown meet the need of an A level. 

## Imports and Data Loading
This section is just the imports that are used and loading the csv with 600,000 urls.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from urllib.parse import urlparse
from urllib.request import urlopen
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

: 

Read in the data

In [None]:
df = pd.read_csv("malicious_phish.csv")
df.head()

: 

In [None]:
df.info()

: 

## Exploratory Data Analysis

In [None]:
df['type'].value_counts().plot(kind="bar")
plt.xlabel("Type")
plt.xticks(rotation=0)
plt.ylabel("Frequency")
plt.title("Urls by Type")

: 

In [None]:
df['length'] = df['url'].apply(lambda x: len(x))
df

: 

A majority of our urls are beningn which is a non malicious link. It might cause issues in the future when trying to accurately detect what may be a malicous link. For the case of this model we may put all the malicous links together to make a more generic url detection system since it is more important to detect whether it is malicious or not as compared to which type of malicous activity it could be.

In [None]:
#Count the number of each possible attributes that can be in a url
attribute = ['@','?','-','=','.','#','%','+','$','!','*',',','//', '(', ')']
for symbol in attribute:
    df[symbol] = df['url'].apply(lambda x: x.count(symbol))
df

: 

The one attribute that was interesting and messed with the accuracy of the model prediction was the number of slashes. Originally we tried a single slash and then a double slash causing the accuracy to drastically increase.

In [None]:
df['type'].unique()

: 

In [None]:
df.plot(x='//', y='type', kind="scatter")
plt.title("Slashes v Url Type")

: 

Based on this scatter plot it seems that both malware and defacement tend to have a range of slashes that they tend to fall in while benign and defacement have a more similar range of slashes.

In [None]:
df.plot(x='.', y='type', kind="scatter")
plt.title("Dots v Url Type")

: 

The number of dots in the urls also seem to resemble the same pattern where the malware and the defacement have their own distrobution of dots while benign and phising is under a similar range

In [None]:
# Determine if the url is being used by a url shortening service
def detectShortened(original_url):
  url_shorteners = [
    "bit.ly",
    "tinyurl.com",
    "ow.ly",
    "is.gd",
    "v.gd",
    "soo.gd",
    "t.co",
    "lnkd.in",
    "buff.ly",
    "adf.ly",
    "shorte.st",
    "go.gl",
    "y2u.be",
    "youtu.be",
    "goo.gl",
    "po.st",
    "qr.cr",
    "snip.ly",
    "rebrand.ly",
    "bl.ink",
    "kutt.it",
    "cutt.ly",
    "shorturl.at",
    "tiny.cc",
    "osf.io",
    "doi.org",
    "arxiv.org",
    "git.io",
    "tny.im",
    "ulvis.net",
    "yourls.org",
    "polr.me",
    "branch.io",
    "app.goo.gl",
    "bnc.lt",
    "bitly.is",
    "j.mp",
    "on.mash.to",
    "flip.it",
    "instagr.am",
    "pin.it",
    "medium.com",
    "at.at",
    "su.pr",
    "twitpic.com",
    "flic.kr",
    "posterous.com",
    "digg.com",
    "plurk.com",
    "yep.it",
    "zi.pe",
    "linktr.ee",
    "taplink.cc",
    "bio.link",
    "solo.to",
    "beacons.ai",
    "luma.events",
    "eventbrite.com",
    "bento.me",
    "start.me",
    "about.me",
    "carrd.co",
    "milkshake.app",
    "paged.co",
    "shortstack.com",
    "woobox.com",
    "easypromosapp.com",
    "wishpond.com",
    "vyper.io",
    "kickofflabs.com",
    "leadpages.net",
    "instapage.com",
    "unbounce.com",
    "shortsw.com"
  ]

  for url in url_shorteners:
    if url in original_url:
      return 1
    else:
      return 0


df['redirect'] = df['url'].apply(detectShortened)

: 

In [None]:
df['redirect'].value_counts().plot(kind="bar", logy=True)
plt.title("Redirect Frequency")
plt.xlabel('Redirect')
plt.xticks([0,1],['No Redirect', 'Redirect'], rotation=0)
plt.ylabel("Frequency")

: 

Based on this graph most of these links are not redirects. Though that is the case this information may be helpful to the models in the future.

Password Entropy Calculator
https://github.com/error-27/Entropy-Calculator/blob/main/Entropy.py

The reason for trying to calculate the "password" entropy of these links is to see how difficult the link would be to guess. This is under the assumption that most links that are real are going to be by companies which will have common words as their links instead of a jumbled mess of characters. One issue where this algorithm will not work well is that in a Url there are no capital letters so this will not work entirely how its supposed to but can still potenially give a value that can be telling.

In [None]:
import math
#!pip install tldextract
# Needed to use tld extract becuase
#urllib parse was not able to get all links properly
import tldextract


def calculate(length, char_amount):
    if char_amount > 0:
      return math.log2(char_amount) * length
    else:
      return 0


def find_chars(password):
    char_amount = 0
    char_sets = [False, False, False, False]
    char_nums = [26, 26, 10, 32]
    for i in password:
        if i.islower():
            char_sets[0] = True
        if i.isupper():
            char_sets[1] = True
        if i.isdigit():
            char_sets[2] = True
        if not i.isalnum() and i.isascii():
            char_sets[3] = True

    for x in range(4):
        if char_sets[x]:
            char_amount += char_nums[x]

    return len(password), char_amount

def calcEntropy(password):
  domain = tldextract.extract(password).domain
  length, amount = find_chars(domain)
  entropy  = calculate(length, amount)
  return entropy

df['domain_entropy'] = df['url'].apply(calcEntropy)
df

: 

In [None]:
df.plot(x='domain_entropy', y='type', kind='scatter', alpha=0.2)

: 

Each of the classifications have seemed to lay in certain ranges that can make them more distict. Based on this graph I could see this being a factor that could help narrow the range as to which classification this would be.

In [None]:
# Determine whether the url has https or not
def isHTTPS(url):
  if "https://" in url:
    return 1
  else:
    return 0

df["https"] = df['url'].apply(isHTTPS)

: 

In [None]:
df['https'].value_counts().plot(kind='bar')
plt.title("Frewquency of Url that Contains HTTPS")
plt.xticks([0,1],['No', 'Yes'], rotation=0)
plt.xlabel("HTTPS")
plt.ylabel("Frequency")

: 

The distinction of whether the link is https could be an indicator of how legitimate the url is. In HTTPS will usually mean that traffic is encrypted and is more secure which is built of the http method.

In [None]:
corr = df.drop(['url', 'type'], axis=1).corr()
sns.heatmap(corr )
plt.title("Correlation Between Attributes")

: 

An interesting piece of information is when there is `//` in correlation with `=` which would mean that there is a query in the link. It is unclear as to what that could mean but it does not seem like something that should be more common than other attributes in the the urls. As for other attributes such as `?` and `=` they are directly linked to one another since this is the way to make a query in a url.

## Prepare Data for Model

In [None]:
ord_map= {
    'malware' : 0,
    'benign': 1,
    'phishing': 2,
    'defacement': 3,
}
X = df.drop(['url', 'type'], axis=1)
y = df['type'].map(ord_map)

: 

Split the data in to train and test

In [None]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.33, random_state=0)

: 

## MLP Classifier

Normalize the data

In [None]:
model = MLPClassifier(hidden_layer_sizes=(4, 16, 4),
                      solver='adam',
                      learning_rate_init=0.001,
                      activation='relu',
                      batch_size=64,
                      max_iter=500,
                      early_stopping=True,
                      validation_fraction=0.1,
                      verbose=True)

model.fit(Xtrain, ytrain)

: 

In [None]:
for i, layer in enumerate(model.coefs_):
  print('Layer', i, 'has', layer.shape[0], 'nodes, each with', layer.shape[1], 'weight(s)')

: 

In [None]:
model.best_validation_score_

: 

In [None]:
plt.plot(model.loss_curve_)
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.title('Training loss curve')
plt.show()

: 

In [None]:
ypred = model.predict(Xtest)
print(classification_report(ytest, ypred))

: 

## Random Forest Classifier

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = RandomForestClassifier(n_estimators=100, max_depth=5)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

: 

## Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV

# Which hyperparameters do we want to try?
param_grid = {'n_estimators': np.arange(30, 34, 2),
              'max_depth': np.arange(12, 18, 3),
              'learning_rate': np.arange(0.07, 0.14, 0.01)}

# We can incorporate cross-validation into the grid search
# by specifying cv=5.
grid = GridSearchCV(XGBClassifier(), param_grid, cv=5, verbose=1)

# We'll use just the training data, so that we can evaluate the best
# model against data that was unseen during training.
grid.fit(Xtrain, ytrain)

: 

In [None]:
print(grid.best_params_)

# and the best accuracy
print(grid.best_score_)

# then we can train the final model with the best hyperparameters
# model = XGBClassifier(n_estimators=3, max_depth=3, learning_rate=0.1)
# model.fit(Xtrain, ytrain)

# predict the test data to see how well the model generalizes
ypred = model.predict(Xtest)
accuracy = accuracy_score(ytest, ypred)
print(accuracy)

: 

## PCA

In [None]:
pca = PCA(n_components=2)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape:   ", X.shape)
print("transformed shape:", X_pca.shape)


fig, ax = plt.subplots()

# note that transformed data becomes a numpy array
ax.scatter(X_pca[:, 0], X_pca[:, 1], c=y, alpha=0.02)

: 

## Gradient Boosting

In [None]:

model = XGBClassifier(n_estimators=1, max_depth=300, learning_rate=0.001)

model.fit(Xtrain, ytrain)

# make predictions for test data
ypred = model.predict(Xtest)

# evaluate predictions
accuracy = accuracy_score(ytest, ypred)
print(f'Accuracy: {accuracy*100:.2f}%')

: 

In [None]:
# Feature Importance

# Make a DataFrame for ease of sorting and visualization
feat_imp = pd.DataFrame({'Feature': X.columns,
                      'Importance': model.feature_importances_})

feat_imp = feat_imp.sort_values(by='Importance', ascending=False)

fig, ax = plt.subplots()

# horizontal bar charts can make text easier to read
ax.barh(feat_imp['Feature'], feat_imp['Importance'])
ax.set_xlabel('Importance')
ax.set_title('Feature Importance in Malicious Urls')


: 

# Prediction Matrix

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import seaborn as sns
print(confusion_matrix(ytest, ypred))
print(classification_report(ytest, ypred))
label_map = {
    0: 'malware',
    1: 'benign',
    2: 'phishing',
    3: 'defacement',
}
labels = [0,1,2,3]
cm = confusion_matrix(ytest, ypred, labels=labels)
sns.heatmap(cm, square=True, annot=True, fmt='d', cbar=True,
                    cmap="Greens",
                    xticklabels=list(label_map.values()),
                    yticklabels=list(label_map.values()))

plt.xlabel('Predicted Label');
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

: 

The confusion matrix is based off of the Gradient Boosting Model, as we saw in the initial graph that most of the predicted urls were benign. Though the overall model is not entirely bad as it was able to have a good amount of urls that were properly predicted. The type of url that had the lowest recall is the the phishing urls which was incorrectly predicted as a benign url. In this case that would make sense since these phishing links are aimed at copying the benign ones. 

# Whois Data
The data was collected using the whois package in python to extract the whois information for each url. The only urls that work were the ones that ended with `.com`, `.edu`, `.org`, and `.net`. In the case of our data set not all of them have just `.com` and consist of something like `.com.uk` which are region based tld's. Due to some of these links having odd tld's we are not able to get whois information for every link that is in the first ten thousand links so there will be some loss. 

In [None]:
wdf = pd.read_csv("malicious_phish_10k_whoisinfo.csv")
wdf.head()

: 

In [None]:
wdf.drop(columns=['domain_name'])

: 

### Drop all rows where no whois data was returned

In [None]:
mask = wdf['domain_name'].isna()

wdf = wdf[~mask]
wdf.info()

: 

In [None]:
wdf['type'].value_counts().plot(kind="bar")
plt.xlabel("Type")
plt.xticks(rotation=0)
plt.ylabel("Frequency")

: 

### Feature Extraction

In [None]:
wdf['registrar_url'] = wdf['registrar_url'].fillna("")

wdf.info()
    


: 

In [None]:
wdf

: 

In [None]:
wdf['length'] = wdf['url'].apply(lambda x: len(x))

wdf['r_length'] = wdf['registrar_url'].apply(lambda x: len(x))

wdf

: 

In [None]:
attribute = ['@','?','-','=','.','#','%','+','$','!','*',',','//', '(', ')']
for symbol in attribute:
    wdf[symbol] = wdf['url'].apply(lambda x: x.count(symbol))

for symbol in attribute:
    wdf['r_' + symbol] = wdf['registrar_url'].apply(lambda x: x.count(symbol))

wdf.info()

: 

In [None]:
# Determine if the url is being used by a url shortening service
def detectShortened(original_url):
  url_shorteners = [
    "bit.ly",
    "tinyurl.com",
    "ow.ly",
    "is.gd",
    "v.gd",
    "soo.gd",
    "t.co",
    "lnkd.in",
    "buff.ly",
    "adf.ly",
    "shorte.st",
    "go.gl",
    "y2u.be",
    "youtu.be",
    "goo.gl",
    "po.st",
    "qr.cr",
    "snip.ly",
    "rebrand.ly",
    "bl.ink",
    "kutt.it",
    "cutt.ly",
    "shorturl.at",
    "tiny.cc",
    "osf.io",
    "doi.org",
    "arxiv.org",
    "git.io",
    "tny.im",
    "ulvis.net",
    "yourls.org",
    "polr.me",
    "branch.io",
    "app.goo.gl",
    "bnc.lt",
    "bitly.is",
    "j.mp",
    "on.mash.to",
    "flip.it",
    "instagr.am",
    "pin.it",
    "medium.com",
    "at.at",
    "su.pr",
    "twitpic.com",
    "flic.kr",
    "posterous.com",
    "digg.com",
    "plurk.com",
    "yep.it",
    "zi.pe",
    "linktr.ee",
    "taplink.cc",
    "bio.link",
    "solo.to",
    "beacons.ai",
    "luma.events",
    "eventbrite.com",
    "bento.me",
    "start.me",
    "about.me",
    "carrd.co",
    "milkshake.app",
    "paged.co",
    "shortstack.com",
    "woobox.com",
    "easypromosapp.com",
    "wishpond.com",
    "vyper.io",
    "kickofflabs.com",
    "leadpages.net",
    "instapage.com",
    "unbounce.com",
    "shortsw.com"
  ]

  for url in url_shorteners:
    if url in original_url:
      return 1
    else:
      return 0


wdf['redirect'] = wdf['url'].apply(detectShortened)
wdf

: 

In [None]:
import math
#!pip install tldextract
# Needed to use tld extract becuase
#urllib parse was not able to get all links properly
import tldextract


def calculate(length, char_amount):
    if char_amount > 0:
      return math.log2(char_amount) * length
    else:
      return 0


def find_chars(password):
    char_amount = 0
    char_sets = [False, False, False, False]
    char_nums = [26, 26, 10, 32]
    for i in password:
        if i.islower():
            char_sets[0] = True
        if i.isupper():
            char_sets[1] = True
        if i.isdigit():
            char_sets[2] = True
        if not i.isalnum() and i.isascii():
            char_sets[3] = True

    for x in range(4):
        if char_sets[x]:
            char_amount += char_nums[x]

    return len(password), char_amount

def calcEntropy(password):
  domain = tldextract.extract(password).domain
  length, amount = find_chars(domain)
  entropy  = calculate(length, amount)
  return entropy

wdf['domain_entropy'] = wdf['url'].apply(calcEntropy)
wdf['r_domain_entropy'] = wdf['registrar_url'].apply(calcEntropy)

wdf

: 

In [None]:
wdf.plot(x='r_domain_entropy', y='type', kind='scatter', alpha=0.2)

: 

In [None]:
# Determine whether the url has https or not
def isHTTPS(url):
  if "https://" in url:
    return 1
  else:
    return 0

wdf["https"] = wdf['url'].apply(isHTTPS)

wdf["r_https"] = wdf['registrar_url'].apply(isHTTPS)

wdf

: 

### Date handling

In [None]:
import pytz

def get_age_days(date):
    if pd.isna(date):
        return -1
    
    date = date[:19]

    url_date = pd.to_datetime(date)
    today = pd.Timestamp.today()

    return (today - url_date).days


wdf['creation_date'] = wdf['creation_date'].apply(get_age_days)
wdf['expiration_date'] = wdf['expiration_date'].apply(get_age_days)
wdf['updated_date'] = wdf['updated_date'].apply(get_age_days)

wdf



: 

### Registrar Count

In [None]:
wdf['r_count'] = wdf['registrar_url'].apply(lambda x: x.count(',') + 1)

: 

# Model Preperation

In [None]:
ord_map= {
    'malware' : 0,
    'benign': 1,
    'phishing': 2,
    'defacement': 3,
}
X = wdf.drop(['url', 'type', 'domain_name', 'registrar_url'], axis=1)
y = wdf['type'].map(ord_map)

: 

Split data into train and test

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.30, random_state=0)

: 

# MLP Classifier

In [None]:
model = MLPClassifier(hidden_layer_sizes=(4, 16, 4),
                      solver='adam',
                      learning_rate_init=0.001,
                      activation='relu',
                      batch_size=64,
                      max_iter=500,
                      early_stopping=True,
                      validation_fraction=0.1,
                      verbose=True)

model.fit(Xtrain, ytrain)

: 

In [None]:
for i, layer in enumerate(model.coefs_):
  print('Layer', i, 'has', layer.shape[0], 'nodes, each with', layer.shape[1], 'weight(s)')

: 

In [None]:
model.best_validation_score_

: 

In [None]:
plt.plot(model.loss_curve_)
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.title('Training loss curve')
plt.show()

: 

In [None]:
ypred = model.predict(Xtest)
print(classification_report(ytest, ypred))

: 

# Random Forest Classifier

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = RandomForestClassifier(n_estimators=100, max_depth=5)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

: 

# PCA

In [None]:
pca = PCA(n_components=2)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape:   ", X.shape)
print("transformed shape:", X_pca.shape)


fig, ax = plt.subplots()

# note that transformed data becomes a numpy array
ax.scatter(X_pca[:, 0], X_pca[:, 1], c=y, alpha=0.02)

: 

# XGBoost Classifier

In [None]:
model = XGBClassifier(n_estimators=1, max_depth=300, learning_rate=0.001)

model.fit(Xtrain, ytrain)

# make predictions for test data
ypred = model.predict(Xtest)

# evaluate predictions
accuracy = accuracy_score(ytest, ypred)
print(f'Accuracy: {accuracy*100:.2f}%')

: 

In [None]:
param_grid = {'n_estimators': np.arange(10, 18, 1),
              'max_depth': np.arange(12, 22, 2),
              'learning_rate': np.arange(0.1, 0.17, 0.01)}

# We can incorporate cross-validation into the grid search
# by specifying cv=5.
grid = GridSearchCV(XGBClassifier(), param_grid, cv=3, verbose=1)

# We'll use just the training data, so that we can evaluate the best
# model against data that was unseen during training.
grid.fit(Xtrain, ytrain)

: 

In [None]:
print(grid.best_params_)

# and the best accuracy
print(grid.best_score_)

# then we can train the final model with the best hyperparameters
# model = XGBClassifier(n_estimators=3, max_depth=3, learning_rate=0.1)
# model.fit(Xtrain, ytrain)

# predict the test data to see how well the model generalizes
ypred = model.predict(Xtest)
accuracy = accuracy_score(ytest, ypred)
print(accuracy)

: 

In [None]:
model = XGBClassifier(n_estimators=15, max_depth=18, learning_rate=0.13)

model.fit(Xtrain, ytrain)

# make predictions for test data
ypred = model.predict(Xtest)

# evaluate predictions
accuracy = accuracy_score(ytest, ypred)
print(f'Accuracy: {accuracy*100:.2f}%')

: 

In [None]:
# Feature Importance
# Make a DataFrame for ease of sorting and visualization
feat_imp = pd.DataFrame({'Feature': X.columns,
                      'Importance': model.feature_importances_})

feat_imp = feat_imp.sort_values(by='Importance', ascending=False)

feat_imp = feat_imp[:20]

fig, ax = plt.subplots()

# horizontal bar charts can make text easier to read
ax.barh(feat_imp['Feature'], feat_imp['Importance'])
ax.set_xlabel('Importance')
ax.set_xlim(0, 0.8)
ax.set_title('Feature Importance in Malicious Urls')

: 

# Prediction Matrix

In [None]:
print(confusion_matrix(ytest, ypred))
print(classification_report(ytest, ypred))
# label_map = {
#     1: 'malware',
#     0: 'benign',
#     # 2: 'phishing',
#     # 3: 'defacement',
# }
# labels = [0,1]
label_map = {
    0: 'malware',
    1: 'benign',
    2: 'phishing',
    3: 'defacement',
}
labels = [0,1,2,3]
cm = confusion_matrix(ytest, ypred, labels=labels)
sns.heatmap(cm, square=True, annot=True, fmt='d', cbar=True,
                    cmap="Greens",
                    xticklabels=list(label_map.values()),
                    yticklabels=list(label_map.values()))

plt.xlabel('Predicted Label');
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

: 