## Introduction
Models have been selected by monitoring performance internally and externally, the chosen models are XGBoost trained on the Phishtank dataset (A) and Random Forest trained on dataset A. An ensemble model is also considered due to promising results. Optuna will be used to find optimal hyperparameters on both models and they will be tested externally on the Kaggle dataset.

## Setup
Import the necessary libraries and create the functions to get a range of scores for the performance of the optimised models. We also set up the data by getting the data and creating train-test splits to use for training. We only train on dataset A as that was shown to have better results than dataset B.


In [1]:
# Standard imports
import pandas as pd
from IPython.display import display

# Models
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier


# Evaluation
from sklearn.model_selection import  train_test_split, learning_curve
from sklearn.metrics import  confusion_matrix, f1_score, precision_score, recall_score, roc_auc_score, roc_curve
import optuna

# Explainability
import shap

In [2]:
def appendScores(model, X_test, y_test): #scoring functions from previous notebook
    y_pred = model.predict(X_test)
    scores = {}
    scores['Accuracy'] = model.score(X_test,y_test)
    scores['F1 Score'] = f1_score(y_test, y_pred)
    scores['Precision'] = precision_score(y_test, y_pred)
    scores['Recall'] = recall_score(y_test, y_pred)
    try:
        scores['ROC AUC:'] = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    except AttributeError:
        # Some models might not have predict_proba method
        scores['ROC AUC:'] = "Not available"
    return pd.DataFrame([scores])


def getScores(model, internal_test_featuresX, internal_test_labely, external_featuresX, external_labely ):
    internal_scores = appendScores(model, internal_test_featuresX, internal_test_labely)
    internal_scores.index = ['Internal']

    external_scores = appendScores(model, external_featuresX, external_labely)
    external_scores.index = ['External']
    df = pd.concat(
        [internal_scores, external_scores]
    )
    display(df)

In [3]:
phishA = pd.read_csv("Datasets/Final_datasets/extracted_phishtank_features.csv")
kaggleB = pd.read_csv("Datasets/Final_datasets/final_kaggle_data.csv")

In [4]:
aX = phishA.drop(columns=['phishing'])
ay = phishA['phishing']
bX = kaggleB.drop(columns=['phishing'])
by = kaggleB['phishing']

The previous well performing models will be used as a baseline to compare the tuned models with.

In [5]:
from sklearn.pipeline import Pipeline

aX_train, aX_test, ay_train, ay_test = train_test_split(aX, ay, test_size=0.2, random_state=42)

randforestA = Pipeline([('randomforesta', RandomForestClassifier(n_estimators=100, random_state=42))])
xgboostA = Pipeline([('xgba', XGBClassifier(eval_metric='logloss', random_state=42))
])

In [6]:
randforestA = randforestA.fit(aX_train, ay_train) #fit models at the start for later use
xgboostA = xgboostA.fit(aX_train, ay_train)

## Optimisation
Both selected models provide a large range of hyperparameters that should be optimised to gain the most out of them. For the optimisation process Optuna will be used. This is due to a plethora of reasons, namely efficiency, visualisation and the ability to deal with complex models such as XGBoost. Optuna utilises Bayesian optimisation techniques to explore the search space efficiently, compared to Gridsearch this method is less computationally expensive.

### Random Forest

In [50]:
from sklearn.model_selection import StratifiedKFold


# Define the objective function
def objective(trial):

    n_estimators = trial.suggest_int('n_estimators',50, 500, log=True)  # Number of trees
    max_depth = trial.suggest_int('max_depth', 2, 16)  # Maximum depth of trees
    min_samples_split = trial.suggest_int('min_samples_split', 10,100)  # Minimum number of samples to split a node
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 5,50)  # Minimum number of samples required at a leaf node
    max_features = trial.suggest_categorical('max_features', [ 'sqrt', 'log2', None])  # Number of features to consider at each split

    # Initialize the Random Forest Classifier with the suggested hyperparameters
    rf = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        max_features=max_features,
        random_state=42
    )
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) #use stratified cross validation
    score = cross_val_score(rf, aX_train, ay_train, cv=cv, scoring='f1').mean()
    return score

# Create and run the Optuna study
study = optuna.create_study(direction='maximize')  # We're maximizing f1
study.optimize(objective, n_trials=100)  # Run optimization for 100 trials

# Check the best hyperparameters
print("Best hyperparameters found: ", study.best_params)

[I 2025-04-25 00:03:52,486] A new study created in memory with name: no-name-e0be8729-d7f3-497d-8fda-86f08b08cc9c
[I 2025-04-25 00:03:57,747] Trial 0 finished with value: 0.8754056928695004 and parameters: {'n_estimators': 253, 'max_depth': 2, 'min_samples_split': 69, 'min_samples_leaf': 35, 'max_features': 'log2'}. Best is trial 0 with value: 0.8754056928695004.
[I 2025-04-25 00:04:01,580] Trial 1 finished with value: 0.9195130225251755 and parameters: {'n_estimators': 59, 'max_depth': 13, 'min_samples_split': 68, 'min_samples_leaf': 24, 'max_features': 'log2'}. Best is trial 1 with value: 0.9195130225251755.
[I 2025-04-25 00:04:29,873] Trial 2 finished with value: 0.9161085673668536 and parameters: {'n_estimators': 202, 'max_depth': 5, 'min_samples_split': 37, 'min_samples_leaf': 13, 'max_features': None}. Best is trial 1 with value: 0.9195130225251755.
[I 2025-04-25 00:04:55,239] Trial 3 finished with value: 0.8734045804201973 and parameters: {'n_estimators': 406, 'max_depth': 2, 'm

Best hyperparameters found:  {'n_estimators': 181, 'max_depth': 14, 'min_samples_split': 10, 'min_samples_leaf': 5, 'max_features': None}


In [51]:
randomforest = RandomForestClassifier(**study.best_params, random_state=42 )
randomforest = randomforest.fit(aX_train, ay_train)

In [52]:
getScores(randomforest, aX_test, ay_test, bX, by) #optimised model

Unnamed: 0,Accuracy,F1 Score,Precision,Recall,ROC AUC:
Internal,0.928793,0.936561,0.92681,0.946521,0.977995
External,0.695087,0.711469,0.640762,0.799715,0.810518


In [66]:
getScores(randforestA, aX_test, ay_test, bX, by) #base model

Unnamed: 0,Accuracy,F1 Score,Precision,Recall,ROC AUC:
Internal,0.927302,0.935333,0.924209,0.946727,0.977538
External,0.713917,0.738031,0.647918,0.857259,0.827339


The tuned model performs better on the unseen data in dataset A, however when generalising to dataset B the recall lowers. As such, the direct model from SciKit Learn seems to be most optimised for usage.

## XGBoost

In [10]:
from sklearn.model_selection import cross_val_score


def objective1(trial):
    params = {
        # Essential parameters
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),

        # Regularization parameters
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 10),
        'reg_lambda': trial.suggest_float('reg_lambda', 0, 10),

        # Tree structure parameters
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 20),
        'gamma': trial.suggest_float('gamma', 0, 5),

        # Efficiency parameters
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000, step=100),
    }

    xg = XGBClassifier(
        **params
    )
    xg.fit(aX_train, ay_train)

    y_pred = xg.predict(aX_test)
    score = cross_val_score(xg, aX_train, ay_train, cv=5, scoring='f1').mean()
    return score

study1 = optuna.create_study(direction='maximize')
study1.optimize(objective1, n_trials=100)
print("Best hyperparameters found: ", study1.best_params)

[I 2025-04-25 20:41:58,580] A new study created in memory with name: no-name-2e55c6af-ba8d-4bac-8f19-21677f5c77ae
[I 2025-04-25 20:42:01,495] Trial 0 finished with value: 0.9326923022342204 and parameters: {'learning_rate': 0.16659161467611217, 'max_depth': 5, 'subsample': 0.6753317466706428, 'colsample_bytree': 0.9834959679727708, 'reg_alpha': 0.9613455555113903, 'reg_lambda': 9.225643327573112, 'min_child_weight': 8, 'gamma': 2.790107450351052, 'n_estimators': 800}. Best is trial 0 with value: 0.9326923022342204.
[I 2025-04-25 20:42:05,524] Trial 1 finished with value: 0.9291879779063386 and parameters: {'learning_rate': 0.06377670172507884, 'max_depth': 5, 'subsample': 0.9659763228394429, 'colsample_bytree': 0.6495072097723513, 'reg_alpha': 3.9096667867048085, 'reg_lambda': 4.681521716392835, 'min_child_weight': 8, 'gamma': 2.735325016436767, 'n_estimators': 1000}. Best is trial 0 with value: 0.9326923022342204.
[I 2025-04-25 20:42:12,365] Trial 2 finished with value: 0.929968834801

Best hyperparameters found:  {'learning_rate': 0.13291484016415933, 'max_depth': 7, 'subsample': 0.9112140881002768, 'colsample_bytree': 0.7372877262802932, 'reg_alpha': 0.05359567621966728, 'reg_lambda': 3.8899873714848487, 'min_child_weight': 1, 'gamma': 0.35715295389157653, 'n_estimators': 900}


In [59]:
xg = XGBClassifier(**study1.best_params
)
xg.fit(aX_train, ay_train)
getScores(xg, aX_test, ay_test, bX, by)

Unnamed: 0,Accuracy,F1 Score,Precision,Recall,ROC AUC:
Internal,0.93808,0.944307,0.943334,0.945282,0.983406
External,0.757312,0.76344,0.704557,0.833062,0.854291


In [34]:
getScores(xgboostA, aX_test, ay_test, bX, by)

Unnamed: 0,Accuracy,F1 Score,Precision,Recall,ROC AUC:
Internal,0.93315,0.940174,0.934517,0.945901,0.982224
External,0.744026,0.74534,0.700071,0.796869,0.830918


The optimised XGBoost model performs better than the base model, with a noticeable improvement in recall. This is necessary for a security sensitive field such as phishing detection so this is very promising. This model was selected for the API.

In [76]:
import joblib #save model for usage in Flask API

joblib.dump(xg, 'xgboost.pkl')

In [9]:
import joblib #save model columns so they can persist in FLask API
model_columns = aX.columns
joblib.dump(model_columns, 'model_columns.pkl')

['model_columns.pkl']