<a href="https://colab.research.google.com/github/ssinlao/CS4210-Phishing-Website-Project/blob/main/Phishing_Websites.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phishing Website Modeling
Data Info:
* Dataset consists of 1828 websites use to check for 18 features
* 859-legitimate websites collected from Yahoo directory
* 969-phishing websites collected from Phishtank Millersmiles archives

### Cleaning the Data
1. import necessary libraries
2. load the arff file
3. see what features are included
4. check for missing and null values

In [2]:
import pandas as pd
import numpy as np
from scipy.io import arff

In [3]:
data, meta = arff.loadarff('phishing-data.arff')
df = pd.DataFrame(data)

for column in df.select_dtypes(include=['object']).columns:
  df[column] = df[column].str.decode('utf-8')

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11055 entries, 0 to 11054
Data columns (total 31 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   having_IP_Address            11055 non-null  object
 1   URL_Length                   11055 non-null  object
 2   Shortining_Service           11055 non-null  object
 3   having_At_Symbol             11055 non-null  object
 4   double_slash_redirecting     11055 non-null  object
 5   Prefix_Suffix                11055 non-null  object
 6   having_Sub_Domain            11055 non-null  object
 7   SSLfinal_State               11055 non-null  object
 8   Domain_registeration_length  11055 non-null  object
 9   Favicon                      11055 non-null  object
 10  port                         11055 non-null  object
 11  HTTPS_token                  11055 non-null  object
 12  Request_URL                  11055 non-null  object
 13  URL_of_Anchor                11

In [None]:
df.head()
df.describe()

Unnamed: 0,having_IP_Address,URL_Length,Shortening_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registration_length,Favicon,...,popUpWindow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
count,11055,11055,11055,11055,11055,11055,11055,11055,11055,11055,...,11055,11055,11055,11055,11055,11055,11055,11055,11055,11055
unique,2,3,2,2,2,2,3,3,2,2,...,2,2,2,2,3,2,2,3,2,2
top,1,-1,1,1,1,-1,1,1,-1,1,...,1,1,1,1,1,-1,1,0,1,1
freq,7262,8960,9611,9400,9626,9590,4070,6331,7389,9002,...,8918,10043,5866,7612,5831,8201,9516,6156,9505,6157


Deleting rows with missing values:

In [4]:
df.dropna()

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,-1,1,1,1,-1,-1,-1,-1,-1,1,...,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,1,-1,0,1,-1,1,...,1,1,-1,-1,0,-1,1,1,1,-1
2,1,0,1,1,1,-1,-1,-1,-1,1,...,1,1,1,-1,1,-1,1,0,-1,-1
3,1,0,1,1,1,-1,-1,-1,1,1,...,1,1,-1,-1,1,-1,1,-1,1,-1
4,1,0,-1,1,1,-1,1,1,-1,1,...,-1,1,-1,-1,0,-1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11050,1,-1,1,-1,1,1,1,1,-1,-1,...,-1,-1,1,1,-1,-1,1,1,1,1
11051,-1,1,1,-1,-1,-1,1,-1,-1,-1,...,-1,1,1,1,1,1,1,-1,1,-1
11052,1,-1,1,1,1,-1,1,-1,-1,1,...,1,1,1,1,1,-1,1,0,1,-1
11053,-1,-1,1,1,1,-1,-1,-1,1,-1,...,-1,1,1,1,1,-1,1,1,1,-1


Since removing duplicates will remove the majority of the dataset and leave us with 5 possible combinations, we will keep dupes

## Feature Selection
Use entropy and information gain to find what features are the most significant for predicting the target class


**Set up features and target class**

In [5]:
X = df.drop(columns=['Result'])
y = df['Result']

**Decision Tree Classifier to Rank Features**
* using entropy and information gain

In [6]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(criterion='entropy', random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

In [7]:
importances = clf.feature_importances_

print("Feature Importances based on Information Gain:")
print(pd.Series(clf.feature_importances_, index=X.columns).sort_values(ascending=False))

Feature Importances based on Information Gain:
SSLfinal_State                 0.509352
URL_of_Anchor                  0.163536
web_traffic                    0.046178
Links_in_tags                  0.034992
Prefix_Suffix                  0.029970
having_Sub_Domain              0.026683
Links_pointing_to_page         0.022604
SFH                            0.019130
Request_URL                    0.018047
Google_Index                   0.014273
age_of_domain                  0.014153
Page_Rank                      0.012450
having_IP_Address              0.011799
Domain_registeration_length    0.010128
DNSRecord                      0.010055
URL_Length                     0.008237
popUpWidnow                    0.007101
Submitting_to_email            0.005449
having_At_Symbol               0.005000
Shortining_Service             0.004779
on_mouseover                   0.004730
HTTPS_token                    0.004217
Iframe                         0.003589
Redirect                       0.

Train model using features that greater than a threshold and compare accuracy

In [8]:
threshold = 0.0002
selected_features = X.columns[importances > threshold]

X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

In [10]:
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

y_pred_all_features = clf.predict(X_test)

accuracy_all_features = accuracy_score(y_test, y_pred_all_features)
print("Accuracy with all features:", accuracy_all_features)

clf_selected = DecisionTreeClassifier(criterion='entropy', random_state=42)
clf_selected.fit(X_train_selected, y_train)

y_pred_selected_features = clf_selected.predict(X_test_selected)

accuracy_selected_features = accuracy_score(y_test, y_pred_selected_features)
print("Accuracy with selected features:", accuracy_selected_features)

Accuracy with all features: 0.9614109134760326
Accuracy with selected features: 0.9617123907145011


The accuracy only improves if the RightClick feature is removed since it is the only one less than the threshold.

1. Logistic Regression:

In [14]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, f1_score
import pandas as pd

# Prepare features and target
X = df.drop(columns=['Result']).astype(int)
y = df['Result'].astype(int)

# Stratified train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# Build pipeline
pipe_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression(
        max_iter=1000,
        class_weight='balanced',
        random_state=42
    ))
])

# Train model
pipe_lr.fit(X_train, y_train)

# Predict
y_pred = pipe_lr.predict(X_test)

# Classification report with proper labels
print("=== Logistic Regression Results ===\n")
print("Classification Report:")
print(classification_report(
    y_test, y_pred,
    target_names=['Phishing (-1)', 'Legitimate (1)']
))

# Confusion matrix with labeled rows/columns
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(
    cm,
    index=['Actual Phishing (-1)', 'Actual Legitimate (1)'],
    columns=['Predicted Phishing (-1)', 'Predicted Legitimate (1)']
)
print("\nConfusion Matrix:")
print(cm_df)

# Macro F1 score
print("\nMacro F1 Score:", f1_score(y_test, y_pred, average='macro'))


=== Logistic Regression Results ===

Classification Report:
                precision    recall  f1-score   support

 Phishing (-1)       0.91      0.92      0.92      1470
Legitimate (1)       0.94      0.93      0.93      1847

      accuracy                           0.92      3317
     macro avg       0.92      0.92      0.92      3317
  weighted avg       0.93      0.92      0.92      3317


Confusion Matrix:
                       Predicted Phishing (-1)  Predicted Legitimate (1)
Actual Phishing (-1)                      1356                       114
Actual Legitimate (1)                      135                      1712

Macro F1 Score: 0.924057396301663


Logistic Regression outputs a good baseline with a Macro F1 Score of 0.92. We can also see that there are 114 phishing sites missed (false negatives) and 135 legitimate sites flagged as phishing (false positives)

2. Random Forest

In [23]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, f1_score
import pandas as pd

# Prepare features and target
X = df.drop(columns=['Result']).astype(int)
y = df['Result'].astype(int)

# Stratified train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# Build Random Forest model
rf = RandomForestClassifier(
    n_estimators=200,        # number of trees
    max_depth=None,          # let trees grow fully
    min_samples_leaf=2,      # avoid overfitting
    class_weight='balanced', # handle class imbalance
    random_state=42
)

# Train
rf.fit(X_train, y_train)

# Predict
y_pred_rf = rf.predict(X_test)

# Classification report with proper labels
print("=== Random Forest Results ===\n")
print("Classification Report:")
print(classification_report(
    y_test, y_pred_rf,
    target_names=['Phishing (-1)', 'Legitimate (1)']
))

# Confusion matrix with labeled rows/columns
cm = confusion_matrix(y_test, y_pred_rf)
cm_df = pd.DataFrame(
    cm,
    index=['Actual Phishing (-1)', 'Actual Legitimate (1)'],
    columns=['Predicted Phishing (-1)', 'Predicted Legitimate (1)']
)
print("\nConfusion Matrix:")
print(cm_df)

# Macro F1 score
print("\nMacro F1 Score:", f1_score(y_test, y_pred_rf, average='macro'))

=== Random Forest Results ===

Classification Report:
                precision    recall  f1-score   support

 Phishing (-1)       0.96      0.96      0.96      1470
Legitimate (1)       0.97      0.97      0.97      1847

      accuracy                           0.96      3317
     macro avg       0.96      0.96      0.96      3317
  weighted avg       0.96      0.96      0.96      3317


Confusion Matrix:
                       Predicted Phishing (-1)  Predicted Legitimate (1)
Actual Phishing (-1)                      1411                        59
Actual Legitimate (1)                       60                      1787

Macro F1 Score: 0.9636572237117871


Random Forest is more accurate compared to Logistic Regression with a Macro F1 Score of 0.95-0.96. We can also see that there are 59 phishing sites missed (false negatives) and 60 legitimate sites flagged as phishing (false positives)

3. Histogram-Based Gradient Boosting

In [20]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, f1_score
import pandas as pd

# Prepare features and target
X = df.drop(columns=['Result']).astype(int)
y = df['Result'].astype(int)

# Stratified train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# Build Histogram-Based Gradient Boosting model
hgb = HistGradientBoostingClassifier(
    learning_rate=0.1,       # step size for boosting
    max_depth=10,            # limit depth to prevent overfitting
    l2_regularization=1.0,   # small penalty for stability
    early_stopping=True,     # stop when validation score stops improving
    random_state=42
)

# Train
hgb.fit(X_train, y_train)

# Predict
y_pred_hgb = hgb.predict(X_test)

# Classification report with proper labels
print("=== Histogram-Based Gradient Boosting Results ===\n")
print("Classification Report:")
print(classification_report(
    y_test, y_pred_hgb,
    target_names=['Phishing (-1)', 'Legitimate (1)']
))

# Confusion matrix with labels
cm = confusion_matrix(y_test, y_pred_hgb)
cm_df = pd.DataFrame(
    cm,
    index=['Actual Phishing (-1)', 'Actual Legitimate (1)'],
    columns=['Predicted Phishing (-1)', 'Predicted Legitimate (1)']
)
print("\nConfusion Matrix:")
print(cm_df)

# Macro F1 score
print("\nMacro F1 Score:", f1_score(y_test, y_pred_hgb, average='macro'))


=== Histogram-Based Gradient Boosting Results ===

Classification Report:
                precision    recall  f1-score   support

 Phishing (-1)       0.96      0.96      0.96      1470
Legitimate (1)       0.97      0.97      0.97      1847

      accuracy                           0.97      3317
     macro avg       0.96      0.96      0.96      3317
  weighted avg       0.97      0.97      0.97      3317


Confusion Matrix:
                       Predicted Phishing (-1)  Predicted Legitimate (1)
Actual Phishing (-1)                      1409                        61
Actual Legitimate (1)                       55                      1792

Macro F1 Score: 0.9645560898321635


Histogram-Based Gradient Boosting is the most accurate compared to Logistic Regression and Random Forest over multiple tunings with a consistent Macro F1 Score of 0.96. We can also see that there are 61 phishing sites missed (false negatives) and 55 legitimate sites flagged as phishing (false positives)