Skip to content

the404packet/Phishing-URL-Detection

Repository files navigation

Phishing and Legitimate URL Detection

Dataset: Kaggle - Phishing URL Dataset
Introductory Paper: ScienceDirect Article

Dataset Information

This project uses a dataset containing over 235,795 labeled URLs, where label 0 indicates a phishing URL and label 1 indicates a legitimate one. The dataset includes over 49 extracted features (from a total of 55 columns) that capture the structural, lexical, and statistical characteristics of URLs. One unique feature in this dataset is the URL Similarity Index (USI), which is highly correlated and domain-specific.

Model Design Approach

The dataset was already clean and well-prepared, eliminating the need for further preprocessing. No additional feature engineering was necessary, as the included features were already rich and informative. Exploratory Data Analysis (EDA) was conducted to understand feature distributions and detect redundancy or low-importance columns. Some features were removed based on EDA insights and importance metrics.

The dataset was then split into 70% training and 30% testing subsets. Importantly, the USI feature was purposely removed during training to avoid inflated performance and to ensure the model generalizes well in real-world settings.

Tree-based models were explored for classification, with Random Forest outperforming others in both accuracy and generalization. Additional improvements were made using hyperparameter tuning and probability threshold adjustments.

Results

Training Accuracy: 0.9949
Testing Accuracy: 0.9938

Testing Classification Report

Class Precision Recall F1-Score Support
0 0.9916 0.9939 0.9928 30,151
1 0.9955 0.9938 0.9946 40,588
Metric Precision Recall F1-Score Support
Accuracy 0.9938 70,739
Macro Avg 0.9935 0.9938 0.9937 70,739
Weighted Avg 0.9938 0.9938 0.9938 70,739

Enjoyed this repo? A ⭐ lets me know it was worth sharing!

About

Phishing and Legitimate URL detection using scikit-learn.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published