<a href="https://colab.research.google.com/github/sally-20/URL-Phishing-Detection/blob/main/URL_Phishing_Detection_using_Random_Forest_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Phishing Website Detection with Random Forest**

This Colab notebook demonstrates how to build a phishing website detection model using the Random Forest algorithm. We will preprocess the data, train the model, and use it to predict whether a given website is phishing or not.

## **Data**
The dataset used in this notebook contains URLs along with their corresponding labels (phishing or legitimate). The dataset has been loaded and preprocessed for training.

## **Preprocessing**
We will first preprocess the URLs to extract their domain information using regular expressions. The extracted domains will be used as input features for the model. We will also perform one-hot encoding for the categorical input features.

Let's take a look at the first few rows of the dataset:

In [1]:
import pandas as pd

In [2]:
# Load the dataset
data = pd.read_csv('/content/drive/MyDrive/phishing_site_urls.csv')
print(data.head())

                                                 URL Label
0  nobell.it/70ffb52d079109dca5664cce6f317373782/...   bad
1  www.dghjdgf.com/paypal.co.uk/cycgi-bin/webscrc...   bad
2  serviciosbys.com/paypal.cgi.bin.get-into.herf....   bad
3  mail.printakid.com/www.online.americanexpress....   bad
4  thewhiskeydregs.com/wp-content/themes/widescre...   bad


## **Training**
We will create and train a Random Forest classifier on the preprocessed dataset. The model will be evaluated for accuracy on the training set.

In [3]:
import re
import numpy as np
from numpy import loadtxt
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
import warnings

In [4]:
# Suppress FutureWarning for OneHotEncoder sparse parameter
warnings.simplefilter(action='ignore', category=FutureWarning)

In [5]:
# Preprocess the input URL
def preprocess_url(url):
    # Extract domain from the URL
    match = re.search('https?://([^/?]+)', url)
    if match:
        domain = match.group(1)
        return domain
    else:
        return None

In [6]:
# Extract the 'Label' column as the output (y) variable
y = data['Label']
# Extract the 'URL' column as the input (X) variable
X = data['URL']

In [7]:
# Initialize a new list to store the preprocessed URLs
preprocessed_X = []

In [8]:
# Preprocess the dataset
for url in X:
    if isinstance(url, str):
        preprocessed_url = preprocess_url(url)
        preprocessed_X.append(preprocessed_url)
    else:
        preprocessed_X.append(None)

In [9]:
# Convert the list to a NumPy array for further processing
preprocessed_X = np.array(preprocessed_X).reshape(-1, 1)

In [10]:
# Perform one-hot encoding for categorical input
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
X_encoded = encoder.fit_transform(preprocessed_X)

In [11]:
# Create and train the Random Forest classifier
model = RandomForestClassifier()
model.fit(X_encoded, y)

In [12]:
# Calculate accuracy on the training set
predictions = model.predict(X_encoded)
accuracy = accuracy_score(y, predictions)
print("Model Accuracy:", accuracy)

Model Accuracy: 0.717149119134389


## **Prediction**
Once the model is trained, we can use it to predict whether a given URL is a phishing website or not.

In [13]:
# Function to get user input for prediction
def get_user_input():
    user_input = input("Enter the URL to check if it's phishing or legitimate: ")
    return user_input

In [14]:
# Get user input
website_input = get_user_input()

Enter the URL to check if it's phishing or legitimate: https://colab.research.google.com/drive/1VKi31_GtTF0bFiJTJbV1yrRPR3wQCkmo#scrollTo=KuvDZjCeuGgO


In [15]:
# Preprocess the user input
preprocessed_website = preprocess_url(website_input)
preprocessed_website_encoded = encoder.transform([[preprocessed_website]])
prediction = model.predict(preprocessed_website_encoded)

In [16]:
# Print the prediction result
if prediction[0] == 1:
    print("The website is phishing.")
else:
    print("The website is legitimate.")

The website is legitimate.
