Skip to content
/ pds Public

Machine Learning and Regex Matching based Phishing Detection System with a phishing attack scenario

License

Notifications You must be signed in to change notification settings

umutsevdi/pds

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Phishing Detection System

Machine Learning and Regex Matching based Phishing Detection System with a phishing attack scenario
Developed by Umut Sevdi, İsmet Güngör, Semih Yazıcı and Oğuzhan Ercan

Explore the docs »

Table of Contents
  1. Project Definition
  2. System Architecture
  3. Hardware Requirements
  4. Installation
  5. License
  6. Contact

1. Project Definition

Phishing is a cyber attack involving carefully crafted emails or websites to trick individuals into revealing sensitive information such as login credentials or financial information. These attacks often take the form of fake login pages or emails purporting to be from legitimate organizations, and they can have severe consequences for both individuals and organizations.

dashboard

In our project, we developed a phishing scenario and a program to protect from it. In the scenario, we hosted an SMTP server and a phishing server for the attacker. Phishing server tricks users into thinking that the website is legit.

When the victim clicks on the link, a login page that imitates "edevlet.gov.tr" is returned. However, when the user logs in, all credentials are sent to the attacker. Phishing site responds with a fake dashboard to be unnoticed.

Against similar attacks, we aimed to develop a machine learning and a regex matching-based phishing detection system to identify and prevent phishing attacks. The use of machine learning algorithms and regex matching allows the system to analyze and classify email content and identify patterns and keywords commonly used in phishing attacks. This approach has the potential to be highly effective in detecting and preventing phishing attacks, as it can quickly and accurately identify suspicious emails and take action to block them.

2. System Architecture

Attacker

  • On the attacker's side, we developed a web server in Go to host the phishing site. The site sends a web page that looks like edevlet.gov.tr. However, unlike the original page, it does not encrypt any data while sending. And it sends directly to the attacker.

Phishing Server

Victim

  • We used a MailHog server to host an SMTP server. It runs from a docker-compose file as a container for testing purposes.

  • To protect the victim against phishing attacks, we have implemented a system that listens to the ongoing traffic and parses SMTP to examine the mail body. After obtaining the mail body, firstly process with Yara using rules specifically generated for detecting phishing mail attacks. After checking possible malicious keywords with the Yara tool, transferring the plain text body to a Python program, a machine learning method that determines whether the incoming mail is a phishing attack or innocent.

Regex Based Detection

  • We have called Long Short Term memory, a type of recurrent neural network (RNN) well-suited for modeling long-term dependencies in time series or sequential data. It can effectively retain information over long periods and handle variable-length input sequences. The attention layer weighs the input sequences, and the classifier predicts based on the weighted input. The model also has methods for generating initial hidden states for the LSTM layer, encoding input text using the embedding layer and LSTM layer, and applying attention to the output of the LSTM layer. In addition, we detect which words cause phishing thanks to the attention layer placed between LSTM and linear classifiers in the model. The text that came over TCP and converted to the string was not in a format that could be fed into our LSTM model. For this reason, we performed the text preprocessing steps frequently used in natural language processing tasks. The utils_preprocess_text function is used for cleaning and preprocessing text by removing punctuation and lower-casing, removing stop words, and optionally applying stemming or lemmatization. The textCleaner function applies the utils_preprocess_text function to a column of a pandas DataFrame and stores the processed text in a new column.

NLP Based Detection

3. Installation

Requirements:

  1. Clone the repository.
   git clone https://github.com/umutsevdi/pds.git
  1. Run the mail server.
    cd victim
    docker-compose up
  1. Compile and execute the Phishing detection programs.
    cd victim 
    cd mail-detect 
    python mail_detect.py &
    cd ..
    go build smtp_phishing_detection
    sudo smtp_phishing_detection/smtp_phishing_detection &
  1. Execute the attacker programs from an external device or locally.
    cd attacker/phishing_server/cmd
    go run . &
  1. Now you can send phishing emails using our mail script.
    cd attacker/
    pyhton mail_sender.py

5. License

Distributed under the MIT License. See LICENSE for more information.

6. Contact

You can contact any developer of this project for any suggestion or information.

Project: umutsevdi/pds

Developed by Umut Sevdi, İsmet Güngör, Semih Yazıcı and Oğuzhan Ercan