# Phishing Detection System

In [1]:
pip install pandas numpy scikit-learn nltk tensorflow flask

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting tensorflow
  Downloading tensorflow-2.18.0-cp312-cp312-win_amd64.whl.metadata (3.3 kB)
Collecting flask
  Downloading flask-3.1.0-py3-none-any.whl.metadata (2.7 kB)
Collecting click (from nltk)
  Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.11.6-cp312-cp312-win_amd64.whl.metadata (41 kB)
     ---------------------------------------- 0.0/41.5 kB ? eta -:--:--
     ------------------ ------------------- 20.5/41.5 kB 640.0 kB/s eta 0:00:01
     -------------------------------------- 41.5/41.5 kB 666.1 kB/s eta 0:00:00
Collecting tqdm (from nltk)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
     ---------------------------------------- 0.0/57.7 kB ? eta -:--:--
     ---------------------------------------- 57.7/57.7 kB 1.5 MB/s eta 0:00:00
Collecting tensorflow-intel==2.18.0 (from tensorflow)
  Downloading tensor

In [3]:
# Load and process emails
import os
import email
import pandas as pd

# extracted SpamAssassin dataset
DATA_DIR = "../data/easy_ham"

# parse email files
def parse_email(file_path):
    with open(file_path, "r", encoding="latin-1") as f:
        msg = email.message_from_file(f)

    # extract useful features
    email_data = {
        "From": msg["From"],
        "Subject": msg["Subject"],
        "Body": "",
        "Label": "ham",
    }

    # extract email body
    if msg.is_multipart():
        for part in msg.walk():
            if part.get_content_type() == "text/plain":
                email_data["Body"] += part.get_payload(decode=True).decode("latin-1", errors="ignore")
    else:
        email_data["Body"] = msg.get_payload(decode=True).decode("latin-1", errors="ignore")

    return email_data

# process all email files
emails = []
for filename in os.listdir(DATA_DIR):
    file_path = os.path.join(DATA_DIR, filename)
    if os.path.isfile(file_path):
        email_content = parse_email(file_path)
        emails.append(email_content)

# convert to dataframe
df = pd.DataFrame(emails)

# save as csv file for training
df.to_csv("processed_emails.csv", index=False)

print("Successfully processed and saved emails to 'processed_emails.csv'!")

Successfully processed and saved emails to 'processed_emails.csv'!


In [8]:
# load processed csv file
df = pd.read_csv("../data/processed_emails.csv")
pd.set_option("display.max_colwidth", None) # full text is displayed
print(df.head())

                                        From  \
0             Robert Elz <kre@munnari.OZ.AU>   
1  Steve Burt <Steve_Burt@cursor-system.com>   
2              "Tim Chapman" <timc@2ubh.com>   
3           Monty Solomon <monty@roscom.com>   
4       Tony Nugent <tony@linuxworks.com.au>   

                                 Subject  \
0               Re: New Sequences Window   
1              [zzzzteana] RE: Alexander   
2              [zzzzteana] Moscow bomber   
3  [IRR] Klez: The Virus That  Won't Die   
4                   Re: Insert signature   

                                                                                                                                                                                                                                                                                                                                                                                                                                                              

In [9]:
print(df["Body"][0])

    Date:        Wed, 21 Aug 2002 10:54:46 -0500
    From:        Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>
    Message-ID:  <1029945287.4797.TMDA@deepeddy.vircio.com>


  | I can't reproduce this error.

For me it is very repeatable... (like every time, without fail).

This is the debug log of the pick happening ...

18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}
18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury
18:19:04 Ftoc_PickMsgs {{1 hit}}
18:19:04 Marking 1 hits
18:19:04 tkerror: syntax error in expression "int ...

Note, if I run the pick command by hand ...

delta$ pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace  4852-4852 -sequence mercury
1 hit

That's where the "1 hit" comes from (obviously).  The version of nmh I'm
using is ...

delta$ pick -version
pick -- nmh-1.0.4 [compiled on fuchsia.cs.mu.OZ.AU at Sun Mar 17 14:55

In [11]:
df["Body"] = df["Body"].str.replace("\n", " ").str.strip()
pd.reset_option("display.max_colwidth")
print(df.head())

                                        From  \
0             Robert Elz <kre@munnari.OZ.AU>   
1  Steve Burt <Steve_Burt@cursor-system.com>   
2              "Tim Chapman" <timc@2ubh.com>   
3           Monty Solomon <monty@roscom.com>   
4       Tony Nugent <tony@linuxworks.com.au>   

                                 Subject  \
0               Re: New Sequences Window   
1              [zzzzteana] RE: Alexander   
2              [zzzzteana] Moscow bomber   
3  [IRR] Klez: The Virus That  Won't Die   
4                   Re: Insert signature   

                                                Body Label  
0  Date:        Wed, 21 Aug 2002 10:54:46 -0500  ...   ham  
1  Martin A posted: Tassos Papadopoulos, the Gree...   ham  
2  Man Threatens Explosion In Moscow   Thursday A...   ham  
3  Klez: The Virus That Won't Die   Already the m...   ham  
4  On Wed Aug 21 2002 at 15:46, Ulises Ponce wrot...   ham  
