# 🛡️ PhishGuardian: DistilBERT for Email Security 🔥  
### *Advanced Phishing Email Detection using AI*  

🔍 **About this Notebook:**  
This notebook demonstrates phishing email detection using a **fine-tuned DistilBERT model**. The goal is to classify emails as **Phishing** or **Safe** based on their textual content.  
With the increasing number of phishing attacks, this AI-powered model provides a reliable way to enhance email security.  

📌 **Key Features:**  
- ✅ **Preprocessing of email text** (cleaning, tokenization, and encoding).  
- ✅ **Fine-tuning DistilBERT** for binary classification.  
- ✅ **Evaluation and visualization of model performance**.  
- ✅ **Deployment options** (FastAPI for real-time inference).  

📊 **Dataset:** The dataset consists of phishing and safe emails, which are used to train and validate the model.  

🚀 **Let’s start building our AI-powered email security model!**  


# Import necessary libraries

In [None]:
import numpy as np
import pandas as pd
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
from sklearn.model_selection import train_test_split

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Using device: ", device)

# Load the dataset

In [None]:
df = pd.read_csv("/kaggle/input/phishingemails/Phishing_Email.csv")

# Remove missing values

In [None]:
df.dropna(inplace=True)

# Convert labels to numerical format (1 - Phishing, 0 - Safe)


In [None]:
df["label"] = df["Email Type"].apply(lambda x: 1 if x == "Phishing Email" else 0)

# Split into training and testing sets

In [None]:
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df["Email Text"].tolist(), df["label"].tolist(), test_size=0.2, random_state=42
)

# Load the DistilBERT tokenizer

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

# Tokenization function

In [None]:
def tokenize_function(texts):
    return tokenizer(texts, padding="max_length", truncation=True, max_length=512)

# Tokenize training and testing data

In [None]:
train_encodings = tokenize_function(train_texts)
test_encodings = tokenize_function(test_texts)

# Convert to Hugging Face Dataset format

In [None]:
train_dataset = Dataset.from_dict({
    "input_ids": train_encodings["input_ids"],
    "attention_mask": train_encodings["attention_mask"],
    "labels": train_labels
})

test_dataset = Dataset.from_dict({
    "input_ids": test_encodings["input_ids"],
    "attention_mask": test_encodings["attention_mask"],
    "labels": test_labels
})

# Load the DistilBERT model for classification

In [None]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2).to(device)

# Define training arguments

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    logging_dir="./logs",
    logging_steps=500,
    save_total_limit=2,
    report_to="none"
)

# Define the Trainer

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

# Train the model

In [None]:
trainer.train()

# Evaluate the model

In [None]:
trainer.evaluate()

# Save the trained model

In [None]:
model.save_pretrained("./phishing_model")
tokenizer.save_pretrained("./phishing_model")