**[Phishing Detection - Part 2]**

**NLP Assignment 3**

**Fine-Tuned Classification Dataset:** Phishing Emails (Kaggle)

**Task:** Fine-tuning BART for Binary Classification (Phishing vs. Legitimate)

**Date:** November 30, 2025

**Description:**

This notebook covers Part 2 of the assignment. We perform full fine-tuning of a pre-trained BART model using a labeled dataset. We utilize the Hugging Face Trainer API to adapt the model for binary sequence classification and evaluate it using Accuracy, Precision, Recall, and F1-Score.

In [None]:
!pip install -q transformers datasets evaluate accelerate scikit-learn

import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from transformers import (
    BartTokenizer,
    BartForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
from datasets import Dataset
from transformers import BartTokenizer, BartForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding
from datasets import Dataset
import numpy as np
import torch

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [None]:
df=pd.read_csv("/content/Phishing_Email.csv")

In [None]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,Email Text,Email Type
0,0,"re : 6 . 1100 , disc : uniformitarianism , re ...",Safe Email
1,1,the other side of * galicismos * * galicismo *...,Safe Email
2,2,re : equistar deal tickets are you still avail...,Safe Email
3,3,\nHello I am your hot lil horny toy.\n I am...,Phishing Email
4,4,software at incredibly low prices ( 86 % lower...,Phishing Email
5,5,global risk management operations sally congra...,Safe Email
6,6,"On Sun, Aug 11, 2002 at 11:17:47AM +0100, wint...",Safe Email
7,7,"entourage , stockmogul newsletter ralph velez ...",Phishing Email
8,8,"we owe you lots of money dear applicant , afte...",Phishing Email
9,9,re : coastal deal - with exxon participation u...,Safe Email


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18650 entries, 0 to 18649
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  18650 non-null  int64 
 1   Email Text  18634 non-null  object
 2   Email Type  18650 non-null  object
dtypes: int64(1), object(2)
memory usage: 437.2+ KB


In [None]:
# Basic Cleaning
df = df.dropna(subset=['Email Text', 'Email Type'])
df = df.drop(columns=['#'], errors='ignore')

In [None]:
label_map = {'Safe Email': 0, 'Phishing Email': 1}
df['label'] = df['Email Type'].map(label_map)

In [None]:
# Filter out any rows where mapping failed (if dataset has dirty labels)
df = df.dropna(subset=['label'])
df['label'] = df['label'].astype(int)

In [None]:
df = df.drop(columns=['Unnamed: 0', 'Email Type'], errors='ignore')

# 2. Verify the final structure
print("Final columns for training:", df.columns)
# Output should be: Index(['Email Text', 'label'], dtype='object')

# 3. Proceed to Split
train_df, test_df = train_test_split(
    df,
    test_size=0.2,
    stratify=df['label'],
    random_state=42
)

Final columns for training: Index(['Email Text', 'label'], dtype='object')


In [None]:
df.shape

(18634, 2)

In [None]:
df.columns

Index(['Email Text', 'label'], dtype='object')

In [None]:
print(f"Training Samples: {len(train_df)}")
print(f"Testing Samples: {len(test_df)}")

Training Samples: 14907
Testing Samples: 3727


In [None]:
# 1. Initialize Tokenizer for the Large MNLI model
model_checkpoint = "facebook/bart-large-mnli"
tokenizer = BartTokenizer.from_pretrained(model_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

In [None]:
# 2. Convert Pandas DataFrames to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

In [None]:
# 3. Tokenization Function
def tokenize_function(examples):
    return tokenizer(
        examples["Email Text"],
        padding="max_length",
        truncation=True,
        max_length=128 # Keep at 128 to save memory with the Large model
    )

In [None]:
# 4. Apply Tokenization
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/14907 [00:00<?, ? examples/s]

Map:   0%|          | 0/3727 [00:00<?, ? examples/s]

In [None]:
# 5. Format for PyTorch
tokenized_train.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_test.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

In [None]:
from transformers import BartForSequenceClassification
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score


def compute_metrics(p):
    predictions, labels = p

    # --- THE FIX ---
    # If predictions is a tuple (common in BART), take the first element (logits)
    if isinstance(predictions, tuple):
        predictions = predictions[0]
    # ----------------

    # Now predictions is definitely a numpy array
    pred = np.argmax(predictions, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)

    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1,
    }

# 2. Initialize Model
# We use num_labels=2 for Binary Classification
model = BartForSequenceClassification.from_pretrained(
    "facebook/bart-large-mnli",
    num_labels=2,
    ignore_mismatched_sizes=True
)

# Move model to GPU
model.to(device)

Some weights of BartForSequenceClassification were not initialized from the model checkpoint at facebook/bart-large-mnli and are newly initialized because the shapes did not match:
- classification_head.out_proj.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([2]) in the model instantiated
- classification_head.out_proj.weight: found shape torch.Size([3, 1024]) in the checkpoint and torch.Size([2, 1024]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BartForSequenceClassification(
  (model): BartModel(
    (shared): BartScaledWordEmbedding(50265, 1024, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): BartScaledWordEmbedding(50265, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
      (layers): ModuleList(
        (0-11): 12 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
       

In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding

# 1. Define Training Arguments
training_args = TrainingArguments(
    output_dir="./bart-large-phishing-finetuned",
    learning_rate=2e-5,

    # Memory Optimization
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    per_device_eval_batch_size=4,

    num_train_epochs=3,
    weight_decay=0.01,

    # UPDATED PARAMETER NAME:
    eval_strategy="epoch",
    save_strategy="epoch",

    load_best_model_at_end=True,
    fp16=True,
    logging_dir='./logs',
    logging_steps=50,
    report_to="none"
)

# 2. Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.1245,0.094246,0.97612,0.954937,0.985636,0.970044
2,0.0443,0.089195,0.97612,0.958584,0.981532,0.969922
3,0.0201,0.077047,0.978803,0.957644,0.98974,0.973428


There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight'].


TrainOutput(global_step=5592, training_loss=0.07415242521931685, metrics={'train_runtime': 2692.7124, 'train_samples_per_second': 16.608, 'train_steps_per_second': 2.077, 'total_flos': 1.218962948564736e+16, 'train_loss': 0.07415242521931685, 'epoch': 3.0})

In [62]:
# 1. Get the data loader directly from the trainer
train_dataloader = trainer.get_train_dataloader()

# 2. Grab a single batch of data
batch = next(iter(train_dataloader))

# 3. Decode the first sample in the batch back to text
input_ids = batch['input_ids'][0]
label_id = batch['labels'][0].item()

decoded_text = tokenizer.decode(input_ids, skip_special_tokens=True)

print("Data Verification")
print(f"Label: {label_id} ({'Phishing' if label_id == 1 else 'Safe'})")
print(f"Text Input to Model:\n{decoded_text}")

Data Verification
Label: 0 (Safe)
Text Input to Model:
WORLD WIDE WORDS          ISSUE 301          Saturday 3 August 2002
-------------------------------------------------------------------
Sent each Saturday to 15,000+ subscribers in at least 119 countries
Editor: Michael Quinion, Thornbury, Bristol, UK      ISSN 1470-1448
 
-------------------------------------------------------------------
 IF YOU RESPOND TO THIS MAILING, REMEMBER TO CHANGE THE OUTGOING
   ADDRESS TO ONE OF THOSE IN THE 'CONTACT ADDRESSES' SECTION.
Contents
----------------------------------------------------------------


In [63]:
# Evaluate on the test set
results = trainer.evaluate()

print("\n" + "="*30)
print("FINAL RESULTS (Fine-Tuned BART)")
print("="*30)
print(f"Accuracy : {results['eval_accuracy']:.4f}")
print(f"Precision: {results['eval_precision']:.4f}")
print(f"Recall   : {results['eval_recall']:.4f}")
print(f"F1-Score : {results['eval_f1']:.4f}")
print("="*30)


FINAL RESULTS (Fine-Tuned BART)
Accuracy : 0.9788
Precision: 0.9576
Recall   : 0.9897
F1-Score : 0.9734
