# Lab 01: Advanced Phishing Email Classifier

Build a comprehensive machine learning classifier to detect diverse phishing attacks.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/depalmar/ai_for_the_win/blob/main/notebooks/lab01_phishing_classifier.ipynb)

## Learning Objectives
- Multi-class phishing classification (BEC, spear-phishing, credential theft, malware delivery)
- Advanced text preprocessing and feature extraction
- TF-IDF and word embeddings for email analysis
- Header analysis (SPF, DKIM, DMARC)
- URL and attachment risk scoring
- Model evaluation with security-focused metrics
- Adversarial evasion awareness

## Phishing Attack Taxonomy

Modern phishing attacks vary significantly:
1. **Credential Phishing** - Fake login pages, account verification
2. **Business Email Compromise (BEC)** - CEO/CFO impersonation, invoice fraud
3. **Spear Phishing** - Targeted attacks with personalized content
4. **Whaling** - Targeting executives
5. **Vendor Email Compromise (VEC)** - Supply chain fraud
6. **Malware Delivery** - Weaponized attachments, drive-by downloads
7. **Callback Phishing** - Phone-based social engineering

In [None]:
# Install dependencies (uncomment for Colab)
# !pip install scikit-learn pandas numpy matplotlib seaborn plotly

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc

# Plotly for interactive visualizations (works great in Colab)
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Set matplotlib style (fallback)
plt.style.use("seaborn-v0_8-whitegrid")
sns.set_palette("husl")

# Plotly template optimized for Colab (works in light/dark mode)
PLOTLY_TEMPLATE = "plotly_white"

## 1. Load and Explore Data

In [None]:
# Comprehensive phishing email dataset with diverse attack types
import random
from typing import List, Dict, Tuple


class PhishingDataGenerator:
    """Generate diverse phishing email samples for training."""

    # Legitimate email templates
    LEGITIMATE_TEMPLATES = [
        # Internal communications
        {
            "subject": "Team meeting tomorrow at 3pm",
            "body": "Hi team, reminder about our weekly sync meeting tomorrow in Conference Room A. Please bring your project updates.",
            "type": "internal",
        },
        {
            "subject": "Q3 report attached",
            "body": "Please find attached the quarterly report for review. Let me know if you have any questions.",
            "type": "internal",
        },
        {
            "subject": "Lunch plans?",
            "body": "Hey! Want to grab lunch today? I was thinking about trying that new place downtown.",
            "type": "internal",
        },
        {
            "subject": "Project deadline extension",
            "body": "Good news - the client agreed to extend the deadline by two weeks. We now have until the 15th.",
            "type": "internal",
        },
        {
            "subject": "Welcome to the team!",
            "body": "Hi Sarah, welcome to the company! Please reach out if you need any help getting started.",
            "type": "internal",
        },
        {
            "subject": "Meeting notes from yesterday",
            "body": "Hi all, please find the meeting notes attached. Action items are highlighted in yellow.",
            "type": "internal",
        },
        {
            "subject": "Vacation request approved",
            "body": "Your time off request for Dec 23-27 has been approved. Enjoy your holiday!",
            "type": "internal",
        },
        {
            "subject": "Expense report submitted",
            "body": "Your expense report #12345 has been submitted for approval. Expected processing time is 3-5 business days.",
            "type": "internal",
        },
        # External legitimate
        {
            "subject": "Your order has shipped",
            "body": "Great news! Your order #ABC123 has shipped and will arrive by Friday. Track your package here: tracking.ups.com/abc123",
            "type": "external",
        },
        {
            "subject": "Invoice #INV-2024-001",
            "body": "Please find attached invoice #INV-2024-001 for services rendered in December. Payment due within 30 days.",
            "type": "external",
        },
        {
            "subject": "Newsletter: January Edition",
            "body": "Welcome to our monthly newsletter! This month we cover industry trends and upcoming events.",
            "type": "external",
        },
        {
            "subject": "Appointment confirmation",
            "body": "This confirms your appointment on January 15th at 2:00 PM. Reply to reschedule if needed.",
            "type": "external",
        },
        {
            "subject": "Thank you for your purchase",
            "body": "Thank you for shopping with us. Your receipt is attached. Questions? Contact support@store.com",
            "type": "external",
        },
    ]

    # Credential phishing templates
    CREDENTIAL_PHISHING = [
        {
            "subject": "Urgent: Your account has been compromised",
            "body": "We detected unusual activity on your account. Verify your identity immediately by clicking here: bit.ly/verify-now or your account will be suspended.",
            "subtype": "credential",
        },
        {
            "subject": "Security Alert: Password reset required",
            "body": "Your password expires today. Update it now to avoid losing access: secure-login.com/reset. This link expires in 24 hours.",
            "subtype": "credential",
        },
        {
            "subject": "Microsoft 365: Verify your account",
            "body": "Your Microsoft 365 subscription needs verification. Please sign in at microsft-verify.com to continue using your services.",
            "subtype": "credential",
        },
        {
            "subject": "Your Google Account: Unusual sign-in prevented",
            "body": "Someone tried to sign in to your account from a new device. If this wasn't you, secure your account at google-security.net",
            "subtype": "credential",
        },
        {
            "subject": "DocuSign: Document waiting for signature",
            "body": "John Smith shared a document with you. Click here to review and sign: docusign-secure.com/doc/abc123",
            "subtype": "credential",
        },
        {
            "subject": "LinkedIn: Please confirm your email",
            "body": "We noticed you haven't verified your email. Complete verification now: linkedln-verify.com/confirm",
            "subtype": "credential",
        },
        {
            "subject": "Apple ID: Your account has been locked",
            "body": "Your Apple ID has been locked due to suspicious activity. Unlock it now: apple-id-support.com/unlock",
            "subtype": "credential",
        },
        {
            "subject": "Netflix: Update your payment information",
            "body": "We couldn't process your payment. Update your billing info to avoid service interruption: netflix-billing.com/update",
            "subtype": "credential",
        },
    ]

    # BEC (Business Email Compromise) templates
    BEC_PHISHING = [
        {
            "subject": "Urgent wire transfer needed",
            "body": "Hi, I need you to process an urgent wire transfer of $45,000 to a new vendor. Please keep this confidential and let me know when done. - CEO",
            "subtype": "bec_ceo_fraud",
        },
        {
            "subject": "Quick favor needed",
            "body": "Are you in the office? I need your help with something urgent but can't call right now. Reply ASAP. - John (CFO)",
            "subtype": "bec_ceo_fraud",
        },
        {
            "subject": "RE: Updated bank details",
            "body": "Please note our bank account has changed. All future payments should go to: Account: 123456789, Routing: 987654321. - Vendor Accounting",
            "subtype": "bec_invoice_fraud",
        },
        {
            "subject": "Invoice Payment - URGENT",
            "body": "The attached invoice is past due. Please process payment immediately to avoid late fees. Our new banking details are included.",
            "subtype": "bec_invoice_fraud",
        },
        {
            "subject": "Payroll update needed",
            "body": "Hi HR, I need to update my direct deposit information before the next payroll. Please change it to account #9876543210.",
            "subtype": "bec_payroll_diversion",
        },
        {
            "subject": "Gift cards needed for client appreciation",
            "body": "I'm in a meeting and can't talk. Please purchase 5 Amazon gift cards ($200 each) for client appreciation. Send me the codes. - Director",
            "subtype": "bec_gift_card",
        },
        {
            "subject": "Confidential acquisition discussion",
            "body": "We're in confidential discussions about acquiring a competitor. I need you to wire $125,000 for the deposit. Keep this between us.",
            "subtype": "bec_ceo_fraud",
        },
    ]

    # Spear phishing templates
    SPEAR_PHISHING = [
        {
            "subject": "Speaking opportunity at Tech Conference 2024",
            "body": "Dear Dr. Smith, we'd like to invite you to speak at our conference. Please review the attached proposal and speaker agreement.",
            "subtype": "spear_personalized",
        },
        {
            "subject": "RE: Your recent publication",
            "body": "I read your paper on machine learning security with great interest. I'd like to discuss collaboration opportunities. Please see attached proposal.",
            "subtype": "spear_personalized",
        },
        {
            "subject": "Your LinkedIn connection request",
            "body": "Hi John, thanks for connecting! I noticed you work at Acme Corp. I have an opportunity that might interest you. Details attached.",
            "subtype": "spear_linkedin",
        },
        {
            "subject": "Alumni network: Job opportunity",
            "body": "Fellow Stanford alum here! Our company has an opening that matches your background. Check out the role description attached.",
            "subtype": "spear_personalized",
        },
        {
            "subject": "Follow-up from today's meeting",
            "body": "Great meeting you at the conference today! As discussed, here's the proposal document. Looking forward to your feedback.",
            "subtype": "spear_personalized",
        },
    ]

    # Malware delivery templates
    MALWARE_PHISHING = [
        {
            "subject": "Invoice #INV-38291 attached",
            "body": "Please find your invoice attached. Enable macros to view the document properly. Contact billing@suspicious.com with questions.",
            "subtype": "malware_invoice",
        },
        {
            "subject": "Your resume was received",
            "body": "Thank you for applying to the position. Please open the attached form to complete your application. Enable content to proceed.",
            "subtype": "malware_job",
        },
        {
            "subject": "Shipping notification: DHL Express",
            "body": "Your package is on its way! Open the attached tracking document to see delivery details. Enable editing if prompted.",
            "subtype": "malware_shipping",
        },
        {
            "subject": "Voicemail from unknown caller",
            "body": "You have 1 new voicemail. Download the attached audio file to listen. Note: .exe format required for playback.",
            "subtype": "malware_voicemail",
        },
        {
            "subject": "Court summons - Immediate action required",
            "body": "You are hereby summoned to appear in court. Open the attached document for case details. Failure to respond may result in arrest.",
            "subtype": "malware_legal",
        },
        {
            "subject": "Failed delivery attempt - UPS",
            "body": "We attempted to deliver your package but no one was home. Print the attached label to pick up your package.",
            "subtype": "malware_shipping",
        },
    ]

    # Callback phishing (no links, phone-based)
    CALLBACK_PHISHING = [
        {
            "subject": "Suspicious activity on your account",
            "body": "We detected suspicious activity on your account. Call our security team immediately at 1-800-555-0123 to verify your identity.",
            "subtype": "callback_security",
        },
        {
            "subject": "Your order requires verification",
            "body": "Your order #12345 requires phone verification before shipping. Call 1-888-555-9999 with your order details.",
            "subtype": "callback_order",
        },
        {
            "subject": "Tax refund pending - IRS notice",
            "body": "You have a pending tax refund of $3,247.00. Call the IRS verification line at 1-800-555-8888 to claim.",
            "subtype": "callback_government",
        },
        {
            "subject": "Tech support alert",
            "body": "Critical security issue detected on your computer. Call Microsoft Support at 1-800-555-7777 immediately.",
            "subtype": "callback_tech_support",
        },
    ]

    def generate_dataset(
        self, total_samples: int = 500
    ) -> Tuple[List[str], List[str], List[int], List[str]]:
        """Generate a balanced dataset with diverse phishing types."""
        subjects = []
        bodies = []
        labels = []  # 0 = legitimate, 1 = phishing
        subtypes = []

        # Calculate samples per category
        legit_count = total_samples // 3
        phish_count = total_samples - legit_count
        phish_per_type = phish_count // 5

        # Generate legitimate emails
        for _ in range(legit_count):
            template = random.choice(self.LEGITIMATE_TEMPLATES)
            subjects.append(self._add_variation(template["subject"]))
            bodies.append(self._add_variation(template["body"]))
            labels.append(0)
            subtypes.append("legitimate")

        # Generate credential phishing
        for _ in range(phish_per_type):
            template = random.choice(self.CREDENTIAL_PHISHING)
            subjects.append(self._add_variation(template["subject"]))
            bodies.append(self._add_variation(template["body"]))
            labels.append(1)
            subtypes.append(template["subtype"])

        # Generate BEC
        for _ in range(phish_per_type):
            template = random.choice(self.BEC_PHISHING)
            subjects.append(self._add_variation(template["subject"]))
            bodies.append(self._add_variation(template["body"]))
            labels.append(1)
            subtypes.append(template["subtype"])

        # Generate spear phishing
        for _ in range(phish_per_type):
            template = random.choice(self.SPEAR_PHISHING)
            subjects.append(self._add_variation(template["subject"]))
            bodies.append(self._add_variation(template["body"]))
            labels.append(1)
            subtypes.append(template["subtype"])

        # Generate malware delivery
        for _ in range(phish_per_type):
            template = random.choice(self.MALWARE_PHISHING)
            subjects.append(self._add_variation(template["subject"]))
            bodies.append(self._add_variation(template["body"]))
            labels.append(1)
            subtypes.append(template["subtype"])

        # Generate callback phishing
        for _ in range(phish_count - 4 * phish_per_type):
            template = random.choice(self.CALLBACK_PHISHING)
            subjects.append(self._add_variation(template["subject"]))
            bodies.append(self._add_variation(template["body"]))
            labels.append(1)
            subtypes.append(template["subtype"])

        return subjects, bodies, labels, subtypes

    def _add_variation(self, text: str) -> str:
        """Add slight variations to text."""
        variations = [
            lambda t: t,
            lambda t: t.upper() if random.random() < 0.1 else t,
            lambda t: t
            + " "
            + random.choice(["Please respond ASAP.", "Thank you.", "Best regards.", ""]),
            lambda t: t.replace(".", "!") if random.random() < 0.2 else t,
        ]
        return random.choice(variations)(text)


# Generate comprehensive dataset
generator = PhishingDataGenerator()
subjects, bodies, labels, subtypes = generator.generate_dataset(total_samples=500)

# Combine into full text
texts = [f"Subject: {s}\n\n{b}" for s, b in zip(subjects, bodies)]

# Create DataFrame
df = pd.DataFrame(
    {"text": texts, "subject": subjects, "body": bodies, "label": labels, "subtype": subtypes}
)

print(f"Dataset size: {len(df)} samples")
print(f"\nClass distribution:")
print(df["label"].value_counts().rename({0: "Legitimate", 1: "Phishing"}))
print(f"\nPhishing subtypes:")
print(df[df["label"] == 1]["subtype"].value_counts())

In [None]:
# Interactive class distribution with Plotly
class_counts = df["label"].value_counts().reset_index()
class_counts.columns = ["label", "count"]
class_counts["type"] = class_counts["label"].map({0: "Legitimate", 1: "Phishing"})

fig = px.bar(
    class_counts,
    x="type",
    y="count",
    color="type",
    color_discrete_map={"Legitimate": "#2ecc71", "Phishing": "#e74c3c"},
    title="Email Classification Distribution",
    labels={"type": "Email Type", "count": "Count"},
    template=PLOTLY_TEMPLATE,
)
fig.update_layout(
    showlegend=False,
    xaxis_title="Email Type",
    yaxis_title="Count",
    hoverlabel=dict(font_size=14),
)
fig.update_traces(
    hovertemplate="<b>%{x}</b><br>Count: %{y}<extra></extra>"
)
fig.show()

## 2. Feature Extraction with TF-IDF

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df["text"], df["label"], test_size=0.3, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

In [None]:
# TF-IDF Vectorization
vectorizer = TfidfVectorizer(
    max_features=1000, stop_words="english", ngram_range=(1, 2)  # Unigrams and bigrams
)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(f"Feature matrix shape: {X_train_tfidf.shape}")
print(f"\nTop 10 features:")
feature_names = vectorizer.get_feature_names_out()
print(feature_names[:10])

## 3. Train Random Forest Classifier

In [None]:
# Train model
clf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

clf.fit(X_train_tfidf, y_train)
print("Model trained successfully!")

## 4. Evaluate Model Performance

In [None]:
# Predictions
y_pred = clf.predict(X_test_tfidf)
y_prob = clf.predict_proba(X_test_tfidf)[:, 1]

# Classification report
print("Classification Report:")
print("=" * 50)
print(classification_report(y_test, y_pred, target_names=["Legitimate", "Phishing"]))

In [None]:
# Interactive Confusion Matrix and ROC Curve with Plotly
cm = confusion_matrix(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

# Create subplots
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Confusion Matrix", f"ROC Curve (AUC = {roc_auc:.3f})"),
    specs=[[{"type": "heatmap"}, {"type": "scatter"}]]
)

# Confusion Matrix Heatmap
labels = ["Legitimate", "Phishing"]
cm_text = [[f"{cm[i][j]}<br>({cm[i][j]/cm.sum()*100:.1f}%)" for j in range(2)] for i in range(2)]

fig.add_trace(
    go.Heatmap(
        z=cm,
        x=labels,
        y=labels,
        text=cm_text,
        texttemplate="%{text}",
        colorscale="Blues",
        showscale=False,
        hovertemplate="Actual: %{y}<br>Predicted: %{x}<br>Count: %{z}<extra></extra>",
    ),
    row=1, col=1
)

# ROC Curve
fig.add_trace(
    go.Scatter(
        x=fpr, y=tpr,
        mode="lines",
        name=f"ROC (AUC={roc_auc:.3f})",
        line=dict(color="#3498db", width=2),
        hovertemplate="FPR: %{x:.3f}<br>TPR: %{y:.3f}<extra></extra>",
    ),
    row=1, col=2
)

# Diagonal reference line
fig.add_trace(
    go.Scatter(
        x=[0, 1], y=[0, 1],
        mode="lines",
        name="Random",
        line=dict(color="gray", width=1, dash="dash"),
        showlegend=False,
    ),
    row=1, col=2
)

fig.update_layout(
    template=PLOTLY_TEMPLATE,
    height=450,
    width=900,
    showlegend=True,
    legend=dict(x=0.75, y=0.15),
)

fig.update_xaxes(title_text="Predicted", row=1, col=1)
fig.update_yaxes(title_text="Actual", row=1, col=1)
fig.update_xaxes(title_text="False Positive Rate", row=1, col=2)
fig.update_yaxes(title_text="True Positive Rate", row=1, col=2)

fig.show()

## 5. Feature Importance Analysis

In [None]:
# Interactive Feature Importance with Plotly
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1][:20]
top_features = [feature_names[i] for i in indices]
top_importances = importances[indices]

# Create DataFrame for Plotly
importance_df = pd.DataFrame({
    "feature": top_features,
    "importance": top_importances,
    "rank": range(1, len(top_features) + 1)
})

fig = px.bar(
    importance_df,
    x="importance",
    y="feature",
    orientation="h",
    title="Top 20 Phishing Indicators",
    labels={"importance": "Feature Importance", "feature": "Feature"},
    template=PLOTLY_TEMPLATE,
    color="importance",
    color_continuous_scale="RdYlGn_r",
)

fig.update_layout(
    yaxis=dict(categoryorder="total ascending"),
    coloraxis_showscale=False,
    height=500,
    width=700,
    hoverlabel=dict(font_size=14),
)

fig.update_traces(
    hovertemplate="<b>%{y}</b><br>Importance: %{x:.4f}<extra></extra>"
)

fig.show()

## 6. Test with New Emails

In [None]:
def classify_email(email_text):
    """Classify a single email as phishing or legitimate."""
    email_tfidf = vectorizer.transform([email_text])
    prediction = clf.predict(email_tfidf)[0]
    probability = clf.predict_proba(email_tfidf)[0]

    result = "PHISHING" if prediction == 1 else "LEGITIMATE"
    confidence = probability[prediction] * 100

    return result, confidence


# Test emails
test_emails = [
    "URGENT: Your account will be closed. Click here immediately!",
    "Hi, let's catch up over coffee next week. How's Tuesday?",
    "You've won a free iPhone! Claim now before it expires!",
]

print("Email Classification Results:")
print("=" * 60)
for email in test_emails:
    result, confidence = classify_email(email)
    icon = "ðŸš¨" if result == "PHISHING" else "âœ…"
    print(f"\n{icon} {result} ({confidence:.1f}% confidence)")
    print(f"   Email: {email[:50]}...")

## Summary

In this lab, we built a phishing email classifier using:
- **TF-IDF vectorization** to convert email text to numerical features
- **Random Forest classifier** for robust classification
- **Evaluation metrics** including precision, recall, F1, and ROC-AUC

### Key Phishing Indicators Identified:
- Urgency words ("urgent", "immediately", "act now")
- Financial incentives ("won", "prize", "free")
- Security threats ("compromised", "suspended", "verify")

### Next Steps:
1. Add more training data
2. Try deep learning models (BERT, RoBERTa)
3. Add header analysis features
4. Integrate with email gateway