# Exercise 5.2A_Spam - Email Components Primer

This notebook introduces the anatomy of email threats: simple spam, phishing, and Business Email Compromise (BEC).
It provides raw sample emails and shows how to parse headers, body, and attachments. The notebook is self-contained and uses Python's standard library.

## Learning Goals
- Recognize key header fields useful for detection (From, To, Subject, Received, SPF/DKIM results).
- Identify body cues (URLs, urgency, financial requests, suspicious file links).
- Understand attachment metadata and risks.
- Know which features to extract for an ML pipeline (header tokens, URL counts, attachment types, sender reputation proxies).

---

## 1) Sample raw emails (synthetic)
Below are three synthetic email examples: a generic spam, a phishing email, and a Business Email Compromise (BEC) style request. These are intentionally simple and safe for classroom use.

In [9]:
# Define three synthetic raw emails as RFC-822 style strings
spam_email = '''From: cheap-deals@example.com
To: student@example.edu
Subject: Amazing deal - 90% OFF!
Date: Thu, 12 Nov 2025 09:15:00 -0500
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8

Hello!
Buy now and save 90% on our products. Visit http://cheap.example.com/deal to claim.
Unsubscribe: http://cheap.example.com/unsub
'''

phish_email = '''From: security-alert@banking.example.com
To: user@example.com
Subject: Action Required: Verify Your Account
Date: Fri, 13 Nov 2025 08:05:00 -0500
MIME-Version: 1.0
Content-Type: text/html; charset=utf-8
Received-SPF: fail (example)

<html>
  <body>
    <p>Dear customer,</p>
    <p>We noticed suspicious activity on your account. Please <a href="http://secure.example-login.com/verify">verify your account</a> immediately or access will be restricted.</p>
    <p>Regards,<br/>Bank Security Team</p>
  </body>
</html>
'''

bec_email = '''From: ceo@trustedcorp.example.com
To: finance@company.example.com
Subject: Urgent: Wire Transfer Request
Date: Mon, 16 Nov 2025 11:22:00 -0500
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="BOUNDARY123"

--BOUNDARY123
Content-Type: text/plain; charset=utf-8

Hi,
Please arrange an urgent wire transfer of $75,000 to the vendor listed in the attached invoice. Do not discuss this over email.

Regards,
CEO

--BOUNDARY123
Content-Type: application/pdf; name="invoice_457.pdf"
Content-Disposition: attachment; filename="invoice_457.pdf"
Content-Transfer-Encoding: base64

JVBERi0xLjQKJcTl8uXrp/Og0MTGCjEgMCBvYmoK... (truncated sample)
--BOUNDARY123--
'''

# Print a short preview so students see them in outputs
print('--- SPAM SAMPLE ---')
print(spam_email[:400])
print('\n--- PHISHING SAMPLE ---')
print(phish_email[:400])
print('\n--- BEC SAMPLE ---')
print(bec_email[:400])

--- SPAM SAMPLE ---
From: cheap-deals@example.com
To: student@example.edu
Subject: Amazing deal - 90% OFF!
Date: Thu, 12 Nov 2025 09:15:00 -0500
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8

Hello!
Buy now and save 90% on our products. Visit http://cheap.example.com/deal to claim.
Unsubscribe: http://cheap.example.com/unsub


--- PHISHING SAMPLE ---
From: security-alert@banking.example.com
To: user@example.com
Subject: Action Required: Verify Your Account
Date: Fri, 13 Nov 2025 08:05:00 -0500
MIME-Version: 1.0
Content-Type: text/html; charset=utf-8
Received-SPF: fail (example)

<html>
  <body>
    <p>Dear customer,</p>
    <p>We noticed suspicious activity on your account. Please <a href="http://secure.example-login.com/verify">verify your ac

--- BEC SAMPLE ---
From: ceo@trustedcorp.example.com
To: finance@company.example.com
Subject: Urgent: Wire Transfer Request
Date: Mon, 16 Nov 2025 11:22:00 -0500
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="BOUNDARY123"

In [8]:
# Minimal parser to extract header fields, body type, URL counts, and attachments from a raw email string
import re
from email import message_from_string
from urllib.parse import urlparse


def extract_basic_features(raw_email):
    msg = message_from_string(raw_email)
    features = {}
    # Headers
    features['from'] = msg.get('From')
    features['to'] = msg.get('To')
    features['subject'] = msg.get('Subject')
    features['date'] = msg.get('Date')
    features['is_multipart'] = msg.is_multipart()

    # Body and URLs
    body = ''
    num_attachments = 0
    attachment_filenames = []
    for part in msg.walk():
        ctype = part.get_content_type()
        disp = str(part.get('Content-Disposition'))
        if ctype == 'text/plain' and 'attachment' not in disp.lower():
            payload = part.get_payload(decode=True)
            if payload:
                try:
                    body += payload.decode(part.get_content_charset('utf-8'), errors='replace')
                except Exception:
                    body += str(payload)
        elif ctype == 'text/html' and 'attachment' not in disp.lower():
            payload = part.get_payload(decode=True)
            if payload:
                try:
                    body += payload.decode(part.get_content_charset('utf-8'), errors='replace')
                except Exception:
                    body += str(payload)
        elif 'attachment' in disp.lower() or part.get_filename():
            num_attachments += 1
            if part.get_filename():
                attachment_filenames.append(part.get_filename())

    # Extract urls with a simple regex (classroom-safe). Place hyphen at end of class to avoid range issues.
    url_regex = r'https?://[\w\./:\-]+'
    urls = re.findall(url_regex, body)
    domains = set()
    for u in urls:
        try:
            domains.add(urlparse(u).netloc)
        except Exception:
            continue

    features['num_urls'] = len(urls)
    features['unique_url_domains'] = len(domains)
    features['num_attachments'] = num_attachments
    features['attachment_filenames'] = attachment_filenames
    features['body_snippet'] = (body[:300] + '...') if len(body)>300 else body
    # Heuristic urgency score
    urgency_keywords = ['urgent', 'immediately', 'asap', 'attention']
    features['urgency_score'] = sum(1 for k in urgency_keywords if k in body.lower())

    return features

# Demonstrate on the three samples defined in the previous cell
for name, raw in [('spam', spam_email), ('phish', phish_email), ('bec', bec_email)]:
    print(f'--- FEATURES for {name} ---')
    f = extract_basic_features(raw)
    for k, v in f.items():
        print(k, ':', v)
    print('\n')

--- FEATURES for spam ---
from : cheap-deals@example.com
to : student@example.edu
subject : Amazing deal - 90% OFF!
date : Thu, 12 Nov 2025 09:15:00 -0500
is_multipart : False
num_urls : 2
unique_url_domains : 1
num_attachments : 0
attachment_filenames : []
body_snippet : Hello!
Buy now and save 90% on our products. Visit http://cheap.example.com/deal to claim.
Unsubscribe: http://cheap.example.com/unsub

urgency_score : 0


--- FEATURES for phish ---
from : security-alert@banking.example.com
to : user@example.com
subject : Action Required: Verify Your Account
date : Fri, 13 Nov 2025 08:05:00 -0500
is_multipart : False
num_urls : 1
unique_url_domains : 1
num_attachments : 0
attachment_filenames : []
body_snippet : <html>
  <body>
    <p>Dear customer,</p>
    <p>We noticed suspicious activity on your account. Please <a href="http://secure.example-login.com/verify">verify your account</a> immediately or access will be restricted.</p>
    <p>Regards,<br/>Bank Security Team</p>
  </body>
