 Function to import the spam-ham dataset

In [68]:
import tarfile  # to handle tar archives (.tar, .tar.bz2)
from pathlib import Path  # for filesystem path manipulation
import urllib.request  # to download files from URLs

def fetch_spam_data():
    spam_root = "http://spamassassin.apache.org/old/publiccorpus/"  # base URL for dataset
    ham_url = spam_root + "20030228_easy_ham.tar.bz2"  # URL for ham emails
    spam_url = spam_root + "20030228_spam.tar.bz2"     # URL for spam emails

    spam_path = Path() / "datasets" / "spam"  # local folder to store downloaded datasets
    spam_path.mkdir(parents=True, exist_ok=True)  # create folder if it doesn't exist

    # iterate over ham and spam datasets
    for dir_name, tar_name, url in (("easy_ham", "ham", ham_url),
                                    ("spam", "spam", spam_url)):
        if not (spam_path / dir_name).is_dir():  # check if dataset already extracted
            path = (spam_path / tar_name).with_suffix(".tar.bz2")  # local tar.bz2 filename
            print("Downloading", path)  # notify user
            urllib.request.urlretrieve(url, path)  # download the tar.bz2 file
            tar_bz2_file = tarfile.open(path)  # open tar.bz2 archive
            tar_bz2_file.extractall(path=spam_path)  # extract all contents to target folder
            tar_bz2_file.close()  # close archive

    # return paths to extracted ham and spam folders
    return [spam_path / dir_name for dir_name in ("easy_ham", "spam")]


In [69]:

ham_dir, spam_dir = fetch_spam_data()

In [70]:
# List all files in the ham directory, sorted alphabetically, filter out short filenames (<21 chars)
ham_filenames = [f for f in sorted(ham_dir.iterdir()) if len(f.name) > 20]

# List all files in the spam directory, sorted alphabetically, filter out short filenames (<21 chars)
spam_filenames = [f for f in sorted(spam_dir.iterdir()) if len(f.name) > 20]

# Print number of ham emails found
print(len(ham_filenames))

# Print number of spam emails found
print(len(spam_filenames))


2500
500


In [71]:
import email  # for parsing email messages
import email.policy  # to define parsing behavior (e.g., preserving headers, line endings)

# Function to load an email from a file path
def load_email(filepath):
    with open(filepath, "rb") as f:  # open file in binary mode
        # parse the email using the default policy, return EmailMessage object
        return email.parser.BytesParser(policy=email.policy.default).parse(f)

# Load all ham emails into a list of EmailMessage objects
ham_emails = [load_email(filepath) for filepath in ham_filenames]

# Load all spam emails into a list of EmailMessage objects
spam_emails = [load_email(filepath) for filepath in spam_filenames]

# Print the content of the second ham email, stripped of leading/trailing whitespace
print(ham_emails[1].get_content().strip())

# Print the content of the seventh spam email, stripped of leading/trailing whitespace
print(spam_emails[6].get_content().strip())


Martin A posted:
Tassos Papadopoulos, the Greek sculptor behind the plan, judged that the
 limestone of Mount Kerdylio, 70 miles east of Salonika and not far from the
 Mount Athos monastic community, was ideal for the patriotic sculpture. 
 
 As well as Alexander's granite features, 240 ft high and 170 ft wide, a
 museum, a restored amphitheatre and car park for admiring crowds are
planned
---------------------
So is this mountain limestone or granite?
If it's limestone, it'll weather pretty fast.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
4 DVDs Free +s&p Join Now
http://us.click.yahoo.com/pt6YBB/NXiEAA/mG3HAA/7gSolB/TM
---------------------------------------------------------------------~->

To unsubscribe from this group, send an email to:
forteana-unsubscribe@egroups.com

 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Help wanted.  We are a 14 year old fortune 500 company, that is
growing at a tremendous rate.  We are loo

In [72]:
# Function to recursively describe the structure of an email
def get_email_structure(email):
    if isinstance(email, str):  # if input is already a string, return it
        return email
    payload = email.get_payload()  # get the payload (content) of the email
    if isinstance(payload, list):  # if payload is a list, it's multipart
        # recursively get structure of each part and join with commas
        multipart = ", ".join([get_email_structure(sub_email)
                               for sub_email in payload])
        return f"multipart({multipart})"  # label as multipart with its parts
    else:
        return email.get_content_type()  # return content type (e.g., text/plain)


In [73]:
from collections import Counter  # for counting occurrences

# Function to count how many times each email structure appears
def structures_counter(emails):
    structures = Counter()  # initialize counter
    for email in emails:
        structure = get_email_structure(email)  # get structure of email
        structures[structure] += 1  # increment count
    return structures

# Print the most common email structures in ham emails
print(
"Common ham structure:",
structures_counter(ham_emails).most_common())

# Print the most common email structures in spam emails
print(
"Common spam structure:",
structures_counter(spam_emails).most_common())

Common ham structure: [('text/plain', 2408), ('multipart(text/plain, application/pgp-signature)', 66), ('multipart(text/plain, text/html)', 8), ('multipart(text/plain, text/plain)', 4), ('multipart(text/plain)', 3), ('multipart(text/plain, application/octet-stream)', 2), ('multipart(text/plain, text/enriched)', 1), ('multipart(text/plain, application/ms-tnef, text/plain)', 1), ('multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)', 1), ('multipart(text/plain, video/mng)', 1), ('multipart(text/plain, multipart(text/plain))', 1), ('multipart(text/plain, application/x-pkcs7-signature)', 1), ('multipart(text/plain, multipart(text/plain, text/plain), text/rfc822-headers)', 1), ('multipart(text/plain, multipart(text/plain, text/plain), multipart(multipart(text/plain, application/x-pkcs7-signature)))', 1), ('multipart(text/plain, application/x-java-applet)', 1)]
Common spam structure: [('text/plain', 218), ('text/html', 183), ('multipart(text/plain, text/html)'

In [74]:
for header, value in spam_emails[0].items():
    print(header, ":", value)

Return-Path : <12a1mailbot1@web.de>
Delivered-To : zzzz@localhost.spamassassin.taint.org
Received : from localhost (localhost [127.0.0.1])	by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32	for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)
Received : from mail.webnote.net [193.120.211.219]	by localhost with POP3 (fetchmail-5.9.0)	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST)
Received : from dd_it7 ([210.97.77.167])	by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623	for <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 13:09:41 +0100
From : 12a1mailbot1@web.de
Received : from r-smtp.korea.com - 203.122.2.197 by dd_it7  with Microsoft SMTPSVC(5.5.1775.675.6);	 Sat, 24 Aug 2002 09:42:10 +0900
To : dcek1a1@netsgo.com
Subject : Life Insurance - Why Pay More?
Date : Wed, 21 Aug 2002 20:31:57 -1600
MIME-Version : 1.0
Message-ID : <0103c1042001882DD_IT7@dd_it7>
Content-Type : text/html; charset="iso-8859-1"
Content-Transfer-Encoding : qu

In [75]:
spam_emails[0]["Subject"]

'Life Insurance - Why Pay More?'

ML-Section:

In [76]:
import numpy as np
from sklearn.model_selection import train_test_split

# Combine ham and spam emails into a single numpy array, dtype=object to hold EmailMessage objects
X = np.array(ham_emails + spam_emails, dtype=object)

# Create labels: 0 for ham, 1 for spam
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

# Split dataset into training (80%) and testing (20%) sets
# random_state=42 ensures reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=42)


In [77]:
from bs4 import BeautifulSoup  # for parsing HTML content
from html import unescape      # to convert HTML entities to normal text
import re                      # for regular expressions

# Convert HTML content of an email to plain text
def html_to_plain_text(html):
    soup = BeautifulSoup(html, "html.parser")  # parse HTML
    # remove <head> section and its contents, often contains metadata/scripts
    if soup.head:
        soup.head.decompose()
    # replace all hyperlinks with the word 'HYPERLINK'
    for a_tag in soup.find_all("a"):
        a_tag.replace_with(" HYPERLINK ")
    # extract visible text, using newline as separator
    text = soup.get_text(separator="\n")
    # normalize multiple consecutive newlines into a single newline
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M)
    return unescape(text)  # convert HTML entities (e.g., &amp;) to characters



In [78]:
html_spam_emails = [email for email in X_train[y_train==1]
                    if get_email_structure(email) == "text/html"]
sample_html_spam = html_spam_emails[7]
print(sample_html_spam.get_content().strip()[:1000], "...")


print(html_to_plain_text(sample_html_spam.get_content())[:1000], "...")

<HTML><HEAD><TITLE></TITLE><META http-equiv="Content-Type" content="text/html; charset=windows-1252"><STYLE>A:link {TEX-DECORATION: none}A:active {TEXT-DECORATION: none}A:visited {TEXT-DECORATION: none}A:hover {COLOR: #0033ff; TEXT-DECORATION: underline}</STYLE><META content="MSHTML 6.00.2713.1100" name="GENERATOR"></HEAD>
<BODY text="#000000" vLink="#0033ff" link="#0033ff" bgColor="#CCCC99"><TABLE borderColor="#660000" cellSpacing="0" cellPadding="0" border="0" width="100%"><TR><TD bgColor="#CCCC99" valign="top" colspan="2" height="27">
<font size="6" face="Arial, Helvetica, sans-serif" color="#660000">
<b>OTC</b></font></TD></TR><TR><TD height="2" bgcolor="#6a694f">
<font size="5" face="Times New Roman, Times, serif" color="#FFFFFF">
<b>&nbsp;Newsletter</b></font></TD><TD height="2" bgcolor="#6a694f"><div align="right"><font color="#FFFFFF">
<b>Discover Tomorrow's Winners&nbsp;</b></font></div></TD></TR><TR><TD height="25" colspan="2" bgcolor="#CCCC99"><table width="100%" border="0" 

In [79]:
def email_to_text(email):
    html = None
    for part in email.walk():
        ctype = part.get_content_type()
        if not ctype in ("text/plain", "text/html"):
            continue
        try:
            content = part.get_content()
        except: # in case of encoding issues
            content = str(part.get_payload())
        if ctype == "text/plain":
            return content
        else:
            html = content
    if html:
        return html_to_plain_text(html)



print(email_to_text(sample_html_spam)[:100], "...")


OTC
 Newsletter
Discover Tomorrow's Winners
For Immediate Release
Cal-Bay (Stock Symbol: CBYI)
Watc ...


In [80]:
import nltk  # Natural Language Toolkit for text processing

# Initialize the Porter stemming algorithm
stemmer = nltk.PorterStemmer()

# Test stemming on different forms of the word "compute"
for word in ("Computations", "Computation", "Computing", "Computed", "Compute",
             "Compulsive"):
    # stem() reduces words to their root form
    print(word, "=>", stemmer.stem(word))


Computations => comput
Computation => comput
Computing => comput
Computed => comput
Compute => comput
Compulsive => compuls


In [81]:
import sys
# Is this notebook running on Colab or Kaggle?
IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules

# if running this notebook on Colab or Kaggle, we just pip install urlextract
if IS_COLAB or IS_KAGGLE:
    %pip install -q -U urlextract

In [82]:
import urlextract  # library to extract URLs from text (may download root domain names)

# Create an instance of the URL extractor
url_extractor = urlextract.URLExtract()

# Example text containing URLs
some_text = "Will it detect github.com and https://youtu.be/7Pq-S557XQU?t=3m32s"

# Extract and print all URLs found in the text
print(url_extractor.find_urls(some_text))


['github.com', 'https://youtu.be/7Pq-S557XQU?t=3m32s']


In [83]:
from sklearn.base import BaseEstimator, TransformerMixin  # base classes for custom transformers

# Transformer that converts emails into word count dictionaries
class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, strip_headers=True, lower_case=True,
                 remove_punctuation=True, replace_urls=True,
                 replace_numbers=True, stemming=True):
        # configuration options for preprocessing
        self.strip_headers = strip_headers
        self.lower_case = lower_case
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.stemming = stemming

    def fit(self, X, y=None):
        return self  # no fitting needed, just a transformer

    def transform(self, X, y=None):
        X_transformed = []  # store transformed emails
        for email in X:
            text = email_to_text(email) or ""  # extract email text
            if self.lower_case:
                text = text.lower()  # convert to lowercase
            if self.replace_urls and url_extractor is not None:
                # find all URLs, sort by length descending to avoid partial replacements
                urls = list(set(url_extractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text = text.replace(url, " URL ")  # replace URLs with placeholder
            if self.replace_numbers:
                # replace all numeric patterns with placeholder
                text = re.sub(r'\d+(?:\.\d*)?(?:[eE][+-]?\d+)?', 'NUMBER', text)
            if self.remove_punctuation:
                # remove all non-word characters
                text = re.sub(r'\W+', ' ', text, flags=re.M)
            word_counts = Counter(text.split())  # count word frequencies
            if self.stemming and stemmer is not None:
                # reduce words to their stems
                stemmed_word_counts = Counter()
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_counts[stemmed_word] += count
                word_counts = stemmed_word_counts
            X_transformed.append(word_counts)  # add processed email word counts
        return np.array(X_transformed)  # return as NumPy array


In [84]:
X_few = X_train[:3]
X_few_wordcounts = EmailToWordCounterTransformer().fit_transform(X_few)
X_few_wordcounts

array([Counter({'chuck': 1, 'murcko': 1, 'wrote': 1, 'stuff': 1, 'yawn': 1, 'r': 1}),
       Counter({'the': 11, 'of': 9, 'and': 8, 'all': 3, 'christian': 3, 'to': 3, 'by': 3, 'jefferson': 2, 'i': 2, 'have': 2, 'superstit': 2, 'one': 2, 'on': 2, 'been': 2, 'ha': 2, 'half': 2, 'rogueri': 2, 'teach': 2, 'jesu': 2, 'some': 1, 'interest': 1, 'quot': 1, 'url': 1, 'thoma': 1, 'examin': 1, 'known': 1, 'word': 1, 'do': 1, 'not': 1, 'find': 1, 'in': 1, 'our': 1, 'particular': 1, 'redeem': 1, 'featur': 1, 'they': 1, 'are': 1, 'alik': 1, 'found': 1, 'fabl': 1, 'mytholog': 1, 'million': 1, 'innoc': 1, 'men': 1, 'women': 1, 'children': 1, 'sinc': 1, 'introduct': 1, 'burnt': 1, 'tortur': 1, 'fine': 1, 'imprison': 1, 'what': 1, 'effect': 1, 'thi': 1, 'coercion': 1, 'make': 1, 'world': 1, 'fool': 1, 'other': 1, 'hypocrit': 1, 'support': 1, 'error': 1, 'over': 1, 'earth': 1, 'six': 1, 'histor': 1, 'american': 1, 'john': 1, 'e': 1, 'remsburg': 1, 'letter': 1, 'william': 1, 'short': 1, 'again': 1, 'becom

In [85]:
from scipy.sparse import csr_matrix  # for sparse matrix representation

# Transformer that converts word count dictionaries into vectors
class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size  # maximum number of words to keep in vocabulary

    def fit(self, X, y=None):
        total_count = Counter()  # count total occurrences across all emails
        for word_count in X:
            for word, count in word_count.items():
                # limit each word's contribution to 10 to avoid dominance
                total_count[word] += min(count, 10)
        # select most common words up to vocabulary_size
        most_common = total_count.most_common()[:self.vocabulary_size]
        # assign an index to each word
        self.vocabulary_ = {word: index + 1
                            for index, (word, count) in enumerate(most_common)}
        print("Vocabulary built:", self.vocabulary_)  # print vocabulary mapping
        return self

    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))  # 0 if word not in vocabulary
                data.append(count)
        sparse_matrix = csr_matrix((data, (rows, cols)),
                                   shape=(len(X), self.vocabulary_size + 1))
        print("Sparse matrix shape:", sparse_matrix.shape)  # print matrix shape
        return sparse_matrix

# Example usage with small dataset
vocab_transformer = WordCounterToVectorTransformer(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
print(X_few_vectors)  # print resulting sparse vectors


Vocabulary built: {'the': 1, 'of': 2, 'and': 3, 'to': 4, 'url': 5, 'all': 6, 'in': 7, 'christian': 8, 'on': 9, 'by': 10}
Sparse matrix shape: (3, 11)
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 20 stored elements and shape (3, 11)>
  Coords	Values
  (0, 0)	6
  (1, 0)	99
  (1, 1)	11
  (1, 2)	9
  (1, 3)	8
  (1, 4)	3
  (1, 5)	1
  (1, 6)	3
  (1, 7)	1
  (1, 8)	3
  (1, 9)	2
  (1, 10)	3
  (2, 0)	67
  (2, 2)	1
  (2, 3)	2
  (2, 4)	3
  (2, 5)	4
  (2, 6)	1
  (2, 7)	2
  (2, 9)	1


In [86]:
vocab_transformer.vocabulary_

{'the': 1,
 'of': 2,
 'and': 3,
 'to': 4,
 'url': 5,
 'all': 6,
 'in': 7,
 'christian': 8,
 'on': 9,
 'by': 10}

In [87]:
from sklearn.pipeline import Pipeline  # for chaining multiple preprocessing steps

# Create a preprocessing pipeline:
# 1. Convert emails to word count dictionaries
# 2. Convert word counts to fixed-size vectors
preprocess_pipeline = Pipeline([
    ("email_to_wordcount", EmailToWordCounterTransformer()),
    ("wordcount_to_vector", WordCounterToVectorTransformer()),
])

# Fit the pipeline on training data and transform it into numerical vectors
X_train_transformed = preprocess_pipeline.fit_transform(X_train)
print("Transformed training data shape:", X_train_transformed.shape)  # print shape


Vocabulary built: {'number': 1, 'the': 2, 'to': 3, 'a': 4, 'and': 5, 'of': 6, 'i': 7, 'in': 8, 'it': 9, 'url': 10, 'is': 11, 'that': 12, 'you': 13, 'for': 14, 'thi': 15, 'on': 16, 's': 17, 'be': 18, 'with': 19, 'have': 20, 'from': 21, 'not': 22, 'are': 23, 't': 24, 'as': 25, 'or': 26, 'your': 27, 'list': 28, 'if': 29, 'but': 30, 'at': 31, 'use': 32, 'can': 33, 'by': 34, 'all': 35, 'an': 36, 'my': 37, 'wa': 38, 'we': 39, 'get': 40, 'mail': 41, 'do': 42, 'they': 43, 'will': 44, 'so': 45, 'one': 46, 'there': 47, 'more': 48, 'ha': 49, 'time': 50, 'just': 51, 'about': 52, 'no': 53, 'what': 54, 'out': 55, 'like': 56, 'messag': 57, 'com': 58, 'up': 59, 'email': 60, 'onli': 61, 'which': 62, 'would': 63, 'work': 64, 'other': 65, 'make': 66, 'don': 67, 'some': 68, 'who': 69, 'ani': 70, 'me': 71, 'now': 72, 'when': 73, 'new': 74, 'peopl': 75, 'their': 76, 'our': 77, 'm': 78, 'free': 79, 'user': 80, 'been': 81, 'net': 82, 'date': 83, 'wrote': 84, 'how': 85, 'want': 86, 'than': 87, 'rpm': 88, 'them

In [88]:
from sklearn.linear_model import LogisticRegression  # classifier
from sklearn.model_selection import cross_val_score  # for cross-validation

# Initialize logistic regression
log_clf = LogisticRegression(max_iter=1000, random_state=42)

# Evaluate using 3-fold cross-validation
score = cross_val_score(log_clf, X_train_transformed, y_train, cv=3)
print("Cross-validation scores:", score)
print("Mean CV score:", score.mean())  # average performance across folds

Cross-validation scores: [0.98375 0.985   0.98875]
Mean CV score: 0.9858333333333333


In [89]:
from sklearn.metrics import precision_score, recall_score  # metrics to evaluate classifier

# Transform test emails into vectors using the same preprocessing pipeline
X_test_transformed = preprocess_pipeline.transform(X_test)
print("Transformed test data shape:", X_test_transformed.shape)  # print shape

# Initialize logistic regression classifier
log_clf = LogisticRegression(max_iter=1000, random_state=42)
# Train on the transformed training data
log_clf.fit(X_train_transformed, y_train)

# Predict labels for the test set
y_pred = log_clf.predict(X_test_transformed)

# Print precision (proportion of predicted spam that is correct)
print(f"Precision: {precision_score(y_test, y_pred):.2%}")
# Print recall (proportion of actual spam correctly identified)
print(f"Recall: {recall_score(y_test, y_pred):.2%}")


Sparse matrix shape: (600, 1001)
Transformed test data shape: (600, 1001)
Precision: 96.88%
Recall: 97.89%


In [90]:
import joblib

# Fit the pipeline and classifier
preprocess_pipeline.fit(X_train)
X_train_transformed = preprocess_pipeline.transform(X_train)

log_clf = LogisticRegression(max_iter=1000, random_state=42)
log_clf.fit(X_train_transformed, y_train)

# Save both objects together
joblib.dump({
    "preprocess_pipeline": preprocess_pipeline,
    "classifier": log_clf
}, "spam_classifier.pkl")


Vocabulary built: {'number': 1, 'the': 2, 'to': 3, 'a': 4, 'and': 5, 'of': 6, 'i': 7, 'in': 8, 'it': 9, 'url': 10, 'is': 11, 'that': 12, 'you': 13, 'for': 14, 'thi': 15, 'on': 16, 's': 17, 'be': 18, 'with': 19, 'have': 20, 'from': 21, 'not': 22, 'are': 23, 't': 24, 'as': 25, 'or': 26, 'your': 27, 'list': 28, 'if': 29, 'but': 30, 'at': 31, 'use': 32, 'can': 33, 'by': 34, 'all': 35, 'an': 36, 'my': 37, 'wa': 38, 'we': 39, 'get': 40, 'mail': 41, 'do': 42, 'they': 43, 'will': 44, 'so': 45, 'one': 46, 'there': 47, 'more': 48, 'ha': 49, 'time': 50, 'just': 51, 'about': 52, 'no': 53, 'what': 54, 'out': 55, 'like': 56, 'messag': 57, 'com': 58, 'up': 59, 'email': 60, 'onli': 61, 'which': 62, 'would': 63, 'work': 64, 'other': 65, 'make': 66, 'don': 67, 'some': 68, 'who': 69, 'ani': 70, 'me': 71, 'now': 72, 'when': 73, 'new': 74, 'peopl': 75, 'their': 76, 'our': 77, 'm': 78, 'free': 79, 'user': 80, 'been': 81, 'net': 82, 'date': 83, 'wrote': 84, 'how': 85, 'want': 86, 'than': 87, 'rpm': 88, 'them

['spam_classifier.pkl']



---

## Spam Email Classification with Logistic Regression

This project builds a **spam detector** using raw email data from the SpamAssassin public corpus. The pipeline includes **data preprocessing, feature engineering, model training, and evaluation**.

---

### **1. Dataset**

* Downloaded ham (`easy_ham`) and spam (`spam`) emails.
* Extracted from `.tar.bz2` archives.
* Parsed emails with Python’s `email` module, preserving headers and MIME content.

---

### **2. Email Preprocessing**

* Extract text from emails:

  * Plain text: use directly.
  * HTML: cleaned using `BeautifulSoup`, removed `<head>`, replaced links with `"HYPERLINK"`.
* Normalize text:

  * Lowercasing.
  * Remove punctuation.
  * Replace URLs with `"URL"`.
  * Replace numbers with `"NUMBER"`.
  * Apply stemming (Porter Stemmer).

---

### **3. Feature Engineering**

* Convert emails → **word count dictionaries** (bag-of-words).
* Build a **shared vocabulary** of most frequent words across all emails.
* Convert word counts → **numerical vectors**:

  * Each vector dimension represents a word in the vocabulary.
  * Sparse representation used for efficiency.

---

### **4. Labels**

* Ham emails → `0`
* Spam emails → `1`
* Split dataset: 80% training, 20% testing.

---

### **5. Model Training**

* Logistic Regression classifier trained on the vectorized emails.
* Model learns **weights for each word**:

  * Positive weight → word indicative of spam.
  * Negative weight → word indicative of ham.
* Decision for new email:

  1. Compute weighted sum of word counts + bias.
  2. Apply sigmoid to get probability of spam.
  3. Predict spam if probability > 0.5, otherwise ham.

**Score formula:**

score = SUM(word_count * word_weight) + bias


---

### **6. How the Model Handles Words**

1. **Word Weights and Scoring**

   * Each email vector is multiplied by learned word weights to compute a **spam score**.
   * Words with higher positive weights push the score toward spam; negative weights push toward ham.
   * Overlapping words (present in both ham and spam emails) get weights proportional to their correlation with spam.

2. **Shared Vocabulary Across Ham and Spam**

   * Only **one vocabulary** is used; not separate for ham and spam.
   * Example:

     * `"meeting"` mostly in ham → negative weight → reduces spam score.
     * `"free"` mostly in spam → positive weight → increases spam score.
   * Neutral or rare words get small weights → minimal effect.

---

### **7. Illustrative Example**

Suppose our vocabulary has **4 words**:
 `["free", "meeting", "click", "project"]`.

| Word    | Weight (learned) |
| ------- | ---------------- |
| free    | +2.0             |
| meeting | -1.5             |
| click   | +1.8             |
| project | -0.5             |

#### **Email A (spam)**

```
"Free click here!"
Word counts: {"free": 1, "click": 1}
Score = 1*2.0 + 1*1.8 + 0*(-1.5) + 0*(-0.5) + bias
      = 3.8 + bias
P(spam) = sigmoid(3.8 + bias) ≈ high → classified as spam
```

#### **Email B (ham)**

```
"Meeting about the project"
Word counts: {"meeting": 1, "project": 1}
Score = 0*2.0 + 0*1.8 + 1*(-1.5) + 1*(-0.5) + bias
      = -2.0 + bias
P(spam) = sigmoid(-2.0 + bias) ≈ low → classified as ham
```

**Detailed Explanation of Email B Score:**

1. **Word Counts:** Count how many times each vocabulary word occurs:

   | Word    | Count |
   | ------- | ----- |
   | free    | 0     |
   | click   | 0     |
   | meeting | 1     |
   | project | 1     |

2. **Multiply by Learned Weights:**

   ```
   free: 0*2.0 = 0
   click: 0*1.8 = 0
   meeting: 1*(-1.5) = -1.5
   project: 1*(-0.5) = -0.5
   ```

3. **Add Bias:**

   ```
   Score = -1.5 + (-0.5) + bias = -2.0 + bias
   ```

4. **Convert to Probability:**

   ```
   P(spam) = 1 / (1 + exp(-score)) ≈ low
   ```

5. **Prediction:** Low probability → classified as ham.

**Key points demonstrated:**

* Words shared in both ham and spam (“project”) influence the score based on their **learned weight**.
* Logistic regression combines all word contributions to compute the final spam probability.
* Each word’s contribution = `count * weight`. Bias shifts overall probability to improve accuracy.

---

### **8. Evaluation**

* Cross-validation on training set to estimate performance.
* Tested on unseen test emails.
* Metrics reported:

  * **Precision:** % of predicted spam that is actually spam.
  * **Recall:** % of actual spam correctly detected.

---

**Outcome:** A classical ML spam classifier using **bag-of-words features**, logistic regression, and shared vocabulary with learned word weights, capable of distinguishing ham and spam based on word patterns, with **score computation formula clearly explained**.
