## Spam Classification

In this notebook, we will explore the Spam Classification dataset from the UCI Machine Learning Repository. Our goal is to build a neural network to classify emails as spam or not spam.

To begin, we will implement a simple Perceptron model to demonstrate its limitations on this dataset. Since the Perceptron cannot handle non-linearly separable data, we will then transition to a more powerful neural network model that can effectively classify the emails.

Let's get started! 🚀

https://archive.ics.uci.edu/dataset/228/sms+spam+collection

```bash
wget https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip

unzip sms+spam+collection.zip
```

In [1]:
data = []
spam, ham = [], []
with open("SMSSpamCollection", "r") as file:
    lines = file.readlines()
    for line in lines:
        line = line.strip()
        label, text = line.split("\t")
        if label.startswith("spam"):
            spam.append(text)
        else:
            ham.append(text)
        data.append((text, label))

print("Number of spam messages: ", len(spam))
print("Number of ham messages: ", len(ham))
print("Total number of messages: ", len(data))

Number of spam messages:  747
Number of ham messages:  4827
Total number of messages:  5574


### Feature Engineering: Designing Our Features

Now, we need to design our features to effectively classify messages as spam or ham. For this dataset, we have manually selected a set of keywords that are commonly associated with spam and another set of keywords that frequently appear in ham (non-spam) messages.

Our features are defined as the count of these spam-related and ham-related words that occur in each text message. By using these word counts as input features, we aim to capture key patterns that distinguish spam from legitimate messages.

In [2]:
import numpy as np
# using domain knowledge
# list of typical spam words (domain knowledge)
spam_words = set([
    "free", "credit", "loan", "cash",
    "money", "urgent", "sale", "offer",
    "discount", "save", "clearance", "win",
    "winner", "prize", "bonus", "gift",
    "click", "visit", "limited", "today",
    "now", "apply", "easy", "fast",
    "quick", "double", "triple", "guarantee"
])

# list of typical ham words (domain knowledge)
ham_words = set([
    "meeting", "lunch", "dinner", "home",
    "office", "work", "project", "report",
    "email", "phone", "call", "meeting",
    "party", "movie", "game", "play",
    "music", "dance", "book", "read",
    "write", "paint", "draw", "travel",
    "trip", "visit", "family", "friend"
])

# generate features based on the domain knowledge
# feature-1: count of the spam knowledge words
# feature-2: count of the ham knowledge words
features, labels = [], []
for text, label in data:
    text = text.lower()
    words = text.split()
    x = [0, 0]
    for w in words:
        if w in spam_words:
            x[0] += 1
        if w in ham_words:
            x[1] += 1
    features.append(x)
    labels.append(label)

print("Number of features: ", len(features[0]))
print("Number of samples: ", len(features))

features = np.array(features)
labels = np.array(labels)
print("Shape of features: ", features.shape)
print("Shape of labels: ", labels.shape)

Number of features:  2
Number of samples:  5574
Shape of features:  (5574, 2)
Shape of labels:  (5574,)
