# Exercise: Use a foundation model to build a spam email classifier

A foundation model serves as a fundamental building block for potentially endless applications. One application we will explore in this exercise is the development of a spam email classifier, using the power of zero-shot learning. By leveraging the capabilities of a foundation model, this project aims to accurately identify and filter out unwanted and potentially harmful emails, enhancing user experience and security.

## Steps

1. Identify and gather relevant data
2. Preprocess and prepare the data
3. "Train" and evaluate the spam email classifier




## Step 1: Identify and gather relevant data

To train and test the spam email classifier, you will need a dataset of emails that are labeled as spam or not spam. It is important to identify and gather a suitable dataset that represents a wide range of spam and non-spam emails.

In [ ]:
# Find a spam dataset at https://huggingface.co/datasets and load it using the datasets library

from datasets import load_dataset


dataset = load_dataset("sms_spam", split=["train"])[0]

dataset[0]


## Step 2: Preprocess and prepare the data

Once you have the dataset, you will need to preprocess and prepare the data for training the model. This may involve tasks such as cleaning the text, removing irrelevant information, and converting the data into a suitable format for training.

## Step 3: "Train" and evaluate the spam email classifier

Using the foundation model and the prepared dataset, you can train the spam email classifier. After training, you will need to evaluate the performance of the classifier by testing it on a separate set of emails. This step will help you assess the accuracy and effectiveness of the classifier in identifying and filtering out spam emails.

In [ ]:
query = f"""

Classify each of the following SMS messages as spam or not spam. Respond in JSON.

---

"""

item_numbers = range(30)

for item_number, entry in zip(item_numbers, dataset.select(item_numbers)):
    sms = entry["sms"]
    label = entry["label"]

    query += f"{item_number}. {sms}"

print(query)


In [ ]:
response = {
    "0": "not spam",
    "1": "not spam",
    "2": "spam",
    "3": "not spam",
    "4": "not spam",
    "5": "spam",
    "6": "not spam",
    "7": "not spam",
    "8": "spam",
    "9": "spam",
    "10": "not spam",
    "11": "spam",
    "12": "spam",
    "13": "not spam",
    "14": "not spam",
    "15": "spam",
    "16": "not spam",
    "17": "not spam",
    "18": "not spam",
    "19": "spam",
    "20": "not spam",
    "21": "not spam",
    "22": "not spam",
    "23": "not spam",
    "24": "not spam",
    "25": "not spam",
    "26": "not spam",
    "27": "not spam",
    "28": "not spam",
    "29": "not spam"
}


In [ ]:
# Estimate the accuracy of your classifier by comparing your responses to the labels in the dataset

correct = 0
total = 0

for item_number, entry in zip(item_numbers, dataset.select(item_numbers)):
    sms = entry["sms"]
    label = entry["label"]


    if response[str(item_number)] == ("spam" if label else "not spam"):
        correct += 1

    total += 1

print(f"Accuracy: {100. * correct / total}")

Wow! That's pretty good! Surely it won't be correct for every example we throw at it, but it's a great start. We can see that the model is able to distinguish between spam and non-spam messages with a high degree of accuracy. This is a great example of how a foundation model can be used to build a spam email classifier.

In future lessons, we will explore methods for improving the performance of the classifier, but for now, let's move on to the next lesson. Great job!