<a href="https://colab.research.google.com/github/ramoneas/FCC-ML-Challenge/blob/main/fcc_sms_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this challenge, you need to create a machine learning model that will classify SMS messages as either "ham" or "spam". A "ham" message is a normal message sent by a friend. A "spam" message is an advertisement or a message sent by a company.

In [None]:
# import libraries
try:
  # %tensorflow_version only exists in Colab.
  !pip install tf-nightly
except Exception:
  pass
import tensorflow as tf
import pandas as pd
from tensorflow import keras
!pip install tensorflow-datasets
import tensorflow_datasets as tfds
import numpy as np
import matplotlib.pyplot as plt

print(tf.__version__)

2.20.0-dev20250516


In [None]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
!wget https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv

train_file_path = "train-data.tsv"
test_file_path = "valid-data.tsv"

--2025-05-17 05:06:48--  https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 104.26.3.33, 172.67.70.149, 104.26.2.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|104.26.3.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358233 (350K) [text/tab-separated-values]
Saving to: ‘train-data.tsv.1’


2025-05-17 05:06:48 (12.8 MB/s) - ‘train-data.tsv.1’ saved [358233/358233]

--2025-05-17 05:06:48--  https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 104.26.3.33, 172.67.70.149, 104.26.2.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|104.26.3.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 118774 (116K) [text/tab-separated-values]
Saving to: ‘valid-data.tsv.1’


2025-05-17 05:06:49 (8.08 MB/s) - ‘valid-data.tsv.1’ saved [118774/118774]



In [None]:
train_data = pd.read_csv(train_file_path, sep="\t", names=["label", "text"])
test_data = pd.read_csv(test_file_path, sep="\t", names=["label", "text"])

Encode labels with `StringLookup`.

This will transform:

*   "ham" → 0
*   "spam" → 1






`num_oov_indices`

* `0` : Do not allow unknown values. Only fixed vocabulary.
* `1 or more` : Allow handling of unknown words with special tokens

.

In [None]:
label_lookup = tf.keras.layers.StringLookup(
    vocabulary=["ham", "spam"], num_oov_indices=0
)

train_labels_tensor = tf.constant(train_data.label.values)
train_labels_encoded = label_lookup(train_labels_tensor)

Let's calculate the number of words per message to determine the `output_sequence_length`.
You can choose an `output_sequence_length` like this:
*   Mean + 1 standard deviation
*   Or cover 95% of cases (eg., `quantile(0.95)`)





In [None]:
train_data["num_tokens"] = train_data["text"].apply(lambda x: len(x.split()))
output_sequence_len = int(train_data.num_tokens.quantile(0.95))

In [None]:
train_data.head(5)

Unnamed: 0,label,text,num_tokens
0,ham,ahhhh...just woken up!had a bad dream about u ...,30
1,ham,you can never do nothing,5
2,ham,"now u sound like manky scouse boy steve,like! ...",22
3,ham,mum say we wan to go then go... then she can s...,17
4,ham,never y lei... i v lazy... got wat? dat day ü ...,18


Let's count the number of unique words to determine the `max_tokens`.




In [None]:
all_words = " ".join(train_data.text.str.lower().values).split()
unique_words = set(all_words)
print(f"There are {len(unique_words)} unique words in the dataset.")

max_tokens_per_message = int(0.90 * len(unique_words))

There are 11330 unique words in the dataset.


Encode text with `TextVectorization`

In [None]:
vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=max_tokens_per_message, #Use only the most common ones. Stick with the ones that cover 95-90% of cases.
    output_mode="int",
    output_sequence_length=output_sequence_len,
)

vectorizer.adapt(train_data.text.values)

train_text_tensor = tf.constant(train_data.text.values)
train_text_vectorized = vectorizer(train_text_tensor)

In [None]:
test_labels_tensor = tf.constant(test_data.label.values)
test_labels_encoded = label_lookup(test_labels_tensor)

test_text_tensor = tf.constant(test_data.text.values)
test_text_vectorized = vectorizer(test_text_tensor)

Transform that data into structures that TensorFlow understands for training: `tf.data.Dataset`.

`.shuffle(buffer_size=1000)`: Shuffles the training data to prevent the model from learning a specific order.

`.batch(32)`: Divides the data into batches of 32 examples. This makes training more efficient.

`.prefetch(tf.data.AUTOTUNE)`: Loads the next batches in the background while the model trains the current one, speeding up the process.

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((train_text_vectorized, train_labels_encoded))
train_dataset = train_dataset.shuffle(buffer_size=1000).batch(32).prefetch(tf.data.AUTOTUNE)

In [None]:
test_dataset = tf.data.Dataset.from_tensor_slices((test_text_vectorized, test_labels_encoded))
test_dataset = test_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

Create and compile the model.

`Embedding(...)` Converts each integer (word) into a dense vector (semantic representation).

`GlobalAveragePooling1D()` Reduces each sequence to a fixed vector (average of embeddings), making the model independent of text length.

`Dense(16, activation='relu')` Hidden layer with ReLU activation that learns useful patterns from the text.

`Dense(1, activation='sigmoid')` Output layer. `sigmoid` returns a probability between 0 and 1. Since we are using binary classification, this is most appropriate.

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=vectorizer.vocabulary_size(), output_dim=16),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(16, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

model.compile(
    optimizer= tf.keras.optimizers.Adam(learning_rate=0.001),
    loss="binary_crossentropy",
    metrics=["accuracy"]
)

In [None]:
model.fit(train_dataset, validation_data=test_dataset, epochs=30)

Epoch 1/30
[1m131/131[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.9976 - loss: 0.0095 - val_accuracy: 0.9864 - val_loss: 0.0457
Epoch 2/30
[1m131/131[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.9984 - loss: 0.0079 - val_accuracy: 0.9864 - val_loss: 0.0455
Epoch 3/30
[1m131/131[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.9986 - loss: 0.0076 - val_accuracy: 0.9856 - val_loss: 0.0484
Epoch 4/30
[1m131/131[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.9987 - loss: 0.0064 - val_accuracy: 0.9856 - val_loss: 0.0479
Epoch 5/30
[1m131/131[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.9984 - loss: 0.0058 - val_accuracy: 0.9885 - val_loss: 0.0424
Epoch 6/30
[1m131/131[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - accuracy: 0.9990 - loss: 0.0052 - val_accuracy: 0.9885 - val_loss: 0.0423
Epoch 7/30
[1m131/131[0m 

<keras.src.callbacks.history.History at 0x7c9a64ebc090>

In [None]:
# function to predict messages based on model
# (should return list containing prediction and label, ex. [0.008318834938108921, 'ham'])
def predict_message(pred_text):
  prediction = model.predict(vectorizer([pred_text]))
  pred_value = prediction[0].item()

  if pred_value > 0.5:
    prediction = [pred_value, "spam"]
  else:
    prediction = [pred_value, "ham"]

  return (prediction)

pred_text = "how are you doing today"

prediction = predict_message(pred_text)
print(prediction)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[7.337473653024063e-05, 'ham']


In [None]:
loss, accuracy = model.evaluate(test_text_vectorized, test_labels_encoded, verbose=2)

44/44 - 0s - 3ms/step - accuracy: 0.9885 - loss: 0.0573


In [None]:
# Run this cell to test your function and model. Do not modify contents.
def test_predictions():
  test_messages = ["how are you doing today",
                   "sale today! to stop texts call 98912460324",
                   "i dont want to go. can we try it a different day? available sat",
                   "our new mobile video service is live. just install on your phone to start watching.",
                   "you have won £1000 cash! call to claim your prize.",
                   "i'll bring it tomorrow. don't forget the milk.",
                   "wow, is your arm alright. that happened to me one time too"
                  ]

  test_answers = ["ham", "spam", "ham", "spam", "spam", "ham", "ham"]
  passed = True

  for msg, ans in zip(test_messages, test_answers):
    prediction = predict_message(msg)
    if prediction[1] != ans:
      passed = False

  if passed:
    print("You passed the challenge. Great job!")
  else:
    print("You haven't passed yet. Keep trying.")

test_predictions()


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step
You passed the challenge. Great job!
