## SMS Spam Detection using NLP

This project aims to classify SMS messages as either 'ham' (non-spam) or 'spam' using a neural network model built with TensorFlow and Keras.

In [3]:
# Importing Libraries

import tensorflow as tf
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

## Downloading the Dataset

We will download the training and validation datasets.


In [5]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
!wget https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv


train_file_path = "train-data.tsv"
test_file_path = "valid-data.tsv"

--2024-05-19 09:23:23--  https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 172.67.70.149, 104.26.3.33, 104.26.2.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|172.67.70.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358233 (350K) [text/tab-separated-values]
Saving to: ‘train-data.tsv.1’


2024-05-19 09:23:23 (14.9 MB/s) - ‘train-data.tsv.1’ saved [358233/358233]

--2024-05-19 09:23:23--  https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 172.67.70.149, 104.26.3.33, 104.26.2.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|172.67.70.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 118774 (116K) [text/tab-separated-values]
Saving to: ‘valid-data.tsv.1’


2024-05-19 09:23:23 (6.53 MB/s) - ‘valid-data.tsv.1’ saved [118774/118774]



## Loading the Dataset

Next, we'll load the datasets into pandas DataFrames.

In [9]:
# Reading the datasets into pandas DataFrames

train_df = pd.read_csv('train-data.tsv', sep ='\t',header = None, names =['label','message'])
test_df = pd.read_csv('valid-data.tsv', sep ='\t',header = None, names =['label','message'])

# Displaying the first few rows of the training dataset
train_df.head()

Unnamed: 0,label,message
0,ham,ahhhh...just woken up!had a bad dream about u ...
1,ham,you can never do nothing
2,ham,"now u sound like manky scouse boy steve,like! ..."
3,ham,mum say we wan to go then go... then she can s...
4,ham,never y lei... i v lazy... got wat? dat day ü ...


## Preprocessing the Data

We'll map the labels 'ham' and 'spam' to 0 and 1, respectively, and then tokenize the text messages.

In [10]:
# Mapping label 'ham' to 0 and 'spam' to 1
label_map = {'ham': 0, 'spam': 1}
train_df['label'] = train_df['label'].map(label_map)
test_df['label'] = test_df['label'].map(label_map)

In [11]:
# Initializing the tokenizer and fitting on the training messages
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_df['message'])

In [12]:
# Converting texts to sequences
train_sequences = tokenizer.texts_to_sequences(train_df['message'])
test_sequences = tokenizer.texts_to_sequences(test_df['message'])

In [13]:
# Displaying the second sequence
train_sequences[1]

[3, 29, 281, 27, 340]

## Padding the Sequences

We need to pad the sequences to ensure uniform input size for the neural network.

In [14]:
# Padding sequences to ensure uniform input size
max_length = max(len(x) for x in train_sequences)
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding='post')
test_padded = pad_sequences(test_sequences, maxlen=max_length, padding='post')

# Displaying the first padded sequence
train_padded[0]

array([3666,   36, 2482,   44,  142,    4,  401,  766,   78,    6,  725,
         23,    1,   93,   55,    6,  162,   19,    1,  460,   54,  176,
         78, 1615,  110,   24,    1,  314,  153,   44,   12,   14,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

## Building the Model

We will define and compile the neural network model.

In [15]:
# Defining the model architecture
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=32, input_length=max_length),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(48, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid'),
])

# Compiling the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Displaying the model summary
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 189, 32)           246176    
                                                                 
 dropout (Dropout)           (None, 189, 32)           0         
                                                                 
 global_average_pooling1d (  (None, 32)                0         
 GlobalAveragePooling1D)                                         
                                                                 
 dense (Dense)               (None, 48)                1584      
                                                                 
 dropout_1 (Dropout)         (None, 48)                0         
                                                                 
 dense_1 (Dense)             (None, 24)                1176      
                                                        

## Training the Model

We will train the model using the padded sequences and the labels.

In [16]:
# Converting labels to numpy arrays
train_labels = np.array(train_df['label'])
test_labels = np.array(test_df['label'])

# Training the model
history = model.fit(train_padded, train_labels, epochs=10, validation_data=(test_padded, test_labels))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Making Predictions

Finally, we will create a function to predict whether a given message is 'ham' or 'spam'.

In [17]:
# Function to predict message type ('ham' or 'spam')
def predict_message(pred_text):
    sequence = tokenizer.texts_to_sequences([pred_text])
    padded = pad_sequences(sequence, maxlen=max_length, padding='post')
    prediction = model.predict(padded)[0][0]
    return [prediction, 'ham' if prediction < 0.5 else 'spam']

# Testing the prediction function
pred_text = "You have won 100000 dollars in cash, click here to claim your prize"
prediction = predict_message(pred_text)
print(prediction)


[0.97822213, 'spam']
