In this example we will learn about the basic binary classification problem where our end result is in yes or no format.
here we will try to build a model which help us to identify if given message is a spam or not

Steps:
1. Import Necessary Libraries
2. Create a Custom Dataset
3. PreProcess the Data
4. Build the Neural Network Model
5. Evaluate the Model


1. Import Necessary Libraries

In [33]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

2. Create a Custom Dataset:
<span>Here we have created a custom data set which help us to perform the classification</span>

In [34]:
data = {
    'message': [
        "Congratulations, you've won a free ticket to the Bahamas! Call now!",
        "Hey, are we still meeting at the cafe tomorrow?",
        "Get cheap meds online, click here for a discount!",
        "Reminder: Your appointment is scheduled for next Wednesday.",
        "Win a $1000 gift card by completing this survey!",
        "Don't forget to submit the assignment by tonight.",
        "You've been selected for a special prize! Visit our website.",
        "Let's catch up over lunch this weekend.",
        "Limited time offer, buy one get one free!",
        "Your order has been shipped and will arrive by Friday."
    ],
    'label': ['spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham']
}

df = pd.DataFrame(data)


3. PreProcess the Data


In [35]:
df.head()

Unnamed: 0,message,label
0,"Congratulations, you've won a free ticket to t...",spam
1,"Hey, are we still meeting at the cafe tomorrow?",ham
2,"Get cheap meds online, click here for a discount!",spam
3,Reminder: Your appointment is scheduled for ne...,ham
4,Win a $1000 gift card by completing this survey!,spam


In [36]:
# Encode labels
encoder = LabelEncoder()
df['label'] = encoder.fit_transform(df['label']) # we have converted the spam or ham value to a numeric values
df.head()

Unnamed: 0,message,label
0,"Congratulations, you've won a free ticket to t...",1
1,"Hey, are we still meeting at the cafe tomorrow?",0
2,"Get cheap meds online, click here for a discount!",1
3,Reminder: Your appointment is scheduled for ne...,0
4,Win a $1000 gift card by completing this survey!,1


In [37]:
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.2, random_state=42)

In [38]:
X_train # X_train is our data corpus which contain the all available text

5    Don't forget to submit the assignment by tonight.
0    Congratulations, you've won a free ticket to t...
7              Let's catch up over lunch this weekend.
2    Get cheap meds online, click here for a discount!
9    Your order has been shipped and will arrive by...
4     Win a $1000 gift card by completing this survey!
3    Reminder: Your appointment is scheduled for ne...
6    You've been selected for a special prize! Visi...
Name: message, dtype: object

In [39]:
# Tokenize text and convert to sequences
tokenizer = Tokenizer()

# We are fitting the all text data into tokenizer memory this will help us to convert our text data into numaric values as machine do not understand the string 
tokenizer.fit_on_texts(X_train)

X_train_seq = tokenizer.texts_to_sequences(X_train) # Get numeric values for X_train data
X_test_seq = tokenizer.texts_to_sequences(X_test)  # Get numeric values for X_test data

# Pad sequences to ensure uniform length
maxlen = 50 # this is our max length of the sentence
X_train_pad = pad_sequences(X_train_seq, padding='post', maxlen=maxlen)
X_test_pad = pad_sequences(X_test_seq, padding='post', maxlen=maxlen)

In [40]:
# numeric value of each word 
tokenizer.index_word

{1: 'a',
 2: 'by',
 3: 'for',
 4: 'to',
 5: 'the',
 6: "you've",
 7: 'this',
 8: 'your',
 9: 'been',
 10: "don't",
 11: 'forget',
 12: 'submit',
 13: 'assignment',
 14: 'tonight',
 15: 'congratulations',
 16: 'won',
 17: 'free',
 18: 'ticket',
 19: 'bahamas',
 20: 'call',
 21: 'now',
 22: "let's",
 23: 'catch',
 24: 'up',
 25: 'over',
 26: 'lunch',
 27: 'weekend',
 28: 'get',
 29: 'cheap',
 30: 'meds',
 31: 'online',
 32: 'click',
 33: 'here',
 34: 'discount',
 35: 'order',
 36: 'has',
 37: 'shipped',
 38: 'and',
 39: 'will',
 40: 'arrive',
 41: 'friday',
 42: 'win',
 43: '1000',
 44: 'gift',
 45: 'card',
 46: 'completing',
 47: 'survey',
 48: 'reminder',
 49: 'appointment',
 50: 'is',
 51: 'scheduled',
 52: 'next',
 53: 'wednesday',
 54: 'selected',
 55: 'special',
 56: 'prize',
 57: 'visit',
 58: 'our',
 59: 'website'}

In [41]:
# What this random numbers are ...? 
# don't worry our tokenizer know these i.e 10 = don't , 11 = forget 
# See above cell for reference and zero at the is used to maintain same length of the all elements which is required for or model
X_train_pad[:2] # showing first two records of X_train data

array([[10, 11,  4, 12,  5, 13,  2, 14,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0],
       [15,  6, 16,  1, 17, 18,  4,  5, 19, 20, 21,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0]])

Step 4: Build the Neural Network Model

Explanation:
Data Preparation: We encode the labels ('spam' and 'ham') into numerical form using LabelEncoder. Text messages are tokenized using Tokenizer from Keras and then padded to ensure uniform length for neural network input.

Model Building:

An Embedding layer converts text sequences into dense vectors of fixed size.
GlobalAveragePooling1D averages over the sequence dimension to flatten the input.
Two Dense layers with relu and sigmoid activations for classification.
Training and Evaluation: The model is trained using binary_crossentropy as the loss function and adam optimizer. Accuracy is evaluated on the test set.

Adjust the model architecture, tokenizer parameters, and training epochs as needed based on performance and specific requirements. This example provides a basic framework for text classification using TensorFlow with your provided data.

In [45]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=64, input_length=maxlen), #Convert integer-encoded token sequences into dense vectors (embeddings).
    tf.keras.layers.GlobalAveragePooling1D(), # GlobalAveragePooling1D averages over the sequence dimension to flatten the input.
    tf.keras.layers.Dense(64, activation='relu'), # Two Dense layers with relu and sigmoid activations for classification.
    tf.keras.layers.Dense(1, activation='sigmoid') ## return single response at the end
])

model.compile(optimizer='adam',
              loss='binary_crossentropy', # The model is trained using binary_crossentropy as the loss function and adam optimizer
              metrics=['accuracy']) # Accuracy is evaluated on the test set

model.summary()

5. Evaluate the Model

In [53]:
# Train the model
history = model.fit(X_train_pad,  # our trained data
                    y_train,  # levels either spam or not
                    epochs=15,  # try to learn data 10 times
                    batch_size=32,  # try with 32 items in each step
                    # More we can learn here Tutorials\4 ML-Models\1 Train a Model.md ## Model fit perms
                    validation_split=0.2,
                    verbose=1 
                    )

Epoch 1/15
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 62ms/step - accuracy: 1.0000 - loss: 0.6636 - val_accuracy: 1.0000 - val_loss: 0.6865
Epoch 2/15
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step - accuracy: 1.0000 - loss: 0.6615 - val_accuracy: 1.0000 - val_loss: 0.6860
Epoch 3/15
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step - accuracy: 1.0000 - loss: 0.6593 - val_accuracy: 1.0000 - val_loss: 0.6856
Epoch 4/15
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step - accuracy: 1.0000 - loss: 0.6570 - val_accuracy: 1.0000 - val_loss: 0.6851
Epoch 5/15
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step - accuracy: 1.0000 - loss: 0.6546 - val_accuracy: 1.0000 - val_loss: 0.6846
Epoch 6/15
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step - accuracy: 1.0000 - loss: 0.6521 - val_accuracy: 1.0000 - val_loss: 0.6841
Epoch 7/15
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━

In [54]:
# Evaluate the model
loss, accuracy = model.evaluate(X_test_pad, y_test)
print(f'Accuracy: {accuracy*100:.2f}%')

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - accuracy: 0.5000 - loss: 0.6835
Accuracy: 50.00%
