<a href="https://colab.research.google.com/github/valenlopez993/SMS_Classifier/blob/main/SMS_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.keras.utils import pad_sequences

from tensorflow.keras.preprocessing.text import Tokenizer

import tensorflow_datasets as tfds

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Preprocessing the Data

In [None]:
import sys

if 'google.colab' in sys.modules:
    !wget https://raw.githubusercontent.com/valenlopez993/SMS_Classifier/main/train-data.tsv
    !wget https://raw.githubusercontent.com/valenlopez993/SMS_Classifier/main/valid-data.tsv

train_file_path = "train-data.tsv"
test_file_path = "valid-data.tsv"

Firstly, two data frames are created: one containing the training data and the other containing the test data

In [None]:
df_train = pd.read_csv(train_file_path, sep='\t', header=None, names=['ham_spam', 'mail'])
df_test = pd.read_csv(test_file_path, sep='\t', header=None, names=['ham_spam', 'mail'])

In this project are two classifications for the emails: **ham** or **spam**. So first the labels have to be separated from the data and then converted to a numerical value

In [None]:
df_train_labels = df_train.pop('ham_spam')
df_test_labels = df_test.pop('ham_spam')

df_train_labels = pd.Categorical(df_train_labels)
df_test_labels = pd.Categorical(df_test_labels)

train_labels = df_train_labels.codes
test_labels = df_test_labels.codes

categories = df_train_labels.unique()

It isn't possible to pass strings to the model so each word has to be encoded. There are several ways to do that but here it's used the `Tokenizer` class from Keras

In [None]:
train_to_tokenize = df_train.to_numpy().reshape(df_train.shape[0])
train_tokenizer = Tokenizer()
train_tokenizer.fit_on_texts(train_to_tokenize)
train_tokenized = train_tokenizer.texts_to_sequences(train_to_tokenize)

test_to_tokenize = df_test.to_numpy().reshape(df_test.shape[0])
test_tokenizer = Tokenizer()
test_tokenizer.fit_on_texts(test_to_tokenize)
test_tokenized = test_tokenizer.texts_to_sequences(test_to_tokenize)

In [None]:
word_index = train_tokenizer.word_index
vocab_size = len(word_index) + 1

Another important thing to take into account it's the size of each input data. It's mandatory to feed the model with data that has the same size. But this is not always the case when it's talked about words and phrases so here it's defined a `maxLength`:

- if the email is greater than 255 words then trim off the extra words
- if the email is less than 255 words add the necessary amount of 0's to make it equal to 255

In [None]:
maxLength = 255
train_data = pad_sequences(train_tokenized, maxLength)
test_data = pad_sequences(test_tokenized, maxLength)

# Model

In [None]:
model = Sequential()
model.add(Embedding(vocab_size, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

In [None]:
model.fit(train_data, train_labels, epochs=50)

# Making Predictions

In [None]:
def predict_message(pred_text):
  tokenizer = Tokenizer()
  tokenizer.fit_on_texts([pred_text])
  tokenized = train_tokenizer.texts_to_sequences([pred_text])
  text_to_predict = pad_sequences(tokenized, maxLength)

  model_prediction = model.predict(text_to_predict, verbose=0)
  
  prediction = []
  prediction.append(model_prediction[0][0])
  prediction.append(categories[int(np.round(model_prediction)[0])])

  print(pred_text, "===>", prediction)

  return prediction

Let's test the model

In [None]:
def test_predictions():
  test_messages = ["how are you doing today",
                   "sale today! to stop texts call 98912460324",
                   "i dont want to go. can we try it a different day? available sat",
                   "our new mobile video service is live. just install on your phone to start watching.",
                   "you have won £1000 cash! call to claim your prize.",
                   "i'll bring it tomorrow. don't forget the milk.",
                   "wow, is your arm alright. that happened to me one time too"
                  ]

  answers = ["ham", "spam", "ham", "spam", "spam", "ham", "ham"]

  for msg, ans in zip(test_messages, answers):
    prediction = predict_message(msg)
    if prediction[1] != ans:
      print("WRONG\n")
    else:
      print("CORRECT!!!\n")

test_predictions()