 # Classify SMS messages as Spam or Legitimate

Link: https://www.kaggle.com/uciml/sms-spam-collection-dataset

**Context**
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

**Content**
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

**Inspiration**
Use this dataset to build a prediction model that will accurately classify which texts are spam?

## Read the dataset

In [None]:
import os
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("../input/spam.csv", encoding='latin-1')

In [None]:
df.shape

In [None]:
df.head()

We seem to have 3 unused columns. Let's drop them.

In [None]:
df.columns.values[2:]

In [None]:
df.drop(df.columns.values[2:], axis=1, inplace=True)

In [None]:
df.head()

In [None]:
df.shape

## Vectorize the data

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Let's consider only the top 10000 words appearing in the messages
maxWords = 10000

We need to decide what's the maximum length of the messages that we shall consider.

In [None]:
# Get the size of each word
sizes = df['v2'].map(lambda x: len(x.split(" ")))

Plot them in a histogram.

In [None]:
plt.hist(sizes, normed=True, bins=50);

We see that most values are below 100. So 100 seems to be a a reasonable length.

In [None]:
maxMessageSize = 100

In [None]:
labelText = df['v1'].tolist()
texts = df['v2'].tolist()

In [None]:
tokenizer = Tokenizer(num_words=maxWords)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

In [None]:
data = pad_sequences(sequences, maxlen=maxMessageSize)

Encode labels: 0 indication Ham and 1 indication Spam.

In [None]:
labels = []
for i in labelText:
    if i == "ham":
        labels.append(0)
    elif i == "spam":
        labels.append(1)

In [None]:
labels = np.asarray(labels)

In [None]:
data.shape

In [None]:
labels.shape

**Shuffle the dataset**

In [None]:
indices = np.arange(data.shape[0])
np.random.shuffle(indices)

data = data[indices]
labels = labels[indices]

## Split into training, testing and validation set

In [None]:
trainingSetSize = 3500
validationSetSize = 1000

In [None]:
trainingSet = data[:trainingSetSize]
trainingLabels = labels[:trainingSetSize]

validationSet = data[trainingSetSize: trainingSetSize + validationSetSize]
validationLabels = labels[trainingSetSize: trainingSetSize + validationSetSize]

testSet = data[trainingSetSize + validationSetSize:]
testLabels = labels[trainingSetSize + validationSetSize:]

## Build the model

**Architecture**
* A 100 dimensional Embedding layer
* 1 densely connected layer 32 hidden units, _reu_ activation
* 1 output layer, _sigmoid_ activation

In [None]:
from keras import models
from keras import layers
from keras import activations
from keras import optimizers
from keras import losses
from keras import metrics

In [None]:
embeddingDimension = 100

In [None]:
model = models.Sequential()
model.add(layers.Embedding(maxWords, embeddingDimension, input_length=maxMessageSize))
model.add(layers.Flatten())
model.add(layers.Dense(32, activation=activations.relu))
model.add(layers.Dense(1, activation=activations.sigmoid))

model.compile(optimizer=optimizers.Adam(lr=0.001), loss=losses.binary_crossentropy, metrics=[metrics.binary_accuracy])

history = model.fit(trainingSet, trainingLabels, epochs=10, batch_size=64, validation_data=(validationSet, validationLabels))

**Validation accuracy: 98.6%**

**Plot the loss and accuracy for the training and validation sets**

In [None]:
history.history.keys()

In [None]:
acc = history.history['binary_accuracy']
val_acc = history.history['val_binary_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()

Save the model

In [None]:
model.save("spamVsHam.h5")

## Evaluate on test set

In [None]:
testSet.shape

In [None]:
testLabels.shape

In [None]:
testLoss, testAccuracy = model.evaluate(testSet, testLabels)

In [None]:
testAccuracy