# Introduction

In this kernel we will try to build a model to classify messages as spam, we will work with Neural Networks instead of Old & Gold Machine Learning algorithms, just because NN (*Neural Networks*) are fancy now. In fact in this kernel I try to put some of the knowledge I acquired while doing Machine Learning and Deep Learning courses, it uses a not so hype dataset so we can be able to have a better understanding of the domain.


## Outline:
 - The Data
     - Raw Read
     - Clean up
     - Visualizations
 - The Model:
     - Define goals
     - Identify possible painpoints
     - Build a not improved model
     - Tune it!

## The Data

First we will load the data and clean up it a bit, even using UCI database the sets there still contains a bit of noise that can be done better, next we will perform some visualizations so we can understand better the distribution, so when we fall into some problem while training the model we might spot the problem faster.

In [1]:
# data libs
import pandas as pd
import numpy as np
import seaborn as sns

from matplotlib import pyplot as plt
from matplotlib.gridspec import GridSpec

sns.set_palette('Pastel1')
sns.set_style('whitegrid')

# fix the seed to make this notebook reproducible
np.random.seed = 69

### Raw Read

Let's read the data provided from the dataset using *pandas*, which offers a nice wrapper for data so us can work better with it.

One problem we might stop is that using `utf-8` encoding will crash the read, we can tell that is because some grapheme present, such as tilde.

`UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 135-136: invalid continuation byte`

To overcome this we might want to use the `latin1` encoding, since it support graphemes.

In [2]:
data = pd.read_csv('../input/spam.csv', encoding='latin1')
data.head()

### Clean Up

With the summary of the data printed above we can see that we have columns that does not add anything, they might ended up here after the processing data without removing indexes columns, also the headers does not help us using this dataframe.

So we will do this:
  - Drop the Unnamed columns
  - Rename the headers

In [3]:
# drop unnamed columns
data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)
# rename columns
data.columns = ['label', 'text']
data.head()

There you go, much nicer to work with.

### Visualizations

To better undestand this data we can analyse it's content, for example, there is a distinction lenght from the spam messages from the normal ones ?

In [4]:
def get_lengths_of_texts(data):
    texts = data['text'].values
    return [len(text) for text in texts]

lengths = get_lengths_of_texts(data)

mean_length = np.mean(lengths)
std_length = np.std(lengths)

spam_lengths = get_lengths_of_texts(data[data['label'] == 'spam'])

mean_spam_length = np.mean(spam_lengths)
std_spam_length = np.std(spam_lengths)

normal_lengths = get_lengths_of_texts(data[data['label'] == 'ham'])

mean_normal_length = np.mean(normal_lengths)
std_normal_length = np.std(normal_lengths)

In [5]:
def annotate_values(ax, values):
    for react, val in zip(ax.patches, values):
        h = react.get_height()
        ax.text(react.get_x() + react.get_width() / 2, h - (h * 0.1),
            "{0:.2f}".format(val), ha='center', style='italic', fontsize=13)

grid = GridSpec(1, 2)
fig = plt.figure(figsize=(15,5))
fig.suptitle('Stats from the length of text')

ax = plt.subplot(grid[0, 0])
values = [mean_spam_length, std_spam_length]
sns.barplot(x=['Mean', 'Std'], y=values, ax=ax)
ax.set_title('Mean & Std for Spam messages')
annotate_values(ax, values)

ax = plt.subplot(grid[0, 1])
values = [mean_normal_length, std_normal_length]
sns.barplot(x=['Mean', 'Std'], y=values, ax=ax)
ax.set_title('Mean & Std for Normal messages')
annotate_values(ax, values)
plt.show()

Looks like spam messages tend to fill all the possible characters on a SMS, resulting in higher mean value.
The limit of a SMS are *160* characters, the mean value of spams reach near it, around *138* character, in constrast, normal messages tend to have less characters, resulting in a mean lower than the spam, using this analyse of the mean one can be tempted to say 
> Oh with a simple if we can solve it!

Well let's try:

In [6]:
def isspam(message):
    return len(message) >= 138

print("What am I?")
vld = data[data['label'] == 'spam'].sample(1)
print('>', 'Spam' if isspam(vld['text']) else 'Not Spam')
print("Actually I'm a", *vld['label'].values)

*Seems it did not work at all*

But why ? Well we can see that following the mean on the graph there is the standard deviation, this is in simple terms means that lower values tend to be close the the mean, that is, we don't have a lot of messages sparsed around the limits of a SMS. The *Std* (*Standard Deviation*) show us that the Spam messages tend to be alot more close to the mean but still have a higher deviation, that is why the `if` above fails. When looking at the *Std* of the normal messages it grows alot more than the spam, because we have messages with alot more variation on the lenght.

------

## The Model

Since our attempt to use if's didn't get us far, let's try what we are supposed to try out, [Neural Networks](https://en.wikipedia.org/wiki/Artificial_neural_network)!

### Defining Goals

As start for every project on machine learning area, one should define what are the goals of the resulting model, for example: It should be fast to predict? How about resources, how much it can use from the CPU to calculate the prediction? Can we iterate every few days, do we need to iterate *n* times a day ? We define those so we can aim out optimization so that the resulting model can fit into the destination application.

So what is our goals ?

- Small enought to fit into most of smarthphones
- Use small cycles of computation, because of draning battery
- Fit the training into 60min execution time of this kernel
- Have a higher score on *false negatives* than *false positives*, meaning we might fail classifying as spam a spam message, but we can't classify as spam a legitimate message (such as code verifications).


### Identifying painpoints

Prior start building our model, we might take a look back at the data we got:

In [7]:
data.head()

As we can see our columns are with good names, thanks to what we have done before, but it has a hide range of length, so we need to think about how to encode the words into numbers, also our labels are strings insteads of `0 & 1`'s

So here what we need to do:

 - Convert the text into a number repesentation
 - Convert the label to numbers representing 1 for spam and 0 otherwise

In [8]:
# a simple generator function to compute the label
def encode_labels(labels):
    return [1 if 'spam' in label else 0 for label in labels]

In [9]:
from keras.preprocessing.text import Tokenizer


class TextEncoder:
    def __init__(self):
        self.tokenizer = Tokenizer()
    
    def fit(self, texts):
        self.tokenizer.fit_on_texts(texts)
    
    def transform(self, texts):
        return self.tokenizer.texts_to_matrix(texts)
    
    def fit_transform(self, texts):
        self.fit(texts)
        return self.transform(texts)
    
    @property
    def dim(self):
        return len(self.tokenizer.word_index) + 1
    

In [10]:
texts = data['text'].values
encoder = TextEncoder()
X = encoder.fit_transform(texts)
y = encode_labels(data['label'].values)

Using Keras as our abstraction over backends of Neural Network frameworks, in particular Tensorflow we can use it's preprocessing modules to help us encoding the text into numbers, to make things easier I did a class that wraps the Tokenizer and expose it in a `scikit-learn` alike API, using `TextEncoder#fit` fit the text into the Tokenizer, so it can calculate the minimal and maximum size to pad the result so every example has the same length in the end, using `TextEncoder#transform` can transform a list of text into a matrix representation so the network can compute it.

### Build a not improved Model

To start we will build a simple model, so we can then add more layers, units, tune parameters as we iterate for a better accuracy, because if we throw a complex network we can get awesome results with the cost of our goal definitions, so starting small anad growing is the best, controlled, way to do this.

We will use Keras again, with the preprocessed data from the previous cell, the `X` for the input and `y` as the output. The netwok will start with a shallow net, contaning just a single neuron with a sigmoid activation.

In [11]:
from keras.models import Sequential
from keras.layers import Dense, Dropout

model = Sequential()
# dimensions calulated by our tokenizer
dims = encoder.dim
# add input layer
model.add(Dense(2, input_dim=dims))
# add output lauer
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
# fit it!
history = model.fit(X, y, shuffle=True, validation_split=.2, epochs=100, batch_size=50)

We have fitted a simple neural network, we can see from the logs, that the loss is sooooo low so let's plot it to better understand it, because I'm smelling *overfitting*

In [12]:
grid = GridSpec(2, 2)

fig = plt.figure(figsize=(13, 10))
fig.suptitle('Model Metrics')

# first plot

ax = plt.subplot(grid[0, :])
ax.set_title('Model Loss')
ax.plot(history.history['loss'], color='c', lw=2)
ax.plot(history.history['val_loss'], color='darkorange', lw=2)
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
# annotate
ax.annotate('Start Overfitting', xy=(8, 0.06),
            xytext=(10, 0.1),
            arrowprops=dict(arrowstyle='->'))

ax.annotate('Look at this gape', xy=(85, 0.06),
            xytext=(85, 0.1))

ax.legend(['Train Loss', 'Val Loss'], loc='best')

# Second plot

ax = plt.subplot(grid[1, :])
ax.set_title('Model Accuracy')
ax.plot(history.history['acc'], color='c', lw=2)
ax.plot(history.history['val_acc'], color='darkorange', lw=2)
ax.set_xlabel('Epochs')
ax.set_ylabel('Accuracy')
# annotate
ax.annotate('Start Overfitting', xy=(8, 0.985),
            xytext=(10, 0.97),
            arrowprops=dict(arrowstyle='->'))

ax.annotate('Look at this gape', xy=(85, 0.06),
            xytext=(85, 0.1))

ax.legend(['Train Accuracy', 'Val Accuracy'], loc='best')

plt.show()

We can see comparing the loss from training and validation that our model is overfitting, the loss drops quickly which is good, but around `10` epochs the validation loss starts increasing, that means that the model stops generalizing the learning data and failing to predict new examples and it only gets worse as the epochs increases.

Looking at the accuracy plot we can also see the same effect occuring, when it reaches about `10` the accuracy of the validation set starts to drop while the training set accuracy keeps improving.

### Tune it!

So what can we do about it? 
 - We can't get more data
 - Maybe we can try lower the epochs ?
 
> What is an epoch ?
>
> An epoch is how many times the network will *see* the data flowing through it.
> NN starts with a random weight, a random learning as the forward and then backward propagation happens,
the weights are updated, but one time is usually not enough, this *time* is called epoch.
Greater the value the epoch is more the NN will learn, which can lead to overfit.

In [13]:
model0 = Sequential()
# dimensions calulated by our tokenizer
dims = encoder.dim
# add input layer
model0.add(Dense(2, input_dim=dims))
# add output lauer
model0.add(Dense(1, activation='sigmoid'))
# compile the model
model0.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
# fit it!
history = model0.fit(X, y, shuffle=True, validation_split=.2, epochs=10, batch_size=32)

In [14]:
grid = GridSpec(2, 2)

fig = plt.figure(figsize=(13, 10))
fig.suptitle('Model Metrics')

# first plot

ax = plt.subplot(grid[0, :])
ax.set_title('Model Loss')
ax.plot(history.history['loss'], color='c', lw=2)
ax.plot(history.history['val_loss'], color='darkorange', lw=2)
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
# annotate
ax.annotate('Start Overfitting', xy=(8, 0.06),
            xytext=(10, 0.1),
            arrowprops=dict(arrowstyle='->'))

ax.annotate('Look at this gape', xy=(85, 0.06),
            xytext=(85, 0.1))

ax.legend(['Train Loss', 'Val Loss'], loc='best')

# Second plot

ax = plt.subplot(grid[1, :])
ax.set_title('Model Accuracy')
ax.plot(history.history['acc'], color='c', lw=2)
ax.plot(history.history['val_acc'], color='darkorange', lw=2)
ax.set_xlabel('Epochs')
ax.set_ylabel('Accuracy')
ax.legend(['Train Accuracy', 'Val Accuracy'], loc='best')

plt.show()

We see an improvement over the overfitting, it still sightly overfitting near the end, we can see it better on the accuracy graph, the lines are spliting starting at epoch `8`.

Se let's try with  `5`  epochs:

In [15]:
model1 = Sequential()
# dimensions calulated by our tokenizer
dims = encoder.dim
# add input layer
model1.add(Dense(2, input_dim=dims))
# add output lauer
model1.add(Dense(1, activation='sigmoid'))
# compile the model
model1.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
# fit it!
history = model1.fit(X, y, shuffle=True, validation_split=.2, epochs=5, batch_size=32)

In [16]:
grid = GridSpec(2, 2)

fig = plt.figure(figsize=(13, 10))
fig.suptitle('Model Metrics')

# first plot

ax = plt.subplot(grid[0, :])
ax.set_title('Model Loss')
ax.plot(history.history['loss'], color='c', lw=2)
ax.plot(history.history['val_loss'], color='darkorange', lw=2)
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
ax.legend(['Train Loss', 'Val Loss'], loc='best')

# Second plot

ax = plt.subplot(grid[1, :])
ax.set_title('Model Accuracy')
ax.plot(history.history['acc'], color='c', lw=2)
ax.plot(history.history['val_acc'], color='darkorange', lw=2)
ax.set_xlabel('Epochs')
ax.set_ylabel('Accuracy')
ax.legend(['Train Accuracy', 'Val Accuracy'], loc='best')

plt.show()

We got better resultings interrupting the learning process at epoch `5` becuase it does not iterate over the training data again, of course there are other ways to improve this, for example, more data, but since we have limited amount this is the best we got with this scenario.

As a bonus let's try to make our net deeper:

In [24]:
model2 = Sequential()
# dimensions calulated by our tokenizer
dims = encoder.dim
# add input layer
model2.add(Dense(2, input_dim=dims))
model2.add(Dense(2, activation='relu'))
model2.add(Dropout(0.5))
model2.add(Dense(2, activation='relu'))
# add output lauer
model2.add(Dense(1, activation='sigmoid'))
# compile the model
model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# fit it!
history = model2.fit(X, y, shuffle=True, validation_split=.2, epochs=50, batch_size=50)

In [None]:
grid = GridSpec(2, 2)

fig = plt.figure(figsize=(13, 10))
fig.suptitle('Model Metrics')

# first plot

ax = plt.subplot(grid[0, :])
ax.set_title('Model Loss')
ax.plot(history.history['loss'], color='c', lw=2)
ax.plot(history.history['val_loss'], color='darkorange', lw=2)
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
ax.legend(['Train Loss', 'Val Loss'], loc='best')

# Second plot

ax = plt.subplot(grid[1, :])
ax.set_title('Model Accuracy')
ax.plot(history.history['acc'], color='c', lw=2)
ax.plot(history.history['val_acc'], color='darkorange', lw=2)
ax.set_xlabel('Epochs')
ax.set_ylabel('Accuracy')
ax.legend(['Train Accuracy', 'Val Accuracy'], loc='best')

plt.show()

With one more layer, we can see that we got results a bit lower than our 'final' model, also, the accuracy show us a drop, small, but present when computing the last epoch.

## That's all for now!

But in the future I might came back to implement the resulting, `model1`, model into raw Tensorflow outputting the Keras output!

-------------

Thanks!