# Classifying spam with CNNs

Here we will be developing a CNN to classify spam or ham (non-spam). The data is found in [here](http://www2.aueb.gr/users/ion/data/enron-spam/). 

## Downloading data 

The following script downloads the data, reads the data and load it to a pandas dataframe.

In [1]:
from zipfile import ZipFile
import gzip
import tarfile
import requests
import os
import pandas as pd
%matplotlib inline 

# Download the file if it's not there
filenames_urls = [("enron1.tar.gz", "http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron1.tar.gz"),
                 ("enron2.tar.gz", "http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron2.tar.gz"),
                 ("enron3.tar.gz", "http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron3.tar.gz")]

def get_spam_dataset(filename, url):
    if not filename.endswith('tar.gz'):
        raise NotImplementedError("This code cannot process files other than tar.gz")
        
    if not os.path.exists(filename):
        r = requests.get(url, allow_redirects=True)
        open(filename, 'wb').write(r.content)

    # Read the downloaded zip file
    labels = []
    text = []
    with tarfile.open(filename, 'r:gz') as tar:
        for member in tar.getmembers():
            if member.name.endswith('ham.txt') or member.name.endswith('spam.txt'):
                f=tar.extractfile(member)
                text.append(f.read().decode('latin-1'))            
                if member.name.endswith('ham.txt'):
                    labels.append(0)
                elif member.name.endswith('spam.txt'):
                    labels.append(1)
                else:
                    raise ValueError()
                
    return text, labels

text, labels = [],[]

for fn, url in filenames_urls:
    local_text, local_labels = get_spam_dataset(fn, url)
    text.extend(local_text)
    labels.extend(local_labels)

print(len(text))

df = pd.DataFrame({'text':text, "label": labels})
df = df.sample(frac=1.0, random_state=100)
pd.options.display.max_colwidth = 500
df.head(n=5)

16541


Unnamed: 0,label,text
1462,0,"Subject: ethane election for september 2000\r\nfyi\r\n- - - - - - - - - - - - - - - - - - - - - - forwarded by stella l morris / hou / ect on 08 / 25 / 2000\r\n02 : 16 pm - - - - - - - - - - - - - - - - - - - - - - - - - - -\r\njack simunek\r\n08 / 25 / 2000 01 : 33 pm\r\nto : lauri a allen / hou / ect @ ect , michael c bilberry / hou / ect @ ect , nathan l\r\nhlavaty / hou / ect @ ect , karry kendall / hou / ect @ ect , blanca a lopez / hou / ect @ ect ,\r\nstella l morris / hou / ect @ ect..."
14944,0,"Subject: fw : weather update\r\n- - - - - original message - - - - -\r\nfrom : shively , hunter s .\r\nsent : thursday , january 31 , 2002 11 : 41 am\r\nto : lavorato , john\r\nsubject : weather update\r\nrobert ricks - lead forecaster nws slidell , la\r\narriving on tuesday or wednesday for interviews\r\nliz taylor is coordinating travel plans\r\nmike gasper - reliant forecaster , 10 years experience , dave knows him from accu - weather\r\ndave ryan is meeting with him this afternoon to det..."
8443,0,"Subject: vacation day feb . 16\r\nshirley ,\r\nplease put me down for a vacation day of feb . 16 .\r\nthanks ,\r\nstinson"
10264,1,Subject: the stock trading gunslinger\r\nfanny is merrill but muzo not colza attainder and penultimate like esmark perspicuous ramble is segovia not group try slung kansas tanzania yes chameleon or continuant clothesman no\r\nlibretto is chesapeake but tight not waterway herald and hawthorn like chisel morristown superior is deoxyribonucleic not clockwork try hall incredible mcdougall yes hepburn or einsteinian earmark no\r\nsapling is boar but duane not plain palfrey and inflexible like huz...
2551,0,"Subject: gas day 2 / 08 / 01\r\nwe agree :\r\nteco tap nom = 40 . 000 ; actual 41 . 358\r\n- - - - - - - - - - - - - - - - - - - - - - forwarded by melissa jones / texas utilities on\r\n02 / 09 / 2001\r\n10 : 15 am - - - - - - - - - - - - - - - - - - - - - - - - - - -\r\nkponton @ duke - energy . com on 02 / 09 / 2001 09 : 15 : 35 am\r\nto : david avila / lsp / enserch / us @ tu , charlie stone / texas utilities @ tu , melissa\r\njones / texas utilities @ tu\r\ncc :\r\nsubject : gas day 2 / ..."


## Exploring data

Let's see how many labels we got for each class. We can see that there far less spam data than ham data. What does this mean?

In [2]:
df.groupby('label')['text'].count().head()

label
0    12045
1     4496
Name: text, dtype: int64

In [3]:
# Making sure that data points having only 0 or 1 as the label will be kept
df = df.loc[(df['label']==0) | (df['label']==1),:]

## Checking which words appear more than 10 times

In [4]:
# This is a data exploration step and consumes signficant time. Therefore this runs only if necessary
run_expensive_op = False
if run_expensive_op:
    ser = pd.Series(df["text"].str.lower().str.replace(r'(?:\.{1,}|,)',' ', regex=True).str.cat()).str.split(expand=True).iloc[0]
    ser_counts = ser.value_counts()#.head(n=100)
    ser_counts = ser_counts[ser_counts>10]
    print(ser_counts.shape)

## Summary statistics of the text length

Here we will look at summary statistics of the text length. This information is important as we need to later pad the sentences to a constant length.

In [5]:
df['text_length'] = df['text'].str.split(' ').str.len()
# Get the statistics of text_length column
text_stats = ____
print(text_stats)

count    16541.000000
mean       333.680007
std       1105.610378
min          2.000000
25%         68.000000
50%        152.000000
75%        329.000000
max      43684.000000
Name: text_length, dtype: float64


## Splitting train and test data

Here we are splitting data to train and test sets. Since this is a class-imbalanced problem, it is important to create equal number of samples from each class in the test set.

In [6]:
seed = 100
n_test = 300

grouped = df.groupby('label')
test_data = []
train_data = []

for _, grp in grouped:

    test = grp.sample(n=n_test)
    test_data.append(test)
    train_data.append(grp.drop(test.index))
    
test_df = pd.concat(test_data, axis=0)
train_df = pd.concat(train_data, axis=0)

print(train_df.shape)
print(test_df.shape)

(15941, 3)
(600, 3)


## Preprocessing text with Keras 

We will be fitting a tokenizer on the trainin data. Then we'll be using this tokenizer to convert training and testing data to sequences of integers.

In [8]:
train_text = train_df["text"].tolist()
train_labels = train_df["label"].tolist()

test_text = test_df["text"].tolist()
test_labels = test_df["label"].tolist()

In [9]:
from tensorflow.keras.preprocessing.text import Tokenizer

n_vocab = 2500
oov_token = '<unk>'

# Fit a tokenizer with n_vocab words and oov token <unk>
tok = _____
# Fit the tokenizer on train_text
_____

In [10]:
print(tok.texts_to_sequences([test_text[0]]))

[[19, 177, 1600, 230, 1688, 176, 12, 2, 177, 1600, 230, 441, 697, 11, 18, 136, 41, 353, 52, 7, 163, 20, 58, 196, 3, 298, 3, 1499, 1, 10, 483, 1, 190, 573, 1, 190, 2201, 1, 1601]]


## Getting the data in the correct shape

What's the correct shape(s) for the data?

* Inputs: `[batch_size, timesteps]` array
* Targets: `[batch_size, 2]` array

In [11]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
from tensorflow.keras.utils import to_categorical

n_pad = int(text_stats["mean"] + 1.0*text_stats["std"])
def process_text(text, n_pad, onehot=False, n_vocab=None):
    # Convert a given list of text to a list of sequences
    seq = ____
    # Pad the sequences using maxlen(=n_pad), pre-padding and post-truncating
    pad_seq = ____
    
    # Return the padded sequenc
    _____

# Preprocessing text
train_data = process_text(train_text, n_pad)
test_data = process_text(test_text,n_pad)
print(train_data.shape)

# Converting labels to onehot labels
train_oh_labels = to_categorical(train_labels, num_classes=2)
test_oh_labels = to_categorical(test_labels, num_classes=2)

(15941, 1439)


## Defining the model

Here we are defining the model we will be using for spam classification. This model is inspired by the model found in [this paper](https://www.aclweb.org/anthology/D14-1181.pdf). Roughly, following layers are present in the model.

* An input layer
* 4 parallel convolution layers with different kernel sizes (3,5,7,9), stride 2, each having 16 filters, activation relu, padding same
* Pooling over time layer (Concatenating the outputs of 4 convolution layers to a single input)
* A flattening layer
* A dense layer (20 nodes)
* A dropout layer (0.2)
* A dense layer (1 node)

This model will use `rmsprop` as the optimizer and also output `acc` (accuracy) as a metric.

In [12]:
from tensorflow.keras import layers, models
import tensorflow as tf
import tensorflow.keras.backend as K

K.clear_session()
# Without these lines I get 
# > UnknownError:  [_Derived_]  Fail to find the dnn implementation.
# >  [[{{node CudnnRNN}}]]
# >  [[model/lstm/StatefulPartitionedCall]] [Op:__inference_distributed_function_5819]
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)

# Input layer
inp = layers.Input(shape=(n_pad,))

# Embedding layer
emb_out = layers.Embedding(n_vocab, 50, input_length=n_pad)(inp)

# Parallel convolution layers with different kernel sizes
# Define four outputs conv1_1_out, conv1_2_out, conv1_3_out, conv1_4_out
____
____
____
____

# Max pool over time
pool_time = layers.Lambda(lambda x: K.concatenate([tf.nn.max_pool(xx, n_pad//2, strides=1, padding='VALID') for xx in x],axis=1))
# Use pool_time layer to get the aggregated output of the convolution layers
conv1_out = ____

# Flattening the output and final fully connected layers
flatten_out = layers.Flatten()(conv1_out)

# Add a dense layer with 20 nodes and relu activation and get the output
dense_out = ____(____)
dense_out = layers.Dropout(0.5)(dense_out)
# Define the final prediction as the output of a layer with 1 node and sigmoid activation
pred = ____(____)

# Defining the model
model = models.Model(inputs=inp, outputs=pred)
model.compile(loss=____, optimizer='rmsprop', metrics=____)
model.summary()


Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 1439)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 1439, 50)     125000      input_1[0][0]                    
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, 720, 16)      2416        embedding[0][0]                  
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 720, 16)      4016        embedding[0][0]                  
______________________________________________________________________________________________

## Training the model

When training the model, it is important to pass in the `class_weight` argument. This is to compensate for the class imbalance present in the dataset. More specifically we give,

* A large weight to the rare class
* A small weight to the majority class

In [13]:
train_labels_arr = np.array(train_labels)

class_0_weight = train_labels_arr[(train_labels_arr==0)].size # ham
class_1_weight = train_labels_arr[train_labels_arr==1].size #spam
tot = class_0_weight + class_1_weight
class_weights = {0: class_1_weight*1.0/tot, 1: class_0_weight*1.0/tot}

# Fit the model with the correct training data
model.fit(____, ____, epochs=3, class_weight = ____, validation_split=0.2)

Train on 12752 samples, validate on 3189 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x22c8f586160>

## Test performance of the model

In [14]:
# Evaluate the model on test data
metrics = model.____(____, ____, verbose=0)
print("Test loss: {} and test accuracy: {}".format(metrics[0], metrics[1]))

[0.2990406271273969, 0.94]

## Print a few examples

Let's now print a few examples from your test set along with the predictions our model gives.

In [24]:
rand_ids = np.random.choice(np.arange(test_data.shape[0]), size=(25))
for doc,true_lbl in zip(test_data[rand_ids], np.array(test_labels)[rand_ids]):
    # Do a model predict on a single example (you need to introduce a batch dim)
    label = ____.ravel()[0]    
    print(' '.join([tok.index_word[i] for i in  doc[-200:] if i != 0]))
    print('\tPred: {}, True: {}\n'.format(label, true_lbl))
    

 <unk>a good weekend uss these results on monday and any further <unk> for <unk> would also be <unk> <unk> m the 26 th i
	Pred: 8.229006198234856e-06, True: 0

 karens and if you need additional info let me know meter <unk> <unk> the deal does not have this day on it please have 1 16 added to the deal <unk> the error will clear 
	Pred: 0.0, True: 0

 best regards <unk> <unk> first account managerill out the form on our website ent 
	Pred: 0.6673570871353149, True: 1

 megan' s great what about deal <unk> we have 0 10 and we are being <unk> 0 11 for meter <unk> on contract <unk> also i <unk> the oct 2000 <unk> <unk> to <unk> <unk> so she can look into the dec 2000 and <unk> problem 
	Pred: 3.194138997741902e-08, True: 0

 <unk> <unk> <unk> more men than <unk>mk> your <unk> <unk> 
	Pred: 0.6853684186935425, True: 1

 <unk>swo weeks vince is out next week but we can start with a <unk> discussion with grant <unk> next week <unk> please get a meeting <unk> as soon as possible we should imme