# Spam detection Neural Network

## 1. In order to create a Neural Network for spam detection, one needs to acquire a dataset that highlights both spam and ham messages. Thankfully, the dataset has been provided.

## 2. The next step is to prepare the data; this can be done through tensorflow's built in data prcoessing tools, as well as panda and numpy's tools.
I imported multiple libraries; pandas for acquiring the dataset, tensorflow for defining the model and its layers, as well as its text processing tools, and numpy and pandas for dataset manipulation. The selected tools will be explained in the following markdowns.

In [37]:
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding, TextVectorization, GlobalAveragePooling1D, Input
from sklearn.model_selection import train_test_split
import numpy as np

Here, I define the location of the dataset, please do not mind the directory.

In [38]:
spam = pd.read_csv("C:\\Users\\Dingus-Elite\\Downloads\\spam.csv",encoding='latin1').astype("str")

### I then type out "spam" and "spam.describe" to see what it contains; we then find out it has 5 columns; one for v1, one for v2 and three unnamed columns.

In [39]:
spam

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [40]:
spam.describe

<bound method NDFrame.describe of         v1                                                 v2 Unnamed: 2  \
0      ham  Go until jurong point, crazy.. Available only ...        nan   
1      ham                      Ok lar... Joking wif u oni...        nan   
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...        nan   
3      ham  U dun say so early hor... U c already then say...        nan   
4      ham  Nah I don't think he goes to usf, he lives aro...        nan   
...    ...                                                ...        ...   
5567  spam  This is the 2nd time we have tried 2 contact u...        nan   
5568   ham              Will Ì_ b going to esplanade fr home?        nan   
5569   ham  Pity, * was in mood for that. So...any other s...        nan   
5570   ham  The guy did some bitching but I acted like i'd...        nan   
5571   ham                         Rofl. Its true to its name        nan   

     Unnamed: 3 Unnamed: 4  
0           nan        n

### We know there are excess columns, the responses are in the form of "spam" and "ham", and that the response and text columns are named v1 and v2 respectively; we would need to "clean" the dataset in order for it to be useful.

- I start by replacing the **"spam"** and **"ham"** values with **1** and **0** respectively.
- After that, I dropped the **3 unnamed columns**, and renamed the **v1** and **v2** columns with **"text"** and **"response".**
- I then configure the **"text"** column to be the **independent variable (x)** and the **"response" column to be the dependent variable (y)**
- I then split the dataset through sklearn's **train_test_split**

In [41]:
spam = spam.replace(["spam", "ham"],[1,0])
spam = spam.drop(["Unnamed: 2","Unnamed: 3", "Unnamed: 4"], axis=1)
spam = spam.rename(columns={"v1":'response', "v2":'text'})
X = spam['text']
y = spam['response']
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.20, random_state=10)

I then use keras' TextVectorization layer; this essentially translates the words from the spam dataset into integers, from which the model will understand. 

The values are arbitrary; **vocab_size** just means how many unique words will the vectorization consider. 10,000 felt big enough.
**seq_length** is 500 because I want it to handle a maximum of 500 words at once; if my interpretation is correct.\
We also used **'lower_and_strip_punctuation'** to strip the dataset of unnecessary characters, and prevent it from misinterpreting capitalizations \
(i.e. A and a may be the same to us, but the computer might interpret it as two different characters)

After defining the layer, we need to give it words that it can transform; we can then use our splitted dataset from earlier; the **train_X.**

In [45]:
vocab_size = 10000
seq_length = 500
vectorizer_layer = TextVectorization(max_tokens=vocab_size, standardize = 'lower_and_strip_punctuation', output_mode='int', 
 output_sequence_length=seq_length)
vectorizer_layer.adapt(train_X)


## 3. We can now define the model; as part of the constraints, I will use keras' **Sequential** model.
Here, we have 7 layers; 
1. 1st layer is the input layer; we set it to a shape of 1, because we need it to process every string individually.
2. The second layer is the vectorization layer that we made earlier.
3. The third layer is the Embedding; this works alongside out vectorizer layer to process our training data.
4. Dropout layer is implemented to prevent overfitting; set to 0.2 arbitrarily.
5. The GlobalAveragePooling1D processes the output from the Embedding layer.
6. the Dense layer is set to 1, as we only need 2 outputs: (0 or 1) I've read somewhere that the best activation for NLP is sigmoid, so I used it here.

I then compiled the model using **binary_crossentropy** as loss, since we're only accepting two inputs; and **adam** as my optimizer.

In [47]:
embedding_dim = 16
model = Sequential()
model.add(Input(shape=(1,), dtype=tf.string))
model.add(vectorizer_layer)
model.add(Embedding(vocab_size + 1, embedding_dim))
model.add(Dropout(0.2))
model.add(GlobalAveragePooling1D())
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


## 4. We can now start training the model; 
I've set the **epochs to 50**, and **verbose to 1** so I can see the cool progress bar.
the selected validation_data for this would be the splitted dataset earlier; the **test_X** and its output **test_Y**

In [48]:
model.fit(train_X, train_y, epochs=50, verbose=1, validation_data=(test_X, test_y), batch_size=10)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x25f91441fd0>

## 5. We can evaluate now how the model performed;
I wanted to see how much true positives and false negatives the model got versus the false positives and true negatives;\
judging from the results, it looks to be okay.

In [50]:
conf_test = model.predict(test_X)
conf_test = np.around(conf_test)
confusion_matrix(test_y,conf_test)



array([[963,   2],
       [ 16, 134]], dtype=int64)

## 6. The final model here is not representative of the amount of work that needed to be done in order to prepare it; I did go back to modify the model a few times, and this is so far the best I've seen it perform, so I'm keeping it.

## 7. After evaluation and tuning. I've given it new data to predict; the output_csv file. 
1. I used panda's **.read_csv** command; and then asked the model to predict which of the the messages in that new dataset are spam;
2. I then rounded off the numbers through numpy's **.around()** command.
3. I then created a new array that bases its contents on the previously rounded off predictions: if it is 1, the new array index will contain the word "spam", while if it is 0 it contain "ham".
4. I then write this array to another csv file.

In [24]:
spam_out = pd.read_csv("C:\\Users\\Dingus-Elite\\Downloads\\output_spam.csv",encoding='latin1').astype("str")

predictions = model.predict(spam_out['text'])
predictions =  np.around(predictions)
pred_conv = ['Spam' if i > 0.5 else 'Ham' for i in predictions]



In [53]:
prediction_test_out = pd.DataFrame(pred_conv, columns=['prediction']).to_csv("C:\\Users\\Dingus-Elite\\Desktop\\billones_spam_output.csv")