# Spam Filter Classifier.

## Team members:
* Alan Gomez
* Bennan Penfold
* Samuel Parra

## Project Description: 
- There are three datasets for training: TrainDataset1.csv, TrainDataset2.csv and TrainDataset3.txt. Each dataset contains short messages with the labels (ham or spam). 
- Analyse, clean and visualise these datasets.
- Combine them into one big data set for the training
- Use this dataset in order to build your own Naive Bayes classifier. (You can either use existing Naive Bayes from sklearn or build your own one)
- Verify your Classifier using new messages (create your own messages or use the messages from the TestDataset.csv dataset).

## Project Duration: 2 weeks
## Project Deliverables:
1. End of the first week do Data preprocessing: 
    - Load the dataset using pandas, 
    - Analysis it for this you will need to process the text, namely remove punctuation and stopwords, and then create a list of clean text words. (Research how to do this) 
    - Visualise the results
    - Prepare the pre-processed data for the usage by Naive Bayes Classifier
2. End of the second week:
    - Train the classifier,
    - Validate it, build confusion matrix, analyse its results
    - Apply it to new test messages,
    - Try to cheat the classifier by adding "good words" to the end of test message.

## Import and sort the data

In [2]:
import pandas as pd
import numpy as np
import string

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split 
from sklearn.metrics import confusion_matrix

# For the Visualisation
from os import path
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from IPython.display import Image


In [3]:
# Read datasets
test_data = pd.read_csv('TestDataset.csv')
train_data_1 = pd.read_csv('TrainDataset1.csv')
train_data_2 = pd.read_csv('TrainDataset2.csv')
train_data_3 = pd.read_csv('TrainDataset3.txt', sep="\t", names=['tag','text'])

train_data_1.columns = ['tag', 'text']
train_data_2.columns = ['tag', 'text']
test_data.columns = ['text']

train_set = pd.concat([train_data_1, train_data_2, train_data_3])

text_set = train_set.text
tag_set = train_set.tag


## Visualise the data

In [14]:
text = train_set['text'].tolist()
tags = train_set['tag'].tolist()

# Generate strings for wordcloud
ham_text = " ".join([words for words, tag in zip(text, tags) if tag == "ham"])
spam_text = " ".join([words for words, tag in zip(text, tags) if tag == "spam"])

# Generate wordcloud
ham_wordcloud = WordCloud(height=500, width=500, background_color='white').generate(ham_text)
spam_wordcloud = WordCloud(height=500, width=500, background_color='white').generate(spam_text)

# Save wordclouds
ham_wordcloud.to_file("img/ham_wordcloud.jpg")
spam_wordcloud.to_file("img/spam_wordcloud.jpg")

<wordcloud.wordcloud.WordCloud at 0x7f79600fb860>

In [15]:
Image("img/ham_wordcloud.jpg")

<IPython.core.display.Image object>

In [16]:
Image("img/spam_wordcloud.jpg")

<IPython.core.display.Image object>

## Train the network

In [5]:
# Seperate training and test sets
text_train, text_test, tags_train, tags_test = train_test_split(text_set, tag_set)

# Create shorthand
cv = CountVectorizer()
classifier = MultinomialNB()

#Learn the vocabulary dictionary and return term-document matrix.
counts = cv.fit_transform(text_train.values)
targets = tags_train.values

classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Test and validate the training

In [6]:
# Test the training against the test set
test_count = cv.transform(text_test.values)
tags_predictions = classifier.predict(test_count)

# Compare the predicted to actual values
print('True values')
print(tags_test.values)
print('Predictions')
print(tags_predictions)

# Print the resulting confusion matrix
print('\nConfusion matrix')
my_confusion_matrix = confusion_matrix(tags_test.values, tags_predictions)
print(my_confusion_matrix)

True values
['spam' 'ham' 'ham' ... 'ham' 'ham' 'spam']
Predictions
['spam' 'ham' 'ham' ... 'ham' 'ham' 'spam']

Confusion matrix
[[3351   21]
 [  20  505]]


## Apply training to the test data

In [11]:
test_data_count = cv.transform(test_data.text.values)
test_predicitons = classifier.predict(test_data_count)

tracker = {
    "test_1": {
        "spam": 0,
        "ham": 0
    },
    "test_2": {
        "spam": 0,
        "ham": 0
    }
}

spam_strs = []
for idx, prediction in enumerate(test_predicitons):
    print("{}: {}".format(prediction, test_data.text[idx]))
    if prediction == "spam":
        spam_strs.append(test_data.text[idx])
    tracker["test_1"][prediction] += 1

ham: Prabha..i'm soryda..realy..frm heart i'm sory
ham: Jus chillaxin, what up
ham: Ok no prob. Take ur time.
spam: Congrats! 2 mobile 3G Videophones R yours. call 09061744553 now! videochat wid ur mates, play java games, Dload polyH music, noline rentl. bx420. ip4. 5we. 150pm
ham: Thankyou so much for the call. I appreciate your care.
ham: Good Morning my Dear........... Have a great &amp; successful day.
ham: Desires- u going to doctor 4 liver. And get a bit stylish. Get ur hair managed. Thats it.
ham: How stupid to say that i challenge god.You dont think at all on what i write instead you respond immed.
ham: Come aftr  &lt;DECIMAL&gt; ..now i m cleaning the house
ham: Arms fine, how's Cardiff and uni? 
ham: Miles and smiles r made frm same letters but do u know d difference..? smile on ur face keeps me happy even though I am miles away from u.. :-)keep smiling.. Good nyt
ham: Even if he my friend he is a priest call him now
ham: ;-( oh well, c u later
ham: Aight, I'm chillin in a fr

## Attempt to trick the training

In [12]:
good_words = "know will got ok time one come today going ok know want need love"

trick_str = [string + good_words for string in spam_strs]

trick_data_count = cv.transform(trick_str)
trick_predicitons = classifier.predict(trick_data_count)

for idx, prediction in enumerate(trick_predicitons):
    print("{}: {}".format(prediction, trick_str[idx]))
    tracker["test_2"][prediction] += 1

spam: Congrats! 2 mobile 3G Videophones R yours. call 09061744553 now! videochat wid ur mates, play java games, Dload polyH music, noline rentl. bx420. ip4. 5we. 150pmknow will got ok time one come today going ok know want need love
spam: Free 1st week entry 2 TEXTPOD 4 a chance 2 win 40GB iPod or å£250 cash every wk. Txt VPOD to 81303 Ts&Cs www.textpod.net custcare 08712405020.know will got ok time one come today going ok know want need love
ham: Email AlertFrom: Jeri StewartSize: 2KBSubject: Low-cost prescripiton drvgsTo listen to email call 123know will got ok time one come today going ok know want need love
spam: UR awarded a City Break and could WIN a å£200 Summer Shopping spree every WK. Txt STORE to 88039.SkilGme.TsCs087147403231Winawk!Age16+å£1.50perWKsubknow will got ok time one come today going ok know want need love
spam: FREE UNLIMITED HARDCORE PORN direct 2 your mobile Txt PORN to 69200 & get FREE access for 24 hrs then chrgd@50p per day txt Stop 2exit. This msg is freekno

In [13]:
print("First test results: {} spam, {} ham".format(tracker["test_1"]["spam"], tracker["test_1"]["ham"]))
print("Second test results: {} spam, {} ham".format(tracker["test_2"]["spam"], tracker["test_2"]["ham"]))

First test results: 162 spam, 953 ham
Second test results: 112 spam, 50 ham


The second test was performed only with the emails classified as spam in the first test.