# Build a spam classifier using Naive Bayes

## Project Description: 
- There are three datasets for training: TrainDataset1.csv, TrainDataset2.csv and TrainDataset3.txt. Each dataset contains short messages with the labels (ham or spam). 
- Analyse, clean and visualise these datasets.
- Combine them into one big data set for the training
- Use this dataset in order to build your own Naive Bayes classifier. (You can either use existing Naive Bayes from sklearn or build your own one)
- Verify your Classifier using new messages (create your own messages or use the messages from the TestDataset.csv dataset).

## Project Duration: 2 weeks
## Project Deliverables:
1. End of the first week do Data preprocessing: 
    - Load the dataset using pandas, 
    - Analysis it for this you will need to process the text, namely remove punctuation and stopwords, and then create a list of clean text words. (Research how to do this) 
    - Visualise the results
    - Prepare the pre-processed data for the usage by Naive Bayes Classifier
2. End of the second week:
    - Train the classifier,
    - Validate it, build confusion matrix, analyse its results
    - Apply it to new test messages,
    - Try to cheat the classifier by adding "good words" to the end of test message.

You can use the following link can be used as guidance for implementation:
https://towardsdatascience.com/spam-filtering-using-naive-bayes-98a341224038

In [5]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split 
from sklearn.metrics import confusion_matrix

pd.__version__

'0.24.2'

In [14]:
train_1 = pd.read_csv('TrainDataset1.csv')
train_2 = pd.read_csv('TrainDataset2.csv')
train_3 = pd.read_csv('TrainDataset3.txt', sep="\t", header=None)

new_set = pd.read_csv('TestDataset.csv')

train_2.columns = ['type', 'text']
train_3.columns = ['type', 'text']
new_set.columns = ['text']

pieces = [train_1, train_2, train_3]
train_set = pd.concat(pieces)

x_data = train_set.text
y_data = train_set.type

#the default train_size is 0.25
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data)

cv = CountVectorizer()
#Learn the vocabulary dictionary and return term-document matrix.
counts = cv.fit_transform(x_train.values)

In [15]:
classifier = MultinomialNB()
targets = y_train.values
classifier.fit(counts, targets)

test_count = cv.transform(x_test.values)
y_predictions = classifier.predict(test_count)

print('True values')
print(y_test.values)
print('Predictions')
print(predictions)

print('\nConfusion matrix')
my_confusion_matrix = confusion_matrix(y_test.values, y_predictions)
print(my_confusion_matrix)

True values
['ham' 'ham' 'ham' ... 'ham' 'ham' 'ham']
Predictions
['spam' 'ham' 'ham' ... 'ham' 'ham' 'ham']

Confusion matrix
[[3358   15]
 [  11  513]]
