# Problem statement
# Spam filtering using Naive Bayes classifiers in order to predict whether a new mail based on its content, can be categorized as spam or not-spam.

### Data processing using panda library

In [None]:
# Import the required libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
import string
import matplotlib.pyplot as plt

In [None]:
# Load the dataset

data = pd.read_csv("spam.tsv",sep='\t',names=['Class','Message'])
data.head(2) # View the first 8 records of our dataset

In [None]:
# Summary of the dataset
data.info()

In [None]:
# create a column to keep the count of the characters present in each record
data['Length'] = data['Message'].apply(len)

In [None]:
# view the dataset with the column 'Length' which contains the number of characters present in each mail
data.head(5)

In [None]:
# statistical info of the data
data.describe()

In [None]:
# Let's see the count of each class
data['Class'].value_counts()

### Text Pre-Processing


In [None]:
# Lets assign ham as 1
data.loc[data['Class']=="ham","Class"] = 1

In [None]:
# Lets assign spam as 0
data.loc[data['Class']=="spam","Class"] = 0

In [None]:
data.head(8)

## First let's remove punctuation. We can just take advantage of Python's built-in string library to get a quick list of all the possible punctuation:

In [None]:
# Why is it important to remove punctuation?

"This message is spam" == "This message is spam."

In [None]:
# get the default list of punctuations in Python
import string

string.punctuation

In [None]:
# Creating a function to remove the punctuation

def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation]) 
    return text

In [None]:
s = "data// science!!"
remove_punct(s)

In [None]:
text = []
for i in data['Message']:
    t = remove_punct(i)
    text.append(t)

In [None]:
data['Text_clean'] = text
data

In [None]:
# creating new column 'text_clean' to hold the cleaned text

data['text_clean'] = data['Message'].apply(lambda x: remove_punct(x)) #the lambda keyword is used to create anonymous functions
                                                                       
# view the dataset
data.head()

Now we need to convert each of those messages into a vector ( the way the ML models can understand and can work with).

In [None]:
# Splitting x and y

X = data['text_clean'].values # convert df as array
y = data['Class'].values

X

In [None]:
# Datatype for y is object. lets convert it into int
y = y.astype('int')
y

### Splitting Train and Test Data

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2, random_state=10)
X_train.shape

In [None]:
X_test.shape

Tokenization means breaking down a sentence or paragraph or any text into words.

CountVectorizer tokenizes the text along with performing very basic preprocessing like removing the punctuation marks, converting all the words to lowercase, etc. The vocabulary of known words is formed which is also used for encoding unseen text later. It will convert a collection of text documents to a matrix of token counts.

## BAG OF WORDS
We cannot pass text directly to train our models in Natural Language Processing, thus we need to convert it into numbers, which machine can understand and can perform the required modelling on it

# CountVectorizer (Bag of words) - to extract the features from text

In [None]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer



# Initialize the object for countvectorizer 
CV = CountVectorizer(stop_words="english")  

[Stopwords are the words in any language which does not add much meaning to a sentence. They are the words which are very common in text documents such as a, an, the, you, your, etc. The Stop Words highly appear in text documents. However, they are not being helpful for text analysis in many of the cases, So it is better to remove from the text. We can focus on the important words if stop words are removed.]

In [None]:
# Apply countvectorizer functionality on the training data to convert 
# the categorical data into vectors
X_train_CV = CV.fit_transform(X_train)


In [None]:
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')

CV.get_feature_names()

### Training a model

With messages represented as vectors, we can finally train our spam/ham classifier. Now we can actually use almost any sort of classification algorithms. For a variety of reasons, the Naive Bayes classifier algorithm is a good choice.

In [None]:
# Initialising the model
NB = MultinomialNB()

In [None]:
# feed data to the model
#NB.fit(xSet_train_CV,ySet_train)
NB.fit(X_train_CV,y_train)

In [None]:
# Let's apply CV on our test data. 
X_test_CV = CV.transform(X_test) #transform() is used to avoid the data leakage

In [None]:
# prediction for xSet_test_CV

y_predict = NB.predict(X_test_CV)
y_predict

In [None]:
# classification report

print(classification_report(y_test,y_predict))

In [None]:
## confusion matrix
pd.crosstab(y_test,y_predict)

In [None]:
#Initialising a model
bnb = BernoulliNB()

## fitting the model
bnb.fit(X_train_CV,y_train)

## getting the prediction
y_hat1=bnb.predict(X_test_CV) 

## confusion matrix
pd.crosstab(y_test,y_hat1)

## TF

In [None]:
# Splitting x and y

X = data['text_clean'].values
y = data['Class'].values
y

In [None]:
# Datatype for y is object. lets convert it into int
y = y.astype('int')
y

In [None]:
# Split the data into training and testing
# convert the training data - fit_transform()
# convert the testing data - transform()

In [None]:
## text preprocessing and feature vectorizer
# To extract features from a document of words, we import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


tf=TfidfVectorizer() ## object creation

#X=tf.fit_transform(X) ## fitting and transforming the data into vectors

In [None]:
## Creating training and testing
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=6)

In [None]:
## print feature names selected by TDIDF from the raw documents
#tf.get_feature_names()

In [None]:
## number of features created
#len(tf.get_feature_names())

In [None]:
X_train_cv = tf.fit_transform(X_train)
X_test_cv = tf.transform(X_test)

In [None]:
# Initialising the model
nb = MultinomialNB()
nb.fit(X_train_cv,y_train)  

# IDF

In [None]:
y_hat = nb.predict(X_test_cv)

In [None]:
# classification report

print(classification_report(y_test,y_hat))

In [None]:
pd.crosstab(y_test,y_hat)

In [None]:
## model object creation
nb=BernoulliNB()

## fitting the model
nb.fit(X_train_cv,y_train)

## getting the prediction
y_hat=nb.predict(X_test_cv)

In [None]:
y_hat

In [None]:
## Evaluating the model
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
print(classification_report(y_test,y_hat))

In [None]:
## confusion matrix
pd.crosstab(y_test,y_hat)