# SPAM FILTERING USING NAIVE BAYES THEOREM

## Problem statement

#### Spam filtering using naive bayes classifier in order to predict whether a new mail based on content be categorized as spam or ham.

## Importing the basic libraries:

In [1]:
import numpy as np
import pandas as pd

## Importing the dataset: 

In [2]:
data= pd.read_csv("spam.tsv", sep= '\t', names= ['class', 'message'])
data.head() 


# tsv file- A (Tab-Separated Values) file is a simple text format for storing data in a tabular structure, similar to CSV (Comma-Separated Values) files. 
# In a TSV file, each line represents a row in the table, and columns are separated by tab characters.

Unnamed: 0,class,message
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!!


## Basic checks: 

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5567 entries, 0 to 5566
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   class    5567 non-null   object
 1   message  5567 non-null   object
dtypes: object(2)
memory usage: 87.1+ KB


In [4]:
# 5567 rows and two columns

In [5]:
data.shape

(5567, 2)

In [3]:
# creating a column to keep the count of characters in each record

data['length'] = data['message'].apply(len)
data.head(2)

Unnamed: 0,class,message,length
0,ham,I've been searching for the right words to tha...,196
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155


In [4]:
# can be used for filtering out the records with a particular length of characters
# for instance as below:

data.loc[data['length'] <100]

Unnamed: 0,class,message,length
2,ham,"Nah I don't think he goes to usf, he lives aro...",61
3,ham,Even my brother is not like to speak with me. ...,77
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!!,36
12,ham,Oh k...i'm watching here:),26
13,ham,Eh u remember how 2 spell his name... Yes i di...,81
...,...,...,...
5559,ham,Why don't you wait 'til at least wednesday to ...,67
5560,ham,Huh y lei...,12
5563,ham,Will ü b going to esplanade fr home?,36
5564,ham,"Pity, * was in mood for that. So...any other s...",57


In [5]:
data.describe(include='O')

Unnamed: 0,class,message
count,5567,5567
unique,2,5164
top,ham,"Sorry, I'll call later"
freq,4821,30


In [9]:
# unique values in the class column

data['class'].unique()

array(['ham', 'spam'], dtype=object)

In [10]:
data['class'].value_counts() # imabalanced data

class
ham     4821
spam     746
Name: count, dtype: int64

## Domain analysis:
- The data is about the classification of mails into spam and ham based on their content.
- The feature 'class' shows the target values and the 'message' column contains the content of the mail.


## Text preprocessing: 

#### steps involved:
1. removing punctuation
2. applying tokenization
3. lowercase
4. converting text into vector

#### two types of vectorization
    1. bag of words---CountVectorizer
    2. tfidfVectorizer

#### Encoding: 

In [6]:
# encoding of target variable
# ham---1
# spam---0

data['class'].value_counts()


class
ham     4821
spam     746
Name: count, dtype: int64

In [7]:
# ham---1
# spam---0

data.loc[data['class']=='ham', 'class']= 1
data.loc[data['class']=='spam', 'class']= 0

In [8]:
data.head()

Unnamed: 0,class,message,length
0,1,I've been searching for the right words to tha...,196
1,0,Free entry in 2 a wkly comp to win FA Cup fina...,155
2,1,"Nah I don't think he goes to usf, he lives aro...",61
3,1,Even my brother is not like to speak with me. ...,77
4,1,I HAVE A DATE ON SUNDAY WITH WILL!!!,36


#### Removing punctuation: 

Python in-built library :
-string: used to get a quick list of all the posiible punctuations. 

In [9]:
# why is it important to remove the punctuations

'This message is spam'=='This message is spam.'

False

In [10]:
# get the default list of punctuations in python

import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [11]:
# creating a function to remove the punctuation

def remove_punc(text):
    text= "".join(char for char in text if char not in string.punctuation)
    return text


In [12]:
# checking the function using general text

s= "How u doin'?"
remove_punc(s)

'How u doin'

In [13]:
# Method1 :applying that function to get punctuation free clean text in 'message' column

data['clean_text']= data['message'].apply(lambda x:remove_punc(x))
data.head()

# lambda keyword is used to create anonymous functions

Unnamed: 0,class,message,length,clean_text
0,1,I've been searching for the right words to tha...,196,Ive been searching for the right words to than...
1,0,Free entry in 2 a wkly comp to win FA Cup fina...,155,Free entry in 2 a wkly comp to win FA Cup fina...
2,1,"Nah I don't think he goes to usf, he lives aro...",61,Nah I dont think he goes to usf he lives aroun...
3,1,Even my brother is not like to speak with me. ...,77,Even my brother is not like to speak with me T...
4,1,I HAVE A DATE ON SUNDAY WITH WILL!!!,36,I HAVE A DATE ON SUNDAY WITH WILL


In [14]:
# for list- use append() function
# for characters in a string-- use join() function
# .apply()-- applies the functionality

#### Splitting the data

In [15]:
# splitting the data into x and y

In [16]:
x= data['clean_text']
y= data['class']

In [17]:
x.dtype

dtype('O')

In [18]:
y.dtype 

dtype('O')

In [19]:
y.value_counts() # despite encoding, the datatype of the target remains to be the 'object', hence should be changed to int type

class
1    4821
0     746
Name: count, dtype: int64

In [20]:
# datatype of y changed to int
y = y.astype('int') # typecasting

#### train_test_split: 

In [21]:
# split the data into training and testing
# convert the training data- fit_transform()
# convert the testing data- transform()
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state= 42)

###### Tokenization is the process of breaking down a stream of text into words, phrases, symbols, or other meaningful elements, known as tokens. In natural language processing (NLP), tokenization is an essential step in text preprocessing and analysis.

##### Countvectorizer:
###### tasks performed:
- **tokenization**- breaking the text down into individual words or tokens based on whitespace or specific tokenization rules.
- **counting occurences**- It then counts the occurrences of each token in the text data., it creates a sparse matrix where each row represents a document or sample, and each column represents a unique token, with the values **indicating the frequency of the token** in the corresponding document.
- **building vocabulary**- CountVectorizer builds a vocabulary of all unique tokens present in the text data. The vocabulary can be accessed using the vocabulary_ attribute.
- **vectorization**
- **normalisation**

#### BAG OF WORDS using countvectorizer 

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
cv= CountVectorizer(stop_words= 'english')
x_train_cv= cv.fit_transform(x_train)   

# here we want countvectorizer to be done only on text data, hence only xtrain and x test will be fitted with cv

In [23]:
# stop words are those common words in natural languages which are processesed( removed) before converting text data into vector.
# eg: and, are, is, was, were etc.,

In [24]:
x_test_cv= cv.transform(x_test)

#### Balancing the dataset :

#### Applying SMOTE directly to text data for classification problems, especially with Naive Bayes models, requires careful consideration due to the nature of text data and the underlying assumptions of Naive Bayes.
Points to consider:
- **Nature of data:**
Text data is inherently different from tabular data. Words in text data have semantic meanings, and their positions and combinations can significantly affect the meaning of the text. Applying traditional resampling techniques like SMOTE directly to text data may not always yield meaningful results.
- **Naive Bayes Assumptions:**
Naive Bayes models assume that features are conditionally independent given the class label. While this assumption simplifies the modeling process, it may not hold true for text data, where words are often dependent on each other in conveying meaning. Therefore, modifying the distribution of the data through techniques like SMOTE might not be beneficial and could potentially disrupt the underlying structure of the text data.
- **Sparse Matrix Representation:**
Text data is often represented as sparse matrices, where each row corresponds to a document and each column corresponds to a unique word (or n-gram). Modifying the distribution of such sparse data using SMOTE may result in synthetic documents that do not reflect the characteristics of real-world text data.
- while SMOTE can be effective for addressing class imbalance in certain types of machine learning problems, its direct application to imbalanced text classification problems with Naive Bayes models may not always yield desirable results. It's important to consider the characteristics of text data and the underlying assumptions of the chosen classification algorithm when addressing class imbalance in text classification tasks.


## Model building: 

#### Types of naive bayes in scikit library:
##### Gaussian NB
- It is used in classification assuming that the features follow a normal distribution.
##### Multinomial NB
- It is used for discrete counts. This is used when the features represent frequency, instead of focusing whether 'a word occurs in the document' or ' a word not occuring in the document', it focusses on the 'count of word occuring in the document'.
- So if the frequency is zero, then the probability of occurence of that feature will be 0 hence multinomial naive bayes ignores that feature.  It is known to work well for text classification problems.

##### Bernoulli
- This is useful when the features vectors are binary (0 and 1). 
- One application would be text classification with "bag of words" model where the 1s and 0s are "words occuring in the document" and "words not occuring in the document".

In [25]:
# using multinomial nb

from sklearn.naive_bayes import MultinomialNB
model_m_nb= MultinomialNB()
model_m_nb.fit(x_train_cv, y_train)

In [26]:
# prediction

y_pred_m_nb= model_m_nb.predict(x_test_cv)
y_pred_m_nb

array([1, 1, 1, ..., 1, 1, 1])

In [27]:
# evaluation:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score, recall_score, f1_score

In [28]:
accuracy_score(y_test, y_pred_m_nb)

0.9892280071813285

In [29]:
confusion_matrix(y_test, y_pred_m_nb)

array([[139,   6],
       [  6, 963]], dtype=int64)

In [30]:
print(classification_report(y_test, y_pred_m_nb))

              precision    recall  f1-score   support

           0       0.96      0.96      0.96       145
           1       0.99      0.99      0.99       969

    accuracy                           0.99      1114
   macro avg       0.98      0.98      0.98      1114
weighted avg       0.99      0.99      0.99      1114



In [31]:
# prediction

text= input("Enter the message: ")
msg_input= cv.transform([text])
predict= model_m_nb.predict(msg_input)
if(predict ==0):
    print('spam mail')
else:
    print('ham mail')


Enter the message: you won 100 dollars.
spam mail


In [32]:
# using bernoulli nb

from sklearn.naive_bayes import BernoulliNB
model_b_nb= BernoulliNB()
model_b_nb.fit(x_train_cv, y_train)

In [33]:
# prediction
y_pred_b_nb= model_b_nb.predict(x_test_cv)
y_pred_b_nb

array([1, 1, 1, ..., 1, 1, 1])

In [34]:
# evaluation

confusion_matrix(y_test, y_pred_b_nb)

array([[117,  28],
       [  1, 968]], dtype=int64)

In [35]:
pd.crosstab(y_test, y_pred_b_nb)

col_0,0,1
class,Unnamed: 1_level_1,Unnamed: 2_level_1
0,117,28
1,1,968


In [36]:
# 28 spam mails are misclassfied as ham mails
# only 1 ham mail is misclassified as spam mail.

#### TFIDF vectorizer

In [37]:
# splitting x and y
x= data['clean_text']
y= data['class']

In [38]:
# typecasting
y= y.astype('int')

In [39]:
# train test split 
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state= 42)

In [40]:
# tfidf
from sklearn.feature_extraction.text import TfidfVectorizer
tf= TfidfVectorizer()
x_train_tf= tf.fit_transform(x_train)
x_test_tf= tf.transform(x_test)

In [41]:
# using multinomial nb

from sklearn.naive_bayes import MultinomialNB
model_m_nb1= MultinomialNB()
model_m_nb1.fit(x_train_tf, y_train)

In [42]:
# prediction
y_pred_m_nb1 = model_m_nb1.predict(x_test_tf)
y_pred_m_nb1

array([1, 1, 1, ..., 1, 1, 1])

In [43]:
# evaluation
accuracy_score(y_test, y_pred_m_nb1)

0.9587073608617595

In [44]:
pd.crosstab(y_test, y_pred_m_nb1)

col_0,0,1
class,Unnamed: 1_level_1,Unnamed: 2_level_1
0,99,46
1,0,969


In [45]:
# 46 spam mails are misclassified as ham mails
# There are no ham mail being misclassified as spam mail.( since this 
# error could be dangerous, it is a good model considering 0 misclassfications here)

In [46]:
# using bernoulli nb
model_b_nb1= BernoulliNB()
model_b_nb1.fit(x_train_tf, y_train)

In [47]:
# prediction
y_pred_b_nb1= model_b_nb1.predict(x_test_tf)
y_pred_b_nb1

array([1, 1, 1, ..., 1, 1, 1])

In [48]:
#evaluation

accuracy_score(y_test, y_pred_b_nb1)

0.9784560143626571

In [49]:
pd.crosstab(y_test, y_pred_b_nb1)

col_0,0,1
class,Unnamed: 1_level_1,Unnamed: 2_level_1
0,122,23
1,1,968


### Complement nb (for imbalanced data) 

#### Complement Naive Bayes using countvectorizer 

In [50]:
from sklearn.naive_bayes import ComplementNB
model_com_nb= ComplementNB()
model_com_nb.fit(x_train_cv, y_train)

In [52]:
# prediction 
y_pred_com_nb= model_com_nb.predict(x_test_cv)
y_pred_com_nb

array([1, 1, 1, ..., 1, 1, 0])

In [53]:
#evaluation
accuracy_score(y_test, y_pred_com_nb)

0.9560143626570916

In [54]:
pd.crosstab(y_test, y_pred_com_nb)

col_0,0,1
class,Unnamed: 1_level_1,Unnamed: 2_level_1
0,139,6
1,43,926


#### Complement Naive Bayes using tfidf 

In [55]:
from sklearn.naive_bayes import ComplementNB
model_com_nb1= ComplementNB()
model_com_nb1.fit(x_train_tf, y_train)

In [56]:
# prediction 
y_pred_com_nb1= model_com_nb1.predict(x_test_tf)
y_pred_com_nb1

array([1, 1, 1, ..., 1, 1, 1])

In [57]:
#evaluation
accuracy_score(y_test, y_pred_com_nb1)

0.9784560143626571

In [58]:
pd.crosstab(y_test, y_pred_com_nb1)

col_0,0,1
class,Unnamed: 1_level_1,Unnamed: 2_level_1
0,134,11
1,13,956


### Pros of naive bayes:
- 1. fast, highly scalable (learning algorithm which can deal with any amount of data).
- 2. used for both binary as well as multiclass classification problems since there are types such as gaussian, multinomial, bernoulli.
- 3. simple and easy to implement.

### Cons: 

- 1. Due to it's assumption of independence among the features, it cannot learn the relationship between the features, hence it is difficult to apply this in real world use cases.
- 2. NB can learn the individual feature's importance but not the relationship between the features.


### Applications: 

- Naive bayes classifiers mostly used in text classification.
- News article classification like SPORTS, TECHNOLOGY.
- Spam filtering
- Sentiment analysis- focusses on whether the customer thinks positively or negatively about a certain topic.
- RECOMMENDATION SYSTEM: Naive bayes classifier along with collaborative filtering builds a recommendation system.