# Lab 6: Naive Bayes Classifier for email spam filter

## Naive Bayes Classifier

The **Naive Bayes classifier** is a probabilistic machine learning model used for classification tasks. It is based on Bayes' Theorem and assumes that the features (predictors) are independent of each other, given the class label—an assumption called "naive" because it is rarely true in real-world data. Despite this, Naive Bayes works well for many applications, especially in text classification (e.g., spam detection, sentiment analysis). It calculates the probability of each class and assigns the label with the highest probability.

## Text Feature Extraction

**Text feature extraction** is the process of converting raw text into numerical data that can be used by machine learning algorithms. This typically involves transforming the text into a matrix of features, where each feature represents some aspect of the text, such as the presence of certain words or phrases. Common methods for text feature extraction include the Bag of Words (BoW) approach, TF-IDF (Term Frequency-Inverse Document Frequency), and word embeddings.

## Bag of Words (BoW) Approach

The **Bag of Words (BoW)** approach is a simple method of text feature extraction where each document is represented as an unordered collection of words, disregarding grammar and word order. The frequency of each word in the document is counted and used as a feature for that document. The resulting model treats all documents as "bags" of words, focusing only on the occurrence of each word to create a word-frequency vector for classification or other machine learning tasks.

## N-grams

**N-grams** are continuous sequences of _n_ items (usually words or characters) from a given text or speech. They are used in text analysis to capture context, patterns, or dependencies between words. In natural language processing, n-grams provide a way to incorporate word order and proximity into models, as they analyze not only individual words but also combinations of words appearing together.

## Unigram

A **unigram** is a type of 1-gram, where _n = 1_. It represents individual words or tokens in a sequence. In a unigram model, each word is treated as an independent feature without considering the words around it. Unigrams are commonly used in text classification and sentiment analysis.

**Example**: For the sentence "I love cats", the unigrams are: `["I", "love", "cats"]`.

## Bigram

A **bigram** is a type of 2-gram, where _n = 2_. It captures two consecutive words or tokens from the text. Bigrams are useful for capturing short-term dependencies or relationships between adjacent words, providing more contextual information than unigrams.

**Example**: For the sentence "I love cats", the bigrams are: `["I love", "love cats"]`.


## Bayes Model For "SPAM FILTER DETECTION"

### Important Points to consider in Naive Bayes classifier

1. In The classification of Naive Bayes all model characteristics are independent. i.e. We assume that every word in the message is independent of all other words in the context of the spam filters, and we count them with the ignorance of the context.
2. Then the classification algorithm generates probabilities of the message to be spam or not spam.
3. The probability estimation is based on the Bayes formula,
4. and the formula components are determined on the basis of the word frequencies in the whole message package.


## NAIIVE Bayes Classifier equation:

$ P(C|X_{1},X_{2},X_{3}.....X_{N}) = P(X_{1}|C) * P(X_{2}|C)*.....P(X_{N}|C) * P(C)$

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [2]:
df = pd.read_csv('dataset/spam_email.csv', encoding='latin1')
df.head()

Unnamed: 0,Category,Msg
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Convert data categorical data into numeric data and find the message length

In prevous labs, we used the label encoding method or one hot encoding method. Let's try a very simple one as below that converts contents of Category column into numbers.

- Email/Ham message label = 0
- Spam message label=1


In [3]:
df['label'] = df.Category.map({'ham': 0, 'spam': 1})
df['msg_len'] = df.Msg.apply(len)
df.head()

Unnamed: 0,Category,Msg,label,msg_len
0,ham,"Go until jurong point, crazy.. Available only ...",0,111
1,ham,Ok lar... Joking wif u oni...,0,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1,155
3,ham,U dun say so early hor... U c already then say...,0,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",0,61


#### Split the overall dataset for train(80%) and test dataset(20% or 0.2)


In [4]:
X_train, X_test, Y_train, Y_test = train_test_split(df.Msg, df.label, test_size =0.2)

## BOW Model

Convert all texts in the data set to matrix representaiton which is called BOW model and the orderring of words doesnt matter.

#### In python scikit library it is done using Countvectorizer ( ) method. Hence using Bag of words Model(BOW) we convert text messges of Msg column into numbers


In [5]:
word_freq_count = CountVectorizer()
X_train_count = word_freq_count.fit_transform(X_train.values)

print(word_freq_count.get_feature_names_out())

['00' '000' '000pes' ... 'zyada' 'ãº1' 'ã¼']


## N-grams

In how many group of words can we divide the whole texts of each documents for the sentence "It was a great Hilarious Day" convert to unigram, bigram and 3 grams

- unigram = { 'It' , 'Was', 'a', 'great', 'day', 'hilarious' }
- bigram = { 'It Was', 'was a', ' a great', 'great day', 'day hilarious' }
- 3grams = { 'It was a', 'was a great' 'a great day' 'GREAT DAY HILARIOUS' }

In our spam detection, unigram offers better dintinction because we recognize spam or email better by single words like 'Sale', 'Discount' ,'Fine', etc. than bigrams or 3grams


## Feed the processed data into the machine learning model

In this case use MultinomialNaiive Bayes for discrete data using the scikit library naive_bayes function: `MultinomialNB()`


In [6]:
model = MultinomialNB()
model.fit(X_train_count,Y_train)

## Test it now from the trained model above by giving your own texts

In [7]:
mail_text = ['Get the children ready we will go to dinner', 'Congratulations you got a massive  offer']

mail = word_freq_count.transform(mail_text)
model.predict(mail)

array([0, 0])

## Model Evaluation

In [8]:
X_test_count = word_freq_count.transform(X_test)
model.score(X_test_count, Y_test)

0.9856502242152466