# Navie Bayes Classification

**Naive Bayes (NB**) is ‘naive’ because it makes the assumption that features of a measurement are independent of each other. This is naive because it is (almost) never true. Here is why NB works anyway.

* A naive Bayes classifier is an algorithm that uses Bayes' theorem to classify objects. Naive Bayes classifiers assume strong, or naive, independence between attributes of data points. Popular uses of naive Bayes classifiers include spam filters, text analysis and medical diagnosis. These classifiers are widely used for machine learning because they are simple to implement.

* Naive Bayes is also known as simple Bayes or independence Bayes.

* A naive Bayes classifier uses probability theory to classify data. Naive Bayes classifier algorithms make use of Bayes' theorem. The key insight of Bayes' theorem is that the probability of an event can be adjusted as new data is introduced.

* A naive Bayes classifier is not a single algorithm, but a family of machine learning algorithms that make uses of statistical independence. These algorithms are relatively easy to write and run more efficiently than more complex Bayes algorithms.

* The most popular application is spam filters. A spam filter looks at email messages for certain key words and puts them in a spam folder if they match.

* Despite the name, the more data it gets, the more accurate a naive Bayes classifier becomes, such as from a user flagging email messages in an inbox for spam.

* What makes a naive Bayes classifier naive is its assumption that all attributes of a data point under consideration are independent of each other. A classifier sorting fruits into apples and oranges would know that apples are red, round and are a certain size, but would not assume all these things at once. Oranges are round too, after all.

* One of the major advantages that Naive Bayes has over other classification algorithms is its ability to handle an extremely large number of features. In our case, each word is treated as a feature and there are thousands of different words. Also, it performs well even with the presence of irrelevant features and is relatively unaffected by them.

* The other major advantage it has is its relative simplicity. Naive Bayes' works well right out of the box and tuning it's parameters is rarely ever necessary, except usually in cases where the distribution of the data is known.

It rarely ever overfits the data.

Another important advantage is that its model training and prediction times are very fast for the amount of data it can handle.



In [None]:
# To read the csv files in arrays and dataframes.
import numpy as np 
import pandas as pd 

In [None]:
data = pd.read_csv("../input/spam.csv", encoding = "latin-1")
# # encoding='latin-1' is used to download all special characters and everything in python. If there is no encoding on the data, it gives an error. Let's check the first five values.
data.head()

Check for the null values if any and count the total number of null values.

In [None]:
data.isnull().sum()

There are so many null values in the 3rd, 4th and 5th columns and it is better to remove them. Also rename the column names as they doesn't sound familiar.

In [None]:
data = data.drop(["Unnamed: 2","Unnamed: 3","Unnamed: 4"],axis=1)
data.rename(columns= { 'v1' : 'class' , 'v2' : 'message'}, inplace= True)
data.head()

In [None]:
data.info()

# Data Visualization 

In [None]:
import matplotlib.pyplot as plt
count =pd.value_counts(data["class"], sort= True)
count.plot(kind= 'bar', color= ["blue", "orange"])
plt.title('Bar chart')
plt.legend(loc='best')
plt.show()

As we see that the count of spam email is less.

In [None]:
count.plot(kind = 'pie',autopct='%1.2f%%') # 1.2 is the decimal points for 2 places
plt.title('Pie chart')
plt.show()

In [None]:
data.groupby('class').describe()

Add a new column called **Length** and check the size of each message.

In [None]:
data['length'] = data['message'].apply(len)
# swapping the columns
data = data[['message', 'length', 'class']]
data.head()

# Data Pre-Processing

The process of converting data to something a computer can understand is referred to as pre-processing. One of the major forms of pre-processing is to filter out useless data. In natural language processing, useless words (data), are referred to as stop words.

**What are Stop words?**

Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.


We would not want these words taking up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to be stop words.

* Remove all **Non-words** in the message( ex : if there are any special characters or numbers, they are replaced with spaces.)


* Changing all the characters to **lower case letters**. We can do it in with the upper case as well, but lower case looks better in     approach. ( ex : the syste must treat the characters 'A' and 'a' the same.


*  Splitting each word in the sentence and separated by **comma**


* Checking the **stop words ( if any )** and removing them accordingly. 


The idea of **stemming** is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is    involved.

 The reason why we stem is to shorten the lookup, and normalize sentences.

 **Consider:**

 "I was taking a ride in the car."
 
 "I was riding in the car."

 This sentence means the same thing. in the car is the same

5. **Joining** all the words into a single sentence after splitting and checking each word in a sentence. it joins all the words.




In [None]:
import re
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
def clean_message(message):
    message = re.sub("[^A-Za-z]", " ", message) #1
    message = message.lower() #2
    message = message.split() #3
    stemmer = PorterStemmer()   #4. to find the  root meaning word of each word         
    message = [stemmer.stem(word) for word in message if word not in set(stopwords.words("english"))] #5
    message = " ".join(message) #6 #Keeping cleaned words together
    return message

Let's test how our function works. We shall take the original data and the 1st value.

In [None]:
message = data.message[0]
print(message)

Testing the data after text mining from the actual data given and performing all the operiations on the data.

In [None]:
message = clean_message(message)
print(message)

Let us apply the function to all the rows in the data.

In [None]:
messages = []
for i in range(0, len(data)):
    message = clean_message(data.message[i])
    messages.append(message)

In [None]:
data = data.drop(["message"],axis=1)
data['messages'] = messages
data.head()

# Feature selection

In [None]:
#let's seperate the output and documents
y = data["class"].values
x = data["messages"].values

In [None]:
from sklearn.model_selection import train_test_split
#splitting the data in training and test set
xtrain , xtest , ytrain , ytest = train_test_split(x,y, test_size = 0.3, random_state = 1)
# test size is 0.3 which is 70 : 30
print(xtrain.shape, ytrain.shape, xtest.shape, ytest.shape)


A **bag-of-words model**, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

Bag of Words (BoW) is an algorithm that counts how many times a word appears in a document. It’s a tally. Those word counts allow us to compare documents and gauge their similarities for applications like search, document classification and topic modeling. BoW is a also method for preparing text for input in a deep-learning net.

BoW lists words paired with their word counts per document. In the table where the words and documents that effectively become vectors are stored, each row is a word, each column is a document, and each cell is a word count. Each of the documents in the corpus is represented by columns of equal length. Those are wordcount vectors, an output stripped of context.

Whenever we apply any algorithm in NLP, it works on numbers. We cannot directly feed our text into that algorithm. Hence, Bag of Words model is used to preprocess the text by converting it into a bag of words, which keeps a count of the total occurrences of most frequently used words.

**Example :** Hello, how are you ?

After making the sentence into tokens  : "Hello", "how", "are", "you"




# **TF-IDF ( Term Frequency - Inverse Document Frequency )**


This method is also called as Normalization. TF - How many times a particular word appears in a single doc. IDF - This downscales words that appear a lot across documents.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(stop_words='english',max_df=0.5)

#fitting train data and then transforming it to count matrix#fitting 
x_train = vect.fit_transform(xtrain)
#print(x_train)

#transforming the test data into the count matrix initiated for train data
x_test = vect.transform(xtest)

# importing naive bayes algorithm
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

#fitting the model into train data 
nb.fit(x_train,ytrain)

#predicting the model on train and test data
y_pred_test = nb.predict(x_test)
y_pred_train = nb.predict(x_train)

#checking accuracy score
from sklearn.metrics import accuracy_score
print(accuracy_score(ytest,y_pred_test)*100)

#Making Confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(ytest,y_pred_test)
print(cm)

# **Count Vectorizer **  




The most straightforward one, it counts the number of times a token shows up in the document and uses this value as its weight. In Python tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language. 

For more information, one can go through the link below.

https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af

https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vect1 = CountVectorizer(stop_words='english',max_df=0.5)

#fitting train data and then transforming it to count matrix#fitting 
x_train = vect1.fit_transform(xtrain)

#transforming the test data into the count matrix initiated for train data
x_test = vect1.transform(xtest)

# importing naive bayes algorithm
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

#fitting the model into train data 
nb.fit(x_train,ytrain)

#predicting the model on train and test data
y_pred_test = nb.predict(x_test)
y_pred_train = nb.predict(x_train)

#checking accuracy score
from sklearn.metrics import accuracy_score
print(accuracy_score(ytest,y_pred_test)*100)

#Making Confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(ytest,y_pred_test)
print(cm)

**Looks like the count vectorizer is giving the most accutate result of 98 where as TfIdf is at 97**

Let us take some random sample data and try to apply the model and see how that actually works.

# Testing the Model

In [None]:
new_text = pd.Series('WINNER!! As a valued network customer you have been selected to receivea å£900 prize reward! To claim call 09061701461. Claim code KL341. valid 12 hours')
new_text_transform = vect.transform(new_text)
print(" The email is a" ,nb.predict(new_text_transform))

In [None]:
new_text = pd.Series(" Hello, how are you?")
new_text_transform = vect.transform(new_text)
print(" The email is a" ,nb.predict(new_text_transform))