This article aims to classify SMS as SPAM or HAM using Naive Bayes technique

### Basics of Naive Bayes:
The Naive Bayes classifier aggregates information using conditional probability with an assumption of independence among features. Hence we can call it as probabilistic classifier.

P(Banana |Long, Sweet, Yellow) = P(Long |Banana)P(Sweet |Banana)P(Yellow |Banana)* P(Banana) __ P(Long, Sweet, Yellow)

First lets check that we have impoeted the dataset or not!![](http://)

In [None]:
import os
print(os.listdir("../input"))

Now lets load the data and print some rows

In [None]:

import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
%matplotlib inline

df = pd.read_csv('../input/spam.csv', encoding="cp1252")
df.head()

So we see v1, v2 columns usefull and rest are useless.

* v1: Target or Y or output class
* v2: input or X or SMS

Instead of droping unwanted columns, lets keep the columns we want

In [None]:
sms = pd.DataFrame()
sms['target'] = df['v1']
sms['sms'] = df['v2']
sms.tail()

We have what we want in our dataframe "sms"

In [None]:
print(sms['target'].value_counts())
sms.count()

### Data cleaning and feature engineering

look below the SMSes are composed of words, spaces, numbers, and punctuation. we need to remove numbers, punctuation, handle uninteresting words such as and, but, and or (stop words)

In [None]:
sms['sms'].head()

In [None]:
from nltk.corpus import stopwords

# Remove punctuations
sms['sms'] = sms['sms'].str.replace('[^\w\s]','')

# Remove numbers
sms['sms'] = sms['sms'].str.replace('\d+', ' ')

# Remove stop words and lower case
sms['sms'] = sms['sms'].apply(lambda x: ' '.join([j.lower() for j in x.split(' ') if j not in stopwords.words('english')]))

sms['sms'].head()

Lets understand the features by visualizing them. we have only one feature that is SMS. What are the visualisation we can do on words ?

* Plot word cloud to see what words are most appeared in SPAM and HAM SMSes?
* Plot SMS length

In [None]:
from nltk import word_tokenize
import matplotlib.pyplot as plt

spam_word_cloud = WordCloud(max_words = 30).generate("".join(sms.loc[sms['target'] == 'spam']['sms']))
ham_word_cloud = WordCloud(max_words = 30).generate("".join(sms.loc[sms['target'] == 'ham']['sms']))

# Display the generated image
plt.imshow(spam_word_cloud)
plt.axis("off")
plt.suptitle('Spam words cloud', fontsize=20)
plt.show()

plt.imshow(ham_word_cloud)
plt.suptitle('Ham words cloud', fontsize=20)
plt.axis("off")
plt.show()

In [None]:
sms['length']=sms['sms'].apply(len)
sms.head()
sms['length'].plot(bins=50,kind='hist')

Nothing much to infere from the above graph, except that we have big SMSes with length up to 1400 chars!!

Moving on, lets make our model learn!! Lets count the number of occurance of each words in the corpus using: CountVectorizer It is the most straightforward one, it counts the number of times a token shows up in the document and uses this value as its weight.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(sms['sms'], sms['target'], random_state=0)


vectorizer = CountVectorizer()
VX = vectorizer.fit_transform(X_train)

print(vectorizer.get_feature_names()[:10])

X_train_vectorized = VX.toarray()

In [None]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB(alpha=0.1)
model.fit(X_train_vectorized, y_train)
from sklearn.metrics import accuracy_score

y_pred = model.predict(vectorizer.transform(X_test))

print('Accuracy: %.2f%%' % (accuracy_score(y_test, y_pred) * 100))



for i, v in enumerate(zip(y_test, X_test)):
    if v[0] != y_pred[i]:
        print(">>>> Actual {} -- predicted -- {}".format(v[0], y_pred[i]))
        print(v[1])