# Naive Bayes Classifier

### Agenda:
1. Naive Bayes Classifier
2. SMS Spam Data
    - Load data
    - Convert spam and ham to 1 and 0
    - Convert sms text to TFIDF vectors
    - Split the data 
3. Build Naive Bayes
    - Predict test data
4. Measure success using roc_auc_score
5. Pros and Cons of Naive Bayes

**NB** is a classification technique based on Bayes’ Theorem with an assumption of **independence among predictors**. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

**P(A|B) = P(B|A)P(A)/P(B)**    

P(A|B) is the posterior probability of class (A, target) given predictor (B, attributes).   
P(A) is the prior probability of class.   
P(B|A) is the likelihood which is the probability of predictor given class.   
P(B) is the prior probability of predictor.   

#### Goal:  Train a Naive Bayes model to classify future SMS messages as either spam or ham.

Steps:

1.  Convert the words ham and spam to a binary indicator variable(0/1)

2.  Convert the txt to a sparse matrix of TFIDF vectors

3.  Fit a Naive Bayes Classifier

4.  Measure your success using roc_auc_score



In [1]:
your_local_path="C:/Users/s.mudalapuram/Documents/PythonMe/data/"

In [2]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score



In [3]:
df= pd.read_csv(your_local_path+"sms_spam.csv")

In [4]:
df.head()

Unnamed: 0,type,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
stopset = nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\s.mudalapuram\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [11]:
len(stopset)

179

#### Train the classifier if it is spam or ham based on the text

In [7]:
#TFIDF Vectorizer
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)

In [8]:
vectorizer.fit(df)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words={'my', "don't", 'theirs', 're', 'myself', 'this', 'more', 'because', 'won', "you're", 'mustn', 'an', 'there', 'is', 'too', 'haven', 'off', 'while', 'wasn', 'few', 'i', "doesn't", 'o', 'had', "shan't", 'who', 'aren', 'him', 'was', 'very', "wasn't", 'once', 'during', 'nor', 'themselves', 'o...', 'our', 'now', 'as', 'he', 'these', 'did', 'needn', "aren't", 'were', 'be', "won't", 'been', 'so'},
        strip_accents='ascii', sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

#### Convert the spam and ham to 1 and 0 values respectively for probability testing

In [12]:
df.type.replace('spam', 1, inplace=True)

In [13]:
df.type.replace('ham', 0, inplace=True)

In [14]:
df.head()

Unnamed: 0,type,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [15]:
df.shape

(5574, 2)

In [16]:
##Our dependent variable will be 'spam' or 'ham' 
y = df.type

In [17]:
#Convert df.txt from text to features
X = vectorizer.fit_transform(df.text)

In [18]:
X.shape

(5574, 8586)

In [19]:
X

<5574x8586 sparse matrix of type '<class 'numpy.float64'>'
	with 47400 stored elements in Compressed Sparse Row format>

### TF-IDF
#### TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

#### IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

#### tf-idf score=TF(t)*IDF(t)

In [20]:
## Spliting the SMS to separate the text into individual words
splt_txt1=df.text[0].split()
print(splt_txt1)

['Go', 'until', 'jurong', 'point,', 'crazy..', 'Available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet...', 'Cine', 'there', 'got', 'amore', 'wat...']


In [21]:
## Finding the most frequent word appearing in the SMS
max(splt_txt1)

'world'

In [22]:
## Count the number of words in the first SMS
len(splt_txt1)

20

#### It means in the first SMS there are 20(len(splt_txt1)) words & out of which only 14 elements have been taken, that;s why we'll get only 14 tf-idf values for the first the SMS.Likewise elements or words of all other SMSes are taken into consideration

In [23]:
X[0]

<1x8586 sparse matrix of type '<class 'numpy.float64'>'
	with 14 stored elements in Compressed Sparse Row format>

#### 0 is the first SMS,3536,4316 etc are the positions of the elements or the words & 0.15,0.34,0.27 are the tf_idf value of the words . Like wise we can find the next SMSes & the tf-idf value of the words of the SMSes

In [24]:
print(X)

  (0, 3536)	0.1570070817542793
  (0, 4316)	0.3466185073652293
  (0, 5877)	0.2711124074492608
  (0, 2316)	0.26843531434169243
  (0, 1301)	0.25926284833436075
  (0, 1746)	0.2928268764441005
  (0, 3620)	0.19147848622350877
  (0, 8428)	0.23446497404204308
  (0, 4442)	0.2928268764441005
  (0, 1744)	0.3308854638944828
  (0, 2038)	0.2928268764441005
  (0, 3580)	0.1625034702178997
  (0, 1074)	0.3466185073652293
  (0, 8218)	0.19367543856970723
  (1, 5466)	0.27190435673704183
  (1, 4478)	0.4083285209202484
  (1, 4284)	0.5236769406481622
  (1, 8333)	0.4316309977097208
  (1, 5493)	0.5466195966483365
  (2, 3340)	0.11532016948053561
  (2, 2931)	0.3598966605883333
  (2, 8387)	0.19049443007546943
  (2, 2155)	0.19443486429295845
  (2, 8345)	0.14768604533962174
  (2, 3068)	0.46962403601340863
  :	:
  (5569, 165)	0.3330442123216397
  (5569, 5384)	0.3330442123216397
  (5570, 3876)	0.3652144637345925
  (5570, 3549)	0.3642455181785356
  (5570, 3327)	0.5597074067013798
  (5570, 2963)	0.6485917181474956
  (55

In [26]:
vectorizer.get_feature_names()[4316]## 4316 is the position of the word jurong

'jurong'

### Second SMS

In [27]:
## Spliting the SMS to separate the text into individual words
splt_txt2=df.text[1].split()
print(splt_txt2)

['Ok', 'lar...', 'Joking', 'wif', 'u', 'oni...']


In [28]:
len(splt_txt2)

6

In [29]:
X[1]## Second SMS

<1x8586 sparse matrix of type '<class 'numpy.float64'>'
	with 5 stored elements in Compressed Sparse Row format>

In [30]:
## Finding the most frequent word appearing in the second SMS
max(splt_txt2)

'wif'

**From the above in the 2nd SMS there are 6 words  & out of which only 5 elements have been taken, that's why
we'll get only 5 tf-idf values for the 2nd the SMS.Likewise elements or words of all other SMSes are taken into consideration**

In [31]:
## The most freaquent word across all the SMSes
max(vectorizer.get_feature_names())

'zyada'

In [32]:
print (y.shape)
print (X.shape)

(5574,)
(5574, 8586)


In [33]:
##Split the test and train
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [34]:
##Train Naive Bayes Classifier
## Fast (One pass)
## Not affected by sparse data, so most of the 8605 words dont occur in a single observation
clf = naive_bayes.MultinomialNB()
model=clf.fit(X_train, y_train)

In [35]:
predicted_class=model.predict(X_test)
print(predicted_class)

[0 0 0 ... 0 0 0]


** First 3 SMSes are correctly assigned to Ham(0) based on the tf-idf scores of the words given in the SMSes**

In [36]:
print(y_test)

3690    0
3527    0
724     0
3370    0
468     0
5412    0
4362    0
4241    0
5442    0
5309    0
2232    0
3573    0
4379    0
3316    1
4895    0
296     1
453     0
4880    0
2034    0
4287    0
605     0
1615    0
5169    0
4655    0
2754    0
2727    0
4295    1
3893    1
2559    0
730     0
       ..
3768    0
3809    0
3034    0
5082    0
257     0
507     0
1438    0
99      0
1957    0
5216    1
3412    0
4058    0
3650    0
2707    0
1954    0
4028    0
2164    0
4564    0
366     0
2561    0
3680    0
4320    0
3133    0
949     0
4842    0
19      1
4758    0
668     0
218     0
4660    0
Name: type, Length: 1394, dtype: int64


In [37]:
df.loc[[19]]

Unnamed: 0,type,text
19,1,England v Macedonia - dont miss the goals/team...


In [38]:
predicted_class[19]## This SMS(SMS no. 19) has been classified as Ham but Actually it's SPAM

0

#### Check for null values in spam

In [39]:
df[df.type.isnull()]

Unnamed: 0,type,text


#### There are no null values

#### Find the probability of assigning a SMS to a specific class

In [40]:
prd=model.predict_proba(X_test)

In [41]:
prd

array([[0.99729642, 0.00270358],
       [0.98498819, 0.01501181],
       [0.9333622 , 0.0666378 ],
       ...,
       [0.99196715, 0.00803285],
       [0.9860348 , 0.0139652 ],
       [0.99650379, 0.00349621]])

In [42]:
clf.predict_proba(X_test)[:,1]

array([0.00270358, 0.01501181, 0.0666378 , ..., 0.00803285, 0.0139652 ,
       0.00349621])

In [43]:
##Check model's accuracy
roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])

0.9860710353261697

**With the model, the success rate is ~98.60%**

### Pros and Cons of Naive Bayes:
#### Pros:
- It is easy and fast to predict class of test data set. It also perform well in multi class prediction
- When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
- It perform well in case of categorical input variables compared to numerical variable(s).

#### Cons:
- If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
- Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.