### Naive Bayes (Supervised)

NB aggregates information using conditional probability with an **assumption of independence among features**. So the presence of a particular feature in a class is unrelated to the presence of any other feature (regardless of any correlations that may exist). These assumptions are often wrong and that why it is Naive but it allows for simple calculations. 

The Naive Bayes classifier is based on finding functions **describing the probability of belonging to a class given features**. Here comes Bayes Rule:

- $P(Y \rvert X) = \frac{P(X \rvert Y) P(Y)}{P(X)}$

Classification using Bayes Rule or Posterior = Likelihood * Prior / Scaling Factor

- $P(Y \rvert X)$ posterior is the probability that sample x is of class y given the
feature values of x being distributed according to distribution of y and the prior.

- $P(X \rvert Y)$ - Likelihood of data X given class distribution Y. Gaussian distribution (given by _calculate_likelihood)
- $P(Y)$ - Prior (given by _calculate_prior)
- $P(X)$ - Scales the posterior to make it a proper probability distribution. This term is ignored in this implementation since it doesn't affect which class distribution the sample is most likely to belong to.

In [196]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import os
import sys
sys.path.insert(0, 'helper_functions/')
from data_manipulation import train_test_split
from ml_preprocessing import MultiColumnLabelEncoder

# path and file (for spam classifier files)
PATH_TO_TRAIN_MAILS = 'data/train_mails/'
PATH_TO_TEST_MAILS = 'data/test_mails/'

# path and file (for sklearn example)
PATH_TO_DATA = 'data/'
FILE = 'mushrooms.csv'

### USe NB for spam classifier (a classic)

In [150]:
# create a bag of most comon words in training mails
mails = [os.path.join(PATH_TO_TRAIN_MAILS, x) for x in os.listdir(PATH_TO_TRAIN_MAILS)]
most_common_words_size = 2000

word_lists = []
for mail in mails:
    with open(mail) as m:
        for line in m:
            words = line.split()
            word_lists += words

bag = Counter(word_lists)
print ('length of word bag: {}'.format(len(bag)))

for item in list(bag.keys()):
    if item.isalpha() == False: # alphabetic, at least 1 char
        del bag[item]
    elif len(item) == 1:
        del bag[item]  
print ('length of word bag after removals: {}'.format(len(bag)))
bag = bag.most_common(most_common_words) # List the n most common sorted elements 
print (bag[:10])

length of word bag: 20601
length of word bag after removals: 16962
[('order', 1414), ('address', 1299), ('report', 1217), ('mail', 1133), ('language', 1099), ('send', 1080), ('email', 1066), ('program', 1009), ('our', 991), ('list', 946)]


In [125]:
def extract_features(mail_dir, bag):
    """
    generate a label and word frequency matrix
    Args:
     mail_dir: directory where mail files are stored
     bag: bag of most common words
    source
     https://github.com/savanpatel
    """
    files = [os.path.join(mail_dir,x) for x in os.listdir(mail_dir)]
    feature_matrix = np.zeros((len(files), most_common_words_size))
    labels = np.zeros(len(files))
    
    print ('shape feature_matrix {}'.format(feature_matrix.shape))
    print ('no labels: {}'.format(len(labels)))
    
    agg = 0;
    doc_id = 0;
    for fil in files:
        with open(fil) as fi:
            for i, line in enumerate(fi):
                if i == 2:
                    words = line.split()
                    for word in words:
                        word_id = 0
                        for i, d in enumerate(bag):
                            if d[0] == word:
                                word_id = i
                                feature_matrix[doc_id, word_id] = words.count(word)
            labels[doc_id] = 0;
            filepathTokens = fil.split('/')
            lastToken = filepathTokens[len(filepathTokens) - 1]
            
            if lastToken.startswith("spmsg"):
                labels[doc_id] = 1;
                agg = agg + 1
            doc_id = doc_id + 1
            
    return feature_matrix, labels

- feature matrix has size 702 rows (no emails in `PATH_TO_TRAIN_MAILS` dir) and number of columns represent the number of most common words (defined in `most_common_words_size` variable)
- labels are target features. 0 for non-spam mails and 1 for spam mails (identified by the 'spmsga' name)

In [155]:
features_matrix, labels = extract_features(PATH_TO_TRAIN_MAILS, bag = bag)
test_feature_matrix, test_labels = extract_features(PATH_TO_TEST_MAILS, bag=bag)

shape feature_matrix (702, 2000)
no labels: 702
shape feature_matrix (260, 2000)
no labels: 260


In [156]:
# no target imbalance
print ('no of non-spam mails in training set: {}'\
       .format(labels[labels == 0].shape[0]))
print ('no of spam mails in training set: {}'\
       .format(labels[labels == 1].shape[0]))

no of non-spam mails in training set: 351
no of spam mails in training set: 351


In [160]:
from sklearn.metrics import accuracy_score
model = GaussianNB()

print ("Training model.")
#train model
model.fit(features_matrix, labels)

predicted_labels = model.predict(test_feature_matrix)

print ("FINISHED classifying. accuracy score : ")
print (accuracy_score(test_labels, predicted_labels))

Training model.
FINISHED classifying. accuracy score : 
0.9730769230769231


### Sklearn implementation (sklearn.naive_bayes.)

Sklearn provides 3 alternatives for model training:
- **GaussianNB** --> used in classification. Features are assumed having a normal distribution
- **MultinomialNB** --> discrete count. F.i “count how often words occur in a doc”, you can think of it as “number of times outcome number $x_i$ is observed over the $n$ trials”
- **BernoulliNB** --> useful if your feature vectors are binary. An application could be text classification with bag-of-words models where the 1s and 0s are "words occurs in the doc" and "word does not occur in the doc"

In [174]:
df = pd.read_csv(PATH_TO_DATA + FILE)

# some cleaning
df.columns = df.columns.str.replace('-', '_')
df = df.replace('?',np.nan)
df['stalk_root'] = df['stalk_root'].fillna('u')

# Labelenoce all columns
df = MultiColumnLabelEncoder().fit_transform(df)

In [175]:
# Features, Target 
X = df.iloc[:, 1:].values 
y = df.iloc[:, 0].values

In [176]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, shuffle=True)

X_train shape: (5687, 22)
X_test shape: (2437, 22)
y_train shape: (5687,)
y_test shape: (2437,)


In [194]:
import naive_bayes_recipes as nbr
nb = nbr.NaiveBayes()
nb.fit(X=X_train, y = y_train)

y_pred = nb.predict(X_test)

In [195]:
# Print results
print("Mislabeled points out of total {} points : {}, performance {:05.2f}%"
      .format(X_test.shape[0], (y_test != y_pred).sum(),
          100*(1-(y_test != y_pred).sum() / X_test.shape[0])
))

Mislabeled points out of total 2437 points : 199, performance 91.83%


In [21]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB 
gnb = GaussianNB() 

gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

In [161]:
# Print results
print("Mislabeled points out of total {} points : {}, performance {:05.2f}%"
      .format(X_test.shape[0], (y_test != y_pred).sum(),
          100*(1-(y_test != y_pred).sum() / X_test.shape[0])
))

Mislabeled points out of total 2437 points : 203, performance 91.67%


In [197]:
# https://towardsdatascience.com/the-real-world-as-seen-on-twitter-sentiment-analysis-part-two-3ed2670f927d