# Naive Bayes Classifier

In machine learning, Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong, naive independence assumptions between the features.

A naive Bayes classifier considers each of the given features to contribute independently to the probability that an object or row belongs to category, regardless of any possible correlations between the  features. 

In [1]:
#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
wine = datasets.load_wine()

print("Features: ", wine.feature_names)

print("Labels: ", wine.target_names)


Features:  ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Labels:  ['class_0' 'class_1' 'class_2']


In [2]:
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target,
                                                    test_size=0.3,random_state=109)


In [3]:
from sklearn.naive_bayes import GaussianNB


#Create a Gaussian Classifier
gnb = GaussianNB()

#Train the model using the training sets
gnb.fit(X_train, y_train)

#Predict the response for test dataset
y_pred_gnb = gnb.predict(X_test)


In [4]:
from sklearn.naive_bayes import BernoulliNB

BernNB = BernoulliNB(binarize=True)

BernNB.fit(X_train, y_train)

# print(BernNB)

y_pred_bnb = BernNB.predict(X_test)



In [5]:
from sklearn.naive_bayes import MultinomialNB

MultiNB = MultinomialNB()
MultiNB.fit(X_train,y_train)
y_pred_mnb = MultiNB.predict(X_test)


In [6]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy, how often is the classifier correct?
print("Accuracy Gaussian NB:",metrics.accuracy_score(y_test, y_pred_gnb))

print("Accuracy Bernoulli NB:",metrics.accuracy_score(y_test, y_pred_bnb))

print("Accuracy Multiomial NB:",metrics.accuracy_score(y_test, y_pred_mnb))


Accuracy Gaussian NB: 0.9074074074074074
Accuracy Bernoulli NB: 0.6296296296296297
Accuracy Multiomial NB: 0.7962962962962963


# Text preprocessing

In [7]:
documents = ["This little kitty came to play when I was eating at a restaurant.",
             "Merley has the best squooshy kitten belly.",
             "Google Translate app is incredible.",
             "If you open 100 tab in google you get a smiley face.",
             "Best cat photo I've ever taken.",
             "Climbing ninja cat.",
             "Impressed with google map feedback.",
             "Key promoter extension for Google Chrome."]

##  Term frequency–inverse document frequency (tf–idf or TFIDF )

TFIDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

The weight of a term that occurs in a document is simply proportional to the term frequency. 
### $$ tf(t,d) = frequency of term in the document $$

The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs. 

### $$ idf(t,D) = \log \frac{total \, number \, of \, documents \, in \, the \, corpus D}{number \, of \, documents \, where \, the \, term \, appears} $$

Then tf–idf is calculated as
 
### $$ tfidf(t,d,D) = tf(t,d) \cdot idf(t,D) $$

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
print( vectorizer.vocabulary_ )

{'little': 18, 'kitty': 17, 'came': 4, 'play': 24, 'eating': 8, 'restaurant': 26, 'merley': 20, 'best': 3, 'squooshy': 28, 'kitten': 16, 'belly': 2, 'google': 12, 'translate': 31, 'app': 1, 'incredible': 14, 'open': 22, '100': 0, 'tab': 29, 'smiley': 27, 'face': 10, 'cat': 5, 'photo': 23, 've': 32, 'taken': 30, 'climbing': 7, 'ninja': 21, 'impressed': 13, 'map': 19, 'feedback': 11, 'key': 15, 'promoter': 25, 'extension': 9, 'chrome': 6}
