<a href="https://colab.research.google.com/github/yavuzuzun/projects/blob/main/naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is naive Bayes

Naive Bayes is a probabilistic machine learning algorithm that is commonly used for classification tasks. It is based on Bayes' theorem, which is a fundamental concept in probability theory.

Naive Bayes assumes that the input features are conditionally independent given the class label. This assumption simplifies the probability calculation and makes the model computationally efficient. The model estimates the conditional probability of the class label given the input features using Bayes' theorem:

P(y|x1, x2, ..., xn) = P(x1, x2, ..., xn|y) * P(y) / P(x1, x2, ..., xn)

where y is the class label, and x1, x2, ..., xn are the input features. The model calculates the likelihood P(x1, x2, ..., xn|y) and the prior probability P(y) from the training data. The evidence probability P(x1, x2, ..., xn) is a normalizing constant that ensures that the probabilities sum up to 1.

Naive Bayes is called "naive" because it assumes that the input features are conditionally independent, which is often not true in practice. However, despite this simplifying assumption, Naive Bayes can be surprisingly effective in many real-world applications, especially in text classification and spam filtering.

# With prepackage

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load the 20 Newsgroups dataset
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

# Convert the raw text data to numerical features using CountVectorizer
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)

# Train a Multinomial Naive Bayes classifier on the training data
y_train = newsgroups_train.target
clf = MultinomialNB()
clf.fit(X_train, y_train)

# Use the classifier to predict the classes of the test data
y_test = newsgroups_test.target
y_pred = clf.predict(X_test)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print a classification report with precision, recall, and F1-score for each class
class_names = newsgroups_train.target_names
report = classification_report(y_test, y_pred, target_names=class_names)
print(report)


Accuracy: 0.7728359001593202
                          precision    recall  f1-score   support

             alt.atheism       0.79      0.77      0.78       319
           comp.graphics       0.67      0.74      0.70       389
 comp.os.ms-windows.misc       0.20      0.00      0.01       394
comp.sys.ibm.pc.hardware       0.56      0.77      0.65       392
   comp.sys.mac.hardware       0.84      0.75      0.79       385
          comp.windows.x       0.65      0.84      0.73       395
            misc.forsale       0.93      0.65      0.77       390
               rec.autos       0.87      0.91      0.89       396
         rec.motorcycles       0.96      0.92      0.94       398
      rec.sport.baseball       0.96      0.87      0.91       397
        rec.sport.hockey       0.93      0.96      0.95       399
               sci.crypt       0.67      0.95      0.78       396
         sci.electronics       0.79      0.66      0.72       393
                 sci.med       0.87      0.82 

# From scratch

First, we'll create a toy dataset. Let's say we have a dataset of emails and we want to classify them as either spam or not spam based on the words in the email. Here's what our dataset might look like:

In [2]:
emails = [
    ('Hey there! I thought you might find this interesting. Click here.', 'spam'),
    ('Get viagra for a discount price. Limited time offer.', 'spam'),
    ('URGENT: Your help is needed to secure your account!', 'spam'),
    ('Hi, just wanted to check in and see how you were doing.', 'not spam'),
    ('Reminder: Meeting tomorrow at 2pm.', 'not spam'),
    ('Please submit your expense reports by Friday.', 'not spam')
]


Now, we'll need to preprocess the data by creating a vocabulary of all the unique words in the dataset and creating a bag of words representation for each email. We can do this using the CountVectorizer class from scikit-learn:

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# create the vectorizer
vectorizer = CountVectorizer()

# fit the vectorizer on the emails
corpus = [email[0] for email in emails]
vectorizer.fit(corpus)

# create a bag of words representation for each email
X = vectorizer.transform(corpus).toarray()

# create the target vector
y = [email[1] for email in emails]


Now, we'll split the data into training and testing sets:

In [4]:
from sklearn.model_selection import train_test_split

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Now, we're ready to implement the Naive Bayes algorithm. First, we'll need to calculate the prior probabilities of each class:

In [5]:
import numpy as np

# calculate the prior probabilities of each class
classes, class_counts = np.unique(y_train, return_counts=True)
prior_probs = class_counts / len(y_train)


Next, we'll calculate the likelihoods of each word given each class using Laplace smoothing:

In [9]:
# calculate the likelihoods of each word given each class
word_counts = np.zeros((len(classes), X_train.shape[1]))
for i, c in enumerate(classes):
    X_c = X_train[y_train == c]
    if len(X_c) > 0:
      word_counts[i, :] = X_c.sum(axis=0) + 1

word_probs = word_counts / word_counts.sum(axis=1, keepdims=True)


  word_probs = word_counts / word_counts.sum(axis=1, keepdims=True)


Finally, we can make predictions on the test set:

In [12]:
# make predictions on the test set
y_pred = []
for x in X_test:
    class_probs = prior_probs.copy()
    for i, p in enumerate(class_probs):
        for j, xj in enumerate(x):
            class_probs[i] *= word_probs[i, j] ** xj
    y_pred.append(classes[np.argmax(class_probs)])

# evaluate the performance of the model
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")


Accuracy: 0.0


I still need to work on it...