<a href="https://colab.research.google.com/github/rahiakela/machine-learning-research-and-practice/blob/main/grokking-machine-learning/08-naive-bayes-model/01_spam_detection_with_naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Spam-detection with naive Bayes

Now that we have developed the algorithm, let’s roll up our sleeves and code the naive Bayes
algorithm.

##Setup

In [1]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt

random.seed(0)

In [None]:
!wget https://github.com/luisguiserrano/manning/raw/master/Chapter_8_Naive_Bayes/emails.csv

##Dataset preprocessing

First, let's load the dataset.

In [3]:
emails = pd.read_csv("emails.csv")
emails.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [4]:
def process_email(text):
  text = text.lower()
  return list(set(text.split()))

In [5]:
emails["words"] = emails["text"].apply(process_email)
emails.head()

Unnamed: 0,text,spam,words
0,Subject: naturally irresistible your corporate...,1,"[it, do, our, will, :, distinctive, hard, with..."
1,Subject: the stock trading gunslinger fanny i...,1,"[attire, attainder, try, plain, merrill, chron..."
2,Subject: unbelievable new homes made easy im ...,1,"[unconditionally, we, approved, time, $, and, ..."
3,Subject: 4 color printing special request add...,1,"[goldengraphix, mail, and, (, our, is, 4, rams..."
4,"Subject: do not have money , get software cds ...",1,"[finish, are, it, do, is, to, compatibility, c..."


##Finding the priors

Let’s first find the probability that an email is spam (the prior).

In [6]:
num_emails  = len(emails)
num_spam = sum(emails["spam"])
print(f"Number of emails: {num_emails}")
print(f"Number of spam emails: {num_spam}")
print()

# Calculating the prior probability that an email is spam
print(f"Probability of spam: {num_spam / num_emails}")

Number of emails: 5728
Number of spam emails: 1368

Probability of spam: 0.2388268156424581


We deduce that the prior probability that the email is spam is around 0.24. This is the probability
that an email is spam if we don’t know anything about the email. 

Likewise, the prior probability
that an email is ham is around 0.76.

##Finding the posteriors

We need to find the probabilities that spam (and ham) emails contain a certain word.

In [7]:
#  write a dictionary, and in this dictionary record every word, and its pair of occurrences in spam and ham
model = {}
for index, email in emails.iterrows():
  for word in email["words"]:
    if word not in model:
      # Note that the counts are initialized at 1 to avoid having zero counts
      model[word] = {"spam": 1, "ham": 1}
    if word in model:
      if email["spam"]:
        model[word]["spam"] += 1
      else:
        model[word]["ham"] += 1

Now let’s examine some rows of the dictionary.

In [8]:
model["lottery"]

{'ham': 1, 'spam': 9}

In [9]:
model["sale"]

{'ham': 42, 'spam': 39}

Although this dictionary doesn’t contain any
probabilities, these can be deduced by dividing the first entry by the sum of both entries.


In [10]:
# the probability of lottery being spam
model["lottery"]["spam"] / (model["lottery"]["ham"] + model["lottery"]["spam"])

0.9

In [11]:
# the probability of sale being spam
model["sale"]["spam"] / (model["sale"]["ham"] + model["sale"]["spam"])

0.48148148148148145

Let's generalize it.

In [12]:
def predict_bayes(word):
  word = word.lower()
  num_spam = model[word]["spam"]
  num_ham = model[word]["ham"]
  return 1.0 * num_spam / (num_spam + num_ham)

In [13]:
predict_bayes("lottery")

0.9

In [14]:
predict_bayes("sale")

0.48148148148148145

In [15]:
predict_bayes("won")

0.3595505617977528

##The naive Bayes algorithm

The input of the algorithm is the email. It goes through all the words in the email, and for each
word, it calculates the probabilities that a spam email contains it and that a ham email contains
it.

Then we multiply these probabilities (the naive assumption) and apply Bayes’ theorem to find the
probability that an email is spam given that it contains the words on this particular email.

In [18]:
def predict_naive_bayes(email):
  # Calculates the total number of emails, spam emails, and ham emails
  total = len(emails)
  num_spam = sum(emails["spam"])
  num_ham = total - num_spam

  # Processes each email by turning it into a list of its words in lowercase
  email = email.lower()
  words = set(email.split())
  spams = [1.0]
  hams = [1.0]

  # For each word, computes the conditional probability that an email containing that word is spam (or ham), as a ratio
  for word in words:
    if word in model:
      spams.append(model[word]["spam"] / num_spam * total)
      hams.append(model[word]["ham"] / num_ham * total)

  # Multiplies all the previous probabilities times the prior probability of the email being spam, and ham
  prod_spams = np.log(np.prod(spams) * num_spam)
  prod_hams = np.log(np.prod(hams) * num_ham)

  # Normalizes these two probabilities to get them to add to one (using Bayes’ theorem) and returns the result
  return prod_spams / (prod_spams + prod_hams)

Now that we have built the model, let’s test it by making predictions on some emails.

In [22]:
predict_naive_bayes("lottery sale")

0.5573625228656424

In [19]:
predict_naive_bayes("Hi mom how are you")

0.48743444104407146

In [21]:
predict_naive_bayes("Hi MOM how aRe yoU afdjsaklfsdhgjasdhfjklsd")

0.48743444104407146

In [20]:
predict_naive_bayes("meet me at the lobby of the hotel at nine am")

0.4619985211948569

In [23]:
predict_naive_bayes("enter the lottery to win three million dollars")

0.5371130104014242

In [24]:
predict_naive_bayes("buy cheap lottery easy money now")

0.5698786029874038

In [25]:
predict_naive_bayes("Grokking Machine Learning by Luis Serrano")

0.4969110781385491

In [26]:
predict_naive_bayes("asdfgh")

0.4628518191306706