<a href="https://colab.research.google.com/github/wpwo98/IDS-CB35533/blob/main/HW02_Spam_Filter_sol.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HW02. Simple Naive Bayes Classifier


## T1. Load a dataset

The following code loads a dataset consisting of text messages and spam-ham labels.

You can write your own code below the **"TODOs"** to answer the questions.



### Questions:
* Number of spam messges? [*747*]
* Number of ham messages? [*4825*]

In [None]:
from typing import List, Tuple, Dict, Iterable, Set
from collections import defaultdict
import re
import math
import pandas as pd

url = 'https://raw.githubusercontent.com/mlee-pnu/IDS/main/spam_dataset.csv'
df = pd.read_csv(url)

# TODOs
hams = df['Category'].value_counts()["ham"]
spams = df['Category'].value_counts()["spam"]
print(spams, hams)

747 4825


## T2. Spam filter for individual words

We first defined a function ***tokenize()*** to convert a given text into a set of words. 

Using the function, we now try to count the frequency of each word in each class (spam and ham).

Complete the following code and answer the following questions:
 



### Qeustions: 
*   Number of spam messages containing the word "free": [*170*]

Let's assume we only care for the word "free" to determine if a message is a spam or not. Answer the following questions:

*   P ( *ham* | *free* ) = [*0.26*]
*   Is this message spam? [*Yes*]

*Note: Do not apply a smoothing method here.*



In [None]:
def tokenize(text: str) -> Set[str]:
    text = text.lower()                         
    all_words = re.findall("[a-z0-9']+", text)  
    return set(all_words)                       

In [None]:
tokens: Set[str] = set()
token_spam_counts: Dict[str, int] = defaultdict(int)
token_ham_counts: Dict[str, int] = defaultdict(int)

spam = df[df.Category == 'spam']
ham = df[df.Category == 'ham']

for msg in spam['Message'].to_list():
  for token in tokenize(msg):
    tokens.add(token)
    token_spam_counts[token] += 1

for msg in ham['Message'].to_list():
  for token in tokenize(msg):
    tokens.add(token)
    token_ham_counts[token] += 1

# TODOs
word = "free"
n_word_spam = token_spam_counts[word] # frequency of the word in spam messages
n_word_ham = token_ham_counts[word]   # frequency of the word in ham messages
print("counts in spam:",n_word_spam, "\ncounts in ham:",n_word_ham)

p_spam = spams/(hams+spams)  # P(spam)
p_ham = hams/(hams+spams)    # P(ham)
p_word_given_spam = n_word_spam/spams  # P(word|spam)
p_word_given_ham = n_word_ham/hams     # P(word|ham)

# p(spam|word) = p(word|spam)*p(spam)/p(word)
p_spam_given_word = p_word_given_spam*p_spam/(p_word_given_spam*p_spam + p_word_given_ham*p_ham)
# P(ham|word)
p_ham_given_word = p_word_given_ham*p_ham/(p_word_given_spam*p_spam + p_word_given_ham*p_ham)
print(p_spam_given_word)
print(p_ham_given_word)
print("====================")
print("P(ham|free) = ", p_ham_given_word)
print("spam? ", p_ham_given_word < p_spam_given_word)

counts in spam: 170 
counts in ham: 59
0.74235807860262
0.2576419213973799
P(ham|free) =  0.2576419213973799
spam?  True


## T3. Spam filter that combines words: Naive Bayes

You received a text message "just do it" from an unknown sender.

Complete the function ***predict()*** that outputs the probability of the message being spam and the predicted label of the message. 


### Questions:

*   P ( *spam* | *text* ) = [*3.31e-06*], [5.13e-07 도 맞게 해주세요.]
*   Is this text message spam? [*No*]

*Note: You do not need to apply a smoothing method here.*



In [None]:
text = "just do it"

# TODOs
def predict(text: str):
  prob = 1
  label = "spam"

  k = 0.0 # smoothing factor
  log_spam = log_ham = 0.0

  for token in tokens:
    # Calculate p(token|spam), p(token|ham) 
    word = token
    n_word_spam = token_spam_counts[word] # frequency of the word in spam messages
    n_word_ham = token_ham_counts[word]   # frequency of the word in ham messages

    p_spam = spams/(hams+spams)  # P(spam)
    p_ham = hams/(hams+spams)    # P(ham)
    p_word_given_spam = (n_word_spam + k) / (spams + 2*k)  # P(word|spam)
    p_word_given_ham = (n_word_ham + k) / (hams + 2*k)     # P(word|ham)

    # iterating on the bag of words 
    if token in tokenize(text):
      log_spam += math.log(p_word_given_spam)
      log_ham += math.log(p_word_given_ham)
    else:
      log_spam += math.log(1.0 - p_word_given_spam)
      log_ham += math.log(1.0 - p_word_given_ham)

  p_if_spam = math.exp(log_spam)
  p_if_ham = math.exp(log_ham)
  prob = p_if_spam / (p_if_spam + p_if_ham)
  label = "spam" if prob > 0.5 else "ham"

  return prob, label

print(predict(text))

(3.315285589623296e-06, 'ham')


## T4. Smoothing method

You again received two text messages from unknown senders.

Complete the function ***spamFilter()*** that classifies a given message. 

You may want to apply a smoothing method for this task.


### Questions:

*   Is textA spam?: [*Yes*]
*   Is textB spam?: [*No*]


In [None]:
textA = "reward! download your free ticket from our website www.pnu.edu"
textB = "call me and get your money back"

# TODOs
def spamFilter(text: str):
  k = 1.0 # smoothing factor
  log_spam = log_ham = 0.0

  for token in tokens:
    # Calculate p(token|spam), p(token|ham) 
    word = token
    n_word_spam = token_spam_counts[word] # frequency of the word in spam messages
    n_word_ham = token_ham_counts[word]   # frequency of the word in ham messages

    p_spam = spams/(hams+spams)  # P(spam)
    p_ham = hams/(hams+spams)    # P(ham)
    p_word_given_spam = (n_word_spam + k) / (spams + 2*k)  # P(word|spam)
    p_word_given_ham = (n_word_ham + k) / (hams + 2*k)     # P(word|ham)

    # iterating on the bag of words 
    if token in tokenize(text):
      log_spam += math.log(p_word_given_spam)
      log_ham += math.log(p_word_given_ham)
    else:
      log_spam += math.log(1.0 - p_word_given_spam)
      log_ham += math.log(1.0 - p_word_given_ham)
  
  p_if_spam = math.exp(log_spam)
  p_if_ham = math.exp(log_ham)
  prob =  p_if_spam / (p_if_spam + p_if_ham)
  label = "spam" if prob > 0.5 else "ham" 
  return label

print(spamFilter(textA))
print(spamFilter(textB))


spam
ham
