# CS471: Introduction to Artificial Intelligence  
## Assignment 3: Naive Bayes 

In this assignment, you will implement the Naive Bayes classification method
For this assignment, you will be working with a Spam Collection dataset,
consisting of text messages that have been collected for Spam research. 

The csv file contains one message per line with a total of 30 messages 
tagged either being ham (legitimate) or spam. Each line is composed of two columns: 
column 1 contains the label (ham or spam) and 
column 2 contains raw text.

Consider the first 20 samples as your training set 
and the rest 10 samples for your testing. 

Tasks: 
Load the dataset and split into training and testing sets 
(first 20 into training and the rest into testing)  (1 point)


Compute the prior probabilities: P(spam) and P(ham)  (2 points)


Compute the conditional probabilities P(sentence/spam) (2 points)


Compute the posterior probabilities 
(probability of a sentence belonging to a spam or ham) (2 points)
P(spam/sentence) ∝ P(spam) * P(sentence/spam) 
Posterior ∝ prior * conditional
P(ham/sentence) ∝ P(ham) * P(sentence/ham) 


For each sentence in the test set: (2 points)
Display the sentence
Print the posterior probability of a sentence belonging to spam or ham 
Display the class (spam or ham) 


Report the test set accuracy (1 point)
Accuracy = no. of sentences correctly predicted by model / total sentences

In [1]:
import csv
from collections import defaultdict
import math

In [8]:
test_data=[]
word_dict = defaultdict(lambda: {'spam': 1, 'ham': 1})  # Laplace smoothing
spam_count, ham_count = 0, 0  # Total number of spam and ham words

with open(r'SpamDetection.csv','r') as csv_file:
    csv_reader = csv.reader(csv_file)
    next(csv_reader)  # Skip header

    for i, line in enumerate(csv_reader):
        if i < 20:
            label = line[0]  # 'spam' or 'ham'
            words = line[1].split()  # Tokenize the message
            if label == 'spam':
                spam_count += len(words)
                for word in words:
                    word_dict[word]['spam'] += 1
            else:
                ham_count += len(words)
                for word in words:
                    word_dict[word]['ham'] += 1
        else:
            test_data.append(line)

# Calculate prior probabilities of spam and ham
p_spam = spam_count / (spam_count + ham_count)
p_ham = ham_count / (spam_count + ham_count)

# Count number of spam and ham words and apply Laplace smoothing
unique_words = len(word_dict)
total_spam_words = spam_count + unique_words  # Applying Laplace smoothing
total_ham_words = ham_count + unique_words    # Applying Laplace smoothing

for word in word_dict:
    word_dict[word]['spam'] = word_dict[word]['spam'] / total_spam_words
    word_dict[word]['ham'] = word_dict[word]['ham'] / total_ham_words

# Testing Phase
for test in test_data:
    message = test[1].split()  # Break message up into words
    # Get prior probablities of ham and spam
    ham_score = math.log(p_ham)
    spam_score = math.log(p_spam)

    # Calculate log probabilities for each word in the test message so we dont have miniscule numbers
    for word in message:
        ham_score += math.log(word_dict[word]['ham']) if word in word_dict else math.log(1 / total_ham_words)
        spam_score += math.log(word_dict[word]['spam']) if word in word_dict else math.log(1 / total_spam_words)

    # Results
    print("Message: "+test[1])
    print(f'Predicted: ham (Spam: {spam_score}, Ham: {ham_score})' 
          if ham_score > spam_score 
          else f'Predicted: spam (Spam: {spam_score}, Ham: {ham_score})')
    print(f'Actual: {test[0]}')
    print('-' * 40)

Message: Tell where you reached
Predicted: ham (Spam: -22.111809666114894, Ham: -21.503157159020933)
Actual: ham
----------------------------------------
Message: Your gonna have to pick up a burger for yourself on your way home
Predicted: ham (Spam: -73.55008091472322, Ham: -72.83503492461345)
Actual: ham
----------------------------------------
Message: As a valued customer I am pleased to advise you that for your recent review you are awarded a Bonus Prize
Predicted: ham (Spam: -107.15726288656074, Ham: -105.89782400308766)
Actual: spam
----------------------------------------
Message: Urgent you are awarded a complimentary trip to EuroDisinc To claim text immediately
Predicted: spam (Spam: -67.58760390448849, Ham: -68.95993527328991)
Actual: spam
----------------------------------------
Message: Finished class where are you
Predicted: ham (Spam: -27.247608103165156, Ham: -26.35908606335621)
Actual: ham
----------------------------------------
Message: where are you how did you perf

### Accuracy
We got 90% accuracy with these test cases. The message "As a valued customer I am pleased to advise you that for your recent review you are awarded a Bonus Prize" is the only one that made it through filter
