# 6.6 Assignment 6: Naïve Bayes
Bayes’ Theorem shows us how to turn P(E|H) to P(H|E), with E=Evidence and H=Hypothesis. But what does that really mean? Imagine you have to explain this to someone who doesn't understand machine learning or probability at all.

## Problem 1
<b>Question</b> Explain how to turn P(E|H) to P(H|E), with E=Evidence and H=Hypothesis in layman's terms.<p>
<b>Answer</b>:<p>
P(E|H) is the probability of seeing the Evidence if the Hypothesis is true. For example, the probability of getting a positive test result if you actually have the disease.<p>
P(H|E) is the probability that the Hypothesis is true given that you've seen the Evidence. For example, the probability that you actually have the disease if you got a positive test result.<p>
So, you turn P(E|H) into P(H|E) by incorporating the initial belief in the hypothesis P(H) and the overall likelihood of seeing the evidence P(E).<p>
P(H) For example, before you take a test, the probability that someone in the general population has the disease. It's the probability that a randomly selected person from the general population has the disease before you have any specific information about that person's test result.<p>
P(E) Represents the overall likelihood of seeing the evidence (a positive test) across the entire population, considering both people who have the disease and people who don't. This is the probability of the test itself, which is typically different than the actual results due to false negatives and false positives.<p>
So, to convert P(E|H) into P(H∣E) using Bayes' Theorem:<p>
$$P(H|E) = \frac {P(E|H)×P(H)} {P(E)}$$<p>

The example from real life is:<p>
 - There is a test for a cancer that occurs in 1% P(H) of the general population.
 - The test produces positive results 2% P(E) of the time.
 - If you actually have the cancer there is a 100% P(E|H) probability that the test will produce a positive result.
So to turn P(E|H) into P(H|E):<p>
$$P(H|E) = \frac {1×.01} {.02}$$<p>
$$P(H|E) = 50\%$$

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

import statsmodels.api as sm
import statsmodels.formula.api as smf

In [3]:
df = pd.read_csv('Youtube01-Psy.csv').dropna()
print(df.columns)
display(df.head(77))

Index(['COMMENT_ID', 'AUTHOR', 'DATE', 'CONTENT', 'CLASS'], dtype='object')


Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1
...,...,...,...,...,...
72,z124fn5ahqnfdbxtg23ihlijyqjqtr1lk,Oh 1080s,2014-11-02T01:08:10,Sub my channel!﻿,1
73,z12lubwrvv35zpzub23ywxbbiuawjbalc,Ariel Baptista,2014-11-02T05:06:46,http://www.ebay.com/itm/131338190916?ssPageNam...,1
74,z13osfxhtkfmwpxue234z3wimzmcs1k2x,Stefano Albanese,2014-11-02T12:04:36,http://www.guardalo.org/best-of-funny-cats-gat...,1
75,z12tcdxa5k3bsvtqh04ccnaqusj1vvfju3s,Salim Tayara,2014-11-02T14:33:30,"if your like drones, plz subscribe to Kamal Ta...",1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   COMMENT_ID  350 non-null    object
 1   AUTHOR      350 non-null    object
 2   DATE        350 non-null    object
 3   CONTENT     350 non-null    object
 4   CLASS       350 non-null    int64 
dtypes: int64(1), object(4)
memory usage: 13.8+ KB


## Problem 2
Build a spam filter with the Naïve Bayes approach. 

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

X = df['CONTENT'] # Features (the text)
y = df['CLASS']   # 0 not spam, 1 spam

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
vectorizer = CountVectorizer(stop_words='english') # Remove common English stop words

# Fit the vectorizer on the training data and transform it
X_train_counts = vectorizer.fit_transform(X_train)

# Transform the test data using the fitted vectorizer
X_test_counts = vectorizer.transform(X_test)
model = MultinomialNB()

# Train the model using the vectorized training data and labels
model.fit(X_train_counts, y_train)
y_pred = model.predict(X_test_counts)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['not-spam', 'spam']))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\n--- Example Prediction ---")
sample_message_not_spam = ["i think about billions of the views come from people who only wanted to check the view count"]
sample_message_spam = ["WINNER! You've won a free Bitcoin! Claim your prize nnow!"]

# Vectorize the sample messages using the same vectorizer
sample_message_not_spam_counts = vectorizer.transform(sample_message_not_spam)
sample_message_spam_counts = vectorizer.transform(sample_message_spam)

# Predict
pred_not_spam = model.predict(sample_message_not_spam_counts)
pred_spam = model.predict(sample_message_spam_counts)

print(f"Non Spam Message: '{sample_message_not_spam[0]}'")
print(f"Predicted class: {'Spam' if pred_not_spam[0] == 1 else 'Not Spam'}")

print(f"\nSpam Message: '{sample_message_spam[0]}'")
print(f"Predicted class: {'Spam' if pred_spam[0] == 1 else 'Not Spam'}")

Accuracy: 0.96

Classification Report:
              precision    recall  f1-score   support

    not-spam       0.93      0.96      0.95        27
        spam       0.98      0.95      0.96        43

    accuracy                           0.96        70
   macro avg       0.95      0.96      0.96        70
weighted avg       0.96      0.96      0.96        70


Confusion Matrix:
[[26  1]
 [ 2 41]]

--- Example Prediction ---
Non Spam Message: 'i think about billions of the views come from people who only wanted to check the view count'
Predicted class: Not Spam

Spam Message: 'WINNER! You've won a free Bitcoin! Claim your prize nnow!'
Predicted class: Spam


The Model was built using the YouTube spam dataset for a Psy Video (a South Korean pop music star). CLASS 0 is not spam and CLASS 1 is spam. The data in the `CONTENT` column was run through a `CountVectorizer` to remove words like 'a', 'the' etc. Basically it keeps the meaningful words and discards words that don't really contribute to the filter either way.<p>
Next the raw training text messages `X_train` are converted into a format that the `MultinomialNB` model can understand. It figures out all the words it needs to pay attention to (the vocabulary from the training data) and then creates a matrix where each row is a message and columns are words, filled with how many times each word appears in each message.<p>
Then the "word-to-number" mapping learned from the training data is applied to the unseen test messages. It counts the occurrences of words that were present in the training vocabulary within each test message, creating a numerical representation `X_test_counts` for the test data that has the exact same structure (columns representing the same words) as the training data's numerical representation `X_train_counts`.<p>
A `MultinomialNB` model is then fit with the training data.<p>
Finally the test data is predicted to determine accuracy and the confusion matrix.<p>
The last step provides unique sample data to the model that we know to represent each class to double check the model for accuracy.<p>
This model is extremely accurate at 96%. The model miscategorized a single non spam message as spam and miscategorized two spam messages as not spam.