# Bayesian Methods

## Bayes Theorem
$$
    P(A|B) = \frac{P(A)P(B|A)}{P(B)}
$$
Where:
- P(A|B) is the posterior probability of A given B
- P(B|A) is the likelihood of B given A
- P(A) is the prior probability of A
- P(B) is the marginal probability of B

The whole ideia is using that with words like free (nothing is free) to filter spam on emais

$$
    P(Spam|Free) = \frac{P(Spam)P(Free|Spam)}{P(Free)}
$$

The numerator is the probability of a message being spam and containing the word "free"
The denominator is the overall probability of an email containing the word "free".


This can be applied to every (meaningful) word we encounter during training, then multiplied together when analyzing a new email to get the probability of it being spam. It's called Naive Bayes because we assume no relationship between the words

# Code

In [14]:
# Importa as bibliotecas 
import os
import io
import numpy
import pandas as pd
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Função para ler arquivos 
def readFiles(path):
    # Percorre os diretórios e arquivos no caminho especificado
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            # Abre o arquivo 
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                # Após a linha em branco, começa a ler o corpo da mensagem
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            # Junta as linhas 
            message = '\n'.join(lines)
            # Retorna o caminho e a mensagem
            yield path, message

# Função para criar um DataFrame a partir de um diretório
def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    # Lê cada arquivo e armazena sua mensagem e classificação
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)
    
    # Retorna um DataFrame com as mensagens e suas classes (spam ou ham)
    return DataFrame(rows, index=index)

# Cria um DataFrame e armazena as mensagens e classes
data = DataFrame({'message': [], 'class': []})
data = pd.concat([data, dataFrameFromDirectory("emails/spam", "spam")])
data = pd.concat([data, dataFrameFromDirectory("emails/ham", "ham")])


In [15]:
data.head()

Unnamed: 0,message,class
emails/spam/00001.7848dde101aa985090474a91ec93fcf0,"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Tr...",spam
emails/spam/00002.d94f1b97e48ed3b553b3508d116e6a09,1) Fight The Risk of Cancer!\n\nhttp://www.adc...,spam
emails/spam/00003.2ee33bc6eacdb11f38d052c44819ba6c,1) Fight The Risk of Cancer!\n\nhttp://www.adc...,spam
emails/spam/00004.eac8de8d759b7e74154f142194282724,##############################################...,spam
emails/spam/00005.57696a39d7d84318ce497886896bf90d,I thought you might like these:\n\n1) Slim Dow...,spam


In [16]:
# Converte as mensagens em uma matriz de contagem de palavras
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

# Inicializa o classificador
classifier = MultinomialNB()
# Extrai as classes-alvo (spam ou ham)
targets = data['class'].values
# Treina o classificador 
classifier.fit(counts, targets)

In [17]:
# Cria emails de exemplo
examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
# Transforma os emails em uma matriz de contagem de palavras
example_counts = vectorizer.transform(examples)
# Realiza e printa a predição
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham'], dtype='<U4')

# Activity

In [22]:
# O classificador nessa atividade não é muito preciso, alguns spams conseguem passar
emails = [
    "URGENT: You've won a free iPhone! Click here!",           # spam
    "Meeting at 3pm in the conference room",                   # ham
    "Hot singles in your area want to meet you!",              # spam
    "Your Amazon order has shipped",                           # ham
    "Enlarge your manhood today!",                             # spam
    "Reminder: Dentist appointment on Friday",                 # ham
    "You're our 1,000,000th visitor! Claim your prize now!",   # spam
    "Can you pick up some milk on your way home?",             # ham
    "Congratulations! You've been selected for a free cruise!",# spam
    "Please find attached the report for Q3",                  # ham
    "Lose 20 pounds in 2 weeks with this miracle pill!",       # spam
    "Don't forget to bring your laptop to the team meeting",   # ham
    "Nigerian prince needs your help to transfer millions!"    # spam
]

example_counts = vectorizer.transform(emails)
# Realiza e printa a predição
predictions = classifier.predict(example_counts)
print(predictions)

['ham' 'ham' 'ham' 'ham' 'spam' 'ham' 'ham' 'ham' 'ham' 'spam' 'spam'
 'ham' 'spam']
