<a href="https://colab.research.google.com/github/vanadhisivakumar-source/Machine-learning-projects/blob/main/email%20spam%20classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Implement a Naive Bayes classifier to classify a new email as 'spam' or 'not spam' based on a synthetic training dataset, then present the predicted classification and the reasoning.

## Create Training Dataset

### Subtask:
Generate a sample training dataset representing emails with features ('free', 'win', 'offer') and the target variable ('spam'). This will serve as the basis for training the Naive Bayes classifier.


**Reasoning**:
The first step is to import the pandas library and create a dictionary with the sample training data, including features ('free', 'win', 'offer') and the target variable ('spam'). This data will then be converted into a pandas DataFrame named `df_train` as specified in the instructions.



In [None]:
import pandas as pd

# Create a dictionary to represent the synthetic training data
data = {
    'free': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
    'win': [0, 1, 1, 0, 0, 1, 0, 1, 1, 0],
    'offer': [0, 0, 1, 1, 0, 1, 0, 0, 1, 1],
    'spam': [0, 0, 1, 0, 0, 1, 0, 1, 1, 0] # 1 for spam, 0 for not spam
}

# Convert the dictionary into a pandas DataFrame
df_train = pd.DataFrame(data)

# Display the first few rows of the DataFrame
print("Training DataFrame created:")
print(df_train.head())
print("\nDataFrame Info:")
df_train.info()

## Calculate Probabilities

### Subtask:
Implement the Naive Bayes algorithm to calculate the prior probabilities (P(spam), P(not spam)) and likelihood probabilities (e.g., P(free|spam), P(win|not spam)) from the training data.


**Reasoning**:
The next step is to implement the Naive Bayes algorithm by calculating the prior and likelihood probabilities from the training data, following the instructions provided in the subtask.



In [None]:
import numpy as np

# 1. Separate the training data into spam and not spam subsets
df_spam = df_train[df_train['spam'] == 1]
df_not_spam = df_train[df_train['spam'] == 0]

# 2. Calculate the prior probability of 'spam'
p_spam = len(df_spam) / len(df_train)

# 3. Calculate the prior probability of 'not spam'
p_not_spam = len(df_not_spam) / len(df_train)

print(f"P(spam): {p_spam:.2f}")
print(f"P(not spam): {p_not_spam:.2f}")

# Features for calculating likelihoods
features = ['free', 'win', 'offer']

# Initialize dictionaries to store likelihoods
likelihoods_spam = {}
likelihoods_not_spam = {}

# Calculate likelihoods for each feature
for feature in features:
    # 4. P(feature|spam)
    likelihoods_spam[feature] = (df_spam[feature].sum() + 1) / (len(df_spam) + 2) # Adding 1 for Laplace smoothing (alpha=1, k=2 for binary features)
    # 5. P(feature|not spam)
    likelihoods_not_spam[feature] = (df_not_spam[feature].sum() + 1) / (len(df_not_spam) + 2) # Adding 1 for Laplace smoothing (alpha=1, k=2 for binary features)

# 6. Store all calculated probabilities
probabilities = {
    'prior_spam': p_spam,
    'prior_not_spam': p_not_spam,
    'likelihoods_spam': likelihoods_spam,
    'likelihoods_not_spam': likelihoods_not_spam
}

print("\nLikelihoods given spam:")
for feature, prob in likelihoods_spam.items():
    print(f"P({feature}|spam): {prob:.2f}")

print("\nLikelihoods given not spam:")
for feature, prob in likelihoods_not_spam.items():
    print(f"P({feature}|not spam): {prob:.2f}")

## Predict New Email Classification

### Subtask:
Use the calculated probabilities to classify the example new email with features: 'free'=yes, 'win'=yes, 'offer'=no. This step will involve computing the posterior probabilities for both 'spam' and 'not spam' classes.


**Reasoning**:
I will define the new email's features and then calculate the unnormalized posterior probabilities for 'spam' and 'not spam' using the previously computed prior and likelihood probabilities.



In [None]:
import numpy as np

# 1. Define the features of the new email
new_email_features = {'free': 1, 'win': 1, 'offer': 0}

# Retrieve probabilities from the 'probabilities' dictionary
p_spam = probabilities['prior_spam']
p_not_spam = probabilities['prior_not_spam']
likelihoods_spam = probabilities['likelihoods_spam']
likelihoods_not_spam = probabilities['likelihoods_not_spam']

# 2. Calculate the unnormalized posterior probability for 'spam'
# P(spam | new_email) = P(spam) * P(free|spam) * P(win|spam) * P(not offer|spam)
posterior_spam_unnorm = p_spam
for feature, value in new_email_features.items():
    if value == 1: # Feature is present
        posterior_spam_unnorm *= likelihoods_spam[feature]
    else: # Feature is absent
        posterior_spam_unnorm *= (1 - likelihoods_spam[feature])

# 3. Calculate the unnormalized posterior probability for 'not spam'
# P(not spam | new_email) = P(not spam) * P(free|not spam) * P(win|not spam) * P(not offer|not spam)
posterior_not_spam_unnorm = p_not_spam
for feature, value in new_email_features.items():
    if value == 1: # Feature is present
        posterior_not_spam_unnorm *= likelihoods_not_spam[feature]
    else: # Feature is absent
        posterior_not_spam_unnorm *= (1 - likelihoods_not_spam[feature])

# 4. Print the unnormalized posterior probabilities
print(f"Unnormalized Posterior P(spam | new email): {posterior_spam_unnorm:.4f}")
print(f"Unnormalized Posterior P(not spam | new email): {posterior_not_spam_unnorm:.4f}")

## Final Task

### Subtask:
Present the predicted classification (spam or not spam) for the new email, along with the reasoning based on the Naive Bayes probabilities.


## Summary:

### Q&A
The new email, with features 'free'=yes, 'win'=yes, 'offer'=no, is classified as **spam**.

This classification is based on the Naive Bayes probabilities:
*   The unnormalized posterior probability of the email being 'spam' was calculated as approximately 0.0556.
*   The unnormalized posterior probability of the email being 'not spam' was calculated as approximately 0.0469.

Since P(spam | new\_email) $>$ P(not spam | new\_email), the email is classified as spam.

### Data Analysis Key Findings
*   A synthetic training dataset (`df_train`) was created with 10 entries and 3 features ('free', 'win', 'offer') and a 'spam' target variable, all represented as integers.
*   Prior probabilities were calculated from the training data: P(spam) = 0.40 and P(not spam) = 0.60.
*   Likelihood probabilities for features given each class were calculated using Laplace smoothing (alpha=1):
    *   **Given Spam:** P(free|spam) = 0.50, P(win|spam) = 0.83, P(offer|spam) = 0.67.
    *   **Given Not Spam:** P(free|not spam) = 0.50, P(win|not spam) = 0.25, P(offer|not spam) = 0.38.
*   For the new email ('free'=yes, 'win'=yes, 'offer'=no), the unnormalized posterior probabilities were:
    *   Unnormalized P(spam | new email) \$\approx\$ 0.0556.
    *   Unnormalized P(not spam | new email) \$\approx\$ 0.0469.
*   Based on these probabilities, the Naive Bayes classifier predicted the new email to be 'spam'.

### Insights or Next Steps
*   The Naive Bayes classifier successfully classified the new email based on the learned probabilities from the synthetic dataset. This demonstrates the basic application of the algorithm for text classification.
*   To improve the model's accuracy and robustness, consider expanding the training dataset with more diverse examples and a larger vocabulary of features, potentially exploring different smoothing techniques or incorporating more advanced feature engineering.
