# 🧑‍🏫 Task 1 Part 1: Building a Spam Classifier with Naive Bayes
In this exercise, you'll implement a spam classifier using the **Naive Bayes algorithm** . You'll work with email data to classify messages as spam or non-spam (ham). Follow the steps below and fill in the code where indicated.

**Objective:** Implement all key components of an ML pipeline (except for data handling).

**Allowed Libraries:** `pandas`, `numpy`

**Not Allowed:** Any pre-built ML algorithms or functions like those from `sklearn`.

Follow the instructions step-by-step and answer the questions!

## Step 1: Data Loading and Preprocessing
First, let's load and examine our data.

In [None]:
import pandas as pd
emails = pd.read_csv(r'/Users/hiren/Downloads/emails.csv')

In [None]:
# Display the first few rows
print(emails.head())

# HINT: Use pd.read_csv() to load the data
# HINT: The DataFrame should have 'text' and 'spam' columns

In [None]:
#Analyse the data and remove or modify rows with missing or invalid values
emails = emails.dropna()

## Step 2: Text Preprocessing
We need to process each email to extract unique words.

In [None]:
def process_email(text):
    """
    Convert email text to a list of unique, lowercase words
    
    Parameters:
        text (str): The email text
    
    Returns:
        list: List of unique words in the email
    """
    # TODO: Implement the preprocessing function
    # 1. Convert text to lowercase
    # 2. Split into words
    # 3. Remove duplicates

    # Your code here

    # HINT: Use text.lower() for lowercase conversion
    # HINT: Use split() to convert text to words
    # HINT: Use set() to remove duplicates
    text = text.lower()
    words = text.split()
    unique_words = list(set(words))
    return unique_words
    

In [None]:
# Apply preprocessing to all emails
emails['processed_text'] = emails['text'].apply(process_email)


In [None]:
# Test your preprocessing by testing on the first email
print("Original text:", emails['text'].iloc[0])
print("Processed text:", emails['processed_text'].iloc[0])

## Step 3: Calculate Prior Probabilities
Let's calculate the basic probability of an email being spam.

In [None]:
# TODO: Calculate the following:
# 1. Total number of emails
# 2. Number of spam emails
# 3. Probability of spam

num_emails = len(emails)
num_spam = sum(emails['spam'])
spam_probability = num_spam / num_emails

print(f"Number of emails: {num_emails}")
print(f"Number of spam emails: {num_spam}")
print(f"Probability of spam: {spam_probability:.4f}")

# HINT: Use len(emails) for total count
# HINT: Use sum(emails['spam']) for spam count

## Step 4: Training the Model
Now we'll build our Naive Bayes model by counting word occurrences in spam and ham emails.

In [None]:
def train_naive_bayes(emails_data):
    """
    Train a Naive Bayes model on email data
    
    Parameters:
        emails_data (DataFrame): DataFrame with 'words' and 'spam' columns
    
    Returns:
        dict: Dictionary with word frequencies in spam and ham emails
    """
    # TODO: Create a dictionary to store word frequencies
    # For each word, store counts of its occurrence in spam and ham emails
    model = {}

    # Your code here
    # HINT: Initialize counts with 1 (Laplace smoothing)
    # HINT: Structure: model[word] = {'spam': count, 'ham': count}
    for _, row in emails_data.iterrows():
        is_spam = row['spam']  
        words = row['processed_text']  
        for word in words:
            if word not in model:
                model[word] = {'spam': 1, 'ham': 1}
            if is_spam:
                model[word]['spam'] += 1
            else:
                model[word]['ham'] += 1
    return model

In [None]:
model = train_naive_bayes(emails)

In [None]:
# Test your model with some words
# Examples: 'lottery', 'sale', 'meeting'
test_words = ['lottery', 'sale', 'meeting']
for word in test_words:
    if word in model:
        print(f"Word: {word}, Spam Count: {model[word]['spam']}, Ham Count: {model[word]['ham']}")
    else:
        print(f"Word: {word} not found in model")


## Step 5: Implementing the Prediction Function
Finally, let's implement the function to predict whether an email is spam.

In [None]:
def predict_naive_bayes(email_text, model, num_spam, num_ham):
    """
    Predict whether an email is spam using Naive Bayes
    
    Parameters:
        email_text (str): The text of the email to classify
        model (dict): Trained Naive Bayes model
        num_spam (int): Number of spam emails in training data
        num_ham (int): Number of ham emails in training data
    
    Returns:
        float: Probability that the email is spam
    """
    # TODO: Implement the Naive Bayes prediction
    # 1. Preprocess the email text
    # 2. Calculate probability using the Naive Bayes formula

    # Your code here

    # HINT: Use the log of probabilities to avoid numerical underflow
    # HINT: Remember to handle words not in the training data
    words = process_email(email_text)
    
    total_emails = num_spam + num_ham
    p_spam = num_spam / total_emails
    p_ham = num_ham / total_emails
    
    log_prob_spam = np.log(p_spam)
    log_prob_ham = np.log(p_ham)
    
    for word in words:
        if word in model:
            spam_count = model[word]['spam']
            ham_count = model[word]['ham']
        else:
            spam_count = 1
            ham_count = 1
        
        log_prob_spam += np.log(spam_count / (num_spam + 2))
        log_prob_ham += np.log(ham_count / (num_ham + 2))
    
    prob_spam = np.exp(log_prob_spam) / (np.exp(log_prob_spam) + np.exp(log_prob_ham))
    return prob_spam

    

In [None]:
# Test your prediction function
test_emails = [
    "lottery winner claim prize money",
    "meeting tomorrow at 3pm",
    "buy cheap watches online"
]
for email in test_emails:
    spam_prob = predict_naive_bayes(email, model, num_spam, num_emails - num_spam)
    print(f"Email: '{email}'\nSpam Probability: {spam_prob:.4f}\n")

## Step 6: Wrap-up
1. How well did your model perform?
2. What challenges did you face while implementing it from scratch?
3. What improvements would you suggest for the future?

### Notes (if any):