# 🧑‍🏫 Task 1 Part 1: Building a Spam Classifier with Naive Bayes
In this exercise, you'll implement a spam classifier using the **Naive Bayes algorithm** . You'll work with email data to classify messages as spam or non-spam (ham). Follow the steps below and fill in the code where indicated.

**Objective:** Implement all key components of an ML pipeline (except for data handling).

**Allowed Libraries:** `pandas`, `numpy`

**Not Allowed:** Any pre-built ML algorithms or functions like those from `sklearn`.

Follow the instructions step-by-step and answer the questions!

## Step 1: Data Loading and Preprocessing
First, let's load and examine our data.

In [2]:
# Load the data
# TODO: Load the 'emails.csv' file into a DataFrame called 'emails'
import pandas as pd
emails = pd.read_csv('emails.csv')# Your code here

In [3]:
# Display the first few rows
print(emails.head())

# HINT: Use pd.read_csv() to load the data
# HINT: The DataFrame should have 'text' and 'spam' columns

                                                text  spam
0  Subject: naturally irresistible your corporate...     1
1  Subject: the stock trading gunslinger  fanny i...     1
2  Subject: unbelievable new homes made easy  im ...     1
3  Subject: 4 color printing special  request add...     1
4  Subject: do not have money , get software cds ...     1


In [4]:
print(emails.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5728 entries, 0 to 5727
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5728 non-null   object
 1   spam    5728 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 89.6+ KB
None


In [5]:
emails.isnull().sum().sum()


0

In [6]:
#Analyse the data and remove or modify rows with missing or invalid values

## Step 2: Text Preprocessing
We need to process each email to extract unique words.

In [7]:
def process_email(text):
    """
    Convert email text to a list of unique, lowercase words
    
    Parameters:
        text (str): The email text
    
    Returns:
        list: List of unique words in the email
    """
    # TODO: Implement the preprocessing function
    # 1. Convert text to lowercase
    # 2. Split into words
    # 3. Remove duplicates
    text=text.lower()
    text=text.split()
    text=list(set(text))
    return text
    # Your code here

    # HINT: Use text.lower() for lowercase conversion
    # HINT: Use split() to convert text to words
    # HINT: Use set() to remove duplicates
    pass

In [8]:
# Apply preprocessing to all emails
for text in emails["text"]:
    text=process_email(text)
    print(text)
    emails

['full', 'will', 'organization', 'clear', 'provided', 'three', 'use', 'drafts', 'changes', 'surethat', 'irresistible', 'look', 'suqgestions', 'world', 'market', 'made', 'do', 'nowadays', 'our', 'and', 'portfolio', 'its', 'for', 'aim', 'all', 'formats', 'catchy', 'logo', 'hotat', 'distinctive', 'automaticaily', 'identity', 'extra', 'system', 'content', 'naturally', 'logos', 'of', 'you', 'specially', 'this', 'good', 'done', 'amount', 'corporate', 'really', 'statlonery', 'be', '_', 'no', 'company', 'your', 'become', 'letsyou', 'not', 'website', 'within', 'unlimited', 'much', 'outstanding', 'change', 'result', 'convenience', 'promptness', 'days', 'effective', 'benefits', 'creativeness', 'subject:', 'ordered', 'provide', 'here', 'gaps', 'recollect', 'affordability', 'with', 'ieader', 'love', 'are', 'list', 'marketing', "'", 'structure', 'products', 'hand', 'image', 'that', 'see', '%', '-', 'promise', 'through', 'is', ',', 'make', 'iogo', 'stylish', 'fees', ';', 'shouldn', 'reflect', ':', 'b

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [16]:
emails.tail()

Unnamed: 0,text,spam
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0
5727,Subject: news : aurora 5 . 2 update aurora ve...,0


In [10]:
# Test your preprocessing by testing on the first email


## Step 3: Calculate Prior Probabilities
Let's calculate the basic probability of an email being spam.

In [20]:
# TODO: Calculate the following:
# 1. Total number of emails
# 2. Number of spam emails
# 3. Probability of spam

num_emails =len(emails) # Your code here
num_spam = sum(email for email in emails['spam'])  # Your code here
spam_probability = num_spam / num_emails if num_emails > 0 else 0 # Your code here


print(f"Number of emails: {num_emails}")
print(f"Number of spam emails: {num_spam}")
print(f"Probability of spam: {spam_probability:.4f}")

# HINT: Use len(emails) for total count
# HINT: Use sum(emails['spam']) for spam count

Number of emails: 5728
Number of spam emails: 1368
Probability of spam: 0.2388


## Step 4: Training the Model
Now we'll build our Naive Bayes model by counting word occurrences in spam and ham emails.

In [33]:
def train_naive_bayes(emails_data):
    """
    Train a Naive Bayes model on email data
    
    Parameters:
        emails_data (DataFrame): DataFrame with 'text' and 'spam' columns
    
    Returns:
        dict: Dictionary with word frequencies in spam and ham emails
    """
     # TODO: Create a dictionary to store word frequencies
    # For each word, store counts of its occurrence in spam and ham email
    
    model = {}

    for _, email in emails_data.iterrows():
        words = email['text'].split()  
        is_spam = email['spam']
        
       
        for word in words:
            if word not in model:
                model[word] = {'spam': 1, 'ham': 1}
                
            if is_spam:
                model[word]['spam'] += 1
            else:
                model[word]['ham'] += 1

    return model
    # Your code here
    # HINT: Initialize counts with 1 (Laplace smoothing)
    # HINT: Structure: model[word] = {'spam': count, 'ham': count}



In [34]:
model = train_naive_bayes(emails)

In [35]:
# Test your model with some words
# Examples: 'lottery', 'sale', 'meeting'

print(model["lottery"])
print(model['sale'])
print(model['meeting'])

{'spam': 21, 'ham': 1}
{'spam': 51, 'ham': 57}
{'spam': 14, 'ham': 1773}


## Step 5: Implementing the Prediction Function
Finally, let's implement the function to predict whether an email is spam.

In [36]:
import math
def predict_naive_bayes(email_text, model, num_spam, num_ham):
    """
    Predict whether an email is spam using Naive Bayes
    
    Parameters:
        email_text (str): The text of the email to classify
        model (dict): Trained Naive Bayes model
        num_spam (int): Number of spam emails in training data
        num_ham (int): Number of ham emails in training data
    
    Returns:
        float: Probability that the email is spam
    """
    # TODO: Implement the Naive Bayes prediction
    # 1. Preprocess the email text
    # 2. Calculate probability using the Naive Bayes formula

    # Your code here
    words = email_text.split()
    
    # Calculate initial probabilities of spam and ham
    total_emails = num_spam + num_ham
    prob_spam = math.log(num_spam / total_emails)
    prob_ham = math.log(num_ham / total_emails)
    
    # Update probabilities based on word presence
    for word in words:
        if word in model:
            # Get word counts in spam and ham with Laplace smoothing
            spam_count = model[word]['spam']
            ham_count = model[word]['ham']
        else:
            # Default counts for unseen words (Laplace smoothing)
            spam_count = 1
            ham_count = 1

        # Calculate log probabilities for the word
        prob_spam += math.log(spam_count / (num_spam + 2))  # +2 for Laplace smoothing
        prob_ham += math.log(ham_count / (num_ham + 2))

    # Return probability as spam score (higher means more likely spam)
    spam_score = math.exp(prob_spam)
    ham_score = math.exp(prob_ham)
    return spam_score / (spam_score + ham_score) 
    # HINT: Use the log of probabilities to avoid numerical underflow
    # HINT: Remember to handle words not in the training data
    pass

In [37]:
# Test your prediction function
test_emails = [
    "lottery winner claim prize money",
    "meeting tomorrow at 3pm",
    "buy cheap watches online"
]

In [42]:
# and model is the dictionary from the `train_naive_bayes` function.

for email_text in test_emails:
    # Predict the probability that the email is spam
    spam_probability = predict_naive_bayes(email_text, model, num_spam, num_emails-num_spam)
    print(f"Email: {email_text}\nSpam Probability: {spam_probability:.4f}\n")


Email: lottery winner claim prize money
Spam Probability: 1.0000

Email: meeting tomorrow at 3pm
Spam Probability: 0.0018

Email: buy cheap watches online
Spam Probability: 0.9983



## Step 6: Wrap-up
1. How well did your model perform?
2. What challenges did you face while implementing it from scratch?
3. What improvements would you suggest for the future?

1. The model accuracy was good
2. Since all steps are given not much challenges were there
3. It was beginner friendly

### Notes (if any):