# 🧑‍🏫 Task 1 Part 1: Building a Spam Classifier with Naive Bayes
In this exercise, you'll implement a spam classifier using the **Naive Bayes algorithm** . You'll work with email data to classify messages as spam or non-spam (ham). Follow the steps below and fill in the code where indicated.

**Objective:** Implement all key components of an ML pipeline (except for data handling).

**Allowed Libraries:** `pandas`, `numpy`

**Not Allowed:** Any pre-built ML algorithms or functions like those from `sklearn`.

Follow the instructions step-by-step and answer the questions!

In [2]:
!pip install pandas




[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [184]:
import pandas as pd

## Step 1: Data Loading and Preprocessing
First, let's load and examine our data.

In [185]:
# Load the data
# wODO: Load the 'emails.csv' file into a DataFrame called 'emails'
emails = pd.read_csv("emails.csv")


In [186]:
# Display the first few rows
print(emails.head())

# HINT: Use pd.read_csv() to load the data
# HINT: The DataFrame should have 'text' and 'spam' columns

                                                text  spam
0  Subject: naturally irresistible your corporate...     1
1  Subject: the stock trading gunslinger  fanny i...     1
2  Subject: unbelievable new homes made easy  im ...     1
3  Subject: 4 color printing special  request add...     1
4  Subject: do not have money , get software cds ...     1


In [188]:
#Analyse the data and remove or modify rows with missing or invalid values
emails.dropna(inplace=True)

for x in emails["spam"]:
    if x != 1 or x != 0:
        emails.dropna(inplace=True)

In [202]:
emails = emails.drop(emails.index[1000:4000])
emails

Unnamed: 0,text,spam
90,Subject: investment / partnership proposal de...,1
1583,Subject: re : executive program on credit risk...,0
1999,Subject: the february issue of reactions is no...,0
2501,Subject: re : stanford or - summer interns ra...,0
1206,"Subject: all graphics software available , che...",1
...,...,...
5568,Subject: re : var for cob 2 nd aug 2000 hi ki...,0
5215,Subject: re : risk european energy 2000 steve...,0
1424,"Subject: re : gordon , it was a pleasure tal...",0
4012,Subject: confirmation of 3 / 20 9 a . m . meet...,0


## Step 2: Text Preprocessing
We need to process each email to extract unique words.

In [203]:
def process_email(text):
    """
    Convert email text to a list of unique, lowercase words
    
    Parameters:
        text (str): The email text
    
    Returns:
        list: List of unique words in the email
    """
    # TODO: Implement the preprocessing function
    # 1. Convert text to lowercase
    # 2. Split into words
    # 3. Remove duplicates

    # Your code here
    text = text.lower()
    text = text.split()
    text = set(text)
    text = list(text)
    
    return text
    # HINT: Use text.lower() for lowercase conversion
    # HINT: Use split() to convert text to words
    # HINT: Use set() to remove duplicates
    pass

In [204]:
# Apply preprocessing to all emails
words = set()
for x in emails["text"]:
    x = process_email(x)
    words.update(x)
    



In [205]:
words

{'phd',
 'exhibitions',
 'cousin',
 'xx',
 'smaller',
 'designers',
 'jhenders',
 'paez',
 'white',
 'mason',
 '5701',
 'hrgovcic',
 'queue',
 'flow',
 '1965',
 'dissemination',
 'systemworks',
 'breen',
 'cbn',
 'jgsm',
 'issures',
 '4727',
 '8511',
 'wess',
 'v',
 'accommodations',
 'prosilbym',
 'margin',
 'teaser',
 'talked',
 'head',
 'adjustment',
 'benefits',
 'deteriorated',
 'lub',
 'marquardt',
 'crenshawwrote',
 '3458',
 'aps',
 'hsbcib',
 'fabulous',
 'arguments',
 'suited',
 'sandbox',
 'satisfaction',
 '2938',
 'karnataka',
 '35275',
 'affiliates',
 'stockholder',
 'hmuprzd',
 'centered',
 'mornings',
 'nrumqor',
 '290',
 'ghosh',
 'training',
 '11',
 'autience',
 '1936',
 'layton',
 'wieloma',
 'detail',
 '0587',
 'bulletin',
 '40',
 'risckrac',
 'realign',
 'tune',
 'engage',
 'costs',
 'pd',
 'tale',
 '5728',
 '4234',
 'grateful',
 'bd',
 'chaxel',
 'penis',
 'dating',
 'pressurising',
 'typesetters',
 'arbitrage',
 '6761',
 'tconvery',
 'soc',
 'jakie',
 'inconsistent

In [206]:
type(words)

set

## Step 3: Calculate Prior Probabilities
Let's calculate the basic probability of an email being spam.

In [207]:
# TODO: Calculate the following:
# 1. Total number of emails
# 2. Number of spam emails
# 3. Probability of spam

num_emails = len(emails["text"])

num_spam = 0
for x in emails["spam"]:
    if x == 1:
        num_spam= num_spam+1
        
spam_probability = num_spam/num_emails

print(f"Number of emails: {num_emails}")
print(f"Number of spam emails: {num_spam}")
num_ham = num_emails-num_spam
print(f"Probability of spam: {spam_probability:.4f}")

# HINT: Use len(emails) for total count
# HINT: Use sum(emails['spam']) for spam count

Number of emails: 1000
Number of spam emails: 217
Probability of spam: 0.2170


## Step 4: Training the Model
Now we'll build our Naive Bayes model by counting word occurrences in spam and ham emails.

In [208]:
def train_naive_bayes(words):
    """
    Train a Naive Bayes model on email data
    
    Parameters:
        emails_data (DataFrame): DataFrame with 'words' and 'spam' columns
    
    Returns:
        dict: Dictionary with word frequencies in spam and ham emails
    """
    # TODO: Create a dictionary to store word frequencies
    # For each word, store counts of its occurrence in spam and ham emails
    model = {}
    
    
    for x in words:
        model[x] = {"spam": 0, "ham": 0}
        for i in emails["text"]:
            if i.find(x) != -1:
                k = emails[emails['text']== i].index.values.astype(int)[0]
                if (emails["spam"][k]) == 1:
                    model[x]["spam"] +=1 
                elif (emails["spam"][k]) == 0:
                    model[x]["ham"] +=1
        
             # if word occured more than once then calculate no. of 

    # Your code here
    # HINT: Initialize counts with 1 (Laplace smoothing)
    # HINT: Structure: model[word] = {'spam': count, 'ham': count}

    return model

In [209]:
words = list(words)

In [210]:
model = train_naive_bayes(words)

In [211]:
model

{'phd': {'spam': 1, 'ham': 17},
 'exhibitions': {'spam': 1, 'ham': 0},
 'cousin': {'spam': 0, 'ham': 1},
 'xx': {'spam': 3, 'ham': 3},
 'smaller': {'spam': 1, 'ham': 4},
 'designers': {'spam': 1, 'ham': 0},
 'jhenders': {'spam': 0, 'ham': 1},
 'paez': {'spam': 0, 'ham': 1},
 'white': {'spam': 2, 'ham': 3},
 'mason': {'spam': 1, 'ham': 3},
 '5701': {'spam': 0, 'ham': 1},
 'hrgovcic': {'spam': 0, 'ham': 6},
 'queue': {'spam': 0, 'ham': 1},
 'flow': {'spam': 1, 'ham': 19},
 '1965': {'spam': 0, 'ham': 1},
 'dissemination': {'spam': 1, 'ham': 2},
 'systemworks': {'spam': 1, 'ham': 0},
 'breen': {'spam': 0, 'ham': 1},
 'cbn': {'spam': 1, 'ham': 0},
 'jgsm': {'spam': 0, 'ham': 3},
 'issures': {'spam': 0, 'ham': 1},
 '4727': {'spam': 0, 'ham': 3},
 '8511': {'spam': 0, 'ham': 1},
 'wess': {'spam': 0, 'ham': 2},
 'v': {'spam': 211, 'ham': 771},
 'accommodations': {'spam': 0, 'ham': 1},
 'prosilbym': {'spam': 0, 'ham': 1},
 'margin': {'spam': 2, 'ham': 5},
 'teaser': {'spam': 0, 'ham': 1},
 'talk

In [180]:
# Test your model with some words
# Examples: 'lottery', 'sale', 'meeting'



## Step 5: Implementing the Prediction Function
Finally, let's implement the function to predict whether an email is spam.

In [212]:
def predict_naive_bayes(email_text, model, num_spam, num_ham):
    """
    Predict whether an email is spam using Naive Bayes
    
    Parameters:
        email_text (str): The text of the email to classify
        model (dict): Trained Naive Bayes model
        num_spam (int): Number of spam emails in training data
        num_ham (int): Number of ham emails in training data
    
    Returns:
        float: Probability that the email is spam
    """
    # TODO: Implement the Naive Bayes prediction
    # 1. Preprocess the email text
    # 2. Calculate probability using the Naive Bayes formula

    # Your code here
    text = process_email(email_text)
    pro = []
    for i in text:
        if i in words:
            k1 = [(model[i]["spam"])*spam_probability/((model[i]["spam"])*spam_probability+(model[i]["ham"])*(1-spam_probability))]
            pro.append(k1)
    PROB = 1
    float(PROB)
    pro =  [element for innerList in pro for element in innerList]
    for i in pro:
        PROB = PROB*i
    return PROB

    # HINT: Use the log of probabilities to avoid numerical underflow
    # HINT: Remember to handle words not in the training data
    pass

In [213]:
# Test your prediction function
test_emails = [
    "lottery winner claim prize money",
    "meeting tomorrow at 3pm",
    "buy cheap watches online"
]

In [214]:
email_text = "lottery winner claim prize money"

In [215]:
predict_naive_bayes(email_text, model, num_spam, num_ham)


0.002081042230870518

## Step 6: Wrap-up
1. How well did your model perform?
2. What challenges did you face while implementing it from scratch?
3. What improvements would you suggest for the future?

### Notes (if any): The model will perform as required if whole of the dataset is taken as to process huge dataset my code block is taking more than required time because of which I sliced the data if you can please provide optimal solution and neglect model's performance for now