# 📧 Email Spam Detection with Machine Learning  

This project demonstrates how to build a machine learning model to classify emails as **spam** or **ham (not spam)**.  
We go through dataset loading, preprocessing, training, evaluation, and interpretation of results step by step.  

---

## 📑 Table of Contents
1. Introduction
2. Load and Explore the Dataset
3. Data Preprocessing
4. Train-Test Split
5. Model Training
6. Evaluation and Comparison
7. Conclusion

## 1. Introduction  

Email spam is one of the most common problems in online communication systems.  
It refers to unsolicited or unwanted emails, often sent for advertising, phishing, or spreading malware.  

In this project, we build a machine learning model to automatically classify emails into **spam** or **ham (legitimate emails)**.  
We will use **text preprocessing**, **feature extraction with TF-IDF**, and a **Naive Bayes classifier** as our baseline model.  
The workflow can later be extended with more advanced models such as Logistic Regression, SVM, or ensemble methods.  


In [1]:
# Data handling
import pandas as pd
import re

# NLP preprocessing
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer

# Model selection and training
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

# Evaluation
from sklearn.metrics import accuracy_score, classification_report


## 2. Load and Explore the Dataset  

We use a publicly available dataset (`spam.csv`) that contains two main columns:  

- **v1** → Label (`ham` or `spam`)  
- **v2** → Email message text  

Our goal is to preprocess the text data and train a model that predicts whether a new email is spam.  


In [2]:
df = pd.read_csv("./spam.csv")
df = df[["v1", "v2"]]
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## 3. Data Preprocessing  

Before training our model, we need to clean and preprocess the text data.  
The main preprocessing steps include:  

1. **Lowercasing** → Convert all text to lowercase for consistency.  
2. **Removing punctuation** → Symbols like `.,!?` do not usually add much value for spam detection.  
3. **Tokenization (optional)** → Splitting text into individual words.  
4. **Stopword removal (optional)** → Common words like *"the", "is", "and"* can be removed to reduce noise.  
5. **TF-IDF transformation** → Convert the cleaned text into numerical feature vectors that machine learning models can understand.  

By applying these steps, we ensure that our model focuses on the most relevant parts of the email text.  


In [3]:
df['v2'] = df['v2'].str.lower()
df['v2'] = df['v2'].apply(lambda x: re.sub(r'https?:\/\/\S+|http\S+|www\.\S+','', str(x)))
df['v2'] = df['v2'].apply(lambda x: re.sub(r'\d+', '', str(x)))
df['v2'] = df['v2'].apply(lambda x: re.sub(r'[^\w\s]', '', str(x)))

In [4]:
df['v2'] = df['v2'].apply(lambda x:word_tokenize(x))

In [5]:
stop_words = set(stopwords.words('english'))
df['v2'] = df['v2'].apply(lambda x:  [w for w in x if w not in stop_words])

In [6]:
lemmatizer = WordNetLemmatizer()
df['v2'] = df['v2'].apply(lambda x:  [lemmatizer.lemmatize(w) for w in x])

In [7]:
df['v2']= df['v2'].apply(lambda x: " ".join(x))  
vectorizer = TfidfVectorizer(max_features=3000, stop_words='english')
X = vectorizer.fit_transform(df['v2'])  

## 4. Train-Test Split  

To evaluate the performance of our model, we split the dataset into:  

- **Training set (80%)** → Used to train the model.  
- **Testing set (20%)** → Used to evaluate how well the model generalizes to unseen data.  

We also apply the same preprocessing steps (like TF-IDF vectorization) to both sets to ensure consistency.  

In [8]:
y = df['v1']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 5. Model Training  

For our baseline model, we use **Multinomial Naive Bayes** because:  

- It is simple and efficient for text classification.  
- Works well with word frequency features such as **TF-IDF**.  
- Often provides surprisingly strong results compared to more complex models.  

Later, we can extend the workflow to include other classifiers (e.g., Logistic Regression, SVM, Random Forest) and compare their performance.  


In [9]:
model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

## 6. Evaluation  

After training the model, we evaluate it on the **test set** using several metrics:  

- **Accuracy** → Overall percentage of correctly classified emails.  
- **Precision** → Out of all emails classified as spam, how many were actually spam.  
- **Recall (Sensitivity)** → Out of all actual spam emails, how many the model correctly identified.  
- **F1-Score** → Harmonic mean of precision and recall, useful for imbalanced datasets.  

We also visualize the **confusion matrix** to better understand the model’s performance in distinguishing between spam and ham emails.  

Additionally, plotting the **ROC curve** and computing the **AUC score** can provide insight into the model’s discrimination ability.  


In [10]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9739910313901345
              precision    recall  f1-score   support

         ham       0.97      1.00      0.99       965
        spam       1.00      0.81      0.89       150

    accuracy                           0.97      1115
   macro avg       0.99      0.90      0.94      1115
weighted avg       0.97      0.97      0.97      1115



## 7. Conclusion  

- The **Naive Bayes classifier** achieved strong performance in detecting spam emails.  
- The model is lightweight, interpretable, and fast to train, making it suitable for real-world spam filtering applications.  
- However, there is still room for improvement:  
  - Trying more advanced models (Logistic Regression, SVM, Random Forest, or even deep learning).  
  - Performing hyperparameter tuning.  
  - Using additional text preprocessing techniques (like lemmatization or n-grams).  

This project demonstrates a complete **end-to-end workflow** for a text classification problem, from raw data to a working predictive model.  
