# Spam Email Classification: A Practical Approach to Binary Classification

Email is one of the most widely used communication tools today, but it is also a common target for unsolicited and harmful messages. Detecting spam emails is an essential task to enhance privacy and improve user experience. This project focuses on classifying emails as either spam or legitimate (ham) using a combined dataset derived from two renowned sources: the **2007 TREC Public Spam Corpus** and the **Enron-Spam Dataset**.

### About the Dataset

The dataset used in this project consists of **83,446 email records** labeled as either:
- **Spam (`1`)**: Unsolicited or harmful messages.
- **Ham (`0`)**: Legitimate email content.

Each record includes:
1. **Label**: Indicates whether the email is spam or not.
2. **Text**: The actual content of the email.

### Data Sources

The dataset combines information from:
- [2007 TREC Public Spam Corpus](https://plg.uwaterloo.ca/~gvcormac/treccorpus07/)  
  Preprocessed dataset: [Download here](https://www.kaggle.com/datasets/bayes2003/emails-for-spam-or-ham-classification-trec-2007)
- [Enron-Spam Dataset](https://www2.aueb.gr/users/ion/data/enron-spam/)  
  Preprocessed dataset: [Download here](https://github.com/MWiechmann/enron_spam_data/)

The combination and preprocessing of these datasets were accomplished using a custom script available [here](https://github.com/PuruSinghvi/Spam-Email-Classifier/blob/main/Combining%20Datasets.ipynb).

### Objective and Inspiration

This project tackles a **binary classification problem** where the goal is to differentiate between spam and ham emails. The task involves understanding the nuances of email content and leveraging machine learning models to achieve high classification accuracy. The approach draws inspiration from Ramya Vidiyala’s article, ["Detecting Spam in Emails"](https://towardsdatascience.com/spam-detection-in-emails-de0398ea3b48), which highlights effective methodologies for spam detection.

By identifying spam emails with high accuracy, this project aims to demonstrate the potential of machine learning in solving real-world challenges, such as improving email security and reducing unwanted communications.


# Data Preprocessing

To prepare the dataset for spam email classification, the following libraries and tools are imported:  

- **NumPy**: For numerical operations.  
- **Pandas**: For data manipulation and handling CSV files.  
- **NLTK (Natural Language Toolkit)**: For natural language processing tasks, including tokenization and stopword removal.  
- **Regular Expressions (`re`)**: For text cleaning and pattern matching.  

Additionally, necessary NLTK resources such as stopwords and tokenizers are downloaded to enable effective text processing.


In [1]:
# Data preprocessing
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import ngrams
import re

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Visualization

For exploring and visualizing the dataset, the following libraries and tools are utilized:  

- **Matplotlib**: For creating static, animated, and interactive visualizations.  
- **Collections (Counter)**: For counting occurrences of elements in the dataset, such as word frequencies.  
- **WordCloud**: For generating word cloud representations to visualize common terms in the dataset.

In [4]:
# Visualization
import matplotlib.pyplot as plt
from collections import Counter
from wordcloud import WordCloud

# Feature Engineering

To transform the dataset into a format suitable for machine learning models, the following libraries and tools are employed:  

- **String**: For handling string operations and text processing.  
- **Regular Expressions (`re`)**: For pattern matching and text cleaning.  
- **Keras Preprocessing**:  
  - **Tokenizer**: For converting text into sequences of tokens.  
  - **Pad Sequences**: For ensuring uniform input length by padding or truncating sequences.  
- **Scikit-learn Preprocessing**:  
  - **LabelEncoder**: For encoding target labels (spam or ham) into numerical format.  

In [7]:
# Feature Engineering
import string
import re
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder

# Machine Learning Model

The following libraries and tools are used to build, train, and evaluate the machine learning model:  

- **Scikit-learn**:  
  - **Train-Test Split**: For dividing the dataset into training and testing sets.  

- **Keras**:  
  - **Sequential**: For creating a linear stack of layers for the model.  
  - **Layers**:  
    - **Dense**: Fully connected layers for learning complex representations.  
    - **LSTM**: Long Short-Term Memory layers for capturing sequential patterns in text data.  
    - **Embedding**: For converting words into dense vector representations.  
    - **Dropout**: For regularization to reduce overfitting.  
    - **Activation**: For applying activation functions like ReLU or softmax.  
    - **Bidirectional**: For processing sequences in both forward and backward directions.  

- **TensorFlow**:  
  - As the backend for training and deploying the neural network model.

In [5]:
# Machine Learning Model
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, Dropout, Activation, Bidirectional
import tensorflow as tf

# Evaluation Metric

To assess the performance of the machine learning model, the following libraries and tools are used:  

- **Scikit-learn Metrics**:  
  - **Confusion Matrix**: For visualizing true positives, true negatives, false positives, and false negatives.  
  - **F1 Score**: For evaluating the balance between precision and recall.  
  - **Precision Score**: For measuring the proportion of correctly identified positive instances.  
  - **Recall Score**: For measuring the proportion of actual positives correctly identified.  

- **Seaborn**:  
  - For creating visually appealing and informative plots, such as heatmaps for the confusion matrix.  

In [8]:
# Evaluation Metric
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score
import seaborn as sns