# **Project Name**    - Email Spam Detection with Machine Learning

##### **Team Name -** Code Red

# **Introduction:**

The challenge of combating spam emails is critical in today’s digital landscape. Spam emails often carry malicious content like scams and phishing attempts, threatening user security. This project aims to create a machine learning-based spam detection system to tackle this issue effectively.

**Project Highlights:**

1. **Data Preprocessing:** We started by cleaning and transforming the email dataset, handling missing values, and converting text data into a format suitable for machine learning models.

2. **Feature Extraction:** We employed various techniques to extract meaningful features from the email data, focusing on attributes like sender address, subject line, and email body content.

3. **Machine Learning Models:** A range of algorithms, including decision trees, support vector machines, and neural networks, were tested to create the most effective spam filter.

4. **Evaluation Metrics:** Accuracy, precision, recall, and F1-score were selected as the primary metrics to evaluate the model’s performance and effectiveness.

5. **Tuning and Optimization:** Hyperparameters were fine-tuned to enhance model accuracy and minimize false positives, ensuring better detection of spam emails.

6. **Validation:** The model was validated using cross-validation and testing on unseen data to assess its ability to generalize.

7. **Deployment:** We considered how to deploy the model for real-world applications, focusing on its potential use in email security systems.

**Objective:**

**To create a machine learning-based spam detection system that accurately classifies emails as spam or ham, leveraging effective data preprocessing and model evaluation techniques.**

# **Data Cleaning:**

##### **Initial Dataset Overview**

- Total rows and columns: 5,572 rows, 5 columns
- Issues identified:
    1. Duplicate rows: 403
    2. Missing values in Unnamed: 2, Unnamed: 3, Unnamed: 4
    3. Non-relevant columns present: Unnamed: 2, Unnamed: 3, Unnamed: 4


##### **Steps Taken**

1. **Removed** duplicate rows to ensure data integrity.
    - Before: 5,572 rows
    - After: 5,169 rows

2. **Dropped irrelevant columns** (Unnamed: 2, Unnamed: 3, Unnamed: 4).

3. **Mapped labels** in the Category column:
    - ham → 0, spam → 1 (binary encoding for machine learning compatibility).


##### **Final Dataset:**

- Rows retained: 5,169 (~92.7% of the original data, meeting the 80% requirement).
- Columns retained: 3 (Category, Message, Spam).


# **Exploratory Data Analysis (EDA)**

##### **Insights Gained**

1. **Distribution of Labels**
    - Spam messages: ~14.2%
    - Ham messages: ~85.7%

2. **Most Common Words in Spam Messages**
    - Frequent words: `free`, `call`, `txt`, `now`


# **Proposed Solution Plan**


##### **Feature Engineering**

1. Converted text data into numerical features using **CountVectorizer** with stop words removed.
2. Ensured proper preprocessing, such as lowercasing, tokenization, and removal of stop words.


##### **Model Selection**

- Selected Naive Bayes Classifier:
    - Lightweight, efficient, and effective for text classification tasks.
    - Uses the bag-of-words representation created by CountVectorizer.

##### **Steps in Model Building**

1. Split dataset into training (75%) and testing (25%) subsets.
2. Trained the Multinomial Naive Bayes classifier on the training set.
3. Evaluated the model on the test set.


# **Evaluation Metrics**

##### **1. Accuracy: ~97.4% on the test set.**
##### **2. Confusion Matrix:**

- True Positives (Spam correctly identified): [Include numbers]
- True Negatives (Ham correctly identified): [Include numbers]


#### **Precision, Recall, F1-Score:**
(Include the classification report or key values.)**

##### **Visualizations**
- ROC-AUC curve to demonstrate model performance. (Include the plot.)

# **Challenges and Improvements**

##### **Challenges Faced:**

- Handling imbalanced dataset: Only 14.2% spam messages.
- Improving the recall rate for spam messages.


##### **Proposed Improvements:**

- Experiment with advanced algorithms like Support Vector Machines or ensemble models.
- Implement hyperparameter tuning for the Naive Bayes classifier.


# **Conclusion**

##### Highlight key takeaways:
- Data cleaning retained 92.7% of the original entries, exceeding the required 80%.
- The Naive Bayes classifier achieved 97.4% accuracy, demonstrating its effectiveness.
