<a href="https://colab.research.google.com/github/virajbhutada/Compozent_ML_AI_OCT23/blob/main/Task1_(BASIC)_Spam_Email_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML and AI Intern @Compozent OCT23

# Author: Viraj N. Bhutada

# **TASK 1 (BASIC): Spam Email Classifier**


# Overview:    
- In this project, I created a smart tool to spot those annoying spam emails we all receive. Imagine it like a virtual guard for your inbox! Using a clever technique called the Naive Bayes algorithm, my creation learned from examples and correctly identified 95% of the emails in our test collection (accuracy score: 0.95).
- This accomplishment shows that it's really good at telling which emails are useful and which ones are just clutter. It's like having a reliable assistant who sifts through your emails, ensuring you only see the important stuff.



# About the Dataset:
- The dataset we are using contains information about 5172 randomly selected email files and their corresponding labels for spam or not-spam classification. The dataset is provided in a CSV file format, making it convenient for analysis and modeling.

- Dataset Details:
https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv

 Rows: 5172 (Each row represents a single email)
Columns: 3002 (1 column for email name, 3000 columns for the most common words, and 1 column for labels)
Labeling: The first column indicates the email name (anonymized for privacy) and the last column contains the labels: 1 for spam and 0 for not spam. The 3000 columns in between represent the 3000 most common words found in the emails, excluding non-alphabetical characters or words.
To access the dataset, you can visit the following link: Email Spam Classification Dataset (emails.csv)

#  Project Steps:
1. Data Loading and Exploration
Load the dataset and explore its structure. Understand the columns, the features, and the labels. Familiarize yourself with the dataset's format.

2. Data Preprocessing
Handle any missing values and prepare the data for training. In this project, missing values were imputed using mean values for each feature.

3. Model Building
Implement a machine learning model using the Naive Bayes algorithm (MultinomialNB) for text classification. Train the model using the preprocessed data.

4. Model Evaluation
Evaluate the trained model using various metrics, including accuracy, precision, recall, and F1-score. Understand how well the model performs in classifying emails as spam or not spam.

5. Interpretation of Results
Interpret the results obtained, especially focusing on the accuracy score. Accuracy of 0.95 means that the model correctly predicted 95% of the emails in the test dataset. It signifies the model's effectiveness in distinguishing between spam and non-spam emails.

6. Drawing Insights from Results, With an accuracy score of 95%, the model accurately classified emails, showcasing its efficiency in spam detection. This outcome reinforces its practicality for real-world email filtering challenges.


# **Step 1: Import Libraries**

In [13]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.impute import SimpleImputer


# **Step 2: Load and Explore the Data**

In [19]:
# Load the dataset with individual words as columns
data = pd.read_csv('emails.csv')

# Assuming 'Prediction' column contains the labels (1 for spam, 0 for not spam)
X = data.drop(columns=['Email No.', 'Prediction'])  # Features
y = data['Prediction']  # Labels

# Explore the data
print(data.head())


  Email No.  the  to  ect  and  for  of    a  you  hou  ...  connevey  jay  \
0   Email 1    0   0    1    0    0   0    2    0    0  ...         0    0   
1   Email 2    8  13   24    6    6   2  102    1   27  ...         0    0   
2   Email 3    0   0    1    0    0   0    8    0    0  ...         0    0   
3   Email 4    0   5   22    0    5   1   51    2   10  ...         0    0   
4   Email 5    7   6   17    1    5   2   57    0    9  ...         0    0   

   valued  lay  infrastructure  military  allowing  ff  dry  Prediction  
0       0    0               0         0         0   0    0           0  
1       0    0               0         0         0   1    0           0  
2       0    0               0         0         0   0    0           0  
3       0    0               0         0         0   0    0           0  
4       0    0               0         0         0   1    0           0  

[5 rows x 3002 columns]


# **Step 3: Handle Missing Values (Imputation)**

In [20]:
# Impute missing values using mean imputation
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)


# **Step 4: Split the Data into Training and Testing Sets**

In [21]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2, random_state=42)


# **Step 5: Initialize and Train the Naive Bayes Classifier**

In [22]:
# Initialize and train the Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)


# **Step 6: Make Predictions and Evaluate the Model**

In [23]:
# Make predictions on the test set
predictions = classifier.predict(X_test)

# Calculate accuracy and display results
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy:.2f}')

# Display classification report
print(classification_report(y_test, predictions))


Accuracy: 0.95
              precision    recall  f1-score   support

           0       0.98      0.95      0.97       739
           1       0.89      0.96      0.92       296

    accuracy                           0.95      1035
   macro avg       0.94      0.96      0.95      1035
weighted avg       0.96      0.95      0.96      1035



An accuracy score of 0.95 implies that the model correctly predicted 95% of the outcomes, which is a high level of accuracy.


# Conclusion

In conclusion, this project marks a significant achievement in the realm of email filtering. With an accuracy score of 0.95, our spam email classifier demonstrated its proficiency in identifying unwanted emails, ensuring a cleaner and more efficient inbox experience. The precision, recall, and F1-score values provide a comprehensive understanding of the model's performance, underscoring its effectiveness in distinguishing between spam and non-spam emails. By leveraging advanced machine learning techniques, specifically the Multinomial Naive Bayes algorithm, I've created a powerful tool capable of accurately distinguishing between spam and legitimate messages.