<a href="https://colab.research.google.com/github/usshaa/SMBDA/blob/main/Spam_Classification_Practical.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Naive Bayes and Support Vector Machine (SVM) classification algorithms for email spam detection. This example will use a popular dataset for spam classification, such as the **SMS Spam Collection Dataset**. This dataset contains labeled SMS messages as either "spam" or "ham" (not spam).

### Spam Detection with Naive Bayes and SVM

In [None]:
!pip install pandas scikit-learn nltk



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import nltk
from nltk.corpus import stopwords

In [None]:
# Download stopwords
nltk.download('stopwords')
### Explanation of Each Section

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

- **Data Loading and Preparation:** The dataset is loaded from a URL, and the relevant columns are selected. The labels are mapped to binary values for the classification task.

In [None]:
# Load the dataset (you can also download it from the UCI Machine Learning Repository)
# Assuming the dataset is in a CSV file named 'spam.csv'.
# The dataset should have two columns: 'label' and 'message'.
# Updated URL:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'

# Download the dataset
import requests, zipfile, io
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

# Read the extracted file into a pandas DataFrame
data = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['label', 'message'])

In [None]:
# Preview the data
print(data.head())

  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


- **Text Vectorization:** The `CountVectorizer` or `TfidfVectorizer` is used to convert the text messages into numerical features. The `stopwords` from NLTK are used to filter out common words that do not contribute to the sentiment.

In [None]:
# Data cleaning
data.columns = ['label', 'message']  # Renaming columns

# Mapping labels to binary values is correct and should be kept.
data['label'] = data['label'].map({'ham': 0, 'spam': 1})  # Mapping labels to binary values

In [None]:
# Split the dataset into training and testing sets
X = data['message']
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Text Vectorization
# You can use either CountVectorizer or TfidfVectorizer
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
# vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))

- **Naive Bayes Classifier:** A Multinomial Naive Bayes classifier is trained on the training data, and predictions are made on the test set. The accuracy and classification report are printed to evaluate the model.

In [None]:
# Fit and transform the training data and transform the test data
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [None]:
# 1. Naive Bayes Classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_vec, y_train)

In [None]:
# Predict on the test set
y_pred_nb = nb_classifier.predict(X_test_vec)

In [None]:
# Evaluation of Naive Bayes Classifier
print("Naive Bayes Classifier Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_nb)}")
print("Classification Report:")
print(classification_report(y_test, y_pred_nb))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_nb))

Naive Bayes Classifier Results:
Accuracy: 0.9847533632286996
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       966
           1       0.95      0.94      0.94       149

    accuracy                           0.98      1115
   macro avg       0.97      0.97      0.97      1115
weighted avg       0.98      0.98      0.98      1115

Confusion Matrix:
[[958   8]
 [  9 140]]


- **Support Vector Machine Classifier:** An SVM classifier with a linear kernel is trained similarly, and its performance is evaluated in the same way.
    

In [None]:
# 2. Support Vector Machine Classifier
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train_vec, y_train)

In [None]:
# Predict on the test set
y_pred_svm = svm_classifier.predict(X_test_vec)

In [None]:
# Evaluation of SVM Classifier
print("\nSupport Vector Machine Classifier Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_svm)}")
print("Classification Report:")
print(classification_report(y_test, y_pred_svm))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_svm))


Support Vector Machine Classifier Results:
Accuracy: 0.9838565022421525
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       0.99      0.89      0.94       149

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115

Confusion Matrix:
[[964   2]
 [ 16 133]]


In [None]:
# Test with Sample Email
sample_email = ["Congratulations! You've won a $1000 Walmart gift card. Click here to claim your prize now!"]
sample_email_vec = vectorizer.transform(sample_email)

# Predict using both classifiers
nb_prediction = nb_classifier.predict(sample_email_vec)
svm_prediction = svm_classifier.predict(sample_email_vec)

# Display the results for the sample email
print("\nSample Email Prediction:")
print(f"Naive Bayes Prediction: {'Spam' if nb_prediction[0] == 1 else 'Ham'}")
print(f"SVM Prediction: {'Spam' if svm_prediction[0] == 1 else 'Ham'}")


Sample Email Prediction:
Naive Bayes Prediction: Spam
SVM Prediction: Spam


In [None]:
# Test with Sample Non-Spam Email
# non_spam_email = ["Hey, are we still on for dinner tonight? Let me know what time works for you."]
check = ["You will get credit of $20 billion cash price gift"]

# non_spam_email_vec = vectorizer.transform(non_spam_email)
check_vec = vectorizer.transform(check)

# Predict using both classifiers
# nb_prediction_non_spam = nb_classifier.predict(non_spam_email_vec)
# svm_prediction_non_spam = svm_classifier.predict(non_spam_email_vec)
nb_prediction_non_spam = nb_classifier.predict(check_vec)
svm_prediction_non_spam = svm_classifier.predict(check_vec)

# Display the results for the non-spam email
print("\nNon-Spam Email Prediction:")
print(f"Naive Bayes Prediction: {'Spam' if nb_prediction_non_spam[0] == 1 else 'Ham'}")
print(f"SVM Prediction: {'Spam' if svm_prediction_non_spam[0] == 1 else 'Ham'}")


Non-Spam Email Prediction:
Naive Bayes Prediction: Spam
SVM Prediction: Ham
