<a href="https://colab.research.google.com/github/shahzadahmad3/Natural-Language-Processing/blob/main/Spam_Email_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [72]:
import pandas as pd
# Load dataset
# Updated URL to point to raw content on GitHub
url = "https://raw.githubusercontent.com/bigmlcom/python/master/data/spam.csv"
df = pd.read_csv(url, encoding='utf-8', sep='\t')  # Try using semicolon as delimiter
df.head()

Unnamed: 0,Type,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [73]:
df['Type']=df['Type'].map({'ham':0, 'spam':1})
#Check Data Balance
df['Type'].value_counts()

Unnamed: 0_level_0,count
Type,Unnamed: 1_level_1
0,577
1,79


In [74]:
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('punkt_tab') #
nltk.download('wordnet') # Download wordnet for lemmatization
nltk.download('stopwords') # Download stopwords for preprocessing

def preprocessing(text):
  text = text.lower()
  tokenized_df = word_tokenize(text)
  stopword = stopwords.words('english')
  tokenized_text = [word for word in tokenized_df if word not in stopword]
  lemmatizer = WordNetLemmatizer()
  preprocessed_text = [lemmatizer.lemmatize(word) for word in tokenized_text]
  # Join the preprocessed tokens back into a single string
  return ' '.join(preprocessed_text) # this line is changed

# Apply the preprocess function to each individual message in the 'Message' column
preprocessed_df = df['Message'].apply(preprocessing)


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [75]:
# Feature Extraction using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(preprocessed_df)
y=df['Type']

In [76]:
# Train Machine Learning Models
# Split Data into Training & Testing Sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [77]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

model_lr=LogisticRegression()
model_nb=MultinomialNB()
model_svm=SVC()

models=[model_lr, model_nb, model_svm]

for model in models:
  model.fit(X_train, y_train)
  y_pred=model.predict(X_test)
  accuracy=accuracy_score(y_test, y_pred)
  print(f"Accuracy for {model}: {accuracy}")
  print(classification_report(y_test, y_pred))

Accuracy for LogisticRegression(): 0.8787878787878788
              precision    recall  f1-score   support

           0       0.88      1.00      0.94       116
           1       0.00      0.00      0.00        16

    accuracy                           0.88       132
   macro avg       0.44      0.50      0.47       132
weighted avg       0.77      0.88      0.82       132

Accuracy for MultinomialNB(): 0.8863636363636364
              precision    recall  f1-score   support

           0       0.89      1.00      0.94       116
           1       1.00      0.06      0.12        16

    accuracy                           0.89       132
   macro avg       0.94      0.53      0.53       132
weighted avg       0.90      0.89      0.84       132

Accuracy for SVC(): 0.8863636363636364
              precision    recall  f1-score   support

           0       0.89      1.00      0.94       116
           1       1.00      0.06      0.12        16

    accuracy                           0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Based on these updated results, it appears that both Multinomial Naive Bayes and Support Vector Machine are performing similarly and are better choices compared to Logistic Regression. They have higher accuracy and better F1-scores, demonstrating improved performance in classifying spam and ham messages.