---

## **Spam Mail Detector**

---

### **Objective:** Build a classifier that distinguishes between spam and non-spam (ham) emails using textual data. 

### **Dataset:** Public datasets like the SMS Spam Collection (UCI) or Enron Email Dataset. 

---

**1. Load the messages and labels (spam or ham).**

In [18]:
import pandas as pd

df = pd.read_csv("SMSSpamCollection", sep="\t", header=None, names=["label", "message"])

print(df)

     label                                            message
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...
...    ...                                                ...
5567  spam  This is the 2nd time we have tried 2 contact u...
5568   ham               Will ü b going to esplanade fr home?
5569   ham  Pity, * was in mood for that. So...any other s...
5570   ham  The guy did some bitching but I acted like i'd...
5571   ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


In [19]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words="english")
X = vectorizer.fit_transform(df["message"])


---

**2. Preprocess the text (lowercasing, remove stopwords, tokenization).**

In [20]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [24]:
#Download stopwords & punkt tokenizer
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lucky\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lucky\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\lucky\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [25]:
stop_words = set(stopwords.words("english"))

In [27]:
def preprocess_text(text):
    
    #1. Lowercasing
    text = text.lower()
    
    #2. Tokenization
    tokens = word_tokenize(text)
    
    #3. Remove Stopwords
    tokens = [word for word in tokens if word not in stop_words]
    
    #4. join tokens back to a string
    return " ".join(tokens)

# Apply preprocessing to messages
df["clean_message"] = df["message"].apply(preprocess_text)

print(df[["message", "clean_message"]].head())

                                             message  \
0  Go until jurong point, crazy.. Available only ...   
1                      Ok lar... Joking wif u oni...   
2  Free entry in 2 a wkly comp to win FA Cup fina...   
3  U dun say so early hor... U c already then say...   
4  Nah I don't think he goes to usf, he lives aro...   

                                       clean_message  
0  go jurong point , crazy .. available bugis n g...  
1                    ok lar ... joking wif u oni ...  
2  free entry 2 wkly comp win fa cup final tkts 2...  
3        u dun say early hor ... u c already say ...  
4       nah n't think goes usf , lives around though  


---

**3. Convert text into numeric features (Bag of Words or TF-IDF).**

In [29]:
# Bag of Words
from sklearn.feature_extraction.text import CountVectorizer

bow_vect = CountVectorizer()

x_bow = bow_vect.fit_transform(df["clean_message"])

print("Shape of BoW matrix:", x_bow.shape)
print("Example feature names:", bow_vect.get_feature_names_out()[:20])

Shape of BoW matrix: (5572, 8645)
Example feature names: ['00' '000' '000pes' '008704050406' '0089' '0121' '01223585236'
 '01223585334' '0125698789' '02' '0207' '02072069400' '02073162414'
 '02085076972' '021' '03' '04' '0430' '05' '050703']


---

**4. Split into train/test sets.**

In [30]:
from sklearn.model_selection import train_test_split

x = x_bow
y = df["label"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

print("Training set shape:", x_train.shape)
print("Test set shape:", x_test.shape)

Training set shape: (4457, 8645)
Test set shape: (1115, 8645)


---

**5. Train a simple model (Naive Bayes, Logistic Regression).** 

In [34]:
#Naive Bayes
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()

nb.fit(x_train, y_train)
y_pred_nb = nb.predict(x_test)

In [36]:
#Logictic Regression
from sklearn.linear_model import LogisticRegression

l_reg = LogisticRegression()

l_reg.fit(x_train, y_train)
y_pred_l_reg = l_reg.predict(x_test)

---

**6. Measure performance with accuracy, precision, or F1 score.**

In [37]:
from sklearn.metrics import accuracy_score, precision_score, f1_score

In [40]:
#Naive Bayes model performance
print("Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Precision:", precision_score(y_test, y_pred_nb, average="weighted"))
print("F1 Score:", f1_score(y_test, y_pred_nb, average="weighted"))

Accuracy: 0.979372197309417
Precision: 0.9801305485955697
F1 Score: 0.9796262882489709


In [42]:
#Logistic Regression model performance
print("Accuracy:", accuracy_score(y_test, y_pred_l_reg))
print("Precision:", precision_score(y_test, y_pred_l_reg, average="weighted"))
print("F1 score:", f1_score(y_test, y_pred_l_reg, average="weighted"))

Accuracy: 0.9856502242152466
Precision: 0.9858840291160164
F1 score: 0.9853020696947724
