### 1Ô∏è‚É£ Lowercasing
### 2Ô∏è‚É£ Removing URLs / Links
### 3Ô∏è‚É£ Removing Numbers / Digits
### 4Ô∏è‚É£ Removing Punctuation
### 5Ô∏è‚É£ Tokenization
### 6Ô∏è‚É£ Stopword Removal
### 7Ô∏è‚É£ Lemmatization (or Stemming)
### 8Ô∏è‚É£ Removing Extra Whitespace
### 9Ô∏è‚É£ Optionally: Handling Emojis / Special Characters
### üîü Optionally: Removing Rare or Short Words

In [2]:
import pandas as pd

In [4]:
data=pd.read_csv('combined_data.csv')
data.head()

Unnamed: 0,label,text
0,1,ounce feather bowl hummingbird opec moment ala...
1,1,wulvob get your medircations online qnb ikud v...
2,0,computer connection from cnn com wednesday es...
3,1,university degree obtain a prosperous future m...
4,0,thanks for all your answers guys i know i shou...


In [29]:
import re
import string
import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

In [15]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')
nltk.download('punkt_tab')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\786\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\786\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\786\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\786\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\786\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\786\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [16]:
# 2Ô∏è‚É£ Clean text function
def clean_text(text):
    text = text.lower()                          # lowercase
    text = re.sub(r"http\S+|www\S+", "", text)   # remove links
    text = re.sub(r"\d+", "", text)              # remove numbers
    text = text.translate(str.maketrans("", "", string.punctuation))  # remove punctuation
    text = text.strip()
    return text

# 3Ô∏è‚É£ Tokenize, remove stopwords, and lemmatize
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()


def preprocess_text(text):
    text = clean_text(text)
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return " ".join(tokens)

# 4Ô∏è‚É£ Apply preprocessing
data["clean_text"] = data["text"].apply(preprocess_text)

print("\nSample cleaned text:\n", data["clean_text"].head())


Sample cleaned text:
 0    ounce feather bowl hummingbird opec moment ala...
1    wulvob get medircations online qnb ikud viagra...
2    computer connection cnn com wednesday escapenu...
3    university degree obtain prosperous future mon...
4    thanks answer guy know checked rsync manual wo...
Name: clean_text, dtype: object


In [19]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [20]:
vectorizer = CountVectorizer(max_features=1500)
X = vectorizer.fit_transform(data["clean_text"])
y = data["label"]

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 7Ô∏è‚É£ Train simple classifier
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# 8Ô∏è‚É£ Predict and evaluate
y_pred = model.predict(X_test)

print("\nAccuracy:", accuracy_score(y_test, y_pred))
print("\nReport:\n", classification_report(y_test, y_pred))


Accuracy: 0.9721390053924506

Report:
               precision    recall  f1-score   support

           0       0.98      0.96      0.97      7938
           1       0.97      0.98      0.97      8752

    accuracy                           0.97     16690
   macro avg       0.97      0.97      0.97     16690
weighted avg       0.97      0.97      0.97     16690



In [27]:
new_email = """
Congratulations! You have won a $1000 Walmart gift card.
Click the link below to claim your prize now!
"""

cleaned_email = preprocess_text(new_email)

X_new = vectorizer.transform([cleaned_email])
print(X_new.toarray())
# Predict
prediction = model.predict(X_new)[0]

[[0 0 0 ... 0 0 0]]


In [25]:
prediction

np.int64(1)

In [26]:
# Show result
if prediction == 1:
    print("üì© The email is: SPAM")
else:
    print("üì® The email is: NOT SPAM")

üì© The email is: SPAM


In [28]:
emails = [
    "Win a free vacation to Dubai! Click here to register.",
    "Hey John, can we meet tomorrow about the project?",
    "Limited offer!!! Get cheap meds online now!",
    "Please find attached the report for Q3 results."
]

for e in emails:
    pred = model.predict(vectorizer.transform([preprocess_text(e)]))[0]
    label = "SPAM" if pred == 1 else "NOT SPAM"
    print(f"\nEmail: {e}\n‚Üí Prediction: {label}")


Email: Win a free vacation to Dubai! Click here to register.
‚Üí Prediction: SPAM

Email: Hey John, can we meet tomorrow about the project?
‚Üí Prediction: NOT SPAM

Email: Limited offer!!! Get cheap meds online now!
‚Üí Prediction: SPAM

Email: Please find attached the report for Q3 results.
‚Üí Prediction: NOT SPAM


### Exporting the Model and Vectorizer

In [30]:
# Total number of features
print("Total Features:", len(vectorizer.get_feature_names_out()))

# Show first 20 words learned
print("\nSample Vocabulary Words:\n", vectorizer.get_feature_names_out()[:20])

Total Features: 1500

Sample Vocabulary Words:
 ['ability' 'able' 'ac' 'accept' 'access' 'according' 'account'
 'acquisition' 'acrobat' 'across' 'act' 'action' 'activity' 'actual'
 'actually' 'ad' 'add' 'added' 'addition' 'additional']


In [31]:
import joblib

# Save
joblib.dump(model, "spam_model.pkl")
joblib.dump(vectorizer, "vectorizer.pkl")

['vectorizer.pkl']

### We can now only have to reuse them

In [32]:

# Later load them
model = joblib.load("spam_model.pkl")
vectorizer = joblib.load("vectorizer.pkl")