# NLP-Driven Spam Detection System for SMS and Email

This project focuses on building a robust spam detection system using
advanced NLP techniques, TF-IDF feature engineering, and machine learning
classifiers. 

The model is designed for both SMS and email spam filtering.


## Problem Statement

Spam messages in SMS and emails pose serious risks including phishing,
financial fraud, and identity theft. The objective of this project is to
classify messages as **Spam** or **Ham** using content-based NLP techniques
while maintaining a balance between precision and recall.


In [None]:
def libraries():
    import pandas as pd
    import numpy as np 
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.model_selection import train_test_split
    import re
    import nltk
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer
    nltk.download("wordnet")
    from nltk.tokenize import sent_tokenize
    nltk.download("stopwords")
    stop_words = set(stopwords.words("English"))
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.pipeline import Pipeline
    from sklearn.linear_model import LogisticRegression
    from sklearn.naive_bayes import ComplementNB
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.svm import SVC
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score,classification_report,RocCurveDisplay,confusion_matrix,ConfusionMatrixDisplay
    from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
    from sklearn.svm import LinearSVC
    from sklearn.pipeline import Pipeline ,FeatureUnion
    import joblib
    return pd,np,plt,train_test_split,re,nltk,stopwords,WordNetLemmatizer,TfidfVectorizer,sent_tokenize,stop_words,KNeighborsClassifier,SVC,DecisionTreeClassifier,ComplementNB,LogisticRegression,RandomForestClassifier,Pipeline,accuracy_score,f1_score,recall_score,precision_score,GridSearchCV,RandomizedSearchCV,LinearSVC,classification_report,RocCurveDisplay,Pipeline,joblib,sns,confusion_matrix,ConfusionMatrixDisplay,FeatureUnion


pd,np,plt,train_test_split,re,nltk,stopwords,WordNetLemmatizer,TfidfVectorizer,sent_tokenize,stop_words,KNeighborsClassifier,SVC,DecisionTreeClassifier,ComplementNB,LogisticRegression,RandomForestClassifier,Pipeline,accuracy_score,f1_score,recall_score,precision_score,GridSearchCV,RandomizedSearchCV,LinearSVC,classification_report,RocCurveDisplay,Pipeline,joblib,sns,confusion_matrix,ConfusionMatrixDisplay,FeatureUnion= libraries()

## Dataset Description

- The dataset consists of labeled SMS and email messages.
- Classes:
  - 0 → Ham (Legitimate messages)
  - 1 → Spam (Promotional / Phishing messages)
- The dataset is imbalanced, which is handled using class weighting
  and threshold tuning.


In [None]:
def dataset():
    df = pd.read_csv(r"C:\Users\chaud\Downloads\spam.csv",encoding= "latin")
    df = df[["v1","v2"]]
    df.columns = ["Labels","Messages"]
    df["Labels"] = df["Labels"].map({"ham":0,"spam":1})
    #df.drop_duplicates()
    return df
df = dataset()
df.head()

In [None]:
def dataset_info(df):
    print("---Dataset info---")
    info = df.info()
    description = df.describe()
    unique_sum = df.nunique()
    value_count = df["Labels"].value_counts()if "Labels" in df.columns else "Column 'Labels' not found"
    pie_chart = plt.subplots()
    pie_chart = plt.pie(value_count,autopct= "%1.1f%%")
    pie_chart = plt.show()
    return info ,description,unique_sum,value_count,pie_chart
info,description,unique_sum,value_count,pie_chart = dataset_info(df) 

print("\n--- Descriptive Statistics ---\n", description)
print("\n--- Unique Values per Column ---\n", unique_sum)
print("\n--- Value Counts for Labels ---\n", value_count)


In [None]:
def splitting_data(df,train_test_split,random_state = 42,test_size = 0.30):
    x = df["Messages"]
    y = df["Labels"]
    x_train , x_test ,y_train ,y_test = train_test_split(x,y,random_state= random_state,stratify=y,test_size=test_size)
    return x_train , x_test ,y_train ,y_test ,x,y
x_train , x_test ,y_train ,y_test,x,y  = splitting_data(df,train_test_split,random_state = 42,test_size = 0.30)
print(f"Training sample :",len(x_train))
print(f"Testing sample :",len(x_test))
print("\n--- x_train ---\n",x_train)
print("\n--- x_train_type ---\n",type(x_train))

## Text Preprocessing

The following preprocessing steps are applied:

- Lowercasing text
- URL detection and removal (with signal preservation)
- Semantic feature injection (urgency, brand mention, link presence)
- Number normalization
- Lemmatization
- Stopword removal

Instead of adding manual numerical features, semantic indicators are injected
directly into the text so that TF-IDF can learn them naturally.


## Feature Engineering

To improve phishing and promotional spam detection, the following semantic
features are injected into the text:

- `link_found` → Detects presence of URLs
- `urgency_flag` → Captures urgency language such as "urgent", "verify"
- `brand_mention` → Detects brand impersonation (Netflix, PayPal, Amazon)

Both **word-level TF-IDF** and **character-level TF-IDF** are used to capture
lexical as well as morphological spam patterns.


In [None]:
def preprocessing(message,stop_words):
    message = str(message).lower()
    def add_custom_feature(message):
        if "http" in message or "www" in message:
            message = message + " link_found"
        if any(word in message for word in ["urgent", "verify", "action required"]):
            message += " urgency_flag"
        brands = ["netflix","amazon","google","paypal"]    
        if any(brand in message for brand in brands): 
            message += " brand_mention"   
        if "verify" in message and "account" in message:
            message += " phishing_pattern"
        if "action required" in message:
            message += " phishing_pattern"
        return message
    message = add_custom_feature(message)
    message = re.sub(r"http\S+|www\.\S+", "", message)
    message = re.sub(r'\d+', ' number_token ', message)
    message = re.sub('[^a-zA-Z0-9!£$% ]', " ", message)
    lemmatizer = WordNetLemmatizer()
    words = message.split()
    cleaned_words = [lemmatizer.lemmatize(word)for word in words if word not in stop_words]
    return " ".join(cleaned_words)
     
x_train_clean = x_train.apply(lambda msg: preprocessing(msg,stop_words))
x_test_clean = x_test.apply(lambda msg: preprocessing(msg,stop_words))    
print("\n----cleaned training dataset----")
print(x_train_clean.head())

## Text Vectorization

- Word-level TF-IDF:
  - Unigrams and bigrams
  - Stopword removal
- Character-level TF-IDF:
  - Character n-grams (3–5)

This hybrid approach improves robustness against obfuscated spam text.


In [None]:
def vectorization(x_train_clean,x_test_clean,TfidfVectorizer):
    word_tfidf = TfidfVectorizer(analyzer="word",max_features = 5000 ,stop_words= "english",ngram_range=(1,2),min_df=5, max_df= 0.9,sublinear_tf= True)
    char_tfidf = TfidfVectorizer(analyzer= "char",ngram_range=(3,5),max_features = 3000)
    vectorizer = FeatureUnion([("word_tfidf",word_tfidf),("char_tfidf",char_tfidf)])
    x_train_clean_vec = vectorizer.fit_transform(x_train_clean)
    x_test_clean_vec = vectorizer.transform(x_test_clean)
    feature_names = vectorizer.get_feature_names_out()
    return x_train_clean_vec,x_test_clean_vec,vectorizer,feature_names    

x_train_clean_vec,x_test_clean_vec,vectorizer,feature_names =  vectorization(x_train_clean,x_test_clean,TfidfVectorizer)   
print(f"Vectorizer :",vectorizer)
print(f"x_train_vec :",x_train_clean_vec)
print(f"x_test_vec :",x_test_clean_vec)
print(f"Feature_names :",feature_names)
 



## Model Selection

Support Vector Machine (SVM) with a linear kernel is used due to its strong
performance on high-dimensional sparse text data.

Class imbalance is handled using `class_weight='balanced'`.


In [None]:
def model_select(x_train_clean_vec,x_test_clean_vec , y_train,y_test):
    models = {
        "SVC":SVC(class_weight="balanced",probability=True),
        "DecisionTreeClassifier":DecisionTreeClassifier(class_weight = "balanced"),
        "ComplementNB":ComplementNB(),
        "LogisticRegression":LogisticRegression(max_iter= 1000,class_weight= "balanced"),
        "RandomForestClassifier":RandomForestClassifier(class_weight= "balanced"),
        "KNeighborsClassifier":KNeighborsClassifier(weights = "distance")
    }
    pipelines_dict = {}
    predictions = {}
    for name,model in models.items():
        pipe = Pipeline([
        ("clf",model)])
        pipe.fit(x_train_clean_vec,y_train)
        y_pred = pipe.predict(x_test_clean_vec)

        metrics_model = {
            "accuracy_score":accuracy_score(y_test,y_pred),
            "f1_score":f1_score(y_test,y_pred,average="weighted"),
            "recall_score":recall_score(y_test,y_pred,average="weighted"),
            "precision_score":precision_score(y_test,y_pred,average="weighted")
        }
        pipelines_dict[name] = pipe
        predictions[name] = metrics_model
    best_model_name = max(predictions, key= lambda z:predictions[z]["f1_score"])
    return pipelines_dict, predictions, best_model_name
   

pipelines_dict, predictions, best_model_name = model_select(x_train_clean_vec,x_test_clean_vec,y_train,y_test)
print(f"pipelines :",{Pipeline})
print(f"predictions :",predictions)

In [None]:
def hyperparameter(x_train_clean_vec,y_train):
    
    #svc with grid search cv
    param_grid = {
        "C": [0.001,0.01,1,10,50,100],
        "kernel":["linear"],
        "gamma":["scale", "auto"], 
    }
    Grid_Search_CV = GridSearchCV(estimator= SVC(probability= True,class_weight= "balanced"),param_grid= param_grid,cv=5,scoring= "f1_weighted",n_jobs=-1)
    Grid_Search_CV.fit(x_train_clean_vec,y_train)
    #linearsvc with random search cv

    param_distributions = { 
        "C": np.logspace(-3, 2, 20)
        }
    Randomized_Search_CV = RandomizedSearchCV(estimator= SVC(kernel='linear', class_weight='balanced', probability=True),param_distributions = param_distributions,cv=5,n_jobs=-1,scoring= "f1_weighted")
    Randomized_Search_CV.fit(x_train_clean_vec,y_train)

    return Grid_Search_CV ,Randomized_Search_CV

Grid_Search_CV,Randomized_Search_CV= hyperparameter(x_train_clean_vec,y_train) 
print("------------- SVC WITH GRIDSEARCHCV-------------------------------------------------------------------------------------")
print("Grid_Search_CV:",Grid_Search_CV)
print("Best Params:", Grid_Search_CV.best_params_)
print("Best_estimator:",Grid_Search_CV.best_estimator_)
print("Best F1:", Grid_Search_CV.best_score_)
print("------------- LINEARSVC WITH RANDOMIZEDSEARCHCV-------------------------------------------------------------------------------------")
print("Randomized_Search_CV:",Randomized_Search_CV)
print("Best Params:", Randomized_Search_CV.best_params_)
print("Best_estimator:",Randomized_Search_CV.best_estimator_)
print("Best F1:", Randomized_Search_CV.best_score_)



In [None]:
def best_model_hyperparameter_name(Randomized_Search_CV,x_train_clean_vec,x_test_clean_vec,y_train,y_test):
    y_pred_train = Randomized_Search_CV.predict(x_train_clean_vec)
    y_pred_test = Randomized_Search_CV.predict(x_test_clean_vec)   
    return y_pred_train,y_pred_test
y_pred_train,y_pred_test = best_model_hyperparameter_name(Randomized_Search_CV,x_train_clean_vec,x_test_clean_vec,y_train,y_test)
print("------------------- FOR TRAINING DATA--------------------------------------------------------------------------------------")
print("accuracy_score:",accuracy_score(y_pred_train,y_train))
print("f1_weighted:",f1_score(y_pred_train,y_train,average = "weighted"))

print("------------------- FOR TESTING DATA---------------------------------------------------------------------------------------")
print("accuracy_score:",accuracy_score(y_pred_test,y_test))
print("f1_weighted:",f1_score(y_pred_test,y_test,average = "weighted"))
print("\n full classification report of testing set:\n",classification_report(y_pred_test,y_test))

FOR UNIGRAM | NGRAM_RANGE = (1,1)

------------------- FOR TRAINING DATA--------------------------------------------------------------------------------------

accuracy_score: 1.0

f1_weighted: 1.0

------------------- FOR TESTING DATA---------------------------------------------------------------------------------------

accuracy_score: 0.9814593301435407

f1_weighted: 0.9816576008685878


 full classification report of testing set:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      1459
           1       0.91      0.95      0.93       213

    accuracy                           0.98      1672
   macro avg       0.95      0.97      0.96      1672
weighted avg       0.98      0.98      0.98      1672


FOR UNIGRAMS,BIGRAMS | NGRAM_RANGE = (1,2)

------------------- FOR TRAINING DATA--------------------------------------------------------------------------------------

accuracy_score: 0.9994871794871795

f1_weighted: 0.9994867661055344


------------------- FOR TESTING DATA---------------------------------------------------------------------------------------

accuracy_score: 0.979066985645933

f1_weighted: 0.9791670782995956


 full classification report of testing set:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      1453
           1       0.91      0.93      0.92       219

    accuracy                           0.98      1672
   macro avg       0.95      0.96      0.95      1672
weighted avg       0.98      0.98      0.98      1672

FOR BIGRAM | NGRAM_RANGE = (2,2)

------------------- FOR TRAINING DATA--------------------------------------------------------------------------------------

accuracy_score: 0.9994871794871795

f1_weighted: 0.9994867661055344

------------------- FOR TESTING DATA---------------------------------------------------------------------------------------

accuracy_score: 0.979066985645933

f1_weighted: 0.9791670782995956

 full classification report of testing set:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      1453
           1       0.91      0.93      0.92       219

    accuracy                           0.98      1672
   macro avg       0.95      0.96      0.95      1672
weighted avg       0.98      0.98      0.98      1672


In [None]:
def roc_curve(x_test_clean_vec,y_test):
    curve = RocCurveDisplay.from_estimator(Randomized_Search_CV.best_estimator_, x_test_clean_vec,y_test)
    curve = plt.show()
    return curve
curve = roc_curve(x_test_clean_vec,y_test)
curve


## Threshold Optimization

Instead of using the default threshold (0.5), probability-based threshold
tuning is applied.

- A threshold of **0.35** is selected to maximize spam recall
- This reduces false negatives while keeping false positives minimal

Threshold tuning allows the model to adapt to different deployment scenarios.


## Model Evaluation (Threshold = 0.35)

The confusion matrix obtained at threshold = 0.35 is shown below:

- True Negatives (Ham correctly identified): 1442
- False Positives (Ham misclassified as Spam): 6
- False Negatives (Spam missed): 16
- True Positives (Spam correctly identified): 208


In [None]:
def deploy_pipeline(x_train,y_train):
    word_tfidf = TfidfVectorizer(analyzer="word",max_features = 5000 ,stop_words= "english",ngram_range=(1,2),min_df=5, max_df= 0.9,sublinear_tf= True)
    char_tfidf = TfidfVectorizer(analyzer= "char",ngram_range=(3,5),max_features = 3000)
    vectorizer = FeatureUnion([("word_tfidf",word_tfidf),("char_tfidf",char_tfidf)])
    pipeline_for_deployment = Pipeline([
        ("Vectorizer",vectorizer),
        ("Classifier",Randomized_Search_CV.best_estimator_)
    ])
    final_pipeline = pipeline_for_deployment.fit(x_train,y_train)
    probs = final_pipeline.predict_proba(x_test)[:, 1]
    y_pred = (probs > 0.35).astype(int)
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Ham', 'Spam'])
    disp.plot(cmap='Blues')
    plt.title("Spam Detection Confusion Matrix")
    plt.show()

    results = pd.DataFrame({'text': x_test, 'actual': y_test, 'predicted': y_pred})
    dump_model = joblib.dump(final_pipeline,"spam.pkl")
    return pipeline_for_deployment,final_pipeline,dump_model,results
pipeline_for_deployment,final_pipeline,dump_model,results = deploy_pipeline(x_train,y_train) 
print("✅ Pipeline Created with Preprocessor")
print("✅ Best Model Integrated:", Randomized_Search_CV.best_params_)
print("✅ Best Model :", Randomized_Search_CV.best_estimator_)
print("✅ Model Dumped to:", dump_model)
print("--- Ham marked as Spam (Critical Errors) ---")
display(results[(results['actual'] == 0) & (results['predicted'] == 1)])
print("\n--- Spam marked as Ham (Missed Spam) ---")
display(results[(results['actual'] == 1) & (results['predicted'] == 0)])


Lowering the threshold from 0.90 to 0.35 improved recall significantly with only a minor increase in false positives.

For consumer SMS and email systems, this trade-off maximizes user protection while maintaining high precision.

## Test Case Evaluation

The model was tested on real-world SMS and email examples including:

- Casual conversations
- Promotional spam
- Phishing attempts
- Brand impersonation messages

This evaluation ensures robustness beyond the training dataset.


In [None]:
def testing():
    model = joblib.load('spam.pkl')
    test_messages = [
    "Hey, are we still meeting for coffee at 4?",
    "Can you send me the notes from today's meeting?",
    "I'll be home a bit late, don't wait up for dinner.",
    "WINNER! You have won a $1000 Walmart gift card. Click here to claim now!",
    "URGENT: Your account has been compromised. Log in at http://bit.ly/fake-link to reset.",
    "FREE entry into our £100 weekly draw. Text 'WIN' to 80085 to join.",
    "Please call me back regarding your insurance claim.",
    "I'm at the bank right now, will wire the money soon.", 
    "CONGRATS on the new job! Let's celebrate!",
    "Action Required: Your Netflix subscription has expired.",
    "The invoice for the project is attached.",
    "Yo, you coming to the gym?",
    "I won the local chess tournament today!",
    "Verify your identity within 24 hours.","Hey, are we still meeting for coffee at 4?",
    "Can you send me the notes from today's meeting?",
    "I'll be home a bit late, don't wait up for dinner.",
    "Yo, you coming to the gym?",
    "I won the local chess tournament today!",
    "The invoice for the project is attached.",
    "CONGRATS on the new job! Let's celebrate!",
    "I'm at the bank right now, will wire the money soon.",
    "WINNER! You have won a $1000 Walmart gift card. Click here to claim now!",
    "FREE entry into our £100 weekly draw. Text 'WIN' to 80085 to join.",
    "Congratulations! Claim your reward now before it expires.",
    "URGENT: Your account has been compromised. Log in at http://bit.ly/fake-link to reset.",
    "Verify your identity within 24 hours to avoid suspension.",
    "Action Required: Your Netflix subscription has expired.",
    "Your PayPal account is limited. Verify now to restore access.",
    "Please call me back regarding your insurance claim.",
    "Your Amazon order has been shipped. Track here.",
    "Reminder: Bank KYC update required within 24 hours."]
    predictions = model.predict(test_messages)
    return model,predictions
model,predictions  = testing()
print(f"Model : {model}")
print(f"Predictions : {predictions}")



## Error Analysis

Some phishing messages without explicit URLs or sender metadata were
classified as ham. This behavior mirrors real-world spam filters and
highlights the limitation of content-only models.

Future improvements can include sender reputation and email header analysis.


## Conclusion

This project demonstrates an effective NLP-based spam detection system using
TF-IDF feature engineering, SVM classification, and threshold tuning.

The system achieves high precision and recall while maintaining realistic
behavior suitable for deployment in SMS and email platforms.


## Key Takeaway

Threshold tuning and semantic feature injection significantly improve spam
detection performance while preserving user experience.
