# üì© Spam‚ÄìHam Classification (Baseline NLP Model)

## Project Objective
The objective of this project is to build a **baseline NLP classification model**
to classify SMS messages as **Spam** or **Ham**.

This baseline will serve as a reference point for evaluating more advanced models
in the future.


In [None]:
def libraries():
    import pandas as pd
    import numpy as np 
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model import LogisticRegression
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.metrics import confusion_matrix,accuracy_score,f1_score,recall_score,classification_report
    
    return pd,np,plt,train_test_split,TfidfVectorizer,LogisticRegression,MultinomialNB,confusion_matrix,accuracy_score,f1_score,recall_score,classification_report



pd,np,plt,train_test_split,TfidfVectorizer,LogisticRegression,MultinomialNB,confusion_matrix,accuracy_score,f1_score,recall_score,classification_report = libraries()
print("Libraries Imported Successfully")


Libraries Imported Successfully


## üìä Dataset Overview

- The dataset consists of SMS text messages.
- Each message is labeled as:
  - **Spam (1)**
  - **Ham (0)**

### Features
- `message`: Text content of the SMS

### Target Variable
- `label`: Binary classification label (Spam / Ham)


In [None]:
def dataset():
    df = pd.read_csv(r"C:\Users\chaud\Downloads\spam.csv",encoding= "latin")
    df = df[["v1","v2"]]
    df.columns = ["Labels","Messages"]
    df["Labels"] = df["Labels"].map({"ham":0,"spam":1})
    return df
df = dataset()
df.head()



Unnamed: 0,Labels,Messages
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
def dataset_info(df):
    print("---Dataset info---")
    info = df.info()
    description = df.describe()
    unique_sum = df.nunique()
    value_count = df["Labels"].value_counts()if "Labels" in df.columns else "Column 'Labels' not found"
    return info ,description,unique_sum,value_count
info,description,unique_sum,value_count = dataset_info(df) 

print("\n--- Descriptive Statistics ---\n", description)
print("\n--- Unique Values per Column ---\n", unique_sum)
print("\n--- Value Counts for Labels ---\n", value_count)

---Dataset info---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Labels    5572 non-null   int64 
 1   Messages  5572 non-null   object
dtypes: int64(1), object(1)
memory usage: 87.2+ KB

--- Descriptive Statistics ---
             Labels
count  5572.000000
mean      0.134063
std       0.340751
min       0.000000
25%       0.000000
50%       0.000000
75%       0.000000
max       1.000000

--- Unique Values per Column ---
 Labels         2
Messages    5169
dtype: int64

--- Value Counts for Labels ---
 Labels
0    4825
1     747
Name: count, dtype: int64


## üîÄ Train‚ÄìTest Split

The dataset is split into training and testing sets to evaluate model performance
on unseen data.

- **Training set** is used to train the model
- **Test set** is used only for evaluation

This helps prevent overfitting and data leakage.


In [None]:
def splitting_data(df,train_test_split,random_state = 42,test_size = 0.30):
    x_train , x_test ,y_train ,y_test = train_test_split(df["Messages"],df["Labels"],random_state= random_state,stratify=df["Labels"],test_size=test_size)
    return x_train , x_test ,y_train ,y_test 
x_train , x_test ,y_train ,y_test  = splitting_data(df,train_test_split,random_state = 42,test_size = 0.30)
print(f"Training sample :",len(x_train))
print(f"Testing sample :",len(x_test))
print("\n--- x_train ---\n",x_train)
print("\n--- x_train_type ---\n",type(x_train))

Training sample : 3900
Testing sample : 1672

--- x_train ---
 4912    Goal! Arsenal 4 (Henry, 7 v Liverpool 2 Henry ...
2541      I dont. Can you send it to me. Plus how's mode.
5323                           Aah bless! How's your arm?
5171                         Oh k. . I will come tomorrow
2532                                            Yup ok...
                              ...                        
3185    Happy birthday to you....dear.with lots of lov...
607     what I meant to say is cant wait to see u agai...
552     Sure, if I get an acknowledgement from you tha...
763     Nothing but we jus tot u would ask cos u ba gu...
3393    Bull. Your plan was to go floating off to IKEA...
Name: Messages, Length: 3900, dtype: object

--- x_train_type ---
 <class 'pandas.core.series.Series'>


## üß† Text Vectorization (TF-IDF)

Machine learning models cannot work directly with raw text.
Therefore, text messages are converted into numerical features using:

- **TF-IDF Vectorizer**
- Learns vocabulary from training data only
- Reduces importance of common words

This step converts text into a sparse numerical matrix.


In [None]:
def vectorization(x_train,x_test,TfidfVectorizer):
    vectorizer = TfidfVectorizer()
    x_train_vec = vectorizer.fit_transform(x_train)
    x_test_vec = vectorizer.transform(x_test)
    feature_names = vectorizer.get_feature_names_out()
    return vectorizer,x_train_vec,x_test_vec,feature_names
vectorizer,x_train_vec,x_test_vec,feature_names = vectorization(x_train,x_test,TfidfVectorizer)
print(f"Vectorizer :",vectorizer)
print(f"x_train_vec :",x_train_vec)
print(f"x_test_vec :",x_test_vec)
print(f"Feature_names :",feature_names)

Vectorizer : TfidfVectorizer()
x_train_vec : <Compressed Sparse Row sparse matrix of dtype 'float64'
	with 51578 stored elements and shape (3900, 7202)>
  Coords	Values
  (0, 2930)	0.4048550489723281
  (0, 986)	0.424946486354178
  (0, 3158)	0.424946486354178
  (0, 3865)	0.19529997364521248
  (0, 5507)	0.20242752448616405
  (0, 7002)	0.0933484902055576
  (0, 5714)	0.1725981461699518
  (0, 5668)	0.20242752448616405
  (0, 2796)	0.19877581458981025
  (0, 7123)	0.212473243177089
  (0, 4734)	0.1702367610950114
  (0, 1472)	0.11103261888339745
  (0, 1212)	0.212473243177089
  (0, 6441)	0.0539842523443838
  (0, 2912)	0.1236887276677332
  (0, 4055)	0.212473243177089
  (0, 769)	0.12991524136520063
  (0, 558)	0.212473243177089
  (0, 4184)	0.14086199719960968
  (1, 6441)	0.15850991863481007
  (1, 2227)	0.3422498571791928
  (1, 1509)	0.25863523023826496
  (1, 7154)	0.1637240840081839
  (1, 5571)	0.31637853608419586
  (1, 3487)	0.23470864133096522
  :	:
  (3898, 4495)	0.15656967633973623
  (3898, 3024

## ü§ñ Baseline Models

The following baseline machine learning models are used:

### 1. Logistic Regression
- Strong linear classifier
- Performs well on high-dimensional text data

### 2. Multinomial Naive Bayes
- Designed specifically for text classification
- Fast and effective for word-frequency features

These models provide a strong baseline for NLP problems.


In [None]:
def model_select(x_train_vec,x_test_vec,y_train):
    models = {
        "logisticregression":LogisticRegression(max_iter=1000,class_weight= "balanced"),
        "MultinomialNB":MultinomialNB()
    }
    predictions = {}

    for name,model in models.items():
        model.fit(x_train_vec,y_train)
        predictions[name] = model.predict(x_test_vec)
    return models,predictions  

models,predictions = model_select(x_train_vec,x_test_vec,y_train)
print(f"models :",models)
print(f"predictions :",predictions)

models : {'logisticregression': LogisticRegression(max_iter=1000), 'MultinomialNB': MultinomialNB()}
predictions : {'logisticregression': array([1, 0, 1, ..., 1, 0, 0], shape=(1672,)), 'MultinomialNB': array([1, 0, 1, ..., 1, 0, 0], shape=(1672,))}


## üìä Results and Model Comparison

The table below summarizes the performance of baseline models
using key evaluation metrics.

- Metrics are shown per model
- Higher **F1-score** indicates better balance between false positives
  and false negatives
- Confusion matrix helps analyze classification errors


In [None]:
#baseline models comparison 
def metrics(predictions,y_test):
        metrics_model = {}
        for name,y_pred in predictions.items():
                metrics_model[name] = {
                        "confusion_matrix" : confusion_matrix(y_test,y_pred),
                        "accuracy_score" : accuracy_score(y_test,y_pred),
                        "f1_score" : f1_score(y_test,y_pred),
                        "recall_score"  : recall_score(y_test,y_pred),
                        "classification_report" : classification_report(y_test,y_pred)

                }
        table = pd.DataFrame.from_dict(metrics_model,orient= "columns")
        table = table.loc[["confusion_matrix","accuracy_score","f1_score","recall_score","classification_report" ]]
        table.loc["confusion_matrix"] = table.loc["confusion_matrix"].apply(lambda x: x.tolist())        
                
        return metrics_model,table
metrics_model,table =  metrics(predictions,y_test)
table
        


Unnamed: 0,logisticregression,MultinomialNB
confusion_matrix,"[[1447, 1], [48, 176]]","[[1448, 0], [70, 154]]"
accuracy_score,0.970694,0.958134
f1_score,0.877805,0.814815
recall_score,0.785714,0.6875
classification_report,precision recall f1-score ...,precision recall f1-score ...


## üèÜ Best Model Selection

Based on the evaluation metrics:

- The model with the **highest F1-score** is selected as the best baseline model
- This model demonstrates better spam detection performance

This baseline model will be used for further improvements
such as hyperparameter tuning or advanced NLP models.


## ‚úÖ Conclusion

- A complete baseline NLP pipeline was implemented
- TF-IDF was used for text vectorization
- Multiple baseline models were evaluated
- The best model was selected using F1-score is Logistic Regression

### Future Improvements
- Hyperparameter tuning
- N-grams experimentation
- Advanced models (SVC, RandomForestClassifier, Transformers)
