Submitted by: Sampriti Mahapatra, MDS202433

# Simple Spam Classification

Comparing 3 basic classifiers using 3 metrics (precision prioritized).

In [21]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score, accuracy_score, f1_score, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

## Load Data

In [22]:
train_df = pd.read_csv('train.csv')
val_df = pd.read_csv('validation.csv')
test_df = pd.read_csv('test.csv')

print(f"Train: {len(train_df)}, Validation: {len(val_df)}, Test: {len(test_df)}")

Train: 3900, Validation: 836, Test: 836


## Prepare Features

In [23]:
# Handle missing values
train_df['cleaned_message'] = train_df['cleaned_message'].fillna('')
val_df['cleaned_message'] = val_df['cleaned_message'].fillna('')
test_df['cleaned_message'] = test_df['cleaned_message'].fillna('')

# Simple bag of words
vectorizer = CountVectorizer()

X_train = vectorizer.fit_transform(train_df['cleaned_message'])
X_val = vectorizer.transform(val_df['cleaned_message'])
X_test = vectorizer.transform(test_df['cleaned_message'])

y_train = (train_df['label'] == 'spam').astype(int)
y_val = (val_df['label'] == 'spam').astype(int)
y_test = (test_df['label'] == 'spam').astype(int)

print(f"Features: {X_train.shape[1]}")

Features: 6696


## Define Models

In [24]:
models = {
    'Logistic Regression': LogisticRegression(),
    'Naive Bayes': MultinomialNB(),
    'Decision Tree': DecisionTreeClassifier()
}

## Validate Models (Train vs Validation)

In [25]:
def evaluate(y_true, y_pred):
    """Calculate metrics and return as dict."""
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    return {
        'Precision': precision_score(y_true, y_pred),
        'Accuracy': accuracy_score(y_true, y_pred),
        'F1-Score': f1_score(y_true, y_pred),
        'False Positives': fp
    }

validation_results = []

for name, model in models.items():
    # Fit on train
    model.fit(X_train, y_train)
    
    # Score on train and validation
    y_train_pred = model.predict(X_train)
    y_val_pred = model.predict(X_val)
    
    # Evaluate on train and validation
    train_metrics = evaluate(y_train, y_train_pred)
    val_metrics = evaluate(y_val, y_val_pred)
    
    validation_results.append({
        'Model': name,
        'Train_Precision': train_metrics['Precision'],
        'Val_Precision': val_metrics['Precision'],
        'Train_Accuracy': train_metrics['Accuracy'],
        'Val_Accuracy': val_metrics['Accuracy'],
        'Train_F1': train_metrics['F1-Score'],
        'Val_F1': val_metrics['F1-Score'],
        'Train_FP': train_metrics['False Positives'],
        'Val_FP': val_metrics['False Positives']
    })
    
    print(f"{name}:")
    print(f"  {'Metric':<12} {'Train':>10} {'Validation':>12}")
    print(f"  {'Precision':<12} {train_metrics['Precision']:>10.4f} {val_metrics['Precision']:>12.4f}")
    print(f"  {'Accuracy':<12} {train_metrics['Accuracy']:>10.4f} {val_metrics['Accuracy']:>12.4f}")
    print(f"  {'F1-Score':<12} {train_metrics['F1-Score']:>10.4f} {val_metrics['F1-Score']:>12.4f}")
    print(f"  {'False Pos':<12} {train_metrics['False Positives']:>10} {val_metrics['False Positives']:>12}")
    print()

Logistic Regression:
  Metric            Train   Validation
  Precision        1.0000       0.9720
  Accuracy         0.9974       0.9868
  F1-Score         0.9903       0.9498
  False Pos             0            3

Naive Bayes:
  Metric            Train   Validation
  Precision        0.9713       0.9537
  Accuracy         0.9921       0.9833
  F1-Score         0.9703       0.9364
  False Pos            15            5

Decision Tree:
  Metric            Train   Validation
  Precision        1.0000       0.9018
  Accuracy         1.0000       0.9737
  F1-Score         1.0000       0.9018
  False Pos             0           11



In [26]:
val_results_df = pd.DataFrame(validation_results)
print(val_results_df.to_string(index=False))

              Model  Train_Precision  Val_Precision  Train_Accuracy  Val_Accuracy  Train_F1   Val_F1  Train_FP  Val_FP
Logistic Regression         1.000000       0.971963        0.997436      0.986842  0.990347 0.949772         0       3
        Naive Bayes         0.971264       0.953704        0.992051      0.983254  0.970335 0.936364        15       5
      Decision Tree         1.000000       0.901786        1.000000      0.973684  1.000000 0.901786         0      11


## Test Set Evaluation

In [27]:
test_results = []

for name, model in models.items():
    # Predict on test set (models already fitted)
    y_pred = model.predict(X_test)
    
    # Evaluate
    metrics = evaluate(y_test, y_pred)
    
    test_results.append({
        'Model': name,
        'Precision': metrics['Precision'],
        'Accuracy': metrics['Accuracy'],
        'F1-Score': metrics['F1-Score'],
        'False Positives': metrics['False Positives']
    })
    
    print(f"{name}:")
    print(f"  Precision: {metrics['Precision']:.4f}")
    print(f"  Accuracy:  {metrics['Accuracy']:.4f}")
    print(f"  F1-Score:  {metrics['F1-Score']:.4f}")
    print(f"  False Positives: {metrics['False Positives']}")
    print()

Logistic Regression:
  Precision: 0.9789
  Accuracy:  0.9749
  F1-Score:  0.8986
  False Positives: 2

Naive Bayes:
  Precision: 0.9612
  Accuracy:  0.9797
  F1-Score:  0.9209
  False Positives: 4

Decision Tree:
  Precision: 0.8349
  Accuracy:  0.9533
  F1-Score:  0.8235
  False Positives: 18



## Results Comparison

In [28]:
results_df = pd.DataFrame(test_results)
results_df = results_df.sort_values('Precision', ascending=False)
print(results_df.to_string(index=False))

              Model  Precision  Accuracy  F1-Score  False Positives
Logistic Regression   0.978947  0.974880  0.898551                2
        Naive Bayes   0.961165  0.979665  0.920930                4
      Decision Tree   0.834862  0.953349  0.823529               18


We prioritise increasing precision since we want to reduce the number of false positives.

In [29]:
best_model = results_df.iloc[0]['Model']
print(f"Best model by Precision: {best_model}")

Best model by Precision: Logistic Regression


Logistic Regression gives the maximum precision and least count of false positives. Maybe this is due to the simplistic nature of the dataset, it is probably linearly separable, leading to the simplistic logistic regression giving the best results.