In [46]:
import pandas as pd
import re
import numpy as np
import nltk
# nltk.download('stopwords')
df = pd.read_csv('spamham.csv')
df.sample(5)

Unnamed: 0,label,Message
1386,1,Subject: works wondder\r\ndear sir / madam .\r...
6904,0,I dont have i shall buy one dear
4747,0,"Subject: re : flow volumes at oxy gladewater ,..."
7173,0,How much she payed. Suganya.
8175,0,\HEY BABE! FAR 2 SPUN-OUT 2 SPK AT DA MO... DE...


In [47]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()

stop_words = set(stopwords.words('english'))

def clean_text_lemmatize(text):
    text = text.lower()
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s$!]', '', text)
    words = text.split()
    words = [w for w in words if w not in stop_words]
    words = [lemmatizer.lemmatize(w) for w in words]  # lemmatize instead of stem
    return " ".join(words)

# cleaning the messages column by filtering out stopword so that we get better model performance
df['Message'] = df['Message'].apply(clean_text_lemmatize)
df.head(5)

Unnamed: 0,label,Message
0,0,subject enron methanol meter follow note gave ...
1,0,subject hpl nom january see attached file hpln...
2,0,subject neon retreat ho ho ho around wonderful...
3,1,subject photoshop window office cheap main tre...
4,0,subject indian spring deal book teco pvr reven...


Difference in Stemming and Lemmatization :
| Original | Stemmed (Porter) | Lemmatized |
| -------- | ---------------- | ---------- |
| running  | run              | run        |
| better   | better           | good       |
| devices  | devic            | device     |
| eligible | elig             | eligible   |
| went     | went             | go         |


In [48]:
print(df.duplicated().sum())
df.drop_duplicates(keep='first', inplace=True)
df.shape

459


(9744, 2)

In [49]:
from sklearn.feature_extraction.text import TfidfVectorizer
f = TfidfVectorizer(ngram_range=(1,2), min_df=5, max_df=0.9, sublinear_tf=True)
X = f.fit_transform(df['Message'])
y = df.label
print(X.shape, y.shape)

(9744, 17722) (9744,)


In [50]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.27, random_state=42)

In [51]:
from sklearn import naive_bayes
from sklearn.metrics import recall_score, precision_score, f1_score
list_alpha = np.arange(1/100000, 5, 0.019) # start from 0.0001, increasing by 0.11 per step, until just before 20 
# list_alpha is a numpy array with x elements

train_score = np.zeros(len(list_alpha))
test_score = np.zeros(len(list_alpha))
recall_test = np.zeros(len(list_alpha))
precision_test = np.zeros(len(list_alpha))
f1_test = np.zeros(len(list_alpha))
count = 0
for alpha in list_alpha:
    nb = naive_bayes.MultinomialNB(alpha=alpha)
    nb.fit(X_train, y_train)
    y_pred = nb.predict(X_test)
    train_score[count] = nb.score(X_train, y_train)
    test_score[count] = nb.score(X_test, y_test)
    recall_test[count] = recall_score(y_test, y_pred)
    precision_test[count] = precision_score(y_test, y_pred)
    f1_test[count] = f1_score(y_test, y_pred)
    count+=1

In [52]:
train_score.shape, precision_test.shape

((264,), (264,))

In [53]:
matrix = np.matrix(np.c_[list_alpha, train_score, test_score, recall_test, precision_test, f1_test])
models = pd.DataFrame(data=matrix, columns=['alpha', 'Train Accuracy', 'Test Accuracy', 'Test Recall', 'Test Precision', 'F1 Score'])
models

Unnamed: 0,alpha,Train Accuracy,Test Accuracy,Test Recall,Test Precision,F1 Score
0,0.00001,0.988050,0.961992,0.931579,0.896959,0.913941
1,0.01901,0.984817,0.966173,0.940351,0.906937,0.923342
2,0.03801,0.984676,0.967693,0.936842,0.915952,0.926279
3,0.05701,0.983692,0.966933,0.933333,0.915663,0.924414
4,0.07601,0.983129,0.967313,0.931579,0.918685,0.925087
...,...,...,...,...,...,...
259,4.92101,0.863630,0.851387,0.314035,1.000000,0.477971
260,4.94001,0.863068,0.850627,0.310526,1.000000,0.473896
261,4.95901,0.862786,0.850627,0.310526,1.000000,0.473896
262,4.97801,0.862224,0.849867,0.307018,1.000000,0.469799


In [54]:
test_Accuracy_winner = models[models['Test Accuracy']==models['Test Accuracy'].max()]
test_Accuracy_winner

Unnamed: 0,alpha,Train Accuracy,Test Accuracy,Test Recall,Test Precision,F1 Score
11,0.20901,0.979615,0.971114,0.924561,0.941071,0.932743


In [55]:
winner_f1 = models.loc[models['F1 Score'].idxmax()]
winner_f1

alpha             0.209010
Train Accuracy    0.979615
Test Accuracy     0.971114
Test Recall       0.924561
Test Precision    0.941071
F1 Score          0.932743
Name: 11, dtype: float64

#### **Note:**

The Model with the highest **Test Accuracy and F1 score** is the **same** with **alpha=0.20901**

In [56]:

from sklearn.metrics import accuracy_score, confusion_matrix
model = naive_bayes.MultinomialNB(alpha=0.20901)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('Accuracy', accuracy_score(y_test, y_pred))
print('f1 score:',f1_score(y_test, y_pred))
print('Confusion Matrix\n', confusion_matrix(y_test, y_pred))

Accuracy 0.9711136450019004
f1 score: 0.9327433628318584
Confusion Matrix
 [[2028   33]
 [  43  527]]


In [146]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
model_list = {
    'Multinomial NB': MultinomialNB(alpha=0.20901),
    'Logistic Regression Clf': LogisticRegression(),
    'Support Vector Classifier': SVC(kernel='linear', C=1)
}

- We will create a generic function to check each model's performance and then compare them.

In [None]:
def evaluate_models(X, y, models):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.27, random_state=42)
    models_list = []
    scores = []
    f1_scores = []
    for i in range(len(list(models))):
        model = list(models.values())[i]
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        score = accuracy_score(y_test, y_pred)
        f1_sc = f1_score(y_test, y_pred)

        model_name = list(models.keys())[i]
        print(f'--- Score for {model_name} ---')
        print(f'{score}')
        models_list.append(model_name)
        scores.append(score)
        f1_scores.append(f1_sc)
    print()

    res = pd.DataFrame()
    res['Model Name'] = models_list
    res['Score'] = scores
    res['F1 Score'] = f1_scores
    return res


In [147]:
result = evaluate_models(X, y, model_list)

--- Score for Multinomial NB ---
0.9711136450019004
--- Score for Logistic Regression Clf ---
0.9448878753325731
--- Score for Support Vector Classifier ---
0.9752945648042569



In [None]:
result

Unnamed: 0,Model Name,Score,F1 Score
0,Multinomial NB,0.971114,0.932743
1,Logistic Regression Clf,0.944888,0.857704
2,Support Vector Classifier,0.975295,0.941599


### Observation:

- Until now, we see that SVC is giving the best results.
Let's test it with some parameters.

In [128]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('svc', SVC(kernel='linear'))
])
param = {
    'svc__C':[1.1,1.22,1.35,1.25,1.28],
}

grid = GridSearchCV(pipe, param, cv=6, scoring='f1', verbose=2, n_jobs=-1)
grid.fit(X_train, y_train)

print(grid.best_params_)

Fitting 6 folds for each of 5 candidates, totalling 30 fits
{'svc__C': 1.35}


In [138]:
best_model = grid.best_estimator_
best_model.fit(X_train, y_train)
y_predicted = best_model.predict(X_test)
accuracy_score(y_test, y_predicted)

0.976054732041049

In [139]:
f1_score(y_test, y_predicted)

0.9432943294329433

In [None]:

from sklearn.svm import SVC
m = SVC(kernel='linear', C=1.22)
m.fit(X_train, y_train)
predicted = m.predict(X_test)


0.9768148992778412

In [143]:
from sklearn.metrics import classification_report
print('Accuracy Score is: ',accuracy_score(y_test, predicted))
print('F1 Score for the model is:',f1_score(y_test, predicted))
print(confusion_matrix(y_test,predicted))
print(classification_report(y_test, predicted))

Accuracy Score is:  0.9768148992778412
F1 Score for the model is: 0.945193171608266
[[2044   17]
 [  44  526]]
              precision    recall  f1-score   support

           0       0.98      0.99      0.99      2061
           1       0.97      0.92      0.95       570

    accuracy                           0.98      2631
   macro avg       0.97      0.96      0.97      2631
weighted avg       0.98      0.98      0.98      2631



- True Negatives (TN) = 2044 → Ham correctly classified
- False Positives (FP) = 17 → Ham misclassified as Spam
- False Negatives (FN) = 44 → Spam misclassified as Ham
- True Positives (TP) = 526 → Spam correctly classified

##### Why SVC with the Same Parameters Gave Different Scores

While testing SVM manually and through GridSearchCV, I noticed slight score differences even with the same parameters.
This happens because GridSearchCV uses k-fold cross-validation (cv=5), while a manual test relies on a single train-test split.

In cross-validation, the training set is divided into 5 parts; the model trains on 4 and validates on 1, repeating the process five times. The final score is the average across folds, reflecting more stable generalization performance.

So, although C = 1.22 worked best on my test split, GridSearchCV found C = 1.35 slightly better on average across folds. Such variations are normal — they come from different data splits and optimization differences.

In short, GridSearchCV’s result is a more reliable estimate of real-world performance.

C controls the penalty for misclassified points:
- High C → low tolerance for misclassification
    - Large C (e.g., 100) The model tries very hard to classify all training points correctly.
    - Decision boundary becomes tight and may fit noise → risk of overfitting.
- Low C → high tolerance for misclassification
    - The model allows some misclassifications in training.
    - Decision boundary is smoother → better generalization, but may slightly underfit.

## END-REPORT

The best model is **SVC** with kernel set to 'linear' and regularization **C=1.22**
<p> which gave us 97.68% accuracy score which is pretty amazing.