✅ Data loading
✅ Text cleaning
✅ TF-IDF vectorization
✅ Label encoding
✅ SMOTE for class imbalance // use for You have imbalanced data

You care about recall for the minority class (e.g., spam detection, fraud, disease)

You want to improve performance for the rare class

✅ Model training (Logistic Regression)
✅ Evaluation (Accuracy, Confusion Matrix, Precision, Recall, F1-Score)

In [1]:
# 📦 Step 1: Import Libraries


import pandas as pd
import numpy as np
import re
import string

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')

# 🧹 Step 2: Load Data
df = pd.read_csv('spam.csv', encoding='latin-1')[['v1', 'v2']]
df.columns = ['label', 'message']

# 🧼 Step 3: Text Cleaning
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    words = text.split()
    filtered = [ps.stem(w) for w in words if w not in stop_words]
    return ' '.join(filtered)

df['cleaned'] = df['message'].apply(clean_text)

# 🔢 Step 4: Label Encoding
df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})

# 🧠 Step 5: Feature Extraction with TF-IDF
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df['cleaned'])
y = df['label_num']

# ✂️ Step 6: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ⚖️ Step 7: Handle Imbalance using SMOTE
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

# 🤖 Step 8: Train Model
model = LogisticRegression()
model.fit(X_train_res, y_train_res)

# 📊 Step 9: Predict & Evaluate
y_pred = model.predict(X_test)

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Confusion Matrix:
[[955  10]
 [ 17 133]]

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       965
           1       0.93      0.89      0.91       150

    accuracy                           0.98      1115
   macro avg       0.96      0.94      0.95      1115
weighted avg       0.98      0.98      0.98      1115



Yes ✅ — **your result is excellent** and everything looks **correct**!

Let me break it down for you:

---

## 🔍 **Confusion Matrix:**

```
[[955  10]      → 955 ham correctly predicted, 10 ham wrongly as spam (False Positive)
 [ 17 133]]     → 133 spam correctly predicted, 17 spam wrongly as ham (False Negative)
```

---

## 📊 **Metrics (Class 0 = ham, Class 1 = spam):**

| Metric    | Ham (0)                             | Spam (1) |
| --------- | ----------------------------------- | -------- |
| Precision | 0.98                                | 0.93     |
| Recall    | 0.99                                | 0.89     |
| F1-Score  | 0.99                                | 0.91     |
| Accuracy  | **0.98** overall on 1115 messages ✅ |          |

---

## 📌 **Interpretation:**

* **High Accuracy**: 98% – very good
* **Spam Precision (0.93)**: 93% of predicted spam were actually spam
* **Spam Recall (0.89)**: 89% of actual spam were detected correctly
* **Ham Recall (0.99)**: Excellent — almost all ham identified perfectly

---

## ✅ Is It Okay?

Yes — this result is:

* **Balanced**
* **No Overfitting**
* **Only 17 spam missed**
* **Only 10 false alarms (ham predicted as spam)**

---

## 🎯 If You Want to Improve Spam Detection:

Do this:

* ✅ Use `class_weight='balanced'` in LogisticRegression
* ✅ Try models like **MultinomialNB** (good for text)
* ✅ Tune threshold using `predict_proba` + ROC curve
* ✅ Add **lemmatization** using `WordNetLemmatizer`
* ✅ Remove extra noise (like email, URLs, numbers)

---

If you want, I can:

🔁 Add another model like Naive Bayes
🎛 Tune threshold
📉 Plot ROC curve
💬 Explain `predict_proba`
📦 Save your model to use in web app

Just say what you need.


naive_bayes

In [2]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Train the Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predict
y_pred_nb = nb_model.predict(X_test)

# Evaluation
print("=== Naive Bayes Model Results ===")
print("Accuracy:", accuracy_score(y_test, y_pred_nb))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_nb))
print("\nClassification Report:\n", classification_report(y_test, y_pred_nb))


=== Naive Bayes Model Results ===
Accuracy: 0.9614349775784753

Confusion Matrix:
 [[965   0]
 [ 43 107]]

Classification Report:
               precision    recall  f1-score   support

           0       0.96      1.00      0.98       965
           1       1.00      0.71      0.83       150

    accuracy                           0.96      1115
   macro avg       0.98      0.86      0.91      1115
weighted avg       0.96      0.96      0.96      1115



 SMOTE again  for naive_bayes

In [3]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from imblearn.over_sampling import SMOTE

# Step 1: Apply SMOTE
sm = SMOTE(random_state=42)
X_smote, y_smote = sm.fit_resample(X, y)

# Step 2: Train Naive Bayes on SMOTE data
nb = MultinomialNB()
nb.fit(X_smote, y_smote)

# Step 3: Predict on original test set (not resampled one)
y_pred_nb = nb.predict(X)

# Step 4: Evaluation
print("=== Naive Bayes (with SMOTE) ===")
print("Accuracy:", accuracy_score(y, y_pred_nb))
print("\nConfusion Matrix:\n", confusion_matrix(y, y_pred_nb))
print("\nClassification Report:\n", classification_report(y, y_pred_nb))


=== Naive Bayes (with SMOTE) ===
Accuracy: 0.9863603732950467

Confusion Matrix:
 [[4763   62]
 [  14  733]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.99      0.99      4825
           1       0.92      0.98      0.95       747

    accuracy                           0.99      5572
   macro avg       0.96      0.98      0.97      5572
weighted avg       0.99      0.99      0.99      5572

