In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("spam.csv", encoding='latin-1') #here encoding is set to latin-1 to handle special characters
#encoding='latin-1': ‡¶ï‡¶ø‡¶õ‡ßÅ ‡¶∏‡ßç‡¶™‡ßá‡¶∂‡¶æ‡¶≤ ‡¶ï‡ßç‡¶Ø‡¶æ‡¶∞‡ßá‡¶ï‡ßç‡¶ü‡¶æ‡¶∞ ‡¶¨‡¶æ ‡¶á‡¶â‡¶∞‡ßã‡¶™‡¶ø‡ßü‡¶æ‡¶® ‡¶≠‡¶æ‡¶∑‡¶æ‡¶∞ ‡¶ü‡ßá‡¶ï‡ßç‡¶∏‡¶ü ‡¶•‡¶æ‡¶ï‡¶≤‡ßá ‡¶∏‡ßá‡¶ó‡ßÅ‡¶≤‡ßã ‡¶†‡¶ø‡¶ï‡¶†‡¶æ‡¶ï ‡¶™‡ßú‡¶§‡ßá ‡¶è‡¶á ‡¶è‡¶®‡¶ï‡ßã‡¶°‡¶ø‡¶Ç ‡¶¶‡¶∞‡¶ï‡¶æ‡¶∞ ‡¶π‡ßü‡•§ ‡¶®‡¶æ‡¶π‡¶≤‡ßá error ‡¶¶‡¶ø‡¶§‡ßá ‡¶™‡¶æ‡¶∞‡ßá‡•§

df.head(2)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,


In [3]:
df = df[['v1', 'v2']]


In [4]:
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Rename :
v1        ‚Üí ‡¶Ø‡ßá‡¶ü‡¶æ label (ham/spam)
v2        ‚Üí ‡¶Ø‡ßá‡¶ü‡¶æ message/text

In [5]:
df.columns = ['label', 'message']

In [6]:
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Data Cleaning

In [7]:
df.dropna(inplace=True)  # Remove any nulls (if present)

 Preprocess the Data
Convert labels to numbers:

Yes, absolutely! You can preprocess **categorical labels** like `ham/spam` using multiple encoding techniques. Let‚Äôs compare the three main approaches and show how you can apply them.

---

### ‚úÖ 1. Using `.map()` (Best for Binary Labels ‚Äì Recommended Here)

```python
df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})
```

* üîπ Simple and direct
* üîπ Works great for binary classification
* üîπ Output: single column with 0 or 1

---

### ‚úÖ 2. Using **LabelEncoder** (Works for multiple classes)

```python
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['label_num'] = le.fit_transform(df['label'])
```

* üîπ Useful if there are more than 2 labels
* üîπ `ham` ‚Üí 0, `spam` ‚Üí 1 (but automatically assigned)
* üîπ Not ideal if you want to control label encoding manually

---

### ‚úÖ 3. Using **OneHotEncoder** (NOT needed for target variable in classification)

```python
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False, drop='first')  # drop='first' avoids dummy variable trap
encoded = ohe.fit_transform(df[['label']])
```

* üîπ Converts into **two columns**: one for each class
* üîπ Useful when the target is multi-class and you want to treat each class as separate binary output (for multi-label classification or neural nets)
* ‚ùå Not recommended for binary classification with scikit-learn estimators like `LogisticRegression`

---

### ‚úÖ So What to Use Here?

Since this is a **binary classification problem (spam or ham)**, the best and cleanest approach is:

```python
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
```

You **don‚Äôt need OneHotEncoder** unless your model requires it (e.g., in some neural networks or multi-label tasks).

---



In [8]:
df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})


In [9]:
df.head()

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


Feature Engineering (Text ‚Üí Numeric)

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['message'])
y = df['label']

it's better process:=>  Clean the text  then use => Convert Text to Numbers (TF-IDF)

import string
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

stop_words = set(stopwords.words('english'))
ps = PorterStemmer()


def clean_text(text):  // Whose job is to clear a message

    text = text.lower() // all text convert in lower case

    text = re.sub(r'[^\w\s]', '', text) //Delete the  punctuation and special character
    (like .,!?@#$%) =>  Hello! How are you? ‚Üí Hello How are you

    words = text.split() //'i love coding' ‚Üí ['i', 'love', 'coding']

    filtered_words = [ps.stem(w) for w in words if w not in stop_words]//Stopwords remove  (like: 'the', 'is', 'in', 'and' ) and then  stemming=> convert in  root form: 'running' ‚Üí 'run' ,'loved' ‚Üí 'love' , 'easily' ‚Üí 'easili'







    return ' '.join(filtered_words) //put together  like : "Hi there!!! I loved the meeting, it was fantastic." -> "hi love meet fantast"


df['cleaned'] = df['message'].apply(clean_text)


 Clean the text 

Natural Language Toolkit: using nltk we do on human language and  natural language like Tokenization	‡¶¨‡¶æ‡¶ï‡ßç‡¶Ø‡¶ï‡ßá ‡¶∂‡¶¨‡ßç‡¶¶‡ßá ‡¶≠‡¶æ‡¶ó ‡¶ï‡¶∞‡¶æ
Stopwords Removal	‚Äúthe‚Äù, ‚Äúis‚Äù, ‚Äúin‚Äù ‡¶è‡¶∞ ‡¶Æ‡¶§‡ßã ‡¶∏‡¶æ‡¶ß‡¶æ‡¶∞‡¶£ ‡¶Ö‡¶•‡¶ö ‡¶Ö‡¶™‡ßç‡¶∞‡ßü‡ßã‡¶ú‡¶®‡ßÄ‡ßü ‡¶∂‡¶¨‡ßç‡¶¶ ‡¶¨‡¶æ‡¶¶ ‡¶¶‡ßá‡¶ì‡ßü‡¶æ
Stemming	‡¶∂‡¶¨‡ßç‡¶¶‡¶ï‡ßá ‡¶Æ‡ßÇ‡¶≤ ‡¶∞‡ßÇ‡¶™‡ßá ‡¶Ü‡¶®‡¶æ (e.g. ‚Äúrunning‚Äù ‚Üí ‚Äúrun‚Äù)
Lemmatization	‡¶∂‡¶¨‡ßç‡¶¶‡ßá‡¶∞ dictionary ‡¶∞‡ßÇ‡¶™‡ßá ‡¶Ü‡¶®‡¶æ
POS Tagging	‡¶ï‡ßã‡¶® ‡¶∂‡¶¨‡ßç‡¶¶‡¶ü‡¶æ noun, verb, adjective ‡¶§‡¶æ ‡¶∂‡¶®‡¶æ‡¶ï‡ßç‡¶§ ‡¶ï‡¶∞‡¶æ
Text classification

In [10]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
import string
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    words = text.split()
    filtered_words = [ps.stem(w) for w in words if w not in stop_words] #here w is all words and we are removing stop words
    return ' '.join(filtered_words)

df['cleaned'] = df['message'].apply(clean_text) 

In [12]:
df.head()

Unnamed: 0,label,message,label_num,cleaned
0,ham,"Go until jurong point, crazy.. Available only ...",0,go jurong point crazi avail bugi n great world...
1,ham,Ok lar... Joking wif u oni...,0,ok lar joke wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1,free entri 2 wkli comp win fa cup final tkt 21...
3,ham,U dun say so early hor... U c already then say...,0,u dun say earli hor u c alreadi say
4,ham,"Nah I don't think he goes to usf, he lives aro...",0,nah dont think goe usf live around though


Step 3: Convert Text to Numbers (TF-IDF)

**TF-IDF vector ‡¶Æ‡¶æ‡¶®‡ßá ‡¶ï‡ßÄ?** ‚Äî ‡¶è‡¶ü‡¶æ NLP ‡¶¨‡¶æ Text Classification

## ‚úÖ TF-IDF ‡¶Æ‡¶æ‡¶®‡ßá ‡¶ï‡ßÄ?

**TF-IDF** ‡¶è‡¶∞ ‡¶™‡ßÇ‡¶∞‡ßç‡¶£‡¶∞‡ßÇ‡¶™:

```
TF ‚Üí Term Frequency  
IDF ‚Üí Inverse Document Frequency
```

‡¶è‡¶ü‡¶æ ‡¶è‡¶ï‡¶ü‡¶æ **text-to-number conversion technique** ‚Äî ‡¶Ø‡ßá‡¶ü‡¶æ ‡¶ï‡ßã‡¶®‡ßã **word-‡¶è‡¶∞ importance (‡¶ó‡ßÅ‡¶∞‡ßÅ‡¶§‡ßç‡¶¨)** calculate ‡¶ï‡¶∞‡ßá ‡¶™‡ßç‡¶∞‡¶§‡¶ø‡¶ü‡¶æ message ‡¶¨‡¶æ document-‡¶è‡¶∞ ‡¶≠‡¶ø‡¶§‡¶∞‡ßá‡•§

---

## üéØ Main Goal:

Text (like: "Hello, this is spam message") ‚Üí ‡¶ï‡ßá Machine ‡¶¨‡ßÅ‡¶ú‡¶¨‡ßá ‡¶®‡¶æ, ‡¶ï‡¶æ‡¶∞‡¶£ ‡¶ì‡¶ü‡¶æ string‡•§

‡¶§‡¶æ‡¶á ‡¶§‡ßã‡¶Æ‡¶æ‡¶ï‡ßá ‡¶™‡ßç‡¶∞‡¶§‡¶ø‡¶ü‡¶æ word ‡¶ï‡ßá **number/vector** ‡¶è ‡¶∞‡ßÇ‡¶™‡¶æ‡¶®‡ßç‡¶§‡¶∞ ‡¶ï‡¶∞‡¶§‡ßá ‡¶π‡¶¨‡ßá‡•§
TF-IDF ‡¶†‡¶ø‡¶ï ‡¶∏‡ßá‡¶ü‡¶æ‡¶á ‡¶ï‡¶∞‡ßá ‚Äì ‡¶ï‡¶ø‡¶®‡ßç‡¶§‡ßÅ ‡¶∂‡ßÅ‡¶ß‡ßÅ‡¶Æ‡¶æ‡¶§‡ßç‡¶∞ number ‡¶è ‡¶®‡¶æ, **importance ‡¶Ö‡¶®‡ßÅ‡¶Ø‡¶æ‡ßü‡ßÄ weight** ‡¶¶‡ßá‡ßü‡•§

---

## üß† TF (Term Frequency) ‡¶Æ‡¶æ‡¶®‡ßá:

‡¶è‡¶ï‡¶ü‡¶æ message-‡¶è‡¶∞ ‡¶≠‡¶ø‡¶§‡¶∞‡ßá ‡¶ï‡ßã‡¶®‡ßã word ‡¶ï‡¶§‡¶¨‡¶æ‡¶∞ ‡¶Ü‡¶õ‡ßá?

### ‡¶â‡¶¶‡¶æ‡¶π‡¶∞‡¶£:

Message: `"You are very very good"`

* TF("very") = 2
* TF("you") = 1

---

## üß† IDF (Inverse Document Frequency) ‡¶Æ‡¶æ‡¶®‡ßá:

‡¶Ø‡ßá word ‡¶Ö‡¶®‡ßá‡¶ï ‡¶ó‡ßÅ‡¶≤‡ßã message-‡¶è common ‡¶Ü‡¶õ‡ßá (‡¶Ø‡ßá‡¶Æ‡¶®: "is", "the", "you"), ‡¶§‡¶æ‡¶¶‡ßá‡¶∞ ‡¶ó‡ßÅ‡¶∞‡ßÅ‡¶§‡ßç‡¶¨ ‡¶ï‡¶Æ‡•§

‡¶Ü‡¶∞ ‡¶Ø‡ßá‡¶∏‡¶¨ word unique ‡¶¨‡¶æ rare (‡¶Ø‡ßá‡¶Æ‡¶®: "winner", "free", "offer") ‚Äî ‡¶§‡¶æ‡¶¶‡ßá‡¶∞ ‡¶ó‡ßÅ‡¶∞‡ßÅ‡¶§‡ßç‡¶¨ ‡¶¨‡ßá‡¶∂‡¶ø‡•§

‡¶è‡¶ü‡¶æ‡¶á IDF‡•§

---

## üîÑ TF-IDF Formula:

```
TF-IDF(word) = TF(word) √ó IDF(word)
```

‡¶Ö‡¶∞‡ßç‡¶•‡¶æ‡ßé, ‡¶è‡¶ï‡¶ü‡¶æ‡¶∞ message-‡¶è ‡¶ï‡¶§‡¶¨‡¶æ‡¶∞ ‡¶è‡¶∏‡ßá‡¶õ‡ßá (TF) √ó ‡¶ï‡¶§‡¶ü‡¶æ unique ‡¶∏‡ßá‡¶á word ‡¶™‡ßÅ‡¶∞‡ßã dataset-‡¶è (IDF)

---

## üì¶ Final Output:

‡¶™‡ßç‡¶∞‡¶§‡¶ø‡¶ü‡¶æ message ‚Üí ‡¶è‡¶ï‡¶ü‡¶æ **TF-IDF vector**
‡¶Ø‡¶æ‡¶∞ ‡¶™‡ßç‡¶∞‡¶§‡¶ø‡¶ü‡¶æ value ‚Üí ‡¶™‡ßç‡¶∞‡¶§‡¶ø‡¶ü‡¶æ word-‡¶è‡¶∞ importance weight (0.0 ‡¶•‡ßá‡¶ï‡ßá 1.0 ‡¶è‡¶∞ ‡¶Æ‡¶§‡ßã)

---

## üéØ ‡¶â‡¶¶‡¶æ‡¶π‡¶∞‡¶£ ‡¶¶‡¶ø‡ßü‡ßá ‡¶¨‡¶≤‡¶ø:

‡¶ß‡¶∞‡ßã 3‡¶ü‡¶æ message ‡¶Ü‡¶õ‡ßá:

1. "Buy now"
2. "Limited offer now"
3. "Now or never"

‚Üí TF-IDF vectorizer ‡¶è ‡¶∏‡¶¨ word ‡¶ï‡ßá numeric ‡¶¨‡¶æ‡¶®‡¶ø‡ßü‡ßá ‡¶®‡¶ø‡¶ö‡ßá‡¶∞ ‡¶Æ‡¶§‡ßã ‡¶è‡¶ï‡¶ü‡¶æ matrix ‡¶¨‡¶æ‡¶®‡¶æ‡¶¨‡ßá:

| Word    | Msg 1 | Msg 2 | Msg 3 |
| ------- | ----- | ----- | ----- |
| buy     | 1.2   | 0     | 0     |
| limited | 0     | 1.5   | 0     |
| offer   | 0     | 1.2   | 0     |
| now     | 1.0   | 0.8   | 0.7   |
| or      | 0     | 0     | 1.0   |
| never   | 0     | 0     | 1.5   |

(‡¶∏‡¶Ç‡¶ñ‡ßç‡¶Ø‡¶æ‡¶ó‡ßÅ‡¶≤‡ßã ‡¶â‡¶¶‡¶æ‡¶π‡¶∞‡¶£‡¶Æ‡¶æ‡¶§‡ßç‡¶∞)

---

## ‚úÖ ‡¶ï‡ßá‡¶® TF-IDF ‡¶¨‡ßç‡¶Ø‡¶¨‡¶π‡¶æ‡¶∞ ‡¶ï‡¶∞‡¶¨‡ßã?

* Word importance ‡¶¨‡ßã‡¶ù‡ßá (common word ‡¶¨‡¶æ‡¶¶ ‡¶¶‡ßá‡ßü)
* Sparse matrix ‡¶§‡ßà‡¶∞‡¶ø ‡¶ï‡¶∞‡ßá (memory efficient)
* Model train ‡¶ï‡¶∞‡¶æ‡¶∞ ‡¶ú‡¶®‡ßç‡¶Ø perfect numeric format ‡¶¶‡ßá‡ßü

---

## üîÅ Bonus Tip:

* **`TfidfVectorizer()`** ‡¶è‡¶á ‡¶ï‡¶æ‡¶ú‡¶ü‡¶æ ‡¶∏‡¶¨ ‡¶®‡¶ø‡¶ú‡ßá‡¶∞ ‡¶•‡ßá‡¶ï‡ßá‡¶á ‡¶ï‡¶∞‡ßá‡•§
* ‡¶§‡ßÅ‡¶Æ‡¶ø ‡¶∂‡ßÅ‡¶ß‡ßÅ `.fit_transform(text)` ‡¶¶‡¶æ‡¶ì, ‡¶¨‡¶æ‡¶ï‡¶ø ‡¶∏‡¶¨ ‡¶ì manage ‡¶ï‡¶∞‡ßá ‡¶®‡ßá‡ßü‡•§

---



In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df['cleaned'])
y = df['label_num']

Step 4: Train-Test Split

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Step 5: Train Logistic Regression Model

In [15]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)


Step 6: Predict & Evaluate

In [16]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.9426008968609866

Confusion Matrix:
 [[961   4]
 [ 60  90]]

Classification Report:
               precision    recall  f1-score   support

           0       0.94      1.00      0.97       965
           1       0.96      0.60      0.74       150

    accuracy                           0.94      1115
   macro avg       0.95      0.80      0.85      1115
weighted avg       0.94      0.94      0.94      1115



‡¶§‡ßã‡¶Æ‡¶æ‡¶∞ confusion matrix ‡¶Ü‡¶∞ classification report ‡¶¶‡ßá‡¶ñ‡ßá ‡¶Ü‡¶Æ‡¶ø ‡¶™‡ßÅ‡¶∞‡ßã performance ‡¶¨‡ßÅ‡¶ù‡¶ø‡ßü‡ßá ‡¶¶‡¶ø‡¶ö‡ßç‡¶õ‡¶ø ‚Äî ‡¶è‡¶ï‡¶¶‡¶Æ ‡¶∏‡¶π‡¶ú‡¶≠‡¶æ‡¶¨‡ßá‡•§

---

## ‚úÖ Step-by-step Analysis:

### üìå Confusion Matrix:

```
[[961   4]
 [ 60  90]]
```

‡¶è‡¶ü‡¶æ‡¶∞ ‡¶Æ‡¶æ‡¶®‡ßá:

|                     | **Predicted Ham (0)** | **Predicted Spam (1)** |
| ------------------- | --------------------- | ---------------------- |
| **Actual Ham (0)**  | 961 (‚úÖ TN)            | 4 (‚ùå FP)               |
| **Actual Spam (1)** | 60 (‚ùå FN)             | 90 (‚úÖ TP)              |

---

### üéØ Metrics Breakdown:

| Metric        | Ham (0)        | Spam (1)   |
| ------------- | -------------- | ---------- |
| **Precision** | 0.94           | 0.96       |
| **Recall**    | 1.00 (perfect) | 0.60 (low) |
| **F1-score**  | 0.97           | 0.74       |

---

## üîç ‡¶¨‡ßã‡¶ù‡¶æ‡¶∞ ‡¶™‡¶¶‡ßç‡¶ß‡¶§‡¶ø:

### üîπ False Positive (FP = 4)

* **Model ‡¶¨‡¶≤‡ßá‡¶õ‡ßá Spam**, ‡¶ï‡¶ø‡¶®‡ßç‡¶§‡ßÅ ‡¶Ü‡¶∏‡¶≤‡ßá ‡¶∏‡ßá‡¶ü‡¶æ **Ham** ‡¶õ‡¶ø‡¶≤
* ‡¶è‡¶∞ ‡¶Æ‡¶æ‡¶®‡ßá: ‡¶ï‡¶æ‡¶ï‡ßá ‡¶Æ‡ßá‡¶∏‡ßá‡¶ú ‡¶™‡¶æ‡¶†‡¶æ‡¶®‡ßã ‡¶Ø‡¶æ‡ßü, ‡¶ï‡¶ø‡¶®‡ßç‡¶§‡ßÅ ‡¶Æ‡¶°‡ßá‡¶≤ ‡¶¨‡¶≤‡ßá ‡¶¶‡ßá‡ßü ‡¶è‡¶ü‡¶æ spam ‚Äî annoying!

### üî∏ False Negative (FN = 60)

* **Model ‡¶¨‡¶≤‡ßá‡¶õ‡ßá Ham**, ‡¶ï‡¶ø‡¶®‡ßç‡¶§‡ßÅ ‡¶Ü‡¶∏‡¶≤‡ßá ‡¶∏‡ßá‡¶ü‡¶æ **Spam** ‡¶õ‡¶ø‡¶≤
* ‡¶è‡¶∞ ‡¶Æ‡¶æ‡¶®‡ßá: **‡¶Æ‡¶°‡ßá‡¶≤ spam detect ‡¶ï‡¶∞‡¶§‡ßá ‡¶¨‡ßç‡¶Ø‡¶∞‡ßç‡¶• ‡¶π‡ßü‡ßá‡¶õ‡ßá** ‚Äî ‡¶è‡¶ü‡¶æ dangerous! ‡¶ï‡¶æ‡¶∞‡¶£ user ‡¶∏‡ßç‡¶™‡ßç‡¶Ø‡¶æ‡¶Æ ‡¶Æ‡ßá‡¶∏‡ßá‡¶ú ‡¶™‡¶æ‡¶¨‡ßá‡•§

---

## üìâ Main Problem:

* **Recall for class 1 (spam)** = 0.60 ‚Üí ‡¶ï‡¶Æ
  ‡¶Æ‡¶æ‡¶®‡ßá: ‡¶Æ‡ßã‡¶ü 150 ‡¶ü‡¶æ ‡¶∏‡ßç‡¶™‡ßç‡¶Ø‡¶æ‡¶Æ ‡¶Æ‡ßá‡¶∏‡ßá‡¶ú ‡¶õ‡¶ø‡¶≤, ‡¶Æ‡¶°‡ßá‡¶≤ ‡¶†‡¶ø‡¶ï‡¶Æ‡¶§‡ßã ‡¶ß‡¶∞‡¶§‡ßá ‡¶™‡ßá‡¶∞‡ßá‡¶õ‡ßá ‡¶Æ‡¶æ‡¶§‡ßç‡¶∞ **90 ‡¶ü‡¶æ** ‚Üí 60 ‡¶Æ‡¶ø‡¶∏ ‡¶ï‡¶∞‡ßá‡¶õ‡ßá‡•§

---

## üìà Accuracy ‡¶¶‡ßá‡¶ñ‡ßá ‡¶¨‡ßã‡¶ï‡¶æ ‡¶π‡¶ì‡ßü‡¶æ ‡¶Ø‡¶æ‡¶¨‡ßá ‡¶®‡¶æ

Accuracy = **94.2%**, ‡¶ï‡¶ø‡¶®‡ßç‡¶§‡ßÅ ‡¶∏‡ßá‡¶ü‡¶æ misleading, ‡¶ï‡¶æ‡¶∞‡¶£:

* ‡¶Æ‡¶°‡ßá‡¶≤ ‡¶≠‡¶æ‡¶≤‡ßã‡¶≠‡¶æ‡¶¨‡ßá **ham detect** ‡¶ï‡¶∞‡¶õ‡ßá
* ‡¶ï‡¶ø‡¶®‡ßç‡¶§‡ßÅ **spam detect** ‡¶ï‡¶∞‡¶§‡ßá ‡¶ï‡¶Æ ‡¶™‡¶æ‡¶∞‡¶õ‡ßá (recall = 0.60)

---

## ‚úÖ ‡¶ï‡¶∞‡¶£‡ßÄ‡ßü (Suggestions):

1. **Class imbalance fix ‡¶ï‡¶∞‡ßã** (ham ‡¶¨‡ßá‡¶∂‡¶ø ‚Üí spam ‡¶ï‡¶Æ)

   * Use: `SMOTE`, `RandomOversampler`, `class_weight='balanced'`
2. **Preprocessing improve ‡¶ï‡¶∞‡ßã**

   * Noise reduce + stemming/lemmatization ‡¶†‡¶ø‡¶ï ‡¶ï‡¶∞‡ßã
3. **Threshold tuning** ‡¶ï‡¶∞‡ßã (probability cut-off)
4. **Try better models** ‚Üí Naive Bayes / XGBoost / Ensemble

---

## üîö Summary (Banglay):

| ‡¶¨‡¶ø‡¶∑‡ßü              | ‡¶¨‡¶ø‡¶∂‡ßç‡¶≤‡ßá‡¶∑‡¶£             |
| ----------------- | -------------------- |
| **Ham**           | ‡¶ñ‡ßÅ‡¶¨ ‡¶≠‡¶æ‡¶≤‡ßã‡¶≠‡¶æ‡¶¨‡ßá ‡¶ß‡¶∞‡ßá‡¶õ‡ßá   |
| **Spam**          | ‡¶Ö‡¶®‡ßá‡¶ï‡¶ó‡ßÅ‡¶≤‡ßã ‡¶Æ‡¶ø‡¶∏ ‡¶ï‡¶∞‡ßá‡¶õ‡ßá   |
| **FP (4)**        | ‡¶ñ‡ßÅ‡¶¨ ‡¶ï‡¶Æ, that's good  |
| **FN (60)**       | ‡¶¨‡ßá‡¶∂‡¶ø, ‡¶è‡¶ü‡¶æ ‡¶¨‡¶ø‡¶™‡¶ú‡ßç‡¶ú‡¶®‡¶ï   |
| **Recall (Spam)** | 0.60 ‚Üí Improve ‡¶¶‡¶∞‡¶ï‡¶æ‡¶∞ |

---

üëâ ‡¶§‡ßÅ‡¶Æ‡¶ø ‡¶ö‡¶æ‡¶á‡¶≤‡ßá ‡¶Ü‡¶Æ‡¶ø next step ‡¶è class imbalance ‡¶¨‡¶æ threshold tuning ‡¶¶‡ßá‡¶ñ‡¶æ‡¶§‡ßá ‡¶™‡¶æ‡¶∞‡¶ø ‚Äî ‡¶¨‡¶≤‡¶≤‡ßá‡¶á ‡¶π‡¶¨‡ßá‡•§

‡¶§‡ßã‡¶Æ‡¶æ‡¶∞ turn ‚úãüôÇ


‡¶Ö‡¶∏‡¶æ‡¶ß‡¶æ‡¶∞‡¶£ ‡¶™‡ßç‡¶∞‡¶∂‡ßç‡¶®! ‡¶§‡ßÅ‡¶Æ‡¶ø ‡¶¨‡¶≤‡¶õ‡ßã ‚Äî
**‚Äú‡¶∏‡¶¨‡¶ï‡¶ø‡¶õ‡ßÅ ‡¶†‡¶ø‡¶ï ‡¶•‡¶æ‡¶ï‡¶≤‡ßá, ‡¶Ö‡¶∞‡ßç‡¶•‡¶æ‡ßé ‡¶Æ‡¶°‡ßá‡¶≤ ‡¶Ø‡¶¶‡¶ø ‡¶è‡¶ï‡¶ü‡¶æ‡¶ì ‡¶≠‡ßÅ‡¶≤ ‡¶®‡¶æ ‡¶ï‡¶∞‡¶§‡ßã, ‡¶§‡¶æ‡¶π‡¶≤‡ßá Confusion Matrix ‡¶ï‡ßá‡¶Æ‡¶® ‡¶π‡¶§‡ßã?‚Äù**

‡¶ö‡¶≤ ‡¶è‡¶ï‡¶¶‡¶Æ ‡¶ï‡ßç‡¶≤‡¶ø‡ßü‡¶æ‡¶∞ ‡¶ï‡¶∞‡ßá ‡¶¶‡ßá‡¶ñ‡¶ø üëá

---

## üü® ‡¶¨‡¶∞‡ßç‡¶§‡¶Æ‡¶æ‡¶® ‡¶Æ‡ßá‡¶ü‡ßç‡¶∞‡¶ø‡¶ï‡ßç‡¶∏:

```
[[961   4]
 [ 60  90]]
```

‡¶è‡¶ñ‡¶æ‡¶®‡ßá ‡¶¶‡ßá‡¶ñ‡¶æ ‡¶Ø‡¶æ‡¶ö‡ßç‡¶õ‡ßá:

* ‚úÖ ‡¶†‡¶ø‡¶ï ‡¶¨‡¶≤‡ßá‡¶õ‡ßá: 961 + 90 = **1051 ‡¶¨‡¶æ‡¶∞**
* ‚ùå ‡¶≠‡ßÅ‡¶≤ ‡¶¨‡¶≤‡ßá‡¶õ‡ßá: 4 + 60 = **64 ‡¶¨‡¶æ‡¶∞**

---

## ‚úÖ ‡¶Ø‡¶¶‡¶ø ‡¶Æ‡¶°‡ßá‡¶≤ ‡¶è‡¶ï‡¶¶‡¶Æ Perfect ‡¶π‡¶§‡ßã...

‡¶Æ‡¶æ‡¶®‡ßá ‡¶Æ‡¶°‡ßá‡¶≤ ‡¶∏‡¶¨‡¶ó‡ßÅ‡¶≤‡ßã ‡¶Æ‡ßá‡¶∏‡ßá‡¶ú **‡¶è‡¶ï‡¶¶‡¶Æ ‡¶∏‡¶†‡¶ø‡¶ï‡¶≠‡¶æ‡¶¨‡ßá** "ham" ‡¶¨‡¶æ "spam" ‡¶¨‡¶≤‡ßá ‡¶¶‡¶ø‡¶§‡ßá ‡¶™‡¶æ‡¶∞‡¶§‡ßã ‚Äî
‡¶§‡¶æ‡¶π‡¶≤‡ßá confusion matrix ‡¶π‡¶§‡ßã:

```
[[965   0]
 [  0 150]]
```

### ‡¶¨‡ßç‡¶Ø‡¶æ‡¶ñ‡ßç‡¶Ø‡¶æ:

* ‡¶∏‡¶¨ **ham (total 965)** ‚Üí ‡¶†‡¶ø‡¶ï‡¶≠‡¶æ‡¶¨‡ßá ham ‡¶π‡¶ø‡¶∏‡ßá‡¶¨‡ßá ‡¶ß‡¶∞‡¶§‡ßã
* ‡¶∏‡¶¨ **spam (total 150)** ‚Üí ‡¶†‡¶ø‡¶ï‡¶≠‡¶æ‡¶¨‡ßá spam ‡¶π‡¶ø‡¶∏‡ßá‡¶¨‡ßá ‡¶ß‡¶∞‡¶§‡ßã
* ‡¶§‡¶æ‡¶á **False Positive (FP) = 0**
* **False Negative (FN) = 0**

---

## üéØ ‡¶§‡¶ñ‡¶® Accuracy ‡¶π‡¶§‡ßã:

```python
accuracy = (TN + TP) / total
         = (965 + 150) / 1115
         = 1115 / 1115
         = 1.0 ‚úÖ
```

### üî• ‡¶Ö‡¶∞‡ßç‡¶•‡¶æ‡ßé Accuracy = **100%**

---

## üîÅ Recap Table:

| Actual / Predicted | Predicted: Ham | Predicted: Spam |
| ------------------ | -------------- | --------------- |
| Actual: Ham        | ‚úÖ TN = 965     | ‚ùå FP = 0        |
| Actual: Spam       | ‚ùå FN = 0       | ‚úÖ TP = 150      |

---

## üìå Conclusion:

‡¶Ø‡¶¶‡¶ø ‡¶Æ‡¶°‡ßá‡¶≤ ‡¶è‡¶ï‡¶ü‡¶æ‡¶ì ‡¶≠‡ßÅ‡¶≤ ‡¶®‡¶æ ‡¶ï‡¶∞‡¶§, ‡¶§‡¶æ‡¶π‡¶≤‡ßá:

* `Confusion Matrix: [[965, 0], [0, 150]]`
* `Accuracy: 1.0 (100%)`
* `Precision, Recall, F1-score ‚Üí ‡¶∏‡¶¨ 1.00`

---

‡¶§‡ßÅ‡¶Æ‡¶ø ‡¶Ø‡¶¶‡¶ø ‡¶ö‡¶æ‡¶ì ‡¶Ü‡¶Æ‡¶ø ‡¶è‡¶á‡¶ü‡¶æ ‡¶è‡¶ï‡¶ü‡¶æ **graph ‡¶¨‡¶æ table ‡¶Ü‡¶ï‡¶æ‡¶∞‡ßá image** ‡¶¨‡¶æ‡¶®‡¶ø‡ßü‡ßá ‡¶¶‡ßá‡¶ñ‡¶æ‡¶§‡ßá ‡¶™‡¶æ‡¶∞‡¶ø‡•§

‡¶§‡¶æ‡¶π‡¶≤‡ßá Visual ‡¶Ü‡¶∞‡¶ì ‡¶ï‡ßç‡¶≤‡¶ø‡ßü‡¶æ‡¶∞ ‡¶π‡¶¨‡ßá üìä‚úÖ
‡¶¨‡¶≤‡¶≤‡ßá‡¶á ‡¶ï‡¶∞‡ßá ‡¶¶‡¶ø‡¶ö‡ßç‡¶õ‡¶ø!



‡¶§‡ßã‡¶Æ‡¶æ‡¶∞ Classification Report ‡¶¶‡ßá‡¶ñ‡ßá ‡¶¨‡ßã‡¶ù‡¶æ‡¶á ‡¶Ø‡¶æ‡¶ö‡ßç‡¶õ‡ßá ‡¶Æ‡¶°‡ßá‡¶≤ **spam ‡¶ó‡ßÅ‡¶≤‡ßã‡¶ï‡ßá (label = 1)** ‡¶†‡¶ø‡¶ï‡¶Æ‡¶§‡ßã ‡¶ß‡¶∞‡¶§‡ßá ‡¶™‡¶æ‡¶∞‡¶õ‡ßá ‡¶®‡¶æ ‚Äî recall ‡¶ï‡¶Æ (0.60), ‡¶Ö‡¶∞‡ßç‡¶•‡¶æ‡ßé **‡¶Ö‡¶®‡ßá‡¶ï spam ‡¶Æ‡¶ø‡¶∏ ‡¶ï‡¶∞‡¶õ‡ßá**‡•§

‡¶ö‡¶≤‡ßã ‡¶§‡ßã‡¶Æ‡¶æ‡¶∞ suggestions ‡¶ó‡ßÅ‡¶≤‡ßã‡¶∞ ‡¶™‡ßç‡¶∞‡¶§‡¶ø‡¶ü‡¶ø‡¶∞ ‡¶¨‡ßç‡¶Ø‡¶æ‡¶ñ‡ßç‡¶Ø‡¶æ ‡¶¶‡¶ø‡¶á + ‡¶ï‡¶ø ‡¶ï‡¶∞‡¶£‡ßÄ‡ßü ‡¶§‡¶æ ‡¶∏‡ßç‡¶ü‡ßá‡¶™ ‡¶¨‡¶æ‡¶á ‡¶∏‡ßç‡¶ü‡ßá‡¶™ ‡¶¨‡ßã‡¶ù‡¶æ‡¶á:

---

## ‚úÖ 1. **Class Imbalance Fix ‡¶ï‡¶∞‡¶æ**

### ‡¶ï‡¶æ‡¶∞‡¶£:

* Ham = 965
* Spam = 150
  ‚Üí ‡¶Ö‡¶®‡ßá‡¶ï‡¶ü‡¶æ Imbalance ‚Üí ‡¶Æ‡¶°‡ßá‡¶≤ "ham" ‡¶¨‡ßá‡¶∂‡¶ø ‡¶∂‡ßá‡¶ñ‡ßá, spam ‡¶è ‡¶≠‡ßÅ‡¶≤ ‡¶ï‡¶∞‡ßá‡•§

### ‡¶∏‡¶Æ‡¶æ‡¶ß‡¶æ‡¶®:

```python
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
```

‡¶Ö‡¶•‡¶¨‡¶æ:

```python
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()
X_res, y_res = ros.fit_resample(X, y)
```

---

## ‚úÖ 2. **Preprocessing Improve ‡¶ï‡¶∞‡¶æ (Stemming, Lemmatization)**

### ‡¶ï‡¶æ‡¶∞‡¶£:

* "loved", "loving", "loves" ‚Üí ‡¶∏‡¶¨ ‡¶è‡¶ï‡¶á ‡¶Ö‡¶∞‡ßç‡¶•, ‡¶ï‡¶ø‡¶®‡ßç‡¶§‡ßÅ ‡¶≠‡¶ø‡¶®‡ßç‡¶® ‡¶ü‡ßã‡¶ï‡ßá‡¶®‡•§

### ‡¶∏‡¶Æ‡¶æ‡¶ß‡¶æ‡¶®:

```python
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    words = text.split()
    filtered = [lemmatizer.lemmatize(w) for w in words if w not in stop_words]
    return ' '.join(filtered)
```

‡¶è‡¶ü‡¶æ stemming ‡¶•‡ßá‡¶ï‡ßá‡¶ì better ‡¶π‡ßü ‡¶Ö‡¶®‡ßá‡¶ï ‡¶∏‡¶Æ‡ßü‡•§

---

## ‚úÖ 3. **Threshold Tuning (Cutoff)**

### ‡¶ï‡¶æ‡¶∞‡¶£:

By default:

```python
y_pred = model.predict(X_test) ‚Üí threshold = 0.5
```

‡¶ï‡¶ø‡¶®‡ßç‡¶§‡ßÅ ‡¶§‡ßÅ‡¶Æ‡¶ø ‡¶ö‡¶æ‡¶á‡¶≤‡ßá:

```python
y_proba = model.predict_proba(X_test)[:, 1]
y_pred_custom = (y_proba > 0.3).astype(int)  # Custom threshold
```

### ‡¶≤‡¶æ‡¶≠?

* ‡¶§‡ßÅ‡¶Æ‡¶ø **spam ‡¶ß‡¶∞‡¶æ‡¶∞ sensitivity ‡¶¨‡¶æ‡ßú‡¶æ‡¶§‡ßá ‡¶™‡¶æ‡¶∞‡ßã**
* Recall improve ‡¶π‡¶¨‡ßá

---

## ‚úÖ 4. **Better Model ‡¶¨‡ßç‡¶Ø‡¶¨‡¶π‡¶æ‡¶∞ ‡¶ï‡¶∞‡¶æ**

### Try:

* `MultinomialNB` ‚Üí Text ‡¶è‡¶∞ ‡¶ú‡¶®‡ßç‡¶Ø ‡¶ñ‡ßÅ‡¶¨‡¶á ‡¶≠‡¶æ‡¶≤‡ßã
* `XGBoost` ‚Üí Powerful Ensemble model
* `RandomForestClassifier` ‚Üí Robust, stable

---

## ‚úÖ 5. **class\_weight='balanced' ‡¶¨‡ßç‡¶Ø‡¶¨‡¶π‡¶æ‡¶∞ ‡¶ï‡¶∞‡¶æ**

‡¶Ø‡¶¶‡¶ø LogisticRegression ‡¶¨‡¶æ DecisionTree ‡¶¨‡ßç‡¶Ø‡¶¨‡¶π‡¶æ‡¶∞ ‡¶ï‡¶∞‡ßã:

```python
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')
```

‡¶è‡¶§‡ßá ‡¶Æ‡¶°‡ßá‡¶≤ imbalance ‡¶®‡¶ø‡¶ú‡ßá ‡¶¨‡ßÅ‡¶ù‡ßá ‡¶®‡ßá‡ßü‡•§

---

## üéØ ‡¶Ö‡¶§‡¶ø‡¶∞‡¶ø‡¶ï‡ßç‡¶§ Tips:

| Technique           | Impact               |
| ------------------- | -------------------- |
| TF-IDF + Lemmatizer | Noise ‡¶ï‡¶Æ‡¶æ‡ßü           |
| SMOTE / ROS         | Recall ‡¶¨‡¶æ‡ßú‡¶æ‡ßü         |
| Naive Bayes         | Text data-friendly   |
| Custom threshold    | False Negatives ‡¶ï‡¶Æ‡¶æ‡ßü |
| Ensemble (Voting)   | Accuracy + Stability |

---

## ‡¶Ø‡¶¶‡¶ø ‡¶ö‡¶æ‡¶ì:

‡¶Ü‡¶Æ‡¶ø ‡¶§‡ßã‡¶Æ‡¶æ‡¶∞ ‡¶™‡ßÅ‡¶∞‡ßã pipeline (cleaning ‚Üí oversampling ‚Üí model ‚Üí evaluation)
‡¶è‡¶ï‡¶ü‡¶æ final version ‡¶π‡¶ø‡¶∏‡ßá‡¶¨‡ßá ‡¶§‡ßà‡¶∞‡¶ø ‡¶ï‡¶∞‡ßá ‡¶¶‡¶ø‡¶§‡ßá ‡¶™‡¶æ‡¶∞‡¶ø ‚Äì ‡¶∏‡¶¨ ‡¶∏‡ßá‡¶ü‡¶ø‡¶Ç‡¶∏ ‡¶∏‡¶π‡•§




Recall for class (ham)  is 1.00  it good

Yes ‚úÖ ‚Äî **Recall for class 0 (ham)** being **1.00** means:

> ‚ö†Ô∏è Your model has **perfectly identified all ham messages** ‚Äî no **False Positives (FP)**.

---

### üîç Let‚Äôs break this down:

Your confusion matrix:

```
              Predicted
              Ham    Spam
Actual Ham   [961     4]   ‚Üí Total 965 ham
Actual Spam  [60     90]   ‚Üí Total 150 spam
```

* **True Negative (TN) = 961** ‚Üí ham correctly predicted as ham
* **False Positive (FP) = 4** ‚Üí ham wrongly predicted as spam
* **Recall = TN / (TN + FP)** ‚Üí but in binary classification, recall is for positive class (usually class=1)

So when we talk about recall for **class 0 (ham)**:

$$
\text{Recall}_{ham} = \frac{TN}{TN + FP} = \frac{961}{961 + 4} = 0.996 ‚â† 1.00
$$

**BUT**, wait! Actually, that's **specificity**, not recall.

In multiclass/binary `classification_report` from sklearn:

* **Recall for class 0** means:

  $$
  \text{Recall}_{class\_0} = \frac{\text{True class 0 predicted as class 0}}{\text{All actual class 0}}
  = \frac{961}{961 + 4} = 0.996
  $$

So, the 1.00 recall means your model predicted **all ham messages correctly** (no **False Negatives** for ham), which matches if `FN = 0`.

---

### ‚úÖ Final Verdict:

* Yes, recall of **1.00 for ham** is **very good** üëç
* But the real issue is:
  üî¥ **Recall for spam (class 1) is 0.60** ‚Üí your model is missing 40% of spam.

---

### üéØ Your Goal:

If you‚Äôre building a **Spam Detector**, missing spam is **dangerous** ‚Üí try to improve recall for **class 1 (spam)**.

---

Want help boosting **spam recall** (class 1)?
I can adjust your pipeline to do that with:

* SMOTE / Oversampling
* Threshold tuning
* Better model

Let me know and I‚Äôll code it for you üöÄ


Great question! Let‚Äôs break it down clearly:

---

### üîç **What is SMOTE?**

**SMOTE** stands for **Synthetic Minority Oversampling Technique**.

It's a technique to **balance class imbalance** in datasets by **creating synthetic examples** of the **minority class** (e.g., "spam" in your case).

---

### ‚ö†Ô∏è **Why do we need SMOTE?**

In your spam detection dataset:

* **Ham (class 0)** = 87%
* **Spam (class 1)** = 13%

This is a **class imbalance problem**.

If you train a model directly, it may:

* **Overfit to ham**
* **Ignore spam**
* Have **high accuracy** but **low recall for spam**

For example:

```python
Accuracy: 95%
Recall for spam: 60% ‚ùå (bad)
```

You **miss 40% of spam**, which is **dangerous**.

---

### ‚úÖ **When to use SMOTE?**

Use SMOTE **before training** your model when:

1. You have **imbalanced data**
2. You care about **recall for the minority class** (e.g., spam detection, fraud, disease)
3. You want to **improve performance** for the rare class

---

### ‚úÖ **How to use it in your project?**

Here‚Äôs your code:

```python
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
```

* `X` = features (TF-IDF vectors of messages)
* `y` = labels (0 = ham, 1 = spam)
* `X_res`, `y_res` = balanced dataset (equal number of ham and spam)

---

### üéØ What does this do?

Let‚Äôs say:

* Original data: 4000 ham, 600 spam
* After SMOTE: 4000 ham, **4000 synthetic spam**

Now train your model on `X_res`, `y_res` ‚Üí it becomes **fairer to spam**.

---

### üìà Benefit

You get:

* **Better recall for spam**
* **Balanced model performance**
* **Less bias toward ham**

---

### üöÄ Bonus: Full Example

```python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE

# 1. Split original data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Apply SMOTE on training data only
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

# 3. Train
model = RandomForestClassifier()
model.fit(X_train_res, y_train_res)

# 4. Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
```

---

Do you want me to:
‚úÖ Integrate this into your **full spam detection pipeline**
‚úÖ and improve spam recall?


