In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("spam.csv", encoding='latin-1') #here encoding is set to latin-1 to handle special characters
#encoding='latin-1': কিছু স্পেশাল ক্যারেক্টার বা ইউরোপিয়ান ভাষার টেক্সট থাকলে সেগুলো ঠিকঠাক পড়তে এই এনকোডিং দরকার হয়। নাহলে error দিতে পারে।

df.head(2)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,


In [3]:
df = df[['v1', 'v2']]


In [4]:
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Rename :
v1        → যেটা label (ham/spam)
v2        → যেটা message/text

In [5]:
df.columns = ['label', 'message']

In [6]:
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Data Cleaning

In [7]:
df.dropna(inplace=True)  # Remove any nulls (if present)

 Preprocess the Data
Convert labels to numbers:

Yes, absolutely! You can preprocess **categorical labels** like `ham/spam` using multiple encoding techniques. Let’s compare the three main approaches and show how you can apply them.

---

### ✅ 1. Using `.map()` (Best for Binary Labels – Recommended Here)

```python
df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})
```

* 🔹 Simple and direct
* 🔹 Works great for binary classification
* 🔹 Output: single column with 0 or 1

---

### ✅ 2. Using **LabelEncoder** (Works for multiple classes)

```python
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['label_num'] = le.fit_transform(df['label'])
```

* 🔹 Useful if there are more than 2 labels
* 🔹 `ham` → 0, `spam` → 1 (but automatically assigned)
* 🔹 Not ideal if you want to control label encoding manually

---

### ✅ 3. Using **OneHotEncoder** (NOT needed for target variable in classification)

```python
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False, drop='first')  # drop='first' avoids dummy variable trap
encoded = ohe.fit_transform(df[['label']])
```

* 🔹 Converts into **two columns**: one for each class
* 🔹 Useful when the target is multi-class and you want to treat each class as separate binary output (for multi-label classification or neural nets)
* ❌ Not recommended for binary classification with scikit-learn estimators like `LogisticRegression`

---

### ✅ So What to Use Here?

Since this is a **binary classification problem (spam or ham)**, the best and cleanest approach is:

```python
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
```

You **don’t need OneHotEncoder** unless your model requires it (e.g., in some neural networks or multi-label tasks).

---



In [8]:
df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})


In [9]:
df.head()

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


Feature Engineering (Text → Numeric)

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['message'])
y = df['label']

it's better process:=>  Clean the text  then use => Convert Text to Numbers (TF-IDF)

import string
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

stop_words = set(stopwords.words('english'))
ps = PorterStemmer()


def clean_text(text):  // Whose job is to clear a message

    text = text.lower() // all text convert in lower case

    text = re.sub(r'[^\w\s]', '', text) //Delete the  punctuation and special character
    (like .,!?@#$%) =>  Hello! How are you? → Hello How are you

    words = text.split() //'i love coding' → ['i', 'love', 'coding']

    filtered_words = [ps.stem(w) for w in words if w not in stop_words]//Stopwords remove  (like: 'the', 'is', 'in', 'and' ) and then  stemming=> convert in  root form: 'running' → 'run' ,'loved' → 'love' , 'easily' → 'easili'







    return ' '.join(filtered_words) //put together  like : "Hi there!!! I loved the meeting, it was fantastic." -> "hi love meet fantast"


df['cleaned'] = df['message'].apply(clean_text)


 Clean the text 

Natural Language Toolkit: using nltk we do on human language and  natural language like Tokenization	বাক্যকে শব্দে ভাগ করা
Stopwords Removal	“the”, “is”, “in” এর মতো সাধারণ অথচ অপ্রয়োজনীয় শব্দ বাদ দেওয়া
Stemming	শব্দকে মূল রূপে আনা (e.g. “running” → “run”)
Lemmatization	শব্দের dictionary রূপে আনা
POS Tagging	কোন শব্দটা noun, verb, adjective তা শনাক্ত করা
Text classification

In [10]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
import string
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    words = text.split()
    filtered_words = [ps.stem(w) for w in words if w not in stop_words] #here w is all words and we are removing stop words
    return ' '.join(filtered_words)

df['cleaned'] = df['message'].apply(clean_text) 

In [12]:
df.head()

Unnamed: 0,label,message,label_num,cleaned
0,ham,"Go until jurong point, crazy.. Available only ...",0,go jurong point crazi avail bugi n great world...
1,ham,Ok lar... Joking wif u oni...,0,ok lar joke wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1,free entri 2 wkli comp win fa cup final tkt 21...
3,ham,U dun say so early hor... U c already then say...,0,u dun say earli hor u c alreadi say
4,ham,"Nah I don't think he goes to usf, he lives aro...",0,nah dont think goe usf live around though


Step 3: Convert Text to Numbers (TF-IDF)

**TF-IDF vector মানে কী?** — এটা NLP বা Text Classification

## ✅ TF-IDF মানে কী?

**TF-IDF** এর পূর্ণরূপ:

```
TF → Term Frequency  
IDF → Inverse Document Frequency
```

এটা একটা **text-to-number conversion technique** — যেটা কোনো **word-এর importance (গুরুত্ব)** calculate করে প্রতিটা message বা document-এর ভিতরে।

---

## 🎯 Main Goal:

Text (like: "Hello, this is spam message") → কে Machine বুজবে না, কারণ ওটা string।

তাই তোমাকে প্রতিটা word কে **number/vector** এ রূপান্তর করতে হবে।
TF-IDF ঠিক সেটাই করে – কিন্তু শুধুমাত্র number এ না, **importance অনুযায়ী weight** দেয়।

---

## 🧠 TF (Term Frequency) মানে:

একটা message-এর ভিতরে কোনো word কতবার আছে?

### উদাহরণ:

Message: `"You are very very good"`

* TF("very") = 2
* TF("you") = 1

---

## 🧠 IDF (Inverse Document Frequency) মানে:

যে word অনেক গুলো message-এ common আছে (যেমন: "is", "the", "you"), তাদের গুরুত্ব কম।

আর যেসব word unique বা rare (যেমন: "winner", "free", "offer") — তাদের গুরুত্ব বেশি।

এটাই IDF।

---

## 🔄 TF-IDF Formula:

```
TF-IDF(word) = TF(word) × IDF(word)
```

অর্থাৎ, একটার message-এ কতবার এসেছে (TF) × কতটা unique সেই word পুরো dataset-এ (IDF)

---

## 📦 Final Output:

প্রতিটা message → একটা **TF-IDF vector**
যার প্রতিটা value → প্রতিটা word-এর importance weight (0.0 থেকে 1.0 এর মতো)

---

## 🎯 উদাহরণ দিয়ে বলি:

ধরো 3টা message আছে:

1. "Buy now"
2. "Limited offer now"
3. "Now or never"

→ TF-IDF vectorizer এ সব word কে numeric বানিয়ে নিচের মতো একটা matrix বানাবে:

| Word    | Msg 1 | Msg 2 | Msg 3 |
| ------- | ----- | ----- | ----- |
| buy     | 1.2   | 0     | 0     |
| limited | 0     | 1.5   | 0     |
| offer   | 0     | 1.2   | 0     |
| now     | 1.0   | 0.8   | 0.7   |
| or      | 0     | 0     | 1.0   |
| never   | 0     | 0     | 1.5   |

(সংখ্যাগুলো উদাহরণমাত্র)

---

## ✅ কেন TF-IDF ব্যবহার করবো?

* Word importance বোঝে (common word বাদ দেয়)
* Sparse matrix তৈরি করে (memory efficient)
* Model train করার জন্য perfect numeric format দেয়

---

## 🔁 Bonus Tip:

* **`TfidfVectorizer()`** এই কাজটা সব নিজের থেকেই করে।
* তুমি শুধু `.fit_transform(text)` দাও, বাকি সব ও manage করে নেয়।

---



In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df['cleaned'])
y = df['label_num']

Step 4: Train-Test Split

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Step 5: Train Logistic Regression Model

In [15]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)


Step 6: Predict & Evaluate

In [16]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.9426008968609866

Confusion Matrix:
 [[961   4]
 [ 60  90]]

Classification Report:
               precision    recall  f1-score   support

           0       0.94      1.00      0.97       965
           1       0.96      0.60      0.74       150

    accuracy                           0.94      1115
   macro avg       0.95      0.80      0.85      1115
weighted avg       0.94      0.94      0.94      1115



তোমার confusion matrix আর classification report দেখে আমি পুরো performance বুঝিয়ে দিচ্ছি — একদম সহজভাবে।

---

## ✅ Step-by-step Analysis:

### 📌 Confusion Matrix:

```
[[961   4]
 [ 60  90]]
```

এটার মানে:

|                     | **Predicted Ham (0)** | **Predicted Spam (1)** |
| ------------------- | --------------------- | ---------------------- |
| **Actual Ham (0)**  | 961 (✅ TN)            | 4 (❌ FP)               |
| **Actual Spam (1)** | 60 (❌ FN)             | 90 (✅ TP)              |

---

### 🎯 Metrics Breakdown:

| Metric        | Ham (0)        | Spam (1)   |
| ------------- | -------------- | ---------- |
| **Precision** | 0.94           | 0.96       |
| **Recall**    | 1.00 (perfect) | 0.60 (low) |
| **F1-score**  | 0.97           | 0.74       |

---

## 🔍 বোঝার পদ্ধতি:

### 🔹 False Positive (FP = 4)

* **Model বলেছে Spam**, কিন্তু আসলে সেটা **Ham** ছিল
* এর মানে: কাকে মেসেজ পাঠানো যায়, কিন্তু মডেল বলে দেয় এটা spam — annoying!

### 🔸 False Negative (FN = 60)

* **Model বলেছে Ham**, কিন্তু আসলে সেটা **Spam** ছিল
* এর মানে: **মডেল spam detect করতে ব্যর্থ হয়েছে** — এটা dangerous! কারণ user স্প্যাম মেসেজ পাবে।

---

## 📉 Main Problem:

* **Recall for class 1 (spam)** = 0.60 → কম
  মানে: মোট 150 টা স্প্যাম মেসেজ ছিল, মডেল ঠিকমতো ধরতে পেরেছে মাত্র **90 টা** → 60 মিস করেছে।

---

## 📈 Accuracy দেখে বোকা হওয়া যাবে না

Accuracy = **94.2%**, কিন্তু সেটা misleading, কারণ:

* মডেল ভালোভাবে **ham detect** করছে
* কিন্তু **spam detect** করতে কম পারছে (recall = 0.60)

---

## ✅ করণীয় (Suggestions):

1. **Class imbalance fix করো** (ham বেশি → spam কম)

   * Use: `SMOTE`, `RandomOversampler`, `class_weight='balanced'`
2. **Preprocessing improve করো**

   * Noise reduce + stemming/lemmatization ঠিক করো
3. **Threshold tuning** করো (probability cut-off)
4. **Try better models** → Naive Bayes / XGBoost / Ensemble

---

## 🔚 Summary (Banglay):

| বিষয়              | বিশ্লেষণ             |
| ----------------- | -------------------- |
| **Ham**           | খুব ভালোভাবে ধরেছে   |
| **Spam**          | অনেকগুলো মিস করেছে   |
| **FP (4)**        | খুব কম, that's good  |
| **FN (60)**       | বেশি, এটা বিপজ্জনক   |
| **Recall (Spam)** | 0.60 → Improve দরকার |

---

👉 তুমি চাইলে আমি next step এ class imbalance বা threshold tuning দেখাতে পারি — বললেই হবে।

তোমার turn ✋🙂


অসাধারণ প্রশ্ন! তুমি বলছো —
**“সবকিছু ঠিক থাকলে, অর্থাৎ মডেল যদি একটাও ভুল না করতো, তাহলে Confusion Matrix কেমন হতো?”**

চল একদম ক্লিয়ার করে দেখি 👇

---

## 🟨 বর্তমান মেট্রিক্স:

```
[[961   4]
 [ 60  90]]
```

এখানে দেখা যাচ্ছে:

* ✅ ঠিক বলেছে: 961 + 90 = **1051 বার**
* ❌ ভুল বলেছে: 4 + 60 = **64 বার**

---

## ✅ যদি মডেল একদম Perfect হতো...

মানে মডেল সবগুলো মেসেজ **একদম সঠিকভাবে** "ham" বা "spam" বলে দিতে পারতো —
তাহলে confusion matrix হতো:

```
[[965   0]
 [  0 150]]
```

### ব্যাখ্যা:

* সব **ham (total 965)** → ঠিকভাবে ham হিসেবে ধরতো
* সব **spam (total 150)** → ঠিকভাবে spam হিসেবে ধরতো
* তাই **False Positive (FP) = 0**
* **False Negative (FN) = 0**

---

## 🎯 তখন Accuracy হতো:

```python
accuracy = (TN + TP) / total
         = (965 + 150) / 1115
         = 1115 / 1115
         = 1.0 ✅
```

### 🔥 অর্থাৎ Accuracy = **100%**

---

## 🔁 Recap Table:

| Actual / Predicted | Predicted: Ham | Predicted: Spam |
| ------------------ | -------------- | --------------- |
| Actual: Ham        | ✅ TN = 965     | ❌ FP = 0        |
| Actual: Spam       | ❌ FN = 0       | ✅ TP = 150      |

---

## 📌 Conclusion:

যদি মডেল একটাও ভুল না করত, তাহলে:

* `Confusion Matrix: [[965, 0], [0, 150]]`
* `Accuracy: 1.0 (100%)`
* `Precision, Recall, F1-score → সব 1.00`

---

তুমি যদি চাও আমি এইটা একটা **graph বা table আকারে image** বানিয়ে দেখাতে পারি।

তাহলে Visual আরও ক্লিয়ার হবে 📊✅
বললেই করে দিচ্ছি!



তোমার Classification Report দেখে বোঝাই যাচ্ছে মডেল **spam গুলোকে (label = 1)** ঠিকমতো ধরতে পারছে না — recall কম (0.60), অর্থাৎ **অনেক spam মিস করছে**।

চলো তোমার suggestions গুলোর প্রতিটির ব্যাখ্যা দিই + কি করণীয় তা স্টেপ বাই স্টেপ বোঝাই:

---

## ✅ 1. **Class Imbalance Fix করা**

### কারণ:

* Ham = 965
* Spam = 150
  → অনেকটা Imbalance → মডেল "ham" বেশি শেখে, spam এ ভুল করে।

### সমাধান:

```python
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
```

অথবা:

```python
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()
X_res, y_res = ros.fit_resample(X, y)
```

---

## ✅ 2. **Preprocessing Improve করা (Stemming, Lemmatization)**

### কারণ:

* "loved", "loving", "loves" → সব একই অর্থ, কিন্তু ভিন্ন টোকেন।

### সমাধান:

```python
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    words = text.split()
    filtered = [lemmatizer.lemmatize(w) for w in words if w not in stop_words]
    return ' '.join(filtered)
```

এটা stemming থেকেও better হয় অনেক সময়।

---

## ✅ 3. **Threshold Tuning (Cutoff)**

### কারণ:

By default:

```python
y_pred = model.predict(X_test) → threshold = 0.5
```

কিন্তু তুমি চাইলে:

```python
y_proba = model.predict_proba(X_test)[:, 1]
y_pred_custom = (y_proba > 0.3).astype(int)  # Custom threshold
```

### লাভ?

* তুমি **spam ধরার sensitivity বাড়াতে পারো**
* Recall improve হবে

---

## ✅ 4. **Better Model ব্যবহার করা**

### Try:

* `MultinomialNB` → Text এর জন্য খুবই ভালো
* `XGBoost` → Powerful Ensemble model
* `RandomForestClassifier` → Robust, stable

---

## ✅ 5. **class\_weight='balanced' ব্যবহার করা**

যদি LogisticRegression বা DecisionTree ব্যবহার করো:

```python
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')
```

এতে মডেল imbalance নিজে বুঝে নেয়।

---

## 🎯 অতিরিক্ত Tips:

| Technique           | Impact               |
| ------------------- | -------------------- |
| TF-IDF + Lemmatizer | Noise কমায়           |
| SMOTE / ROS         | Recall বাড়ায়         |
| Naive Bayes         | Text data-friendly   |
| Custom threshold    | False Negatives কমায় |
| Ensemble (Voting)   | Accuracy + Stability |

---

## যদি চাও:

আমি তোমার পুরো pipeline (cleaning → oversampling → model → evaluation)
একটা final version হিসেবে তৈরি করে দিতে পারি – সব সেটিংস সহ।




Recall for class (ham)  is 1.00  it good

Yes ✅ — **Recall for class 0 (ham)** being **1.00** means:

> ⚠️ Your model has **perfectly identified all ham messages** — no **False Positives (FP)**.

---

### 🔍 Let’s break this down:

Your confusion matrix:

```
              Predicted
              Ham    Spam
Actual Ham   [961     4]   → Total 965 ham
Actual Spam  [60     90]   → Total 150 spam
```

* **True Negative (TN) = 961** → ham correctly predicted as ham
* **False Positive (FP) = 4** → ham wrongly predicted as spam
* **Recall = TN / (TN + FP)** → but in binary classification, recall is for positive class (usually class=1)

So when we talk about recall for **class 0 (ham)**:

$$
\text{Recall}_{ham} = \frac{TN}{TN + FP} = \frac{961}{961 + 4} = 0.996 ≠ 1.00
$$

**BUT**, wait! Actually, that's **specificity**, not recall.

In multiclass/binary `classification_report` from sklearn:

* **Recall for class 0** means:

  $$
  \text{Recall}_{class\_0} = \frac{\text{True class 0 predicted as class 0}}{\text{All actual class 0}}
  = \frac{961}{961 + 4} = 0.996
  $$

So, the 1.00 recall means your model predicted **all ham messages correctly** (no **False Negatives** for ham), which matches if `FN = 0`.

---

### ✅ Final Verdict:

* Yes, recall of **1.00 for ham** is **very good** 👍
* But the real issue is:
  🔴 **Recall for spam (class 1) is 0.60** → your model is missing 40% of spam.

---

### 🎯 Your Goal:

If you’re building a **Spam Detector**, missing spam is **dangerous** → try to improve recall for **class 1 (spam)**.

---

Want help boosting **spam recall** (class 1)?
I can adjust your pipeline to do that with:

* SMOTE / Oversampling
* Threshold tuning
* Better model

Let me know and I’ll code it for you 🚀


Great question! Let’s break it down clearly:

---

### 🔍 **What is SMOTE?**

**SMOTE** stands for **Synthetic Minority Oversampling Technique**.

It's a technique to **balance class imbalance** in datasets by **creating synthetic examples** of the **minority class** (e.g., "spam" in your case).

---

### ⚠️ **Why do we need SMOTE?**

In your spam detection dataset:

* **Ham (class 0)** = 87%
* **Spam (class 1)** = 13%

This is a **class imbalance problem**.

If you train a model directly, it may:

* **Overfit to ham**
* **Ignore spam**
* Have **high accuracy** but **low recall for spam**

For example:

```python
Accuracy: 95%
Recall for spam: 60% ❌ (bad)
```

You **miss 40% of spam**, which is **dangerous**.

---

### ✅ **When to use SMOTE?**

Use SMOTE **before training** your model when:

1. You have **imbalanced data**
2. You care about **recall for the minority class** (e.g., spam detection, fraud, disease)
3. You want to **improve performance** for the rare class

---

### ✅ **How to use it in your project?**

Here’s your code:

```python
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
```

* `X` = features (TF-IDF vectors of messages)
* `y` = labels (0 = ham, 1 = spam)
* `X_res`, `y_res` = balanced dataset (equal number of ham and spam)

---

### 🎯 What does this do?

Let’s say:

* Original data: 4000 ham, 600 spam
* After SMOTE: 4000 ham, **4000 synthetic spam**

Now train your model on `X_res`, `y_res` → it becomes **fairer to spam**.

---

### 📈 Benefit

You get:

* **Better recall for spam**
* **Balanced model performance**
* **Less bias toward ham**

---

### 🚀 Bonus: Full Example

```python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE

# 1. Split original data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Apply SMOTE on training data only
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

# 3. Train
model = RandomForestClassifier()
model.fit(X_train_res, y_train_res)

# 4. Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
```

---

Do you want me to:
✅ Integrate this into your **full spam detection pipeline**
✅ and improve spam recall?


