## 📘 **Lesson Plan: Text Classification Using Email Spam Dataset**

---

### 🔹 **1. Problem Statement**

Classify email messages as **Spam** or **Not Spam (Ham)** using machine learning and deep learning techniques.
We will use:

* **CountVectorizer** and **TF-IDF** for feature extraction
* ML models: **Naive Bayes**, **Logistic Regression**, **SVM**
* **ANN** using Keras

---

### 📦 **Dataset Used**

* Email spam dataset with two columns:

  * `label`: 'spam' or 'ham'
  * `text`: email content

> Example dataset: [Kaggle Email Spam Dataset](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)

---

## ✅ **Steps for All Approaches**

### **Step 1: Load & Preprocess Data**


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv("class_11_spam.csv", encoding='latin-1')[['v1', 'v2']]
df.columns = ['label', 'text']

# Encode label
le = LabelEncoder()
df['label'] = le.fit_transform(df['label'])  # ham = 0, spam = 1

# Split data
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

In [3]:

## 🔹 **2. CountVectorizer + ML Models**

### Vectorization

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train_cv = vectorizer.fit_transform(X_train)
X_test_cv = vectorizer.transform(X_test)



In [4]:
### Classification

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

model = MultinomialNB()
model.fit(X_train_cv, y_train)
y_pred = model.predict(X_test_cv)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.9838565022421525
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
           1       0.99      0.89      0.94       150

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115



In [5]:

# 📌 **Try with Logistic Regression and SVM as well.**


## 🔹 **3. TF-IDF + ML Models**

### TF-IDF Vectorization


from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

### Classification (same as above)

model = MultinomialNB()
model.fit(X_train_tfidf, y_train)
y_pred = model.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.9623318385650225


In [None]:
## Naive Bayes Classifier
gnb = GaussianNB() 
%time gnb.fit(X_train, y_train)

y_pred_train = gnb.predict(X_train)
y_pred_test = gnb.predict(X_test)
print("\nTraining Accuracy score:",accuracy_score(y_train, y_pred_train))
print("Testing Accuracy score:",accuracy_score(y_test, y_pred_test))

In [None]:
print(classification_report(y_test, y_pred_test, target_names=['not relevant', 'relevant']))

In [None]:
cm = confusion_matrix(y_test, y_pred_test)
# print('Confusion matrix\n', cm)

cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive', 'Actual Negative'], 
                        index=['Predict Positive', 'Predict Negative'])
sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')
plt.show()

In [None]:
# also use MultinomialNB ,LogisticRegression,LinearSVC,DecisionTreeClassifier,ensemble
from sklearn.ensemble import VotingClassifier

classifiers = [('Decision Tree', dt),
               ('Logistic Regression', lr),
                ('Naive Bayes', gnb)
              ]
vc = VotingClassifier(estimators=classifiers)
# Fit 'vc' to the traing set and predict test set labels
vc.fit(X_train, y_train)
y_pred_train=vc.predict(X_train)
y_pred_test = vc.predict(X_test)
print("Training Accuracy score:",accuracy_score(y_train, y_pred_train))
print("Testing Accuracy score:",accuracy_score(y_test, y_pred_test))

In [None]:
probs = gnb.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [6]:


## 🔹 **4. Text Classification using ANN (Keras)**

### Preprocess using TF-IDF and Convert to Array


import numpy as np
X_train_arr = X_train_tfidf.toarray()
X_test_arr = X_test_tfidf.toarray()



In [7]:
### Define ANN Model

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

model = Sequential()
model.add(Dense(128, input_dim=X_train_arr.shape[1], activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train_arr, y_train, epochs=5, batch_size=64, validation_data=(X_test_arr, y_test))


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/5
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 39ms/step - accuracy: 0.8342 - loss: 0.5842 - val_accuracy: 0.8987 - val_loss: 0.2784
Epoch 2/5
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 26ms/step - accuracy: 0.9269 - loss: 0.2271 - val_accuracy: 0.9722 - val_loss: 0.1204
Epoch 3/5
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 29ms/step - accuracy: 0.9823 - loss: 0.0837 - val_accuracy: 0.9794 - val_loss: 0.0793
Epoch 4/5
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 32ms/step - accuracy: 0.9938 - loss: 0.0448 - val_accuracy: 0.9785 - val_loss: 0.0677
Epoch 5/5
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 32ms/step - accuracy: 0.9969 - loss: 0.0247 - val_accuracy: 0.9803 - val_loss: 0.0629


<keras.src.callbacks.history.History at 0x212383619a0>

## Step 2: Predict with Sample Input

In [8]:
# Sample email text
sample_email = ["Congratulations! You've won a free iPhone. Click the link to claim now!"]

# Vectorize using the same TF-IDF vectorizer
sample_tfidf = tfidf.transform(sample_email).toarray()

# Predict
prediction = model.predict(sample_tfidf)

# Interpret result
label = "Spam" if prediction[0][0] >= 0.5 else "Ham"
print(f"Prediction: {label} ({prediction[0][0]:.4f})")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 180ms/step
Prediction: Spam (0.9854)


In [9]:
# Try with Multiple Samples
sample_emails = [
    "Congratulations! You have won a lottery worth $1,000,000.",
    "Hey, are we still meeting for lunch today?",
    "Limited-time offer just for you! Get 90% off on your next purchase.",
    "Please find the attached report for the project."
]

sample_tfidf = tfidf.transform(sample_emails).toarray()
predictions = model.predict(sample_tfidf)

for i, pred in enumerate(predictions):
    label = "Spam" if pred[0] >= 0.5 else "Ham"
    print(f"Email: {sample_emails[i]}\nPrediction: {label} ({pred[0]:.4f})\n")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 212ms/step
Email: Congratulations! You have won a lottery worth $1,000,000.
Prediction: Spam (0.9313)

Email: Hey, are we still meeting for lunch today?
Prediction: Ham (0.0016)

Email: Limited-time offer just for you! Get 90% off on your next purchase.
Prediction: Ham (0.1808)

Email: Please find the attached report for the project.
Prediction: Ham (0.1389)



### ✅ How to Save and Prepare the Model

In [None]:
# Save model
model.save("ann_spam_model.h5")

# Save TF-IDF vectorizer
import pickle
with open("tfidf_vectorizer.pkl", "wb") as f:
    pickle.dump(tfidf, f)





### Step 1: Data Preparation

* Collect a labeled dataset (e.g., movie reviews with positive/negative labels).
* Clean the text (lowercase, remove noise).
* Handle emoticons by replacing them with sentiment tokens.
* Convert text to numeric vectors (TF-IDF, CountVectorizer, or embeddings).

---

### Step 2: ANN Model Architecture

* Input layer size = number of features
* Hidden layers (e.g., 1 or 2 layers with ReLU activation)
* Output layer with 1 neuron and sigmoid activation (for binary sentiment classification)



### Step 3: Code Example (Using Keras)


import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# Sample data
texts = [
    "I love this product! :)",
    "This is the worst movie I've ever seen :(",
    "Absolutely fantastic experience!",
    "I do not like this at all :/",
    "Such a boring day..."
]
labels = [1, 0, 1, 0, 0]  # 1 = positive, 0 = negative

# Preprocessing function to handle emoticons
def preprocess_text(text):
    emoticon_dict = {
        ':)': ' happy ',
        ':(': ' sad ',
        ':D': ' happy ',
        ':/': ' disappointed ',
        ':-)': ' happy ',
        ':-(': ' sad '
    }
    for emot in emoticon_dict:
        text = text.replace(emot, emoticon_dict[emot])
    return text.lower()

texts = [preprocess_text(t) for t in texts]

# Vectorize text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts).toarray()
y = np.array(labels)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define ANN model
model = Sequential([
    Dense(16, input_dim=X.shape[1], activation='relu'),
    Dense(8, activation='relu'),
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])

# Train model
model.fit(X_train, y_train, epochs=20, batch_size=2, verbose=1)

# Evaluate model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy*100:.2f}%")
```

---

### Explanation

* We replace emoticons with text tokens that the vectorizer can pick up.
* We use TF-IDF to convert text to numeric vectors.
* The ANN has two hidden layers.
* The output layer uses sigmoid activation to output a probability for positive sentiment.
* We train and evaluate the model on a small dataset.

---

### Assignment Ideas

1. Collect a larger dataset of social media comments.
2. Add more emoticon handling rules.
3. Try word embeddings (Word2Vec, GloVe) instead of TF-IDF.
4. Experiment with different ANN architectures (more layers, dropout).
5. Evaluate the model on a test set and analyze errors.

---

### ✅ Streamlit App: spam_detector_app.py

✅ How to Run Streamlit App

streamlit run spam_detector_app.py

In [None]:
import streamlit as st
import pandas as pd
import numpy as np
import pickle
from tensorflow.keras.models import load_model
from sklearn.feature_extraction.text import TfidfVectorizer

# Load trained model
model = load_model("ann_spam_model.h5")

# Load saved TF-IDF vectorizer
with open("tfidf_vectorizer.pkl", "rb") as f:
    tfidf = pickle.load(f)

# Streamlit UI
st.title("📧 Email Spam Classifier")
st.write("Enter an email below to check if it's **Spam** or **Ham**")

email_text = st.text_area("✉️ Email Content")

if st.button("Predict"):
    if email_text.strip() == "":
        st.warning("Please enter some email text.")
    else:
        # Transform text
        email_vector = tfidf.transform([email_text]).toarray()
        
        # Predict
        prediction = model.predict(email_vector)
        label = "🛑 Spam" if prediction[0][0] >= 0.5 else "✅ Ham"
        confidence = prediction[0][0] if label == "🛑 Spam" else 1 - prediction[0][0]
        
        st.subheader("📊 Prediction Result")
        st.write(f"**Prediction:** {label}")
        st.write(f"**Confidence:** {confidence:.2%}")


## Flask API: app.py

In [None]:
# ✅ 2. Save Model and Vectorizer
# After training in your main notebook:

model.save("ann_spam_model.h5")

import pickle
with open("tfidf_vectorizer.pkl", "wb") as f:
    pickle.dump(tfidf, f)

In [None]:
from flask import Flask, request, jsonify
import numpy as np
import pickle
from tensorflow.keras.models import load_model
from sklearn.feature_extraction.text import TfidfVectorizer

app = Flask(__name__)

# Load model and vectorizer
model = load_model('ann_spam_model.h5')

with open('tfidf_vectorizer.pkl', 'rb') as f:
    tfidf = pickle.load(f)

@app.route('/', methods=['GET'])
def index():
    return jsonify({'message': 'Email Spam Detection API is running!'})

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    
    if 'email' not in data:
        return jsonify({'error': 'No email content provided'}), 400

    email_text = data['email']
    
    # Preprocess and predict
    email_vector = tfidf.transform([email_text]).toarray()
    prediction = model.predict(email_vector)
    
    label = 'Spam' if prediction[0][0] >= 0.5 else 'Ham'
    confidence = float(prediction[0][0]) if label == "Spam" else float(1 - prediction[0][0])

    return jsonify({
        'prediction': label,
        'confidence': round(confidence, 4)
    })

if __name__ == '__main__':
    app.run(debug=True)


In [None]:
# ✅ 3. Run the Flask API

# pip install flask tensorflow scikit-learn
# python app.py


## ✅ **4. Test the API (Using curl or Postman)**

### Example `POST` request using **curl**:

```bash
curl -X POST http://127.0.0.1:5000/predict \
     -H "Content-Type: application/json" \
     -d '{"email":"Congratulations! You have won a free ticket. Click here!"}'
```

### Sample Response:

```json
{
  "prediction": "Spam",
  "confidence": 0.9813
}
