# Project Flow of ML Model Building – Email Spam Classifier

## 1. Text Cleaning
Feature extraction requires normalized text.
- Lowercase
- Remove URLs
- Remove numbers
- Remove punctuation
- Tokenization

## 2. EDA (Exploratory Data Analysis)
Initial analysis phase to understand dataset characteristics.
**Key Tasks:**
- Class distribution (Spam vs Ham)
- Email length analysis
- Word frequency visualization
- Detect anomalies or imbalance
**Purpose:** Identify patterns and validate modeling assumptions.

## 3. Text Pre-processing
Transforms raw email text into machine-readable features.
- Lowercasing text  
- Remove punctuation & stopwords  
- Tokenization  
- Stemming/Lemmatization  
- Vectorization using **TF-IDF / Bag of Words**  

## 4. Model Building
Training the classification algorithms.
- Train–test split  
- Feature extraction from text  
- Train models:
  - Naïve Bayes  
  - Logistic Regression  
  - SVM  
- Fit on vectorized dataset

## 5. Model Evaluation
Assess model effectiveness.
**Metrics:**
- Accuracy  
- Precision  
- Recall  
- F1-Score  
- Confusion Matrix  
**Focus:** Balance between spam detection and false positives.

## 6. Improvement
Performance optimization stage.
- Hyperparameter tuning  
- N-gram features  
- Handle class imbalance  
- Ensemble / advanced models 

## 7. User Interface (Website)
Front-end system for predictions.
- Email text input  
- Predict button  
- Output: Spam / Not Spam  
- Built with **HTML, CSS, JS + Flask/Django**  

## 8. Deployment on Heroku
Cloud hosting for public access.
- Upload Flask app + model files  
- Add `requirements.txt`  
- Add `Procfile`  
- Deploy via Git  
- Generate live public URL  

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('enron_spam_data.csv')

In [3]:
df.sample(3)

Unnamed: 0,Message ID,Subject,Message,Spam/Ham,Date
28557,28557,wild goose storage inc . expansion open season,please watch for an important e - mail in the ...,ham,2001-04-09
3114,3114,"hpl nom for april 25 , 2001",( see attached file : hplno 425 . xls )\n- hpl...,ham,2001-04-24
10345,10345,in the heart of your business !,corporate image can say a lot of things about ...,spam,2005-06-30


In [4]:
df.shape

(33716, 5)

In [5]:
!pip install nltk




[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
# Text Cleaning
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\suraj\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\suraj\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_email(text):
    
    #1 Remove email headers (Message-ID, Date, etc.)
    text = re.sub(r'^(Message-ID|Date|From|To|Subject):.*\n?', '', text, flags=re.MULTILINE)
    
    #2 Remove MIME boundaries
    text = re.sub(r'--\S+', '', text)
    
    #3 Remove HTML tags
    text = re.sub(r'<.*?>', ' ', text)
    
    #4 Replace URLs with token
    text = re.sub(r'http\S+|www\S+', ' <URL> ', text)
    
    #5 Replace email addresses
    text = re.sub(r'\S+@\S+', ' <EMAIL> ', text)
    
    #6️ Convert to lowercase
    text = text.lower()
    
    #7️ Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    #8️ Remove digits (optional)
    text = re.sub(r'\d+', '', text)
    
    #9️ Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    #10 Tokenization
    tokens = text.split()
    
    #11 Remove stopwords + short words + lemmatize
    cleaned_tokens = []
    for word in tokens:
        if word not in stop_words and len(word) > 2:
            word = lemmatizer.lemmatize(word)
            cleaned_tokens.append(word)
    
    return " ".join(cleaned_tokens)

In [8]:
df['combined_text'] = df['Subject'].fillna('') + " " + df['Message'].fillna('')
df['cleaned_text'] = df['combined_text'].apply(clean_email)

In [9]:
pd.set_option('display.max_colwidth', None)
df[['combined_text', 'cleaned_text']].head(3)

Unnamed: 0,combined_text,cleaned_text
0,christmas tree farm pictures,christmas tree farm picture
1,"vastar resources , inc . gary , production from the high island larger block a - 1 # 2 commenced on\nsaturday at 2 : 00 p . m . at about 6 , 500 gross . carlos expects between 9 , 500 and\n10 , 000 gross for tomorrow . vastar owns 68 % of the gross production .\ngeorge x 3 - 6992\n- - - - - - - - - - - - - - - - - - - - - - forwarded by george weissman / hou / ect on 12 / 13 / 99 10 : 16\nam - - - - - - - - - - - - - - - - - - - - - - - - - - -\ndaren j farmer\n12 / 10 / 99 10 : 38 am\nto : carlos j rodriguez / hou / ect @ ect\ncc : george weissman / hou / ect @ ect , melissa graves / hou / ect @ ect\nsubject : vastar resources , inc .\ncarlos ,\nplease call linda and get everything set up .\ni ' m going to estimate 4 , 500 coming up tomorrow , with a 2 , 000 increase each\nfollowing day based on my conversations with bill fischer at bmar .\nd .\n- - - - - - - - - - - - - - - - - - - - - - forwarded by daren j farmer / hou / ect on 12 / 10 / 99 10 : 34\nam - - - - - - - - - - - - - - - - - - - - - - - - - - -\nenron north america corp .\nfrom : george weissman 12 / 10 / 99 10 : 00 am\nto : daren j farmer / hou / ect @ ect\ncc : gary bryan / hou / ect @ ect , melissa graves / hou / ect @ ect\nsubject : vastar resources , inc .\ndarren ,\nthe attached appears to be a nomination from vastar resources , inc . for the\nhigh island larger block a - 1 # 2 ( previously , erroneously referred to as the\n# 1 well ) . vastar now expects the well to commence production sometime\ntomorrow . i told linda harris that we ' d get her a telephone number in gas\ncontrol so she can provide notification of the turn - on tomorrow . linda ' s\nnumbers , for the record , are 281 . 584 . 3359 voice and 713 . 312 . 1689 fax .\nwould you please see that someone contacts linda and advises her how to\nsubmit future nominations via e - mail , fax or voice ? thanks .\ngeorge x 3 - 6992\n- - - - - - - - - - - - - - - - - - - - - - forwarded by george weissman / hou / ect on 12 / 10 / 99 09 : 44\nam - - - - - - - - - - - - - - - - - - - - - - - - - - -\n"" linda harris "" on 12 / 10 / 99 09 : 38 : 43 am\nto : george weissman / hou / ect @ ect\ncc :\nsubject : hi a - 1 # 2\neffective 12 - 11 - 99\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| mscf / d | min ftp | time |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 4 , 500 | 9 , 925 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 6 , 000 | 9 , 908 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 8 , 000 | 9 , 878 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 10 , 000 | 9 , 840 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 12 , 000 | 9 , 793 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 14 , 000 | 9 , 738 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 16 , 000 | 9 , 674 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 18 , 000 | 9 , 602 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 20 , 000 | 9 , 521 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 22 , 000 | 9 , 431 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 24 , 000 | 9 , 332 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 26 , 000 | 9 , 224 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 28 , 000 | 9 , 108 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 30 , 000 | 8 , 982 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 32 , 000 | 8 , 847 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 34 , 000 | 8 , 703 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |\n| | | |\n| 36 , 000 | 8 , 549 | 24 hours |\n| | | |\n| - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - |",vastar resource inc gary production high island larger block commenced saturday gross carlos expects gross tomorrow vastar owns gross production george forwarded george weissman hou ect daren farmer carlos rodriguez hou ect ect george weissman hou ect ect melissa graf hou ect ect subject vastar resource inc carlos please call linda get everything set going estimate coming tomorrow increase following day based conversation bill fischer bmar forwarded daren farmer hou ect enron north america corp george weissman daren farmer hou ect ect gary bryan hou ect ect melissa graf hou ect ect subject vastar resource inc darren attached appears nomination vastar resource inc high island larger block previously erroneously referred well vastar expects well commence production sometime tomorrow told linda harris get telephone number gas control provide notification turn tomorrow linda number record voice fax would please see someone contact linda advises submit future nomination via mail fax voice thanks george forwarded george weissman hou ect linda harris george weissman hou ect ect subject effective mscf min ftp time hour hour hour hour hour hour hour hour hour hour hour hour hour hour hour hour hour
2,calpine daily gas nomination - calpine daily gas nomination 1 . doc,calpine daily gas nomination calpine daily gas nomination doc


In [10]:
#1 Train-Test Split (80-20 ratio)

df['cleaned_text']
df['Spam/Ham']

0         ham
1         ham
2         ham
3         ham
4         ham
         ... 
33711    spam
33712    spam
33713    spam
33714    spam
33715    spam
Name: Spam/Ham, Length: 33716, dtype: object

In [11]:
from sklearn.model_selection import train_test_split

X = df['cleaned_text']
y = df['Spam/Ham']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [12]:
#2 TF-IDF Vectorization (Fit Only on Train)

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=9000)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [13]:
vectorizer.get_feature_names_out()[:10]

array(['aaa', 'aaron', 'abacha', 'abacus', 'abandon', 'abandoned', 'abb',
       'abbott', 'abc', 'abdominal'], dtype=object)

In [14]:
# Model 1 - Naive Bayes

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

nb_pred = nb_model.predict(X_test_tfidf)
nb_accuracy = accuracy_score(y_test, nb_pred)

print("Naive Bayes Accuracy:", nb_accuracy)

Naive Bayes Accuracy: 0.9876927639383155


In [15]:
# Model 2 - Logistic Regression

from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_tfidf, y_train)

lr_pred = lr_model.predict(X_test_tfidf)
lr_accuracy = accuracy_score(y_test, lr_pred)

print("Logistic Regression Accuracy:", lr_accuracy)

Logistic Regression Accuracy: 0.9887307236061684


In [16]:
# Model 3 - Linear Support Vector Machine (SVM)

from sklearn.svm import LinearSVC

svm_model = LinearSVC()
svm_model.fit(X_train_tfidf, y_train)

svm_pred = svm_model.predict(X_test_tfidf)
svm_accuracy = accuracy_score(y_test, svm_pred)

print("Linear SVM Accuracy:", svm_accuracy)

Linear SVM Accuracy: 0.9915480427046264


In [17]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize model
nb_model = MultinomialNB()

# Train
nb_model.fit(X_train_tfidf, y_train)

# Predict
nb_pred = nb_model.predict(X_test_tfidf)

# Classification Report
print("Naive Bayes Classification Report:\n")
print(classification_report(y_test, nb_pred))

# Confusion Matrix
print("Naive Bayes Confusion Matrix:\n")
print(confusion_matrix(y_test, nb_pred))

Naive Bayes Classification Report:

              precision    recall  f1-score   support

         ham       0.99      0.98      0.99      3309
        spam       0.98      0.99      0.99      3435

    accuracy                           0.99      6744
   macro avg       0.99      0.99      0.99      6744
weighted avg       0.99      0.99      0.99      6744

Naive Bayes Confusion Matrix:

[[3254   55]
 [  28 3407]]


In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize model
lr_model = LogisticRegression(max_iter=1000)

# Train
lr_model.fit(X_train_tfidf, y_train)

# Predict
lr_pred = lr_model.predict(X_test_tfidf)

# Classification Report
print("Logistic Regression Classification Report:\n")
print(classification_report(y_test, lr_pred))

# Confusion Matrix
print("Logistic Regression Confusion Matrix:\n")
print(confusion_matrix(y_test, lr_pred))

Logistic Regression Classification Report:

              precision    recall  f1-score   support

         ham       1.00      0.98      0.99      3309
        spam       0.98      1.00      0.99      3435

    accuracy                           0.99      6744
   macro avg       0.99      0.99      0.99      6744
weighted avg       0.99      0.99      0.99      6744

Logistic Regression Confusion Matrix:

[[3249   60]
 [  16 3419]]


In [19]:
from sklearn.metrics import classification_report

# Classification Report
print("SVM Classification Report:\n")
print(classification_report(y_test, svm_pred))

# Confusion Matrix
print("SVM Confusion Matrix:\n")
print(confusion_matrix(y_test, svm_pred))

SVM Classification Report:

              precision    recall  f1-score   support

         ham       0.99      0.99      0.99      3309
        spam       0.99      1.00      0.99      3435

    accuracy                           0.99      6744
   macro avg       0.99      0.99      0.99      6744
weighted avg       0.99      0.99      0.99      6744

SVM Confusion Matrix:

[[3269   40]
 [  17 3418]]


In [20]:
def predict_email_ensemble(text, weak_threshold=0.6, strong_threshold=0.75):
    
    # Clean
    cleaned = clean_email(text)
    vectorized = vectorizer.transform([cleaned])
    
    # Predictions
    nb_pred = str(nb_model.predict(vectorized)[0])
    lr_pred = str(lr_model.predict(vectorized)[0])
    svm_pred = str(svm_model.predict(vectorized)[0])
    
    # Logistic Regression probability
    lr_prob = lr_model.predict_proba(vectorized)[0]
    spam_index = list(lr_model.classes_).index('spam')
    spam_confidence = lr_prob[spam_index]
    
    # -------- Decision Logic -------- #
    
    # Case 1: NB says HAM and LR is weak spam → HAM
    if nb_pred == 'ham' and spam_confidence < weak_threshold:
        final_pred = 'ham'
    
    # Case 2: LR & SVM strongly agree spam with high confidence → SPAM
    elif lr_pred == 'spam' and svm_pred == 'spam' and spam_confidence >= strong_threshold:
        final_pred = 'spam'
    
    # Otherwise safer choice
    else:
        final_pred = 'ham'
    
    return {
        "Naive Bayes": nb_pred,
        "Logistic Regression": lr_pred,
        "Linear SVM": svm_pred,
        "Spam Confidence (LR)": round(spam_confidence, 4),
        "Final Decision": final_pred.upper()
    }

In [21]:
# Random Tests :-

email_text = """
URGENT!!! You have won $10,000.
Click the link below to claim now.
"""

result = predict_email_ensemble(email_text)
print(result)

{'Naive Bayes': 'spam', 'Logistic Regression': 'spam', 'Linear SVM': 'spam', 'Spam Confidence (LR)': np.float64(0.9837), 'Final Decision': 'SPAM'}


In [22]:
email_text = """
Congratulations!!! You have won $5000.
Click here to claim your prize now.
"""

result = predict_email_ensemble(email_text)
print(result)

{'Naive Bayes': 'spam', 'Logistic Regression': 'spam', 'Linear SVM': 'spam', 'Spam Confidence (LR)': np.float64(0.9026), 'Final Decision': 'SPAM'}


In [23]:
email_text = """
The Team is impressed by your marketing strategy. Let's catch up for disussing a marketing campaign !
- Marketing Head
"""

result = predict_email_ensemble(email_text)
print(result)

{'Naive Bayes': 'ham', 'Logistic Regression': 'spam', 'Linear SVM': 'spam', 'Spam Confidence (LR)': np.float64(0.7161), 'Final Decision': 'HAM'}


In [24]:
email_text = """
It's nice connecting with you. Your refferal really helped a lot. Thanks for your support. 
"""

result = predict_email_ensemble(email_text)
print(result)

{'Naive Bayes': 'spam', 'Logistic Regression': 'ham', 'Linear SVM': 'ham', 'Spam Confidence (LR)': np.float64(0.4697), 'Final Decision': 'HAM'}


In [25]:
email_text = """
Hey Builders,
It feels like every week we open the laptop and the surface has expanded again.
Memory that sticks. Agents that plan. Models that see, reason, call tools, and finish the job. What used to require heavy orchestration is becoming built-in. The shift from assistive to autonomous systems isn’t gradual—it’s compounding.
So this moment isn’t about access. It’s about execution. If you’ve been waiting for the stack to mature, it has. And just like we say every other week: Now it’s about what you ship with it.
"""
result = predict_email_ensemble(email_text)
print(result)

{'Naive Bayes': 'spam', 'Logistic Regression': 'spam', 'Linear SVM': 'spam', 'Spam Confidence (LR)': np.float64(0.6224), 'Final Decision': 'HAM'}
