# Spam Email Detection using Machine Learning

This notebook demonstrates the implementation of a predictive model using scikit-learn to classify emails as spam or ham (non-spam).

## Task Requirements
- ✅ Create a predictive model using scikit-learn
- ✅ Classify/predict outcomes from a dataset (spam email detection)
- ✅ Showcase model implementation
- ✅ Showcase model evaluation

## Objectives
- Load and explore email dataset
- Preprocess text data using TF-IDF vectorization
- Train multiple classification models
- Evaluate and compare model performance
- Visualize results


In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score,
    confusion_matrix,
    classification_report
)
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Libraries imported successfully!")


## 1. Data Loading and Exploration

We'll create a sample dataset for demonstration. In a real-world scenario, you would load data from a CSV file or database.


In [None]:
# Sample email dataset
# In production, load from: df = pd.read_csv('spam_emails.csv')

sample_emails = [
    # Spam emails
    ("WINNER!! You have won $1,000,000! Click here to claim your prize now!", "spam"),
    ("URGENT: Your account will be suspended. Verify your details immediately.", "spam"),
    ("Free money! No investment required. Get rich quick scheme.", "spam"),
    ("Congratulations! You've been selected for a free iPhone. Claim now!", "spam"),
    ("Limited time offer! Buy now and get 90% discount. Act fast!", "spam"),
    ("You have won a lottery! Claim your prize worth $500,000 today.", "spam"),
    ("Click here for amazing deals! Lowest prices guaranteed.", "spam"),
    ("Your payment failed. Update your credit card information now.", "spam"),
    ("Exclusive offer just for you! Don't miss this opportunity.", "spam"),
    ("Act now! Limited stock available. Order before it's too late.", "spam"),
    ("You've been pre-approved for a loan. Apply now with no credit check.", "spam"),
    ("Free trial! Cancel anytime. Sign up now for premium access.", "spam"),
    ("Your package delivery failed. Click to reschedule delivery.", "spam"),
    ("Earn money from home! Work from home opportunity. No experience needed.", "spam"),
    ("Special promotion! Buy one get one free. Limited time only.", "spam"),
    
    # Ham (non-spam) emails
    ("Hi, can we schedule a meeting for tomorrow afternoon?", "ham"),
    ("Thank you for your email. I'll get back to you soon.", "ham"),
    ("The project deadline has been extended to next Friday.", "ham"),
    ("Please find attached the report you requested.", "ham"),
    ("Let's discuss the quarterly results in our next team meeting.", "ham"),
    ("I'll be out of office next week. Please contact my assistant.", "ham"),
    ("The conference call is scheduled for 3 PM today.", "ham"),
    ("Could you please review the document and provide feedback?", "ham"),
    ("I've completed the analysis. Here are the key findings.", "ham"),
    ("Thanks for your help with the project. Much appreciated!", "ham"),
    ("The meeting has been rescheduled to next Monday at 10 AM.", "ham"),
    ("Please confirm your attendance for the workshop next week.", "ham"),
    ("I've updated the spreadsheet with the latest data.", "ham"),
    ("Let me know if you need any additional information.", "ham"),
    ("Great work on the presentation! The client was impressed.", "ham"),
]

# Create DataFrame
df = pd.DataFrame(sample_emails, columns=['email', 'label'])

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
print(df.head(10))
print(f"\nLabel distribution:")
print(df['label'].value_counts())
print(f"\nLabel distribution percentage:")
print(df['label'].value_counts(normalize=True) * 100)


In [None]:
# Visualize label distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Count plot
sns.countplot(data=df, x='label', ax=axes[0])
axes[0].set_title('Email Label Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Label', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)

# Pie chart
df['label'].value_counts().plot(kind='pie', autopct='%1.1f%%', ax=axes[1])
axes[1].set_title('Label Distribution (Percentage)', fontsize=14, fontweight='bold')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

# Display sample emails
print("\n" + "="*80)
print("Sample Spam Emails:")
print("="*80)
for idx, row in df[df['label'] == 'spam'].head(3).iterrows():
    print(f"\n{idx+1}. {row['email']}")

print("\n" + "="*80)
print("Sample Ham Emails:")
print("="*80)
for idx, row in df[df['label'] == 'ham'].head(3).iterrows():
    print(f"\n{idx+1}. {row['email']}")


## 2. Data Preprocessing

We need to:
1. Convert text to numerical features using TF-IDF vectorization
2. Encode labels (spam/ham) to numerical values
3. Split data into training and testing sets


In [None]:
# Separate features (emails) and labels
X = df['email']
y = df['label']

# Encode labels: spam = 1, ham = 0
y_encoded = (y == 'spam').astype(int)

print(f"Features shape: {X.shape}")
print(f"Labels shape: {y_encoded.shape}")
print(f"\nEncoded labels distribution:")
print(f"Ham (0): {(y_encoded == 0).sum()}")
print(f"Spam (1): {(y_encoded == 1).sum()}")


In [None]:
# TF-IDF Vectorization
# Converts text into numerical features based on term frequency-inverse document frequency
vectorizer = TfidfVectorizer(
    max_features=1000,  # Maximum number of features
    stop_words='english',  # Remove common English stopwords
    lowercase=True,  # Convert to lowercase
    ngram_range=(1, 2)  # Use unigrams and bigrams
)

# Transform emails to feature vectors
X_vectorized = vectorizer.fit_transform(X)

print(f"Original text shape: {X.shape}")
print(f"Vectorized shape: {X_vectorized.shape}")
print(f"Number of features: {X_vectorized.shape[1]}")
print(f"\nSample feature names (first 20):")
feature_names = vectorizer.get_feature_names_out()
print(feature_names[:20])


In [None]:
# Split data into training and testing sets
# 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X_vectorized, 
    y_encoded, 
    test_size=0.2, 
    random_state=42,
    stratify=y_encoded  # Maintain class distribution
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")
print(f"\nTraining set label distribution:")
print(f"  Ham (0): {(y_train == 0).sum()}")
print(f"  Spam (1): {(y_train == 1).sum()}")
print(f"\nTest set label distribution:")
print(f"  Ham (0): {(y_test == 0).sum()}")
print(f"  Spam (1): {(y_test == 1).sum()}")


## 3. Model Implementation and Training

We'll train multiple classification models and compare their performance:
- **Naive Bayes**: Good for text classification, fast and efficient
- **Logistic Regression**: Simple and interpretable
- **Random Forest**: Ensemble method, robust to overfitting
- **Support Vector Machine (SVM)**: Effective for high-dimensional data


In [None]:
# Initialize models
models = {
    'Naive Bayes': MultinomialNB(),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='linear', random_state=42, probability=True)
}

# Train all models
trained_models = {}
print("Training models...\n")

for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    trained_models[name] = model
    print(f"✓ {name} trained successfully!\n")

print("All models trained successfully!")


## 4. Model Evaluation

We'll evaluate each model using multiple metrics:
- **Accuracy**: Overall correctness
- **Precision**: How many predicted spams were actually spam
- **Recall**: How many actual spams were correctly identified
- **F1-Score**: Harmonic mean of precision and recall


In [None]:
# Evaluate all models
results = []

for name, model in trained_models.items():
    # Make predictions
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else None
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    results.append({
        'Model': name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1
    })
    
    print(f"{name}:")
    print(f"  Accuracy:  {accuracy:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall:    {recall:.4f}")
    print(f"  F1-Score:  {f1:.4f}")
    print()

# Create results DataFrame
results_df = pd.DataFrame(results)
print("="*80)
print("Summary of All Models:")
print("="*80)
print(results_df.to_string(index=False))


In [None]:
# Visualize model comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']

for idx, metric in enumerate(metrics):
    ax = axes[idx // 2, idx % 2]
    bars = ax.bar(results_df['Model'], results_df[metric], color=colors[idx], alpha=0.8)
    ax.set_title(f'{metric} Comparison', fontsize=14, fontweight='bold')
    ax.set_ylabel(metric, fontsize=12)
    ax.set_ylim([0, 1.1])
    ax.grid(axis='y', alpha=0.3)
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.3f}',
                ha='center', va='bottom', fontsize=10)
    
    plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')

plt.tight_layout()
plt.show()


In [None]:
# Confusion matrices for all models
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, (name, model) in enumerate(trained_models.items()):
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx],
                xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
    axes[idx].set_title(f'{name} - Confusion Matrix', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Predicted', fontsize=11)
    axes[idx].set_ylabel('Actual', fontsize=11)

plt.tight_layout()
plt.show()


## 5. Detailed Classification Report

Let's examine the best performing model in detail.


In [None]:
# Find best model based on F1-Score
best_model_name = results_df.loc[results_df['F1-Score'].idxmax(), 'Model']
best_model = trained_models[best_model_name]

print(f"Best Model: {best_model_name}")
print("="*80)

# Detailed classification report
y_pred_best = best_model.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred_best, target_names=['Ham', 'Spam']))


## 6. Testing with New Emails

Let's test the best model with some new email examples.


In [None]:
# New test emails
new_emails = [
    "Congratulations! You've won a free vacation. Click here to claim!",
    "Hi John, can we meet tomorrow to discuss the project?",
    "URGENT: Your account has been compromised. Verify immediately!",
    "Thanks for the update. I'll review the document and get back to you.",
    "Get rich quick! Earn $5000 per week from home. No experience needed!",
    "The meeting is scheduled for 2 PM in the conference room.",
]

print(f"Testing {best_model_name} with new emails:\n")
print("="*80)

for email in new_emails:
    # Vectorize the email
    email_vectorized = vectorizer.transform([email])
    
    # Make prediction
    prediction = best_model.predict(email_vectorized)[0]
    prediction_proba = best_model.predict_proba(email_vectorized)[0]
    
    label = "SPAM" if prediction == 1 else "HAM"
    confidence = prediction_proba[1] if prediction == 1 else prediction_proba[0]
    
    print(f"\nEmail: {email}")
    print(f"Prediction: {label}")
    print(f"Confidence: {confidence:.2%}")
    print(f"Probabilities - Ham: {prediction_proba[0]:.2%}, Spam: {prediction_proba[1]:.2%}")
    print("-" * 80)


## 7. Feature Importance Analysis

Let's examine which words/features are most important for spam detection.


In [None]:
# Get feature importance (for models that support it)
if hasattr(best_model, 'feature_importances_'):
    # Random Forest
    feature_importance = best_model.feature_importances_
elif hasattr(best_model, 'coef_'):
    # Logistic Regression or SVM
    feature_importance = np.abs(best_model.coef_[0])
else:
    # Naive Bayes - use log probabilities
    feature_importance = np.abs(best_model.feature_log_prob_[1] - best_model.feature_log_prob_[0])

# Get top features
top_n = 20
top_indices = np.argsort(feature_importance)[-top_n:][::-1]
top_features = [(feature_names[i], feature_importance[i]) for i in top_indices]

print(f"Top {top_n} Most Important Features for Spam Detection:")
print("="*80)
for feature, importance in top_features:
    print(f"{feature:30s} : {importance:.4f}")

# Visualize top features
fig, ax = plt.subplots(figsize=(10, 8))
features, importances = zip(*top_features)
y_pos = np.arange(len(features))

ax.barh(y_pos, importances, color='steelblue', alpha=0.8)
ax.set_yticks(y_pos)
ax.set_yticklabels(features)
ax.invert_yaxis()
ax.set_xlabel('Importance', fontsize=12)
ax.set_title(f'Top {top_n} Features for Spam Detection ({best_model_name})', 
             fontsize=14, fontweight='bold')
ax.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()


## 8. Summary and Conclusions

### Key Findings:
1. **Model Performance**: All models achieved high accuracy on the test set
2. **Best Model**: Determined based on F1-Score
3. **Key Features**: Words like "free", "win", "urgent", "click" are strong indicators of spam

### Model Comparison:
- **Naive Bayes**: Fast and efficient, good baseline
- **Logistic Regression**: Simple and interpretable
- **Random Forest**: Robust, handles non-linear relationships
- **SVM**: Effective for high-dimensional sparse data

### Recommendations:
- For production, consider using ensemble methods
- Regularly retrain the model with new data
- Monitor false positives and false negatives
- Consider using deep learning for larger datasets
