In [28]:
import os
import pickle
import pandas as pd
import numpy as np
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from datasketch import MinHash, MinHashLSH
import warnings
warnings.filterwarnings('ignore')

# Download required NLTK data
print("Setting up NLTK data...")
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

# Initialize text processing tools
punctuations = list(string.punctuation)
stopwords_set = set(stopwords.words('english'))
stemmer = PorterStemmer()

print("✅ LSH spam fighting setup completed!")

Setting up NLTK data...
✅ LSH spam fighting setup completed!


## LSH-Based Spam Fighting with Kaggle Ling-Spam Dataset

**About LSH (Locality Sensitive Hashing):**
- LSH is used to find similar documents efficiently
- We'll use MinHash LSH to detect spam emails similar to known spam
- The approach: Build an LSH index of known spam emails, then check if new emails are similar

**Dataset Setup:**
1. Download the Ling-Spam dataset from Kaggle
2. Place the CSV file in the 'datasets/' directory 
3. The CSV should have 'email_text' and 'label' columns
4. If no dataset is found, a sample dataset will be created for demonstration

**LSH Strategy:**
- Extract features (stemmed words) from spam emails during training
- Create MinHash signatures for each spam email
- Build an LSH index to quickly find similar emails
- For new emails: if LSH finds similar spam → classify as spam, else → ham

In [29]:
# Configuration for Ling-Spam Dataset
DATASET_PATH = 'datasets/lingspam_dataset.csv'
TRAINING_SET_RATIO = 0.7

# LSH parameters
LSH_THRESHOLD = 0.5  # Jaccard similarity threshold
NUM_PERM = 128      # Number of MinHash permutations

print(f"📁 Dataset path: {DATASET_PATH}")
print(f"🔄 Training ratio: {TRAINING_SET_RATIO}")
print(f"🎯 LSH threshold: {LSH_THRESHOLD}")
print(f"🔢 MinHash permutations: {NUM_PERM}")

📁 Dataset path: datasets/lingspam_dataset.csv
🔄 Training ratio: 0.7
🎯 LSH threshold: 0.5
🔢 MinHash permutations: 128


In [30]:
def preprocess_text(text):
    """
    Process email text into stemmed tokens for LSH analysis
    
    Args:
        text (str): Raw email text
        
    Returns:
        list: List of stemmed tokens
    """
    if not text or pd.isna(text):
        return []
    
    # Convert to lowercase and tokenize
    tokens = nltk.word_tokenize(str(text).lower())
    
    # Remove punctuation and filter tokens
    tokens = [token.strip("".join(punctuations)) for token in tokens 
              if token not in punctuations and len(token) > 1]
    
    # Remove stopwords and stem tokens
    if len(tokens) > 2:
        return [stemmer.stem(word) for word in tokens 
                if word not in stopwords_set and word.isalpha()]
    return []

# Test preprocessing
test_text = "This is a test email for LSH processing!"
test_tokens = preprocess_text(test_text)
print(f"🧪 Sample preprocessing: '{test_text}' → {test_tokens}")

🧪 Sample preprocessing: 'This is a test email for LSH processing!' → ['test', 'email', 'lsh', 'process']


In [31]:
def load_lingspam_dataset():
    """
    Load the Ling-Spam dataset from CSV format
    
    Returns:
        pandas.DataFrame: Dataset with email_text and label columns
    """
    try:
        if os.path.exists(DATASET_PATH):
            df = pd.read_csv(DATASET_PATH)
            print(f"✅ Loaded dataset from {DATASET_PATH}")
            print(f"Dataset shape: {df.shape}")
            return df
        else:
            print(f"⚠️  Dataset not found at {DATASET_PATH}")
            print("Creating sample dataset for LSH demonstration...")
            
            # Create sample dataset for LSH demonstration
            sample_data = {
                'email_text': [
                    "Dear colleague, I hope this email finds you well. We are organizing a linguistics conference next month.",
                    "URGENT!!! You have won $1,000,000!!! Click here now to claim your prize!!! Limited time offer!!!",
                    "The latest research on phonetics shows interesting patterns in vowel recognition systems.",
                    "FREE VIAGRA!!! Buy now with 90% discount!!! No prescription needed!!! Order today!!!",
                    "Thank you for your submission to the journal. We will review it and get back to you soon.",
                    "MAKE MONEY FAST!!! Work from home!!! Earn $5000 per week!!! No experience required!!!",
                    "The syntax paper you requested is attached. Please let me know if you need any clarifications.",
                    "CREDIT CARD DEBT FORGIVENESS!!! Eliminate your debt today!!! Government program!!!",
                    "Could you please review the manuscript on morphological analysis? Your expertise would be valuable.",
                    "WIN A FREE IPHONE!!! Click now!!! Limited time offer!!! Act fast!!!",
                    "The linguistics department is hosting a seminar on computational linguistics next Friday.",
                    "HOT SINGLES IN YOUR AREA!!! Meet them tonight!!! No strings attached!!!",
                    "I found your paper on semantic analysis very insightful. Would you be interested in collaboration?",
                    "LOSE 30 POUNDS IN 30 DAYS!!! Revolutionary diet pill!!! Doctor approved!!!",
                    "The conference proceedings are now available online. Thank you for your participation.",
                    "WORK FROM HOME!!! Earn $3000/week!!! No experience needed!!! Start today!!!"
                ],
                'label': ['ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 
                         'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam']
            }
            
            df = pd.DataFrame(sample_data)
            print(f"Created sample dataset with {len(df)} emails")
            return df
            
    except Exception as e:
        print(f"❌ Error loading dataset: {e}")
        return None

# Load the dataset
dataset = load_lingspam_dataset()
if dataset is not None:
    print(f"\n📊 Dataset summary:")
    print(f"Total emails: {len(dataset)}")
    print(f"Label distribution:")
    print(dataset['label'].value_counts())

⚠️  Dataset not found at datasets/lingspam_dataset.csv
Creating sample dataset for LSH demonstration...
Created sample dataset with 16 emails

📊 Dataset summary:
Total emails: 16
Label distribution:
label
ham     8
spam    8
Name: count, dtype: int64


In [32]:
# Prepare dataset for LSH processing
# Convert labels to binary format (1 for ham, 0 for spam)
dataset['label_binary'] = dataset['label'].map({'ham': 1, 'spam': 0})

# Split the dataset into training and testing sets
X = dataset['email_text']
y = dataset['label_binary']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=(1 - TRAINING_SET_RATIO), 
    random_state=42, 
    stratify=y
)

print(f"📝 Dataset split for LSH:")
print(f"Training set: {len(X_train)} emails")
print(f"Testing set: {len(X_test)} emails")
print(f"Training spam/ham ratio: {sum(y_train == 0)}/{sum(y_train == 1)}")
print(f"Testing spam/ham ratio: {sum(y_test == 0)}/{sum(y_test == 1)}")

# Create lists for easier processing
train_data = list(zip(X_train, y_train))
test_data = list(zip(X_test, y_test))

📝 Dataset split for LSH:
Training set: 11 emails
Testing set: 5 emails
Training spam/ham ratio: 5/6
Testing spam/ham ratio: 3/2


In [33]:
# Extract only spam emails from training set for LSH index
spam_emails = [(email_text, label) for email_text, label in train_data if label == 0]
ham_emails = [(email_text, label) for email_text, label in train_data if label == 1]

print(f"🎯 Training data for LSH:")
print(f"Spam emails for LSH index: {len(spam_emails)}")
print(f"Ham emails (reference): {len(ham_emails)}")
print(f"Total training emails: {len(train_data)}")

# Show sample spam email for LSH
if spam_emails:
    print(f"\n📧 Sample spam email (first 100 chars):")
    print(f"'{spam_emails[0][0][:100]}...'")
else:
    print("⚠️  No spam emails found in training set!")

🎯 Training data for LSH:
Spam emails for LSH index: 5
Ham emails (reference): 6
Total training emails: 11

📧 Sample spam email (first 100 chars):
'WORK FROM HOME!!! Earn $3000/week!!! No experience needed!!! Start today!!!...'


In [34]:
# Initialize MinHashLSH matcher
print(f"🏗️  Initializing LSH with threshold={LSH_THRESHOLD}, num_perm={NUM_PERM}")
lsh = MinHashLSH(threshold=LSH_THRESHOLD, num_perm=NUM_PERM)

print(f"✅ LSH matcher initialized successfully!")
print(f"📊 LSH Configuration:")
print(f"  • Jaccard similarity threshold: {LSH_THRESHOLD}")
print(f"  • MinHash permutations: {NUM_PERM}")
print(f"  • Ready to index spam emails...")

🏗️  Initializing LSH with threshold=0.5, num_perm=128
✅ LSH matcher initialized successfully!
📊 LSH Configuration:
  • Jaccard similarity threshold: 0.5
  • MinHash permutations: 128
  • Ready to index spam emails...


In [35]:
# Build LSH index with spam emails from training set
print("🔍 Building LSH index with spam emails...")

indexed_count = 0
for idx, (email_text, label) in enumerate(spam_emails):
    # Create MinHash for this spam email
    minhash = MinHash(num_perm=NUM_PERM)
    
    # Process email text to get stemmed tokens
    stems = preprocess_text(email_text)
    
    # Skip emails with too few tokens
    if len(stems) < 2:
        print(f"⚠️  Skipping email {idx}: too few tokens ({len(stems)})")
        continue
    
    # Add tokens to MinHash
    for stem in stems:
        minhash.update(stem.encode('utf-8'))
    
    # Insert into LSH index with unique identifier
    email_id = f"spam_email_{idx}"
    lsh.insert(email_id, minhash)
    indexed_count += 1
    
    if idx % 5 == 0:  # Progress indicator
        print(f"  Indexed {indexed_count} spam emails...")

print(f"✅ LSH index built successfully!")
print(f"📊 Index summary:")
print(f"  • Total spam emails processed: {len(spam_emails)}")
print(f"  • Successfully indexed: {indexed_count}")
print(f"  • Skipped (too few tokens): {len(spam_emails) - indexed_count}")
print(f"  • LSH index ready for similarity queries!")

🔍 Building LSH index with spam emails...
  Indexed 1 spam emails...
✅ LSH index built successfully!
📊 Index summary:
  • Total spam emails processed: 5
  • Successfully indexed: 5
  • Skipped (too few tokens): 0
  • LSH index ready for similarity queries!


In [36]:
def lsh_predict_label(email_text):
    """
    Predict email label using LSH similarity matching
    
    Args:
        email_text (str): Email text content
        
    Returns:
        int: 0 if predicted spam, 1 if predicted ham, -1 if error
    """
    # Preprocess email text
    stems = preprocess_text(email_text)
    
    # Check if we have enough tokens
    if len(stems) < 2:
        return -1  # Error: insufficient tokens
    
    # Create MinHash for the email
    minhash = MinHash(num_perm=NUM_PERM)
    for stem in stems:
        minhash.update(stem.encode('utf-8'))
    
    # Query LSH index for similar spam emails
    matches = lsh.query(minhash)
    
    # If we find matches with known spam emails, classify as spam
    if matches:
        return 0  # Spam
    else:
        return 1  # Ham

# Test the prediction function
if len(test_data) > 0:
    test_email, test_label = test_data[0]
    predicted = lsh_predict_label(test_email)
    actual = "spam" if test_label == 0 else "ham"
    predicted_str = "spam" if predicted == 0 else "ham" if predicted == 1 else "error"
    
    print(f"🧪 LSH prediction test:")
    print(f"  Email (first 80 chars): '{test_email[:80]}...'")
    print(f"  Actual label: {actual}")
    print(f"  LSH prediction: {predicted_str}")
    print(f"  Match: {'✅' if (test_label == predicted) else '❌'}")
else:
    print("⚠️  No test data available for prediction test")

🧪 LSH prediction test:
  Email (first 80 chars): 'CREDIT CARD DEBT FORGIVENESS!!! Eliminate your debt today!!! Government program!...'
  Actual label: spam
  LSH prediction: ham
  Match: ❌


In [37]:
# Test LSH classifier on the test set
print("🧪 Testing LSH classifier on test set...")

# Initialize confusion matrix variables
fp = 0  # False Positive: Ham classified as Spam
tp = 0  # True Positive: Spam classified as Spam  
fn = 0  # False Negative: Spam classified as Ham
tn = 0  # True Negative: Ham classified as Ham

skipped = 0  # Count of emails with parsing errors

# Classify each email in the test set
for idx, (email_text, true_label) in enumerate(test_data):
    # Get LSH prediction
    predicted_label = lsh_predict_label(email_text)
    
    # Skip emails with parsing errors
    if predicted_label == -1:
        skipped += 1
        continue
    
    # Update confusion matrix
    if predicted_label == 0:  # Predicted spam
        if true_label == 1:  # Actually ham
            fp += 1
        else:  # Actually spam
            tp += 1
    else:  # Predicted ham
        if true_label == 1:  # Actually ham
            tn += 1
        else:  # Actually spam
            fn += 1

# Calculate totals
total_predictions = tp + tn + fp + fn
total_processed = len(test_data)

print(f"📊 LSH Classification Results:")
print(f"True Positives (Spam → Spam): {tp}")
print(f"True Negatives (Ham → Ham): {tn}")
print(f"False Positives (Ham → Spam): {fp}")
print(f"False Negatives (Spam → Ham): {fn}")
print(f"Parsing errors (skipped): {skipped}")
print(f"Total processed: {total_processed}")
print(f"Total predictions: {total_predictions}")

if total_predictions > 0:
    accuracy = (tp + tn) / total_predictions
    print(f"🎯 Preliminary accuracy: {accuracy:.1%}")
else:
    print("❌ No valid predictions made")

🧪 Testing LSH classifier on test set...
📊 LSH Classification Results:
True Positives (Spam → Spam): 0
True Negatives (Ham → Ham): 2
False Positives (Ham → Spam): 0
False Negatives (Spam → Ham): 3
Parsing errors (skipped): 0
Total processed: 5
Total predictions: 5
🎯 Preliminary accuracy: 40.0%


In [38]:
# Display confusion matrix
from IPython.display import HTML, display

if total_predictions > 0:
    print("📈 Confusion Matrix (Raw Counts):")
    print("Predicted →")
    print("Actual ↓     Ham    Spam")
    
    # Create HTML table for better visualization
    html_table = "<table border='1' style='border-collapse: collapse;'>"
    html_table += "<tr><th></th><th>Predicted Ham</th><th>Predicted Spam</th></tr>"
    html_table += f"<tr><td><b>Actual Ham</b></td><td>{tn}</td><td>{fp}</td></tr>"
    html_table += f"<tr><td><b>Actual Spam</b></td><td>{fn}</td><td>{tp}</td></tr>"
    html_table += "</table>"
    
    display(HTML(html_table))
    
    print(f"\n🎯 Interpretation:")
    print(f"• True Negatives (TN): {tn} - Ham emails correctly identified as Ham")
    print(f"• False Positives (FP): {fp} - Ham emails incorrectly identified as Spam")
    print(f"• False Negatives (FN): {fn} - Spam emails incorrectly identified as Ham")
    print(f"• True Positives (TP): {tp} - Spam emails correctly identified as Spam")
else:
    print("❌ Cannot display confusion matrix - no valid predictions")

📈 Confusion Matrix (Raw Counts):
Predicted →
Actual ↓     Ham    Spam


Unnamed: 0,Predicted Ham,Predicted Spam
Actual Ham,2,0
Actual Spam,3,0



🎯 Interpretation:
• True Negatives (TN): 2 - Ham emails correctly identified as Ham
• False Positives (FP): 0 - Ham emails incorrectly identified as Spam
• False Negatives (FN): 3 - Spam emails incorrectly identified as Ham
• True Positives (TP): 0 - Spam emails correctly identified as Spam


In [39]:
# Display performance metrics
if total_predictions > 0:
    print("📊 Performance Metrics:")
    
    # Calculate percentages
    tn_pct = f"{tn/total_predictions:.1%}"
    fp_pct = f"{fp/total_predictions:.1%}"
    fn_pct = f"{fn/total_predictions:.1%}"
    tp_pct = f"{tp/total_predictions:.1%}"
    
    # Create HTML table for percentages
    html_table_pct = "<table border='1' style='border-collapse: collapse;'>"
    html_table_pct += "<tr><th></th><th>Predicted Ham</th><th>Predicted Spam</th></tr>"
    html_table_pct += f"<tr><td><b>Actual Ham</b></td><td>{tn_pct}</td><td>{fp_pct}</td></tr>"
    html_table_pct += f"<tr><td><b>Actual Spam</b></td><td>{fn_pct}</td><td>{tp_pct}</td></tr>"
    html_table_pct += "</table>"
    
    display(HTML(html_table_pct))
    
    # Calculate key metrics
    accuracy = (tp + tn) / total_predictions
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    print(f"\n📈 LSH Classifier Performance:")
    print(f"• Accuracy: {accuracy:.1%} - Overall correct predictions")
    print(f"• Precision: {precision:.1%} - Of predicted spam, how much was actually spam")
    print(f"• Recall: {recall:.1%} - Of actual spam, how much was detected")
    print(f"• F1-Score: {f1_score:.3f} - Harmonic mean of precision and recall")
    
    # LSH-specific insights
    print(f"\n🔍 LSH-Specific Insights:")
    print(f"• Similarity threshold: {LSH_THRESHOLD}")
    print(f"• MinHash permutations: {NUM_PERM}")
    print(f"• Spam emails in index: {indexed_count}")
    print(f"• Processing errors: {skipped} emails")
    
else:
    print("❌ Cannot calculate performance metrics - no valid predictions")

📊 Performance Metrics:


Unnamed: 0,Predicted Ham,Predicted Spam
Actual Ham,40.0%,0.0%
Actual Spam,60.0%,0.0%



📈 LSH Classifier Performance:
• Accuracy: 40.0% - Overall correct predictions
• Precision: 0.0% - Of predicted spam, how much was actually spam
• Recall: 0.0% - Of actual spam, how much was detected
• F1-Score: 0.000 - Harmonic mean of precision and recall

🔍 LSH-Specific Insights:
• Similarity threshold: 0.5
• MinHash permutations: 128
• Spam emails in index: 5
• Processing errors: 0 emails


In [40]:
# Final summary and LSH analysis
print("🎉 LSH-Based Spam Fighting - Ling-Spam Dataset Analysis Complete!")
print("="*75)

if total_predictions > 0:
    accuracy = (tp + tn) / total_predictions
    print(f"🎯 Final LSH Classification Accuracy: {accuracy:.1%}")
    
    print(f"\n💡 Key Results:")
    print(f"• Dataset size: {len(dataset)} emails")
    print(f"• Training set: {len(train_data)} emails")
    print(f"• Test set: {len(test_data)} emails")
    print(f"• LSH index size: {indexed_count} spam signatures")
    print(f"• Similarity threshold: {LSH_THRESHOLD}")
    print(f"• Method: MinHash LSH with Jaccard similarity")
    
    # Performance analysis
    if accuracy > 0.8:
        print(f"✅ Excellent performance! LSH effectively detects similar spam patterns.")
    elif accuracy > 0.6:
        print(f"⚠️  Moderate performance. Consider tuning LSH parameters or preprocessing.")
    else:
        print(f"❌ Poor performance. LSH may not be suitable for this dataset or needs adjustment.")
    
    # LSH-specific recommendations
    print(f"\n🔧 LSH Tuning Recommendations:")
    if precision < 0.7:
        print(f"• Low precision → Increase similarity threshold (current: {LSH_THRESHOLD})")
    if recall < 0.7:
        print(f"• Low recall → Decrease similarity threshold or add more spam examples")
    if accuracy < 0.7:
        print(f"• Try different num_perm values (current: {NUM_PERM})")
        print(f"• Improve text preprocessing (stemming, n-grams)")
        print(f"• Consider ensemble with other methods")
    
    print(f"\n🚀 Next Steps:")
    print(f"• Experiment with different LSH thresholds (0.3, 0.7)")
    print(f"• Try different MinHash permutation counts (64, 256)")
    print(f"• Combine LSH with other features (email headers, length)")
    print(f"• Use LSH for initial filtering, then apply ML classifiers")
    
else:
    print("❌ LSH classification failed - check dataset and preprocessing")

print("="*75)

🎉 LSH-Based Spam Fighting - Ling-Spam Dataset Analysis Complete!
🎯 Final LSH Classification Accuracy: 40.0%

💡 Key Results:
• Dataset size: 16 emails
• Training set: 11 emails
• Test set: 5 emails
• LSH index size: 5 spam signatures
• Similarity threshold: 0.5
• Method: MinHash LSH with Jaccard similarity
❌ Poor performance. LSH may not be suitable for this dataset or needs adjustment.

🔧 LSH Tuning Recommendations:
• Low precision → Increase similarity threshold (current: 0.5)
• Low recall → Decrease similarity threshold or add more spam examples
• Try different num_perm values (current: 128)
• Improve text preprocessing (stemming, n-grams)
• Consider ensemble with other methods

🚀 Next Steps:
• Experiment with different LSH thresholds (0.3, 0.7)
• Try different MinHash permutation counts (64, 256)
• Combine LSH with other features (email headers, length)
• Use LSH for initial filtering, then apply ML classifiers


## Optional: Create Enhanced Sample Dataset for LSH Testing

The LSH method works better with more data and similar patterns. Run the cell below to create a larger, more realistic dataset for testing LSH effectiveness.

In [41]:
# Create enhanced sample dataset for better LSH demonstration
def create_enhanced_lingspam_dataset(filename='datasets/lingspam_dataset.csv', num_emails=50):
    """
    Create a larger sample dataset with patterns that work well with LSH
    """
    # Create datasets directory if it doesn't exist
    os.makedirs('datasets', exist_ok=True)
    
    # Common spam patterns (to create similarity for LSH)
    spam_templates = [
        "URGENT!!! You have won ${} dollars!!! Click here now to claim your prize!!! Limited time offer!!!",
        "FREE {}!!! Buy now with {}% discount!!! No {} needed!!! Order today!!!",
        "MAKE MONEY FAST!!! Work from home!!! Earn ${}/week!!! No experience required!!!",
        "CREDIT CARD DEBT FORGIVENESS!!! Eliminate your {} today!!! {} program!!!",
        "HOT {} IN YOUR AREA!!! Meet them tonight!!! No strings attached!!!",
        "LOSE {} POUNDS IN {} DAYS!!! Revolutionary {}!!! Doctor approved!!!",
        "WIN A FREE {}!!! Click now!!! Limited time offer!!! Act fast!!!",
        "WORK FROM HOME!!! Earn ${}/week!!! No {} needed!!! Start today!!!",
        "FREE MONEY!!! {} grants available!!! Claim yours now!!! No repayment!!!",
        "MIRACLE CURE!!! {} without {} or {}!!! 100% guaranteed!!!"
    ]
    
    # Spam variations
    amounts = ["1000000", "500000", "250000", "100000"]
    products = ["VIAGRA", "CIALIS", "PILLS", "MEDICINE"]
    percentages = ["90", "80", "70", "95"]
    requirements = ["prescription", "experience", "payment", "commitment"]
    periods = ["30", "14", "7", "60"]
    items = ["IPHONE", "LAPTOP", "CAR", "VACATION"]
    people = ["SINGLES", "FRIENDS", "PARTNERS", "DATES"]
    benefits = ["Government", "Federal", "State", "Private"]
    activities = ["diet", "exercise", "work", "effort"]
    
    # Ham email templates (academic/professional)
    ham_templates = [
        "Dear colleague, I hope this email finds you well. We are organizing a {} conference next month on {}.",
        "The latest research on {} shows interesting patterns in {} systems and methodologies.",
        "Thank you for your submission to the {}. We will review it and get back to you within {} days.",
        "The {} paper you requested is attached. Please let me know if you need any clarifications on {}.",
        "Could you please review the manuscript on {}? Your expertise in {} would be valuable.",
        "The {} department is hosting a seminar on {} next Friday at {} in room {}.",
        "I found your paper on {} very insightful. Would you be interested in collaboration on {}?",
        "The conference proceedings for {} are now available online. Thank you for your participation in {}.",
        "Please find attached the corrected version of the {} algorithm for {} classification.",
        "The workshop on {} has been scheduled for next month. We would appreciate your input on {}."
    ]
    
    # Ham variations
    conferences = ["linguistics", "computer science", "AI", "NLP"]
    topics = ["phonetics", "syntax", "semantics", "morphology"]
    journals = ["journal", "conference", "symposium", "workshop"]
    days = ["5-7", "10-14", "2-3", "7-10"]
    papers = ["syntax", "phoneme", "semantic", "morphological"]
    subjects = ["machine learning", "natural language", "computational linguistics", "AI"]
    departments = ["linguistics", "computer science", "AI research", "NLP"]
    times = ["10 AM", "2 PM", "9 AM", "3 PM"]
    rooms = ["A101", "B205", "C301", "D150"]
    algorithms = ["classification", "clustering", "prediction", "analysis"]
    
    emails = []
    labels = []
    
    # Generate varied spam emails
    for i in range(num_emails // 2):
        template = spam_templates[i % len(spam_templates)]
        
        # Fill template with variations
        if "{}" in template:
            if "won" in template:
                spam_email = template.format(amounts[i % len(amounts)])
            elif "FREE" in template and "discount" in template:
                spam_email = template.format(
                    products[i % len(products)], 
                    percentages[i % len(percentages)], 
                    requirements[i % len(requirements)]
                )
            elif "EARN" in template:
                spam_email = template.format(amounts[i % len(amounts)][:4])
            elif "DEBT" in template:
                spam_email = template.format("debt", benefits[i % len(benefits)])
            elif "HOT" in template:
                spam_email = template.format(people[i % len(people)])
            elif "LOSE" in template:
                spam_email = template.format(
                    periods[i % len(periods)], 
                    periods[i % len(periods)], 
                    "diet pill"
                )
            elif "WIN" in template:
                spam_email = template.format(items[i % len(items)])
            elif "WORK" in template:
                spam_email = template.format(amounts[i % len(amounts)][:4], requirements[i % len(requirements)])
            elif "grants" in template:
                spam_email = template.format(benefits[i % len(benefits)])
            elif "MIRACLE" in template:
                spam_email = template.format("lose weight", activities[i % len(activities)], activities[(i+1) % len(activities)])
            else:
                spam_email = template
        else:
            spam_email = template
            
        # Add some variation
        if i > 0:
            spam_email += f" Reference: SPAM{i:03d}. ID: {i+1000}."
            
        emails.append(spam_email)
        labels.append('spam')
        
        # Generate varied ham emails
        ham_template = ham_templates[i % len(ham_templates)]
        if "{}" in ham_template:
            if "conference" in ham_template:
                ham_email = ham_template.format(conferences[i % len(conferences)], topics[i % len(topics)])
            elif "research" in ham_template:
                ham_email = ham_template.format(topics[i % len(topics)], subjects[i % len(subjects)])
            elif "submission" in ham_template:
                ham_email = ham_template.format(journals[i % len(journals)], days[i % len(days)])
            elif "paper" in ham_template:
                ham_email = ham_template.format(papers[i % len(papers)], topics[i % len(topics)])
            elif "manuscript" in ham_template:
                ham_email = ham_template.format(subjects[i % len(subjects)], topics[i % len(topics)])
            elif "department" in ham_template:
                ham_email = ham_template.format(
                    departments[i % len(departments)], 
                    subjects[i % len(subjects)], 
                    times[i % len(times)], 
                    rooms[i % len(rooms)]
                )
            elif "insightful" in ham_template:
                ham_email = ham_template.format(topics[i % len(topics)], subjects[i % len(subjects)])
            elif "proceedings" in ham_template:
                ham_email = ham_template.format(conferences[i % len(conferences)], topics[i % len(topics)])
            elif "algorithm" in ham_template:
                ham_email = ham_template.format(algorithms[i % len(algorithms)], subjects[i % len(subjects)])
            elif "workshop" in ham_template:
                ham_email = ham_template.format(subjects[i % len(subjects)], topics[i % len(topics)])
            else:
                ham_email = ham_template
        else:
            ham_email = ham_template
            
        # Add variation
        if i > 0:
            ham_email += f" Email ref: HAM{i:03d}. Best regards, Academic Team."
            
        emails.append(ham_email)
        labels.append('ham')
    
    # Create DataFrame
    df = pd.DataFrame({
        'email_text': emails,
        'label': labels
    })
    
    # Shuffle the dataset
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)
    
    # Save to CSV
    df.to_csv(filename, index=False)
    print(f"✅ Enhanced dataset created: {filename}")
    print(f"📊 Contains {len(df)} emails ({len(df[df['label']=='spam'])} spam, {len(df[df['label']=='ham'])} ham)")
    print(f"🔍 Dataset designed for LSH similarity detection")
    
    return df

# Uncomment the line below to create an enhanced dataset
# enhanced_df = create_enhanced_lingspam_dataset(num_emails=100)

print("💡 To create an enhanced dataset for better LSH testing, uncomment the line above and run this cell.")
print("📈 The enhanced dataset includes similar spam patterns that work well with LSH similarity detection.")

💡 To create an enhanced dataset for better LSH testing, uncomment the line above and run this cell.
📈 The enhanced dataset includes similar spam patterns that work well with LSH similarity detection.


## ✅ LSH Notebook Successfully Converted to Ling-Spam Dataset!

### **Key Changes Made:**

1. **Dataset Format Conversion**: 
   - 🔄 **From**: TREC 2007 corpus (email files + labels file)
   - 🔄 **To**: Kaggle Ling-Spam dataset (CSV with email_text + label columns)

2. **Modern Dependencies**: 
   - ✅ Updated imports (pandas, sklearn, modern NLTK)
   - ✅ Text preprocessing with stemming and stopword removal
   - ✅ Proper error handling and progress indicators

3. **LSH Implementation**: 
   - ✅ MinHash LSH with configurable parameters
   - ✅ Jaccard similarity-based spam detection
   - ✅ Complete evaluation pipeline with confusion matrix

### **LSH Method Explanation:**
- **Training**: Build LSH index with MinHash signatures of known spam emails
- **Prediction**: Check if new email has similar MinHash to indexed spam
- **Advantage**: Very fast similarity detection for large datasets
- **Challenge**: Requires sufficient similar examples to work effectively

### **Performance Notes:**
- 📊 **Current accuracy**: 40% on small sample dataset
- 📉 **Low performance expected** due to small sample size and diverse spam patterns
- 📈 **Will improve with**: Larger dataset, similar spam patterns, parameter tuning

### **To Improve LSH Performance:**
1. **Use the enhanced dataset generator** (uncomment and run the last cell)
2. **Tune LSH parameters**: Try threshold=0.3 for higher recall
3. **Add more training data**: LSH works better with more examples
4. **Preprocess improvements**: Use n-grams, better stemming
5. **Combine methods**: Use LSH as first filter, then apply ML classifier

The notebook demonstrates LSH concepts and provides a foundation for larger-scale spam detection systems!