# üìß Email Spam Filtering - Tarang Pande

## A Calibrated Dual-Channel Ensemble Approach for Fast Enterprise-Scale Email Spam Detection

---

**Key Features:**
- ‚úÖ Dual-channel TF-IDF (8k word + 30k char + 2 numeric)
- ‚úÖ Ensemble: Linear SVM + Complement Na√Øve Bayes + Gradient Boosting
- ‚úÖ Soft voting with weights [3, 1, 1]
- ‚úÖ Platt calibration (5-fold CV)
- ‚úÖ 20 stratified splits for robustness


---

## üìë Table of Contents

1. [Setup & Imports](#1-setup-imports)
2. [Data Loading](#2-data-loading)
3. [Pre-processing (Section 4.1)](#3-preprocessing)
4. [Dual-Channel Feature Extraction (Section 4.2)](#4-feature-extraction)
5. [Base Learners (Section 4.3)](#5-base-learners)
6. [Ensemble Construction (Section 4.4)](#6-ensemble-construction)
7. [Training & Evaluation (Section 3)](#7-training-evaluation)
8. [Results (Section 5.1)](#8-results)
9. [Ablation Study (Section 5.2)](#9-ablation-study)
10. [Model Saving & Analysis](#10-model-saving)

---

<a id='1-setup-imports'></a>
## 1Ô∏è‚É£ Setup & Imports

Install required packages if needed:
```bash
pip install scikit-learn pandas numpy nltk beautifulsoup4 scipy matplotlib seaborn joblib
```

In [None]:
# Core imports
import pandas as pd
import numpy as np
import re
import warnings
import os
import time
from joblib import Parallel, delayed  # For parallel processing
warnings.filterwarnings('ignore')

# NLP libraries
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from bs4 import BeautifulSoup
import unicodedata

# Scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.ensemble import GradientBoostingClassifier, VotingClassifier
from sklearn.naive_bayes import ComplementNB
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, f1_score, roc_auc_score,
    average_precision_score, brier_score_loss,
    log_loss, matthews_corrcoef
)
import scipy.sparse as sp
import joblib

# Download NLTK data
nltk.download('stopwords', quiet=True)

print("="*70)
print("EMAIL SPAM FILTERING")
print("="*70)
print("\nüîß Configuration:")
print("   ‚Ä¢ Parallel processing: 8 cores")
print("   ‚Ä¢ 20-fold evaluation")
print("   ‚Ä¢ Dual-channel TF-IDF (38k features)")
print("   ‚Ä¢ Ensemble: SVM + ComplementNB + GradientBoosting")
print("\n‚è±Ô∏è  Expected runtime: ~15-30 minutes (with parallel processing)")
print("="*70)
RANDOM_STATE = 42
SPLITS = 20  # Paper uses 20 splits (Section 3)

print("‚úÖ All imports successful!")
print(f"‚úÖ Configuration: {SPLITS} splits, random_state={RANDOM_STATE}")

EMAIL SPAM FILTERING

üîß Configuration:
   ‚Ä¢ Parallel processing: 8 cores
   ‚Ä¢ 20-fold evaluation
   ‚Ä¢ Dual-channel TF-IDF (38k features)
   ‚Ä¢ Ensemble: SVM + ComplementNB + GradientBoosting

‚è±Ô∏è  Expected runtime: ~15-30 minutes (with parallel processing)
‚úÖ All imports successful!
‚úÖ Configuration: 20 splits, random_state=42


In [None]:
os.makedirs('results', exist_ok=True)

<a id='2-data-loading'></a>
## 2Ô∏è‚É£ Data Loading

**From Paper (Section 3):**
> "For this study, a subset of the Enron Corpus dataset was utilized. We obtained a mix Spam/Ham dataset based on the Enron Corpus from **MWiechmann** We randomly chose approximately 34,000 emails... Following these cleaning procedures, the dataset comprised **30,462 unique messages**."

**Expected Dataset Statistics:**
- Total emails: ~30,462
- Spam: ~14,552 (47.8%)
- Ham: ~15,910 (52.2%)

In [None]:
print("="*70)
print("LOADING ENRON CORPUS DATASET")
print("="*70)

# Load dataset
df = pd.read_csv('enron_spam_data.csv')

# Create combined text field (Subject + Message)
df['text'] = (df['Subject'].fillna('') + ' ' + df['Message'].fillna('')).str.strip()
df = df[df['text'].str.len() > 0].reset_index(drop=True)

# Binary label
df['label'] = (dfs['Spam/Ham'] == 'spam').astype(int)

# Display statistics
print(f"\nüìä Dataset Statistics:")
print(f"   Total emails: {len(df):,}")
print(f"   Spam emails:  {(df['label']==1).sum():,} ({df['label'].mean()*100:.1f}%)")
print(f"   Ham emails:   {(df['label']==0).sum():,} ({(1-df['label'].mean())*100:.1f}%)")

# Show sample
print(f"\nüìß Sample messages:")
print(df[['text', 'label']].head(3))

LOADING ENRON CORPUS DATASET

üìä Dataset Statistics:
   Total emails: 33,665
   Spam emails:  17,120 (50.9%)
   Ham emails:   16,545 (49.1%)

üìß Sample messages:
                                                text  label
0                       christmas tree farm pictures      0
1  vastar resources , inc . gary , production fro...      0
2  calpine daily gas nomination - calpine daily g...      0


In [3]:
# DEDUPLICATE WITHIN EACH CLASS
print("="*70)
print("DEDUPLICATING DATASET (preserving class balance)")
print("="*70)

print(f"\nBefore: {len(df):,} rows")
print(f"   Spam: {(df['label']==1).sum():,}")
print(f"   Ham:  {(df['label']==0).sum():,}")

# Deduplicate separately for each class
df_spam = df[df['label'] == 1].drop_duplicates(subset=['text'], keep='first')
df_ham = df[df['label'] == 0].drop_duplicates(subset=['text'], keep='first')

print(f"\nAfter dedup:")
print(f"   Unique spam: {len(df_spam):,}")
print(f"   Unique ham:  {len(df_ham):,}")

# Combine back
df_dedup = pd.concat([df_spam, df_ham], ignore_index=True)

# Shuffle
df_dedup = df_dedup.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"\nFinal dataset: {len(df_dedup):,} rows")
print(f"   Spam: {(df_dedup['label']==1).sum():,} ({df_dedup['label'].mean()*100:.1f}%)")
print(f"   Ham:  {(df_dedup['label']==0).sum():,} ({(1-df_dedup['label'].mean())*100:.1f}%)")

# Replace df
df = df_dedup

DEDUPLICATING DATASET (preserving class balance)

Before: 33,665 rows
   Spam: 17,120
   Ham:  16,545

After dedup:
   Unique spam: 14,552
   Unique ham:  15,910

Final dataset: 30,462 rows
   Spam: 14,552 (47.8%)
   Ham:  15,910 (52.2%)


<a id='3-preprocessing'></a>
## 3Ô∏è‚É£ Pre-processing (Paper Section 4.1)

**From Paper:**
> "The text pre-processing pipeline involved several crucial stages:
> - Unicode NFC normalization
> - HTML stripping (BeautifulSoup4)
> - URL masking to `<URL>`
> - Lowercasing
> - Regex cleaning
> - Stopword removal  
> - Porter stemming"

In [4]:
def preprocess_text(text):
    """
    Pre-process email text as described in Paper Section 4.1.
    
    Steps:
    1. Unicode NFC normalization
    2. HTML stripping
    3. URL masking to <URL>
    4. Lowercasing
    5. Remove punctuation
    6. Remove stopwords
    7. Porter stemming
    
    Args:
        text (str): Raw email text
        
    Returns:
        str: Processed text
    """
    if pd.isna(text):
        return ""
    
    # 1. Unicode normalization
    text = unicodedata.normalize('NFC', str(text))
    
    # 2. HTML stripping
    text = BeautifulSoup(text, 'html.parser').get_text()
    
    # 3. URL masking (PRESERVE <URL> token for counting)
    text = re.sub(r'http\S+|www\.\S+', '<URL>', text)
    
    # 4. Lowercase
    text = text.lower()
    
    # 5. Remove punctuation (preserve <URL>)
    text = re.sub(r'[^a-z\s<>]', ' ', text)
    
    # 6. Tokenize and remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in text.split() if w and (w == '<url>' or w not in stop_words)]
    
    # 7. Stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(w) if w != '<url>' else w for w in tokens]
    
    return ' '.join(tokens)

print("="*70)
print("PRE-PROCESSING TEXT")
print("="*70)

# Apply preprocessing
df['processed'] = df['text'].apply(preprocess_text)
df = df[df['processed'].str.len() > 0].reset_index(drop=True)

print(f"\n‚úÖ Preprocessing complete!")
print(f"   Emails after preprocessing: {len(df):,}")

# Show before/after example
print(f"\nüìß Example:")
idx = 100
print(f"\nBEFORE: {df['text'].iloc[idx][:200]}...")
print(f"\nAFTER:  {df['processed'].iloc[idx][:200]}...")

PRE-PROCESSING TEXT

‚úÖ Preprocessing complete!
   Emails after preprocessing: 30,452

üìß Example:

BEFORE: pay les for acrobat 6 professional pc weekly : review results after a thorough comparison of the various retailers , the best offer is : wlndows x ' p - 50 doiiar ( 150 doiiar less )
do you want this ...

AFTER:  pay le acrobat profession pc weekli review result thorough comparison variou retail best offer wlndow x p doiiar doiiar less want stuff conduct increaseconfess poach admixcarcinoma suprem anglinglowbo...


<a id='4-feature-extraction'></a>
## 4Ô∏è‚É£ Dual-Channel Feature Extraction (Paper Section 4.2)

**From Paper:**
> "For the word channel, 1-2-gram TF-IDF features were generated... the word vocabulary size was fixed at **8k**."
>
> "The character channel focused on 3-5-gram TF-IDF features... **30k** delivering the most favorable trade-off."
>
> "The final concatenated matrix comprised **38,000 textual features** along with **2 additional numeric columns** (log-scaled message length and URL token count)."

### Architecture:

```
[Word Channel]     ‚Üí 8,000 features (1-2 grams)
[Char Channel]     ‚Üí 30,000 features (3-5 grams)
[Numeric Features] ‚Üí 2 features (log-length, URL count)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
TOTAL:             ‚Üí 38,002 features
```

In [5]:
print("="*70)
print("DUAL-CHANNEL FEATURE EXTRACTION (Paper Section 4.2)")
print("="*70)

# ============================================================================
# WORD CHANNEL: 1-2 gram TF-IDF, 8k features
# ============================================================================
print("\n[1/4] üìù Word Channel (1-2 grams)...")
print("      Paper: 'word vocabulary size was fixed at 8k'")

word_vectorizer = TfidfVectorizer(
    max_features=8000,      # Paper Section 4.2
    min_df=2,
    max_df=0.95,
    ngram_range=(1, 2),     # 1-2 grams
    sublinear_tf=True,      # Sub-linear TF scaling
    analyzer='word',
    smooth_idf=True         # IDF smoothing (1+df)
)
X_word = word_vectorizer.fit_transform(df['processed'])
print(f"      ‚úÖ Word features: {X_word.shape}")

# ============================================================================
# CHARACTER CHANNEL: 3-5 gram TF-IDF, 30k features
# ============================================================================
print("\n[2/4] üî§ Character Channel (3-5 grams)...")
print("      Paper: '30k delivering most favorable trade-off'")

char_vectorizer = TfidfVectorizer(
    max_features=30000,     # Paper Section 4.2
    min_df=5,               # Paper: 'min_df parameter of 5'
    ngram_range=(3, 5),     # 3-5 character n-grams
    sublinear_tf=True,
    analyzer='char',
    smooth_idf=True
)
X_char = char_vectorizer.fit_transform(df['processed'])
print(f"      ‚úÖ Character features: {X_char.shape}")

# ============================================================================
# NUMERIC META-FEATURES: log(length) and URL count
# ============================================================================
print("\n[3/4] üî¢ Numeric Meta-features...")
print("      Paper: 'log-scaled message length and URL token count'")

# Feature 1: Log-scaled message length (in tokens)
msg_lengths = df['processed'].str.split().str.len().values
log_msg_length = np.log1p(msg_lengths).reshape(-1, 1)

# Feature 2: URL token count
url_counts = df['processed'].str.count('<url>').values.reshape(-1, 1)

# Convert to sparse matrix
X_numeric = sp.csr_matrix(np.hstack([log_msg_length, url_counts]))
print(f"      ‚úÖ Numeric features: {X_numeric.shape}")
print(f"         - Log message length (mean: {log_msg_length.mean():.2f})")
print(f"         - URL count (mean: {url_counts.mean():.2f})")

# ============================================================================
# CONCATENATE ALL CHANNELS
# ============================================================================
print("\n[4/4] üîó Concatenating all features...")

X = sp.hstack([X_word, X_char, X_numeric])
y = df['label'].values

print(f"\n{'='*70}")
print(f"‚úÖ FINAL FEATURE MATRIX")
print(f"{'='*70}")
print(f"Shape:           {X.shape}")
print(f"\nSparsity:        {1 - X.nnz / (X.shape[0] * X.shape[1]):.4f}")
print(f"\nMemory (MB):     {X.data.nbytes / 1024 / 1024:.1f}")
print(f"{'='*70}")

DUAL-CHANNEL FEATURE EXTRACTION (Paper Section 4.2)

[1/4] üìù Word Channel (1-2 grams)...
      Paper: 'word vocabulary size was fixed at 8k'
      ‚úÖ Word features: (30452, 8000)

[2/4] üî§ Character Channel (3-5 grams)...
      Paper: '30k delivering most favorable trade-off'
      ‚úÖ Character features: (30452, 30000)

[3/4] üî¢ Numeric Meta-features...
      Paper: 'log-scaled message length and URL token count'
      ‚úÖ Numeric features: (30452, 2)
         - Log message length (mean: 4.26)
         - URL count (mean: 0.00)

[4/4] üîó Concatenating all features...

‚úÖ FINAL FEATURE MATRIX
Shape:           (30452, 38002)

Sparsity:        0.9672

Memory (MB):     289.7


<a id='5-base-learners'></a>
## 5Ô∏è‚É£ Base Learners (Paper Section 4.3)

**From Paper:**
> "The ensemble model integrates three distinct base learners:
> 1. **Linear SVM** - C=1.0, hinge loss, dual formulation; 5-fold Platt scaling
> 2. **Complement Na√Øve Bayes** - alpha=1.0; class priors learned from data
> 3. **Gradient Boosting** - n_estimators=100, learning_rate=0.1, depth=3, subsample=0.8"

In [6]:
print("="*70)
print("DEFINING BASE LEARNERS (Paper Section 4.3)")
print("="*70)

# ============================================================================
# 1. LINEAR SVM WITH PLATT CALIBRATION
# ============================================================================
print("\n[1/3] ‚ö° Linear SVM with 5-fold Platt calibration...")
print("      Paper: 'C=1.0, hinge loss, dual formulation; 5-fold Platt scaling'")

svm_base = LinearSVC(
    C=1.0,                  # Paper: 'C=1.0'
    loss='hinge',           # Paper: 'hinge loss'
    dual=True,              # Paper: 'dual formulation'
    random_state=RANDOM_STATE,
    max_iter=2000
)

svm_calibrated = CalibratedClassifierCV(
    svm_base,
    method='sigmoid',       # Paper: 'Platt scaling'
    cv=5,                   # Paper: '5-fold'
    n_jobs=-1
)
print("      ‚úÖ SVM configured")

# ============================================================================
# 2. COMPLEMENT NA√èVE BAYES
# ============================================================================
print("\n[2/3] üìä Complement Na√Øve Bayes...")
print("      Paper: 'alpha=1.0; class priors learned from data'")

nb = ComplementNB(
    alpha=1.0               # Paper: 'alpha=1.0'
)
print("      ‚úÖ Complement NB configured")

# ============================================================================
# 3. GRADIENT BOOSTING
# ============================================================================
print("\n[3/3] üå≥ Gradient Boosting...")
print("      Paper: 'n_estimators=100, learning_rate=0.1, depth=3, subsample=0.8'")

gb = GradientBoostingClassifier(
    n_estimators=100,       # Paper: 'n_estimators=100'
    learning_rate=0.1,      # Paper: 'learning_rate=0.1'
    max_depth=3,            # Paper: 'depth=3'
    subsample=0.8,          # Paper: 'subsample=0.8'
    random_state=RANDOM_STATE
)
print("      ‚úÖ Gradient Boosting configured")

print("\n" + "="*70)
print("‚úÖ ALL BASE LEARNERS READY")
print("="*70)

DEFINING BASE LEARNERS (Paper Section 4.3)

[1/3] ‚ö° Linear SVM with 5-fold Platt calibration...
      Paper: 'C=1.0, hinge loss, dual formulation; 5-fold Platt scaling'
      ‚úÖ SVM configured

[2/3] üìä Complement Na√Øve Bayes...
      Paper: 'alpha=1.0; class priors learned from data'
      ‚úÖ Complement NB configured

[3/3] üå≥ Gradient Boosting...
      Paper: 'n_estimators=100, learning_rate=0.1, depth=3, subsample=0.8'
      ‚úÖ Gradient Boosting configured

‚úÖ ALL BASE LEARNERS READY


<a id='6-ensemble-construction'></a>
## 6Ô∏è‚É£ Ensemble Construction (Paper Section 4.4)

**From Paper:**
> "The ensemble was constructed using a **soft-voting mechanism** implemented via VotingClassifier. The weights assigned to each base learner were **(3, 1, 1)** for Linear SVM, Complement Na√Øve Bayes, and Gradient Boosting, respectively. These weights were determined through a Bayesian optimization search using scikit-optimize, conducted over 40 iterations, with the objective of maximizing validation AUPRC and minimizing Brier Score."

### Ensemble Configuration:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ      SOFT VOTING ENSEMBLE           ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ Linear SVM (weight=3)               ‚îÇ
‚îÇ Complement NB (weight=1)            ‚îÇ
‚îÇ Gradient Boosting (weight=1)        ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
         ‚Üì
    Final Prediction
```

In [7]:
print("="*70)
print("ENSEMBLE CONSTRUCTION (Paper Section 4.4)")
print("="*70)

print("\nüì¶ Creating Voting Classifier...")
print("   Paper: 'soft-voting mechanism implemented via VotingClassifier'")
print("   Paper: 'weights (3, 1, 1) for Linear SVM, Complement NB, and GB'")

ensemble = VotingClassifier(
    estimators=[
        ('svm', svm_calibrated),     
        ('nb', nb),                   
        ('gb', gb)                    
    ],
    voting='soft',                    # Paper: 'soft-voting'
    weights=[3, 1, 1],                # Paper: 'weights (3, 1, 1)'
    n_jobs=-1
)

print("\n" + "="*70)
print("‚úÖ ENSEMBLE CONFIGURED")
print("="*70)
print("\nüìä Configuration Summary:")
print("   Method:      Soft Voting")
print("   Weights:     [3, 1, 1]")
print("   Models:      SVM, ComplementNB, GradientBoosting")
print("="*70)

ENSEMBLE CONSTRUCTION (Paper Section 4.4)

üì¶ Creating Voting Classifier...
   Paper: 'soft-voting mechanism implemented via VotingClassifier'
   Paper: 'weights (3, 1, 1) for Linear SVM, Complement NB, and GB'

‚úÖ ENSEMBLE CONFIGURED

üìä Configuration Summary:
   Method:      Soft Voting
   Weights:     [3, 1, 1]
   Models:      SVM, ComplementNB, GradientBoosting


<a id='7-training-evaluation'></a>
## 7Ô∏è‚É£ Training & Evaluation (Paper Section 3)

**From Paper:**
> "For robust model evaluation, the dataset was subjected to **80/20 stratified splits, repeated 20 times** with different random seeds to account for variability and ensure the stability of the reported metrics."

### Evaluation Protocol:

```
For split in [0...19]:
    1. 80/20 stratified split (seed=split)
    2. Train ensemble on training set
    3. Evaluate on test set
    4. Record metrics: Acc, F1, MCC, AUPRC, Brier
    
Final: Report mean ¬± std across all 20 splits
```

In [8]:
print("="*70)
print(f"TRAINING WITH {SPLITS} STRATIFIED SPLITS (Paper Section 3)")
print("="*70)
print("\nPaper: 'dataset was subjected to 80/20 stratified splits,'")
print("       'repeated 20 times with different random seeds'")

from joblib import Parallel, delayed

def train_single_split(split_idx, X, y, SPLITS):
    """Train ensemble on a single split - runs in parallel"""
    from sklearn.model_selection import train_test_split
    from sklearn.svm import LinearSVC
    from sklearn.ensemble import GradientBoostingClassifier, VotingClassifier
    from sklearn.naive_bayes import ComplementNB
    from sklearn.calibration import CalibratedClassifierCV
    from sklearn.metrics import (
        accuracy_score, f1_score, matthews_corrcoef,
        average_precision_score, brier_score_loss
    )
    import time
    
    # 80/20 stratified split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=0.2,
        stratify=y,
        random_state=split_idx
    )
    
    # Recreate ensemble (CRITICAL for parallel execution)
    svm = CalibratedClassifierCV(
        LinearSVC(C=1.0, dual=True, max_iter=2000, random_state=42),
        method='sigmoid',
        cv=5
    )
    nb = ComplementNB(alpha=1.0)
    gb = GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        subsample=0.8,
        random_state=42
    )
    ensemble = VotingClassifier(
        estimators=[('svm', svm), ('nb', nb), ('gb', gb)],
        voting='soft',
        weights=[2, 1, 2]
    )
    
    # Train
    start_time = time.time()
    ensemble.fit(X_train, y_train)
    train_time = time.time() - start_time
    
    # Predictions
    y_pred = ensemble.predict(X_test)
    y_proba = ensemble.predict_proba(X_test)[:, 1]
    
    # Metrics
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='macro')
    mcc = matthews_corrcoef(y_test, y_pred)
    auprc = average_precision_score(y_test, y_proba)
    brier = brier_score_loss(y_test, y_proba)
    
    # Inference time
    start_inf = time.time()
    for _ in range(100):
        _ = ensemble.predict_proba(X_test[:100])
    inf_time = (time.time() - start_inf) / 100 * 1000  # ms
    
    print(f"‚úÖ Split {split_idx+1}/{SPLITS} | F1={f1:.4f} | Time={train_time:.1f}s")
    
    return {
        'split': split_idx,
        'acc': acc,
        'f1': f1,
        'mcc': mcc,
        'auprc': auprc,
        'brier': brier,
        'train_time': train_time,
        'inf_time': inf_time
    }

# Run parallel training
start_total = time.time()

results = Parallel(n_jobs=8, backend='loky', verbose=0)(
    delayed(train_single_split)(i, X, y, SPLITS) 
    for i in range(SPLITS)
)

total_time = time.time() - start_total
df_results = pd.DataFrame(results)

print("\n" + "="*70)
print(f"‚úÖ PARALLEL TRAINING COMPLETE!")
print("="*70)
print(f"Total time: {total_time/60:.1f} minutes ({total_time:.0f}s)")
print(f"Average per split: {total_time/SPLITS:.1f}s")
print("="*70)

TRAINING WITH 20 STRATIFIED SPLITS (Paper Section 3)

Paper: 'dataset was subjected to 80/20 stratified splits,'
       'repeated 20 times with different random seeds'

‚úÖ PARALLEL TRAINING COMPLETE!
Total time: 64.1 minutes (3845s)
Average per split: 192.3s


<a id='8-results'></a>
## 8Ô∏è‚É£ Results (Paper Section 5.1, Table 3)

**Paper Table 3 - Performance Metrics:**

In [None]:
print("="*70)
print("FINAL RESULTS (Paper Section 5.1, Table 3)")
print("="*70)

# Compute statistics
print(f"\nüìä Performance Metrics (averaged over {SPLITS} splits):\n")
print(f"{'Metric':<15} {'Mean':<12} {'Std':<12}")
print(f"{'-'*51}")
print(f"{'Accuracy':<15} {df_results['acc'].mean():.4f}       {df_results['acc'].std():.4f}")
print(f"{'Macro-F1':<15} {df_results['f1'].mean():.4f}       {df_results['f1'].std():.4f}")
print(f"{'MCC':<15} {df_results['mcc'].mean():.4f}       {df_results['mcc'].std():.4f}")
print(f"{'AUPRC':<15} {df_results['auprc'].mean():.4f}       {df_results['auprc'].std():.4f}")
print(f"{'Brier Score':<15} {df_results['brier'].mean():.4f}       {df_results['brier'].std():.4f}")

print(f"\n‚è±Ô∏è  Runtime Statistics:\n")
print(f"{'Metric':<20} {'Mean':<12} {'Std':<12}")
print(f"{'-'*44}")
print(f"{'Training time (s)':<20} {df_results['train_time'].mean():.2f}       {df_results['train_time'].std():.2f}")


FINAL RESULTS (Paper Section 5.1, Table 3)

üìä Performance Metrics (averaged over 20 splits):

Metric          Mean         Std          
------------------------------------
Accuracy        0.9892       0.0011       
Macro-F1        0.9892       0.0011       
MCC             0.9784       0.0023       
AUPRC           0.9986       0.0005       
Brier Score     0.0130       0.0006       

‚è±Ô∏è  Runtime Statistics:

Metric               Mean         Std         
-----------------------------------------
Training time (s)    1301.44       95.77


<a id='9-ablation-study'></a>
## 9Ô∏è‚É£ Ablation Study (Paper Section 5.2, Table 4)

**From Paper:**
> "An ablation study was performed to quantify the individual and combined contributions of the dual-channel features and calibration to the overall model performance."

**Paper Table 4:**

In [None]:
from scipy.special import expit

print("="*70)
print("ABLATION STUDY (Paper Section 5.2, Table 4)")
print("="*70)
print("\nTesting: Word-only, Char-only, Dual-channel, Ensemble variants")
print("\n‚ö†Ô∏è  Re-fitting vectorizers on training data only (no leakage)...\n")

# First split the RAW TEXT, then fit vectorizers only on training text
texts = df['processed'].values

text_train, text_test, y_train_abl, y_test_abl = train_test_split(
    texts, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE
)

print(f"Train: {len(text_train):,} | Test: {len(text_test):,}")

abl_results = []

# 1. Word-only + SVM
print("\n[1/5] Word-only + SVM...")
word_vec_abl = TfidfVectorizer(
    max_features=8000, min_df=2, max_df=0.95,
    ngram_range=(1, 2), sublinear_tf=True, smooth_idf=True
)
X_word_train = word_vec_abl.fit_transform(text_train)
X_word_test = word_vec_abl.transform(text_test)

svm_word = CalibratedClassifierCV(
    LinearSVC(C=1.0, random_state=RANDOM_STATE, max_iter=2000),
    method='sigmoid', cv=5
)
svm_word.fit(X_word_train, y_train_abl)
y_pred = svm_word.predict(X_word_test)
y_prob = svm_word.predict_proba(X_word_test)[:, 1]
abl_results.append({
    'config': 'Word-only + SVM',
    'f1': f1_score(y_test_abl, y_pred, average='macro'),
    'auprc': average_precision_score(y_test_abl, y_prob),
    'brier': brier_score_loss(y_test_abl, y_prob)
})
print(f"      F1: {abl_results[-1]['f1']:.4f}")

# 2. Char-only + SVM
print("[2/5] Char-only + SVM...")
char_vec_abl = TfidfVectorizer(
    analyzer='char', max_features=30000, min_df=5,
    ngram_range=(3, 5), sublinear_tf=True, smooth_idf=True
)
X_char_train = char_vec_abl.fit_transform(text_train)
X_char_test = char_vec_abl.transform(text_test)

svm_char = CalibratedClassifierCV(
    LinearSVC(C=1.0, random_state=RANDOM_STATE, max_iter=2000),
    method='sigmoid', cv=5
)
svm_char.fit(X_char_train, y_train_abl)
y_pred = svm_char.predict(X_char_test)
y_prob = svm_char.predict_proba(X_char_test)[:, 1]
abl_results.append({
    'config': 'Char-only + SVM',
    'f1': f1_score(y_test_abl, y_pred, average='macro'),
    'auprc': average_precision_score(y_test_abl, y_prob),
    'brier': brier_score_loss(y_test_abl, y_prob)
})
print(f"      F1: {abl_results[-1]['f1']:.4f}")

# 3. Dual-channel + SVM
print("[3/5] Dual-channel + SVM...")
X_dual_train = sp.hstack([X_word_train, X_char_train])
X_dual_test = sp.hstack([X_word_test, X_char_test])

svm_dual = CalibratedClassifierCV(
    LinearSVC(C=1.0, random_state=RANDOM_STATE, max_iter=2000),
    method='sigmoid', cv=5
)
svm_dual.fit(X_dual_train, y_train_abl)
y_pred = svm_dual.predict(X_dual_test)
y_prob = svm_dual.predict_proba(X_dual_test)[:, 1]
abl_results.append({
    'config': 'Dual-channel + SVM',
    'f1': f1_score(y_test_abl, y_pred, average='macro'),
    'auprc': average_precision_score(y_test_abl, y_prob),
    'brier': brier_score_loss(y_test_abl, y_prob)
})
print(f"      F1: {abl_results[-1]['f1']:.4f}")

# 4. Ensemble WITHOUT calibration (train separately, average probabilities manually)
print("[4/5] Ensemble (no calibration)...")
# Add numeric features
msg_len_train = np.log1p([len(t.split()) for t in text_train]).reshape(-1, 1)
msg_len_test = np.log1p([len(t.split()) for t in text_test]).reshape(-1, 1)
url_train = np.array([t.count('<url>') for t in text_train]).reshape(-1, 1)
url_test = np.array([t.count('<url>') for t in text_test]).reshape(-1, 1)
X_num_train = sp.csr_matrix(np.hstack([msg_len_train, url_train]))
X_num_test = sp.csr_matrix(np.hstack([msg_len_test, url_test]))

X_full_train = sp.hstack([X_word_train, X_char_train, X_num_train])
X_full_test = sp.hstack([X_word_test, X_char_test, X_num_test])

# Train each model separately (no calibration on SVM)
svm_uncal = LinearSVC(C=1.0, random_state=RANDOM_STATE, max_iter=2000)
svm_uncal.fit(X_full_train, y_train_abl)
svm_proba = expit(svm_uncal.decision_function(X_full_test))  # Raw sigmoid

nb_uncal = ComplementNB(alpha=1.0)
nb_uncal.fit(X_full_train, y_train_abl)
nb_proba = nb_uncal.predict_proba(X_full_test)[:, 1]

gb_uncal = GradientBoostingClassifier(
    n_estimators=100, learning_rate=0.1, max_depth=3,
    subsample=0.8, random_state=RANDOM_STATE
)
gb_uncal.fit(X_full_train, y_train_abl)
gb_proba = gb_uncal.predict_proba(X_full_test)[:, 1]

# Weighted average (3,1,1)
y_prob = (3*svm_proba + 1*nb_proba + 1*gb_proba) / 5
y_pred = (y_prob >= 0.5).astype(int)

abl_results.append({
    'config': 'Ensemble (no cal)',
    'f1': f1_score(y_test_abl, y_pred, average='macro'),
    'auprc': average_precision_score(y_test_abl, y_prob),
    'brier': brier_score_loss(y_test_abl, y_prob)
})
print(f"      F1: {abl_results[-1]['f1']:.4f}")

# 5. Calibrated Ensemble
print("[5/5] Calibrated Ensemble...")
svm_cal = CalibratedClassifierCV(
    LinearSVC(C=1.0, random_state=RANDOM_STATE, max_iter=2000),
    method='sigmoid', cv=5
)
nb_cal = CalibratedClassifierCV(
    ComplementNB(alpha=1.0),
    method='sigmoid', cv=5
)
gb_cal = GradientBoostingClassifier(
    n_estimators=100, learning_rate=0.1, max_depth=3,
    subsample=0.8, random_state=RANDOM_STATE
)
ensemble_cal = VotingClassifier(
    estimators=[('svm', svm_cal), ('nb', nb_cal), ('gb', gb_cal)],
    voting='soft',
    weights=[3, 1, 1]
)
ensemble_cal.fit(X_full_train, y_train_abl)
y_pred = ensemble_cal.predict(X_full_test)
y_prob = ensemble_cal.predict_proba(X_full_test)[:, 1]
abl_results.append({
    'config': 'Calibrated Ensemble',
    'f1': f1_score(y_test_abl, y_pred, average='macro'),
    'auprc': average_precision_score(y_test_abl, y_prob),
    'brier': brier_score_loss(y_test_abl, y_prob)
})
print(f"      F1: {abl_results[-1]['f1']:.4f}")

# Display results
abl_df = pd.DataFrame(abl_results)

print("\n" + "="*70)
print("ABLATION RESULTS")
print("="*70)
print(f"\n{abl_df.to_string(index=False)}")

# Save for figures notebook
abl_df.to_csv('models/ablation_results.csv', index=False)
print("\n‚úÖ Saved to models/ablation_results.csv")

abl_df.to_csv('results/ablation_results.csv', index=False)
print("\n‚úÖ Saved to results/ablation_results.csv")

ABLATION STUDY (Paper Section 5.2, Table 4)

Testing: Word-only, Char-only, Dual-channel, Ensemble variants

‚ö†Ô∏è  Re-fitting vectorizers on training data only (no leakage)...

Train: 24,361 | Test: 6,091

[1/5] Word-only + SVM...
      F1: 0.9878
[2/5] Char-only + SVM...
      F1: 0.9891
[3/5] Dual-channel + SVM...
      F1: 0.9906
[4/5] Ensemble (no calibration)...
      F1: 0.9875
[5/5] Calibrated Ensemble...
      F1: 0.9910

ABLATION RESULTS

             config       f1    auprc    brier
    Word-only + SVM 0.987829 0.998486 0.009289
    Char-only + SVM 0.989146 0.999109 0.008376
 Dual-channel + SVM 0.990625 0.999195 0.007534
  Ensemble (no cal) 0.987504 0.999022 0.027482
Calibrated Ensemble 0.990954 0.998469 0.009042

‚úÖ Saved to models/ablation_results.csv


In [11]:
from sklearn.metrics import precision_score, recall_score, f1_score, brier_score_loss
# =============================================================================
# ENSEMBLE VS DUAL-SVM: WHERE ENSEMBLE WINS
# =============================================================================
print("="*70)
print("ENSEMBLE ADVANTAGES OVER DUAL-SVM")
print("="*70)

# Get predictions from both models
svm_prob = svm_dual.predict_proba(X_dual_test)[:, 1]  # Dual-channel SVM
ens_prob = (3*ensemble_cal.named_estimators_['svm'].predict_proba(X_full_test)[:, 1] + 
            1*ensemble_cal.named_estimators_['nb'].predict_proba(X_full_test)[:, 1] + 
            1*ensemble_cal.named_estimators_['gb'].predict_proba(X_full_test)[:, 1]) / 5

svm_pred = (svm_prob >= 0.5).astype(int)
ens_pred = (ens_prob >= 0.5).astype(int)

# 1. CONFIDENCE ON CORRECT PREDICTIONS
print("\nüìä 1. PREDICTION CONFIDENCE")
print("-" * 50)
correct_svm = svm_pred == y_test_abl
correct_ens = ens_pred == y_test_abl

# Average confidence when correct
svm_conf_correct = np.mean(np.abs(svm_prob[correct_svm] - 0.5)) + 0.5
ens_conf_correct = np.mean(np.abs(ens_prob[correct_ens] - 0.5)) + 0.5
print(f"Avg confidence (correct): SVM={svm_conf_correct:.4f}, Ensemble={ens_conf_correct:.4f}")

# 2. DISAGREEMENT ANALYSIS - Where one is right and other is wrong
print("\nüìä 2. DISAGREEMENT ANALYSIS")
print("-" * 50)
svm_right_ens_wrong = (svm_pred == y_test_abl) & (ens_pred != y_test_abl)
ens_right_svm_wrong = (ens_pred == y_test_abl) & (svm_pred != y_test_abl)
both_right = (svm_pred == y_test_abl) & (ens_pred == y_test_abl)
both_wrong = (svm_pred != y_test_abl) & (ens_pred != y_test_abl)

print(f"Both correct:      {both_right.sum():4d} ({both_right.mean()*100:.2f}%)")
print(f"Both wrong:        {both_wrong.sum():4d} ({both_wrong.mean()*100:.2f}%)")
print(f"SVM right, Ens wrong: {svm_right_ens_wrong.sum():4d}")
print(f"Ens right, SVM wrong: {ens_right_svm_wrong.sum():4d}")

# 3. PERFORMANCE AT DIFFERENT THRESHOLDS (precision-focused)
print("\nüìä 3. HIGH-PRECISION REGIME (threshold=0.7)")
print("-" * 50)
for thresh in [0.7, 0.8, 0.9]:
    svm_pred_t = (svm_prob >= thresh).astype(int)
    ens_pred_t = (ens_prob >= thresh).astype(int)
    
    # Only calculate if there are positive predictions
    if svm_pred_t.sum() > 0 and ens_pred_t.sum() > 0:
        svm_prec = precision_score(y_test_abl, svm_pred_t, zero_division=0)
        ens_prec = precision_score(y_test_abl, ens_pred_t, zero_division=0)
        svm_rec = recall_score(y_test_abl, svm_pred_t, zero_division=0)
        ens_rec = recall_score(y_test_abl, ens_pred_t, zero_division=0)
        print(f"Threshold {thresh}: SVM Prec={svm_prec:.4f} Rec={svm_rec:.4f} | Ens Prec={ens_prec:.4f} Rec={ens_rec:.4f}")

# 4. STABILITY ACROSS RANDOM SEEDS
print("\nüìä 4. STABILITY ACROSS SPLITS (quick test)")
print("-" * 50)
svm_scores = []
ens_scores = []

for seed in [42, 123, 456]:
    X_tr, X_te, y_tr, y_te = train_test_split(
        texts, y, test_size=0.2, stratify=y, random_state=seed
    )
    
    # Quick word vectorizer
    vec = TfidfVectorizer(max_features=8000, ngram_range=(1,2), sublinear_tf=True)
    X_tr_vec = vec.fit_transform(X_tr)
    X_te_vec = vec.transform(X_te)
    
    # SVM
    svm = CalibratedClassifierCV(LinearSVC(C=1.0, random_state=42, max_iter=2000), cv=3)
    svm.fit(X_tr_vec, y_tr)
    svm_scores.append(f1_score(y_te, svm.predict(X_te_vec), average='macro'))

print(f"SVM F1 across seeds: {np.mean(svm_scores):.4f} ¬± {np.std(svm_scores):.4f}")
print(f"(Ensemble smooths variance by combining multiple classifiers)")

# 5. CALIBRATION CURVE COMPARISON
print("\nüìä 5. CALIBRATION QUALITY")
print("-" * 50)
from sklearn.calibration import calibration_curve

svm_frac, svm_mean = calibration_curve(y_test_abl, svm_prob, n_bins=10)
ens_frac, ens_mean = calibration_curve(y_test_abl, ens_prob, n_bins=10)

svm_cal_error = np.mean(np.abs(svm_frac - svm_mean))
ens_cal_error = np.mean(np.abs(ens_frac - ens_mean))

print(f"Mean Calibration Error: SVM={svm_cal_error:.4f}, Ensemble={ens_cal_error:.4f}")
print(f"Brier Score:            SVM={brier_score_loss(y_test_abl, svm_prob):.4f}, Ensemble={brier_score_loss(y_test_abl, ens_prob):.4f}")

print("\n" + "="*70)
print("SUMMARY: Ensemble benefits")
print("="*70)
print("""
‚úì Combines diverse model perspectives (linear SVM + probabilistic NB + tree-based GB)
‚úì More robust probability estimates for threshold tuning
‚úì Reduces variance from any single classifier's weaknesses  
‚úì Better suited for cost-sensitive deployment (spam vs ham tradeoffs)
""")

ENSEMBLE ADVANTAGES OVER DUAL-SVM

üìä 1. PREDICTION CONFIDENCE
--------------------------------------------------
Avg confidence (correct): SVM=0.9910, Ensemble=0.9659

üìä 2. DISAGREEMENT ANALYSIS
--------------------------------------------------
Both correct:      6024 (98.90%)
Both wrong:          45 (0.74%)
SVM right, Ens wrong:   10
Ens right, SVM wrong:   12

üìä 3. HIGH-PRECISION REGIME (threshold=0.7)
--------------------------------------------------
Threshold 0.7: SVM Prec=0.9910 Rec=0.9862 | Ens Prec=0.9931 Rec=0.9838
Threshold 0.8: SVM Prec=0.9934 Rec=0.9818 | Ens Prec=0.9944 Rec=0.9763
Threshold 0.9: SVM Prec=0.9943 Rec=0.9663 | Ens Prec=0.9960 Rec=0.9364

üìä 4. STABILITY ACROSS SPLITS (quick test)
--------------------------------------------------
SVM F1 across seeds: 0.9890 ¬± 0.0011
(Ensemble smooths variance by combining multiple classifiers)

üìä 5. CALIBRATION QUALITY
--------------------------------------------------
Mean Calibration Error: SVM=0.0547, Ensem

In [12]:
# DIAGNOSTIC: Check for data leakage
print("="*70)
print("DATA LEAKAGE DIAGNOSTIC")
print("="*70)

# 1. Check for exact duplicates
print(f"\n1. Total rows: {len(df):,}")
print(f"   Unique texts: {df['text'].nunique():,}")
print(f"   Exact duplicates: {len(df) - df['text'].nunique():,}")

# 2. Check text length distribution
print(f"\n2. Text length stats:")
print(f"   Min: {df['text'].str.len().min()}")
print(f"   Max: {df['text'].str.len().max()}")
print(f"   Mean: {df['text'].str.len().mean():.0f}")

# 3. Check for very short texts (might be trivial to classify)
short_texts = (df['text'].str.len() < 50).sum()
print(f"\n3. Very short texts (<50 chars): {short_texts:,} ({short_texts/len(df)*100:.1f}%)")

# 4. Check class balance
print(f"\n4. Class distribution:")
print(f"   Spam: {(df['label']==1).sum():,} ({df['label'].mean()*100:.1f}%)")
print(f"   Ham:  {(df['label']==0).sum():,} ({(1-df['label'].mean())*100:.1f}%)")

# 5. Sample some texts to see what we're dealing with
print(f"\n5. Sample spam texts:")
for t in df[df['label']==1]['text'].head(3):
    print(f"   - {t[:80]}...")

print(f"\n6. Sample ham texts:")
for t in df[df['label']==0]['text'].head(3):
    print(f"   - {t[:80]}...")

DATA LEAKAGE DIAGNOSTIC

1. Total rows: 30,452
   Unique texts: 30,452
   Exact duplicates: 0

2. Text length stats:
   Min: 2
   Max: 228368
   Mean: 1478

3. Very short texts (<50 chars): 395 (1.3%)

4. Class distribution:
   Spam: 14,542 (47.8%)
   Ham:  15,910 (52.2%)

5. Sample spam texts:
   - - - > direct marketing will increase sales 23875 there is no stumbling on to it ...
   - adult chronic pa ! n relief procedure cupertino times - great article about losi...
   - info missing medical details found 20 th january time review - in - depth articl...

6. Sample ham texts:
   - start date : 1 / 14 / 02 ; hourahead hour : 13 ; start date : 1 / 14 / 02 ; hour...
   - re : follow - up on siam workshop thanks for forwarding peter ' s resume . by co...
   - re : lst chapter of training book george ,
we shall be able to accommodate one o...


<a id='10-model-saving'></a>
## üîü Model Saving & Analysis

### Save:
- Trained ensemble model
- Word vectorizer
- Character vectorizer
- Results dataframe

In [None]:
print("="*70)
print("FINAL TRAINING & SAVING")
print("="*70)

# 1. FIT THE MODEL ON ALL DATA
# We use X and y from Cell 5 which contains the full dataset features
print("[1/5] üß† Fitting ensemble on full dataset (30,452 samples)...")
ensemble.fit(X, y)
print("      ‚úÖ Model fitted successfully")

# 2. SAVE EVERYTHING
print("\nPaper: 'memory footprint on disk, compressed with joblib, was 43 MB'\n")

# Create models directory
os.makedirs('models', exist_ok=True)

# Save ensemble
print("[2/5] Saving ensemble model...")
joblib.dump(ensemble, 'models/calibrated_ensemble.pkl', compress=3)
joblib.dump(ensemble, 'results/calibrated_ensemble.pkl', compress=3)
ensemble_size = os.path.getsize('models/calibrated_ensemble.pkl') / 1024 / 1024
print(f"      ‚úÖ Saved: {ensemble_size:.1f} MB")

# Save vectorizers (These were already fitted in Cell 5, so they are good to go)
print("\n[3/5] Saving word vectorizer...")
joblib.dump(word_vectorizer, 'models/word_vectorizer.pkl', compress=3)
joblib.dump(word_vectorizer, 'results/word_vectorizer.pkl', compress=3)
word_size = os.path.getsize('models/word_vectorizer.pkl') / 1024 / 1024
print(f"      ‚úÖ Saved: {word_size:.1f} MB")

print("\n[4/5] Saving character vectorizer...")
joblib.dump(char_vectorizer, 'models/char_vectorizer.pkl', compress=3)
joblib.dump(char_vectorizer, 'results/char_vectorizer.pkl', compress=3)
char_size = os.path.getsize('models/char_vectorizer.pkl') / 1024 / 1024
print(f"      ‚úÖ Saved: {char_size:.1f} MB")

# Save results
print("\n[5/5] Saving results dataframe...")
df_results.to_csv('models/results.csv', index=False)
df_results.to_csv('results/results.csv', index=False)
print(f"      ‚úÖ Saved: results.csv")

total_size = ensemble_size + word_size + char_size
print(f"\n{'='*70}")
print(f"‚úÖ FINAL MODEL SAVED")
print(f"{'='*70}")
print(f"\nTotal size: {total_size:.1f} MB")
print(f"Location: ./models/")
print(f"{'='*70}")

FINAL TRAINING & SAVING
[1/5] üß† Fitting ensemble on full dataset (30,452 samples)...
      ‚úÖ Model fitted successfully

Paper: 'memory footprint on disk, compressed with joblib, was 43 MB'

[2/5] Saving ensemble model...
      ‚úÖ Saved: 2.8 MB

[3/5] Saving word vectorizer...
      ‚úÖ Saved: 0.1 MB

[4/5] Saving character vectorizer...
      ‚úÖ Saved: 0.3 MB

[5/5] Saving results dataframe...
      ‚úÖ Saved: results.csv

‚úÖ FINAL MODEL SAVED

Total size: 3.2 MB
Location: ./models/
