# CRISP-DM: Spam Email Detection

This notebook follows the CRISP-DM process for the Spam Email Detection project. It demonstrates data understanding, preprocessing, modeling, evaluation, and saving artifacts for deployment. The notebook is runnable with the lightweight `data/sample_emails.csv` present in the repository.

## 1. 環境與相依套件安裝
Install dependencies (only if needed) and show Python and package versions. In CI you would normally install from `requirements.txt`.

In [None]:
# Environment checks
import sys
import pkg_resources
print('Python:', sys.version.splitlines()[0])
packages = ['pandas','numpy','sklearn','joblib','streamlit']
for pkg in packages:
    try:
        ver = pkg_resources.get_distribution(pkg).version
        print(f'{pkg}: {ver}')
    except Exception:
        print(f'{pkg}: not installed')

## 2. 載入與檢查資料檔案
Load the dataset from `data/data.csv` (if present) or fall back to `data/sample_emails.csv`. Show basic file checks with `pathlib`.

In [None]:
from pathlib import Path
import pandas as pd
DATA_PATH = Path('data') / 'data.csv'
SAMPLE_PATH = Path('data') / 'sample_emails.csv'
if DATA_PATH.exists():
    df = pd.read_csv(DATA_PATH)
    print('Loaded data.csv')
else:
    df = pd.read_csv(SAMPLE_PATH)
    print('Loaded sample_emails.csv')
print('\nData shape:', df.shape)
print(df.head(5))

## 3. 初步資料檢視與摘要統計
Use `head`, `info`, `describe`, compute missing-value ratios and display value counts for the label column.

In [None]:
print('\nInfo:')
df.info()
print('\nDescribe:')
print(df.describe(include='all'))
print('\nMissing value ratios:')
print(df.isnull().mean())
if 'label' in df.columns:
    print('\nLabel distribution:')
    print(df['label'].value_counts())

## 4. 資料清理：遺失值、重複與資料型態
Apply simple cleaning: drop exact duplicates and fill missing text with empty string.

In [None]:
# Remove duplicates
before = len(df)
df = df.drop_duplicates()
after = len(df)
print(f'Removed {before-after} duplicate rows')
if 'text' in df.columns:
    df['text'] = df['text'].fillna('')
if 'label' in df.columns:
    df['label'] = df['label'].astype('category')
    print('\nLabel categories:', df['label'].cat.categories)

## 5. 資料視覺化探索（EDA）
Plot basic distributions: label counts and word count distribution.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
df['word_count'] = df['text'].apply(lambda s: len(str(s).split()))
plt.figure(figsize=(6,4))
sns.countplot(data=df, x='label')
plt.title('Label distribution')
plt.show()
plt.figure(figsize=(8,4))
sns.histplot(df['word_count'], bins=20)
plt.title('Word count distribution')
plt.show()

## 6. 特徵工程與處理（編碼、標準化、衍生特徵）
We will use the preprocessing utilities from `src.preprocessing`. For text we'll use TF-IDF vectorization. For small pipelines, combine text processing using a scikit-learn `TfidfVectorizer` and an estimator pipeline.

In [None]:
from src.preprocessing import preprocess_email
from sklearn.feature_extraction.text import TfidfVectorizer
sample_text = df['text'].iloc[0]
print('Original:', sample_text)
print('Tokens:', preprocess_email(sample_text))
df['text_proc'] = df['text'].apply(lambda t: ' '.join(preprocess_email(t)))
vectorizer = TfidfVectorizer(max_features=2000)
X = vectorizer.fit_transform(df['text_proc'])
print('TF-IDF shape:', X.shape)

## 7. 資料切分：訓練 / 驗證 / 測試與交叉驗證
Split data with stratification on label.

In [None]:
from sklearn.model_selection import train_test_split
if 'label' in df.columns:
    y = df['label'].map({'ham':0, 'spam':1})
else:
    raise RuntimeError('Label column not found')
X_train, X_test, y_train, y_test = train_test_split(df['text_proc'], y, test_size=0.2, stratify=y, random_state=42)
print('Train/Test sizes:', len(X_train), len(X_test))

## 8. 建立基準模型（Baseline）
Train a simple baseline: predict the majority class.

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
baseline = DummyClassifier(strategy='most_frequent')
baseline.fit(X_train.values.reshape(-1,1), y_train)
y_pred_base = baseline.predict(X_test.values.reshape(-1,1))
print('Baseline accuracy:', accuracy_score(y_test, y_pred_base))
print('Baseline precision:', precision_score(y_test, y_pred_base, zero_division=0))
print('Baseline recall:', recall_score(y_test, y_pred_base, zero_division=0))

## 9. 模型訓練：Scikit-Learn 範例
Use a pipeline: TfidfVectorizer + RandomForestClassifier (quick example). For larger datasets consider GridSearchCV and cross-validation.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
pipeline = make_pipeline(TfidfVectorizer(max_features=2000), RandomForestClassifier(n_estimators=100, random_state=42))
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
print('Confusion matrix:\n', confusion_matrix(y_test, y_pred))

## 10. 超參數調整與網格/隨機搜尋（範例）
Example using RandomizedSearchCV for a small parameter grid.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_dist = {
    'randomforestclassifier__n_estimators': randint(50, 200),
    'randomforestclassifier__max_depth': randint(3, 20)
}
search = RandomizedSearchCV(pipeline, param_distributions=param_dist, n_iter=4, cv=3, random_state=42)
search.fit(X_train, y_train)
print('Best params:', search.best_params_)
print(classification_report(y_test, search.predict(X_test)))

## 11. 模型評估：指標、混淆矩陣、ROC / PR 曲線
Plot ROC curve and compute AUC.

In [None]:
from sklearn.metrics import roc_curve, auc, precision_recall_curve
import matplotlib.pyplot as plt
try:
    y_score = search.predict_proba(X_test)[:,1]
except Exception:
    y_score = pipeline.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0,1], [0,1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

## 12. 誤差分析與模型解釋（特徵重要性）
Show feature importances (top N) for the RandomForest part of pipeline.

In [None]:
vec = None
clf = None
try:
    vec = pipeline.named_steps['tfidfvectorizer']
    clf = pipeline.named_steps['randomforestclassifier']
except Exception:
    vec = pipeline.steps[0][1]
    clf = pipeline.steps[-1][1]
if hasattr(clf, 'feature_importances_'):
    feature_names = vec.get_feature_names_out()
    importances = clf.feature_importances_
    idx = importances.argsort()[::-1][:20]
    top_feats = [(feature_names[i], importances[i]) for i in idx]
    print('Top features:')
    for name, imp in top_feats:
        print(name, f'{imp:.4f}')
else:
    print('Classifier does not provide feature importances')

## 13. 模型儲存與匯出（joblib）
Save the best model and vectorizer for deployment.

In [None]:
import joblib
from pathlib import Path
MODEL_DIR = Path('models')
MODEL_DIR.mkdir(exist_ok=True)
joblib.dump(search.best_estimator_ if hasattr(search, 'best_estimator_') else pipeline, MODEL_DIR / 'spam_pipeline.joblib')
print('Saved pipeline to models/spam_pipeline.joblib')

## 14. 簡易部署示範（Streamlit 應用骨架）
Example Streamlit snippet to load model and run a single prediction (paste into `src/app.py` or separate `streamlit_app.py`).

In [None]:
streamlit_snippet = '''import streamlit as st
import joblib

pipeline = joblib.load('models/spam_pipeline.joblib')

text = st.text_area('Email text')
if st.button('Predict'):
    pred = pipeline.predict([text])
    st.write('Spam' if pred[0]==1 else 'Ham')
'''
print(streamlit_snippet)

## 15. 單元測試與可重現性（pytest、設定 seed）
Set seeds for reproducibility and show an example pytest test (tests/test_preprocessing.py already present).

In [None]:
import numpy as np
import random
import os
seed = 42
np.random.seed(seed)
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
print('Seeds set for numpy and random')

## 16. 執行紀錄與實驗追蹤（logging、結果表格化）
Demonstrate saving experiment metrics to a CSV for later comparison.

In [None]:
import pandas as pd
from pathlib import Path
results = {
    'model': ['random_forest'],
    'accuracy': [0.0],
    'precision': [0.0],
    'recall': [0.0],
}
results_df = pd.DataFrame(results)
Path('experiments').mkdir(exist_ok=True)
results_df.to_csv('experiments/results_summary.csv', index=False)
print('Saved experiments/results_summary.csv (template)')