**Name:** zhai qiuyu

**EID:** 5999 1830

# CS5489 - Assignment 1 - SMS classification

## Goal
In this assignment, the task is predict whether an SMS message is a real message, a spam message, or a phishing message (called smishing). Here are some examples:

  - **Normal**: "For real tho this sucks. I can't even cook my whole electricity is out. And I'm hungry."
  - **Spam**: "Had your mobile 10 mths? Update to latest Orange camera/video phones for FREE. Save £s with Free texts/weekend calls. Text YES for a callback orno to opt out"
  - **Smishing**: "Todays Vodafone numbers ending 5347 are selected to receive a Rs.2,00,000 award. If you have a match please call 6299257179 quoting claim code 2041 standard rates apply"


Your goal is to train a classifier to predict the class from the SMS text.


## Methodology
You need to train classifiers using the training data, and then predict on the test data. You are free to choose the feature extraction method and classifier algorithm.  You are free to use methods that were not introduced in class.  You should probably do cross-validation to select a good parameters.


## Evaluation

You need to report your test predictions. The csv file has determined the split of validation and test data, where the validation data will be used to determine the timestep of checkpointing. The model that achieves the best performance on validation set should be used to evaluate the test data. The test performance will be used to calculate your final ranking.

The evaluation metric is **balanced accuracy score**. This is because the dataset has some class imbalance as there are more normal samples than spam/smishing samples. See details for `sklearn.metrics.balanced_accuracy_score` [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html).

## What to hand in
You need to turn in the following things:

1. This ipynb file `Assignment1-Doc.ipynb` with your source code and documentation. _**You should write about all the various attempts that you make to find a good solution.**_ You may also submit python scripts as source code, but you then must document all analysis and results (figures, outputs, etc.) in the ipynb file.
2. The PDF file exported from your `Assignment1-Doc.ipynb` file.
3. Your final CSV submission file on test data.
4. The ipynb file `Assignment1-Final.ipynb`, which contains the code that generates the final CSV submission file.  **This code will be used to verify that your submission is reproducible.**

**Please compress all four files into a single zipfile and upload it to Assignment 1 on Canvas.**

## Basic Requirements of Documentation:

For your documentation, you need to at least explain the following things:

- **Data Preprocessing**: This section should detail your exploratory data analysis and the rationale behind all preprocessing steps.
  - Describe the initial characteristics of your dataset.
  - Explain all techniques you have applied on the data
  - Clearly state which subset of the data was used to determine any inherent hyperparameters within your preprocessing techniques, ensuring no information from the test set was used (if any)
- **Methodology**: you will justify your modeling choices.
  - Chosen Models: describe the core principle or model architecture (for deep learning)
  - Chosen Optimizers (if any)
  - Chosen Loss Function (if any)
- **Hyperparameters**: This section should demonstrate a rigorous approach to hyper-parameter selection
  - List the key hyperparameters used
  - Document your hyperparameter search process. Compare model performance on a dedicated validation set and select the best-performing configuration based solely on validation metrics. As a core principle of this course, using the test set for hyperparameter selection is strictly prohibited and constitutes academic dishonesty.
- **Results and Visualization**: This section should provide clear evidence of your model's performance and a qualitative analysis of its behavior.
  - Learning Curves (only for deep learning methods): the loss (error) and/or accuracy on training and validation set must be provided.
  - Show examples of correctly classified and misclassified test samples. For misclassified samples, hypothesize why the model failed.

## Grading
The marks of the assignment are distributed as follows:
- 45% - Results using various classifiers and feature representations.
- 30% - Trying out feature representations (e.g. adding additional features) or classifiers not used in the tutorials/lectures.
- 20% - Quality of the written report.  More points for insightful observations and analysis.
- 5% - Final performance on the test data. If a submission cannot be reproduced by the submitted code, it will not receive marks for ranking.
- **Late Penalty:** 25 marks will be subtracted for each day late.

**NOTE:** This is an _individual_ assignment.

**NOTE:** you should start early! Some classifiers may take a while to train.

<hr>

# Load the Data

The training data is in the text file `smishing_train.txt`.  This CSV file contains the SMS text and the class label. The class labels are: `0`, `1`, `2`, which are `normal`, `spam`, `smishing`.

The validation/testing data is in the text file `smishing_test.txt`, and only contains the SMS text.

The label of validation/testing data is in the csv file `smishing_val_test.csv`, and only contains the SMS text.

You need to generate a csv file, with the following format:

<pre>
Id,Prediction
1,0
2,1
3,0
4,2
...
</pre>

Here are two helpful functions for reading the text data and writing the csv file.

In [None]:
%matplotlib inline
import matplotlib_inline   # setup output image format
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
import matplotlib.pyplot as plt
import matplotlib
from numpy import *
from sklearn import *
from scipy import stats
import csv
import pandas as pd
import numpy as np
random.seed(100)

In [34]:
def read_text_data(fname):

    txtdata = []            # 文本数据
    classes = []            # 标签
    with open(fname, 'r', encoding='utf-8') as csvfile:  # open file safely with UTF-8
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        for row in reader:
            txtdata.append(row[0])# row[0]是文本数据
            if len(row)>1:# 如果有标签
                classes.append(int(row[1]))# row[1]是标签

    return (txtdata, classes) # 这样就变成了一个元组（文本数据，标签）

def write_csv(fname, Y):
    tmp = [['Id', 'Prediction']] # CSV 标题

    # add ID numbers for each Y
    for (i,y) in enumerate(Y):
        tmp2 = [(i+1), y]
        tmp.append(tmp2)

    # write CSV file
    with open(fname, 'w') as f:
        writer = csv.writer(f)
        writer.writerows(tmp)

The below code will load the training and test sets.

In [35]:
# load the data
(traintxt, trainY) = read_text_data("smishing_train.txt")
(testtxt, _)       = read_text_data("smishing_val_test.txt")
testY              = pd.read_csv("smishing_val_test.csv")
print(len(traintxt))
print(len(testtxt))
testY.head()


2985
2986


Unnamed: 0,Id,Prediction,Usage
0,1,0,val
1,2,0,test
2,3,0,val
3,4,0,test
4,5,0,val


In [36]:
traintxt[:5]

['Dunno da next show aft 6 is 850. Toa payoh got 650.',
 'I.ll hand her my phone to chat wit u',
 'I dont have i shall buy one dear',
 'Nite...',
 'Ok�congrats�']

In [37]:
trainY[:5]

[0, 0, 0, 0, 0]

In [38]:
testY[:5]

Unnamed: 0,Id,Prediction,Usage
0,1,0,val
1,2,0,test
2,3,0,val
3,4,0,test
4,5,0,val


In [39]:
# 展示类别
classnames = unique(trainY)
print(classnames)

[0 1 2]


In [40]:
classlabels = ['normal', 'spam', 'smishing']

Here is an example to write a csv file with predictions on the test set.  These are random predictions.

In [None]:
# random.randint(n, size=m)，从0到n-1中随机取m个
i = random.randint(len(classnames), size=len(testtxt))
predY = classnames[i]
write_csv("my_submission.csv", predY)

Look at the data:

In [42]:
for c in classnames:# 每个类别
    tmp = where(trainY==c)
    for a in tmp[0][0:5]:# 当前类别的前5个
        print('[{}]: {}'.format(classlabels[trainY[a]], traintxt[a]))

[normal]: Dunno da next show aft 6 is 850. Toa payoh got 650.
[normal]: I.ll hand her my phone to chat wit u
[normal]: I dont have i shall buy one dear
[normal]: Nite...
[normal]: Ok�congrats�
[spam]: I'd like to tell you my deepest darkest fantasies. Call me 09094646631 just 60p/min. To stop texts call 08712460324 (nat rate)
[spam]: Santa Calling! Would your little ones like a call from Santa Xmas eve? Call 09058094583 to book your time.
[spam]: Meet Top 35 US universities in Delhi at India Habitat Centre Lodhi Road on Nov 8th, 2 to 6 pm for student admission.Entry Free,  details contact 9911489000
[spam]: SMS AUCTION You have won a Nokia 7250i. This is what you get when you win our FREE auction. To take part send Nokia to 86021 now. HG/Suite342/2Lands Row/W1JHL 16+
[spam]: Call Germany for only 1 pence per minute! Call from a fixed line via access number 0844 861 85 86.
[smishing]: WIN URGENT! Your mobile number has been awarded with a £2000 prize GUARANTEED call 09061790121 from lan

# 任务清单
- 数据预处理：
1.数据集的初始特征（如样本数量、特征类型、类别分布、缺失值情况、异常值特征等）
2.解释你在数据上应用的所有技术（如缺失值填充、异常值处理、特征编码、特征归一化 / 标准化、特征选择等）
3.明确说明你使用了数据集中的哪一个子集（如训练集、验证集）来确定预处理技术中固有的超参数（例如，用训练集的均值 / 标准差进行标准化、用训练集的分布确定异常值阈值等），并确保未使用测试集中的任何信息（若存在测试集）

- 模型选择：
1.所选模型的核心原理或架构（仅针对深度学习）
2.优化器
3.损失函数

- 超参数：
1.列出使用的关键超参数
2.记录超参数的搜索过程：在专门的验证集上对比模型性能，并完全基于验证集的评估指标来选择表现最佳的参数配置。
3.作为本课程的核心原则，严禁使用测试集进行超参数选择，这种行为属于学术不端。

- 结果与可视化：
1.学习曲线（仅针对深度学习方法）：需提供训练集和验证集上的损失值（误差）和 / 或准确率。
2.展示测试集中分类正确与分类错误的样本示例：对于分类错误的样本，需推测模型分类失败的原因。

# 数据预处理

### 数据排查
- 缺失值：对 `traintxt/testtxt` 统计空字符串与仅空白样本，看是否需要占位符等操作。
- 异常值：用长度（字符数）、数字字符比率、URL/号码命中等统计量做探查，识别极端长短信、号码/URL 极多的样本。
- 填充与处理：
  - 缺失文本或特殊字符以占位符 "<EMPTY>" 替换，以便模型进行学习；
  - 对于有的过长的短信，可以直接截断，防止信息冗杂；
- 特征缩放：
  - 文本特征可以采用 `TfidfVectorizer`''，包含 L2 归一化；
  - 对于数值派生特征（长度、数字占比等），可以做 `StandardScaler`等处理，并严格以训练集统计量拟合（避免信息泄漏）。


In [None]:
# 数据诊断：缺失值与异常值探查
import re

# 若是空字符串或仅包含空白字符，则返回 True
def is_empty_text(t: str) -> bool:
    return (t is None) or (len(t.strip()) == 0)

def describe_texts(texts):
    # 计算每条文本的字符长度（包含空格等）
    lengths = np.array([len(t) for t in texts])
    # 计算每条文本中数字字符占比：数字字符数 / 最大(1, 文本长度)（避免除以0）
    digit_ratio = np.array([sum(ch.isdigit() for ch in t) / max(1, len(t)) for t in texts])
    # URL 的正则（匹配 http:// https:// 或以 www. 开头的域名）
    url_pat = re.compile(r"https?://|www\.")
    # 匹配至少7位连续数字的正则（常用于检测长电话号码）
    phone_pat = re.compile(r"\b\d{7,}\b")
    # 金额符号的正则（匹配 £ $ € ¥ 或 rs/inr/usd/gbp 等货币缩写，忽略大小写）
    amt_pat = re.compile(r'[£$€¥]|\b(?:rs|inr|usd|gbp)\b', flags=re.IGNORECASE)


    # 对每条文本检测是否包含 URL（1 表示命中，0 表示未命中）
    url_hits = np.array([1 if url_pat.search(t.lower()) else 0 for t in texts])
    # 对每条文本检测是否包含长号码（1/0）
    phone_hits = np.array([1 if phone_pat.search(t) else 0 for t in texts])
    # 对每条文本检测是否包含金额符号（1/0）
    amt_hits = np.array([1 if amt_pat.search(t) else 0 for t in texts])

    return {
        'count': len(texts),
        'empty_count': int(sum(is_empty_text(t) for t in texts)),
        'len_mean': float(lengths.mean()),
        'len_p95': float(np.percentile(lengths, 95)),
        'len_max': int(lengths.max()),
        'digit_ratio_mean': float(digit_ratio.mean()),
        'url_ratio': float(url_hits.mean()),
        'phone_ratio': float(phone_hits.mean()),
        'amt_ratio': float(amt_hits.mean()),
    }

train_desc = describe_texts(traintxt)
test_desc   = describe_texts(testtxt)
print('Train desc:', train_desc)
print('Test desc  :', test_desc)

# 为后续向量化准备占位符，避免空文本报错
traintxt_proc = [t if not is_empty_text(t) else '<EMPTY>' for t in traintxt]
test_proc   = [t if not is_empty_text(t) else '<EMPTY>' for t in testtxt]

Train desc: {'count': 2985, 'empty_count': 0, 'len_mean': 82.17219430485763, 'len_p95': 161.0, 'len_max': 910, 'digit_ratio_mean': 0.02400560015137386, 'url_ratio': 0.03149078726968174, 'phone_ratio': 0.10452261306532663, 'amt_ratio': 0.05862646566164154}
Test desc  : {'count': 2986, 'empty_count': 0, 'len_mean': 84.30709979906229, 'len_p95': 161.0, 'len_max': 611, 'digit_ratio_mean': 0.02580820242487473, 'url_ratio': 0.030475552578700604, 'phone_ratio': 0.1188881446751507, 'amt_ratio': 0.06965840589417281}


  digit_ratio = np.array([sum(ch.isdigit() for ch in t) / max(1, len(t)) for t in texts])
  'empty_count': int(sum(is_empty_text(t) for t in texts)),


- 根据输出可以知道：
- 文本数都为0，说明没有完全空的条目，非常好。
- 训练集和验证集在总体分布上很接近（len_mean,len_p95,digit_ratio_mean,url_ratio），说明分割合理，非常好。
- 训练集有极端值到910,训练集有极端值到611，说明存在少数超长短信。这超长文本可能包含大量冗余信息（如重复话术、无关内容），若直接参与特征提取（如 TF-IDF），会因词数多而放大低频 / 无意义词的权重，稀释关键特征（如欺诈关键词）的影响；并且模型可能对 “长度” 产生不当依赖（比如误认为 “长文本更可能是垃圾短信”），导致过拟合。解决方案：用 TF-IDF 的 L2 归一化和过滤罕见词，本质是消除 “长度” 对特征权重的干扰。进一步，对于处理过后还是很长的SMS，可以直接截断，因为短信信息一般集中在前面，这样可以避免冗余内容误导模型。
- URL 和长号码 数量很少，但它们对区分 spam / smishing 可能具有高信息量。解决方案：将它们替换为占位token，比如<URL>,<PHONE>,将 “具体实例” 抽象为 “类别特征”—— 让模型聚焦于 “这类符号的存在本身”，而非具体内容，既防止字典极度稀疏（每个URL/号码都是独立特征），又强化 “特殊符号与垃圾短信的关联”。
- digit_ratio 平均很低，但某些样本数字占比较高的情况（如金额、电话号码）可能仍是重要特征。解决方案：可以把货币符号或金额标准化为<AMOUNT>，使特征更通用且不会稀释字典。
- 对于这些数字类的，其数值范围差异大，若直接输入模型，会因 “量纲不一致” 导致模型对高数值样本过度敏感（或对低数值样本忽略）。解决方案：对它们进行StandardScaler 或者 RobustScaler 标准化，让 “数字占比的相对高低” 成为模型可识别的信号。

- standardscaler:基于数据的 “中心趋势（均值）” 和 “离散程度（标准差）” 进行缩放，让转换后的数据均值为 0，标准差为 1。缺点：对极端值非常敏感。因为均值和标准差会被极端值严重拉高或拉低，导致标准化结果失真。
- RobustScaler:基于数据的 “中间位置（中位数）” 和 “中间 50% 数据的离散程度（IQR）” 进行缩放，不受极端值影响。优点：对极端值不敏感（稳健）。因为中位数和 IQR 仅由数据的中间 50% 决定，极端值不会影响这两个统计量。

In [44]:
# 统计训练集各类邮件数量（假设 0=正常，1=垃圾，2=广告）
normal_count = trainY.count(0)  # 统计标签为0的数量（正常邮件）
spam_count = trainY.count(1)    # 统计标签为1的数量（垃圾邮件）
ad_count = trainY.count(2)      # 统计标签为2的数量（广告邮件）

# 打印结果
print(f"训练集中正常邮件的数量是：{normal_count}")
print(f"训练集中垃圾邮件的数量是：{spam_count}")
print(f"训练集中诈骗邮件的数量是：{ad_count}")
print(f"训练集总邮件数：{len(traintxt)}")


训练集中正常邮件的数量是：2454
训练集中垃圾邮件的数量是：234
训练集中诈骗邮件的数量是：297
训练集总邮件数：2985


- 根据结果，看来存在类别不平衡的问题，正常邮件偏多，垃圾和诈骗邮件偏少。可以在模型中使用class_weight='balanced'，它会根据样本中各类别的占比，自动调整不同类别的 “损失权重”。权重与类别频率成反比，避免模型因多数类（正常邮件）占比过高而 “偏向” 预测多数类。

### 数据预处理

In [None]:
# 预处理 + 缩放（先仅处理训练集；验证/测试在划分后再处理）
from scipy.sparse import csr_matrix
from sklearn.preprocessing import StandardScaler, RobustScaler

# 正则：URL、长电话号码、货币/金额符号
url_pat = re.compile(r'https?://|www\.')
phone_pat = re.compile(r'\b\d{7,}\b')
amt_pat = re.compile(r'[£$€¥]|\b(?:rs|inr|usd|gbp)\b', flags=re.IGNORECASE)

def normalize_text(t,max_length=200):  # 把具体实例替换为类别特征,截断文本长度
    if t is None or len(str(t).strip()) == 0:
        return '<EMPTY>'
    s = str(t).lower()
    s = url_pat.sub(' <URL> ', s)
    s = phone_pat.sub(' <PHONE> ', s)
    s = amt_pat.sub(' <AMOUNT> ', s)
    s = ' '.join(s.split())
    if len(s) > max_length:
        s = s[:max_length]  # 仅保留前max_length字符
    return s

traintxt_norm = [normalize_text(t, max_length=200) for t in traintxt]

# 提取数值衍生特征：长度、数字占比、是否含 URL、是否含长号码
def make_numeric_features(texts):
    lengths = np.array([len(t) for t in texts])
    digit_ratio = np.array([sum(ch.isdigit() for ch in t) / max(1, len(t)) for t in texts])
    url_hits = np.array([1 if url_pat.search(t) else 0 for t in texts])
    phone_hits = np.array([1 if phone_pat.search(t) else 0 for t in texts])
    return np.vstack([lengths, digit_ratio, url_hits, phone_hits]).T  # 行=样本，列=特征

train_num = make_numeric_features(traintxt_norm)

# 选择缩放器：若长度 99 百分位 > 500，则使用 RobustScaler，否则 StandardScaler（仅在训练集拟合）
scaler = RobustScaler() if np.percentile(train_num[:, 0], 99) > 500 else StandardScaler()
scaler.fit(train_num)
train_num_s = scaler.transform(train_num)

# 转换为稀疏格式，便于与 TF-IDF 矩阵 hstack
train_num_csr = csr_matrix(train_num_s)


  digit_ratio = np.array([sum(ch.isdigit() for ch in t) / max(1, len(t)) for t in texts])


# 特征工程

## 为什么选择 TF‑IDF 而不是 BoW

- BoW (CountVectorizer) 记录的是词频（每个词在短信中出现的次数），它对短文本任务有用，但存在问题：当某些短信非常长或包含重复模板时，计数会放大这些文本的影响，从而导致训练过程中“长度”或重复模式主导模型。
- 而 TF‑IDF 通过 IDF 抑制高频低信息词并结合归一化减少长度偏差，这在包含极端长短信和模版化内容的短文本分类任务中尤为重要。通过同时使用词级与字符级 TF‑IDF，我们既能捕捉有意义的词语/短语，又能对 URL/电话号码等字符模式保持鲁棒性。

In [None]:
# 统一的 TF-IDF 向量化（词与字符），仅在训练集 fit；验证/测试在划分后 transform
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack

# 词级与字符级 TF-IDF
word_vect = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), min_df=2, max_features=30000)
char_vect = TfidfVectorizer(analyzer='char', ngram_range=(3, 5), min_df=2, max_features=60000)

# 仅在训练文本上拟合词表/IDF
trainX_word = word_vect.fit_transform(traintxt_norm)
trainX_char = char_vect.fit_transform(traintxt_norm)

# 仅 TF-IDF（非负），供 MultinomialNB 使用
trainX_tfidf = hstack([trainX_word, trainX_char]).tocsr()

# 拼接训练特征，并叠加数值衍生特征（供其他模型使用）
trainX = hstack([trainX_word, trainX_char, train_num_csr]).tocsr()
trainX.shape


(2985, 56293)

# 模型

### 先把验证集和测试集单独整理出来

In [None]:
# 基于提供的 Usage 字段先划分验证与测试索引
val_mask = (testY['Usage'].astype(str).str.lower() == 'val').values
test_mask = (testY['Usage'].astype(str).str.lower() == 'test').values
val_indices = np.where(val_mask)[0]
test_indices = np.where(test_mask)[0]

# 准备验证/测试标签
val_true = testY.loc[val_mask, 'Prediction'].astype(int).values
test_true = testY.loc[test_mask, 'Prediction'].astype(int).values

# 提取验证/测试文本并做与训练一致的token规范化
valtxt = [testtxt[i] for i in val_indices]
valtxt_norm = [normalize_text(t, max_length=200) for t in valtxt]

testtxt_sel = [testtxt[i] for i in test_indices]
testtxt_norm = [normalize_text(t, max_length=200) for t in testtxt_sel]

# 生成验证/测试的数值衍生特征，并用训练拟合好的 scaler 变换
val_num = make_numeric_features(valtxt_norm)
val_num_s = scaler.transform(val_num)
val_num_csr = csr_matrix(val_num_s)

test_num = make_numeric_features(testtxt_norm)
test_num_s = scaler.transform(test_num)
test_num_csr = csr_matrix(test_num_s)

# 用训练拟合好的 TF-IDF 向量器变换验证/测试文本
valX_word = word_vect.transform(valtxt_norm)
valX_char = char_vect.transform(valtxt_norm)
# 仅 TF-IDF（非负），供 MultinomialNB 使用
valX_tfidf = hstack([valX_word, valX_char]).tocsr()
# TF-IDF + 数值特征（供其他模型使用）
valX = hstack([valX_word, valX_char, val_num_csr]).tocsr()

testX_word = word_vect.transform(testtxt_norm)
testX_char = char_vect.transform(testtxt_norm)
# 仅 TF-IDF（非负），供 MultinomialNB 使用
testX_tfidf = hstack([testX_word, testX_char]).tocsr()
# TF-IDF + 数值特征（供其他模型使用）
testX = hstack([testX_word, testX_char, test_num_csr]).tocsr()

# 准备验证集标签
val_y = val_true.astype(int)

验证集样本数: 1493，测试集样本数: 1493


  digit_ratio = np.array([sum(ch.isdigit() for ch in t) / max(1, len(t)) for t in texts])
  digit_ratio = np.array([sum(ch.isdigit() for ch in t) / max(1, len(t)) for t in texts])


### 交叉验证

In [None]:
# 交叉验证与评分工具
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.metrics import balanced_accuracy_score


cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=100)
#分层 K 折交叉验证的配置，核心作用是在模型训练时更合理地划分训练集和验证集，尤其适合分类不平衡任务
scorer = 'balanced_accuracy'

def run_grid_search(estimator, param_grid, X, y, cv=cv, scorer=scorer, n_jobs=-1):
    gs = GridSearchCV(estimator, param_grid, cv=cv, scoring=scorer, n_jobs=n_jobs, verbose=1, return_train_score=False)
    gs.fit(X, y)
    cvlog = pd.DataFrame(gs.cv_results_).sort_values('mean_test_score', ascending=False)#按平均测试分数降序排序
    return gs, cvlog

# 将验证集标签准备为numpy
val_y = val_true.astype(int)


## 多项式贝叶斯

In [49]:
# MultinomialNB 超参数搜索（使用仅 TF-IDF 的非负特征）
from sklearn.naive_bayes import MultinomialNB

mnb_params = {
    'alpha': [0.5, 1.0, 1.5],
}
# 注意：MultinomialNB 需要非负特征，故使用 trainX_tfidf/valX_tfidf
mnb_gs, mnb_log = run_grid_search(MultinomialNB(), mnb_params, trainX_tfidf, trainY)

mnb_best = mnb_gs.best_estimator_
val_pred = mnb_best.predict(valX_tfidf)
mnb_val_bacc = balanced_accuracy_score(val_y, val_pred)
print('MNB best params:', mnb_gs.best_params_, 'val balanced_acc:', mnb_val_bacc)

mnb_log.head()


Fitting 3 folds for each of 3 candidates, totalling 9 fits
MNB best params: {'alpha': 0.5} val balanced_acc: 0.8039547947804828


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.034881,0.002231,0.006504,0.001939,0.5,{'alpha': 0.5},0.795333,0.782661,0.765327,0.781107,0.012299,1
1,0.03602,0.0,0.006036,0.0,1.0,{'alpha': 1.0},0.701243,0.699023,0.696063,0.698776,0.002122,2
2,0.03284,0.002804,0.007875,0.001301,1.5,{'alpha': 1.5},0.630666,0.621064,0.586636,0.612788,0.018904,3


## 逻辑回归

- 以下是 “通用版本”，支持二分类 / 多分类，所以没特意指明multi_class

In [50]:
# LogisticRegression 超参数搜索（Linear，类不平衡）
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=200, class_weight='balanced', n_jobs=None, solver='liblinear')
lr_params = {
    'C': [0.5, 1.0, 2.0, 4.0],
}
lr_gs, lr_log = run_grid_search(lr, lr_params, trainX, trainY)

lr_best = lr_gs.best_estimator_
val_pred = lr_best.predict(valX)
lr_val_bacc = balanced_accuracy_score(val_y, val_pred)
print('LR best params:', lr_gs.best_params_, 'val balanced_acc:', lr_val_bacc)

lr_log.head()


Fitting 3 folds for each of 4 candidates, totalling 12 fits
LR best params: {'C': 2.0} val balanced_acc: 0.8765334648362172


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
2,0.546306,0.038604,0.005243,0.003756,2.0,{'C': 2.0},0.836845,0.846706,0.820658,0.834737,0.010738,1
3,0.617505,0.026777,0.006778,0.001596,4.0,{'C': 4.0},0.841934,0.838974,0.816792,0.832567,0.01122,2
1,0.460342,0.011093,0.006163,0.0017,1.0,{'C': 1.0},0.828797,0.82839,0.824116,0.827101,0.002117,3
0,0.414881,0.013643,0.006054,0.001587,0.5,{'C': 0.5},0.810016,0.822894,0.825023,0.819311,0.00663,4


## SVG

In [None]:
# LinearSVC 超参数搜索
from sklearn.svm import LinearSVC

lsvc = LinearSVC(class_weight='balanced', dual=True)
lsvc_params = {
    'C': [0.1,0.5, 1.0, 2.0, 4.0]
}
lsvc_gs, lsvc_log = run_grid_search(lsvc, lsvc_params, trainX, trainY)

lsvc_best = lsvc_gs.best_estimator_
val_pred = lsvc_best.predict(valX)
lsvc_val_bacc = balanced_accuracy_score(val_y, val_pred)
print('LinearSVC best params:', lsvc_gs.best_params_, 'val balanced_acc:', lsvc_val_bacc)

lsvc_log.head()


Fitting 3 folds for each of 4 candidates, totalling 12 fits
LinearSVC best params: {'C': 0.5} val balanced_acc: 0.8887162361933004




Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,2.843833,0.618862,0.00769,0.002818,0.5,{'C': 0.5},0.842341,0.838974,0.826893,0.83607,0.006633,1
3,3.589835,0.049526,0.009075,0.005482,4.0,{'C': 4.0},0.847521,0.818513,0.822212,0.829416,0.012891,2
1,3.531672,0.106731,0.00984,0.000877,1.0,{'C': 1.0},0.843655,0.822787,0.819252,0.828565,0.010768,3
2,3.604525,0.058988,0.00256,0.002347,2.0,{'C': 2.0},0.844562,0.818513,0.819252,0.827443,0.012109,4


## SGDClassifier

In [52]:
# SGDClassifier (hinge/log) 超参数搜索
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier(class_weight='balanced', random_state=100, max_iter=2000, tol=1e-3)
sgd_params = {
    'loss': ['hinge', 'log_loss'],
    'alpha': [1e-4, 5e-4, 1e-3],
}
sgd_gs, sgd_log = run_grid_search(sgd, sgd_params, trainX, trainY)

sgd_best = sgd_gs.best_estimator_
val_pred = sgd_best.predict(valX)
sgd_val_bacc = balanced_accuracy_score(val_y, val_pred)
print('SGD best params:', sgd_gs.best_params_, 'val balanced_acc:', sgd_val_bacc)

sgd_log.head()

Fitting 3 folds for each of 6 candidates, totalling 18 fits
SGD best params: {'alpha': 0.0001, 'loss': 'log_loss'} val balanced_acc: 0.878622948577077


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,param_loss,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
1,0.28869,0.031615,0.008637,0.00528,0.0001,log_loss,"{'alpha': 0.0001, 'loss': 'log_loss'}",0.846207,0.838159,0.822562,0.835643,0.009816,1
4,0.301628,0.024322,0.005706,0.004068,0.001,hinge,"{'alpha': 0.001, 'loss': 'hinge'}",0.85616,0.809966,0.830759,0.832295,0.01889,2
5,0.186328,0.074526,0.002641,0.003735,0.001,log_loss,"{'alpha': 0.001, 'loss': 'log_loss'}",0.823983,0.836621,0.816419,0.825674,0.008334,3
0,0.286169,0.020317,0.006506,0.004788,0.0001,hinge,"{'alpha': 0.0001, 'loss': 'hinge'}",0.835032,0.833406,0.787689,0.818709,0.021945,4
3,0.25533,0.012333,0.019929,0.009129,0.0005,log_loss,"{'alpha': 0.0005, 'loss': 'log_loss'}",0.849997,0.815162,0.76978,0.811646,0.032843,5


### 深度学习模型

### TextCNN（Keras）的模型结构解释
- 使用训练集拟合分词器，仅在验证/测试集做 transform，避免信息泄漏。
- 模型结构：Embedding → Conv1D(k=3/4/5) → GlobalMaxPool → Concat → Dropout → Dense(3)。
- 监控验证集，早停，指标使用 balanced accuracy 进行报告（训练优化目标仍为交叉熵）。


In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer  # 用于文本分词，将文本转为词的索引序列
from tensorflow.keras.preprocessing.sequence import pad_sequences  # 用于将序列填充或截断到固定长度
from tensorflow.keras import layers, models  # 导入网络层和模型相关模块
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier  # 用于将 Keras 模型包装成 scikit-learn 可用的分类器
from sklearn.model_selection import GridSearchCV  # 网格搜索超参数
from sklearn.metrics import balanced_accuracy_score  # 用于计算平衡准确率

# 仅基于训练集拟合分词器：避免验证集/测试集信息泄露到分词器中
max_words = 20000  # 设定词汇表最大大小，保留最常见的20000个词
max_len = 150  # 设定序列的最大长度，超过则截断，不足则填充

tokenizer = Tokenizer(num_words=max_words, lower=True, oov_token='<OOV>')  # 创建分词器，lower=True将文本转为小写，oov_token处理未见过的词
tokenizer.fit_on_texts(traintxt_norm)  # 仅在训练集文本上拟合分词器

# 序列化与填充：将文本转为数字序列，并统一长度
X_train_seq = tokenizer.texts_to_sequences(traintxt_norm)  # 训练集文本转序列
X_val_seq = tokenizer.texts_to_sequences(valtxt_norm)  # 验证集文本转序列
X_test_seq = tokenizer.texts_to_sequences(testtxt_norm)  # 测试集文本转序列

# 对序列进行填充或截断，使所有序列长度为max_len，padding和truncating设为'post'表示在序列末尾进行操作
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len, padding='post', truncating='post')
X_val_pad = pad_sequences(X_val_seq, maxlen=max_len, padding='post', truncating='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len, padding='post', truncating='post')

num_classes = 3  # 分类任务的类别数

# TextCNN 模型构建函数：构建文本卷积神经网络模型，参数由超参数搜索传入
def build_textcnn(emb_dim=64, num_filters=64, filter_sizes=(3, 4, 5), dropout=0.5):
    vocab_size = min(max_words, len(tokenizer.word_index) + 1)
    inputs = layers.Input(shape=(max_len,))  # 定义输入层，输入形状为(max_len,)
    # 嵌入层：将词的索引转为稠密向量表示，input_dim为词汇表大小，output_dim为嵌入维度，input_length为输入序列长度
    x = layers.Embedding(input_dim=vocab_size, output_dim=emb_dim, input_length=max_len)(inputs)
    convs = []  # 用于存储不同卷积核处理后的特征
    for k in filter_sizes:  # 遍历不同尺寸的卷积核
        # 一维卷积层：num_filters为卷积核数量，k为卷积核尺寸，activation为激活函数，padding为'valid'表示不填充
        c = layers.Conv1D(num_filters, k, activation='relu', padding='valid')(x)
        # 全局最大池化层：对每个卷积结果取最大值，捕捉最显著的特征
        p = layers.GlobalMaxPooling1D()(c)
        convs.append(p)  # 将池化结果加入列表
    # 拼接不同卷积核得到的特征，若只有一个卷积核则直接取该特征
    x = layers.Concatenate()(convs) if len(convs) > 1 else convs[0]
    x = layers.Dropout(dropout)(x)  # Dropout层：防止过拟合，随机丢弃部分神经元
    # 输出层：num_classes个输出，softmax激活函数将输出转为概率分布
    outputs = layers.Dense(num_classes, activation='softmax')(x)
    model = models.Model(inputs, outputs)  # 构建模型，指定输入和输出
    # 编译模型：optimizer为优化器，loss为损失函数，metrics为评估指标
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

# 将 Keras 模型包装为 scikit-learn 分类器
model = KerasClassifier(build_fn=build_textcnn, epochs=15, batch_size=64, verbose=1)

# 定义要搜索的超参数网格
param_grid = {
    'emb_dim': [64, 128],  # 嵌入维度候选值
    'num_filters': [64, 128],  # 卷积核数量候选值
    'dropout': [0.3, 0.4],  # Dropout 率候选值
    'filter_sizes': [(3, 4, 5), (2, 3, 4)]  # 卷积核尺寸组合候选值
}

# 创建网格搜索对象
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='balanced_accuracy', verbose=2,n_jobs=-1)

# 执行超参数搜索（使用训练集数据，这里为了演示，实际可根据需求调整）
grid_result = grid.fit(X_train_pad, np.array(trainY))

# 输出最佳超参数和最佳分数
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# 使用最佳超参数构建模型
best_model = build_textcnn(
    emb_dim=grid_result.best_params_['emb_dim'],
    num_filters=grid_result.best_params_['num_filters'],
    dropout=grid_result.best_params_['dropout'],
    filter_sizes=grid_result.best_params_['filter_sizes']
)

# 回调：早停：当验证集损失在patience轮内没有改善时停止训练，并恢复最佳权重
es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# 训练最佳模型
history = best_model.fit(
    X_train_pad, np.array(trainY),
    validation_data=(X_val_pad, np.array(val_y)),
    epochs=10,
    batch_size=128,
    callbacks=[es],
    verbose=1
)

# 验证集性能（balanced accuracy）：计算平衡准确率，解决类别不平衡问题
val_pred = best_model.predict(X_val_pad, batch_size=256, verbose=0).argmax(axis=1)  # 预测验证集并取概率最大的类别
val_bacc = balanced_accuracy_score(val_y, val_pred)  # 计算平衡准确率
print('TextCNN val balanced_acc:', val_bacc)  # 打印验证集平衡准确率

# 测试集评估：在测试集上评估模型性能
test_pred = best_model.predict(X_test_pad, batch_size=256, verbose=0).argmax(axis=1)  # 预测测试集并取概率最大的类别
test_bacc = balanced_accuracy_score(test_true, test_pred)  # 计算测试集平衡准确率
print('TextCNN test balanced_acc:', test_bacc)  # 打印测试集平衡准确率

  model = KerasClassifier(build_fn=build_textcnn, epochs=15, batch_size=64, verbose=1)


Fitting 3 folds for each of 16 candidates, totalling 48 fits




Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Best: 0.833640 using {'dropout': 0.4, 'emb_dim': 64, 'filter_sizes': (2, 3, 4), 'num_filters': 128}
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
TextCNN val balanced_acc: 0.8561361760903045
TextCNN test balanced_acc: 0.8555784641476395


# 选出最好的模型

In [39]:
# 基于验证集 balanced accuracy 选择表现最佳的模型
cv_summary = pd.DataFrame([
    ['MultinomialNB', mnb_gs.best_params_, mnb_val_bacc],
    ['LogisticRegression', lr_gs.best_params_, lr_val_bacc],
    ['LinearSVC', lsvc_gs.best_params_, lsvc_val_bacc],
    ['SGDClassifier', sgd_gs.best_params_, sgd_val_bacc],
], columns=['model', 'best_params', 'val_bal_acc']).sort_values('val_bal_acc', ascending=False)
cv_summary


Unnamed: 0,model,best_params,val_bal_acc
2,LinearSVC,{'C': 0.1},0.882869
3,SGDClassifier,"{'alpha': 0.0001, 'loss': 'log_loss'}",0.878623
1,LogisticRegression,{'C': 2.0},0.876533
0,MultinomialNB,{'alpha': 0.5},0.803955


### 去测试集验证去

In [None]:
# 在测试集上做最终预测，并展示正确/错误样本（复用已构建的 testX/test_true/test_indices）
from sklearn.metrics import classification_report, confusion_matrix

# 选择验证集上最优的模型
best_row = cv_summary.iloc[0]
best_name = best_row['model']

if best_name == 'MultinomialNB':
    best_model = mnb_best
    X_test_for_best = testX_tfidf
elif best_name == 'LogisticRegression':
    best_model = lr_best
    X_test_for_best = testX
elif best_name == 'LinearSVC':
    best_model = lsvc_best
    X_test_for_best = testX
elif best_name == 'SGDClassifier':
    best_model = sgd_best
    X_test_for_best = testX
else:
    # fallback
    best_model = lr_best
    X_test_for_best = testX

# 预测（根据模型选择相应的测试特征矩阵）
pred_test = best_model.predict(X_test_for_best)

# 评估与示例
print('Test balanced_acc:', balanced_accuracy_score(test_true, pred_test))
print(confusion_matrix(test_true, pred_test))
print(classification_report(test_true, pred_test, target_names=classlabels, digits=4))

Test balanced_acc: 0.8740838209168876
[[1188    2    1]
 [  14   97   18]
 [   2   20  151]]
              precision    recall  f1-score   support

      normal     0.9867    0.9975    0.9921      1191
        spam     0.8151    0.7519    0.7823       129
    smishing     0.8882    0.8728    0.8805       173

    accuracy                         0.9618      1493
   macro avg     0.8967    0.8741    0.8849      1493
weighted avg     0.9605    0.9618    0.9610      1493


示例-分类正确：
[gold=normal, pred=normal] Hey i booked the kb on sat already... what other lessons are we going for ah? Keep your sat night free we need to meet and confirm our lodging
[gold=normal, pred=normal] Hmm .. Bits and pieces lol ... *sighs* ...
[gold=normal, pred=normal] 10 min later k...
[gold=smishing, pred=smishing] Your B4U voucher w/c 27/03 is MARSMS. Log onto www.B4Utele.com for discount credit. To opt out reply stop. Customer care call 08717168528
[gold=normal, pred=normal] I dont have that much image in clas

- 可能是分类不平衡导致的，尝试一下手动调整增加一下它们的权重

In [38]:
# 定义包含不同权重组合的参数网格
param_grid = {
    'C':[0.1,1.0],
    'class_weight':
     [{0:1, 1:3, 2:3}]}       # 大幅度调高

# 用 GridSearchCV 搜索最优参数（需要把 LinearSVC 包装进 GridSearchCV）
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(
    LinearSVC(dual=True),
    param_grid=param_grid,
    scoring='balanced_accuracy',  # 用平衡准确率评估（适合类不平衡）
    cv=3  # 3折交叉验证
)
grid.fit(trainX, trainY)

print("Best params:", grid.best_params_)
print("Best score:", grid.best_score_)



Best params: {'C': 1.0, 'class_weight': {0: 1, 1: 3, 2: 3}}
Best score: 0.8492480232496532




- 看来效果没有很明显的提升

# 反思

### 分析
- 特征处理方面，除了TF-IDF，可以再试试一些SMS特定特征工程。
- 类别不均衡的实际影响：只有 `class_weight='balanced'` 还不够，对 spam和smishing的区分方便效果可以做到更好，比如可以试试重采样技术，补足这两类样本偏小的问题。
- 模型选择方面，除了以上展示的这些,还可以试试更多复杂的集成模型（梯度提升、AdaBoost、投票分类器）和别的现代方法（XGBoost、LightGBM）
